3.2. The class type node

type node_type =
  T_data
| T_element of string
;;

class type [ 'ext ] node =
  object ('self)
    constraint 'ext = 'ext node #extension

    (* General observers *)

    method extension : 'ext
    method parent : 'ext node
    method root : 'ext node
    method sub_nodes : 'ext node list
    method iter_nodes : ('ext node -> unit) -> unit
    method iter_nodes_sibl : 
           ('ext node option -> 'ext node -> 'ext node option -> unit) -> unit
    method node_type : node_type
    method data : string
    method dtd : dtd
    method pinstr : string -> proc_instruction list
    method pinstr_names : string list

    (* Attribute observers *)

    method attribute : string -> Markup_types.att_value
    method required_string_attribute : string -> string
    method optional_string_attribute : string -> string option
    method required_list_attribute : string -> string list
    method optional_list_attribute : string -> string list
    method attribute_names : string list
    method attribute_type : string -> Markup_types.att_type
    method attributes : (string * Markup_types.att_value) list

    (* Modifying methods *)

    method add_node : 'ext node -> unit
    method add_pinstr : proc_instruction -> unit
    method delete : unit
    method set_nodes : 'ext node list -> unit
    method quick_set_attributes : (string * Markup_types.att_value) list -> unit

    (* Cloning methods *)

    method orphaned_clone : 'self
    method orphaned_flat_clone : 'self
    method create_element : dtd -> node_type -> (string * string) list 
                            -> 'ext node
    method create_data : dtd -> string -> 'ext node

    (* Weird methods *)

    method find : string -> 'ext node
    method reset_finder : unit
    method local_validate : unit

    (* Internal methods (undocumented). *)

    method internal_adopt : 'ext node option -> unit
    method internal_delete : 'ext node -> unit
    method internal_init : dtd -> string -> (string * string) list -> unit
  end
;;
In the module Markup_types you can find another type definition that is important in this context:
type Markup_types.att_value =
    Value     of string
  | Valuelist of string list
  | Implied_value
;;

3.2.1. The structure of document trees

A node represents either an element or a character data section. There are two classes implementing the two aspects of nodes: element_impl and data_impl. The latter class does not implement all methods because some methods do not make sense for data nodes.

The following figure () shows an example how a tree is constructed from element and data nodes. The circular areas represent element nodes whereas the ovals denote data nodes. Only elements may have subnodes; data nodes are always leaves of the tree. The subnodes of an element can be either element or data nodes; in both cases the O'Caml objects storing the nodes have the class type node.

Attributes (the clouds in the picture) are not directly integrated into the tree; there is always an extra link to the attribute list. This is also true for processing instructions (not shown in the picture). This means that there are separated access methods for attributes and processing instructions.

Figure 3-1. A tree with element nodes, data nodes, and attributes

Only elements, data sections, attributes and processing instructions can, directly or indirectly, occur in the document tree. It is impossible to add entity references to the tree; if the parser finds such a reference, not the reference as such but the referenced text (i.e. the tree representing the structured text) is included in the tree.

Figure 3-2. Nodes are doubly linked trees

The node tree has links in both directions: Every node has a link to its parent (if any), and it has links to the subnodes (see figure Nodes are doubly linked trees). Obviously, this doubly-linked structure simplifies the navigation in the tree; but has also some consequences for the possible operations on trees.

Because every node must have at most one parent node, operations are illegal if they violate this condition. The following figure (A node can only be added if it is a root) shows on the left side that node y is added to x as new subnode which is allowed because y does not have a parent yet. The right side of the picture illustrates what would happen if y had a parent node; this is illegal because y would have two parents after the operation.

Figure 3-3. A node can only be added if it is a root

The "delete" operation simply removes the links between two nodes. In the picture (A deleted node becomes the root of the subtree) the node x is deleted from the list of subnodes of y. After that, x becomes the root of the subtree starting at this node.

Figure 3-4. A deleted node becomes the root of the subtree

It is also possible to make a clone of a subtree; illustrated in The clone of a subtree. In this case, the clone is a copy of the original subtree except that it is no longer a subnode. Because cloning never keeps the connection to the parent, the clones are called orphaned.

Figure 3-5. The clone of a subtree

3.2.2. The methods of the class type node

General observers .

  • extension: The reference to the extension object which belongs to this node (see ...).

  • parent: Get the father node. Raises Not_found in the case the node does not have a parent, i.e. the node is the root.

  • root: Gets the reference to the root node of the tree. Every node is contained in a tree with a root, so this method always succeeds. Note that this method searches the root, which costs time proportional to the length of the path to the root.

  • sub_nodes: Returns references to the children. The returned list reflects the order of the children. For data nodes, this method returns the empty list.

  • iter_nodes f: Iterates over the children, and calls f for every child in turn.

  • iter_nodes_sibl f: Iterates over the children, and calls f for every child in turn. f gets as arguments the previous node, the current node, and the next node.

  • node_type: Returns either T_data which means that the node is a data node, or T_element n which means that the node is an element of type n.

  • data: Returns the character data of this node and all children, concatenated as one string. The string is ISO-8859-1-encoded. - For data nodes, this method simply returns the represented characters. For elements, the meaning of the method has been extended such that it returns something useful, i.e. the effectively contained characters, without markup.

  • dtd: Returns a reference to the global DTD. All nodes of a tree must share the same DTD.

  • pinstr n: Returns all processing instructions that are directly contained in this element and that have a target specification of n. The target is the first word after the <?.

  • pinstr_names: Returns the list of all targets of processing instructions directly contained in this element.

Attribute observers .

  • attribute n: Returns the value of the attribute with name n. This method returns a value for every declared attribute, and it raises Not_found for any undeclared attribute. Note that it even returns a value if the attribute is actually missing but is declared as #IMPLIED or has a default value. - Possible values are:

    • Implied_value: The attribute has been declared with the keyword #IMPLIED, and the attribute is missing in the attribute list of this element.

    • Value s: The attribute has been declared as type CDATA, as ID, as IDREF, as ENTITY, or as NMTOKEN, or as enumeration or notation, and one of the two conditions holds: (1) The attribute value is present in the attribute list in which case the value is returned in the string s. (2) The attribute has been omitted, and the DTD declared the attribute with a default value. The default value is returned in s. - Summarized, Value s is returned for non-implied, non-list attribute values.

    • Valuelist l: The attribute has been declared as type IDREFS, as ENTITIES, or as NMTOKENS, and one of the two conditions holds: (1) The attribute value is present in the attribute list in which case the space-separated tokens of the value are returned in the string list l. (2) The attribute has been omitted, and the DTD declared the attribute with a default value. The default value is returned in l. - Summarized, Valuelist l is returned for all list-type attribute values.

    Note that before the attribute value is returned, the value is normalized. This means that newlines are converted to spaces, and that references to character entities (i.e. &#n;) and general entities (i.e. &name;) are expanded; if necessary, expansion is performed recursively.

  • required_string_attribute n: returns the Value attribute called n, or the Valuelist attribute as a string where the list elements are separated by spaces. If the attribute value is implied, or if the attribute does not exists, the method will fail. - This method is convenient if you expect a non-implied and non-list attribute value.

  • optional_string_attribute n: returns the Value attribute called n, or the Valuelist attribute as a string where the list elements are separated by spaces. If the attribute value is implied, or if the attribute does not exists, the method returns None. - This method is convenient if you expect a non-list attribute value including the implied value.

  • required_list_attribute n: returns the Valuelist attribute called n, or the Value attribute as a list with a single element. If the attribute value is implied, or if the attribute does not exists, the method will fail. - This method is convenient if you expect a list attribute value.

  • optional_list_attribute n: returns the Valuelist attribute called n, or the Value attribute as a list with a single element. If the attribute value is implied, or if the attribute does not exists, an empty list will be returned. - This method is convenient if you expect a list attribute value or the implied value.

  • attribute_names: returns the list of all attribute names of this element. As this is a validating parser, this list is equal to the list of declared attributes.

  • attribute_type n: returns the type of the attribute called n. See the module Markup_types for a description of the encoding of the types.

  • attributes: returns the list of pairs of names and values for all attributes of this element.

Modifying methods . The following methods are only defined for element nodes (more exactly: the methods are defined for data nodes, too, but fail always).

  • add_node sn: Adds sub node sn to the list of children. This operation is illustrated in the picture A node can only be added if it is a root. This method expects that sn is a root, and it requires that sn and the current object share the same DTD.

  • add_pinstr pi: Adds the processing instruction pi to the list of processing instructions.

  • delete: Deletes this node from the tree. After this operation, this node is no longer the child of the former father node; and the node loses the connection to the father as well. This operation is illustrated by the figure A deleted node becomes the root of the subtree.

  • set_nodes nl: Sets the list of children to nl. It is required that every member of nl is a root, and that all members and the current object share the same DTD.

  • quick_set_attributes atts: sets the attributes of this element to atts. It is not checked whether atts matches the DTD or not; it is up to the caller of this method to ensure this. (This method may be useful to transform the attribute values, i.e. apply a mapping to every attribute.)

Cloning methods .

  • orphaned_clone: Returns a clone of the node and the complete tree below this node (deep clone). The clone does not have a parent (i.e. the reference to the parent node is not cloned). While copying the subtree, strings are skipped; it is likely that the original tree and the copy tree share strings. Extension objects are cloned by invoking the clone method on the original objects; how much of the extension objects is cloned depends on the implemention of this method.

    This operation is illustrated by the figure The clone of a subtree.

  • orphaned_flat_clone: Returns a clone of the node, but sets the list of sub nodes to [], i.e. the sub nodes are not cloned.

  • create_element dtd nt al: Returns a flat copy of this node (which must be an element) with the following modifications: The DTD is set to dtd; the node type is set to nt, and the new attribute list is set to al (given as list of (name,value) pairs). The copy does not have children nor a parent. It does not contain processing instructions. See the example below.

  • create_data dtd cdata: Returns a flat copy of this node (which must be a data node) with the following modifications: The DTD is set to dtd; the node type is set to T_data; the attribute list is empty (data nodes never have attributes); the list of children and PIs is empty, too (same reason). The new node does not have a parent. The value cdata is the new character content of the node. See the example below.

Weird methods . The following methods will hopefully be replaced by better ones in the next major release:

  • find id: Searches the tree (beginning at the root node) for the node with an ID attribute id, or raises Not_found. The method may also cause a Validation_error if something is wrong with the IDs.

  • reset_finder: Resets the ID cache. This is necessary after new nodes have been added; otherwise find would not find these nodes.

  • local_validate: Checks that this node conforms to the DTD. (Applications need not to call this method unless they add new nodes themselves to the tree.)

3.2.3. The class element_impl

This class is an implementation of node which realizes element nodes:

class [ 'ext ] element_impl : 'ext -> [ 'ext ] node

Constructor. You can create a new instance by

new element_impl extension_object
which creates a special form of empty element which already contains a reference to the extension_object, but is otherwise empty. This special form is called an exemplar. The purpose of exemplars is that they serve as patterns that can be duplicated and filled with data. The method create_element is designed to perform this action.

Example. First, create an exemplar by

let exemplar_ext = ... in
let exemplar     = new element_impl exemplar_ext in
The exemplar is not used in node trees, but only as a pattern when the element nodes are created:
let element = exemplar # create_element dtd (T_element name) attlist 
The element is a copy of exemplar (even the extension exemplar_ext has been copied) which ensures that element and its extension are objects of the same class as the exemplars; note that you need not to pass a class name or other meta information. The copy is initially connected with the dtd, it gets a node type, and the attribute list is filled. The element is now fully functional; it can be added to another element as child, and it can contain references to subnodes.

3.2.4. The class data_impl

This class is an implementation of node which should be used for all character data nodes:

class [ 'ext ] data_impl : 'ext -> string -> [ 'ext ] node

Constructor. You can create a new instance by

new data_impl extension_object cdata_string
which creates an exemplar node filled with the characters cdata_string which is connected to extension_object. The node does not contain a reference to any DTD, and because of this it cannot be added to node trees.

To get a fully working data node, apply the method create_data to the exemplar (see example).

(The interface is a bit confusing: Why it is possible to create exemplar nodes that are already filled with character data? Don't expect a reasonable answer...)

Example. First, create an exemplar by

let exemplar_ext = ... in
let exemplar     = new exemplar_ext data_impl "" in
The exemplar is not used in node trees, but only as a pattern when the data nodes are created:
let data_node = exemplar # create_data dtd "The characters contained in the data node" 
The data_node is a copy of exemplar. The copy is initially connected with the dtd, and it is filled with character material. The data_node is now fully functional; it can be added to an element as child.

3.2.5. Examples

Building trees. Here is the piece of code that creates the tree of the figure A tree with element nodes, data nodes, and attributes. The extension object and the DTD are beyond the scope of this example.

let exemplar_ext = ... (* some extension *) in
let dtd = ... (* some DTD *) in

let element_exemplar = new element_impl exemplar_ext in
let data_exemplar    = new data_impl    exemplar_ext "" in

let a1 = element_exemplar # create_element dtd (T_element "a") ["att", "apple"]
and b1 = element_exemplar # create_element dtd (T_element "b") []
and c1 = element_exemplar # create_element dtd (T_element "c") []
and a2 = element_exemplar # create_element dtd (T_element "a") ["att", "orange"]
in

let cherries = data_exemplar # create_data dtd "Cherries" in
let orange   = data_exemplar # create_data dtd "An orange" in

a1 # add_node b1;
a1 # add_node c1;
b1 # add_node a2;
b1 # add_node cherries;
a2 # add_node orange;
Alternatively, the last block of statements could also be written as:
a1 # set_nodes [b1; c1];
b1 # set_nodes [a2; cherries];
a2 # set_nodes [orange];
The root of the tree is a1, i.e. it is true that
x # root == a1
for every x from { a1, a2, b1, c1, cherries, orange }.

Furthermore, the following properties hold:

  a1 # attribute "att" = Value "apple"
& a2 # attribute "att" = Value "orange"

& cherries # data = "Cherries"
&   orange # data = "An orange"
&       a1 # data = "CherriesAn orange"

&       a1 # node_type = T_element "a"
&       a2 # node_type = T_element "a"
&       b1 # node_type = T_element "b"
&       c1 # node_type = T_element "c"
& cherries # node_type = T_data
&   orange # node_type = T_data

&       a1 # sub_nodes = [ b1; c1 ]
&       a2 # sub_nodes = [ orange ]
&       b1 # sub_nodes = [ a2; cherries ]
&       c1 # sub_nodes = []
& cherries # sub_nodes = []
&   orange # sub_nodes = []

&       a2 # parent == a1
&       b1 # parent == b1
&       c1 # parent == a1
& cherries # parent == b1
&   orange # parent == a2

Searching nodes. The following function searches all nodes of a tree for which a certain condition holds:

let rec search p t =
  if p t then
    t :: search_list p (t # sub_nodes)
  else
    search_list p (t # sub_nodes)

and search_list p l =
  match l with
    []      -> []
  | t :: l' -> (search p t) @ (search_list p l')
;;

For example, if you want to search all elements of a certain type et, the function search can be applied as follows:

let search_element_type et t =
  search (fun x -> x # node_type = T_element et) t
;;

Getting attribute values. Suppose we have the declaration:

<!ATTLIST e a CDATA #REQUIRED
            b CDATA #IMPLIED
            c CDATA "12345">
In this case, every element e must have an attribute a, otherwise the parser would indicate an error. If the O'Caml variable n holds the node of the tree corresponding to the element, you can get the value of the attribute a by
let value_of_a = n # required_string_attribute "a"
which is more or less an abbreviation for
let value_of_a = 
  match n # attribute "a" with
    Value s -> s
  | _       -> assert false
- as the attribute is required, the attribute method always returns a Value.

In contrast to this, the attribute b can be omitted. In this case, the method required_string_attribute works only if the attribute is there, and the method will fail if the attribute is missing. To get the value, you can apply the method optional_string_attribute:

let value_of_b = n # optional_string_attribute "b"
Now, value_of_b is of type string option, and None represents the omitted attribute. Alternatively, you could also use attribute:
let value_of_b = 
  match n # attribute "b" with
    Value s       -> Some s
  | Implied_value -> None
  | _             -> assert false

The attribute c behaves much like a, because it has always a value. If the attribute is omitted, the default, here "12345", will be returned instead. Because of this, you can again use required_string_attribute to get the value.

The type CDATA is the most general string type. The types NMTOKEN, ID, IDREF, ENTITY, and all enumerators and notations are special forms of string types that restrict the possible values. From O'Caml, they behave like CDATA, i.e. you can use the methods required_string_attribute and optional_string_attribute, too.

In contrast to this, the types NMTOKENS, IDREFS, and ENTITIES mean lists of strings. Suppose we have the declaration:

<!ATTLIST f d NMTOKENS #REQUIRED
            e NMTOKENS #IMPLIED>
The type NMTOKENS stands for lists of space-separated tokens; for example the value "1 abc 23ef" means the list ["1"; "abc"; "23ef"]. (Again, IDREFS and ENTITIES have more restricted values.) To get the value of attribute d, one can use
let value_of_d = n # required_list_attribute "d"
or
let value_of_d = 
  match n # attribute "d" with
    Valuelist l -> l
  | _           -> assert false
As d is required, the attribute cannot be omitted, and the attribute method returns always a Valuelist.

For optional attributes like e, apply

let value_of_e = n # optional_list_attribute "e"
or
let value_of_e = 
  match n # attribute "e" with
    Valuelist l   -> l
  | Implied_value -> []
  | _             -> assert false
Here, the case that the attribute is missing counts like the empty list.