2.3. Class-based processing of the node tree

By default, the parsed node tree consists of objects of the same class; this is a good design as long as you want only to access selected parts of the document. For complex transformations, it may be better to use different classes for objects describing different element types.

For example, if the DTD declares the element types a, b, and c, and if the task is to convert an arbitrary document into a printable format, the idea is to define for every element type a separate class that has a method print. The classes are eltype_a, eltype_b, and eltype_c, and every class implements print such that elements of the type corresponding to the class are converted to the output format.

The parser supports such a design directly. As it is impossible to derive recursive classes in O'Caml[1], the specialized element classes cannot be formed by simply inheriting from the built-in classes of the parser and adding methods for customized functionality. To get around this limitation, every node of the document tree is represented by two objects, one called "the node" and containing the recursive definition of the tree, one called "the extension". Every node object has a reference to the extension, and the extension has a reference to the node. The advantage of this model is that it is now possible to customize the extension without affecting the typing constraints of the recursive node definition.

Every extension must have the three methods clone, node, and set_node. The method clone creates a deep copy of the extension object and returns it; node returns the node object for this extension object; and set_node is used to tell the extension object which node is associated with it, this method is automatically called when the node tree is initialized. The following definition is a good starting point for these methods; usually clone must be further refined when instance variables are added to the class:

class custom_extension =
  object (self)

    val mutable node = (None : custom_extension node option)

    method clone = {< >} 
    method node =
      match node with
          None ->
            assert false
        | Some n -> n
    method set_node n =
      node <- Some n

  end
This part of the extension is usually the same for all classes, so it is a good idea to consider custom_extension as the super-class of the further class definitions. Continuining the example of above, we can define the element type classes as follows:
class virtual abstract_extension =
  object (self)
    ... clone, node, set_node defined as above ...

    method virtual print : out_channel -> unit
  end

class eltype_a =
  object (self)
    inherit abstract_extension
    method print ch = ...
  end

class eltype_b =
  object (self)
    inherit abstract_extension
    method print ch = ...
  end

class eltype_c =
  object (self)
    inherit abstract_extension
    method print ch = ...
  end
The method print can now be implemented for every element type separately. Note that you get the associated node by invoking
self # node
and you get the extension object of a node n by writing
n # extension
It is guaranteed that
self # node # extension == self
always holds.

The remaining task is to configure the parser such that these extension classes are actually used. Here another problem arises: It is not possible to dynamically select the class of an object to be created. As workaround, Markup allows the user to specify exemplar objects for the various element types; instead of creating the nodes of the tree by applying the new operator the nodes are produced by duplicating the exemplars. As object duplication preserves the class of the object, one can create fresh objects of every class for which previously an exemplar has been registered.

Exemplars are meant as objects without contents, the only interesting thing is that exemplars are instances of a certain class. The creation of an exemplar for an element node can be done by:

let element_exemplar = new element_impl extension_exemplar
And a data node exemplar is created by:
let data_exemplar = new data_impl extension_exemplar ""
The classes element_impl and data_impl are defined in the module Markup_document. The constructors initialize the fresh objects as empty objects, i.e. without children, without data contents, and so on. The extension_exemplar is the initial extension object the exemplars are associated with.

Once the exemplars are created and stored somewhere (e.g. in a hash table), you can take an exemplar and create a concrete instance (with contents) by duplicating it. As user of the parser you are normally not concerned with this as this is part of the internal logic of the parser, but as background knowlege it is worthwhile to mention that the two methods create_element and create_data actually perform the duplication of the exemplar for which they are invoked, additionally apply modifications to the clone, and finally return the new object. Moreover, the extension object is copied, too, and the new node object is associated with the fresh extension object. Note that this is the reason why every extension object must have a clone method.

The configuration of the set of exemplars is passed to the parse_document_entity function as third argument. In our example, this argument can be set up as follows:

let domspec =
  let map = Hashtbl.create 4 in
  Hashtbl.add map (T_element "a") (new element_impl (new eltype_a));
  Hashtbl.add map (T_element "b") (new element_impl (new eltype_b));
  Hashtbl.add map (T_element "c") (new element_impl (new eltype_c));
  Hashtbl.add map T_data          (new data_impl (new data_ext));
  { map = map;
    default_element = new element_impl (new eltype_a);
  }
The map component of the record contains a hash table which stores the exemplars for several node types. The hash table must define an exemplar for data nodes; here data nodes have extensions of class data_ext. Extensions of data nodes work in the same way as extensions of element nodes. If there are element types for which entries in the hash table are missing, the exemplar defined in the default_element component is used.

The configuration is now complete. You can still use the same parsing functions, only the initialization is a bit different. For example, call the parser by:

let d = parse_document_entity default_config (File "doc.xml") domspec
Note that the resulting document d has a usable type; especially the print method we added is visible. So you can print your document by
d # print stdout

This object-oriented approach looks rather complicated; this is mostly caused by working around some problems of the strict typing system of O'Caml. Some auxiliary concepts such as extensions were needed, but the practical consequences of this are low. In the next section, one of the examples of the distribution is explained, a converter from readme documents to HTML.

Notes

[1]

The problem is that the subclass is usually not a subtype in this case because O'Caml has a contravariant subtyping rule.