Let me first give a rough overview of the object model of the parser. The following items are represented by objects:
Documents: The document representation is more or less the anchor for the application; all accesses to the parsed entities start here. It is described by the class document contained in the module Markup_document. You can get some global information, such as the XML declaration the document begins with, the DTD of the document, global processing instructions, and most important, the document tree.
The contents of documents: The contents have the structure of a tree: Elements contain other elements and text[1]. The common type to represent both kinds of content is node which is a class type that unifies the properties of elements and character data. Every node has a list of children (which is empty if the element is empty or the node represents text); nodes may have attributes; nodes have always text contents. There are two implementations of node, the class element_impl for elements, and the class data_impl for text data. You find these classes and class types in the module Markup_document, too.
Note that attribute lists are represented by non-class values.
The node extension: For advanced usage, every node of the document may have an associated extension which is simply a second object. This object must have the three methods clone, node, and set_node as bare minimum, but you are free to add methods as you want. This is the preferred way to add functionality to the document tree[2]. The class type extension is defined in Markup_document, too.
The DTD: Sometimes it is necessary to access the DTD of a document; the average application does not need this feature. The class dtd describes DTDs, and makes it possible to get representations of element, entity, and notation declarations as well as processing instructions contained in the DTD. This class, and dtd_element, dtd_notation, and proc_instruction can be found in the module Markup_dtd. There are a couple of classes representing different kinds of entities; these can be found in the module Markup_entity.
Markup_yacc: Here the main parsing functions such as parse_document_entity are located. Some additional types and functions allow the parser to be configured in a non-standard way.
Markup_types: This is a collection of basic types and exceptions.
Let the document to be parsed be stored in a file called doc.xml. The parsing process is started by calling the function
val parse_document_entity : config -> source -> 'ext domspec -> 'ext documentdefined in the module Markup_yacc. The first argument specifies some global properties of the parser; it is recommended to start with the default_config. The second argument determines where the document to be parsed comes from; this may be a file, a channel, or an entity ID. To parse doc.xml, it is sufficient to pass File "doc.xml".
The third argument specifies the document object model to use. Roughly speaking, it determines which classes implement the node objects of which element types, and which extensions are to be used. The 'ext polymorphic variable is the type of the extension. For the moment, let us simply pass default_dom as this argument, and ignore it.
So the following expression parses doc.xml:
open Markup_yacc let d = parse_document_entity default_config (File "doc.xml") default_domNote that default_config implies that warnings are collected but not printed. Errors raise one of the exception defined in Markup_types; to get readable errors and warnings catch the exceptions as follows:
let rec print_error e = match e with Markup_types.At(where,what) -> print_endline where; print_error what | _ -> print_endline (Printexc.to_string e) try let d = parse_document_entity default_config (File "doc.xml") default_dom in let s = default_config.warner # print_warnings in if s <> "" then print_endline s; default_config.warner # reset ... with e -> let s = default_config.warner # print_warnings in if s <> "" then print_endline s; default_config.warner # reset; print_error eNow d is an object of the document class. If you want the node tree, you can get the root element by
let root = d # rootand if you would rather like to access the DTD, determine it by
let dtd = d # dtdAs it is more interesting, let us investigate the node tree now. Given the root element, it is possible to recursively traverse the whole tree. The children of a node n are returned by the method sub_nodes, and the type of a node is returned by node_type. This function traverses the tree, and prints the type of each node:
let rec print_structure n = let ntype = n # node_type in match ntype with T_element name -> print_endline ("Element of type " ^ name); let children = n # sub_nodes in List.iter print_structure children | T_data -> print_endline "Data"You can call this function by
print_structure rootThe type returned by node_type is either T_element name or T_data. The name of the element type is the string included in the angle brackets. Note that only elements have children; data nodes are always leaves of the tree.
There are some more methods in order to access a parsed node tree:
n # parent: Returns the parent node, or raises Not_found
n # root: Returns the root of the node tree.
n # attribute a: Returns the value of the attribute with name a. If a has been declared, one of the following values are possible: Value s indicating that the attribute contains a single value; Valuelist sl indicating that this attribute has a token list as value (this is possible for the attribute types IDREFS, ENTITIES, and NMTOKENS); Implied_value indicating that the attribute has been left out and was declared in the DTD as #IMPLIED. Note that if the attribute is absent and the DTD declared a default value, this method will always return the default value. Only if you try to get a non-declared attribute, Not_found will be raised.
Note that attibute values are normalized while being parsed. Most important, newline characters are turned into plain spaces.
n # data: Returns the character data contained in the node. For data nodes, the meaning is obvious as this is the main content of data nodes. For element nodes, this method returns the concatenated contents of all inner data nodes.
Note that entity references included in the text are resolved while they are being parsed; for example the text "a <> b" will be returned as "a <> b" by this method. Spaces of data nodes are always preserved. Newlines are preserved, but always converted to \n characters even if newlines are encoded as \r\n or \r.
Note that elements that do not allow #PCDATA as content will not have data nodes as children. This means that spaces and newlines, the only character material allowed for such elements, are silently dropped.
let rec print_valuable_prio1 n = let ntype = n # node_type in match ntype with T_element "valuable" when n # attribute "priority" = Value "1" -> print_endline "Valuable node with priotity 1 found:"; print_endline (n # data) | _ -> let children = n # sub_nodes in List.iter print_valuable_prio1 childrenYou can call this function by:
print_valuable_prio1 rootIf you like a DSSSL-like style, you can make the function process_children explicit:
let rec print_valuable_prio1 n = let process_children n = let children = n # sub_nodes in List.iter print_valuable_prio1 children in let ntype = n # node_type in match ntype with T_element "valuable" when n # attribute "priority" = Value "1" -> print_endline "Valuable node with priotity 1 found:"; print_endline (n # data) | _ -> process_children nSo far, O'Caml is now a simple "style-sheet language": You can form a big "match" expression to distinguish between all significant cases, and provide different reactions on different conditions. But this technique has limitations; the "match" expression tends to get larger and larger, and it is difficult to store intermediate values as there is only one big recursion. Alternatively, it is also possible to represent the various cases as classes, and to use dynamic method lookup to find the appropiate class. The next section explains this technique in detail.
[1] | Elements may also contain processing instructions. Unlike other document models, Markup separates processing instructions from the rest of the text and provides a second interface to access them. |
[2] | Due to the type system it is more or less impossible to derive recursive classes in O'Caml. To get around this, it is common practice to put the modifiable or extensible part of recursive objects into parallel objects. |