Sorry, this chapter has not yet been written. For an introduction into parser configuration, see the previous chapters. As a first approximation, the interface definition of Markup_yacc outlines what could go here.
open Markup_types open Markup_dtd open Markup_document type config = { warner : collect_warnings; (* An object that collects warnings. *) errors_with_line_numbers : bool; (* Whether error messages contain line numbers or not. The parser * is 10 to 20 per cent faster if line numbers are turned off; * you get only character positions in this case. *) processing_instructions_inline : bool; (* true: turns a special mode for processing instructions on. Normally, * you cannot determine the exact location of a PI; you only know * in which element the PI occurs. The "inline" mode makes it possible * to find the exact location out: Every PI is artificially wrapped * by a special element with name "-pi". For example, if the XML text * is <a><?x?><?y?></a>, the parser normally produces only an element * object for "a", and puts the PIs "x" and "y" into it (without * order). In inline mode, the object "a" will contain two objects * with name "-pi", and the first object will contain "x", and the * second "y". * Notes: * (1) The name "-pi" is reserved. You cannot use it for your own * tags because tag names must not begin with '-'. * (2) You need not to add a declaration for "-pi" to the DTD. These * elements are handled separately. * (3) Of course, the "-pi" objects are created from exemplars of * your DOM map. *) virtual_root : bool; (* true: the topmost element of the XML tree is not the root element, * but the so-called virtual root. The root element is a son of the * virtual root. The virtual root is an ordinary element with name * "-vr". * The following behaviour changes, too: * - PIs occurring outside the root element and outside the DTD are * added to the virtual root instead of the document object * - If processing_instructions_inline is also turned on, these PIs * are added inline to the virtual root * Notes: * (1) The name "-vr" is reserved. You cannot use it for your own * tags because tag names must not begin with '-'. * (2) You need not to add a declaration for "-vr" to the DTD. These * elements are handled separately. * (3) Of course, the "-vr" objects are created from exemplars of * your DOM map. *) (* The following options are not implemented, or only for internal * use. *) sgml_capitalize_names : bool; (* TODO *) sgml_attribute_values : bool; (* TODO *) sgml_omit_endtags : bool; (* TODO *) debugging_mode : bool; } type source = Entity of ((dtd -> Markup_entity.entity) * Markup_reader.resolver) | Channel of in_channel | File of string | Latin1 of string | ExtID of (ext_id * Markup_reader.resolver) (* Note on sources: * * The sources do not have all the same capabilities. Here the differences: * * - File: A File source reads from a file by name. This has the advantage * that references to external entites can be resolved. - The problem * with SYSTEM references is that they usually contain relative file * names; more exactly, a file name relative to the document containing it. * It is only possible to convert such names to absolute file names if the * name of the document containing such references is known; and File * denotes this name. * * - Channel, Latin1: These sources read from documents given as channels or * (Latin 1-encoded) strings. There is no file name, and because of this * the documents must not contain references to external files (even * if the file names are given as absolute names). * * - ExtID(x,r): The identifier x (either the SYSTEM or the PUBLIC name) of the * entity to read from is passed to the resolver r as-is. * The intention of this option is to allow customized * resolvers to interpret external identifiers without any restriction. * For example, you can assign the PUBLIC identifiers a meaning (they * currently do not have any), or you can extend the "namespace" of * identifiers. * ExtID is the interface of choice for own extensions to resolvers. * * - Entity(m,r): You can implementy every behaviour by using a customized * entity class. Once the DTD object d is known that will be used during * parsing, the entity e = m d is determined and used together with the * resolver r. * This is only for hackers. *) type 'ext domspec = { map : (node_type, 'ext node) Hashtbl.t; default_element : 'ext node; } (* Specifies which node to use as exemplar for which node type. See the * manual for explanations. *) val default_config : config (* - The resolver is able to read from files by name * - Warnings are thrown away * - Error message will contain line numbers *) val default_extension : ('a node extension) as 'a (* A "null" extension; an extension that does not extend the funtionality *) val default_dom : ('a node extension as 'a) domspec (* Specifies that you do not want to use extensions. *) val parse_dtd_entity : config -> source -> dtd (* Parse an entity containing a DTD, and return this DTD. *) val parse_document_entity : config -> source -> 'ext domspec -> 'ext document (* Parse a closed document, i.e. a document beginning with <!DOCTYPE...>, * and validate the contents of the document against the DTD contained * and/or referenced in the document. *) val parse_content_entity : config -> source -> dtd -> 'ext domspec -> 'ext node (* Parse a file representing a well-formed fragment of a document. The * fragment must be a single element (i.e. something like <a>...</a>; * not a sequence like <a>...</a><b>...</b>). The element is validated * against the passed DTD, but it is not checked whether the element is * the root element specified in the DTD. * Note that you can create DTDs that specify not to validate at all * (invoke method allow_arbitrary on the DTD). *) val parse_wf_entity : config -> source -> 'ext domspec -> 'ext document (* Parse a closed document (see parse_document_entity), but do not * validate it. Only checks on well-formedness are performed. *) val open_gen: string -> string -> string * ('a Markup_document.node Markup_document.extension as 'a) Markup_document.document (* open_gen e f, parses a closed document in file f (looking for it in the * directories default_dirs) i.e. a document beginning with <!DOCTYPE...>, * and validate the contents of the document against the DTD contained * and/or referenced in the document ensuring its root is e. Returns * the empty string and the document ast or a * non-empty string with error (and a empty ast withouth root element). *)
There are the following main modes of invoking the parser:
parse_document_entity: You want to parse a complete and closed document consisting of a DTD and the document body; the body is validated against the DTD. This mode is interesting if you have a file
<!DOCTYPE root ... [ ... ] > <root> ... </root>and you can accept any DTD that is included in the file.
parse_wf_entity: You want to parse a complete and closed document consisting of a DTD and the document body; but the body is not validated, only checked for well-formedness. This mode is preferred if validation costs too much time.
parse_dtd_entity: You want only to parse an entity (file) containing the whole DTD or only a part of it. Sometimes it is interesting to read a DTD without document, for example to compare it with the DTD included in a document, or to apply the next mode:
parse_content_entity: You want only to parse an entity (file) containing a fragment of a document body; this fragment is validated against the DTD you pass to the parser. Especially, the fragment must not have a <!DOCTYPE> clause, and must directly begin with an element. The element is validated against the DTD (note that it is possible to create a "DTD" that matches everything, i.e. you can turn off validation). This mode is interesting if you want to check documents against a fixed, immutable DTD.
There are a number of variations of these modes. One important application of a parser is to check documents of an untrusted source against a fixed DTD. One solution is to not allow the <!DOCTYPE> clause in these documents, and treat the document like a fragment (using mode parse_content_entity). This is very simple, but inflexible; users of such a system cannot even define additional entities to abbreviate frequent phrases of their text.
It may be necessary to have a more intelligent checker. For example, it is also possible to parse the document to check fully, i.e. with DTD, and to compare this DTD with the prescribed one. In order to fully parse the document, mode parse_document_entity is applied, and to get the DTD to compare with mode parse_dtd_entity can be used.
There is another very important configurable aspect of the parser: the so-called resolver. The task of the resolver is to locate the contents of an (external) entity for a given entity name, and to make the contents accessible as a character stream. (Furthermore, it also normalizes the character set; but this is a detail we can ignore here.) Consider you have a file called "main.xml" containing
<!ENITITY % sub SYSTEM "sub/sub.xml"> %sub;and a file stored in the subdirectory "sub" with name "sub.xml" containing
<!ENITITY % subsub SYSTEM "subsub/subsub.xml"> %subsub;and a file stored in the subdirectory "subsub" of "sub" with name "subsub.xml" (the contents of this file do not matter). Here, the resolver must track that the second entity subsub is located in the directory "sub/subsub", i.e. the difficulty is to interpret the system (file) names of entities relative to the entities containing them, even if the entities are deeply nested.
There is not a fixed resolver already doing everything right - resolving entity names is a task that highly depends on the environment. The XML specification only demands that SYSTEM entities are interpreted like URLs (which is not very precise, as there are lots of URL schemes in use), hoping that this helps overcoming the local peculiarities of the environment; the idea is that if you do not know your environment you can refer to other entities by giving the URLs of them. I think that this interpretation of SYSTEM names may have some applications in the internet, but it is not the first choice in general. Because of this, the resolver is a separate module of the parser that can be exchanged by another one if necessary; more precisely, the parser already defines several resolvers.
The following resolvers do already exist:
A resolver for file names. The SYSTEM name is simply interpreted as file name with the slash "/" as separator for directories.
A resolver that reads always from a given O'Caml string. This resolver is not able to resolve further names because the string is not associated with any name, i.e. if the document contained in the string refers to an external entity, this reference cannot be followed.
A resolver that reads from an arbitrary input channel. This resolver cannot resolve inner names, too.
Note that the existing resolvers only interpret SYSTEM names, not PUBLIC names. If it helps you, it is possible to define resolvers for PUBLIC names, too; for example, such a resolver could look up the public name in a hash table, and map it to a system name which is passed over to the existing resolver for system names. It is relatively simple to provide such a resolver.