The type source enumerates the various possibilities where the document to parse comes from.
type source = Entity of ((dtd -> Markup_entity.entity) * Markup_reader.resolver) | Channel of in_channel | File of string | Latin1 of string | ExtID of (ext_id * Markup_reader.resolver)
File s: The document is read from file s, you may specify absolute or relative path names. This source has the advantage that it is able to resolve inner external entities; i.e. if your document includes data from another file (using the SYSTEM attribute), this mode will find that file. This mode cannot resolve PUBLIC identifiers.
This mode is able to read UTF-8, UTF-16, and ISO-8859-1-encoded data.
Channel ch: The document is read from the channel ch. This mode is not able to interpret file names of SYSTEM clauses, nor it can look up PUBLIC identifiers.
This mode is able to read UTF-8, UTF-16, and ISO-8859-1-encoded data.
Latin1 s: The string s is the document to parse. This mode is not able to interpret file names of SYSTEM clauses, nor it can look up PUBLIC identifiers.
This mode is only able to read ISO-8859-1-encoded data.
ExtID (id, r): The document to parse is denoted by the identifier id (either a SYSTEM or PUBLIC clause), and this identifier is interpreted by the resolver r. Use this mode if you have written your own resolver.
Which character sets are possible depends on the passed resolver r.
Entity (get_entity, r): The document to parse is returned by the function invocation get_entity dtd, where dtd is the DTD object to use (it may be empty). Inner external references occuring in this entity are resolved using the resolver r.
Which character sets are possible depends on the passed resolver r.
A resolver is an object that can be opened like a file, but you do not pass the file name to the resolver, but the XML identifier of the entity to read from (either a SYSTEM or PUBLIC clause). When opened, the resolver must return the Lexing.lexbuf that reads the characters. The resolver can be closed, and it can be cloned. Furthermore, it is possible to tell the resolver which character set it should assume. (Note: Markup works internally with ISO-8859-1, so the resolver must recode the characters into this character set.)
class type resolver = object method open_in : ext_id -> Lexing.lexbuf method close_in : unit method change_encoding : string -> unit method clone : resolver endThe resolver object must work as follows:
If the parser wants to read from the resolver, it invokes the method open_in. Either the resolver succeeds, in which case the Lexing.lexbuf reading from the file or stream must be returned, or opening fails. In the latter case the method implementation should raise an exception.
If the parser finishes reading, it calls the close_in method.
If the parser finds a reference to another external entitiy in the input stream, it calls clone to get a second resolver which must be closed (not yet connected with an input stream). The parser then invokes open_in and the other methods as described.
If you already know the character set of the input stream, you should recode it to ISO-8859-1, and define the method change_encoding as an empty method.
If you want to support multiple character sets, the object must follow a much more complicated protocol. Directly after open_in has been called, the resolver must return a lexical buffer that only reads one byte at a time. This is only possible if you create the lexical buffer with Lexing.from_function; the function must then always return 1 if the EOF is not yet reached, and 0 if EOF is reached. If the parser has read the first line of the document, it will invoke change_encoding to tell the resolver which character set to assume. From this moment, the object can return more than one byte at once. The argument of change_encoding is either the parameter of the "encoding" attribute of the XML declaration, or the empty string if there is not any XML declaration or if the declaration does not contain an encoding attribute.
At the beginning the resolver must only return one character every time something is read from the lexical buffer. The reason for this is that you otherwise would not exactly know at which position in the input stream the character set changes.
If you want automatic recognition of the character set, it is up to the resolver object to implement this.
Example: How to define a resolver that is equivalent to Latin1.