3.4. Details of the mapping from XML text to the tree representation

3.4.1. The representation of character-free elements

If an element declaration does not allow the element to contain character data, the following rules apply.

If the element must be empty, i.e. it is declared with the keyword EMPTY, the element instance may still contain whitespace characters (spaces, tabs, carriage returns, and newlines). The parser ignores these characters; such an element does not have a data subnode containing these irrelevant characters.

If the element declaration only permits other elements to occur within that element but not character data, it is still possible to insert whitespace characters between the subelements. The parser ignores these characters, too, and does not create data nodes for them.

Example. Consider the following element types:

<!ELEMENT x ( #PCDATA | z )* >
<!ELEMENT y ( z )* >
<!ELEMENT z EMPTY>
Only x may contain character data, the keyword #PCDATA indicates this. The other types are character-free.

The XML term

<x><z/> <z/></x>
will be internally represented by an element node for x with three subnodes: the first z element, a data node containing the space character, and the second z element. In contrast to this, the term
<y><z/> <z/></y>
is represented by an element node for y with only two subnodes, the two z elements. There is no data node for the space character because spaces are ignored in the character-free element y.

3.4.2. The representation of character data

The XML specification allows all Unicode characters in XML texts. This parser cannot deal with characters with code bigger than 255 because it represents the characters internally in the ISO-8859-1 character set (this set is identical to the Unicode characters 0 to 255). When the parser finds a character it cannot represent, it ignores the character and prints a warning (to the collect_warning object that must be passed when the parser is called).

The XML specification allows lines to be separated by single LF characters, by CR LF character sequences, or by single CR characters. Internally, these separators are always converted to single LF characters.

3.4.3. The representation of entities within documents

Entities are not represented within documents! If the parser finds an entity reference in the document content, the reference is immediately expanded, and the parser reads the expansion text instead of the reference.

3.4.4. The representation of attributes

As attribute values are composed of Unicode characters, too, the same restriction applies as for character data: Characters that cannot be represented are dropped, and a warning is printed.

Attribute values are normalized before they are returned by methods like attribute. First, any remaining entity references are expanded; if necessary, expansion is performed recursively. Second, newline characters (any of LF, CR LF, or CR characters) are converted to single space characters. Note that especially the latter action conforms to the XML standard.

3.4.5. The representation of processing instructions

Processing instructions are not (yet) well supported. PIs are parsed to some extent: The first word of the PI is called the target, and it is stored separated from the rest of the PI:

<?target rest?>
The exact location where a PI occurs is not represented. The parser puts the PI into the object that represents the embracing construct (an element, a DTD, or the whole document); that means you can find out which PIs occur in a certain element, in the DTD, or in the whole document, but you cannot lookup the exact position within the construct.

3.4.6. The attributes xml:lang and xml:space

These attributes are not supported specially; they are handled like any other attribute.