[Foomatic] entity reference problem in XML data
Johannes Meixner
jsmeix at suse.de
Tue Jan 20 08:32:26 PST 2004
Hello,
as fas as I see there is a general problem in the XML data
with the entity references:
& ' > < "
When such entity references appear in XML character data
and are meant to stay unchanged in their literal form
then they must be written this way:
&amp; &apos; &gt; &lt; &quot;
I.e. the XML special character & must be escaped seperated
from the rest.
Otherwise the meaning of the XML data would be wrong and this
would result wrong data if it is processed with XML tools.
I.e. at the moment the Foomatic XML data cannot be processed
correctly with XML tools.
The real problem is that the Foomatic parser and perhaps a lot
of third party software would have to be changed.
Perhaps we can ignore this problem.
This depends whether or not XML parser output is still vaild
for the Foomatic and third party tools.
In particular the " may cause a problem because a XML parser
would replace it with a plain " character but this is a special
character in a PPD file.
For example:
xmllint Kyocera-FS-1010.xml
results this output of the comments:
...
Use the
<a href="/download/PPD/Kyocera/">Kyocera PPD file</a>
for PostScript,
...
Detailed explanation:
& ' > < " are XML entity references.
When a XML parser reads them they are immediately replaced
by the matching plain characters.
I.e. "inside" the XML parser there are no longer the entity
references but the plain characters & ' > < "
You can verify this by using for example xmllint and watch the output:
echo '<foo>& ' > < "</foo>' | xmllint -
results the output
<?xml version="1.0"?>
<foo>& ' > < "</foo>
You may wonder why only ' and " are replaced
by the plain characters.
The reason is a bit complicated:
When reading the input the parser replaces all entity references
with the matching plain characters.
Later when creating the output the characters & > and <
are re-replaced by the matching entity references because
this way the parser makes sure that the output is valid XML.
In this example & > < are character data - see
http://www.w3.org/TR/1998/REC-xml-19980210#syntax
-----------------------------------------------------------------
2.4 Character Data and Markup
Text consists of intermingled character data and markup.
Markup takes the form of start-tags, end-tags, empty-element tags,
entity references, character references, comments, CDATA section
delimiters, document type declarations, and processing instructions.
All text that is not markup constitutes the character data of the
document.
The ampersand character (&) and the left angle bracket (<) may
appear in their literal form only when used as markup delimiters,
-----------------------------------------------------------------
Those characters in character data which would lead to invalid
XML output are "escaped" by their matching entity references.
You can verify this by using different entity references:
echo '<foo>& ' > < "</foo>' | xmllint -
results excatly the same output as above:
<?xml version="1.0"?>
<foo>& ' > < "</foo>
In contrast
echo '<foo>&amp; &apos; &gt; &lt; &quot;</foo>' \
| xmllint -
results
<?xml version="1.0"?>
<foo>&amp; &apos; &gt; &lt; &quot;</foo>
I.e. the output was not changed by the XML parser.
In fact it was changed twice as described above because
echo '<foo>&amp; &apos; &gt; &lt; &quot;</foo>' \
| xmllint -
results exactly the same as above but
echo '<foo>&#038; &#039; &#062; &#060; &#034;</foo>' \
| xmllint -
results a different (but nevertheless correct) output:
<?xml version="1.0"?>
<foo>&#038; &#039; &#062; &#060; &#034;</foo>
Regards
Johannes Meixner
--
SUSE LINUX AG, Maxfeldstrasse 5 Mail: jsmeix at suse.de
90409 Nuernberg, Germany WWW: http://www.suse.de/
More information about the Printing-foomatic
mailing list