[SCL] XML question
Frank Farance
frank at farance.com
Sat Feb 7 20:20:50 CST 2004
At 15:06 2004-01-22 -0600, pat hayes wrote:
> Does anyone have a good answer to the following question? Its really
> about the design principles of XML.
Pat-
I believe James Anderson provided much of the response, but I'd like to add my two cents' worth. :-)
> In writing the SCL core syntax, I had in mind that it ought to be
> possible to include chunks of SCL core inside XML documents without
> the XML barfing. Since XML reserves the characters '<' and '&' and
> uses ' " ' for quoting, my first instinct was to simply ban these
> characters from appearing anywhere in the SCL core syntax: then one
> can take any piece of SCL core text, stick double-quotes on either
> side of it, and plonk it down as for example an attribute value in
> XHTML, and nothing breaks (this would be neat for example when
> attaching SCL as markup to a web page, since the SCL would be
> invisible in the HTML but visible to processors).
First, I believe that embedded in XML and HTML are two separate questions.
For HTML, for example, you could embed an achor, text, and some SCL all with <script></script>:
<html>
<body>
<p>here's some text</p>
<p>
<a name=text_to_tag>
<script type="text/x-scl">
... some SCL text with < & >
</script>
my special text to text to tag
</a>
</p>
</body>
</html>
Assuming this HTML was in a file named "foobar.html", then "foobar.html#text_to_tag" would refer to the text and your SCL script. For the SCL script, you just have to make sure that the string "</script>" doesn't appear in the SCL code (probably not a problem).
In a practical sense, one can dump a lotta stuff into HTML.
The question is somewhat different for XML because HTML and XML, effectively, are "parsed" differently -- even though they appear to have very similar syntactic features. For HTML, it is seen as a series of tokens that drive a state machine (e.g., a browser) that has some useful side effects (e.g., rendering pages to the user). In other words, there is agreement upon the meaning of the data (i.e., HTML tags, their attributes, and their content).
For XML, there is not a common agreement on the meaning of the data, but merely an agreement upon basic syntactic features (elements, entities, etc.) and their well-formedness (e.g., balanced tags need to be balanced). In many cases, additional constraints are ***layered on top of the agreement among data interchange participants***, such as validation of element structure, validation of element/attribute data itself, and the meaning of whitespace (must whitespace be preserved or can each set of whitespace characters be reduced to a single space?).
It is impossible to determine what the actual data interchange agreements are by just looking at the XML (of course, this is not a problem for HTML). Thus, it is impossible to know if adding <script></script> is valid, invalid, or conflicting (i.e., not the meaning you had intended). A common approach to this problem is to embed the script in comments. Here's the XML before:
<somexml>
<xx>
<yy>...</yy>
<zz>...</zz>
</xx>
<xx>
<yy>...</yy>
<zz>...</zz>
</xx>
</somexml>
and after:
<somexml>
<xx>
<!-- SCL: ... some SCL code -->
<yy>...</yy>
<zz>...</zz>
</xx>
<xx>
<yy>...</yy>
<zz>...</zz>
</xx>
</somexml>
So what are the hazards of above?
- You need to agree upon some unique prefix (see the MS-Word example below).
- Your SCL code might get lost if someone decides to strip comments.
- Your SCL code (in the example above) might have caused extra whitespace that changes the meaning of the original XML code (see improvement below).
- Your SCL code cannot have the sequence "-->" in it.
- Unlike the HTML anchor and its SCL code, it is difficult to create a pointer to the SCL code embedded as an XML comment.
To fix the whitespace problem above (assuming there is no whitespace compression), you would probably just insert the SCL string into the text with no extra newline:
<somexml>
<xx>
<!-- SCL: ... some SCL code --><yy>...</yy>
<zz>...</zz>
</xx>
<xx>
<yy>...</yy>
<zz>...</zz>
</xx>
</somexml>
If you want to see this technique in a large example, look at how Microsoft embedded XML into their HTML code when saving an MS-Word document as XML (FYI, Microsoft chose "o:" as a prefix for their Office products). It certainly isn't pretty.
As I said, the data interchange agreements are not known by just inspecting an XML file. So one cannot gratuitously add elements, attributes, etc. to "markup" or tag exising XML files unless one is certain about the particular data interchange agreements.
In conclusion, the angle brackets and ampersands probably won't be a problem. My guess is that the HTML/XML closing comment character "-->" might require some attention.
Hope this helps.
-FF
______________________________________________________________________
Frank Farance, Farance Inc. T: +1 212 486 4700 F: +1 212 759 1605
mailto:frank at farance.com http://farance.com
Standards/Products/Services for Information/Communication Technologies
More information about the SCL
mailing list