[SCL] XML question
Murray Altheim
m.altheim at open.ac.uk
Thu Jan 22 18:29:09 CST 2004
pat hayes wrote:
> Does anyone have a good answer to the following question? Its really
> about the design principles of XML.
>
> In writing the SCL core syntax, I had in mind that it ought to be
> possible to include chunks of SCL core inside XML documents without
> the XML barfing. Since XML reserves the characters '<' and '&' and
> uses ' " ' for quoting, my first instinct was to simply ban these
> characters from appearing anywhere in the SCL core syntax: then one
> can take any piece of SCL core text, stick double-quotes on either
> side of it, and plonk it down as for example an attribute value in
> XHTML, and nothing breaks (this would be neat for example when
> attaching SCL as markup to a web page, since the SCL would be
> invisible in the HTML but visible to processors).
In document content, which is called PCDATA, you have to escape '<' and
'&' (and you should escape '>' too in practice). In attribute values,
which are likely to be declared CDATA, you just escape all markup
characters, including the single and double quotes.
If we know beforehand that SCL will commonly appear within XML documents,
in other words, we might consider it as a design goal, then avoiding
those characters is probably a Very Good Idea. The list of XML characters
to really avoid are '<', '>' and '&'. The single and double quotes are
in practice not really a big deal, unless you think to put SCL content
within attribute values. I generally avoid putting anything like document
content within attribute values, so single and double quotes are usually
not an issue. Since you mention specifically the idea of putting SCL
within an HTML document as attribute values, you're taking a fairly big
risk, in that a misplaced quote will do an unending amount of harm. The
skies will fall (and worse still, inconsistently, different on different
browsers on different platforms).
I'm not sure under what circumstance you'd actually want to include
SCL markup on a web page. If you did, you could escape *all* the markup
characters, but again, you're kinda asking for trouble (not from the
perspective of the specification, but from the users' difficulties,
i.e., in the real world).
> But what about an SCL string which might contain any character? Well,
> XML allows one to include any character in *parsed* character data by
> escaping the bad characters using entity references. That handles
> going from SCL-in-XML back into SCL. But how about going from SCL
> into SCL-in-XML? The use of the XML escaping seems to require that
> any software which creates XML - for example, something which wants
> to transmit some SCL text between SCL engines using SCL-in-XML -
> must perform a kind of XML-unparsing step to replace every occurrence
> of '<' or '&' by the entity reference.
If you're trying to make the distinction between PCDATA and CDATA, that's
not relevant within attribute values. In XML you can't declare an attribute
value to be PCDATA, only CDATA.
When you say SCL-in-XML, do you mean some sort of XML syntax for SCL,
such as what I've been calling XCL, or do you mean escaped SCL (a non-XML
syntax)?
> My question boils down to this. Do I *need* to keep the surface
> syntax of SCL Core "XML-safe" in the sense that it is guaranteed to
> simply never contain the characters less-than or ampersand (or
> double-quote, in fact) ? This can be done, but is a pain, and
> requires SCL to have its own character-escaping conventions different
> from XML's conventions. (It can't use the XML conventions, since then
> XML itself will alter the SCL string encodings.)
As I said, this depends on whether you really want SCL to be able to
live in attribute values, unescaped. That design decision will affect
everything else you do, so perhaps rather than me answer in a ton of
detail, we should iron that one out first.
> Or is this being inappropriately fussy, since XML tools are already
> capable of handling text which is not "XML-safe" in this way, and
> automatically doing the transformations to and from the XML-escaped
> forms? In which case I should just ignore XML's character
> restrictions when thinking about the SCL syntax itself, and rely on
> generic XML tools and conventions to faithfully handle the parsing
> and coding in and out of the XML syntax.
The XML tools (at least compliant ones) only do this kind of "handling"
when those characters happen in the proper context. Most tools do have
converters, but you have to also think about authoring, and honestly,
there's a lot of people who hand author, and there's a lot of broken
tools.
> Or, should I use the CDATA feature of XML? This seems to have been
> designed for cases like this, but I have the sense that CDATA is
> rarely used in XML-based conventions, and wonder if there is any good
> reason why not. It rather worries me that an XML processor is
> apparently allowed to remove all traces of whether a piece of text
> was originally in a CDATA section or not. I would like XML to
> transmit any SCL-in-XML faithfully, and if XML parsers may remove
> some of the critical encoding information then this seems to
> introduce some fragility into the transaction.
What you're talking about is called a CDATA section. It can appear
anywhere within a document, and creates a completely opaque block
of characters that the XML processor pays no attention to whatsoever,
except for looking for that final ']]>' string. But while this is
very good for escaping code, it doesn't really buy you what you're
looking for, since from the perspective of a document structure,
CDATA sections don't really exist -- you can't engineer them into
a schema, for example, they're just a feature available to authors.
If you were creating a full-blown XML markup language, you could
create whats called an "application convention", unenforceable by
the schema/DTD, that says, whenever you see a <foo> element, its
content will contain a CDATA section containing a special notation.
This isn't a good idea and isn't generally done, except as a last
resort (IMO).
> Im sure that the XML community has come to an agreement on a suitable
> best practice to follow in a case like this, and would appreciate
> any guidance or input.
I think in order to answer your question better I'd need to know
what the design goal is, really. If you want to be able to put SCL
into an XML attribute value, unescaped, then you want to avoid any
markup characters if possible, including single and double quotes.
Since that's really impractical, you (as you say) begin to rely on
XML authoring tools to escape the contents. This can be a real
difficulty for authoring, but it certainly is the solution. Now,
if XCL is an XML markup language in its own right, you'd just put
the XCL into the document as XML markup, using XML Namespaces. The
current generation of browsers should (in theory) be able to handle
that, though in my experience you never know what the hell is going
to come out of Redmond in this regard.
So, in a nutshell, is there an SCL design goal to put it into XML,
or will there be an XML markup language for SCL that would be used
instead?
Murray
PS. sorry if I'm rambling tonight. I normally don't respond to
email after I've been drinking...
......................................................................
Murray Altheim http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK .
"At the Fresno event, even some of the handpicked guests expressed
skepticism about the state selling $15 billion in bonds to balance
the budget. A few said the state could look harder for more cuts
to the government bureaucracy -- but nevertheless said they would
defer to Schwarzenegger's judgment for now."
http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2004/01/21/MNG7L4E7IT1.DTL
Defer to Arnold's judgment?!
More information about the SCL
mailing list