From: "Saved by Windows Internet Explorer 10" Subject: Invisible XML Date: Mon, 20 Jan 2014 10:17:03 +0100 MIME-Version: 1.0 Content-Type: multipart/related; type="text/html"; boundary="----=_NextPart_000_0000_01CF15C8.C37AEF50" X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.17609 This is a multi-part message in MIME format. ------=_NextPart_000_0000_01CF15C8.C37AEF50 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.balisage.net/Proceedings/vol10/html/Pemberton01/BalisageVol10-Pemberton01.html =EF=BB=BF
=20 =20James Clark, Makoto MURATA (eds.), 2001, RELAX = NG=20 Specification, https://www.oasis-open.org/committees/relax-ng/spec.html<= /A>=20
James Clark (ed.). 2002, RELAX NG Compact = Syntax, https://www.oasis-open.org/committees/relax-ng/compact-20= 021121.html=20
Backus-Naur Form, http://en.wikipedia.org/wiki/Backus-Naur_Form =
S. Pemberton, 1982, "Executable Semantic = Definition of Programming Languages Using Two-level Grammars", http://www.cwi.nl/~steven/vw.html
Alfred Aho and Jeffrey D. Ullman, 1977, = "Principles of Compiler Design", Addison-Wesley, ISBN 0-201-00022-9.
Earley, Jay (1970), "An efficient context-free = parsing=20 algorithm", Communications of the ACM 13 (2): 94-102, doi:10.1145/362007.362035
N. Freed et al., 1996, "Multipurpose Internet = Mail=20 Extensions, (MIME) Part Two: Media Types", http://www.ietf.org/rfc/rfc2046.txt
Copyright =C2=A9 Steven Pemberton 2013, all rights=20 reserved.
What if you could see everything as XML? XML has = many strengths for data exchange, strengths both inherent in the nature of = XML=20 markup and strengths that derive from the ubiquity of tools that can = process=20 XML. For authoring, however, other forms are preferred: no one writes = CSS or=20 Javascript in XML. It does not follow, however, that there is no value = in=20 representing such information in XML. Invisible = XML is=20 a method for treating non-XML documents as if they were XML, enabling = authors to=20 write in a format they prefer while providing XML for processes that are = more=20 effective with XML content. There is really no reason why XML cannot be = more=20 ubiquitous than it is.
=20 How to cite = this=20 paper
Pemberton, Steven. =E2=80=9CInvisible = XML.=E2=80=9D Presented at Balisage:=20 The Markup Conference 2013, Montr=C3=A9al, Canada, August 6 - 9, 2013. = In=20 Proceedings of Balisage: The Markup Conference 2013. Balisage = Series on=20 Markup Technologies, vol. 10 (2013). doi:10.4242/BalisageVol10.Pemberton01.
XML is a popular format. It is widely and = successfully used for document and data storage, exchange and presentation. A major = advantage of=20 using XML is the toolchain and pipeline available for generic XML = processing.=20 You can easily use new formats within the generic framework.
However, for authoring purposes XML is seldom = preferred over a notation more directly suited to the purpose. Few would prefer to = write their=20 CSS rules as
<rule><simple-selector = name=3D"body"/><block><property name=3D"color" = value=3D"blue"/></block></rule>=0A=
to the more direct
body {color: blue}=0A=
and even less would prefer to write
<statement><if><condition><compari= son name=3D"<"><var name=3D"max"><var = name=3D"a"></comparison></condition><then><statem= ent><assign><var name=3D"max"/><expression><var = name=3D"a"/></expression></assign></statement></t= hen></if></statement>=0A=
to the much more direct
if (max<a) then = max=3Da;=0A=
And, of course it should be noted that even RELAX = NG has both an XML syntax and a 'compact' syntax RELAX=20 NG RELAX=20 NG COMPACT.
In fact if we are to be brutally honest, even XML =
formats
take short cuts for authoring ease. Take for instance an <a>
element in XHTML:
<a = href=3D"http://www.w3.org/TR/1999/xhtml">XHTML</a>=0A=
This does not surface the real=20 structure of the underlying data. If we were to be completely faithful = to the principle of making all relevant structure explicit, we should really = write=20 something along the lines of
<a><href><method = type=3D"href"/><domain name=3D"org"/><site = name=3D"w3"/><sub name=3D"www"/><path><root><sub = name=3D"TR"><sub name=3D"1999"><sub = name=3D"xhtml"</sub></sub></sub></root></path&= gt;</href><text>XHTML</text></a>=0A=
You might argue about the details here, but this = example is only to show that there are parts of XML documents that could be = further=20 structured, but that we choose not to, possibly for authoring ease, = possibly for=20 fear of being laughed out of town.
The reasons for this are obvious: despite the = disadvantages of not being able to use the generic toolchain any more, or only to a = lesser=20 degree, the increased readability of the source, and its closer relation = to the=20 problem domain makes authoring so much easier.
Part of the advantage of XML is that there is a = single parser needed to be able to deal with any kind of document. This can be = contrasted=20 with for instance the situation for HTML, where you need a parser for = the HTML,=20 with separate parsers for CSS and Javascript at least, (and URLs), = creating=20 extra complexity and brittleness.
But looked at through a suitable pair of glasses, = what is XML apart from a description of a parse tree for some format (with some = special=20 treatment for text nodes)? And frankly, what is so difficult about=20 general-purpose parsing? It is a widely understood and easily solved = problem. Is=20 it not possible to combine the best of both worlds, and have authorable = formats,=20 that can still use the XML tool chain? Couldn't XML become the = underlying format=20 for everything?
The approach presented here is to add one more step = to the XML processing chain, an initial one. This step takes any textual = document, and=20 a (reference to) a suitable syntax description, parses the document = using the=20 syntax description, and produces as output a parse tree that can be = treated as=20 an XML document with no further parsing necessary (or alternatively, the = document can be serialised out to XML).
In other words, the input document might be
body {color: blue}=0A=
but the result of the parse will be the same as if = an XML parser had been presented with the XML document
<css>=0A= <rule><simple-selector name=3D"body"/>=0A= <block><property name=3D"color" = value=3D"blue"/></block>=0A= </rule>=0A= </css>=0A=
We call this method Invisible = XML,=20 since the document is treated as XML, but it is not visibly an XML document.
The requirement is to find a suitable way to = describe the syntax of the input document so that the resultant parse-tree is of the = form=20 suitable for use in our XML chain. If we were to use BNF BNF, arguably = the most=20 well-known syntax-description format, it might look like this (in what = follows=20 "..." is used for parts of the definition that have been elided and will = be=20 defined later):
<css> ::=3D = <rules>=0A= <rules> ::=3D <rule> | <rules> <rule>=0A= <rule> ::=3D <selector> <block>=0A= <block> ::=3D "{" <properties> "}"=0A= <properties> ::=3D <property> | <property> ";" = <properties>=0A= <property> ::=3D <name> ":" <value> | <empty>=0A= <selector> ::=3D <name>=0A=
etc, etc. But it is quickly apparent that this has =
some
shortcomings. Firstly a surface problem that since we are using this =
for XML,=20
we could quickly go crazy with the use of angle brackets for two =
different=20
purposes. Although there is a certain charm to defining the <css>
element with a syntax rule whose name =
is <css>
, let us rather use a different format. =
Therefore=20
we shall use a variant of VWG format VWG.=20
This looks like:
css: rules.=0A= rules: rule; rules, rule.=0A= rule: selector, block.=0A= block: "{", properties, "}".=0A= properties: property; property, ";", properties.=0A= property: name, ":", value; empty.=0A= selector: name.=0A= name: ...=0A= value: ...=0A= empty: .=0A=
(We shall restrict ourselves to a simplified CSS = grammar for the sake of this article).
Note that ";" signifies alternatives, and as is = normal in syntax definitions, if one alternative is empty (or reduces to empty), = the rule=20 is optional.
If we parse the snippet of CSS above with this, = and then represent the resulting parse tree in an XML style (so that each = nonterminal is=20 represented as an XML element), a second problem becomes apparent:
<css>=0A= <rules>=0A= <rule>=0A= <selector>body</selector>=0A= <block>=0A= <properties>=0A= <property>=0A= <name>color</name>=0A= <value>blue</value>=0A= </property>=0A= </properties>=0A= </block>=0A= </rule>=0A= </rules>=0A= </css>=0A=
namely that there are certain elements in the tree =
(rules
, properties
) =
that we really=20
aren't interested in. (You'll notice that some terminal symbols such as =
the=20
brackets, colons and semicolons don't appear in the parse tree. This =
will be=20
discussed later).
The problem becomes even more apparent with a CSS = snippet like
body {color: blue; = font-weight: bold}=0A=
since the content of the <block>
element then becomes even more =
unwieldly:
<properties>=0A= <property>=0A= <name>color</name>=0A= <value>blue</value>=0A= </property>=0A= <properties>=0A= <property>=0A= <name>font-weight</name>=0A= <value>bold</value>=0A= </property>=0A= </properties>=0A= </properties>=0A=
where we would prefer to see the much more = direct
<property>=0A= <name>color</name>=0A= <value>blue</value>=0A= </property>=0A= <property>=0A= <name>font-weight</name>=0A= <value>bold</value>=0A= </property>=0A=
The problem arises in this case because the syntax description method relies on recursion to deal with repetition. To that = end, we=20 shall introduce a specific notation for repetition. Zero or more=20 repetitions:
(rule)*=0A=
and one or more repetitions:
(rule)+=0A=
In fact we shall extend these two postfix = operators to also act as infix operators, to handle a commonly occurring case:
(property)*";"=0A= (property)+";"=0A=
which respectively mean "zero or more, separated = by semicolon" and "one or more, separated by semicolon" (there is no = reason to=20 restrict the separator to a terminal as here; it may also be a = nonterminal).
Now we can specify our syntax as:
css: (rule)*.=0A= rule: selector, block.=0A= block: "{", (property)*";", "}".=0A= property: name, ":", value; .=0A= name: ...=0A= value: ...=0A=
and the parsetree will now look like this:
<css>=0A= <rule>=0A= <selector>body</selector>=0A= <block>=0A= <property>=0A= <name>color</name>=0A= <value>blue</value>=0A= </property>=0A= <property>=0A= <name>font-weight</name>=0A= <value>bold</value>=0A= </property>=0A= </block>=0A= </rule>=0A= </css>=0A=
However, there is another reason why we might not = want a syntax rule name to appear in the parse tree, and that is when we use a = syntax=20 rule as a refinement, that is to say, when = the syntax=20 rule doesn't represent anything of semantic importance, but has been = defined so=20 that we can use it in several places without having to repeat it. For = instance,=20 suppose we wanted to define a series of properties in a separate = rule:
properties: = (property)*";".=0A=
and use it:
block: "{", properties, = "}".=0A=
but not want <properties>
to=20
appear in the final parse tree. What we define is that the use of any =
rule name=20
preceded by a minus sign is only being used for refinement. So that =
would give=20
us:
properties: = (property)*";".=0A= block: "{", -properties, "}".=0A=
and this would result in the same parse-tree as = above. Note that this still allows a rule to be used in other places and appear in = the=20 parse tree if needed.
Also note that for simplicity we have ignored = treating=20 spaces in the syntax description, but that is also an example of = something you=20 would not want to have in the parse tree:
colon: -spaces, ":", = -spaces.=0A= spaces: " "*.=0A=
Similarly, we can use it to make empty = alternatives more explicit:
property: name, ":", = value; -empty.=0A=
empty: .=0A=
As alluded to above, in general, terminal symbols = do not appear in the parse-tree, since most of them are only there to delimit=20 structural elements in the source file. If you want them to show up, you = can add=20 an explicit rule for them:
colon: ":".=0A=
which will cause them to show up in the tree like = this:
<property>=0A= <name>color</name>=0A= <colon/>=0A= <value>blue</value>=0A= </property>=0A=
However, there are places where terminals have = semantic meaning, and you do want them to appear in = the parse-tree, for instance in our example the names and values of the = properties.=20 To achieve this we mark terminals that are to be copied to the parse = tree=20 specially:
name: (+"a"; +"b"; = ...etc...; +"9"; +"-")+.=0A=
In other words, normally terminals are discarded, = but if=20 they are preceded with a + they are copied to the parse-tree.
Strictly speaking, this would be enough to allow = you to=20 parse a document, and output it as an equivalent XML document. However, = there=20 are possible extensions that give you a little more control over the = result. The=20 most obvious is allowing the specification of attributes. This is simply = done by=20 marking the use of rules with at signs:
css: (rule)*.=0A= rule: selector, block.=0A= block: "{", (property)*";", "}".=0A= property: @name, ":", value.=0A=
A rule used like this may clearly not contain any = structural elements (though it may contain terminals and refinements), since = attributes=20 are not structured, but this is an easy condition to check for. The = parsetree=20 will now look like this:
<css>=0A= <rule>=0A= <selector>body</selector>=0A= <block>=0A= <property name=3D"color">=0A= <value>blue</value>=0A= </property>=0A= <property name=3D"font-weight">=0A= <value>bold</value>=0A= </property>=0A= </block>=0A= </rule>=0A= </css>=0A=
If we changed the rule for property
to look like this:
property: @name, ":", = @value.=0A=
then the resultant parse-tree would look like
<css>=0A= <rule>=0A= <selector>body</selector>=0A= <block>=0A= <property name=3D"color" value=3D"blue"/>=0A= <property name=3D"font-weight" value=3D"bold"/>=0A= </block>=0A= </rule>=0A= </css>=0A=
Note that by marking the use of a=20
syntax rule in this way, and not the definition, it allows the syntax =
rule to be=20
used for structural elements (<name>color</name>
) as well as for =
attributes=20
(name=3D"color"
).
Although it would be possible to require the = syntax to be restricted to some class of language, such as LL(1) or LR(1) LL1 in=20 order to make the parser faster, in practice it is easier for the author = of the=20 syntax if we make no such restriction, since it would require the author = to=20 understand the principles, and it would require the system to check that = the=20 syntax adhered to the requirement. In practise a parsing algorithm such = as=20 Earley's Earley=20 is fast enough, and will treat all context-free languages. The only = remaining=20 problem is if the syntax author describes an ambiguous language. To that = end we=20 just define that the parser outputs one of the parses, and leave it at = that. For=20 instance, if expression were defined as:
expr: i; expr, plus, = expr.=0A= i: "i".=0A= plus: "+".=0A=
then a string such as
i+i+i=0A=
could be parsed as both
<expr><i/></expr>=0A= <plus/>=0A= <expr>=0A= <expr><i/></expr>=0A= <plus/>=0A= <expr><i/></expr>=0A= </expr>=0A=
and as
<expr>=0A= <expr><i/></expr>=0A= <plus/>=0A= <expr><i/></expr>=0A= </expr>=0A= <plus/>=0A= <expr><i/></expr>=0A=
To deliver a source document to be parsed by our = system, we can use a media type Media=20 type that supplies a reference to the required syntax description. = For=20 instance:
application/xml-invisible; = syntax=3Dhttp://example.com/syntax/css=0A=
Clearly a system can cache well-known syntax descriptions.
It should go without saying that the syntax = descriptions themselves are in Invisible XML (though in their case the syntax = description=20 must be cached to prevent an infinite loop = of=20 processing.)
The definition might look like this:
ixml: (rule)+.=0A= rule: @name, -colon, -definition, -stop.=0A= definition: (alternative)*-semicolon.=0A= alternative: (-term)*-comma.=0A= term: -symbol; -repetition.=0A= repetition: one-or-more; zero-or-more.=0A= one-or-more: -open, -definition, -close, -plus, separator.=0A= zero-or-more: -open, -definition, -close, -star, separator.=0A= separator: -symbol; -empty.=0A= empty: .=0A= symbol: -terminal; nonterminal; refinement.=0A= terminal: explicit-terminal; implicit-terminal.=0A= explicit-terminal: -plus, @string.=0A= implicit-terminal: @string.=0A= nonterminal: @name.=0A= refinement: -minus, @name.=0A= attribute: -at, @name.=0A= =0A= string: -openquote, (-character)*, -closequote.=0A= name: (-letter)+.=0A= letter: +"a"; +"b"; ...=0A= character: ...=0A= =0A= colon: -S, ":", -S.=0A= stop: -S, ".", -S.=0A= semicolon: -S, ";", -S.=0A= comma: -S, ",", -S.=0A= plus: -S, "+", -S.=0A= minus: -S, "-", -S.=0A= star: -S, "*", -S.=0A= open: -S, "(", -S.=0A= close: -S, ")", -S.=0A= at: -S, "@", -S.=0A= openquote: -S, """".=0A= closequote: """", -S.=0A= S: " "*.=0A=
This would then parse to the XML form:
<ixml>=0A= <rule name=3D"ixml">=0A= <alternative>=0A= <one-or-more>=0A= <alternative>=0A= <nonterminal name=3D"rule"/>=0A= </alternative><separator/>=0A= </one-or-more>=0A= </alternative>=0A= </rule>=0A= <rule name=3D"rule">=0A= <alternative>=0A= <attribute name=3D"name"/>=0A= <refinement name=3D"definition"/>=0A= </alternative=0A= </rule>=0A= <rule name=3D"definition">=0A= <alternative>=0A= <zero-or-more>=0A= <alternative>=0A= <nonterminal name=3D"alternative"/>=0A= </alternative>=0A= <separator><refinement = name=3D"semicolon"/></separator>=0A= </zero-or-more>=0A= </alternative=0A= </rule>=0A= ... etc ...=0A= <rule name=3D"separator">=0A= <alternative><refinement = name=3D"symbol"/></alternative>=0A= <alternative><refinement = name=3D"empty"/></alternative>=0A= </rule>=0A= ... etc ...=0A= </ixml>=0A=
Thanks to Earley's parsing algorithm, we can =
remove the=20
<alternative>
elements when there is =
only one=20
alternative
in a rule
, by
redefining definition
:
definition: = -alternative; alternative, -semicolon, (alternative)+-semicolon.=0A=
Note how we have used the "-" character to prevent = it being copied in the first case (when there is only one). You wouldn't be able = to use=20 such a rule as this if there were a requirement on the syntax to be = LL(1) or=20 LR(1), since the two parts of the rule start with the same symbols.
Similarly, we can get rid of empty <separators/>
thusly:
one-or-more: -open, = -definition, -close, -plus; -open, -definition, -close, -plus, separator.=0A= zero-or-more: -open, -definition, -close, -star; -open, -definition, = -close, -star, separator.=0A= separator: -symbol.=0A=
We can move the value of the separator into an = attribute with:
separator: @explicit; = @implicit; @nonterminal; @refinement.=0A= explicit: -plus, -string.=0A= implicit: -string.=0A=
This would then generate:
<ixml>=0A= <rule name=3D"ixml">=0A= <one-or-more>=0A= <nonterminal name=3D"rule"/>=0A= </one-or-more>=0A= </rule>=0A= <rule name=3D"rule">=0A= <attribute name=3D"name"/>=0A= <refinement name=3D"definition"/>=0A= </rule>=0A= <rule name=3D"definition">=0A= <alternative>=0A= <refinement name=3D"alternative"/>=0A= </alternative>=0A= <alternative>=0A= <nonterminal name=3D"alternative"/>=0A= <one-or-more>=0A= <nonterminal name=3D"alternative"/>=0A= <separator refinement=3D"semicolon"/>=0A= </one-or-more>=0A= </alternative>=0A= </rule>=0A= ... etc ...=0A= <rule name=3D"separator">=0A= <alternative><refinement = name=3D"symbol"/></alternative>=0A= <alternative><refinement = name=3D"empty"/></alternative>=0A= </rule>=0A= ... etc ...=0A= </ixml>=0A=
(An observant reader will have spotted that we =
have allowed
attributes to be defined by attributes here -- for instance with @refinement
=20
-- that is we treat an attribute within an attribute definition as if it =
were a=20
refinement).
As yet another possibility, we can move the =
separator into=20
an attribute of the one-or-more
or zero-or-more
elements:
one-or-more: -open, = -definition, -close, -plus; -open, -definition, -close, -plus, = -separator.=0A= zero-or-more: -open, -definition, -close, -star; -open, -definition, = -close, -star, -separator.=0A= separator: @explicit; @implicit; @nonterminal; @refinement.=0A= explicit: -plus, -string.=0A= implicit: -string.=0A=
Although the syntax description so defined was = developed iteratively based on the needs of the user, and is sufficient for its = purpose,=20 it is clear in the above example, that refinements occur far more = frequently=20 than true semantic rules. An alternative worth exploring would be to say = that=20 nothing is copied to the syntax tree unless=20 specifically marked. Let us use the "^" character to mark items that are = copied=20 to the tree. The result is clearly much more restful on the eyes:
ixml: (^rule)+.=0A= rule: @name, colon, definition, stop.=0A= definition: alternative; ^alternative, semicolon, = (^alternative)+semicolon.=0A= alternative: (term)*comma.=0A= term: symbol; repetition.=0A= repetition: ^one-or-more; ^zero-or-more.=0A= one-or-more: open, definition, close, plus; open, definition, close, = plus, ^separator.=0A= zero-or-more: open, definition, close, star; open, definition, close, = star, ^separator.=0A= separator: terminal; @nonterminal; @refinement.=0A= symbol: terminal; ^nonterminal; ^refinement.=0A= terminal: ^explicit-terminal; ^implicit-terminal.=0A= explicit-terminal: up, @string.=0A= implicit-terminal: @string.=0A= nonterminal: up, @name.=0A= refinement: @name.=0A= attribute: at, @name.=0A= =0A= string: openquote, (character)*, closequote.=0A= name: (letter)+.=0A= letter: ^"a"; ^"b"; ...=0A= character: ...=0A= =0A= colon: S, ":", S.=0A= stop: S, ".", S.=0A= semicolon: S, ";", S.=0A= comma: S, ",", S.=0A= plus: S, "+", S.=0A= up: S, "^", S.=0A= star: S, "*", S.=0A= open: S, "(", S.=0A= close: S, ")", S.=0A= at: S, "@", S.=0A= openquote: S, """".=0A= closequote: """", S.=0A= S: " "*.=0A=
There are obvious extra odds and ends that need = adding, such as sets of characters, to make terminal specification easier, for = instance:
letter: ^["a"-"z", = "A"-"Z", "-"].=0A= S: [" ", "\t", "\n", ...]*.=0A=
but these are just details.
It should be noted in passing that in the form = presented here, Invisible XML only works in one = direction: you=20 can turn any textual document into an equivalent XML document. However, = it is=20 not in general possible to turn a textual document into a particular XML form without more work. For = instance, you=20 could turn Wiki markup into an XML document, but not into XHTML in=20 particular.
Returning the resultant XML document to its = original format is just a process of presentation, nothing that a suitable bit of XSLT = couldn't=20 do, or even CSS in some simple cases. In fact it should be apparent that = from=20 the Invisible XML syntax, it would be = straightforward=20 to automatically generate the required piece of XSLT directly.
There is really no reason why XML can't be more = ubiquitous than it is, and similarly there is no reason why XML documents have to = be=20 written in an explicit XML format per se. = Anything=20 that can be parsed can be perceived as XML, since parsing is very easy, = and=20 parse-trees are really just XML documents in different clothing. Invisible=20 XML allows a multitude of document formats to be authored in = their traditional form, but be processed as XML, with the concomitant = advantages of=20 the XML toolchain.
[RELAX=20 NG] James Clark, Makoto MURATA (eds.), 2001, RELAX NG Specification, = https://www.oasis-open.org/committees/relax-ng/spec.html<= /A>
[RELAX=20 NG COMPACT] James Clark (ed.). 2002, RELAX NG Compact Syntax, https://www.oasis-open.org/committees/relax-ng/compact-20= 021121.html=20
[BNF]=20 Backus-Naur Form, http://en.wikipedia.org/wiki/Backus-Naur_Form
[VWG]=20 S. Pemberton, 1982, "Executable Semantic Definition of Programming = Languages=20 Using Two-level Grammars", http://www.cwi.nl/~steven/vw.html
[LL1]=20 Alfred Aho and Jeffrey D. Ullman, 1977, "Principles of Compiler = Design",=20 Addison-Wesley, ISBN 0-201-00022-9.
[Earley]=20 Earley, Jay (1970), "An efficient context-free parsing algorithm",=20 Communications of the ACM 13 (2): 94-102, doi:10.1145/362007.362035
[Media=20 type] N. Freed et al., 1996, "Multipurpose Internet Mail Extensions, = (MIME)=20 Part Two: Media Types", http://www.ietf.org/rfc/rfc2046.txt