My document object model can do more than yours

Extending the DOM for data manipulation

Alain Couthures

This work is made available under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/).

expand Abstract

expand Alain Couthures

Balisage logo

Proceedings

expand How to cite this paper

My document object model can do more than yours

Extending the DOM for data manipulation

Balisage: The Markup Conference 2013
August 6 - 9, 2013

Introduction

XML is well-known for two major uses: documents and data exchange. Actually, any XML "document" can also be considered as a small consistent database. As an example, an invoice requires many tables to be described while one XML document is enough for this.

For small amount of data, an XML "document" is usually parsed in memory and the DOM API is a common library to manipulate its contents within browsers or in Microsoft environments (MSXML, .Net) or in PHP. DOM Level 1 was quite limited and was designed for HTML. DOM became namespace-aware with DOM2. The latest version of DOM is Level 3 and it has be published in 2004. There is a quite recent working draft for DOM4 (6 December 2012).

There are many critics about the DOM API, probably because it is clearly a low-level interface and because complexifying it for full XML support was out of interest for HTML-only fans. The DOM structure is also not fully appropriate for building an XPath engine (presence of CDATA and entities nodes and lack of namespaces nodes) as defined in XDM.

As a matter of fact, DOM3 has not been fully implemented in browsers and DOM4 might loose vital functionnalities for building an XPath engine with it, such as attributes not been nodes anymore.

XForms 1.0, and later XForms 1.1, has been designed for editing XML instances with a browser when embedded in an HTML page. XForms is based on XPath and any XForms implementation requires extra features about nodes such as properties ("validity", "relevant", "read-only", ...) which cannot be found in native DOM implementations in browsers. There are also extra XPath functions, such as "instance()", "index()", "event()", defined in XForms specifications while XPath engines in browser cannot be extended. That is why XSLTForms, a client-side XForms implementation, has its own XPath engine written in Javascript and that is why its ancestor project, AJAXForms, had even also its own DOM implementation. It was chosen to use native XML storage in XSLTForms for performance and for eprouved compliance when serializing XML instances with multiple namespaces. Today, this has to be reconsidered.

XForms 2.0 is not limited to XML editing. At least, CSV and JSON are also supported. The question for an XForms implementer is how to integrate these notations at low level for keeping XPath use at authors' level.

CSV format might seem to be an old format but it is still used in many import/export functions of applications. For relational databases, a CSV file is a natural table content.

Mapping JSON in XML and preserving XPath expressions readability is not easy. Many attempts have already been done but there is not yet an agreement about one in particular for each different situation.

This paper is describing that, with few extensions, CSV and JSON can be loaded in an extended DOM structure so an XPath engine can manipulate them immediately.

There are yet more challenging notations or file formats which can be of interest for authors. Typically, applications with a light server side or offline applications want to manipulate files, not just data in exchange notation. Many of them are binary formats and they can now be manipulated within browsers with Javascript. The most common ones are text-processing documents or spreadsheets in a ZIP package.

Possibilities are numerous. For XQuery implementors and developers, XQuery instructions might also be parsed into a DOM structure as if there was an XQueryX source. Programming languages are defined according to a grammar and a grammar is similar to an XML Schema for text sources. For Apache administrators, httpd.conf file and log files have a format which can be loaded into a DOM structure. There are also emerging notations which are simpler or richer than XML and they can surely be parsed and stored in an extended DOM structure.

A bigger challenge is to use a DOM structure not only for a tree or a forest of trees (such as a document with multiple document elements) but for graphs. A navigational approach can be obtained with named axes: the data designer can specify different sets of children for a node, each one being assigned a name.

A non-planar structure is even possible within a DOM structure, the difficulties being about internal ids to be used when parsing and serializing. But this is already done by developers with workarounds and a way to standardize this would be appreciated.

Node names

Most data engines (such as relational databases, CSV, JSON,...) don't have restrictions for names: any character is possible and an enclosing delimiter is used in the corresponding query languages to ensure there will be no mismatch. For example, MySQL uses the back-quote character, MS-Access uses square brackets. Names without restrictions are clearly more user-friendly. XML 1.1 extended possibilities for names but was not implemented much...

For the DOM, a name is stored within a string: there is no restriction for names due to the DOM structure.

Quote, apostrophe, brackets, angular brackets, parenthesis are already used in XML and XPath. So, the back-quote character is a good candidate for this purpose:

`+` is now a valid node name in XPath

Encoding within a name is still required in XML and XPath for special characters (new line, quotes,...). Those characters can easily be escaped as entities.

`&` stands for an ampersand character
`
` stands for a newline character

Multiple document type nodes and multiple document elements

Multiple document type nodes can be very useful within a ZIP package because each ZIP component might have its own document properties (media type, encoding, ...).

Multiple document elements is a pertinent feature for many formats for which serializers don't guarantee that the end of the contents has be reached. This is even an advantage to allow data to be added at the end without breaking a well-formed rule. Is it still interesting to always double check that all data has been received?

Those two extensions do not require the DOM structure to be modified. They impact the parser and the serializer. Methods of the DOM API, such as appendChild and so on, have to accept them.

For JSON:

{a: "Hello", b: "World"}

 becomes

document( 
  element("a",
    text("Hello")
  ),
  element("b",
    text("World")
  )
)

Sequence Node Type

Manipulation of an ordered set of data is possible in many formats. For example, JSON allows "arrays" which can be embedded.

There is a known trick when converting a named JSON array into XML: just iterate as many elements with the array name. Luckily, even the bracket notation is similar in XPath because of the shortcut for "[position() = n]". But there are also anonymous arrays in JSON, empty arrays and arrays with a unique item. So, this trick is not enough and extra meta data becomes mandatory.

Support of a new node type is enough to deal with ordered sets. This could be named "Sequence Node Type" because of sequences defined in XPath, even if XPath does not allow embedded sequences.

A sequence node can be seen as an element without a name and without attributes. Instead of defining such a new node type, it could also be considered to allow elements without a name but this would interfere within XPath expressions with a selector such as '*'.

It could also be compared to a document fragment being stored with the structure instead of just being a temporary embedding node. It would then be necessary to add a parameter to each method accepting a document fragment to specify whether the document fragment node should be dropped or not.

Since a sequence node type is corresponding to a specific need, it is better to define a new node type.

["a", "b"] becomes
document( sequence( text("a"), text("b") ) )

It is more problematic to choose a notation in XPath for such nodes:

  • "sequence()" such as in "text()"?

  • "#" or "(-)" as a shortcut?

"#/text()[1]" returns "a" for ["a", "b"]

Typed-value Node Type

In DOM3, just elements and attributes can have a data type as property and nodes with effective values are only text nodes.

This is not enough for JSON because stand-alone values are possible. The following JSON expressions are valid:

[1, true]

null

2.154E-7

["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]

Every value in JSON has a data type and JSON allows numbers, strings, booleans and null. From the Javascript point of view, because JSON data is directly evaluated, JSON can even contain expressions: for example,date object creations are possible.

In CSV, no type is specified and, commonly, a spreadsheet processor will try to associate the most fitting type for each value. Import tools help users to manually force a type for a column.

But, because the number of header lines is not limited to just one line, it is also a possible trick to associate a type to each CSV column within the header (when column names are unknown, an empty line can be used). For example:

xs:anyURI,xs:gYear,xs:positiveInteger
http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
http://en.wikipedia.org/wiki/Afghanistan,1963,10188299

As a consequence, for a DOM structure to be used to store JSON or CSV data, the text node type is not enough and a new node type has to be defined for a typed value.

42 becomes
value("xsd:decimal", 42)
{ a: 42, b: "Hello"} becomes
document( element("a", value("xsd:decimal", 42) ), element("b", value("xsd:string", "Hello") ) )

This has not to be limited to XSD types and really binary types (not xsd:hexBinary nor xsd:base64Binary) are required in different situations such as ZIP packages with embedded images like in, for example, OpenXML files (.xlsx, .docx,...).

Named axes

Attributes can be considered as low-level elements but there are advantages at using them because they are clearly separated from children nodes. As a consequence, whether they are present or not, they do not interfere when calling XPath nodeset functions upon children. For XPath, attributes are accessed with a dedicated axis ("attribute::", with "@" as the well-known shorcut). So, in a sense, there are possibly two distinct sub-trees under an element.

<person firstname="Paul" lastname="Verdier" >
  <address>17 rue de Rivoli</address>
  <address>75001 Paris</address>
  <address>France</address>
</person>

could also be serialized as

<person>
  <attribute::firstname>Paul</attribute::firstname>
  <attribute::lastname>Verdier</attribute::lastname>
  <address>17 rue de Rivoli</address>
  <address>75001 Paris</address>
  <address>France</address>
</person>

Allowing developers to define named axes means that an element can have multiple children in a specific context which might also be seen as a relationship. The name for the axis is the name of the relationship. The resulting structure can be compared to the navigational CODASYL database model.

This point is not at all difficult to implement (a map of arrays for children instead of a unique array) and a corresponding syntax for XPath can easily be defined ("child(axis-name)::" for example).

For XML-like serialization, the list of axes could precede the name of an element child:

<product>
  <designer::person>Peter</designer::person>
  <user::person>Jack</user::person>
  <user::person>Emily</user::person>
</product>

When multiple parents are allowed, the structure looks like in the network CODASYL database model.

A use case for XML developers is embedding in the same document data and specific data types declarations (in the same approach, fonts can be embedded within a PDF file). This is possible with the data document element and the schema document element being at the top level.

Serialization is more clumsy with a network data model: a child element cannot be serialized under each parent element. A specific notation can be used to list links such as in multipart messages. This will be less human-friendly but XML documents are not often generated manually and serialization might be seen as just a way to exchange a database.

For example:

<person>
  <attribute::email>
    <attribute::xsi:type xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#email_type"/>
    paul.verdier@rivoli.fr
  </attribute::email>
</person>
<xs:simpleType xmlns:xs="http://www.w3.org/2001/XMLSchema" id="email_type">
  <xs:restriction base="xsd:string">
    <xs:pattern value="(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})"/>
  </xs:restriction>
</xs:simpleType>

Parsing and serializing

Currently, with the Javascript DOMParser class, only XML and, eventually HTML and SVG documents can be parsed in a browser and the corresponding mediatype is to be provided as a parameter for the parseFromString() method.

Extending support for other media types such as "text/csv" and "application/json" allows CSV and JSON being loaded in a DOM structure. XSLTForms now supports CSV/XML and JSON/XML conversions written in Javascript.

For other formats, a third parameter can be the grammar to be used to generate the corresponding tree.

Because data cannot always be stored as a string (binary data, for example), another method has also to be defined, such as parseFromUint32Array. This allows media types such as "application/zip" for ZIP packages to be supported. This has already been prototyped in XSLTForms with the zip_inflate and zip_deflate Javascript functions written by Masanao Izumo (http://www.onicos.com/staff/iz/amuse/javascript/expert/) and there is no performance issue with standard MS-Excel or MS-Word files.

When serializing, some links could be internal links. It could be possible to add id attributes when required but another mechanism should be provided, a sort of internal ids such as keys in XSLT, with its specific notation such as internal:key.

Of course, exceptions will be fired because not all DOM structures can be serialized in any format. There might be different workarounds to serialize a DOM structure in a different media type without data loss.

Conclusion

Trees are everywhere and they can even store links to represent graphs. XML notation is a convenient way to serialize not-so-limited trees but people can consider that it is too verbeous. The real point is the power in data description and data manipulation.

The DOM structure is already used by browsers as an XML Data Model. Extending the DOM for non-XML data manipulation is not really difficult and XPath is not heavily affected. The resulting structure can support plenty of existing data formats. Because it is a low level component, upper layers can immediately benefit of it without significant modifications.

XSLTForms is directly concerned by an extended DOM implementation written in Javascript. It is a good opportunity to easily prototype and test new features. This implementation is currently in development and it already has its own name: "Fleur".

References

[DOM Level 3 Core] W3C. DOM Level 3 Core W3C Recommendation 4 April 2004 [online]. http://www.w3.org/TR/DOM-Level-3-Core/core.html

[Linked CSV] Jeni Tennison, Open Data Institute. Linked CSV Unofficial Draft 08 March 2013 [online]. http://jenit.github.io/linked-csv/

[superset] Eric van der Vlist. Toward χίμαιραλ/superset 8 August 2012 [online]. http://eric.van-der-vlist.com/blog/2012/08/08/toward-%CF%87%CE%AF%CE%BC%CE%B1%CE%B9%CF%81%CE%B1%CE%BBsuperset/

[Embracing JSON] Eric van der Vlist. Embracing JSON? Of course, but how? 10 February 2013 [online]. http://archive.xmlprague.cz/2013/presentations/Embracing_JSON/presentation.html#/start

[Document Object Model - Attr] Mozilla Developer Network Document Object Model - Attr [online]. https://developer.mozilla.org/en/docs/DOM/Attr

[DOM4] W3C. DOM4 W3C Working Draft 6 December 2012 [online]. http://www.w3.org/TR/dom/

[CODASYL] Nick Scherbakov. Network (CODASYL) Data Model [online]. http://coronet.iicm.edu/is/scripts/lesson03.pdf