.notation). XML, on the other hand, supports additional datatypes, and is most commonly handled via SAX or DOM, both of which are low-level and meant to be cross-language. Typical developers want high-level access that feels
nativein the language they are using. These shortcomings have little or nothing to do with XML, and can be remedied by a different API. Software that demonstrates this is presented and described. It uses Python's richer set of abstract datatypes (such as tuples and sets), and provides native Python style syntax with richer semantics than JSON or Javascript.
DynaText,and has been deeply involved in standards development including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM and others. He has served as Chief Scientist of Brown University's Scholarly Technology Group and Adjunct Associate Professor of Computer Science. He has written many papers, two books, and eleven patents. Most recently he has been working as a consultant in text analytics.
hierarchical.This paper is concerned first with the superficial problem of syntax, where JSON has achieved a reputation in some quarters for being easier to use than XML; and second with the subtler but ultimately more important problem of data modeling.
datatasks, it does arise all the time in dealing with documents. Trying to manage document-shaped information with arrays and hashes is (of course) possible, but exceedingly awkward. A much better solution is to implement the needed type, and make it as natural to deal with in programming, as arrays and hashes are now.
Much too complicated.
Doesn't fit the way my tools work.
Programs just don't have to deal with that kind of thing very much.
Too much stuff in there.
You can build that all with stuff I've already got.
Clever, but not worth the effort.Or even,
If the experts who built Fortran didn't see it as necessary, why should I?These were also common responses to Unicode, to object-orientation, and to multi-threading. They are much the same arguments made now against XML. And although none of these features has become completely ubiquitous, few developers would oppose built-in support for them.
tables:
A table is similar to a one-dimensional array. However, instead of referencing an element with an integer, any data object can be used.(
T<'A'> = 5
.expectedbehavior. Then developers with better things to do, will use them — and (if the Unicode folks do their implementations right) will notice (perhaps long after) that their systems work better. Meanwhile, their managers may notice that they can sell in new areas with minimal new development costs.
False
vs. #F
vs. 0
vs. nil
; 99
vs. 99
vs. "099"
; and a variety of challenges that arise from mathematical and real-world characteristics such as precision, spelling complexity and variability, etc.$myArray[0]
vs. myArray(1)
vs. item 7 of myArray
.nameor
key
hash tableis a data structure very commonly used to implement such keyed collections (binary search trees are another); but the terms are commonly used interchangeably.
h = [ a=>1, b=>2 ]
vs. h = { 'a':1, n:'2' }
vs. h["a"]=1, h["b"]=2
vs. (hash ('a 1) ('b 2))
, and so on).[ 1, 2, 3 ]
is not an array; it beholes in a piece of compressed wood-pulp); or it might represent a vector in a left-handed non-orthogonal 3-space. It might be loaded into 12 bytes of RAM, intended to be interpreted as 3 32-bit (signed?) integers; or Javascript commonly represents arrays as hashes, by converting the indices from integers to strings that it then hashes. Russel and others represent an integer
n
as a set nested n
-levels deep. And in the end, all of these are merely representations of an abstract notion of quantity that cannot be directly perceived. For present purposes we need not contend with all this ontological complexity; but the differences between these notions are central:{ 'a':1, 'b':2, 'c':"\u2172" }
.loading,composed of constructs implemented by those who implement a given programming language. For example, a contiguous sequence of 12 bytes, preceded by 8 extra bytes of overhead such as the width (4) of the entries, the total length of the area, who owns it, etc.
myArray
, or myArray[1]
or myArray.1
for the second item.isa part of Javascript: a valid JSON file can be pasted in as the right-hand-side of a Javascript assignment statement, and should
just work. The reverse is true for simple cases, though not in general. In my opinion, this is by far the largest factor in JSON's reputed ease of use:
mappingat all.
[ 1, myVar, "c",... ]
, so one cannot factor out heavily-reused pieces of data except by introducing conventions outside JSON's awareness. Thus JSON as a representation is slightly isJavascript.
eval(...)
), the programmer accesses any part of the (formerly JSON) data structure just as if it were the same data declared as native constants in Javascript.obj.myProp
. Thus, there is not independent dictionary type.Array
constructor, which makes objects that support the usual array/list operations (including sort
), and whose members can be accessed by a[2]
syntax. They can be initialized with syntax like a = [ 1, 2, 3 ]
.obj["myProp"]
. This accesses the same property as obj.myProp].
But if you assign obj[99] = "hello"
, the integer 99 is converted to the string 99, and
obj
gains a property named 99(see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array). One could access it as
obj.99
except that Javascript syntax does not allow numbers as operands of .. Likewise, if assigning
a["2"] = "bar"
affects the same member as a[2] = "bar"
.a["foo"] = "bar"
. You can set and get such members. However, the value of a.length
does not change unless those strings; it is defined as the number (property name) of the last non-negative-number-named element, plus 1. Obviously that means that arrays start at 0. You can happily assign to other numeric indices (quoted or not): x[3.0] = 1; x[0.3e3] = 1; x[3.000000000000000001] = 1; x[3.1] = 0; x[-1] = 1;
. However, only the first three of those five become part of the array; the others quietly become properties instead (therefore still accessible as, for example,
x[3.1]
Array
constructor makes an object with extra properties and methods such as length
, sort
, etc.). Array methods work so long as you behave just so. For example, if you make an Array and assign to a["foo"]
or a[3.1]
, it does create a.foo
; but (for example) a.length
does not increase.Lay out footnotes this way;
Retrieve all the distinct conference proceedings from the bibliography;
Check that all the verses are present in this translation of the Gospel of John;
Build a table of contents from the first 'title' element in each 'div' not nested more than 3 divs deep; and so on.
<span class="p">
looks strange).dataformats are perfectly adequate for documents, rarely show any examples of realistic documents. Rather, the examples virtually always lack hierarchy; even more frequently lack mixed content (an extraordinarily important omission); often have no relevant order; and usually include nothing notionally like attributes (such as properties of elements as wholes; the occasional exception is IDs).
the 3rd footnote in chapter 4,
the last word of each speech attributed to Medea,
all the images,and so on.
Document Object Model,is a standard, widely-used interface to XML structure. The term
Object Modelcan mean either the set of formal properties of some data object, or a collection of classes and APIs for accessing something. DOM involves mainly the second sense: it is essentially an API.
pelement child of a given node; nor even the 3rd
indices,and one by string (usually)
keys. This glosses over many differences in theory, implementation, and use, but will suffice for the moment. Some common operations are shown here, for a plain Javascript/JSON array, and for the children of a DOM node (other aspects of XML and of DOM are discussed later).
Description | Javascript | Javascript DOM |
---|---|---|
Get first item | n[0] | n.firstChild |
Get second item | n[1] | n.childNodes(1) |
Items 1-3 | n.slice(0, 3) | n.childNodes.slice(0,3) |
"eggs" attribute" | n.eggs | n.getAttribute("eggs") |
two items equivalent | n1 == n2 | n1.isEqualNode(n2) |
two items identical | n1 === n2 | n1.isSameNode(n2) |
replace item 3 | n1[2] = n3 | n1.replaceChild(n2, n3) |
c = n.childNodes[3]
,
just say c = n[3]
; this could be done in current Javascript by making Node (or XMLNodeor whatever) be a subclass of Array, with the child-pointers as the members, and the other data as properties. Although Javascript does not provide operator overloading, in languages that do, other simplifications such as
==
instead of isEqualNode
can also be provided.Abstractin this case means that the datatypes are characterized by their storage and access behaviors (or topology, if you will), rather than by how they are implemented. Arrays are distinct from hashes because one indexes items by position, and the other by name (or more properly
key).
arraydata structure (a contiguous series of equal-sized memory blocks), but we have already seen that Javascript is an exception. Sparse arrays such as used in high-dimensionality problems in NLP and physics commonly implement arrays using linked lists or even hashes in order not to waste space on large numbers of empty members.
set,which has quite different semantics because it has neither position nor identifiers, only data items.
Priority queue(used to choose tasks in order of importance or urgency) is another abstract collection type, which introduces a new feature: Its members are accessed by priority, which is very much like a position in an array; however, unlike with an arrays there can be any number of tasks with the same priority. Dictionaries and sets also fail to encompass priority queues.
bagor
multisetprovided in many programming libraries is, similarly, a set but with duplicate entries.
homogenousones (in which all entries' data items must be of the same datatype, and
heterogeneousones (in which items can be of mixed types). For example, Python has arrays, which are heterogeneous, and byteArrays which are homogeneous. Collection types rarely place any restriction on entries containing equivalent (or identical)
Name | Position | Named | DupPos | DupName | Immutable form |
---|---|---|---|---|---|
set | 0 | 0 | – | 0 | frozenset |
(multiset or bag) | 0 | 0 | – | 1DupNamefeature can be slightly complicated to account for this if desired. |
|
dict, defaultdict | 0 | 1 | – | 0 | |
Counter | 0 | 1 | – | 1 | |
array/list, bytearray | 1 | 0 | 0 | – | tuple, string, u string, buffer |
priority queue | 1 | 0 | 1 | – | |
OrderedDict | 1 | 1 | 0 | 0 | namedtuple |
? | 1 | 1 | 0 | 1 |
OrderedDicthas much in common with XML child sequences. It is a variation on dictionaries, that also remembers in what order members were added. Members can be added with the usual
od["aardvark"] = "Tubulidentata"
syntax, and accessed by the same name. It remembers the order items were added, and can iterate in that order with for k,v in od.items()
items.NamedArray,and it is described in the next section.
childNodes
, so they are accessed by position. That's fine as far as it goes, especially if languages would make them accessible via their native array syntax as shown earlier.paraor
stanza; but there are also several types of nodes that are not elements. This 2-way or 2-level distinction complicates XML processing, so to keep the syntax and semantics simple it would help to get rid of it. The next section deals with this.
paraor
chapter. The overloading of
typeis confusing. The simplification proposed here begins by reducing the variety of node-types:
targetas an attribute, or to accomodate the commonplace of using attribute syntax within PIs).
Elm,since it is largely similar to DOM Element, but subsumes the other nodes types as well. By introducing reserved Elm names such as
_TEXT,
_PI,
_COMMENT, and
_ROOT, and by treating the text content of such (empty, leaf) Elms as a special attribute, the inventory of node types drops to just Elms and attributes.
_TYPE.
_is used to prefix reserved names such as
_TYPEbecause it is an acceptable identifier-start character in Javascript, Python, and many other languages. XML names can, however, also include colon, period, and hyphen, which would necessitate using Javascript bracket notations instead of dot notation.
NamedArray. The basic properties of this new type are:
sort()
probably won't be used much for XML, but may be for other applications of NamedArray). Operations that insert members by position (such as append()
) take an extra parameter for the name.*matches any Elm node type name, but not any of the reserved (
_-initial) names.
array[n]
notation for arrays. It also provides slicing such as array[start:end]
(note that the 2nd argument specifies the entry dict[key]
notation, and a separate notion of object properties, accessed like object.propname
[]
for arrays, {}
for dicts, and lacks .
notation. Javascript uses array[n]
, obj[key]
, and obj.key
, but they are largely synonymous.array[start:end:interval]
notation. For example, array[0:20:2]
retrieves every other item from among the array's first 20 entries. This is said to be heavily used with the numpy/scipy scientific and math packages.Description | DOM | Pythonish |
---|---|---|
Get first child | n.firstChild | n[1] |
Get second child | n.childNodes(2) | n[2] |
Get last child | n.lastChild | n[-1] |
children 1-3 | n.childNodes.slice(0,3) | n[1, 4) |
two nodes equivalent | n1.isEqualNode(n2) | n1 == n2 |
two nodes identical | n1.isSameNode(n2) | n1 === n2 |
replace third child | n1.replaceChild(n2, n3) | n1[2] = n3 |
get first "p" child | n["p"] | |
get third "p" child | n["p":3] | |
get last "p" child | n["p":-1] | |
get the first 20 "p" children | n["p":1:21] | |
get all "p" children from among the first 20 children | n[1:21:"p"] | |
walk down by types | doc[“chap”:3][“sec”:2]["p"] |
n["*"]
is
defined to do this. The reserved Elm types _TEXT,
_PI,
_COMMENT,
_ROOT, and
_ATTLISTare simply names when XML uses NamedArray, so NamedArray itself has no awareness of XML convention.
[]
considerably reduces the number of methods and properties required to achieve DOM's functionality.n["p":1:21]
and n[1:21:"p"]
. The former retrieves all children named p, and then retrieves the first 20 of those; the second instead retrieves the first 20 children, and then extracts all of those that are named
p. That is, the slicing operations go from left to right. This may seem familiar to users of XPath's successive
[]
filters.nextSibling
, etc.). Another option, which seems better when it is possible, is to keep them all in a dictionary. The dictionary in turn can be a property, or as suggested below, the first member of the NamedArray.special.With slicing operations, it can be included or excluded at desired. I see this as a rather nice compromise between (a) most programming languages using 0-based arrays, and (b) many people thinking of the first
parentNode
, etc.) are provided as properties as usual, as are more
Python-native synonyms: cloneNode (copy), textContent (toString),
appendChild (append), hasChildNodes and hasAttribute (in), removeAttribute (del),
removeChild (del), etc.[]
notation invokes __getitem__()
. Python already accepts 3 arguments to the bracket notation, and tests show that any of them can be a string instead of an integer. Thus, the semantics just described are easily implemented:buffer
type, which supports read-only access to successive subsequences of a sequence (reminiscent of Scheme cdr'ingdown a list) would provide an effective approach.
<<x>>
?
<⊂x⊂>
?
<〔x〕>
?acfor
abstract collection):
replacesXML, what is to be done? It is a commonplace that perhaps 80-90% even of corporate data, resides in documents rather than databases; but whatever that precise figure may be, Appolonius of Rhodes, Clement of Alexandria, and Harry of Hogwarts are not going away.
["stanza", "...." ]
, or a dictionary such as {"label":"stanza", "content":"...." }
.n.childNodes[3].childNodes[2].getAttribute('class')
when they would quite sensibly rather write n[3][2].attrs['class']
— countless times.NamedArraytype and implementation to fill that gap.
Extensible Markup Language (XML) 1.0 (Fifth Edition).W3C Recommendation 26 November 2008. http://www.w3.org/TR/REC-xml/
Markup systems and the future of scholarly text processing.
What is text, really?
What Should Markup Really Be: Applying theories of text to the design of markup systems.
The application/json Media Type for JavaScript Object Notation (JSON)IETF RFC 4627. http://www.ietf.org/rfc/rfc4627.txt
ECMAScript® Language Specification.Standard ECMA-262, 5.1 Edition / June 2011. Geneva: Ecma International. http://www.ecma-international.org/ecma-262/5.1/
JQuery API.http://api.jquery.com
Document Object Model (DOM) Level 3 Core Specification.Version 1.0. W3C Recommendation 07 April 2004. http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/