How to cite this paper
DeRose, Steven J. “Ragnarok: An Experimental XML environment.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.DeRose01.
Balisage: The Markup Conference 2025
August 4 - 8, 2025
Balisage Paper: Ragnarok: An Experimental XML environment
Steven J. DeRose
Steve DeRose has been working with electronic document and hypertext systems
since 1979. He holds degrees in Computer Science and Linguistics and a Ph.D. in
Computational Linguistics from Brown University.
He co-founded Electronic Book Technologies in 1989 to build the first SGML
browser and retrieval system, DynaText,
and has been deeply
involved in document standards including XML, TEI, HyTime, HTML 4, XPath,
XPointer, EAD, Open eBook, OSIS, and others. He has served as adjunct faculty at
Brown and Calvin Universities and has written many papers, two books, and more
than fifteen patents. Most recently he has been working as a consultant in text
analytics.
Copyright 2025 Steven J. DeRose. May be copied per the Creative Commons Attribution-Sharealike
license.
Abstract
XML has a highly reliable, consistent, widely-supported ecosystem. Python is
enormously popular, yet (perhaps surprisingly) its support for XML has weaknesses.
Several parsers are available but most (including the official
xml.parsers.expat) are not native Python, leading to issues with Python development
tools. The Python DOM library (xml.dom.minidom) is native, but is barely DOM 2.0,
slow, and lacks conveniences well-established elsewhere. It is also not
Pythonic,
using few modern Python features and idioms. lxml is
admirably Pythonic for Elements, but text poses problems.
Ragnarok is a new, pure Python XML tool suite
that addresss these issues. It provides plug-compatible replacements for Python XML
libraries, and is equipped with many Pythonic conveniences (the batteries
included
philosophy of Python). The parser (Thor), uses recursive decent: methods map directly to the XML
grammar, easing debugging and extension. A validator (Heimdall) is in progress. Schemera
handles DTDs, but its architecture is more like XML Schema. The DOM library
(Dominµs, aka Yggdrasil) is much faster than minidom, with almost all of DOM 3
Core and many features drawn from other XML and HTML tools and from Python practice.
Flexible output serializing comes via components called Gleipnir and Bifrost.
Non-hierarchical structures have long been of interest to this community, but face
a dilemma because XML doesn’t really support them: one can coerce to XML
syntax via milestones or standoff markup; or create entirely new syntax. In either
case XML tools give little help. Beside Thor, Ragnarok also includes a second
parser, called Loki, which explores a middle way.
Loki accepts non-hierarchical structures via syntax that includes non-XML
extensions, but remains so similar that (a) no prior WF XML changes meaning, and (b)
the implementation is easily constructed on top of a regular parser. This may
enhance data and code re-use. Loki is a subclass (rather than brother) of Thor, and
like its namesake can change its shape. Loki can be configured with many
XML-adjacent options ranging from case-folding names or enabling named character
entities (both difficult with many other tools), on up to extensions for olists,
suspend/resume, and milestone-encoded structures.
Ragnarok overall can help with everyday XML tasks in Python by being faster, more
up-to-date, more Pythonic, and more fully functional. On the other hand, Loki can
help with overlap and a variety of other tasks at the edge.
Table of Contents
- Introduction
- Design principles
- Yggdrasil (aka Dominµs)
-
- Slicing
- Other list methods
- Other Yggdrasil features
- Relation to lxml and ElementTree
-
- Some ways documents differ
- Yggdrasil / Dominµs sibling representation
- Schemera (DTD/schema support)
- Loki: a shape-changing parser
-
- Security options
- Basic/practical parser options
- Weightier extensions
- Other options
- Schemera and Loki
- Parser APIs
- Special character handling
- Runeheim (character set management)
- Gleipnir and Bifrost (serializers)
- Testing
- Futures
-
- ID handling
- A longer view
- Why try Ragnarok?
- Why not try it?
- Availability
- Appendix A. Examples of extended syntax
- Appendix B. Example of JBook from Bifrost
- Appendix C. Dominµs/Yggdrasil vs. regular DOM
- Appendix D. Loki and Schemera options
Introduction
XML is plenty cool as is. Yet support for it in Python has some limitations. This
paper reports progress on a pure Python XML environment that I call Ragnarok after the apocalyptic rebirth in Norse mythology.
The great serpent (python?) Jörmungandr triggers it, as problems with the official
Python XML stack motivate this Ragnarok. For example, the official
parser
is not native Python and so has limitations with Python development tools. The Python
DOM library is behind, quite slow, and uses few modern Python features or practices.
Several Ragnarok tools can be plugged straight in for Python’s libraries,
particularly expat and minidom.
Ragnarok has several parts, all implemented in pure Python and fully
type-hinted:
-
The primary component is a modern DOM implementation [Le Hors et al. 2000]
covering nearly all of DOM 2 and DOM 3 Core. It is called Yggdrasil after the great tree of Norse myth; or if you prefer,
Dominµs — spelled with a micro sign because
it’s fast (unlike minidom).
-
An XML parser called Thor (short for Text Hierarchy Object
Reader, and previously called xsparser). Thor
has a very few options, such as for protecting against DoS attacks.
-
Flexible serializers called Gleipnir and Bifrost.
-
A package for character sets and naming, called Runeheim.
-
DTD and partial XSD support, called Schemera
because it casts a wide feature net and integrates parts from several other
components and languages — like the the Chimera (despite the lack of that
creature in Norse myth).
-
A still-incomplete validator, called Heimdall
after the guardian of the bridge joining Asgard and Midgard.
-
An extended (XML-adjacent) parser, called Loki because it is extremely configurable, including for overlap
and much more.
-
A still-incomplete persistent binary form called Sleipnir.
Ragnarok tools are backward-compatible: Yggdrasil has the normal DOM API, plus added
conveniences you can use if you want. Thor has an API almost identical to expat’s
and is a normal XML parser. Schemera does normal DTDs (and soon, XSDs) unless you
reconfigure it. Gleipnir binds DOM structures into XML syntax just like minidom’s
toprettyxml() unless you reconfigure it with an additional formatting
spec. Loki, in contrast to Thor, can be specifically reconfigured with a wide range
of
syntactic and semantic options (nevertheless, WF XML is treated normally). Ragnarok
also
includes a large Python unittest suite including
head-to-head swapping vs. prior tools.
Design principles
Ragnarok aims at several broad goals:
-
Support the existing ecosystems
-
Be backward-compatible with XML and DOM tools, building on XML
foundations rather than replacing them
-
Ensure regular XML parsers (including Thor) stop if Loki xtensions are
enabled
-
Be Pythonic, XML-ic, and Unicode-ic
-
Provide for configuration and extension
-
Be pure Python and leverage Python libraries and features
-
Support non-hierarchical structures with minimal syntax change from
XML
-
Avoid components depending on other component’s options
-
Don’t necessarily be limited by quirks of the ecosystem
-
Remember that XML ≢ DOM ≢ structure.
-
Avoid making DTDs necessary (e.g., for named characters, finding IDs,
or attribute defaults)
-
Offer more sophisticated structure, ID, and other semantics
-
Provide pain relief
-
Address inconsistency, verbosity, and awkwardness in APIs
-
Be fast and allow users to make performance tradeoffs
-
Make entities safer against potential attacks
There are lots of functions and methods available across these components, but users
rarely have to care. For example, all the DOM node insertion and deletion methods
are
there in Yggdrasil so if you like your mutator, you can keep your
mutator.
On the other hand, XPath users needn’t remember whether to
use preceding/following or previous/next (I never can). Both work the same. Python
developers that are new to DOM can just use the already-familiar Python list operations.
For example, instead of
if newNode.nodeType == Node.ELEMENT_NODE:
c3 = n.childNodes[3]
n.insertBefore(newNode, c3)
n.removeChild(c3)
they can say the following, which is Pythonic, shorter, more readable, and not prone
to mistaking
insertBefore’s parameter order:
if newNode.isElement: n[3] = newNode
Python, Unicode, and XML have express goals that are broadly compatible but differ
in
detail. Honoring them involves compromises, not only when combined but even for each
in
itself:
-
Python teaches [Peters 2004] that There should be one — and preferably only
one — obvious way to do it.
One counterexample is that the fastest
way to build up a string (say, when parsing) is to make a list of separate
single-character strings, then combine them into one string at the end. This is
far from the obvious way to do it.
-
Unicode’s principles [Unicode, Section 2.2, Unicode Annex #15]
include that it be simple to parse and process,
that it
unify duplicate characters within scripts across languages,
and that characters have well-defined semantics
that
represent plain text.
All these have exceptions.
-
XML goals include #5 [Bray, Paoli, and Sperberg-McQueen 1998, §1.1]: The number
of optional features in XML is to be kept to the absolute minimum, ideally
zero.
Yet there are effectively options in
standalone,
attribute defaulting, XML 1.1 differences, and
validation and choice of schema language.
Nevertheless, Ragnarok tries to honor these principles as far as practical, except
for
having options. For example, Thor and Yggdrasil provide idioms that fit the
obvious
way of doing things in Python, such as n[3] to
access n’s 4th child, or n.isText to identify text
nodes. Runeheim modularizes Unicode dependencies so code need not be complicated
everywhere else. Ragnarok adds synonyms without removing original names, rather than
forcing users to adjust.
Yggdrasil (aka Dominµs)
Yggdrasil is my favorite piece of Ragnarok. It supports nearly all of the minidom
API
identically, while my profiling finds it about 40% faster (this will of course vary
across use cases). It has nearly all of DOM 2 and DOM 3 Core, and features from HTML
DOM, WhatWG, XPath, XPointer, lxml, and even ElementTree and CSS. Most existing code
should run fine with Yggdrasil replacing minidom (just faster and better
integrated).
Slicing
The most obvious change is that DOM Elements are a true subclass of Python lists —
consisting of their child nodes. This enables Python’s normal subscript or
slicing
notation: myNode[1], not just
myNode.childNodes[1].
In minidom, the first raises TypeError: 'Element' object is not
subscriptable.
The second works as a reference, but assigning to it in
obvious Python fashion only appears to work in
minidom. For example, this reports no
error:
myNode.childNodes[1] = myNode.childNodes[3]
minidom copies the pointer, leaving
myNode.childNodes with duplicate references
to the identical node. Thus
for c in myNode.childNodes afterward gives
children 0, 3, 2, 3, .… But iterating with
c = myNode.childNodes[0]; while c:
c = c.nextSibling still gives the original list. Yggdrasil, in contrast,
correctly complains that [3] already has a parent.
If instead a new Node is spliced into position [1], minidom again corrupts its own
data: iterating via nextSibling would never see the new Node, though it would be
reachable as myNode.childNodes[1]). Yggdrasil instead does the right
thing. In Yggdrasil Pythonically-obvious things just work, both on left and right
sides.
Python [] notation supports more varied arguments than just a
positive integer, and Yggdrasil also leverages that to support shorthands
reminiscent of XPath:
Table I
| Example |
DOM |
Description |
x[-1] |
x.childNodes[-1] # right-side only |
Last child |
x[0:-3] |
nl = NodeList()
for ch in x.childNodes[0:-3]: nl.append(ch) |
First child through 3rd-to-last child |
x["para"] |
nl = NodeList([ ch for ch in x.childNodes
if ch.nodeName == "para" ]) |
All children with nodeName "para" |
x["@id"] |
x.getAttribute("id") |
Attribute named "id" |
x["para":1:3] |
temp = NodeList[ ch for ch in x.childNodes
if ch.nodeName == "para" ])
NodeList(temp[1:3]) |
Among all "para" children, the 2nd through 4th |
x["#text"] |
nl = NodeList([ ch for ch in x.childNodes
if ch.nodeName == "#text" ]) |
All children with nodeName #text (i.e., text nodes) |
x["*"] |
nl = NodeList([ ch for ch in x.childNodes
if ch.nodeType == Node.ELEMENT_NODE ]) |
All element children |
x[".."] |
x.parentNode |
The parentNode |
x["/"] |
NodeList(x.childNodes) |
All children |
As with Python’s own extended slices
there are some limitations
on the left side (they raise errors if tried). Yggdrasil is even faithful to a
feature/quirk of Python slicing: An out-of-range index raises IndexError for a
singleton like x[999] but quietly returns an empty list for a range
like x[99:999] or x[99:999:2]. One difference, however, is
that elements with no children do not cast to Boolean False like empty Python lists
do. This is because empty elements have distinct data (such as attributes, not to
mention their tree context), unlike other falsish
things in Python:
None, 0, "", [],
{}, etc.
Finally, slicing supports scheme prefixes analogous to those in XPointers and
URLs. For example, n["css:#mainText"]. Many CSS selector types are
available (a few don’t make sense in non-browser contexts). Additional scheme
prefixes can be registered along with functions to handle them.
Other list methods
Regular Python list operations such as insert and del (not
to mention sort and many others) just work, as do regular DOM
ones:
myNode.insert(-1, newNode)
myNode.extend(someNodeList)
del myNode[-1]
myNode.pop(5)
x[0:2] = myNodeList
howMany = myNode.count("p")
myNode.appendChild(newNode)
myNode.insertBefore(oldChild, newChild)
Other Yggdrasil features
Some other features of Yggdrasil are inspired by the HTML DOM, such as
(configurable) case-ignoring and innerXml and outerXml (both setters and
getters). WhatWG inspired adding insertAdjacentXml() and methods for
tokenized attributes (such as, but not only, HTML CLASS).
The Yggdrasil API uses modern Python constructs. The much shorter nodeType test
properties were shown earlier (n.isElement, n.isText,
etc.); or of course Python isinstance() works fine. There are also
methods to get the first node along any XPath axis (not just parent, child, and
siblings), and generators to iterate along any XPath axis. There is also a generator for the SAX events that a parser would return
were it parsing a given subtree. The generators can take a callable to use as a
filter:
for i, n in enumerate(someNode.descendants(test=myCallback)):
print(f"Descendant {i} is of type {n.nodeName}")Yggdrasil accepts synonyms such as DOM previous/next vs. XPath
preceding/following, and fills symmetry gaps such as
prependChild().
Reserved string arguments are defined as enums, so in cases like
myNode.insertAdjacentXML("beforebegin", "<p>As seen below:</p>")a reified type and value can be used: RelPosition.beforebegin. This
lets pylint provide earlier detection of errors, and compilers like mypc optimize
better. The enums also accept the strings as equivalent so existing code works
without change.
DOM methods that normally require a child node as a location specifier are
extended in Yggdrasil to also accept (signed) numbers. Whether a caller knows the
position of a node (perhaps because it’s iterating by number) or the actual
node (perhaps because it’s in a NodeList), it can just use what it has rather
than having to convert to the other — the following are all interchangeable if
oldNode is theParent[4]:
theParent.insertBefore(newNode, oldNode)
theParent.insertBefore(newNode, 4)
theParent.insert(4, newNode)
That last case is just the normal Python list insertion method, and therefore uses
that order of arguments. Nodes can even delete themselves, using:
myNode.removeNode()
rather than something like:
myNode.parentNode.removeChild(myNode)
Normal comparison operators such as
< are overloaded in typical
Python fashion, and test document order.
However, because a node cannot be in multiple places, using
== and
!= to test document order is redundant with
testing identity (
isSameNode()). It seems better to instead make them
do the equivalent of
isEqualNode(). DOM 3’s
compareDocumentPosition() is also available.
Yggdrasil can generate and interpret basic XPointers [Clark and DeRose 2002]
with or without IDs. As it turns out, generating XPointer child sequences for two
nodes and comparing them is a very fast and intuitive way to compare document
positions. Upgrades to support DOM/XPointer ranges and discontiguous (virtual)
elements are underway but incomplete (to be named Gungnir
after
Odin’s spear which always hits its target). Those of course will need
additional position, containment, traversal, and inheritance semantics.
Since Elements and NodeLists are actually Python lists, operations like sort,
reverse, and even multiply are available. However, because no Node can appear in
more than one place in a document, operations like multiplication on Elements return
a new NodeList and not the same Element lengthened.
Relation to lxml and ElementTree
lxml is a popular parser, with an associated ElementTree package comparable to
DOM. It has the advantages of being much more Pythonic than minidom, for example
simply constructing nodes via Element(),
Comment(), etc.; treating elements much like lists, leveraging
Pythonisms such as is, in, len; treating
attributes as a dictionary; etc. These conventions are also used in Yggdrasil.
ElementTree also has its own names for various things: tag vs.
nodeName, getParent() vs. parentNode,
etc. These also are supported in the interest of users not
having to care.
ElementTree itself, however, has what I consider a serious drawback: It has no
concept of text nodes. Text is treated as a property of the preceding sibling
(almost like an attribute), named tail.
But of course there may not
be a preceding sibling. In that case the text instead goes onto the parent as a
property called text
(tail
on the parent would refer
to text in a different place). For Comments and PIs, text
means their
data content and they have no tail.
This works, but has consequences:
-
The number of children, child numbers, etc. are completely different
than in other XML tools.
-
Text that is conceptually part of an element is partly
on
that element and partly on
its
children, rather than all at the same level. This makes adding (say) an
inline node within that text topologiclly interesting.
-
The tail text logically follows all descendants’ text; but it
is a property of the parent, which precedes all its descendants in
document order.
-
In everyday speech people say that text is part of
or
contained in
a paragraph — ElementTree still has a
notion of children but it does not include any of the text (except
perhaps indirectly).
-
Text content is split across two different variables, one of which can
only logically occur on certain Elements (in DOM terms, just those which
happen to have a first child which is a text node).
-
There is data stored in text
on Comments and PIs that
is not text content despite the name.
-
Code to gather up all the text content is considerably more
complicated with ElementTree’s approach (and it is awkward to use
Python’s much faster ''.join() method for building
up strings):
def textContent(self) -> str: # ElementTree way
buf = self.text or ""
for ch in self:
if not isinstance(ch, (ET.Comment, ET.ProcessingInstruction)):
buf += ch.textContent()
if ch.tail:
buf += ch.tail or ""
return buf
def textContent(self) -> str: # DOM way
if isinstance(self, Element):
return ''.join([ ch.textContent() for ch in self.childNodes ])
if isinstance(self, CharacterData):
return self.nodeValue
return ""
def textContent(self) -> str: # Yggdrasil way
if self.isElement:
return ''.join([ ch.textContent() for ch in self ])
if self.isCharacterData:
return self.nodeValue
return ""
I find ElementTree’s approach to text less than intuitive. Nevertheless,
along with the rest of the ElementTree API for nodes, Yggdrasil supplies setters and
getters for text and tail.
Some ways documents differ
It may be that this approach by ElementTree is related to how XML is commonly
(mis?) represented in the software world. It is extremely common to explain XML
using examples that completely omit features that are indispensible for
documents; that include only features that are needed for config files, CSV-like
data, etc. (see DeRose 1999). For example, countless examples
share these limitations:
-
Only elements, no attributes. Or perhaps attributes only for style
or ID. This conflates the notion of things having component parts
with the notion of things having properties (essentially mereology
vs. ontology, or the ubiquitous programming distinction of is-a vs.
has-part).
-
Essentially no hierarchy. Examples very commonly have a root
element, then a bunch of instances of a single element type, each
with the same sequence of sub-element types. In other words, a CSV
file in pointy brackets. This is just fine for data that
is that way; but documents are not that
way.
-
No ordering. Again like CSV, many XML examples have no use for
order — you can shuffle records without affecting the
meaning; but do that to the paragraphs of a document and the author
might not be pleased. Some sophisticated but non-document-like XML
applications, however, also do not make much use of order (for
example SVG and XSLT).
-
No mixed content or even no text content at all. Look at Apple XML
config files, or even the XML page on Wikipedia — which as of
this writing barely mentions mixed content and gives not one example
of non-trivially nested elements or of mixed content. Documents in
reality have lots of unnamed text portions — countless
inline-ish or font-ish elements are embedded in text, making the
text on either side unnamed.
The last item (mixed content), I think is the crux: Serializing a
SQL or CSV record or a C struct gives no occasion for freestanding
text
(or freestanding integers, for that matter). A record is
a tuple of scalar fields (essentially what Python calls a namedtuple).
Similarly, an OOP object is conceptually (and in Python literally) a dict with
named items. In both cases such items permit no repetitions (an item’s
value can of course be a list, dict, or other object, but if it is named
X you still only get one
X). Nameless parts simply
don’t exist in these contexts.
A fifth fundamental omission is that examples rarely show items having identity,
reference, or re-use of data by reference. Unlike the other points this should
be bread and butter even for developers who rarely deal with documents — objects
and databases are replete with pointers and re-use despite their notable absence
from CSV and JSON.
These features are absolutely required for documents. Without them XML could still
handle tabular and OOP-like cases, but not even simple documents. Someone who
considers only those cases may think all the rest is wasteful, but that is akin
to considering Unicode wasteful because one is monolingual. Sadly, XML examples
are commonly gerrymandered.
Yggdrasil / Dominµs sibling representation
DOM implementations provide direct access to the previous and next sibling from
any node. There are three main ways to implement this, which have significant
space/time tradeoffs:
-
Store explicit pointers to adjacent siblings in every node (minidom does
this). This is very fast for individual steps, but it costs space for the
pointers and time to update them on changes.
-
Store in each node its integer position among its siblings. This is faster
than searching the parent (see next) but slower for node insertions or
deletions other than at the end (because all later siblings need their
numbers adjusted).
-
Go to the parent, count through its list of children to find the original
child, then add or subtract 1. This slows down for nodes with very many
siblings (except for some operations like appends), but saves space and
makes tree changes fast (no pointers or counters to update).
Perhaps surprisingly, profiling revealed that going to the parentNode and counting
is very much faster than maintaining and using direct pointers, even for quite wide
trees. This is a good example of why a pure Python experimental framework is useful:
it’s very easy to change specific details and do head-to-head
comparisons.
In keeping with Ragnarok’s overall philosophy Yggdrasil provides for
choosing any of the three methods, and even for changing in mid-stream. For example,
method 2 might be best during loading (it is very fast for additions at the end),
then method 3 once loaded (it is very fast for modifications), or method 1 if there
are very bushy nodes. It would not be very hard to have individual nodes switch
methods when they get wide (perhaps with hundreds of child nodes). However, method
3
is so much faster in my profiling that I have not seen a need for that. Direct
changes to nodes via list operations just work, regardless of the sibling
implementation in effect at the time.
A node’s child number is available, counting from either end. Options
enable counting only nodes with the same nodeName, ignoring white-space-only text
nodes, and/or coalescing adjacent (non-normalized) text nodes.
Schemera (DTD/schema support)
Ragnarok can read DTDs and internal subsets and make their information readily
accessible. This support is collectively called Schemera
even though it
includes parts within several other Ragnarok components. When markup declarations
are
loaded they are available via a simple but pretty complete API — it’s easy to
find out what’s declared or not, what the content models are (as strings, tokens,
or trees), what attributes are undeclared, of a given type, have defaults, and so
on.
The architecture is more informed by XML Schema than by DTD, and XSD syntax support
is
underway. Since the semantics are similar this will add capabilities such as importing
one and exporting the other (via either Gleipnir or BiFrost), not to mention a common
API. This also enables custom schema language extensions for Loki, such as supporting
XSD datatypes for attributes in DTDs:
<!ATTLIST employee
salary float #IMPLIED
hireDate gDate #REQUIRED>An xsdType
option already enables this, and affects both schemas and
documents. Loki and Schemera must recognize the XSD type names in ATTLIST declarations,
while Yggdrasil must provide access to their representation. Gleipnir and Bifrost
must
serialize them, at present by converting to the nearest DTD types. Schemera can already
check lexical formats and facets, although some other aspects of validation await
the
Heimdall validator. For example, salary in the example above would really need to
be a floating point number. Values can also be auto-cast to the declared type when returned via SAX or
DOM rather than leaving everything as strings (this should save a lot of tedium).
Each
XSD type knows what built-in Python type to cast to. Yggdrasil also keeps track of
the
order of declarations, so re-exporting doesn’t have to scramble the order.
For options discussed in this section, Yggdrasil supports the needed internal data
and
APIs, but only Loki can read markup for them (not Thor). For example, content models
can
be very slightly extended: First, there is an anyAttribute
option to
support an eponymous declared content type which is like ANY except that
#PCDATA is not allowed (Ragnarok dislikes symmetry gaps). A
repBrace
option enables content models to use XSD-like
minOccurs and maxOccurs via the well-known regex brace
notation:
(title, para{1,3}, sec+)Options also allow a few SGML-like items in schemas such as OMITTAG flags
and exceptions (oflag
). This is mainly to accommodate old SGML DTDs.
However, even Loki does not support tag omission, because it generally breaks
XML’s key principle of not needing a schema to parse correctly. Omitting end tags
immediately before another end tag or EOF does not break that, so Loki has options
for
those cases. Declaring multiple element types at once (or attributes for multiple element
types) is available via groupDcl
plus additions for declaring global
attributes via globalAttribute
and XSD-like AnyAttribute via
anyAttribute.
A multiPath
option allows multiple SYSTEM identifiers for DOCTYPE,
ENTITY, and NOTATION declarations. They are defined to be tried from left to right.
This
is mainly to relieve the situation where a document is passed among people with
differing paths for the same things, without needing a catalog.
schemaType
enables Loki and Schemera to recognize a schema language
(which can also be declared as a NOTATION) specified like:
<!DOCTYPE article SYSTEM "http://docbook.org/xsd/5.0/docbook.xsd" NDATA XSD []>
Loki: a shape-changing parser
Thor parses XML. It is pure Python and uses recursive descent, so it has overt methods
directly corresponding to the XML grammar and they are easy to find, examine, or debug.
The interface is like expat as seen from Python and it works as a direct replacement.
It
reports the same WF errors. I think it also gives pretty good error messages (let
me
know of exceptions, please). It includes a few options, such as restricting external
entities to certain source directories, depths, etc.
So far so good.
Loki is a 2nd parser, which is backward-compatible with Thor but offers many more
options ranging from case-folding on up to supporting markup representing
overlap.
Security options
Several options have little relation to XML syntax but aim to improve the security
of entity usage. Python documentation [Python] gives fairly dire
warnings about attack vectors like entities with excessive expansion or risky system
identifiers:
<!ENTITY a "Boom">
<!ENTITY b
"&a;&a;&a;&a;&a;&a;&a;&a;&a;&a;"></para>
<!ENTITY c
"&b;&b;&b;&b;&b;&b;&b;&b;&b;&b;"></para>
<!ENTITY d
"&c;&c;&c;&c;&c;&c;&c;&c;&c;&c;"></para>
<!ENTITY e
"&d;&d;&d;&d;&d;&d;&d;&d;&d;&d;"></para>
<doc>&e;</doc><!ENTITY pw SYSTEM "../../../etc/passwd">
&pw;
These are indeed potential attacks, although they affect most any data format that
provides a macro or include
capability (a very long list). There is
even a Python library called defusedxml
to mitigate such risks [Heimes 2023]. Thor and Loki both provide mitigation options, because such
options do not affect syntax per se:
-
Limiting the depth and/or total length of entity expansions
(MAXENTITYDEPTH
and MAXEXPANSION
)
-
Deciding whether to fetch and parse a DOCTYPE at all
(extSchema
)
-
Ruling out external entities entirely (extEntities
)
-
Ruling out non-local URIs as system identifiers
(netEntities
)
-
Setting a white-list of directories so system ids that point elsewhere
fail (entityDirs
)
-
Allowing only (or not even) special character and/or unparsed entities
(charEntities
).
Basic/practical parser options
Loki, like its namesake, is very good at shape-changing. It’s built as a
subclass of Thor. A few of its basic options have already been mentioned. For
example, case-folding is not XML but is fairly simple and ubiquitous — SGML
and HTML both use it (SGML via separate options for element vs. entity names). XML
does not offer it for users, but uses it internally (xmlns and
xml prefixes, xml:stylesheet, etc.). XSD has it as a
facet for datatypes. This all suggests that case-folding is pretty useful. It is
also nearly trivial to implement in a parser, and not disruptive to much else. So
Loki offers it as an option (actually several options: elementFold,
attributeFold,
entityFold,
idFold,
…).
However, case is not entirely trivial. I learned that some
HTML versions, SGML, and Unicode all define it very slightly differently. I bit the
bullet and Loki (via Runeheim) can use upper(), lower(),
or full Unicode case_fold() (or of course none, as in Thor). More
choices can be added. Unicode also defines several normalization forms [Unicode], and those are also available to deal with ligatures,
diacritics, halfwidth vs. fullwidth, etc. Likewise where specs differ in what counts
as whitespace.
A somewhat related option (htmlNames
) enables all the usual
SGML/HTML named character entities in one step. This otherwise requires 252 explicit
ENTITY declarations (over 2,000 for HTML 5) – and a parser that handles them, and
the time to parse them. Even if you’re using a schema language other than DTD
your tools have to include DTD support to get this (or perhaps you can code a
callback for unknown entities), which seems a bit silly. Like case-folding, a
one-step way to enable these isn’t part of XML but it’s awfully
handy.
Setting entEncoding
enables an extra parameter for ENTITY
declarations (thus available in Loki only), to declare the encoding for an external
entity. This is especially useful because files oftern show up in unexpected
encodings. Since Python provides an encoding parameter when opening streams, this
is
straightforward:
<!ENTITY chap2 SYSTEM "chap2.xml" ENCODING cp1252>
As noted earlier all options are off by default (except a few for avoiding risky
entity fetches), in which case Loki has normal XML behavior just like Thor. But for
those inclined to rush in where parsers fear to read, there are two ways to turn
options on. First, construct a Loki parser and call its setOption()
method. Second, enable options from within a document. At the moment, the settings
go in the XML declaration (or perhaps I should call it the Loki declaration?). For
example:
<?xml version="1.1" encoding="utf-8" elementFold="UPPER" htmlNames="yes"?>
There are many other ways this could be done. An advantage of the present approach
is that it makes a document that uses extensions no longer be WF XML, while
remaining extremely close — which is precisely the point. With extensions in use a
document is only XML-ish, so XML parsers should reject the
document rather than interpreting any extended usage incorrectly. This syntax makes
them do so. It also makes the document reveal up front exactly which extensions it
uses (if one is used but not enabled, Loki raises a WF error). Extended documents
can be converted to regular XML: just load, then export them back out with Gleipnir.
That works for most options, though not for those supporting non-hierarchical
structures — those must wait for support in Yggdrasil.
Weightier extensions
Being interested in more complex document structures (and well-versed in
Wall’s 3 Virtues of a Programmer
[Wall, Christiansen, and Orwant 2000]), I
also wanted a tool to help me examine related ideas, especially incremental support
for non-hierarchical and virtual markup structures. In contrast, the security
options discussed above have essentially no effect on parsing (so are in Thor), and
case-folding options have only a very narrow effect (so are in Loki only). Enabling
special syntax for non-hierarchical structures is a bigger deal (milestones are not
a special syntax per se, but involve a special
semantic of which an XML parser can be blissfully but unhelpfully unaware).
Nevertheless, non-hierarchy is not always as big a deal as it
may seem at first glance. A relatively simple example is olist
semantics (see Sperberg-McQueen and Huitfeldt 2000), where an element can be closed even if it
isn’t current. The following is neither WF XML nor MECS syntax, but in an
olist/overlap world it’s pretty clear what it must
mean:
<sec><para>Quoth the raven,</para>
<para><q>Nevermore.</para>
<para>Except maybe on Tuesdays.</q></para>
</sec>Because XML (and therefore XML parsers) can’t do this it is typical to either:
-
coerce markup to syntactically-XML-compatible workarounds like
milestones, joins, standoffs, etc.; or
-
design an entirely new syntax.
Option 1 requires building separate checking even for many constraints XML parsers
already have most of the code for, such as ensuring that milestones are empty but
paired, that starts come before ends and are of corresponding type, that end
milestones lack attributes, and so on. The markup also has to be more complex: in
the case above nothing is required to indicate what the q end tag pairs
with, but with milestones they have to be co-indexed or have an interesting
algorithm to match them up (not to mention that ID/IDREF attributes for co-indexing
does not involve quite the same concept as their use elsewhere).
Option 2 requires constructing a new parser (or modding an existing one, which is
harder if it is not in Python when the rest of your code is or if the new syntax is
not very close to XML). At that point people commonly create, implement, and debug
entirely new syntaxes and tools.
A key insight is that there can be syntax for things like olist that is only a
tiny step from XML. By choosing such a syntax and integrating it as an option within
a parser that also supports and is tested with regular XML, most of the constraints
just mentioned become trivial to add. For olist semantics, in the parser itself the
difference involves little but whether the open-element data is a stack or a list.
Since Loki is designed for extensibility it is very easy to add an option to remove
such an item from the stack
of open elements, return a SAX event for
it, and continue rather than issuing a WF error. Just add an elif
exactly where XML end tag processing would instead issue a WF
error:
if name == tagStack[-1]:
tagStack.pop()
yield SAXEvent.End, name
elif options.olist and name in tagStack:
del tagStack[tagStack.rindex(name)]
yield SAXEvent.End, name
else: raise SyntaxError(
f"Unexpected end tag for ’{name}’ (context: {tagStack}).")The parser change is surgical, the syntax and semantics are predictable given
olist structural semantics, and everything else works as it did. When the olist
option is not enabled the behavior is exactly that of XML. Put another way, Loki
supports olist with a 3-line addition beyond Thor.
Yggdrasil cannot yet represent such structures directly, but a model that can
(such as LMNL [Piez 2008, Piez 2012] or TagML [Bleeker, Buitendijk, and Haentjens Dekker 2020] or MECS [Sperberg-McQueen and Huitfeldt 2008]) can source the necessary
information via Loki if desired.
Suspend and resume markup work similarly when the suspend
option
is enabled in Loki. It generates distinct SAX events, and a suspended element stays
in the open-element stack (flagged as suspended when it is). The markup takes the
form below. As with olist this requires very little code in Loki, and seems (at
least to me) a fairly intuitive increment beyond XML syntax (an error is raised in
cases like suspending an already suspended or closed element):
<q>....<-q>...<+q>...</q>
In this and the prior case there could in principle be multiple q
elements open and one might wish to operate on one other than the innermost. For
this reason an endTagId
option enables ID attributes to be repeated
on suspend, resume, and/or end tags to co-identify the correct element if more than
one of the same type is open (if not specified, the most recently encountered is
closed):
<p><q id="G">Say unto my people,
<q id="J">....<-q id="G">...<+q id="G">...</q id="G">...</q id="J">...</p>This feature can also be used with end tags in general, to co-identify the
boundaries of very large elements for easier readability and more specific error
reporting. If an end tag’s name matches the current element but it has an ID
attribute that doesn’t, an error is raised that can specifically say where
the intended (that is, matching) start tag was (or that it isn’t there at
all). No other attributes are allowed in suspend, resume, or end tags. The
implementation is a near-trivial re-use of the attribute-parsing code already there
for starttags.
Also related to helping keep track of large element scopes, I am considering
experimenting with Loki options reminiscent of SGML RANK. Many people have
experienced trying to tag recursive structures right when the tags at all levels are
identical (say, div/div/div instead of div1/div2/div3 or chap/sec/subsec),
especially when converting from applications or data for which there are no real sections, but only headings (say, h1/h2/h3).
I find it much easier to debug when something explicit relates the starts to their
corresponding ends. Many schemas do this by providing different element types for
each level; but it is easy to support the general notion at the syntactic rather
than lexical level. Some obvious syntax possibilities are shown below; another is
simply to permit a numeric suffix, which if present must match the
depth:
<div@3>
<div x:level=3>
<div><?level 3?>
In semantic (not syntactic) homage to some LMNL concerns, simultaneous opens and
closes are supported via the simultaneous
option, using markup as
shown below. This is one possible way to represent more than one element with the
identical scope:
<b|i>.....</b|/i>
For the b/i case it is of course easier to create a portmanteau bi element; but
the problem is far more general (e.g., <foreign+term>).
Milestone markup needs no extensions for parsing per
se, though the DOM and validation implications are more complex. In
the short term Yggdrasil will likely add virtual element
support as
mentioned earlier, leveraging Schemera syntax for declaring the relevant elements
and attributes.
Other options
Loki makes options easy to experiment with. Several are available that recall SGML
features, but never break the XML principle that the parser can produce the correct
result with or without a schema. They also remain close to XML syntax for reasons
already discussed:
</> — omit the name when closing the innermost element
(emptyEnd
).
Omit an end tag(s) immediately before another end tag (omitEnd,
about which I am ambivalent) or at EOF (omitAtEOF
).
<|> is not from SGML, but potentially useful. This closes the
innermost element and reopens a new instance of it (the name is optional, but
checked if present). Attributes (other than an ID) get the same values, but can be
changed by specifying them after the |
. This restart
option is reminiscent of MediaWiki tables, though it’s not quite as short.
<div id=Intro> — omit quotes when an attribute value is a name or
number token (unQuotedAttribute
).
A booleanAttribute
option enables Boolean-valued attributes (or
any, really) to be set to 1 or 0 as shown below. This addresses the
don’t repeat yourself
awkwardness of things like
border="border" (not to mention SGML’s rule that different
attributes cannot share enum values):
<td +border -underline...>
There are also minor options to work around autocorrect problems sometimes
introduced by uncooperative environments. For example, the parser can be instructed
to recognize comments with different delimiters: em dash as equivalent to 2 hyphens
(emComments
), or <#...#>
(poundComments
) to avoid the issue altogether (nestable comments
are on the possible
list). For similar reasons
curlyQuote
permits various fancy quotes around attributes.
Such options can be helpful to XML’s paradigmatic desperate Perl
hacker
who has to make some documents work right
now (see the prescient and still valuable Bray 2010).
Note that none of these cases change existing XML syntax to have a new meaning. They
intentionally use syntax that is unused in XML (that is, syntax that is not WF). A
list of parser options is provided in an appendix.
Schemera and Loki
Some Loki options affect the schema directly or indirectly. One helps with the
slight inconvenience of declaring defaults and types for attributes in the subset.
People seem rarely to do this to get defaulting (unless they are providing a full
DTD, which some parsers do not even support):
<!DOCTYPE mySchema SYSTEM "" [
<!ATTLIST p id ID #IMPLIED
class NMTOKENS "regular">
]>Loki can be configured to allow setting a default within the document, on the
first use on a given element type (bangAttribute
):
<p foo!=bar>
With bangAttrType
a datatype can also be given (this would use
colon in homage to Python type-hints, but that would conflict with
namespacing):
<p foo!NMTOKEN=bar>
Assuming this is the first use of attribute foo on a p, it has the same effect
as:
<!ATTLIST p foo NMTOKEN "bar">
There are other Loki/Schemera options to slightly extend markup declarations. For
example, the SGML feature of declaring multiple element types in a single
declaration (groupDcl
).
Parser APIs
Both Thor and Loki have APIs almost identical to expat’s (as viewed from
Python). Handlers are assigned in the same way, with the same names and arguments.
However, there are a few options here too.
For example, each attribute can be returned as a separate SAX event immediately
following its start tag event (saxAttribute
). This has the advantage
of making every event have a fixed number of arguments rather than start tag events
having a potentially unbounded list of alternating names and values as some parsers
produce (though not expat for Python). It also gives room to return additional
information such as whether the attribute was explicit or defaulted, or the
pre-normalized and normalized forms. The default is to just do what expat does for
Python, which is to return a Python dictionary of the attributes.
A slight difference from expat is that neither Loki nor Thor breaks text into
separate SAX events at every newline and character entity. However, that behavior
can be achieved in either by setting option expatBreaks.
Although expat always hands attribute values to handlers as strings, with
attrCast
Loki (with Schemera’s help) can coerce them to
their declared types — so a program can have them ready to go as Python ints,
floats, datetimes; or as provided custom types that map directly to XSD’s
builtin types.
Special character handling
The inconvenience that all special-character entities must be declared has been
mentioned briefly. Sets such a those defined in Annex D of SGML and in HTML are
useful and widely known, but activating them in the standard ways is tedious. Such
sets can be activated in Loki merely by setting an option, either via the
API:
myParser.setOption("htmlNames", 1)
or via the declaration:
<?xml version="1.1" encoding="utf-8" htmlNames="1" ?>
Perl 6 (aka Raku) introduced the interesting ability to specify Unicode characters
by their official names [
Raku]. Loki adds a similar capability.
After setting the
unicodeNames
option references like these are
supported:
•
&GREEK.SMALL.LETTER.OMICRON.WITH.TONOS;
Case is ignored because Unicode character names do not mix case (this will likely
be made subject to the entityFold option). Any of hyphen, underscore, or dot may be
used instead of spaces (so may space itself when looking characters up via the
API).
Some names are long. However, it turns out that abbreviating all but the last
token down to the first 4 characters leads to only 11 collisions in the Unicode BMP.
So abbreviations down as far as that also work, as do intermediate
abbreviations:
&GREEK.SMAL.LETT.OMIC.WITH.TONOS;
In the rare case of a collision (which Loki will report), lengthen anywhere to
disambiguate. For example, EQUIVALENT TO (U+0224d) and EQUIANGULAR TO (U+0225a)
collide when fully abbreviated, but giving at least the first 5 characters of the
first token succeeds: EQUIV.TO vs. EQUIA.TO. Further abbreviations are feasible,
such as SMALL LETTER OMICRON
to LC O
or special-cases
a few frequent/long substrings like CJK UNIFIED IDEOGRAPH to CJK and simply omitting
LETTER; but you’ve already memorized the current rule.
I sometimes find it inconvenient that XML defines no special-character mechanism
that applies inside PIs or comments. The piEscapes
option makes Loki
recognize and replace character references inside PIs.
Somewhat related, a piAttribute
option makes Loki parse PI data as
if it were an attribute list and return SAX events for PIs with the corresponding
arguments (similar to start tag events). This is an obviously useful convention (XML
employs it for the XML declaration, thoough it doesn’t offer the same thing
to users). With this and the prior option users can avoid re-implementing some
wheels. And finally, piAttrDcl
enables ATTLIST declarations to
constrain those PI quasi-attributes, merely by giving ?
plus a PI
target name where an element name would normally go (again applying Loki’s
idea of XML-adjacent syntax, and trivially implemented by small adjustments to
already-existing code):
<!ATTLIST ?ah hyphenate boolean #IMPLIED
kern float "1.0">
...
<?ah hyphenate="1" kern="0.8"?>Those accustomed to backslashed hexadecimal character codes can choose to enable a
backslash
option to recognize \n, \\, \xFF, \uFFFF, \U000FFFFF,
etc. (though not \u{FFF} or \u{BULLET}, yet). As one might expect,
\< and \& are then ok.
Runeheim (character set management)
Runeheim is a separate package that is the realm of the XML orthography: what
characters are what and how names are formed. Many programs have slightly inaccurate
lists of name- and name-start characters; some support only ASCII. Runeheim provides
the
correct lists in various forms, as well as full regexes to match things like QNames,
and
functions like isXmlQName().
The lists of (for example) name and name-start characters are built automatically
from
the literal hex ranges given in the XML Recommendation (with a choice of 1.0 or 1.1).
This makes them readable, less error-prone, and easy to update as Unicode grows.
Runeheim can also be used in isolation to provide name testing, normalization, etc.
to
any Python callers.
There are centralized routines for checking various kinds of names, as well as for
escaping and unescaping text for each context. In contrast, many APIs only provide
escaping as for text content, and miss double hyphens in comments , ?> in
PIs, etc.; neglect ]]>; or overdo >. The Ragnarok
serializers (see next) can configure to map to HTML or Unicode named characters, or
to
decimal or hex numerics of controllable case and width. Runeheim also provides types
such as NMTOKEN_t for use as type-hints, making modern tools such as
linters and Python compilers more effective and the code more readable.
Gleipnir and Bifrost (serializers)
Within Yggdrasil, Ragnarok provides a replacement for minidom’s
toprettyxml(), called Gleipnir
after the magical binding
used to restrain the great wolf Fenrir. It does the same things as regular toprettyxml,
and is what you get if you call that in Yggdrasil. However, it can also take a
FormatOptions object that encapsulates a wide range of options including how to break
around tags, attributes, and text; text wrapping; indentation; use of CDATA (Yggdrasil
can track which text nodes came in as CDATA sections); and how to do escaping (HTML
names, decimal, or hex; padding width; case of hex; ...). A list of inline-ish tags
can
also be set, preventing line-breaks around them. Pre-made default and canonical
FormatOptions objects are provided. Gleipnir does not generate XML that uses any of
Loki’s syntax extensions; just normal fully-conforming XML.
Ragnarok also provides another serialization, called Bifrost
after the
bridge connecting Asgard and Midgard. It serializes an entire XML document to valid
JSON. The JSON conventions applied are called JBook,
and a small sample
is in an appendix. Bifrost can also read the resulting JSON back to produce the same
DOM. The conversion is complete — it covers not just elements and attributes, but
marked
sections, comments, PIs, namespaces, optionlly the DTD, and so on. JBook is similar
to
the form presented in [DeRose 2014a] but more fully developed. Text nodes
become JSON strings, and other nodes become JSON lists. The first item in each such
list
is a dict that includes the nodeName and attributes (or other data for non-Elements),
and the rest of the list consists of the children.
The first round-trip (DOM → JBook → DOM or JBook → DOM →
JBook) can change details such as attribute quoting, type of escaping for special
characters, etc. After that, however, round-tripping all day makes no further changes.
In other words, the conversion loop is idempotent. I spent some time searching for
a
prior XML → JSON conversion that covered all constructs, and could find none
(much less any that were also idempotent, much less readable).
Testing
Ragnarok includes a large Python unittest suite which automates testing for many
cases. Coverage is about 75% so far. The code is thoroughly hinted and linted. Every
Node subclass has a checkNode() method which can test (recursively or not,
at option) many invariants such as the nodeType vs. class, consistency of child and
sibling information, well-formedness of names, etc.
Testing includes very long Unicode names, wide and deep documents, and even enormous
numbers of leading zeros on numeric character references (it turns out xmllint requires
a special option for depths over 256 or names over 64K).
Thor, Loki, and expat are tested with both minidom and Yggdrasil on top, just as
Yggdrasil is tested with Thor, Loki, and expat below. This ensures that they are closely
compatible, although there is a lot of additional testing specific to extensions.
The Gleipnir serializer will likely learn some of the parser’s options, if only
to help feed the testing process.
Futures
Work is underway on a model validator, Heimdall, that leverages regex processors.
To
accomplish that, each distinct element type is assigned a private-use Unicode character.
Those characters are substituted for the element type names in content models, while
commas and whitespace are dropped. For example (but using Greek instead of private
use
characters for readability), the content model in
<!ELEMENT chapter (title, para+, (note|section|para)*)>
becomes something like
(αβ+(γ|∂|β)*)
If the sequence of child types for a given element instance undergoes the same
mapping, a normal regex match against the transformed model achieves the correct
validation result. Loki/Schemera’s {} syntax for XSD-like minOccurs
and maxOccurs within content models also works fine this way. #PCDATA and
actual text nodes are set aside except for a single test for whether they’re
permitted (for full SGML support, a bit more would be needed). Attributes already
can be
checked via Schemera.
This suffices for validating a DOM structure in hand. However, partial
validation is needed in cases like editing (and for much of the DOM 3 Validation
module): Perhaps an element is valid so far, but may (or even must) have more. For
example, a model like
<!ELEMENT dl (dt, dd)*>
is a partial but not a complete match after each dt. After each dd it’s both a
partial and a complete match (well, during parsing one must wait until after
</dl>). The built-in regex processor in Python does not support
partial matching. However, a common 3rd-party regex processor for Python
(
regex)
does, and seems to work fine
for this.
Also in progress is a binary format for nodes and documents, called
Sleipnir
after Odin’s 8-legged horse. Nodes are represented as
8 fixed-size fields that encapsulate the information needed to support DOM operations
while being trivial to aggregate and access. It is loosely modeled on [DeRose et al. 1996], but supports Unicode, namespaces, and dynamic
modification.
Finally, I’m looking into adding a new Node subclass specifically for binary objects.
This would enable images, sound, CAD models, etc. to live direclty in Yggdrasil, much
like Blobs
live in databases. They can be added via the API, via
referring to unparsed (NDATA) entities, or perhaps also directly via a special marked
section type such as <![BASE64[…]]>.
ID handling
Yggdrasil indexes IDs, with or without case and similar normalization. They do not
(yet) update individually if the tree is modified (in fairness, minidom’s
don’t either). The index can, however, be discarded and rebuilt without
rebuilding the whole DOM.
There is a common problem of specifying just which attributes are IDs. XML
thankfully reserves xml:id but many schemas assign their own. Such
cases cannot be detected reliably without a schema. Ragnarok also provides for
configuring what is recognized (and indexed) as IDs via the API, based on the name
and namespace of the attribute and of the element it’s on. Multiple
definitions and wildcards are possible.
Other ID-like semantics will likely be added. For example:
-
an option to allow namespace prefixes on ID values (NAMESPACEID
)
-
options and distinct attribute declared types for milestones like
<q_start/>...<q_end/> (COID
)
-
Trojan milestones where start and end milestones use different attribute names
(STARTID
and ENDIDID
)
-
suspend/resume co-indexing (the last two plus SUSPENDID
and
RESUMEID
)
-
unique values constructed by accumulating ID values from ancestors, like hierarchical
section-numbering (STACKID
)
-
SQL-like compound IDs constructed by evaluating an XPath or similar expression
(COMPOUNDID
)
The types enable checking associated structural semantics such as that starts come
before ends, milestone elements are empty, etc. The declarations must differ
slightly to distinguish when the role is flagged by Loki syntax, element type,
attribute name, or potentially other mechanisms.
Element declarations will have ways to declare whether they are suspendable,
milestonable, olist-endable, etc. (though some of these may merely be inferred from
what attribute types they declare and use).
Like attribute defaults and named special characters, IDs introduce semantics that
require a schema to understand (though of course not to parse). They thus bump up
against XML’s general principle that documents are correctly parsable without
a DTD. This can be avoided by always using xml:id, but that’s
slightly unwieldy.
XML inherits SGML’s restriction that there can be only one ID attribute for
any given element type. Given that uniqueness it would be feasible to support a special syntax
instead of a special attribute type. One example could be suffixing IDs to the
element type: <p#p31> vs. <p xml:id="p31">. Such an
IDSuffix
option would save space and perhaps improve readability
while avoiding the need for explicit declarations.
A longer view
I am working on support for XPointer ranges [Clark and DeRose 2002] and on
DOM-integrated support for overlap features and potentially XCONCUR [Schonefeld 2008]. I and others have long lamented the lack of annotation,
versioning, and collaborative editing capabilities both on the Web and in XML
generally (see DeRose 2014b), for which overlap support is
prerequisite. By promoting such structures into a separate but coordinated structure
from the DOM itself, some of these capabilities might be brought within closer reach
(see also DeRose and van Dam 1999):
-
Switching between document views (including simultaneous/parallel/variorum
views)
-
Switching non-hierarchical markup between various in-line and out-of-line
representations
-
Having metadata on individual edits, such as voting, acceptance, etc., and
a well-defined process for making a true new version
-
Maintaining connections across such versions so one can answer questions
like Where did this sentence go?
-
Making such annotation modular/orthogonal with respect to the main
document’s schema(s)
I have also done some work to support mapping language and/or orthography codes to
the Unicode ranges they use (an option to be called langChecking.
This will check that text content comports with the xml:lang value in
effect in its context. Short of that, there are already options to prohibit C0
control characters (other than cr/lf/tab), C1 control characters (which commonly
arise via incorrect character-set handling), and/or private use characters.
Why try Ragnarok?
-
Yggdrasil is almost twice as fast as minidom (especially for
building).
-
Yggdrasil is more up-to-date than minidom.
-
Ragnarok is in pure Python, thus relatively easy to modify, extend, or
(heaven forbid!) debug.
-
The API includes lots of Pythonic conveniences.
-
It’s modular, so parts can be used separately for other
things.
-
Loki can be easily configured with useful extensions such as character-set
flexibility and overlap support.
Why not try it?
-
There are still bugs (probably mostly involving namespaces and parameter
entities).
-
Some pieces (like Heimdall the validator, Sleipnir the binary DOM, and
particular extensions) are not complete.
-
You don’t like experimenting.
-
You don’t use Python.
Availability
The entire suite including the test suite will shortly be available on my github
page at https://github.com/sderose/Ragnarok/tree/master/. Error reports,
suggestions, etc. are welcome. Pythonistai (or perhaps Níðhǫggsfólk?) are also
welcome to contribute.
Appendix A. Examples of extended syntax
<?xml version="1.1" encoding="utf-8"?>
<!DOCTYPE article SYSTEM "c:\schemas\balisage.dtd"
"/home/tg/balisage.dtd" NDATA DTD [
<!ELEMENT article - - (head, div{1,10}>
<!ENTITY % soup "i|b|tt|strike">
<!ATTLIST (%soup;) #ANY CDATA #IMPLIED>
<!ELEMENT div (#PCDATA | %soup;)*>
<!ATTLIST * id ID #IMPLIED>
<!ATTLIST p width float #IMPLIED
sId COID #IMPLIED
eID COID #IMPLIED>
<!ATTLIST ?troff width int16 "60">
]>
<!— See that emdash? —>
<article +conf xml:lang=en:us -final class=“x y”>
<div>
<table +border>
<tr><td just!str=left>DTD<|>No<|>Yes<|>3.1
</table>
<p><q>Never<-q>, said the(\U0001f42) <+q>, really.</Q>
<| width=Inf>&therefore; &nbsp;
<b|i>No more</b|/i>.</>
<![IGNORE[ Why not? ]]>
<?troff width="12"?>
Appendix B. Example of JBook from Bifrost
[ { "~":"JBook", "#jbookversion":"0.9",
"#xmlversion":"1.0", "#encoding":"utf-8", "#standalone":"yes",
"#doctype":"html", "#systemId":"http://w3.org/html" },
[ { "~":"html",
"xmlns:html":"http://www.w3.org/1999/xhtml" },
[ { "~":"head" },
[ { "~":"title" }, "My document" ]
],
[ { "~":"body" },
[ { "~":"p", "id":"stuff", "class":"lead" },
"This is a ", [ { "~":"i" }, "very" ], " short document.",
[ { "~":"#cdata" }, "This is some <real> & <legit/> cdata." ]
],
[ { "~":"#pi", "#target":"troff" }, ".ss;.b" ],
[ { "~":"#comment" }, "This is <real> commentary." ]
[ { "~":"hr" } ]
]
]
]
Appendix C. Dominµs/Yggdrasil vs. regular DOM
Table II
| DOM |
Yggdrasil |
| x.nodeType Node.ELEMENT_NODE |
x.isElement |
if n >= len(x.childNodes):
x.appendChild(newNode)
else:
x.insertBefore(newNode, x.childNodes[n])
|
x.insert(n, newNode)
|
found = 0
for ch in x.childNodes:
if ch.nodeName == "p":
if found == 3: return ch
found += 1
return None
|
return x["p":3]
|
newEl = doc.createElement("p")
for a, v in someAttrs.items():
newEl.setAttribute(a, v)
|
newEl = doc.createElement("p", someAttrs)
|
newEl = doc.createElement("p")
s1 = doc.createElement("speech")
s1.setAttribute("spkr", "Essex")
s1.appendChild(doc.createTextNode("Goodbye"))
p.appendChild(s1)
if (len(.nchildNodes) > 0):
n.insertBefore(s1, n.childNodes[0])
else:
n.appendChild(s1)
|
x.insertAdjacentXML(RelPosition.begin,
"""<p><speech spkr="Essex">Goodbye</speech></p>""")
|
if (x.compareDocumentPosition(y) < 0):...
# only in DOM 3
|
if x << y:...
|
if x.isEqualNode(y):...
|
if x == y:...
|
Appendix D. Loki and Schemera options
Options whose names are in [brackets] are not yet implemented.
Table III
| Name |
Default |
Description |
| Security and limits |
| MAXEXPANSION |
1<<20 |
Total maximum entity length |
| MAXENTITYDEPTH |
16 |
Maximum entity nesting depth |
| charEntities |
True |
Allow special character entities? |
| extEntities |
True |
Allow external entities? |
| [netEntities] |
False |
Allow off-localhost entities? |
| entityDirs |
None |
Permitted dirs to see? (None means any) |
| extSchema |
True |
Fetch and process external schema? |
| Case and Unicode |
| elementFold |
None |
Normalizer for element names |
| attributeFold |
None |
... for attribute names |
| entityFold |
None |
... for entity names |
| keywordFold |
None |
... for XML reserved words (CDATA, etc.) |
| [xsdFold] |
None |
... for XSD values (true, INF, etc.) |
| [wsDef] |
XML |
Definition of whitespace (vs HTML or WhatWG) |
| noC0 |
False |
Prohibit C0 control chars like XML 1.0 |
| noC1 |
False |
Prohibit C1 control chars |
| noPrivateUse |
False |
Prohibit Private Use chars |
| langChecking |
False |
Check chars used vs. xml:lang |
| Schemas |
| schemaType |
False |
<!DOCTYPE foo SYSTEM "" NDATA x> |
| fragComments |
False |
<!ELEMENT foo -- really? -- (p|q)*> |
| Elements |
|
|
| groupDcl |
False |
<!ELEMENT (x|y|z)...> |
| oflag |
False |
<!ELEMENT - O para...> |
| sgmlWord |
False |
Allow CDATA RCDATA etc. |
| [mixel]
|
False |
Allow declared content ANYELEMENT |
| mixins |
False |
Allow SGML-ish inclusion/exclusion dcls |
| repBrace |
False |
Allow {min,max} in content models |
| emptyEnd |
False |
Allow </> |
| omitEnd |
False |
Can omit end tag before another |
| omitAtEOF |
False |
Can omit end tags at EOF |
| restart |
False |
<|> to close and reopen current element |
| restartName |
False |
<|name> to close and reopen multiple elements |
| [rankAnnots] |
None |
Permit "@" and a number on tags (cf SGML RANK) |
| [endTagID] |
False |
Permit repeat of ID on end tags |
| [entEncoding] |
False |
Permit ENCODING parameter on external entity declarations |
| Beyond hierarchy |
| [multiTag] |
False |
<div/title>...</title/div> |
| simultaneous |
False |
<b|i> </i|/b> |
| suspend |
False |
<x>...<-x>...<+x>...</x> |
| olist |
False |
Allow closing non-current elements |
| Attributes |
| saxAttribute |
False |
Generate separate SAX event per attribute |
| globalAttribute |
False |
<!ATTLIST * ...> |
| anyAttribute |
False |
<!ATTLIST para #ANY...> |
| xsdType |
False |
Allow XSD attribute types in DTD |
| [xsdPlural]
|
False |
XSD types can be pluralized |
| [attrCast] |
False |
Return attrs in declared types |
| specialFloat |
False |
Nan, Inf, etc. |
| unQuotedAttribute |
False |
<p x=foo> |
| noAttributeNorm |
False |
Suppress whitespace normalization of undeclared attributes |
| curlyQuote |
False |
Allow fancy quotes for qlits |
| booleanAttribute |
False |
<x +border -foo> |
| booleanIsName |
False |
booleanAttributes return name/"", not "1"/"0" |
| bangAttribute |
False |
!= on first use to set default |
| [bangAttributeType] |
False |
!type= to set datatype and default |
| [IDSuffix] |
False |
#ID can be appended to start tag name |
| [COID]
|
False |
Add attribute type to co-index milestones |
| [NAMESPACEDID]
|
False |
IDs can have ns prefix |
| [STACKID]
|
False |
ID is accumulated from ancestors |
| Validation (beyond WF) |
| useDTD |
False |
Fetch and parse external DTD if available |
| valElementNames |
False |
Elements must be declared |
| [valModels] |
False |
Full content model checking |
| valAttributeNames |
False |
Attributes must be declared |
| valAttributeTypes |
False |
Attributes must match dcl datatype |
| Entities and special
characters |
| htmlNames |
False |
Enable HTML/Annex D named character refs |
| unicodeNames |
False |
Enable Raku-like Unicode named character refs |
| multiPath |
False |
Allow multiple SYSTEM IDs |
| [multiSDATA]
|
False |
<!SDATA nbsp 160 z 0x9D...> |
| backslash |
False |
\n \xff \uffff (not yet \\x{}) |
| Other |
| saveMarkup |
False |
Preserve source markup form on output (where practical) |
| expatBreaks |
False |
Break at \n and entities like expat |
| emComments |
False |
Treat emdash as -- for comments |
| poundComments |
False |
<#...#> alternative for comments |
| [piEscapes] |
False |
Recognize character refs in PIs |
| piAttribute |
False |
Parse/return PI data like attributes |
| [piAttrList]
|
False |
Declare PI attrs (<!ATTLIST ?target ...>) |
| [nsUsage] |
None |
Limit namespaces to one/global/noredef/regular |
| [MSTypes] |
False |
Allow Marked sections beyond CDATA |
| [extraDcl] |
False |
Allow extra XML Declarations in documents referenced as entities |
References
[Biron and Malhotra 2004] Biron, Paul V. and Ashok Malhotra. 28 October 2004. XML Schema Part 2: Datatypes
Second Edition.
W3C Recommendation. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/.
[Bleeker, Buitendijk, and Haentjens Dekker 2020] Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. 2020. Marking up microrevisions with major implications:
Non-linear text in TAG.
Presented at Balisage: The Markup Conference 2020,
Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup
Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020).
doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
[Boyer and Marcy 2008] Boyer, John and Glenn Marcy. 2 May 2008.
Canonical XML Version 1.1.
W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-c14n11-20080502/.
[Bray, Paoli, and Sperberg-McQueen 1998] Bray, Tim, Jean Paoli, and C.M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0.
W3C Recommendation. World
Wide Web Consortium, 10 February 1998. https://www.w3.org/TR/1998/REC-xml-19980210.
[Bray 2010] Bray, Tim. 2010. D.P.H.
https://www.tbray.org/ongoing/When/201x/2010/07/21/DPH.
[Carlisle and Ion 2015] Carlisle, David and Patrick Ion. 2015. XML Entity Definitions for Characters.
W3C Working Draft (3rd Edition). https://www.w3.org/Math/characters/unicode.xml.
[Clark and DeRose 1999] Clark, James and Steve DeRose. 1999. XML Path
Language (XPath) Version 1.0.
W3C Recommendation. World Wide Web Consortium,
16 November 1999. https://www.w3.org/TR/1999/REC-xpath-19991116.
[Clark and DeRose 2002] Clark, James and Steve DeRose. 2002. XML Pointer
Language (XPointer).
W3C Working Draft. World Wide Web Consortium, 16 August
2002. https://www.w3.org/TR/xptr/.
[DeRose et al. 1996] DeRose, Steven et al. 1996. Data
processing system and method for representing, generating a representation of and
random access rendering of electronic documents.
U.S. patent number
5,557,722 (expired). https://uspto.report/patent/grant/5,557,722.
[DeRose 1999] DeRose, Steven J. 1999. What Do Those
Weird XML Types Want, Anyway?
Keynote address, VLDB ’99, Edinburgh.
In VLDB ’99: Proceedings of the 25th International
Conference on Very Large Data Bases: 721-724. Morgan Kaufmann. ISBN
1-55860-615-7. https://dl.acm.org/doi/10.5555/645925.671670, https://www.vldb.org/dblp/db/conf/vldb/DeRose99.html.
[DeRose 2014a] DeRose, Steven J. 2014. JSOX: A Justly
Simple Objectization for XML: Or: How to do better with Python and XML.
Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014.
In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on
Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.DeRose02.
[DeRose 2014b] DeRose, Steven J. 2014. What do we still lack? Or:
Prolegomena to any future hypertext system.
Presented at Symposium on HTML5
and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on
HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014).
doi:https://doi.org/10.4242/BalisageVol14.DeRose01.
[DeRose and van Dam 1999] DeRose, Steven J. and Andries van Dam.
1999.Document structure and markup in the FRESS hypertext system.
In
Markup Languages: Theory & Practice 1.1:
7-32.
[Diewals and Stührenberg 2013] Diewald, Nils, and Maik Stührenberg. 2013.
An extensible API for documents with multiple annotation layers.
Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9,
2013.
In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on
Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Diewald01.
[Heimes 2023] Heimes, Christian (maintainer). 2023.
defusedxml.
Pypi library, version 0.7.1,
https://pypi.org/project/defusedxml/.
[Le Hors et al. 2000] Le Hors, Arnaud, Philippe Le Hégaret, Lauren Wood,
Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. 2000. Document
Object Model (DOM) Level 2 Core Specification.
W3C Recommendation. World
Wide Web Consortium, 13 November 2000. https://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/.
[Peters 2004] Peters, T. 2004. PEP 20 — The Zen of
Python.
Python Enhancement Proposals. Retrieved from https://peps.python.org/pep-0020.
[Piez 2008] Piez, Wendell. 2008. LMNL in Miniature.
Presented at the LMNL Workshop, Amsterdam, December 2008.
[Piez 2012] Piez, Wendell. 2012. Luminescent:
parsing LMNL by XSLT upconversion.
Presented at Balisage: The Markup
Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of
Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Piez01.
[Python] Python. Retrieved 2025-03-20.
XML Processing Modules
(Python 3.11.11 documentation).
https://docs.python.org/3.11/library/xml.html.
[Raku] Raku. Retrieved 2025-03-19.
Unicode: Unicode support in Raku
(Raku documentation).
https://docs.raku.org/language/unicode.
[Schonefeld 2008] Schonefeld, Oliver. 2008. An
event-centric API for processing concurrent markup.
Presented at Balisage:
The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on
Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Schonefeld01.
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C.M., and Claus Huitfeldt.
2000. GODDAG: A Data Structure for Overlapping Hierarchies.
In
Digital Documents: Systems and Principles. PODDP 2000, pp. 139-160. Lecture Notes in Computer Science, vol. 2023. Springer-Verlag,
Berlin, Heidelberg. doi:https://doi.org/10.1007/978-3-540-39916-2_12.
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C.M., and Claus Huitfeldt.
2008. Markup Discontinued: Discontinuity in TexMecs, Goddag structures, and
rabbit/duck grammars.
Presented at Balisage: The Markup Conference 2008,
Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup
Conference 2008. Balisage Series on Markup Technologies, vol. 1.
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.
[Thompson et al. 2004] Thompson, Henry S., David Beech, Murray Maloney, and
Noah Mendelsohn. 2004. XML Schema Part 1: Structures Second Edition.
W3C
Recommendation. World Wide Web Consortium, 28 October 2004. https://www.w3.org/TR/2004/REC-xmlschema-1-20041028/.
[Unicode Annex #15] The Unicode Consortium. 2015. Unicode Standard
Annex #15: Unicode Normalization Forms.
Unicode Standard Annex.
Retrieved 2025-03-21. https://www.unicode.org/reports/tr15/tr15-43.html.
[Unicode] The Unicode Consortium. Principles of the Unicode Standard.
Section in The Unicode® Standard: A Technical Introduction.
https://www.unicode.org/reports/tr15/tr15-43.html.
[Unicode, Section 2.2] The Unicode Consortium. 2024-09-10.
Unicode Design Principles.
Section 2.2 in The Unicode® Standard,
Version 16.0 — Core Specification. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G128.
[Wall, Christiansen, and Orwant 2000] Wall, Larry, Tom Christiansen and Jon Orwant. 2000. Programming Perl (3rd ed.).
O’Reilly Media. (The three virtues of a programmer — laziness, impatience, and hubris
— are referenced in Chapter 27 Perl Culture.
)
[WHATWG] WHATWG. 2025. HTML Living Standard.
Web Hypertext Application Technology Working Group. https://html.spec.whatwg.org/multipage/. Retrieved 2025-07-22.
[Selectors Level 3] World Wide Web Consortium. 2009.
Selectors Level 3.
W3C Recommendation. World Wide Web Consortium, 15
December 2009. https://www.w3.org/TR/2009/REC-css3-selectors-20091215/.
×Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. 2020. Marking up microrevisions with major implications:
Non-linear text in TAG.
Presented at Balisage: The Markup Conference 2020,
Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup
Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020).
doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.
×Clark, James and Steve DeRose. 2002. XML Pointer
Language (XPointer).
W3C Working Draft. World Wide Web Consortium, 16 August
2002. https://www.w3.org/TR/xptr/.
×DeRose, Steven et al. 1996. Data
processing system and method for representing, generating a representation of and
random access rendering of electronic documents.
U.S. patent number
5,557,722 (expired). https://uspto.report/patent/grant/5,557,722.
×DeRose, Steven J. 2014. JSOX: A Justly
Simple Objectization for XML: Or: How to do better with Python and XML.
Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014.
In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on
Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.DeRose02.
×DeRose, Steven J. 2014. What do we still lack? Or:
Prolegomena to any future hypertext system.
Presented at Symposium on HTML5
and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on
HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014).
doi:https://doi.org/10.4242/BalisageVol14.DeRose01.
×DeRose, Steven J. and Andries van Dam.
1999.Document structure and markup in the FRESS hypertext system.
In
Markup Languages: Theory & Practice 1.1:
7-32.
×Diewald, Nils, and Maik Stührenberg. 2013.
An extensible API for documents with multiple annotation layers.
Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9,
2013.
In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on
Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Diewald01.
×Le Hors, Arnaud, Philippe Le Hégaret, Lauren Wood,
Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. 2000. Document
Object Model (DOM) Level 2 Core Specification.
W3C Recommendation. World
Wide Web Consortium, 13 November 2000. https://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/.
×Piez, Wendell. 2008. LMNL in Miniature.
Presented at the LMNL Workshop, Amsterdam, December 2008.
×Piez, Wendell. 2012. Luminescent:
parsing LMNL by XSLT upconversion.
Presented at Balisage: The Markup
Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of
Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Piez01.
×Schonefeld, Oliver. 2008. An
event-centric API for processing concurrent markup.
Presented at Balisage:
The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In
Proceedings of Balisage: The Markup Conference 2008. Balisage Series on
Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Schonefeld01.
×Sperberg-McQueen, C.M., and Claus Huitfeldt.
2000. GODDAG: A Data Structure for Overlapping Hierarchies.
In
Digital Documents: Systems and Principles. PODDP 2000, pp. 139-160. Lecture Notes in Computer Science, vol. 2023. Springer-Verlag,
Berlin, Heidelberg. doi:https://doi.org/10.1007/978-3-540-39916-2_12.
×Sperberg-McQueen, C.M., and Claus Huitfeldt.
2008. Markup Discontinued: Discontinuity in TexMecs, Goddag structures, and
rabbit/duck grammars.
Presented at Balisage: The Markup Conference 2008,
Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup
Conference 2008. Balisage Series on Markup Technologies, vol. 1.
doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.
×Wall, Larry, Tom Christiansen and Jon Orwant. 2000. Programming Perl (3rd ed.).
O’Reilly Media. (The three virtues of a programmer — laziness, impatience, and hubris
— are referenced in Chapter 27 Perl Culture.
)