How to cite this paper

DeRose, Steven J. “Ragnarok: An Experimental XML environment.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.DeRose01.

Balisage: The Markup Conference 2025
August 4 - 8, 2025

Balisage Paper: Ragnarok: An Experimental XML environment

Steven J. DeRose

Consultant

`<sderose@acm.org>`

ORCID ID: https://orcid.org/0000-0003-4216-548X

Steve DeRose has been working with electronic document and hypertext systems since 1979. He holds degrees in Computer Science and Linguistics and a Ph.D. in Computational Linguistics from Brown University.

He co-founded Electronic Book Technologies in 1989 to build the first SGML browser and retrieval system, DynaText, and has been deeply involved in document standards including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, and others. He has served as adjunct faculty at Brown and Calvin Universities and has written many papers, two books, and more than fifteen patents. Most recently he has been working as a consultant in text analytics.

Abstract

XML has a highly reliable, consistent, widely-supported ecosystem. Python is enormously popular, yet (perhaps surprisingly) its support for XML has weaknesses. Several parsers are available but most (including the official xml.parsers.expat) are not native Python, leading to issues with Python development tools. The Python DOM library (xml.dom.minidom) is native, but is barely DOM 2.0, slow, and lacks conveniences well-established elsewhere. It is also not Pythonic, using few modern Python features and idioms. lxml is admirably Pythonic for Elements, but text poses problems.

Ragnarok is a new, pure Python XML tool suite that addresss these issues. It provides plug-compatible replacements for Python XML libraries, and is equipped with many Pythonic conveniences (the batteries included philosophy of Python). The parser (Thor), uses recursive decent: methods map directly to the XML grammar, easing debugging and extension. A validator (Heimdall) is in progress. Schemera handles DTDs, but its architecture is more like XML Schema. The DOM library (Dominµs, aka Yggdrasil) is much faster than minidom, with almost all of DOM 3 Core and many features drawn from other XML and HTML tools and from Python practice. Flexible output serializing comes via components called Gleipnir and Bifrost.

Non-hierarchical structures have long been of interest to this community, but face a dilemma because XML doesn’t really support them: one can coerce to XML syntax via milestones or standoff markup; or create entirely new syntax. In either case XML tools give little help. Beside Thor, Ragnarok also includes a second parser, called Loki, which explores a middle way. Loki accepts non-hierarchical structures via syntax that includes non-XML extensions, but remains so similar that (a) no prior WF XML changes meaning, and (b) the implementation is easily constructed on top of a regular parser. This may enhance data and code re-use. Loki is a subclass (rather than brother) of Thor, and like its namesake can change its shape. Loki can be configured with many XML-adjacent options ranging from case-folding names or enabling named character entities (both difficult with many other tools), on up to extensions for olists, suspend/resume, and milestone-encoded structures.

Ragnarok overall can help with everyday XML tasks in Python by being faster, more up-to-date, more Pythonic, and more fully functional. On the other hand, Loki can help with overlap and a variety of other tasks at the edge.

Introduction

Design principles

Yggdrasil (aka Dominµs)

Slicing

Other list methods

Other Yggdrasil features

Relation to lxml and ElementTree

Some ways documents differ

Yggdrasil / Dominµs sibling representation

Schemera (DTD/schema support)

Loki: a shape-changing parser

Security options
Basic/practical parser options
Weightier extensions
Other options
Schemera and Loki
Parser APIs
Special character handling

Runeheim (character set management)

Gleipnir and Bifrost (serializers)

Testing

Futures

ID handling
A longer view
Why try Ragnarok?
Why not try it?
Availability

Appendix A. Examples of extended syntax

Appendix B. Example of JBook from Bifrost

Appendix C. Dominµs/Yggdrasil vs. regular DOM

Appendix D. Loki and Schemera options

Introduction

XML is plenty cool as is. Yet support for it in Python has some limitations. This paper reports progress on a pure Python XML environment that I call Ragnarok after the apocalyptic rebirth in Norse mythology. The great serpent (python?) Jörmungandr triggers it, as problems with the official Python XML stack motivate this Ragnarok. For example, the official parser is not native Python and so has limitations with Python development tools. The Python DOM library is behind, quite slow, and uses few modern Python features or practices. Several Ragnarok tools can be plugged straight in for Python’s libraries, particularly expat and minidom.

Ragnarok has several parts, all implemented in pure Python and fully type-hinted:^[1]

The primary component is a modern DOM implementation [Le Hors et al. 2000] covering nearly all of DOM 2 and DOM 3 Core. It is called Yggdrasil after the great tree of Norse myth; or if you prefer, Dominµs — spelled with a micro sign because it’s fast (unlike minidom).
An XML parser called Thor (short for Text Hierarchy Object Reader, and previously called xsparser). Thor has a very few options, such as for protecting against DoS attacks.
Flexible serializers called Gleipnir and Bifrost.
A package for character sets and naming, called Runeheim.
DTD and partial XSD support, called Schemera because it casts a wide feature net and integrates parts from several other components and languages — like the the Chimera (despite the lack of that creature in Norse myth).
A still-incomplete validator, called Heimdall after the guardian of the bridge joining Asgard and Midgard.
An extended (XML-adjacent) parser, called Loki because it is extremely configurable, including for overlap and much more.
A still-incomplete persistent binary form called Sleipnir.

Ragnarok tools are backward-compatible: Yggdrasil has the normal DOM API, plus added conveniences you can use if you want. Thor has an API almost identical to expat’s and is a normal XML parser. Schemera does normal DTDs (and soon, XSDs) unless you reconfigure it. Gleipnir binds DOM structures into XML syntax just like minidom’s toprettyxml() unless you reconfigure it with an additional formatting spec. Loki, in contrast to Thor, can be specifically reconfigured with a wide range of syntactic and semantic options (nevertheless, WF XML is treated normally). Ragnarok also includes a large Python unittest suite including head-to-head swapping vs. prior tools.

Design principles

Ragnarok aims at several broad goals:

Support the existing ecosystems
1. Be backward-compatible with XML and DOM tools, building on XML foundations rather than replacing them
2. Ensure regular XML parsers (including Thor) stop if Loki xtensions are enabled
3. Be Pythonic, XML-ic, and Unicode-ic
Provide for configuration and extension
1. Be pure Python and leverage Python libraries and features
2. Support non-hierarchical structures with minimal syntax change from XML
3. Avoid components depending on other component’s options
Don’t necessarily be limited by quirks of the ecosystem
1. Remember that XML ≢ DOM ≢ structure.
2. Avoid making DTDs necessary (e.g., for named characters, finding IDs, or attribute defaults)
3. Offer more sophisticated structure, ID, and other semantics
Provide pain relief
1. Address inconsistency, verbosity, and awkwardness in APIs
2. Be fast and allow users to make performance tradeoffs
3. Make entities safer against potential attacks

There are lots of functions and methods available across these components, but users rarely have to care. For example, all the DOM node insertion and deletion methods are there in Yggdrasil so if you like your mutator, you can keep your mutator. On the other hand, XPath users needn’t remember whether to use preceding/following or previous/next (I never can). Both work the same. Python developers that are new to DOM can just use the already-familiar Python list operations. For example, instead of

if newNode.nodeType == Node.ELEMENT_NODE:^[2]
    c3 = n.childNodes[3]
    n.insertBefore(newNode, c3)
    n.removeChild(c3)

they can say the following, which is Pythonic, shorter, more readable, and not prone to mistaking insertBefore’s parameter order:

if newNode.isElement: n[3] = newNode

Python, Unicode, and XML have express goals that are broadly compatible but differ in detail. Honoring them involves compromises, not only when combined but even for each in itself:

Python teaches [Peters 2004] that There should be one — and preferably only one — obvious way to do it. One counterexample is that the fastest way to build up a string (say, when parsing) is to make a list of separate single-character strings, then combine them into one string at the end. This is far from the obvious way to do it.
Unicode’s principles [Unicode, Section 2.2, Unicode Annex #15] include that it be simple to parse and process, that it unify duplicate characters within scripts across languages, and that characters have well-defined semantics that represent plain text. All these have exceptions.
XML goals include #5 [Bray, Paoli, and Sperberg-McQueen 1998, §1.1]: The number of optional features in XML is to be kept to the absolute minimum, ideally zero. Yet there are effectively options in standalone, attribute defaulting, XML 1.1 differences, and validation and choice of schema language.

Nevertheless, Ragnarok tries to honor these principles as far as practical, except for having options. For example, Thor and Yggdrasil provide idioms that fit the obvious way of doing things in Python, such as n[3] to access n’s 4th child, or n.isText to identify text nodes. Runeheim modularizes Unicode dependencies so code need not be complicated everywhere else. Ragnarok adds synonyms without removing original names, rather than forcing users to adjust.

Yggdrasil (aka Dominµs)

Yggdrasil is my favorite piece of Ragnarok. It supports nearly all of the minidom API identically, while my profiling finds it about 40% faster (this will of course vary across use cases). It has nearly all of DOM 2 and DOM 3 Core, and features from HTML DOM, WhatWG, XPath, XPointer, lxml, and even ElementTree and CSS. Most existing code should run fine with Yggdrasil replacing minidom (just faster and better integrated).

Slicing

The most obvious change is that DOM Elements are a true subclass of Python lists — consisting of their child nodes. This enables Python’s normal subscript or slicing notation: myNode[1], not just myNode.childNodes[1].

In minidom, the first raises TypeError: 'Element' object is not subscriptable. The second works as a reference, but assigning to it in obvious Python fashion only appears to work in minidom. For example, this reports no error:

myNode.childNodes[1] = myNode.childNodes[3]

minidom copies the pointer, leaving myNode.childNodes with duplicate references to the identical node. Thus for c in myNode.childNodes afterward gives children 0, 3, 2, 3, .… But iterating with

c = myNode.childNodes[0]; while c:
                        c = c.nextSibling

still gives the original list. Yggdrasil, in contrast, correctly complains that [3] already has a parent.

If instead a new Node is spliced into position [1], minidom again corrupts its own data: iterating via nextSibling would never see the new Node, though it would be reachable as myNode.childNodes[1]). Yggdrasil instead does the right thing. In Yggdrasil Pythonically-obvious things just work, both on left and right sides.

Python [] notation supports more varied arguments than just a positive integer, and Yggdrasil also leverages that to support shorthands reminiscent of XPath:

Table I

Example	DOM	Description
x[-1]	x.childNodes[-1] # right-side only	Last child
x[0:-3]	nl = NodeList() for ch in x.childNodes[0:-3]: nl.append(ch)	First child through 3rd-to-last child
x["para"]	nl = NodeList([ ch for ch in x.childNodes if ch.nodeName == "para" ])	All children with nodeName "para"
x["@id"]	x.getAttribute("id")	Attribute named "id"
x["para":1:3]	temp = NodeList[ ch for ch in x.childNodes if ch.nodeName == "para" ]) NodeList(temp[1:3])	Among all "para" children, the 2nd through 4th
x["#text"]	nl = NodeList([ ch for ch in x.childNodes if ch.nodeName == "#text" ])	All children with nodeName #text (i.e., text nodes)
x["*"]	nl = NodeList([ ch for ch in x.childNodes if ch.nodeType == Node.ELEMENT_NODE ])	All element children
x[".."]	x.parentNode	The parentNode
x["/"]	NodeList(x.childNodes)	All children

As with Python’s own extended slices there are some limitations on the left side (they raise errors if tried). Yggdrasil is even faithful to a feature/quirk of Python slicing: An out-of-range index raises IndexError for a singleton like x[999] but quietly returns an empty list for a range like x[99:999] or x[99:999:2]. One difference, however, is that elements with no children do not cast to Boolean False like empty Python lists do. This is because empty elements have distinct data (such as attributes, not to mention their tree context), unlike other falsish things in Python: None, 0, "", [], {}, etc.

Finally, slicing supports scheme prefixes analogous to those in XPointers and URLs. For example, n["css:#mainText"]. Many CSS selector types are available (a few don’t make sense in non-browser contexts). Additional scheme prefixes can be registered along with functions to handle them.

Other list methods

Regular Python list operations such as insert and del (not to mention sort and many others) just work, as do regular DOM ones:

myNode.insert(-1, newNode)
myNode.extend(someNodeList)
del myNode[-1]
myNode.pop(5)
x[0:2] = myNodeList
howMany = myNode.count("p")
myNode.appendChild(newNode)
myNode.insertBefore(oldChild, newChild)

Other Yggdrasil features

Some other features of Yggdrasil are inspired by the HTML DOM, such as (configurable) case-ignoring^[3] and innerXml and outerXml (both setters and getters). WhatWG inspired adding insertAdjacentXml() and methods for tokenized attributes (such as, but not only, HTML CLASS).

The Yggdrasil API uses modern Python constructs. The much shorter nodeType test properties were shown earlier (n.isElement, n.isText, etc.); or of course Python isinstance() works fine. There are also methods to get the first node along any XPath axis (not just parent, child, and siblings), and generators to iterate along any XPath axis.^[4] There is also a generator for the SAX events that a parser would return were it parsing a given subtree. The generators can take a callable to use as a filter:

for i, n in enumerate(someNode.descendants(test=myCallback)):
     print(f"Descendant {i} is of type {n.nodeName}")

Yggdrasil accepts synonyms such as DOM previous/next vs. XPath preceding/following, and fills symmetry gaps such as prependChild().

Reserved string arguments are defined as enums, so in cases like

myNode.insertAdjacentXML("beforebegin", "<p>As seen below:</p>")

a reified type and value can be used: RelPosition.beforebegin. This lets pylint provide earlier detection of errors, and compilers like mypc optimize better. The enums also accept the strings as equivalent so existing code works without change.

DOM methods that normally require a child node as a location specifier are extended in Yggdrasil to also accept (signed) numbers. Whether a caller knows the position of a node (perhaps because it’s iterating by number) or the actual node (perhaps because it’s in a NodeList), it can just use what it has rather than having to convert to the other — the following are all interchangeable if oldNode is theParent[4]:

theParent.insertBefore(newNode, oldNode)
theParent.insertBefore(newNode, 4)
theParent.insert(4, newNode)

That last case is just the normal Python list insertion method, and therefore uses that order of arguments. Nodes can even delete themselves, using:

myNode.removeNode()

rather than something like:

myNode.parentNode.removeChild(myNode)

Normal comparison operators such as < are overloaded in typical Python fashion, and test document order.^[5] However, because a node cannot be in multiple places, using == and != to test document order is redundant with testing identity (isSameNode()). It seems better to instead make them do the equivalent of isEqualNode(). DOM 3’s compareDocumentPosition() is also available.

Yggdrasil can generate and interpret basic XPointers [Clark and DeRose 2002] with or without IDs. As it turns out, generating XPointer child sequences for two nodes and comparing them is a very fast and intuitive way to compare document positions. Upgrades to support DOM/XPointer ranges and discontiguous (virtual) elements are underway but incomplete (to be named Gungnir after Odin’s spear which always hits its target). Those of course will need additional position, containment, traversal, and inheritance semantics.

Since Elements and NodeLists are actually Python lists, operations like sort, reverse, and even multiply are available. However, because no Node can appear in more than one place in a document, operations like multiplication on Elements return a new NodeList and not the same Element lengthened.

Relation to lxml and ElementTree

lxml is a popular parser, with an associated ElementTree package comparable to DOM. It has the advantages of being much more Pythonic than minidom, for example simply constructing nodes via Element(), Comment(), etc.; treating elements much like lists, leveraging Pythonisms such as is, in, len; treating attributes as a dictionary; etc. These conventions are also used in Yggdrasil. ElementTree also has its own names for various things: tag vs. nodeName, getParent() vs. parentNode, etc. These also are supported in the interest of users not having to care.^[6]

ElementTree itself, however, has what I consider a serious drawback: It has no concept of text nodes. Text is treated as a property of the preceding sibling (almost like an attribute), named tail. But of course there may not be a preceding sibling. In that case the text instead goes onto the parent as a property called text (tail on the parent would refer to text in a different place). For Comments and PIs, text means their data content and they have no tail. This works, but has consequences:

The number of children, child numbers, etc. are completely different than in other XML tools.
Text that is conceptually part of an element is partly on that element and partly on its children, rather than all at the same level. This makes adding (say) an inline node within that text topologiclly interesting.
The tail text logically follows all descendants’ text; but it is a property of the parent, which precedes all its descendants in document order.
In everyday speech people say that text is part of or contained in a paragraph — ElementTree still has a notion of children but it does not include any of the text (except perhaps indirectly).
Text content is split across two different variables, one of which can only logically occur on certain Elements (in DOM terms, just those which happen to have a first child which is a text node).
There is data stored in text on Comments and PIs that is not text content despite the name.

Code to gather up all the text content is considerably more complicated with ElementTree’s approach (and it is awkward to use Python’s much faster ''.join() method for building up strings):^[7]

def textContent(self) -> str:  # ElementTree way
    buf = self.text or ""
    for ch in self:
        if not isinstance(ch, (ET.Comment, ET.ProcessingInstruction)):
            buf += ch.textContent()      
        if ch.tail:
            buf += ch.tail or ""
    return buf

def textContent(self) -> str:  # DOM way
    if isinstance(self, Element):
        return ''.join([ ch.textContent() for ch in self.childNodes ])
    if isinstance(self, CharacterData):
        return self.nodeValue
    return ""

def textContent(self) -> str:  # Yggdrasil way
    if self.isElement:
        return ''.join([ ch.textContent() for ch in self ])
    if self.isCharacterData:
        return self.nodeValue
    return ""

I find ElementTree’s approach to text less than intuitive. Nevertheless, along with the rest of the ElementTree API for nodes, Yggdrasil supplies setters and getters for text and tail.

Some ways documents differ

It may be that this approach by ElementTree is related to how XML is commonly (mis?) represented in the software world. It is extremely common to explain XML using examples that completely omit features that are indispensible for documents; that include only features that are needed for config files, CSV-like data, etc. (see DeRose 1999). For example, countless examples share these limitations:

Only elements, no attributes. Or perhaps attributes only for style or ID. This conflates the notion of things having component parts with the notion of things having properties (essentially mereology vs. ontology, or the ubiquitous programming distinction of is-a vs. has-part).
Essentially no hierarchy. Examples very commonly have a root element, then a bunch of instances of a single element type, each with the same sequence of sub-element types. In other words, a CSV file in pointy brackets. This is just fine for data that is that way; but documents are not that way.
No ordering. Again like CSV, many XML examples have no use for order — you can shuffle records without affecting the meaning; but do that to the paragraphs of a document and the author might not be pleased. Some sophisticated but non-document-like XML applications, however, also do not make much use of order (for example SVG and XSLT).
No mixed content or even no text content at all. Look at Apple XML config files, or even the XML page on Wikipedia — which as of this writing barely mentions mixed content and gives not one example of non-trivially nested elements or of mixed content. Documents in reality have lots of unnamed text portions — countless inline-ish or font-ish elements are embedded in text, making the text on either side unnamed.

The last item (mixed content), I think is the crux: Serializing a SQL or CSV record or a C struct gives no occasion for freestanding text (or freestanding integers, for that matter). A record is a tuple of scalar fields (essentially what Python calls a namedtuple). Similarly, an OOP object is conceptually (and in Python literally) a dict with named items. In both cases such items permit no repetitions (an item’s value can of course be a list, dict, or other object, but if it is named X you still only get one X). Nameless parts simply don’t exist in these contexts.

A fifth fundamental omission is that examples rarely show items having identity, reference, or re-use of data by reference. Unlike the other points this should be bread and butter even for developers who rarely deal with documents — objects and databases are replete with pointers and re-use despite their notable absence from CSV and JSON.

These features are absolutely required for documents. Without them XML could still handle tabular and OOP-like cases, but not even simple documents. Someone who considers only those cases may think all the rest is wasteful, but that is akin to considering Unicode wasteful because one is monolingual. Sadly, XML examples are commonly gerrymandered.

Yggdrasil / Dominµs sibling representation

DOM implementations provide direct access to the previous and next sibling from any node. There are three main ways to implement this, which have significant space/time tradeoffs:

Store explicit pointers to adjacent siblings in every node (minidom does this). This is very fast for individual steps, but it costs space for the pointers and time to update them on changes.^[8]
Store in each node its integer position among its siblings. This is faster than searching the parent (see next) but slower for node insertions or deletions other than at the end (because all later siblings need their numbers adjusted).
Go to the parent, count through its list of children to find the original child, then add or subtract 1. This slows down for nodes with very many siblings (except for some operations like appends), but saves space and makes tree changes fast (no pointers or counters to update).

Perhaps surprisingly, profiling revealed that going to the parentNode and counting is very much faster than maintaining and using direct pointers, even for quite wide trees. This is a good example of why a pure Python experimental framework is useful: it’s very easy to change specific details and do head-to-head comparisons.

In keeping with Ragnarok’s overall philosophy Yggdrasil provides for choosing any of the three methods, and even for changing in mid-stream. For example, method 2 might be best during loading (it is very fast for additions at the end), then method 3 once loaded (it is very fast for modifications), or method 1 if there are very bushy nodes. It would not be very hard to have individual nodes switch methods when they get wide (perhaps with hundreds of child nodes). However, method 3 is so much faster in my profiling that I have not seen a need for that. Direct changes to nodes via list operations just work, regardless of the sibling implementation in effect at the time.

A node’s child number is available, counting from either end. Options enable counting only nodes with the same nodeName, ignoring white-space-only text nodes, and/or coalescing adjacent (non-normalized) text nodes.

Schemera (DTD/schema support)

Ragnarok can read DTDs and internal subsets and make their information readily accessible. This support is collectively called Schemera even though it includes parts within several other Ragnarok components. When markup declarations are loaded they are available via a simple but pretty complete API — it’s easy to find out what’s declared or not, what the content models are (as strings, tokens, or trees), what attributes are undeclared, of a given type, have defaults, and so on.

The architecture is more informed by XML Schema than by DTD, and XSD syntax support is underway. Since the semantics are similar this will add capabilities such as importing one and exporting the other (via either Gleipnir or BiFrost), not to mention a common API. This also enables custom schema language extensions for Loki, such as supporting XSD datatypes for attributes in DTDs:

<!ATTLIST employee
    salary    float  #IMPLIED
    hireDate  gDate  #REQUIRED>

An xsdType option already enables this, and affects both schemas and documents. Loki and Schemera must recognize the XSD type names in ATTLIST declarations, while Yggdrasil must provide access to their representation. Gleipnir and Bifrost must serialize them, at present by converting to the nearest DTD types. Schemera can already check lexical formats and facets, although some other aspects of validation await the Heimdall validator.^[9] For example, salary in the example above would really need to be a floating point number.^[10] Values can also be auto-cast to the declared type when returned via SAX or DOM rather than leaving everything as strings (this should save a lot of tedium). Each XSD type knows what built-in Python type to cast to. Yggdrasil also keeps track of the order of declarations, so re-exporting doesn’t have to scramble the order.

For options discussed in this section, Yggdrasil supports the needed internal data and APIs, but only Loki can read markup for them (not Thor). For example, content models can be very slightly extended: First, there is an anyAttribute option to support an eponymous declared content type which is like ANY except that #PCDATA is not allowed (Ragnarok dislikes symmetry gaps). A repBrace option enables content models to use XSD-like minOccurs and maxOccurs via the well-known regex brace notation:

(title, para{1,3}, sec+)

Options also allow a few SGML-like items in schemas such as OMITTAG flags and exceptions (oflag). This is mainly to accommodate old SGML DTDs. However, even Loki does not support tag omission, because it generally breaks XML’s key principle of not needing a schema to parse correctly. Omitting end tags immediately before another end tag or EOF does not break that, so Loki has options for those cases.^[11] Declaring multiple element types at once (or attributes for multiple element types) is available via groupDcl plus additions for declaring global attributes via globalAttribute and XSD-like AnyAttribute via anyAttribute.

A multiPath option allows multiple SYSTEM identifiers for DOCTYPE, ENTITY, and NOTATION declarations. They are defined to be tried from left to right. This is mainly to relieve the situation where a document is passed among people with differing paths for the same things, without needing a catalog.

schemaType enables Loki and Schemera to recognize a schema language (which can also be declared as a NOTATION) specified like:

<!DOCTYPE article SYSTEM "http://docbook.org/xsd/5.0/docbook.xsd" NDATA XSD []>

Loki: a shape-changing parser

Thor parses XML. It is pure Python and uses recursive descent, so it has overt methods directly corresponding to the XML grammar and they are easy to find, examine, or debug. The interface is like expat as seen from Python and it works as a direct replacement. It reports the same WF errors. I think it also gives pretty good error messages (let me know of exceptions, please). It includes a few options, such as restricting external entities to certain source directories, depths, etc.

So far so good.

Loki is a 2nd parser, which is backward-compatible with Thor but offers many more options ranging from case-folding on up to supporting markup representing overlap.

Security options

Several options have little relation to XML syntax but aim to improve the security of entity usage. Python documentation [Python] gives fairly dire warnings about attack vectors like entities with excessive expansion or risky system identifiers:

<!ENTITY a "Boom">
<!ENTITY b
    "&a;&a;&a;&a;&a;&a;&a;&a;&a;&a;"></para>
<!ENTITY c
    "&b;&b;&b;&b;&b;&b;&b;&b;&b;&b;"></para>
<!ENTITY d
    "&c;&c;&c;&c;&c;&c;&c;&c;&c;&c;"></para>
<!ENTITY e
    "&d;&d;&d;&d;&d;&d;&d;&d;&d;&d;"></para>
<doc>&e;</doc>

<!ENTITY pw SYSTEM "../../../etc/passwd">
&pw;

These are indeed potential attacks, although they affect most any data format that provides a macro or include capability (a very long list). There is even a Python library called defusedxml to mitigate such risks [Heimes 2023]. Thor and Loki both provide mitigation options, because such options do not affect syntax per se:

Limiting the depth and/or total length of entity expansions (MAXENTITYDEPTH and MAXEXPANSION)
Deciding whether to fetch and parse a DOCTYPE at all (extSchema)
Ruling out external entities entirely (extEntities)
Ruling out non-local URIs as system identifiers (netEntities)
Setting a white-list of directories so system ids that point elsewhere fail (entityDirs)
Allowing only (or not even) special character and/or unparsed entities (charEntities).

Basic/practical parser options

Loki, like its namesake, is very good at shape-changing. It’s built as a subclass of Thor. A few of its basic options have already been mentioned. For example, case-folding is not XML but is fairly simple and ubiquitous — SGML and HTML both use it (SGML via separate options for element vs. entity names). XML does not offer it for users, but uses it internally (xmlns and xml prefixes, xml:stylesheet, etc.). XSD has it as a facet for datatypes. This all suggests that case-folding is pretty useful. It is also nearly trivial to implement in a parser, and not disruptive to much else. So Loki offers it as an option (actually several options: elementFold, attributeFold, entityFold, idFold, …).

However, case is not entirely trivial. I learned that some HTML versions, SGML, and Unicode all define it very slightly differently. I bit the bullet and Loki (via Runeheim) can use upper(), lower(), or full Unicode case_fold() (or of course none, as in Thor). More choices can be added. Unicode also defines several normalization forms [Unicode], and those are also available to deal with ligatures, diacritics, halfwidth vs. fullwidth, etc. Likewise where specs differ in what counts as whitespace.

A somewhat related option (htmlNames) enables all the usual SGML/HTML named character entities in one step. This otherwise requires 252 explicit ENTITY declarations (over 2,000 for HTML 5) – and a parser that handles them, and the time to parse them. Even if you’re using a schema language other than DTD your tools have to include DTD support to get this (or perhaps you can code a callback for unknown entities), which seems a bit silly. Like case-folding, a one-step way to enable these isn’t part of XML but it’s awfully handy.

Setting entEncoding enables an extra parameter for ENTITY declarations (thus available in Loki only), to declare the encoding for an external entity. This is especially useful because files oftern show up in unexpected encodings. Since Python provides an encoding parameter when opening streams, this is straightforward:

<!ENTITY chap2 SYSTEM "chap2.xml" ENCODING cp1252>

As noted earlier all options are off by default (except a few for avoiding risky entity fetches), in which case Loki has normal XML behavior just like Thor. But for those inclined to rush in where parsers fear to read, there are two ways to turn options on. First, construct a Loki parser and call its setOption() method. Second, enable options from within a document. At the moment, the settings go in the XML declaration (or perhaps I should call it the Loki declaration?). For example:

<?xml version="1.1" encoding="utf-8" elementFold="UPPER" htmlNames="yes"?>

There are many other ways this could be done. An advantage of the present approach is that it makes a document that uses extensions no longer be WF XML, while remaining extremely close — which is precisely the point. With extensions in use a document is only XML-ish, so XML parsers should reject the document rather than interpreting any extended usage incorrectly. This syntax makes them do so. It also makes the document reveal up front exactly which extensions it uses (if one is used but not enabled, Loki raises a WF error). Extended documents can be converted to regular XML: just load, then export them back out with Gleipnir. That works for most options, though not for those supporting non-hierarchical structures — those must wait for support in Yggdrasil.

Weightier extensions

Being interested in more complex document structures (and well-versed in Wall’s 3 Virtues of a Programmer [Wall, Christiansen, and Orwant 2000]), I also wanted a tool to help me examine related ideas, especially incremental support for non-hierarchical and virtual markup structures. In contrast, the security options discussed above have essentially no effect on parsing (so are in Thor), and case-folding options have only a very narrow effect (so are in Loki only). Enabling special syntax for non-hierarchical structures is a bigger deal (milestones are not a special syntax per se, but involve a special semantic of which an XML parser can be blissfully but unhelpfully unaware).

Nevertheless, non-hierarchy is not always as big a deal as it may seem at first glance. A relatively simple example is olist semantics (see Sperberg-McQueen and Huitfeldt 2000), where an element can be closed even if it isn’t current. The following is neither WF XML nor MECS syntax, but in an olist/overlap world it’s pretty clear what it must mean:

<sec><para>Quoth the raven,</para>
     <para><q>Nevermore.</para>
     <para>Except maybe on Tuesdays.</q></para>
</sec>

Because XML (and therefore XML parsers) can’t do this it is typical to either:

coerce markup to syntactically-XML-compatible workarounds like milestones, joins, standoffs, etc.; or
design an entirely new syntax.

Option 1 requires building separate checking even for many constraints XML parsers already have most of the code for, such as ensuring that milestones are empty but paired, that starts come before ends and are of corresponding type, that end milestones lack attributes, and so on. The markup also has to be more complex: in the case above nothing is required to indicate what the q end tag pairs with, but with milestones they have to be co-indexed or have an interesting algorithm to match them up (not to mention that ID/IDREF attributes for co-indexing does not involve quite the same concept as their use elsewhere).

Option 2 requires constructing a new parser (or modding an existing one, which is harder if it is not in Python when the rest of your code is or if the new syntax is not very close to XML). At that point people commonly create, implement, and debug entirely new syntaxes and tools.

A key insight is that there can be syntax for things like olist that is only a tiny step from XML. By choosing such a syntax and integrating it as an option within a parser that also supports and is tested with regular XML, most of the constraints just mentioned become trivial to add. For olist semantics, in the parser itself the difference involves little but whether the open-element data is a stack or a list. Since Loki is designed for extensibility it is very easy to add an option to remove such an item from the stack of open elements, return a SAX event for it, and continue rather than issuing a WF error. Just add an elif exactly where XML end tag processing would instead issue a WF error:

if name == tagStack[-1]:
    tagStack.pop()
    yield SAXEvent.End, name
elif options.olist and name in tagStack:
    del tagStack[tagStack.rindex(name)]
    yield SAXEvent.End, name
else: raise SyntaxError(
    f"Unexpected end tag for ’{name}’ (context: {tagStack}).")

The parser change is surgical, the syntax and semantics are predictable given olist structural semantics, and everything else works as it did. When the olist option is not enabled the behavior is exactly that of XML. Put another way, Loki supports olist with a 3-line addition beyond Thor.

Yggdrasil cannot yet represent such structures directly, but a model that can (such as LMNL [Piez 2008, Piez 2012] or TagML [Bleeker, Buitendijk, and Haentjens Dekker 2020] or MECS [Sperberg-McQueen and Huitfeldt 2008]) can source the necessary information via Loki if desired.^[12]

Suspend and resume markup work similarly when the suspend option is enabled in Loki. It generates distinct SAX events, and a suspended element stays in the open-element stack (flagged as suspended when it is). The markup takes the form below. As with olist this requires very little code in Loki, and seems (at least to me) a fairly intuitive increment beyond XML syntax (an error is raised in cases like suspending an already suspended or closed element):

  <q>....<-q>...<+q>...</q>

In this and the prior case there could in principle be multiple q elements open and one might wish to operate on one other than the innermost. For this reason an endTagId option enables ID attributes to be repeated on suspend, resume, and/or end tags to co-identify the correct element if more than one of the same type is open (if not specified, the most recently encountered is closed):

 <p><q id="G">Say unto my people,
    <q id="J">....<-q id="G">...<+q id="G">...</q id="G">...</q id="J">...</p>

This feature can also be used with end tags in general, to co-identify the boundaries of very large elements for easier readability and more specific error reporting. If an end tag’s name matches the current element but it has an ID attribute that doesn’t, an error is raised that can specifically say where the intended (that is, matching) start tag was (or that it isn’t there at all). No other attributes are allowed in suspend, resume, or end tags. The implementation is a near-trivial re-use of the attribute-parsing code already there for starttags.

Also related to helping keep track of large element scopes, I am considering experimenting with Loki options reminiscent of SGML RANK. Many people have experienced trying to tag recursive structures right when the tags at all levels are identical (say, div/div/div instead of div1/div2/div3 or chap/sec/subsec), especially when converting from applications or data for which there are no real sections, but only headings (say, h1/h2/h3). I find it much easier to debug when something explicit relates the starts to their corresponding ends. Many schemas do this by providing different element types for each level; but it is easy to support the general notion at the syntactic rather than lexical level. Some obvious syntax possibilities are shown below; another is simply to permit a numeric suffix, which if present must match the depth:

<div@3>
<div x:level=3>
<div><?level 3?>

In semantic (not syntactic) homage to some LMNL concerns, simultaneous opens and closes are supported via the simultaneous option, using markup as shown below. This is one possible way to represent more than one element with the identical scope:^[13]

<b|i>.....</b|/i>

For the b/i case it is of course easier to create a portmanteau bi element; but the problem is far more general (e.g., <foreign+term>).

Milestone markup needs no extensions for parsing per se, though the DOM and validation implications are more complex. In the short term Yggdrasil will likely add virtual element support as mentioned earlier, leveraging Schemera syntax for declaring the relevant elements and attributes.

Other options

Loki makes options easy to experiment with. Several are available that recall SGML features, but never break the XML principle that the parser can produce the correct result with or without a schema. They also remain close to XML syntax for reasons already discussed:

</> — omit the name when closing the innermost element (emptyEnd).

Omit an end tag(s) immediately before another end tag (omitEnd, about which I am ambivalent) or at EOF (omitAtEOF).

<|> is not from SGML, but potentially useful. This closes the innermost element and reopens a new instance of it (the name is optional, but checked if present). Attributes (other than an ID) get the same values, but can be changed by specifying them after the |. This restart option is reminiscent of MediaWiki tables, though it’s not quite as short.

<div id=Intro> — omit quotes when an attribute value is a name or number token (unQuotedAttribute).

A booleanAttribute option enables Boolean-valued attributes (or any, really) to be set to 1 or 0 as shown below. This addresses the don’t repeat yourself awkwardness of things like border="border" (not to mention SGML’s rule that different attributes cannot share enum values):

<td +border -underline...>

There are also minor options to work around autocorrect problems sometimes introduced by uncooperative environments. For example, the parser can be instructed to recognize comments with different delimiters: em dash as equivalent to 2 hyphens (emComments), or <#...#> (poundComments) to avoid the issue altogether (nestable comments are on the possible list). For similar reasons curlyQuote permits various fancy quotes around attributes.

Such options can be helpful to XML’s paradigmatic desperate Perl hacker who has to make some documents work right now (see the prescient and still valuable Bray 2010). Note that none of these cases change existing XML syntax to have a new meaning. They intentionally use syntax that is unused in XML (that is, syntax that is not WF). A list of parser options is provided in an appendix.

Schemera and Loki

Some Loki options affect the schema directly or indirectly. One helps with the slight inconvenience of declaring defaults and types for attributes in the subset. People seem rarely to do this to get defaulting (unless they are providing a full DTD, which some parsers do not even support):

<!DOCTYPE mySchema SYSTEM "" [
<!ATTLIST p id ID #IMPLIED
    class NMTOKENS "regular">
]>

Loki can be configured to allow setting a default within the document, on the first use on a given element type (bangAttribute):^[14]

<p foo!=bar>

With bangAttrType a datatype can also be given (this would use colon in homage to Python type-hints, but that would conflict with namespacing):

<p foo!NMTOKEN=bar>

Assuming this is the first use of attribute foo on a p, it has the same effect as:

<!ATTLIST p foo NMTOKEN "bar">

There are other Loki/Schemera options to slightly extend markup declarations. For example, the SGML feature of declaring multiple element types in a single declaration (groupDcl).

Parser APIs

Both Thor and Loki have APIs almost identical to expat’s (as viewed from Python). Handlers are assigned in the same way, with the same names and arguments. However, there are a few options here too.

For example, each attribute can be returned as a separate SAX event immediately following its start tag event (saxAttribute). This has the advantage of making every event have a fixed number of arguments rather than start tag events having a potentially unbounded list of alternating names and values as some parsers produce (though not expat for Python). It also gives room to return additional information such as whether the attribute was explicit or defaulted, or the pre-normalized and normalized forms. The default is to just do what expat does for Python, which is to return a Python dictionary of the attributes.

A slight difference from expat is that neither Loki nor Thor breaks text into separate SAX events at every newline and character entity. However, that behavior can be achieved in either by setting option expatBreaks.

Although expat always hands attribute values to handlers as strings, with attrCast Loki (with Schemera’s help) can coerce them to their declared types — so a program can have them ready to go as Python ints, floats, datetimes; or as provided custom types that map directly to XSD’s builtin types.

Special character handling

The inconvenience that all special-character entities must be declared has been mentioned briefly. Sets such a those defined in Annex D of SGML and in HTML are useful and widely known, but activating them in the standard ways is tedious. Such sets can be activated in Loki merely by setting an option, either via the API:

myParser.setOption("htmlNames", 1)

or via the declaration:

<?xml version="1.1" encoding="utf-8" htmlNames="1" ?>

Perl 6 (aka Raku) introduced the interesting ability to specify Unicode characters by their official names [Raku]. Loki adds a similar capability. After setting the unicodeNames option references like these are supported:

&bullet;
&GREEK.SMALL.LETTER.OMICRON.WITH.TONOS;

Case is ignored because Unicode character names do not mix case (this will likely be made subject to the entityFold option). Any of hyphen, underscore, or dot may be used instead of spaces (so may space itself when looking characters up via the API).^[15]

Some names are long. However, it turns out that abbreviating all but the last token down to the first 4 characters leads to only 11 collisions in the Unicode BMP. So abbreviations down as far as that also work, as do intermediate abbreviations:

&GREEK.SMAL.LETT.OMIC.WITH.TONOS;

In the rare case of a collision (which Loki will report), lengthen anywhere to disambiguate. For example, EQUIVALENT TO (U+0224d) and EQUIANGULAR TO (U+0225a) collide when fully abbreviated, but giving at least the first 5 characters of the first token succeeds: EQUIV.TO vs. EQUIA.TO. Further abbreviations are feasible, such as SMALL LETTER OMICRON to LC O or special-cases a few frequent/long substrings like CJK UNIFIED IDEOGRAPH to CJK and simply omitting LETTER; but you’ve already memorized the current rule.

I sometimes find it inconvenient that XML defines no special-character mechanism that applies inside PIs or comments. The piEscapes option makes Loki recognize and replace character references inside PIs.^[16]

Somewhat related, a piAttribute option makes Loki parse PI data as if it were an attribute list and return SAX events for PIs with the corresponding arguments (similar to start tag events). This is an obviously useful convention (XML employs it for the XML declaration, thoough it doesn’t offer the same thing to users). With this and the prior option users can avoid re-implementing some wheels. And finally, piAttrDcl enables ATTLIST declarations to constrain those PI quasi-attributes, merely by giving ? plus a PI target name where an element name would normally go (again applying Loki’s idea of XML-adjacent syntax, and trivially implemented by small adjustments to already-existing code):

<!ATTLIST ?ah hyphenate boolean #IMPLIED
              kern      float   "1.0">
...
<?ah hyphenate="1" kern="0.8"?>

Those accustomed to backslashed hexadecimal character codes can choose to enable a backslash option to recognize \n, \\, \xFF, \uFFFF, \U000FFFFF, etc. (though not \u{FFF} or \u{BULLET}, yet). As one might expect, \< and \& are then ok.

Runeheim (character set management)

Runeheim is a separate package that is the realm of the XML orthography: what characters are what and how names are formed. Many programs have slightly inaccurate lists of name- and name-start characters; some support only ASCII. Runeheim provides the correct lists in various forms, as well as full regexes to match things like QNames, and functions like isXmlQName().

The lists of (for example) name and name-start characters are built automatically from the literal hex ranges given in the XML Recommendation (with a choice of 1.0 or 1.1). This makes them readable, less error-prone, and easy to update as Unicode grows. Runeheim can also be used in isolation to provide name testing, normalization, etc. to any Python callers.

There are centralized routines for checking various kinds of names, as well as for escaping and unescaping text for each context. In contrast, many APIs only provide escaping as for text content, and miss double hyphens in comments , ?> in PIs, etc.; neglect ]]>; or overdo >. The Ragnarok serializers (see next) can configure to map to HTML or Unicode named characters, or to decimal or hex numerics of controllable case and width. Runeheim also provides types such as NMTOKEN_t for use as type-hints, making modern tools such as linters and Python compilers more effective and the code more readable.

Gleipnir and Bifrost (serializers)

Within Yggdrasil, Ragnarok provides a replacement for minidom’s toprettyxml(), called Gleipnir after the magical binding used to restrain the great wolf Fenrir. It does the same things as regular toprettyxml, and is what you get if you call that in Yggdrasil. However, it can also take a FormatOptions object that encapsulates a wide range of options including how to break around tags, attributes, and text; text wrapping; indentation; use of CDATA (Yggdrasil can track which text nodes came in as CDATA sections); and how to do escaping (HTML names, decimal, or hex; padding width; case of hex; ...). A list of inline-ish tags can also be set, preventing line-breaks around them. Pre-made default and canonical FormatOptions objects are provided. Gleipnir does not generate XML that uses any of Loki’s syntax extensions; just normal fully-conforming XML.

Ragnarok also provides another serialization, called Bifrost after the bridge connecting Asgard and Midgard. It serializes an entire XML document to valid JSON. The JSON conventions applied are called JBook, and a small sample is in an appendix. Bifrost can also read the resulting JSON back to produce the same DOM. The conversion is complete — it covers not just elements and attributes, but marked sections, comments, PIs, namespaces, optionlly the DTD, and so on. JBook is similar to the form presented in [DeRose 2014a] but more fully developed. Text nodes become JSON strings, and other nodes become JSON lists. The first item in each such list is a dict that includes the nodeName and attributes (or other data for non-Elements), and the rest of the list consists of the children.

The first round-trip (DOM → JBook → DOM or JBook → DOM → JBook) can change details such as attribute quoting, type of escaping for special characters, etc. After that, however, round-tripping all day makes no further changes. In other words, the conversion loop is idempotent. I spent some time searching for a prior XML → JSON conversion that covered all constructs, and could find none (much less any that were also idempotent, much less readable).

Testing

Ragnarok includes a large Python unittest suite which automates testing for many cases. Coverage is about 75% so far. The code is thoroughly hinted and linted. Every Node subclass has a checkNode() method which can test (recursively or not, at option) many invariants such as the nodeType vs. class, consistency of child and sibling information, well-formedness of names, etc.

Testing includes very long Unicode names, wide and deep documents, and even enormous numbers of leading zeros on numeric character references (it turns out xmllint requires a special option for depths over 256 or names over 64K).

Thor, Loki, and expat are tested with both minidom and Yggdrasil on top, just as Yggdrasil is tested with Thor, Loki, and expat below. This ensures that they are closely compatible, although there is a lot of additional testing specific to extensions.

The Gleipnir serializer will likely learn some of the parser’s options, if only to help feed the testing process.

Futures

Work is underway on a model validator, Heimdall, that leverages regex processors. To accomplish that, each distinct element type is assigned a private-use Unicode character. Those characters are substituted for the element type names in content models, while commas and whitespace are dropped. For example (but using Greek instead of private use characters for readability), the content model in

<!ELEMENT chapter (title, para+, (note|section|para)*)>

becomes something like

(αβ+(γ|∂|β)*)

If the sequence of child types for a given element instance undergoes the same mapping, a normal regex match against the transformed model achieves the correct validation result. Loki/Schemera’s {} syntax for XSD-like minOccurs and maxOccurs within content models also works fine this way. #PCDATA and actual text nodes are set aside except for a single test for whether they’re permitted (for full SGML support, a bit more would be needed). Attributes already can be checked via Schemera.

This suffices for validating a DOM structure in hand. However, partial validation is needed in cases like editing (and for much of the DOM 3 Validation module): Perhaps an element is valid so far, but may (or even must) have more. For example, a model like

<!ELEMENT dl (dt, dd)*>

is a partial but not a complete match after each dt. After each dd it’s both a partial and a complete match (well, during parsing one must wait until after </dl>). The built-in regex processor in Python does not support partial matching. However, a common 3rd-party regex processor for Python (regex) does, and seems to work fine for this.

Also in progress is a binary format for nodes and documents, called Sleipnir after Odin’s 8-legged horse. Nodes are represented as 8 fixed-size fields that encapsulate the information needed to support DOM operations while being trivial to aggregate and access. It is loosely modeled on [DeRose et al. 1996], but supports Unicode, namespaces, and dynamic modification.

Finally, I’m looking into adding a new Node subclass specifically for binary objects. This would enable images, sound, CAD models, etc. to live direclty in Yggdrasil, much like Blobs live in databases. They can be added via the API, via referring to unparsed (NDATA) entities, or perhaps also directly via a special marked section type such as <![BASE64[…]]>.

ID handling

Yggdrasil indexes IDs, with or without case and similar normalization. They do not (yet) update individually if the tree is modified (in fairness, minidom’s don’t either). The index can, however, be discarded and rebuilt without rebuilding the whole DOM.

There is a common problem of specifying just which attributes are IDs. XML thankfully reserves xml:id but many schemas assign their own. Such cases cannot be detected reliably without a schema. Ragnarok also provides for configuring what is recognized (and indexed) as IDs via the API, based on the name and namespace of the attribute and of the element it’s on. Multiple definitions and wildcards are possible.

Other ID-like semantics will likely be added. For example:

an option to allow namespace prefixes on ID values (NAMESPACEID)
options and distinct attribute declared types for milestones like <q_start/>...<q_end/> (COID)
Trojan milestones where start and end milestones use different attribute names (STARTID and ENDIDID)
suspend/resume co-indexing (the last two plus SUSPENDID and RESUMEID)
unique values constructed by accumulating ID values from ancestors, like hierarchical section-numbering (STACKID)
SQL-like compound IDs constructed by evaluating an XPath or similar expression (COMPOUNDID)

The types enable checking associated structural semantics such as that starts come before ends, milestone elements are empty, etc. The declarations must differ slightly to distinguish when the role is flagged by Loki syntax, element type, attribute name, or potentially other mechanisms.

Element declarations will have ways to declare whether they are suspendable, milestonable, olist-endable, etc. (though some of these may merely be inferred from what attribute types they declare and use).

Like attribute defaults and named special characters, IDs introduce semantics that require a schema to understand (though of course not to parse). They thus bump up against XML’s general principle that documents are correctly parsable without a DTD. This can be avoided by always using xml:id, but that’s slightly unwieldy.

XML inherits SGML’s restriction that there can be only one ID attribute for any given element type.^[17] Given that uniqueness it would be feasible to support a special syntax instead of a special attribute type. One example could be suffixing IDs to the element type: <p#p31> vs. <p xml:id="p31">. Such an IDSuffix option would save space and perhaps improve readability while avoiding the need for explicit declarations.

A longer view

I am working on support for XPointer ranges [Clark and DeRose 2002] and on DOM-integrated support for overlap features and potentially XCONCUR [Schonefeld 2008]. I and others have long lamented the lack of annotation, versioning, and collaborative editing capabilities both on the Web and in XML generally (see DeRose 2014b), for which overlap support is prerequisite.^[18] By promoting such structures into a separate but coordinated structure from the DOM itself, some of these capabilities might be brought within closer reach (see also DeRose and van Dam 1999):

Switching between document views (including simultaneous/parallel/variorum views)
Switching non-hierarchical markup between various in-line and out-of-line representations
Having metadata on individual edits, such as voting, acceptance, etc., and a well-defined process for making a true new version
Maintaining connections across such versions so one can answer questions like Where did this sentence go?
Making such annotation modular/orthogonal with respect to the main document’s schema(s)

I have also done some work to support mapping language and/or orthography codes to the Unicode ranges they use (an option to be called langChecking. This will check that text content comports with the xml:lang value in effect in its context. Short of that, there are already options to prohibit C0 control characters (other than cr/lf/tab), C1 control characters (which commonly arise via incorrect character-set handling), and/or private use characters.

Why try Ragnarok?

Yggdrasil is almost twice as fast as minidom (especially for building).
Yggdrasil is more up-to-date than minidom.
Ragnarok is in pure Python, thus relatively easy to modify, extend, or (heaven forbid!) debug.
The API includes lots of Pythonic conveniences.
It’s modular, so parts can be used separately for other things.
Loki can be easily configured with useful extensions such as character-set flexibility and overlap support.

Why not try it?

There are still bugs (probably mostly involving namespaces and parameter entities).
Some pieces (like Heimdall the validator, Sleipnir the binary DOM, and particular extensions) are not complete.
You don’t like experimenting.
You don’t use Python.

Availability

The entire suite including the test suite will shortly be available on my github page at https://github.com/sderose/Ragnarok/tree/master/. Error reports, suggestions, etc. are welcome. Pythonistai (or perhaps Níðhǫggsfólk?) are also welcome to contribute.

Appendix A. Examples of extended syntax

<?xml version="1.1" encoding="utf-8"?>
<!DOCTYPE article SYSTEM "c:\schemas\balisage.dtd"
    "/home/tg/balisage.dtd" NDATA DTD [
<!ELEMENT article - - (head, div{1,10}>
<!ENTITY % soup "i|b|tt|strike">
<!ATTLIST (%soup;) #ANY  CDATA #IMPLIED>
<!ELEMENT div (#PCDATA | %soup;)*>
<!ATTLIST * id    ID    #IMPLIED>
<!ATTLIST p width float #IMPLIED
            sId   COID  #IMPLIED
            eID   COID  #IMPLIED>
<!ATTLIST ?troff  width  int16 "60">
]>
<!— See that emdash? —>
<article +conf xml:lang=en:us -final class=“x y”>
  <div>
    <table +border>
      <tr><td just!str=left>DTD<|>No<|>Yes<|>3.1
    </table>
    <p><q>Never<-q>, said the(\U0001f42) <+q>, really.</Q>
    <| width=Inf>&amp;therefore; &amp;nbsp;
           <b|i>No more</b|/i>.</>
    <![IGNORE[ Why not? ]]>
    <?troff width="12"?>

Appendix B. Example of JBook from Bifrost

[ { "~":"JBook", "#jbookversion":"0.9",
    "#xmlversion":"1.0", "#encoding":"utf-8", "#standalone":"yes",
    "#doctype":"html", "#systemId":"http://w3.org/html" },
    [ { "~":"html",
        "xmlns:html":"http://www.w3.org/1999/xhtml" },
        [ { "~":"head" },
            [ { "~":"title" }, "My document" ]
        ],
        [ { "~":"body" },
            [ { "~":"p", "id":"stuff", "class":"lead" },
                "This is a ", [ { "~":"i" }, "very" ], " short document.",
                [ { "~":"#cdata" }, "This is some <real> & <legit/> cdata." ]
            ],
            [ { "~":"#pi", "#target":"troff" }, ".ss;.b" ],
            [ { "~":"#comment" }, "This is <real> commentary." ]
            [ { "~":"hr" } ]
        ]
    ]
]

Appendix C. Dominµs/Yggdrasil vs. regular DOM

Table II

DOM	Yggdrasil
x.nodeType Node.ELEMENT_NODE	x.isElement
if n >= len(x.childNodes): x.appendChild(newNode) else: x.insertBefore(newNode, x.childNodes[n])	x.insert(n, newNode)
found = 0 for ch in x.childNodes: if ch.nodeName == "p": if found == 3: return ch found += 1 return None	return x["p":3]
newEl = doc.createElement("p") for a, v in someAttrs.items(): newEl.setAttribute(a, v)	newEl = doc.createElement("p", someAttrs)
newEl = doc.createElement("p") s1 = doc.createElement("speech") s1.setAttribute("spkr", "Essex") s1.appendChild(doc.createTextNode("Goodbye")) p.appendChild(s1) if (len(.nchildNodes) > 0): n.insertBefore(s1, n.childNodes[0]) else: n.appendChild(s1)	x.insertAdjacentXML(RelPosition.begin, """<p><speech spkr="Essex">Goodbye</speech></p>""")
if (x.compareDocumentPosition(y) < 0):... # only in DOM 3	if x << y:...
if x.isEqualNode(y):...	if x == y:...

Appendix D. Loki and Schemera options

Options whose names are in [brackets] are not yet implemented.

Table III

Name	Default	Description
Security and limits
MAXEXPANSION	1<<20	Total maximum entity length
MAXENTITYDEPTH	16	Maximum entity nesting depth
charEntities	True	Allow special character entities?
extEntities	True	Allow external entities?
[netEntities]	False	Allow off-localhost entities?
entityDirs	None	Permitted dirs to see? (None means any)
extSchema	True	Fetch and process external schema?
Case and Unicode
elementFold	None	Normalizer for element names
attributeFold	None	... for attribute names
entityFold	None	... for entity names
keywordFold	None	... for XML reserved words (CDATA, etc.)
[xsdFold]	None	... for XSD values (true, INF, etc.)
[wsDef]	XML	Definition of whitespace (vs HTML or WhatWG)
noC0	False	Prohibit C0 control chars like XML 1.0
noC1	False	Prohibit C1 control chars
noPrivateUse	False	Prohibit Private Use chars
langChecking	False	Check chars used vs. xml:lang
Schemas
schemaType	False	<!DOCTYPE foo SYSTEM "" NDATA x>
fragComments	False	<!ELEMENT foo -- really? -- (p\|q)*>
Elements
groupDcl	False	<!ELEMENT (x\|y\|z)...>
oflag	False	<!ELEMENT - O para...>
sgmlWord	False	Allow CDATA RCDATA etc.
[mixel]	False	Allow declared content ANYELEMENT
mixins	False	Allow SGML-ish inclusion/exclusion dcls
repBrace	False	Allow {min,max} in content models
emptyEnd	False	Allow </>
omitEnd	False	Can omit end tag before another
omitAtEOF	False	Can omit end tags at EOF
restart	False	<\|> to close and reopen current element
restartName	False	<\|name> to close and reopen multiple elements
[rankAnnots]	None	Permit "@" and a number on tags (cf SGML RANK)
[endTagID]	False	Permit repeat of ID on end tags
[entEncoding]	False	Permit ENCODING parameter on external entity declarations
Beyond hierarchy
[multiTag]	False	<div/title>...</title/div>
simultaneous	False	<b\|i> </i\|/b>
suspend	False	<x>...<-x>...<+x>...</x>
olist	False	Allow closing non-current elements
Attributes
saxAttribute	False	Generate separate SAX event per attribute
globalAttribute	False	<!ATTLIST * ...>
anyAttribute	False	<!ATTLIST para #ANY...>
xsdType	False	Allow XSD attribute types in DTD
[xsdPlural]	False	XSD types can be pluralized
[attrCast]	False	Return attrs in declared types
specialFloat	False	Nan, Inf, etc.
unQuotedAttribute	False	<p x=foo>
noAttributeNorm	False	Suppress whitespace normalization of undeclared attributes^[19]
curlyQuote	False	Allow fancy quotes for qlits
booleanAttribute	False	<x +border -foo>
booleanIsName	False	booleanAttributes return name/"", not "1"/"0"
bangAttribute	False	!= on first use to set default
[bangAttributeType]	False	!type= to set datatype and default
[IDSuffix]	False	#ID can be appended to start tag name
[COID]	False	Add attribute type to co-index milestones
[NAMESPACEDID]	False	IDs can have ns prefix
[STACKID]	False	ID is accumulated from ancestors
Validation (beyond WF)
useDTD	False	Fetch and parse external DTD if available
valElementNames	False	Elements must be declared
[valModels]	False	Full content model checking
valAttributeNames	False	Attributes must be declared
valAttributeTypes	False	Attributes must match dcl datatype
Entities and special characters
htmlNames	False	Enable HTML/Annex D named character refs
unicodeNames	False	Enable Raku-like Unicode named character refs
multiPath	False	Allow multiple SYSTEM IDs
[multiSDATA]	False	<!SDATA nbsp 160 z 0x9D...>
backslash	False	\n \xff \uffff (not yet \\x{})
Other
saveMarkup	False	Preserve source markup form on output (where practical)
expatBreaks	False	Break at \n and entities like expat
emComments	False	Treat emdash as -- for comments
poundComments	False	<#...#> alternative for comments
[piEscapes]	False	Recognize character refs in PIs
piAttribute	False	Parse/return PI data like attributes
[piAttrList]	False	Declare PI attrs (<!ATTLIST ?target ...>)
[nsUsage]	None	Limit namespaces to one/global/noredef/regular
[MSTypes]	False	Allow Marked sections beyond CDATA
[extraDcl]	False	Allow extra XML Declarations in documents referenced as entities

References

[Biron and Malhotra 2004] Biron, Paul V. and Ashok Malhotra. 28 October 2004. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/.

[Bleeker, Buitendijk, and Haentjens Dekker 2020] Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. 2020. Marking up microrevisions with major implications: Non-linear text in TAG. Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.

[Boyer and Marcy 2008] Boyer, John and Glenn Marcy. 2 May 2008. Canonical XML Version 1.1. W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-c14n11-20080502/.

[Bray, Paoli, and Sperberg-McQueen 1998] Bray, Tim, Jean Paoli, and C.M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0. W3C Recommendation. World Wide Web Consortium, 10 February 1998. https://www.w3.org/TR/1998/REC-xml-19980210.

[Bray 2010] Bray, Tim. 2010. D.P.H. https://www.tbray.org/ongoing/When/201x/2010/07/21/DPH.

[Carlisle and Ion 2015] Carlisle, David and Patrick Ion. 2015. XML Entity Definitions for Characters. W3C Working Draft (3rd Edition). https://www.w3.org/Math/characters/unicode.xml.

[Clark and DeRose 1999] Clark, James and Steve DeRose. 1999. XML Path Language (XPath) Version 1.0. W3C Recommendation. World Wide Web Consortium, 16 November 1999. https://www.w3.org/TR/1999/REC-xpath-19991116.

[Clark and DeRose 2002] Clark, James and Steve DeRose. 2002. XML Pointer Language (XPointer). W3C Working Draft. World Wide Web Consortium, 16 August 2002. https://www.w3.org/TR/xptr/.

[DeRose et al. 1996] DeRose, Steven et al. 1996. Data processing system and method for representing, generating a representation of and random access rendering of electronic documents. U.S. patent number 5,557,722 (expired). https://uspto.report/patent/grant/5,557,722.

[DeRose 1999] DeRose, Steven J. 1999. What Do Those Weird XML Types Want, Anyway? Keynote address, VLDB ’99, Edinburgh. In VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases: 721-724. Morgan Kaufmann. ISBN 1-55860-615-7. https://dl.acm.org/doi/10.5555/645925.671670, https://www.vldb.org/dblp/db/conf/vldb/DeRose99.html.

[DeRose 2014a] DeRose, Steven J. 2014. JSOX: A Justly Simple Objectization for XML: Or: How to do better with Python and XML. Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.DeRose02.

[DeRose 2014b] DeRose, Steven J. 2014. What do we still lack? Or: Prolegomena to any future hypertext system. Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). doi:https://doi.org/10.4242/BalisageVol14.DeRose01.

[DeRose and van Dam 1999] DeRose, Steven J. and Andries van Dam. 1999.Document structure and markup in the FRESS hypertext system. In Markup Languages: Theory & Practice 1.1: 7-32.

[Diewals and Stührenberg 2013] Diewald, Nils, and Maik Stührenberg. 2013. An extensible API for documents with multiple annotation layers. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Diewald01.

[Heimes 2023] Heimes, Christian (maintainer). 2023. defusedxml. Pypi library, version 0.7.1, https://pypi.org/project/defusedxml/.

[Le Hors et al. 2000] Le Hors, Arnaud, Philippe Le Hégaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. 2000. Document Object Model (DOM) Level 2 Core Specification. W3C Recommendation. World Wide Web Consortium, 13 November 2000. https://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/.

[Peters 2004] Peters, T. 2004. PEP 20 — The Zen of Python. Python Enhancement Proposals. Retrieved from https://peps.python.org/pep-0020.

[Piez 2008] Piez, Wendell. 2008. LMNL in Miniature. Presented at the LMNL Workshop, Amsterdam, December 2008.

[Piez 2012] Piez, Wendell. 2012. Luminescent: parsing LMNL by XSLT upconversion. Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Piez01.

[Python] Python. Retrieved 2025-03-20. XML Processing Modules (Python 3.11.11 documentation). https://docs.python.org/3.11/library/xml.html.

[Raku] Raku. Retrieved 2025-03-19. Unicode: Unicode support in Raku (Raku documentation). https://docs.raku.org/language/unicode.

[Schonefeld 2008] Schonefeld, Oliver. 2008. An event-centric API for processing concurrent markup. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Schonefeld01.

[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C.M., and Claus Huitfeldt. 2000. GODDAG: A Data Structure for Overlapping Hierarchies. In Digital Documents: Systems and Principles. PODDP 2000, pp. 139-160. Lecture Notes in Computer Science, vol. 2023. Springer-Verlag, Berlin, Heidelberg. doi:https://doi.org/10.1007/978-3-540-39916-2_12.

[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C.M., and Claus Huitfeldt. 2008. Markup Discontinued: Discontinuity in TexMecs, Goddag structures, and rabbit/duck grammars. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.

[Thompson et al. 2004] Thompson, Henry S., David Beech, Murray Maloney, and Noah Mendelsohn. 2004. XML Schema Part 1: Structures Second Edition. W3C Recommendation. World Wide Web Consortium, 28 October 2004. https://www.w3.org/TR/2004/REC-xmlschema-1-20041028/.

[Unicode Annex #15] The Unicode Consortium. 2015. Unicode Standard Annex #15: Unicode Normalization Forms. Unicode Standard Annex. Retrieved 2025-03-21. https://www.unicode.org/reports/tr15/tr15-43.html.

[Unicode] The Unicode Consortium. Principles of the Unicode Standard. Section in The Unicode® Standard: A Technical Introduction. https://www.unicode.org/reports/tr15/tr15-43.html.

[Unicode, Section 2.2] The Unicode Consortium. 2024-09-10. Unicode Design Principles. Section 2.2 in The Unicode® Standard, Version 16.0 — Core Specification. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G128.

[Wall, Christiansen, and Orwant 2000] Wall, Larry, Tom Christiansen and Jon Orwant. 2000. Programming Perl (3rd ed.). O’Reilly Media. (The three virtues of a programmer — laziness, impatience, and hubris — are referenced in Chapter 27 Perl Culture.)

[WHATWG] WHATWG. 2025. HTML Living Standard. Web Hypertext Application Technology Working Group. https://html.spec.whatwg.org/multipage/. Retrieved 2025-07-22.

[Selectors Level 3] World Wide Web Consortium. 2009. Selectors Level 3. W3C Recommendation. World Wide Web Consortium, 15 December 2009. https://www.w3.org/TR/2009/REC-css3-selectors-20091215/.

^[1] Beside its value for readability and debugging, type-hinting helps Python compilers and performance tools optimize for even more speed.

^[2] Or the more Pythonic but (imho) still unwieldy isinstance(newNode, Node.ELEMENT_NODE).

^[3] Unlike Yggdrasil, Thor does not support case-ignoring, because that would violate XML syntactic rules. However, Loki does.

^[4] Except the -or-self ones, which use the regular axis method with an includeSelf=True option. Symmetry suggests offering this for all axes except self.

^[5] XPath 2 uses << and >> instead, so I added those too.

^[6] There are also related libraries, such as ElementPath for very basic XPath, and Davide Brunato’s https://github.com/brunato. I plan to make Ragnarok very easy to connect to his elementpath (for modern XPath) and xmlschema.

^[7] Yggdrasil actually defines textContentseparately on various classes, obviating the type-tests entirely. This seems to me clearer. According to https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeValue textContent returns .nodeValue for CDATA, Comment, PI, and Text (the CharacterData subclasses); None for Document and Doctype; and For other node types (including Element), textContent returns the concatenation of the textContent of every child node, excluding comments and processing instructions. DocumentFragment has nodeValue None but textContent as concatenated. For attributes both return the attribute value, but I’ve omitted that case because ElementTree doesn’t have attributes as a class one could hang a method on, while DOM and Yggdrasil just need a trivial addition.

^[8] Iterating through a node’s children (though very common) does not benefit, because the list of children is typically explicit in the parent node.

^[9] There is attention to details such as case for special values (true, false, nan, inf, etc.). This seems to me valuable since the case for IEEE values does not match that used in XSD or in some programming languages.

^[10] Date and time types can convert back and forth with the Python datetime types.

^[11] The EOF case is especially useful for log files, so one can just append records without having to remove and replace the end-marker every time as in XML, JSON, etc. See Poor performance when writing large XML-based log file (Stackoverflow).

^[12] I plan to add a library to support more complex logical (virtual?) elements (Draupnir, after a magical ring that duplicates itself). This would include Schemera support to identify and check various overlap syntaxes; a DOM-based API to make them look as much like regular elements as feasible (such as for text extraction and search, order and containment tests, etc); and conversion between various overlap representations. Rendering becomes fairly easy again: CSS-style inheritance works pretty well for olists: Nevermore is in sec/para/q, but Except is in sec/q/para. It may prove helpful to introduce distinct start syntax and SAX events for virtual elements; perhaps <<para>> or <para*> and START_VIRTUAL; otherwise it can be necessary to restructure the DOM slightly when an olist close or suspend event is found later.

^[13] The semantics can get complicated. For basic purposes tools could merely treat such tags like the corresponding sequence or like a single tag with a funny type name. But full support would require more. For example, validity might require that both flips be valid, or perhaps introduce entirely new kinds of constraints.

^[14] This is not like the SGML #CURRENT feature. It must occur only on one instance of the element/attribute pair, and then becomes the permanent default. Also, this is an early experiment and may vanish or change. One can, after all, put ATTLIST declarations in the internal subset, though that seem to me awkward when using a different primary schema language.

^[15] I am considering adding support for other sets of names, such as TEX conventions. Sebastian Rahtz long ago assembled perhaps the definitive list of standardized character names, reflected for example in [Carlisle and Ion 2015].

^[16] I may add this for comments, too, and/or SGML-like RCDATA marked sections (nested, keyworded, and parameter-entity-based marked sections also have an option, and are partially supported).

^[17] I see no principled reason for this restriction, unlike it’s reverse (that there be only one element bearing a given ID value).

^[18] It is worth noting that MS Word’s annotation feature supports overlap, yet is fitted to XML in the saved docx files. The representation is not unreasonable.

^[19] This and saveMarkup were inspired by an email from Syd Bauman.

Biron, Paul V. and Ashok Malhotra. 28 October 2004. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/.

Bleeker, Elli, Bram Buitendijk and Ronald Haentjens Dekker. 2020. Marking up microrevisions with major implications: Non-linear text in TAG. Presented at Balisage: The Markup Conference 2020, Washington, DC, July 27 - 31, 2020. In Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). doi:https://doi.org/10.4242/BalisageVol25.Bleeker01.

Boyer, John and Glenn Marcy. 2 May 2008. Canonical XML Version 1.1. W3C Recommendation. http://www.w3.org/TR/2008/REC-xml-c14n11-20080502/.

Bray, Tim, Jean Paoli, and C.M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0. W3C Recommendation. World Wide Web Consortium, 10 February 1998. https://www.w3.org/TR/1998/REC-xml-19980210.

Bray, Tim. 2010. D.P.H. https://www.tbray.org/ongoing/When/201x/2010/07/21/DPH.

Carlisle, David and Patrick Ion. 2015. XML Entity Definitions for Characters. W3C Working Draft (3rd Edition). https://www.w3.org/Math/characters/unicode.xml.

Clark, James and Steve DeRose. 1999. XML Path Language (XPath) Version 1.0. W3C Recommendation. World Wide Web Consortium, 16 November 1999. https://www.w3.org/TR/1999/REC-xpath-19991116.

Clark, James and Steve DeRose. 2002. XML Pointer Language (XPointer). W3C Working Draft. World Wide Web Consortium, 16 August 2002. https://www.w3.org/TR/xptr/.

DeRose, Steven et al. 1996. Data processing system and method for representing, generating a representation of and random access rendering of electronic documents. U.S. patent number 5,557,722 (expired). https://uspto.report/patent/grant/5,557,722.

DeRose, Steven J. 1999. What Do Those Weird XML Types Want, Anyway? Keynote address, VLDB ’99, Edinburgh. In VLDB ’99: Proceedings of the 25th International Conference on Very Large Data Bases: 721-724. Morgan Kaufmann. ISBN 1-55860-615-7. https://dl.acm.org/doi/10.5555/645925.671670, https://www.vldb.org/dblp/db/conf/vldb/DeRose99.html.

DeRose, Steven J. 2014. JSOX: A Justly Simple Objectization for XML: Or: How to do better with Python and XML. Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage Series on Markup Technologies, vol. 13 (2014). doi:https://doi.org/10.4242/BalisageVol13.DeRose02.

DeRose, Steven J. 2014. What do we still lack? Or: Prolegomena to any future hypertext system. Presented at Symposium on HTML5 and XML, Washington, DC, August 4, 2014. In Proceedings of the Symposium on HTML5 and XML. Balisage Series on Markup Technologies, vol. 14 (2014). doi:https://doi.org/10.4242/BalisageVol14.DeRose01.

DeRose, Steven J. and Andries van Dam. 1999.Document structure and markup in the FRESS hypertext system. In Markup Languages: Theory & Practice 1.1: 7-32.

Diewald, Nils, and Maik Stührenberg. 2013. An extensible API for documents with multiple annotation layers. Presented at Balisage: The Markup Conference 2013, Montréal, Canada, August 6 - 9, 2013. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies, vol. 10 (2013). doi:https://doi.org/10.4242/BalisageVol10.Diewald01.

Heimes, Christian (maintainer). 2023. defusedxml. Pypi library, version 0.7.1, https://pypi.org/project/defusedxml/.

Le Hors, Arnaud, Philippe Le Hégaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. 2000. Document Object Model (DOM) Level 2 Core Specification. W3C Recommendation. World Wide Web Consortium, 13 November 2000. https://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/.

Peters, T. 2004. PEP 20 — The Zen of Python. Python Enhancement Proposals. Retrieved from https://peps.python.org/pep-0020.

Piez, Wendell. 2008. LMNL in Miniature. Presented at the LMNL Workshop, Amsterdam, December 2008.

Piez, Wendell. 2012. Luminescent: parsing LMNL by XSLT upconversion. Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). doi:https://doi.org/10.4242/BalisageVol8.Piez01.

Python. Retrieved 2025-03-20. XML Processing Modules (Python 3.11.11 documentation). https://docs.python.org/3.11/library/xml.html.

Raku. Retrieved 2025-03-19. Unicode: Unicode support in Raku (Raku documentation). https://docs.raku.org/language/unicode.

Schonefeld, Oliver. 2008. An event-centric API for processing concurrent markup. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:https://doi.org/10.4242/BalisageVol1.Schonefeld01.

Sperberg-McQueen, C.M., and Claus Huitfeldt. 2000. GODDAG: A Data Structure for Overlapping Hierarchies. In Digital Documents: Systems and Principles. PODDP 2000, pp. 139-160. Lecture Notes in Computer Science, vol. 2023. Springer-Verlag, Berlin, Heidelberg. doi:https://doi.org/10.1007/978-3-540-39916-2_12.

Sperberg-McQueen, C.M., and Claus Huitfeldt. 2008. Markup Discontinued: Discontinuity in TexMecs, Goddag structures, and rabbit/duck grammars. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Sperberg-McQueen01.

Thompson, Henry S., David Beech, Murray Maloney, and Noah Mendelsohn. 2004. XML Schema Part 1: Structures Second Edition. W3C Recommendation. World Wide Web Consortium, 28 October 2004. https://www.w3.org/TR/2004/REC-xmlschema-1-20041028/.

The Unicode Consortium. 2015. Unicode Standard Annex #15: Unicode Normalization Forms. Unicode Standard Annex. Retrieved 2025-03-21. https://www.unicode.org/reports/tr15/tr15-43.html.

The Unicode Consortium. Principles of the Unicode Standard. Section in The Unicode® Standard: A Technical Introduction. https://www.unicode.org/reports/tr15/tr15-43.html.

The Unicode Consortium. 2024-09-10. Unicode Design Principles. Section 2.2 in The Unicode® Standard, Version 16.0 — Core Specification. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G128.

Wall, Larry, Tom Christiansen and Jon Orwant. 2000. Programming Perl (3rd ed.). O’Reilly Media. (The three virtues of a programmer — laziness, impatience, and hubris — are referenced in Chapter 27 Perl Culture.)

WHATWG. 2025. HTML Living Standard. Web Hypertext Application Technology Working Group. https://html.spec.whatwg.org/multipage/. Retrieved 2025-07-22.

World Wide Web Consortium. 2009. Selectors Level 3. W3C Recommendation. World Wide Web Consortium, 15 December 2009. https://www.w3.org/TR/2009/REC-css3-selectors-20091215/.

Author's keywords for this paper:

Python; XML; DOM; Markup Systems

BalisageThe Markup Conference2025