How to cite this paper

Durusau, Patrick. “Deferred Well-Formedness and Validity: Change.log, Collaboration, Immutability, XML, UUIDs.” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). https://doi.org/10.4242/BalisageVol26.Durusau01.

Balisage: The Markup Conference 2021
August 2 - 6, 2021

Balisage Paper: Deferred Well-Formedness and Validity

Change.log, Collaboration, Immutability, XML, UUIDs

Patrick Durusau

Independent Consultant

Patrick Durusau is the Chair of the OASIS Open Document Format for Office Applications (OpenDocument) TC and has been a member of that TC since its initial meeting on December 16, 2002. His employer/sponsor has changed several times over the years and Patrick has been a co-editor/editor of the OpenDocument Format (ODF) for the majority of that time. Patrick is also the project editor for the ISO/IEC mirror of ODF as ISO/IEC 26300.

Patrick blogs about topic maps (being one of the co-editors of ISO 13250-5), other semantic issues and of late, how irregular forces can leverage data for their causes at Another Word for It.

Abstract

This proposal emerges out of conversations about introducting collaborative editing into OpenDocument Format (ODF) applications, as a type of change tracking.[1] Vis-a-vis a document, a lone author is a lesser and included case of collaborative editing. In either case, changes have to be captured, along with their metadata, and reconciled, in the case of conflicting edits.

Despite progress on the software side of collaborative editing for a variety of formats, there has been no visible progress on the capturing of changes, or their reconcilation in OpenDocument Format documents. Being habituated, not to say addicted, to markup approaches, it's understandable I find the lack of format discussions disquieting. It's all well and good to have change tracking/collaborative editing, successfully in software, but what the hell am I going to write down in ODF?[2]

How to capture changes, from one or many authors, and how to capture reconciliations are the focus of this proposal. That requires unique identification of changes (one or many authors), identifying where changes may be applied, and recording the application of changes (the resulting document).

Table of Contents

Introduction
Change Log
Identification of Changes (2), Proposed Changes (5), Location of Proposed Changes (6)
Acceptance or Denial of Change
Well-formed and Valid (finally)
Conclusion

Introduction

Usually consigned to a footnote, I want to thank reviewers #1, #2, and #3 for saving you from a poorly written and likely boring presentation. I attempted to write in the gradiose voice of tech papers instead of saying what I have found and why I find it persuasive. Without a lot of hand waiving or convoluted arguments. Any of the foregoing in this paper and/or presentation, remain because I failed to take their advice. A large round of thanks for Balisage reviewers!

As I say in the abstract, I view the problem of collaborative editing to be a superset of change tracking for a single editor. That being the case, what works for the larger use case, should suffice for the lesser. Moreover, they should share a common syntax. Exceptions, "or" statements, seem to trouble programmers so one goal of the proposed format treat all cases the same. No exceptions.

The requirements of change tracking in XML are well known enough to not require citation:

  1. change log

  2. identification of the change (for acceptance or denial)

  3. author of the change

  4. date of change

  5. proposed change

  6. location of proposed change

  7. acceptance or denial of the change

  8. date of acceptance or denial of change

  9. acceptor or denier of change

  10. a well-formed and in the case of ODF, a valid XML document for presentation to the parser

Implementations may choose to optimize this information internally. What is presented here assumes verbosity is not an issue.

Change Log

The genesis of the idea of using a change log to capture proposed changes to an ODF document came about from discussions of Operational Transformations (OT).[3] Not that OT has a log such as proposed here, but capturing proposed changes requires a means of recording them.

As we will see later, capturing proposed changes separate from the content.xml file, allows us to avoid questions of how to capture changes and at the same time maintain well-formedness and validity. In fact, conflicting changes can be captured when held separately from the document instance. But that doesn't answer the question of how to uniquely identify changes from random authors.

Identification of Changes (2), Proposed Changes (5), Location of Proposed Changes (6)

Identification of changes, proposed changes, and the location of proposed changes all share the difficulty of how to coordinate uncoordinated editing of documents? That is to say authors may be online simultaneously, online separately, or even offline and still editing the same document. Before we even reach reconciliation, how do we distinguish, reliably, edits, one from the other?

Fortunately, the problem of uncoordinated identification was solved outside the markup world, under the unwieldy title: Information technology – Procedures for the operation of object identifier registration authorities: Generation of universally unique identifiers and their use in object identifiers, Recommendation ITU-T X.667.[4]

Recommendation ITU-T X.667 defines the concept of generating "universally unique identifiers (UUIDs)" and specifies procedures for their generation. The details of generation need not delay us, but the introduction lays the groundwork for incorporation of UUIDs as part of a change tracking log for ODF documents:

This Recommendation | International Standard standardizes the generation of universally unique identifiers (UUIDs).

UUIDs are an octet string of 16 octets (128 bits). The 16 octets can be interpreted as an unsigned integer encoding, and the resulting integer value can be used as the primary integer value (defining an integer-valued Unicode label) for an arc of the International Object Identifier tree under the Joint UUID arc. This enables users to generate object identifier and OID internationalized resource identifier names without any registration procedure.

...

If generated according to one of the mechanisms defined in this Recommendation | International Standard, a UUID is either guaranteed to be different from all other UUIDs generated before 3603 A.D., or is extremely likely to be different (depending on the mechanism chosen).

No centralized authority is required to administer UUIDs. Centrally generated UUIDs are guaranteed to be different from all other UUIDs centrally generated.

A UUID can be used for multiple purposes, from tagging objects with an extremely short lifetime, to reliably identifying very persistent objects across a network, particularly (but not necessarily) as part of an object identifier or OID internationalized resource identifier value, or in a uniform resource name (URN).

With a near guarantee (check with your lawyers) of uniqueness until 3603 C.E. (that beyond the end of Unix time if you are interested), the identification of changes in a change log with a UUID (that's GUID for people from Redmond), looks good.

But it's not just the identification of changes, what of the identification of elements within a change? And the poor editor who is editing off-line, how does he align his changes against an ever changing XML tree?

What if all ODF elements used their xml:ids to hold UUIDs, prefixed by "odf" so as to be a valid xml:id? The author of any ODF element and anyone to who that element has been shared, knows a unique xml:id for addressing that element, to put material before, after, and/or to delete the element. What's more, the offline editor is generating their own unique xml:ids, enabling them to both make edits to XML elements known to them, as well as the xml elements they have created.

That scenario presumes that xml:ids are immutable and change logs are append only, but why not? Memory is for all practical intents and purposes unlimited so we need not keep acting like we are all editing XML on XT clones. Not to mention that databases, I know, document crowd but you have heard of databases, yes?, nearly universally use UUIDs. If anything, we are behind the curve on using them in connection with XML documents.

Acceptance or Denial of Change

To just rough in the syntax of a changelog.txt file, at this point we have: odfUUID author date insert (node|nodes) items before location or odfUUID author date insert (node|nodes) items after location or odfUUID author date delete (node | nodes) location In order to avoid creating unnecessary difficulties, insert and delete operations should be at element boundaries. If a deletion crosses a paragraph boundary, for example, the deletion should be of the text nodes and not the beginning paragraph element.

In terms of representation, not as a constraint on execution, I propose the use of XQuery 3.1 Update primitives, but only insert and delete.

One pattern that follows the recordation of changes format could be: odfUUID author date accept/deny odfUUID (of change accepted or denied) That serves to identify the acceptor of a change or deletion, separate from its original author.

Well-formed and Valid (finally)

Assuming we have an append only change log and immutable xml:ids, how does that get us to a well-formed and valid XML document to feed to an XML parser? Good question!

The beginning state of the XML document is made, just like any other change, following the pattern: odfUUID author date insert (node|nodes) items except that it has no "location" value. It is the starting state of the document and all changes will be recorded against the nodes in that start.

A version of the document is captured in the change log as follows: odfUUID author date odfUUIDs (separated by commas) The list of odfUUIDs, when those operations are performed, results in a well-formed and valid ODF document for presentation to an XML parser.

There is no constraint on changelog.txt to prevent there being multiple versions of the same document, representing differing decisions about what changes to accept or reject.

Conclusion

The immutability of xml:ids used in this proposal introduces several advantages that may not be immediately evident. One of the primary ones is that any editor capable of producing a pointer to an xml:id, can both submit edits as well as comments to a document, so long as it exists in digital form. In cases where public comment is sought but later not included in final publication, the attachment of that content is never lost.

The same longevity of annotations and comments is true when a document is purged of such notes when shared with others, but you want to restore the notes when the document, perhaps edited, returns to your possession.

Horror stories of editorial comments leaking out can be avoided automatically with this proposal because the changelog will be bytes 0 for a document with no changes. Pristine for distribution as it were.

There are whispers that programmers don't like to preserve xml:ids, but we know in fact that quite large document systems can and do, everyday. Consider this example: <section style="-uslm-lc:I80" id="id92be36ef-db2e-11eb-bf11-e2f53ffbac53" identifier="/us/usc/t26/s107"><num value="107">§ 107.</num><heading> Rental value of parsonages</heading> Part of title 26, Internal Revenue Code, from the Office of the Law Revision Counsel, United States Code[5]

Serious publishers have no objections to UUIDs, why should you?

References

[Schubert 1994] Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes: modeling, detection, storage and visualization (DChanges '14). Association for Computing Machinery, New York, NY, USA, Article 6, 1–4. doi:https://doi.org/10.1145/2723147.2723155.

[Schubert 2019] The Next Millennium Document Format. In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association for Computing Machinery, New York, NY, USA, Article 40, 1–4. doi:https://doi.org/10.1145/3342558.3345419.

[ITU-T X.667 2012] Recommendation ITU-T X.667 http://handle.itu.int/11.1002/1000/11746



[1] See: Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes: modeling, detection, storage and visualization (DChanges '14). Association for Computing Machinery, New York, NY, USA, Article 6, 1–4. DOI:https://doi.org/10.1145/2723147.2723155, and, Svante Schubert. 2019. The Next Millennium Document Format. In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association for Computing Machinery, New York, NY, USA, Article 40, 1–4. DOI:https://doi.org/10.1145/3342558.3345419.

[2] The ODF TC specifies a document format, not how the document is processed. OASIS Open Document Format for Office Applications (OpenDocument) TC

[3] See: Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes: modeling, detection, storage and visualization (DChanges '14). Association for Computing Machinery, New York, NY, USA, Article 6, 1–4. DOI:https://doi.org/10.1145/2723147.2723155

[5] This is where I stole the idea to prepend a string to a UUID to make it a valid xml:id

×

Interoperable Document Collaboration Svante Schubert, Sebastian Rönnau, and Patrick Durusau. 2014. Interoperable Document Collaboration. In Proceedings of the 2nd International Workshop on (Document) Changes: modeling, detection, storage and visualization (DChanges '14). Association for Computing Machinery, New York, NY, USA, Article 6, 1–4. doi:https://doi.org/10.1145/2723147.2723155.

×

The Next Millennium Document Format. In Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association for Computing Machinery, New York, NY, USA, Article 40, 1–4. doi:https://doi.org/10.1145/3342558.3345419.

×

Recommendation ITU-T X.667 http://handle.itu.int/11.1002/1000/11746

Author's keywords for this paper:
OpenDocument Format (ODF); XML; Change Tracking; XQuery; XQuery Update