Divide and Conquer: can we handle complex markup simply?

Robin La Fontaine

Abstract

Cultural Heritage markup can quickly become complex because of the need to represent multiple, and even overlapping, hierarchical structures. It can therefore become very difficult to maintain correctly. This talk suggests that a better approach is now possible: markup that is designed to represent different aspects of a text could be handled separately from the point of view of checking and maintenance, and then only combined into a single document when needed, e.g. for some kind of analysis. Advances in comparison and merge tools for XML make this a possibility.

Introduction and Background

Our cultural heritage is important, and we can learn from it. In looking at better ways of handling cultural heritage documents using structured markup, there is an opportunity also to learn from computer science ‘heritage’. Although many things in computer science are changing very rapidly, lessons can be learned from past mistakes or experiences and it is often the case that what is deemed to be a new approach, is in fact an old approach revisited.

One of the purposes of cultural heritage markup is to have a representation of many variants of a document all in one document. The variation may be in how it is marked up, or in the text itself. This can lead to very complex markup, and it can become extremely difficult to manage without very good tools. Indeed, as the information content becomes richer, so the difficulty of handling the complexity increases. This is very well described by Schmidt [1] and [2] and he proposes that one way to solve this is to keep separate variants and merge them as needed. He points out, however, that this is not a simple task.

The purpose of this short paper is twofold. Firstly, to note that this approach of divide and conquer has been used in similar situations very successfully. Secondly, to summarize developments in the area of XML comparison and merge, developed primarily for other purposes, that relate to and may help in this area.

An example of Divide and Conquer

The cultural heritage markup problem has similarities to the handling of multiple versions in other areas of computer science. An example of this is a project for handling the documentation of a complex data model, using a version controlled relational database. Although this work was done some twenty-five years ago, the lessons learned remain pertinent.

The purpose of this project was to document a complex data model, and have this reviewed by subject matter experts. The problem was that these experts were, as they always are, short of time, and therefore we wanted to ensure that their time was well spent in review. However, the model had to be developed and reviewed over many different versions in order to make sure that it was correct, and therefore we needed to present the subject matter experts with successive versions as these were developed. The experts clearly wanted to know what had changed, rather than reviewing the whole document again. This was in an era when tracked changes had not even been thought about, and certainly good word-processing technology was not widely available.

The documentation was therefore put into a relational database, which was versioned so that each successive version was recorded and identified. Using what was called 4GL, fourth-generation language[3], it was possible to write a report that generated the full documentation with an indication of which parts of it have been updated since the previous version. (As an aside, it turned out to be impossible to parameterize the 4GL reports sufficiently, and therefore large sections had to be duplicated and slightly modified, resulting in a very large number of lines of code, which eventually became impossible to maintain.) In terms of the result that was produced, the project was very successful and the subject matter experts were pleased because they were able to review only the changes.

As more versions of the document were added to this database, it became more and more difficult to maintain the integrity of the database. It was extremely difficult, for example, to remove a particular version from the database, or even to make updates to the latest version. This was partly due to inadequate tools, but it was fundamentally difficult because whenever something was changed, it had to be duplicated first and all the versioning information set up correctly.

To get round this problem, a new approach was adopted. Rather than working directly on the versioned database, a new version of the documentation was created independently from the versioned database. It was then possible to write an automated script that could add this new version back into the versioned database as a new version. This could be automated and so correctness could be guaranteed. Using this approach, it became far easier to create a new version of the document while at the same time being able to maintain the versioned documentation that was required by the subject matter experts.

It was quite a simple idea, but it made an increasingly complex situation much easier to handle. There are some parallels with cultural heritage markup, so this approach is worth persuing based on this similar past experience.

Application to Cultural Heritage Markup

We will now consider how this approach applies to cultural heritage markup. If we could work on the representation of a particular variant of the document, this would have relatively simple markup, which could be validated using conventional XML tools. There would not be a need for overlapping hierarchy, and possibly not even text variations. If we could then combine these simpler variants into a single document, using markup to show structural and text variations, we would still be able to publish the rich information that cultural heritage markup provides.

One of the advantages of this approach would be that we would not need to keep all the variants together in a single document all the time, but rather we would combine only those variants that were relevant to a particular publishing scenario. In addition, we can combine two related variants together in order to check their integrity with respect to each other.

In order to achieve this simplification, there are some significant challenges in performing the merge, as noted by Schmidt. This would need to be based on comparison, but it would be important to align the text independently of the structural markup. That said, some of the markup may be important for alignment and therefore a flexible comparison approach is needed. Traditional text comparison tools are line based, and do not understand the markup and are therefore unsuitable for this work. XML comparison is traditionally guided by the document structure and again this is not suitable unless it can be made more flexible. A prerequisite is therefore the ability to be able to distinguish between structurally significant markup, i.e. markup that is an important divider in terms of alignment, and structurally insignificant markup, i.e. markup that should be ignored for alignment.

Once the text has been aligned, it is then necessary to have a suitable representation of the overlapping structural hierarchy in a form that is suitable for conversion into cultural markup, e.g. TEI[4]. The representation of overlapping hierarchies is a difficult problem, and quite a number of papers have been presented at this conference and others about it [5].

Developments in XML-aware Comparison and Delta Representation

XML aware comparison understands the structure of the XML, and therefore uses this when aligning two documents. Where the XML elements represent structure that is significant to the alignment process, this approach is appropriate. However, XML element tags are also used to markup formatting information and it is usually desirable not to show text changes when only formatting changes have been made. We therefore end up with a mixture of XML structure, some of which is significant and needs to be considered in the alignment process, and other elements that are not significant and need to be ignored in the alignment process. The ignored elements need to be represented in the final result and not lost. This, of course, typically leads to overlapping hierarchy.

Ignoring for a moment cultural heritage markup, for regular structured documents in formats such as DITA[6] or DocBook[7], it is not generally necessary to be able to represent overlapping hierarchy. However, it is often desirable to be able to distinguish between textual changes and formatting changes, and a good representation of overlapping hierarchy enables such a distinction to be made in a delta file[8].

Another requirement of conventional structured document comparison is the need to control the alignment of specific elements within the document. This can be achieved by assigning keys to these elements, and ensuring that these keyed elements are aligned in preference to the alignment of any other elements. The use of keys enables a very reproducible and controllable merge.

We are therefore moving to a situation where generic XML delta formats are able to represent not only changes to textual information, and the simple addition or deletion of complete elements, but also the presence or absence of XML tags around portions of text. We are getting close to the ability to take multiple variants of the document, where the text is similar but not necessarily the same, and where markup may be completely different, and merge these into a single document where the variants are represented in XML in a generic form, without loss of information. To validate that no information is lost, it should be possible to generate all of the original documents from the merged document.

The original purpose of this generic delta format was to be able to generate derivatives for a variety of different purposes. For example, it would be possible to show where text has been changed, and distinguish this from where formatting has been changed so that those who are interested in the one are not confused by the other. It is also possible to ignore certain types of change in an intelligent way.

These advances may have applications useful to the cultural heritage markup community.

Conclusions

This short paper has described how a divide-and-conquer approach was adopted for a complex version-controlled relational database, designed to support documentation of a data model. The version-controlled database became too complex to manage but success was achieved by working on each version and only adding this to the versioned database. Very useful results were achieved from this – results that were simply not possible at the time with any other approach.

The paper has explored parallels between this and the management of cultural heritage markup and shown that advances in techniques to compare and merge structured XML documents mean that a similar approach could be applied.

A purpose of these short talks is to explore different approaches to existing problems. The question for discussion and feedback from the audience is how useful this approach would be to the Cultural Heritage markup community.

References

[1] Schmidt, Desmond. “The role of markup in the digital humanities.” Historical Social Research 37 (2012), 3, pp. 125-146. URN: http://nbn-resolving.de/urn:nbn:de:0168-ssoar-378369

[2] Schmidt, Desmond. “Merging Multi-Version Texts: a Generic Solution to the Overlap Problem.” Balisage Series on Markup Technologies, vol. 3 (2009), http://www.balisage.net/Proceedings/vol3/html/Schmidt01/BalisageVol3-Schmidt01.html, doi:https://doi.org/10.4242/BalisageVol3.Schmidt01

[3] Informix 4GL: https://en.wikipedia.org/wiki/Informix-4GL

[4] TEI: Text Encoding Initiative, http://www.tei-c.org/index.xml

[5] Marcoux, Yves, Michael Sperberg-McQueen, Claus Huitfeldt. “Modeling overlapping structures.” Balisage Series on Markup Technologies, vol. 10 (2013), http://www.balisage.net/Proceedings/vol10/html/Marcoux01/BalisageVol10-Marcoux01.html, doi:https://doi.org/10.4242/BalisageVol10.Marcoux01

[6] OASIS Darwin Information Typing Architecture (DITA) TC, https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=dita

[7] OASIS DocBook TC, https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=docbook

[8] Overlapping Hierarchies in DeltaV2 Format, http://www.deltaxml.com/support/documents/deltav21

Robin La Fontaine

Robin is the founder and CEO of DeltaXML. He holds an Engineering Science degree from Oxford University and an MSc in Computer Science. His background includes computer aided design software and he has been addressing the challenges and opportunities associated with information change for many years.

BalisageSymposium

Balisage Paper: Divide and Conquer: can we handle complex markup simply?