Rendered result table, using bold italics to show deletion
Cell 1
Cell 2
Cell 3
Cell 4
Cell 5
Cell 6
As can be seen, the resultant table does not render well as the second row now includes too many cells, thus pushing Cell 4 too far to the right. A better result would be to handle
the change to row spanning by including the problematic rows from the original table, marked as deleted, followed by the matching rows from the modified table, marked as added. This
can be seen in the example below.
Rendered result table, using bold italics to show deletion and underline to show addition
Cell 1
Cell 2
Cell 3
Cell 4
Cell 1
Cell 2
Cell 4
Cell 5
Cell 6
This is one example of the way that tables are handled intelligently during the comparison phase. As mentioned above, the XHTML table model is simpler than the CALS table model
leading to fewer potential issues during comparison, but there were still a number of problems that needed to be solved.Text formatting changesChanging the format of specific pieces of text, e.g. highlighting a word by making it bold or italic, is common during text editing but should this constitute a change in a redline
document? The answer will depend on the context of the change, whether the subject domain places meaning on such formatting, and whether or not there is a requirement to see these kind
of changes in the redline document. In the case that it should be highlighted, there may be different ways of doing so. The document reviewer may wish to see the text with its old
formatting marked as deleted and the text with its new formatting marked as added so that a complete view of the change is present. In other situations, it may be sufficient to mark
the text with some other kind of highlighting to show that there has been a formatting change but not include details of how the formatting has changed.Many content authors may not even understand that there is an XML structure underlying their document and that a format change actually constitutes a structural change. Thus, when
they make a word bold and the resultant comparison result shows the word deleted and then added again, they see this as a mistake.In order to have the ability of marking formatting changes in a different way, or in fact ignoring them completely, we need to have some way of detecting the structural change
without having to mark the underlying text as changed as well. One technique we have utilised for this is to pre-process the documents to flatten the structure of formatting elements.
The following example shows a document with a bold word that has had its formatting flattened.This flattened structure can handle formatting elements that are a simple tag, e.g. <b/> or <i/> and also more complex formatting such as
<span style="font-size:14; font-weight:bold;"/>. Processing the input documents in this way then allows the text to be compared more intuitively, as it is all at the
same level in the XML structure. Format changes are detected as changes to the <deltaxml:format-start/> and <deltaxml:format-end/> elements and the
structured formatting can be reconstructed after comparison. There is the potential for overlapping structures in the result when formatting is flattened; to solve this problem, the
formatting from one of the input documents, typically the latest or ‘B’ document, is given priority when reconstructing.ISO’s requirement was to ignore formatting changes completely and, for content that was in both input documents, to include the formatting from the latest or ‘B’ document. This
makes reconstructing the formatting elements a lot simpler because in the case where formatting has changed it is possible to ignore all of the elements marked as being only in
document ‘A’.ID and IDREF attributesID attributes and their associated IDREFs are typically used for internal cross-referencing in documents. It is important that the target of a cross-reference is declared as an
attribute having type ID in order to ensure uniqueness within the document. Unfortunately, this uniqueness constraint can cause problems in the result file, which must be overcome.
Imagine the situation where an image, e.g. an <img/> element, is used to display a diagram and defines an ID, e.g. <img xml:id="widget"/>. An editor of
the document decides that this should have been defined using a figure element but, to avoid having to update references to the diagram, uses the same id: <fig
xml:id="widget"/>. This is all perfectly valid because each document maintains uniqueness of its IDs. However, the comparison result file will contain the following content
because of the requirement to view both added and deleted content in the same document.The document now contains two elements with the same ID value, which makes it invalid. This situation can be resolved by renaming the IDs on any deleted, or ‘A’ document elements
and also updating any references to that element (these will be elements in the ‘A’ document only, that contain an IDREF whose value is the ID in question). The following figure shows
an example of a fixed result file.This document is now valid in respect of its ID uniqueness. The deleted first paragraph contains a reference to the old diagram as that is what it was referencing. The remaining
second paragraph now points to the new version of the diagram. The naming scheme for updating deleted ID attributes can ensure uniqueness by using a number suffix that does not exist
in the document. This can be checked against all existing IDs in the document.Another potential use of ID values is to use them during comparison to align elements of the same type with matching IDs. This can improve comparison results, particularly for
documents that include repeated sentences and phrases as can be typical in legal documents for example. For this technique to work, an element must maintain its ID value across
different versions of the document so that its identity is consistent. Many XML documents are auto-generated from some other format and part of this process will involve the generation
of ID values. If these are randomly generated, they will not be suitable for this use as equivalent elements in different versions of a document will not have the same ID. Even if they
are not random and use a naming scheme, e.g. fig1, fig2, fig3 etc., removal of an element in this sequence could have a ripple effect on the ID values for all subsequent elements,
again making them unsuitable for use during comparison. This was the case for the ISO documents and the ripple effect of ID values changing caused a large amount of change to ID
attributes that had to be handled using the technique above.Processing InstructionsProcessing instructions are used to supply a consuming application with information. One thing they are increasingly used for is to insert data and/or content into a document
format that does not allow for that content in its model. This is a way of providing a customized extension to a document format but is often used as a quick fix when a more
appropriate solution would be to add the required functionality to the language specification. An example of this is the use of a processing instruction to specify the size at which a
table should be rendered on a page. In the ISOSTS documents we tested, we saw the use of processing instructions to specify an external image location that could have been included as an attribute, e.g. <img><?img-id D09291AZ.PNG?></img> instead of <img href="D09291Az.PNG"/>.One of the problems this causes is that if you compare documents containing such processing instructions and you want the result file to include the processing instructions, there
is no sensible way of representing change to them as they are not XML elements. It is possible to preserve processing instructions, and even detect change in them by first converting
them into an XML structure, comparing documents, and then converting the XML structure back into processing instructions. A potential solution to representing change is to duplicate
the containing element whenever a change is detected in a processing instruction. For example, and <img/> containing a processing instruction as above with a change to
the external location of that image could be represented as an image deletion and addition e.g. This solution is not as good as being able to represent change to an href attribute as it is not as easily processed but it provides a reasonable result. This can,
however, be problematic if the element containing the processing instruction is very large, e.g. a table containing a processing instruction that gives information on how it should be
rendered. Including two versions of the whole table in order to represent the processing instruction change does not give a sensible result.Word CapitalizationWord capitalization, like formatting change, is often viewed as an insignificant change that should not be highlighted in a redline document. This was indeed the case with ISO’s
requirements. Like formatting, the result document needed to include the version of the text that was in the latest, or ‘B’ document.A potential solution to this problem is to pre-process the input documents to ensure that all text uses only lower case. For documents whose text is mainly prose, this is not
appropriate as upper case letters are an important feature of the text and should be preserved during comparison. Because pre-processing the inputs in this way does not make sense for
the ISOSTS documents, the solution was to post-process the result file to detect those text changes where the only difference between the two versions was letter case. The following
figure gives an example of the kind of change that can be detected.A text-based comparison of the ‘A’ and ‘B’ branches of the <deltaxml:textGroup/> element after converting both strings to all lower-case, shows that there is no
change. In this situation, we can remove the marked changes and include only the text from the ‘B’ document.This technique works well for the cases where a text change is purely a capitalization change. More complex changes that involve capitalization in conjunction with addition and/or
deletion of surrounding words will still include the capitalization change in the final output. As the capitalization is part of a larger change which will need to be reviewed anyway,
this is not likely to be a significant inconvenience.HTML change visualizationAs well as the ISOSTS specification, ISO provide XSLT stylesheets that convert an ISOSTS document into standalone XHTML. These stylesheets provide a useful and simple way of
producing a published version of standards documents for previewing during authoring. They can also be used to publish an online version of a standard.As well as providing the intermediate change representation for input into Typefi Publish, we were able to extend the XSLT stylesheets to provide some redline functionality in the
XHTML output. In the simplest cases, this involved first categorizing the elements in ISOSTS as either block-level or inline elements and then extending the output templates to wrap
block-level elements in a <div/> and inline elements in a <span/> with these wrappers defining a class attribute containing the value of
the intermediate result’s deltaV2 attribute where it was ‘A’ or ‘B’. These classes were then styled using CSS to highlight deletions with a red background and additions
with a green background.Other cases were more complicated and involved the overriding of whole processing templates in the original XSLT but the final result was a useful rendering of redlining in
XHTML.ResultsThe following figures show an excerpt from each of the different types of redline result that were produced. The PDF result was produced using the intermediate result delta,
published through Typefi Publish and the HTML result was produced by transforming the intermediate delta file using our XSLT extension to the ISO stylesheets. Unfortunately, images
were not available for the HTML output at the time of writing.SummaryDocument comparison is a key part of any workflow involving changing documents and, with more and more documents being stored as XML, it is important to provide tools that
understand the XML structure and the implications that it has on comparison results. As we have demonstrated, there are many subtle areas to consider when looking at XML comparison and
change representation and many of the problems we have encountered could have been made simpler by designing the document formats with comparison and change representation in mind.
This case study shows that the problems arising during comparison of structured content are not insurmountable and those considering moving to an XML representation for their document
storage should not be reluctant to do so based on any of issues seen here.Structured content offers huge benefits, not least of which is the processability of content to multiple published formats. This case study has shown that the production of an
intermediate document containing change representation can be used to produce redline documents in both PDF and XHTML. This intermediate file can quite easily be further processed to
select the types of change which should be highlighted and those which should be ignored. Coupling this technology with Typefi Publish, which provides the flexibility of multiple output formats and professional layout and design capabilities provided ISO with a
comprehensive solution to their requirements for published redline documents.BibliographyDeltaXML, “DeltaV2 Format”, http://www.deltaxml.com/support/documents/deltav2 (accessed July 15 2013) ISO, "ISO Standards Tag Set", http://www.iso.org/schema/isosts/ (accessed July 15 2013)Mulberry Technologies, "Mulberry Technologies Inc", http://www.mulberrytech.com (accessed July 15 2013)Typefi, "Typefi Publish", http://www.typefi.com/typefi-publish (accessed July 15 2013)W3C, “Change Tracking Markup Community Group”, http://www.w3.org/community/change/> (accessed July 15 2013)W3C, “XHTML 1.1 - Conformance Definition”, http://www.w3.org/TR/xhtml11/conformance.html#s_conform (accessed July 15 2013)