Stand-alone Encoding of Document History

(or One Step Beyond XML Diff)

Jean-Yves Vion-Dury

Senior Scientist

Xerox Research Centre Europe

Copyright © 2010 Xerox. All rights reserved.

expand Abstract

expand Jean-Yves Vion-Dury

Balisage logo

Proceedings

expand How to cite this paper

Stand-alone Encoding of Document History

(or One Step Beyond XML Diff)

Balisage: The Markup Conference 2010
August 3 - 6, 2010

Introduction

XML being recognized today as the lingua franca of computerized information, some key basic functionalities become of universal interest and will predictably gain momentum in the upcoming XML related economy.

One of these functionalities is the computation of differences between two XML documents, and more generally, the management of the history of a given document. In particular, this functions as a cornerstone of any system aiming at offering long term preservation.

Today, such functionality is achieved through content management systems or databases. Models and operations are therefore hidden inside a black box, and up to the author's knowledge, no standard mechanism makes this information explicit and accessible to human users neither to external algorithms and processors.

Our proposal addresses this issue by associating a target document instance together with its history inside a standalone and consistent document, thus gaining strong potential for current or future interoperability.

The notion of history, when applied to a document, calls for a deeper understanding of the global context, intents and social processes underpinning the creation, evolution and maintenance of documents (see pédauque03). Hence, the relevant structure of a meaningful document history highly depends on the document typology and usages.

In addition, the criterion that founds any document history is the permanence of some key deep invariants (for a general, ontological, reflection around this line, see pédauque05). Those invariants define the identity of the document itself and their loss inevitably ends the versioning process and calls for a new creation cycle.

The method we describe in this paper takes into consideration the key points above, especially through a flexible calculus of difference descriptors, and through offering high level transformation operations that preserve the consistency of both the document instance and its whole history.

Overview

The target document (also called body hereafter), i.e. any XML document using any namespace, is encapsulated inside a dedicated packaging XML document. The body always consistently relates to a given state of its history (not necessarily the most recent one) thanks to a dedicated attribute that refers to a version label and makes this relation explicit.

The history itself is encoded as interrelated nodes (XML elements) and has the structure of a directed acyclic graph (DAG); each node of this graph models a versioning point, i.e. a particular significant state that the document reached during its lifetime and for which a version has been recorded. The meaning of any versioning point with respect to the document life cycle is application dependent, as discussed above, and the proposed approach abstracts over application semantics, just considering that a new version is built when some actor in the document life cycle decided it makes sense to do so. Arcs that connect the nodes are decorated with deltas, and these arcs model the transformation that allows changing the document from one versioning point to the other. Thus arcs are oriented. Deltas descriptors are combinations of low level operations on the document tree, mainly based on deletion and insertion of subtrees.

Thanks to a set of transformations qualitatively described in this paper, it is possible to navigate inside the history of the document and to consistently extract any version of the document captured in the history graph. It is also possible to create new versioning points, new branches or to merge existing branches, all these operations producing novel consistent encapsulations of the target document.

Modelling Abstract Differencing Operations

Document model

The encapsulating XML document can be represented (in an abstract way) as a tree like this:

Figure 1: Abstract tree structure of the document history

x-version[x-bodyvi[d],x-history[…vi…]]

where d represent the target document and vi a versioning point. The full syntax of our encapsulation is available through a RelaxNG Schema (see appendix). The examples Figure 26, Figure 27, Figure 28 and Figure 29 illustrate concretely the way we propose to encode the history graph (DAG) in XML, and the way delta operations are attached to transition arcs.

Diff Engine

We assume that a diff engine is able to operate as a black box function. Its abstract signature could be for instance:

Figure 2: Abstracting over diff operation

diff(config, d, d’) Δ

with config being some input information used to configure the processor (e.g. filter to apply, mode commutative/non-commutative,…). The notation Δ represents a set of basic delta operations, formally described hereafter.

Abstracting over Delta Operations

Figure 3: Syntax of delta operations

Δ::= { } no operation
| { δ1 … δi } commutative snapshots
| Δ ; Δ sequences
δ::= insert (pp, A)subtree insertion
| insert-attr(pp/@nm, A)attribute insertion
| delete(p)subtree deletion

The delta operation use paths, noted p, to designate the tree location where the modification should be applied. Those paths are describe by the following grammar (i is a positive integer, nm an attribute name, pp denotes "pure paths"):

Figure 4: Syntax of path description

p ::=pp | pp/@nm
pp ::=i/pp | i

These paths are interpreted as relative to the root of the encapsulated document, and can be easily translated as XPath expressions. For example, 1/2/1 is translated into *[1]/*[2]/*[1], and 1/3/@id into *[1]/*[3]/@id [1].

Each delta belonging to a snapshot must comply with a structural constraint that ensures orthogonality (thus making the snapshot indeed commutative). In particular, it is enough to verify that two paths in a snapshot do not designate sibling trees, and one path does not designate a sibling tree of the parent node designated by the other path. We assume that this constraint is part of the well-formedness condition assured for all definitions [2].

The semantics of delta operations expresses changes that occur on the operand (an XML tree) ; we note this transformation of d into d’ by applying Δ as follows:

Figure 5: Transformation by Delta application

d › Δ › d’

More precisely, the previous notation is a logical assertion saying that a well-formed document d is changed into a well-formed document d’ after application of the well-formed Δ operation.

Formally, for all subtree A, path p, documents d and d’, deltas Δi and δi, the transformations verify the following abstract properties:

Figure 6: Abstract properties of delta transformations

(a-seq) d › Δ12 › d’ ∃ d’’ , d › Δ1 › d’’ ⋀ d’’› Δ2 › d’
(a-snap) d › { δ1 ... δi} › d’ d › {δπ1} ; { δπi} › d’for any permutation π defined over the sequence of indexes [1,…,i]
(a-void) d › { } › d’ d ≈ d’
(a-ins) d › { insert(pp,A)} › d’ get(d',pp)≈A ⋀ invar(d,d',pp,f) with f = ζ(pp)
(a-ins-@) d › { insert-attr(pp/@nm,A)} › d’ get(d',pp/@nm) = A ⋀ invar(d,d',pp/@nm)
(a-del) d › { delete(pp)} › d’ get(d',pp)≈get(d,pp ⊕ f) ⋀ invar(d,d',pp,-f) with f = ζ(pp)
(a-del-@) d › { delete(pp/@nm)} › d’ get(d',pp/@nm) = ε

Note that the definition of insert operator makes use of the function get, which extracts the subtree rooted at a given location p, as well as a path addition function ⊕ and a so called fingerprint extraction ζ . The invar operator deals with locality of the transformation (see Invariance Property Fig Figure 11 and associated explanations). Moreover, ≈ (the equivalence of trees) is defined extensionally using quantification of paths, and is insensitive to attribute order, according to path definition.

Figure 7: Tree equivalence

d ≈ d’ ∀ p, get(d,p) = get(d’,p)

Path based, extensional equivalence over trees

The path addition is defined over pure paths (paths designating element nodes only, noted pp), can deal with operands of various depth and is commutative:

Figure 8: Path addition

i/pp j/pp’ = (i + j)/(pp ⊕ pp’)
i/pp j= (i + j)/pp
i j= (i + j)

Also fingerprints capture some structural information, and more precisely, the depth level of a path:

Figure 9: Fingerprint computation

ζ(i/pp) = 0/ζ(pp)
ζ(i) = 1

Thus ζ (1/2/3) = 0/0/1 and 1/2/3 ⊕ ζ(1/2/3) = 1/2/4 and 1/2/3/2 ⊕ ζ(1/2/3) = 1/2/4/2

The property (a-snap) holds for all permutations over indexes. It simply means that the set of deltas must be commutative with respect to their sequential application. In other words, all basic deltas of a snapshot are pairwise orthogonal.

The property (a-del) involves a minus operator over paths. This one is easily defined as inverting all integers found in the path.

Figure 10: Path inversion

-(i/pp) = (-i)/-(pp)
-(i) = -i

The invar property used in (a-ins), (a-ins-@) and (a-del) expresses that other parts of the tree are not modified by the operation. This is defined as follows:

Figure 11: Invariance property

invar(d, d’, pp/@nm)∀ nm' [ get(d, pp/@nm’) = get(d’, pp/@nm’) ]
invar(d, d’, pp, f)∀ p' [

p' ≪ pp get(d, p’) = get(d’,p’)
pp ≪ p'get(d, p’) = get(d’,p’ ⊕ f)

]

The relation ≪ is a strict total order over (pure) paths, which is sound with respect to the standard total order over document nodes as defined by the XML document model.

Figure 12: Strict total order on paths

for all i, j integers, pp, pp' pure paths,

Table I

i < j i = j i > j
i/pp j/pp' true pp ≪ pp' false
i j/pp' true false false
i/pp j true false false
i j true false false

The intuitive notion of path prefix relation captures the idea of tree embedding

Figure 13: Path prefix relation

pp1 ≺ pp2 iff ∃ pp | pp1 = pp2/pp

We now define the notion of path orthogonality as a binary boolean operator that captures the notion that a path is not a subpath of another (or equivalently, a subtree is not included into another subtree)

Figure 14: Path orthogonality

pp1 ⊥ pp2 iff not [ pp1 ≺ pp2 ⋁ pp2 ≺ pp1]

Extension of basic vocabulary

In order to ease understanding and increase the generality of descriptions, we propose now to extend the vocabulary through definitions based on the fundamental operations described above: move, copy and replace [3].

Figure 15: Definition of Additional Operators

pp2 ≪ pp1d › { move(pp1, pp2) } › d’ d › { insert (pp2,get(d, pp1))};{delete(pp1 ⊕ ζ(pp2)) } › d’
pp1 ≪ pp2d › { move(pp1, pp2) } › d’ d › { insert (pp2,get(d, pp1))};{delete(pp1) } › d’
pp1 ⊥ pp2d › { copy(pp1, pp2) } › d’ d › { insert (pp2,get(d, pp1)) } › d’
d › { replace(pp,A) } › d’ d › { insert (pp,A)};{delete(pp ⊕ ζ(pp)) } › d’
d › { replace(pp/@nm,A) } › d’ d › {delete(pp/@nm) }; { insert-attr(pp/@nm,A)} › d’

Note that move and copy operations are only defined for orthogonal paths[4].

Inversion of basic operations and snapshots

Inverting a delta (i.e. computing the changes that will exactly bring the operand in the previous state) requires knowing the original operand on which changes will be applied.

The inversion function is inductively defined as follows:

Figure 16: Inversion Function ∘

{ } = { }
d › {δ1 ... δi} › d’ 1 ... δi} = {δ1 ... δi}
d › Δ12 › d’ 12) = 2);1)
d › {delete(pp)} › d’ (delete(pp)) = insert(pp,get(d,pp))
d › {delete(pp/@nm)} › d’ (delete(pp/@nm)) = insert-attr(pp/@nm,get(d,pp/@nm))
d › {insert(pp,A)} › d’ (insert(pp,A)) = delete(pp)
d › {insert-attr(pp/@nm,A)} › d’ (insert-attr(pp/@nm,A)) = delete(pp/@nm)

Delta inversion is characterized by the following soundness property:

Figure 17: Proposition 1

d › Δ › d’ d’ › Δ › d

Inversion of well-formed delta operations over well-formed documents produces a well-formed reverse tree transformations

Proof : See appendix Appendix A

The inversion of deltas is an important functionality which allows navigating in the history graph. Moreover, it provides a more compact representation of changes, especially when successive versions represent documents whose content tends to increase incrementally (corresponds to the construction phase inside a standard document life-cycle).

Indeed, in such cases, subgraphs of the form [5]

viinsert(p,A)vjinsert(p',B)[vk]

can be rewritten using delta inversion as:

videlete(p)vjdelete(p')[vk]

This transformation is beneficial because the subtrees A and B are redundantly stored: once inside the history and once inside the encapsulated instance itself. In case the focus is set to a non-terminal versioning point (e.g. vj ), we also may have

videlete(p)[vj]insert(p',B)vk

which is still quite meaningful as the subtree B is only stored once inside the delta transition (indeed, the encapsulated document, consistent with focused version vj, does not comprise the B subtree (see examples 3 et 4 below to grasp more concretely the point).

Note that inverting the operands of a diff operation should also lead to reversed deltas [6]:

Figure 18: Strong soundness of delta inversion

diff(c, d, d’)=Δ diff(c, d’, d) = (Δ)

A weaker version of this property (see Figure 19 below) requires defining an equivalence relation over deltas. This relation is based on the effect of delta application rather than its syntactic structure.

Figure 19: Weak soundness of delta inversion

diff(c, d, d’)=Δ diff(c, d’, d) ≈ (Δ)

Figure 20: Equivalence of deltas

Δ ≈ Δ'

∀ d, d', d''
d › Δ › d' ⋀ d › Δ' › d''d' ≈ d''

In the general case, it is much more interesting (from the performance point of view) to compute inversion on deltas rather than to perform a reversed diff.

Fundamental Operations over Encapsulated Documents

Creation of a history from an initial XML document

This operation has the following abstract signature:

Figure 21: Initial Encapsulation

create-history(d) x-version[x-body v0 [d], x-history[v0]]

This reflects that the document is encapsulated, and that an initial versioning point v0 is created inside the history. The link that relates the embedded document with the consistent versioning point is inserted in the x-body subtree.

Versioning

The operation requires two operands: the encapsulated document and another variant of the document which is to be considered as the novel versioning point. What is returned is a novel encapsulated document including a new versioning point, a (consistent) link inside the body part, and a transition from the previous versioning point to the new one. This transition expresses the delta operations abstracted from the diff engine outputs.

Figure 22: Version Creation

create-version(x-version[x-body vi [d], x-history[... vi ...]], d') x-version[ x-body vj [d], x-history[... vi Δ vj ]]
with diff(c,d,d') = Δ

Extraction

Unfolds the embedded document. This is useful to work on the target document, update or change it.

Figure 23: Version Extraction

extract( x-version[x-body vi [d], x-history[...vi...]] ) d

Focusing

This operation allows modifying the embedded document d in order to be compliant to a given version stored in the history. This requires first building the path that connects the current versioning point vi to the novel vj and second, applying all deltas to obtain the new embedded document d'

Figure 24: Version Focusing

focus(x-version[x-bodyvi[d], x-history[...vi...]], vj) x-version[ x-body vj [d'], x-history[... vj ...]]
with vi Δi ... → Δjvj and d › Δi ; ... ; Δj › d’

Note that in the connection path, some vertices could have the reversed form vnΔn vm requiring the computation of inverse delta (indeed, the connection graph is a DAG admitting branching). Note also that we assume that the algorithm is able to choose a path when several possibilities are present in the graph (it may decide which is the optimal path with respect to performance criteria based on standard algorithm based on simple metrics).

Merging branches

This operation requires two operands: the embedded document and another existing versioning point. An algorithm creates a novel versioning point and two transitions that relate original versioning points to the new one. The idea is to perform a merge with maximal preservation (no deletion operation in the respective deltas). However, conflicts may arise, and can be signaled thanks to a dedicated annotation inside the target document (this annotation is based on foreign namespace which cannot conflict with namespaces used by target document)

Figure 25: Version Merging

merge(x-version[x-bodyvi[d], x-history[...vi...]], vj) x-version[ x-bodyvk[d'], x-history[... viΔa vk , vjΔb vk ]]
with dvi › Δa › d’ and dvj › Δb › d’

In the above formula, Δa represents the concatenation of (possibly reversed) deltas that allows to reach the new (merged) version starting from version vi, and similarly for Δa and vj .

Encoding Graphs of Deltas in XML

Each versioning point is captured using a dedicated element (e.g. version) associated with an id attribute that uniquely identifies it. Our convention uses a naming scheme of the form v0… v3 for instance. Names for diverging branches use a dot, e.g. v1.1, v1.2.

Each version element contains a sequence of delta elements which capture the transition from this versioning point to another versioning point. This information is conveyed by an attribute (e.g. fwd for forward links and bwd for backward links). Moreover, a delta contains a non order-significant sequence of delta operations, which captures the so-called snapshot. Several delta elements are interpreted as sequences of snapshots (corresponds to the Δ1 ; Δ2 syntactic form, where Δi= {…}).

Each delta operation is described using a dedicated element name according to its semantics (insert, insert-attr …). The path information may be concisely encoded, e.g. through an “ipath” attribute. Copies of subtrees may be expressed through a “copy” attribute attached to the delta element. When such an attribute is not defined, an extensive definition of the subtree is expected as the content of the operation.

The example 1 below is a target document we want to track

Figure 26: The Original Document

        <html xmlns="http://www.w3c.org/1999/xhtml">
            <!-- XHTML 1.0 -->
            <head>
                <title>New Testament (Matthew, chapter 2)</title>
            </head>
            <body>
                <p class="verse" number="1">
                Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king,
                behold, there came wise men from the east to Jerusalem,
                </p>
            </body>
        </html>
           

The original (target) document to be encoded using our method

After applying the create-history operation, the document of figure Figure 27 below is created.

Figure 27: First encapsulation


      <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version">
        <x-body version=”v0”>
            <html xmlns="http://www.w3c.org/1999/xhtml">
            <!-- XHTML 1.0 -->
            <head>
                <title>New Testament (Matthew, chapter 2)</title>
            </head>
            <body>
                <p class="verse" number="1">
                Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king,
                behold, there came wise men from the east to Jerusalem,
                </p>                
            </body>
            </html>
         </x-body>
       <x-history >
         <version id=”v0”/>
       </x-history>
     </x-version>

            

A first version (labeled v0) has been created ; note that the attribute “version” of element x-body points to the right versioning point inside the x-history subtree.

In the following example, we assume three distinct versioning points have been created.

Figure 28: A focused history

                <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version">
                    <x-body version=”v1”>
                        <html xmlns="http://www.w3c.org/1999/xhtml">
                            <!-- XHTML 1.0 -->
                            <head>
                                <title>New Testament (Matthew, chapter 2)</title>
                            </head>
                            <body>
                                <p class="verse" number="1">
                                Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king,
                                behold, there came wise men from the east to Jerusalem,
                                </p>
                                <p class="verse" number="2">
                                Saying, Where is he that is born King of the Jews? For we have seen his star in 
                                the east, and are come to worship him.
                                </p>
                            </body>
                        </html>
                    </x-body>
                    <x-history start=”v0” end=”v2”>
                        <version id=”v0” />
                        <version id=”v1”>
                            <delta bwd=”v0”>
                                <delete ipath=”1/2/2”/>
                            </delta >
                            <delta fwd=”v2”>
                                <insert ipath=”1/2/3”>
                                    <p class="verse" number="3">
                                    When Herod the king had heard these things, he was troubled, and all 
                                    Jerusalem with him.
                                    </p>
                                </insert >
                            </delta >
                        </version >
                        <version id=”v2” />
                    </x-history>
                </x-version>
                
            

Three different versioning points have been created. The current version is v1 (cf attribute version of x-body element). As the versioning mode is “focused”, we do have backward oriented delta to the previous version, and forward oriented delta to the following version (to the latest one in this example).

The following example illustrate how the same information can be encoded differently, featuring different properties.

Figure 29: The same document (see Figure 28) in "linear mode"


    <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version">
        <x-body version=”v1”>
            <html xmlns="http://www.w3c.org/1999/xhtml">
            <!-- XHTML 1.0 -->
            <head>
                <title>New Testament (Matthew, chapter 2)</title>
            </head>
            <body>
                <p class="verse" number="1">
                Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king,
                behold, there came wise men from the east to Jerusalem,
                </p>
                <p class="verse" number="2">
                Saying, Where is he that is born King of the Jews? For we have seen his star in 
                the east, and are come to worship him.
                </p>
            </body>
            </html>
        </x-body>
        <x-history start=”v0” end=”v2”>
            <version id=”v0” >
                <delta fwd=”v1”>
                    <insert ipath=”1/2/2”>
                    <p class="verse" number="2">
                    Saying, Where is he that is born King of the Jews? For we have seen his star in 
                    the east, and are come to worship him.
                    </p>
                    </insert >
                </delta >
            </version >
            <version id=”v1”>
                <delta fwd=”v2”>
                    <insert ipath=”1/2/3”>
                        <p class="verse" number="3">
                        When Herod the king had heard these things, he was troubled, and all 
                        Jerusalem with him.
                        </p>
                    </insert >
                </delta >
            </version >
            <version id=”v2” />
        </x-history>
    </x-version>                

            

We do have a forward oriented delta from v0 to v1, and a forward oriented delta from v1 to v2. Note that the whole history can be read in compliance with the evolution order, from the beginning up to the end. However, in that case, we have information redundancy as a subtree (verse number 2) is present both in the body and in the history.

We propose a particular optimisation that could be applied in linear mode: redundant trees are eliminated tanks to an explicit copy instruction using an XPath expression to designate the subtree (thus the XPath selection refers to the focused version).

                <delta fwd="v1">
                    <insert ipath=”1/2/2”>
                        <p class="verse" number="2">
                        Saying, Where is he that is born King of the Jews? For we have seen his star 
                        in the east, and are come to worship him.
                        </p>
                    </insert >                
                </delta>                 
            
In the case of figure Figure 29, the node "insert", child of the forward delta to v1, is suppressed from the history and its parent is changed into the expression below:
                <delta fwd="v1">
                    <insert ipath=”1/2/2” copy=”/html[1]/body[1]/p[2]”/>    
                </delta>                 
            

This optimization applies only to versions belonging to past history, but makes a lot of sense when the evolution of document instances is best captured through using linear mode and comprises many incremental additions.

State of the Art

XML diff operations are basically used to realize two fundamental operations on tree structured documents: comparison and merging. Therefore tree based diff algorithms are central to solve various important problems related to XML document management and editing processes (Chawathe96), among which:

  • checking modifications with respect to a reference document, typically, the last version stored inside a repository or a file system (version management Chien00)

  • synchronizing variants, that is, detection of changes that occurred concurrently on two copies of a reference document, detecting potential conflicts (LaFontaine02

  • merging variants, which is about fusing two variants into a unique document integrating modifications of both instances (this may imply solving potential conflicts when the underlying document management system applies a weak control policy)

  • recovering anterior state(s) of a document

An abundant literature exists on this topic, including recent and excellent synthetic work on an old topic, linear differencing (Khanna07). Most work covers :

  1. Algorithmic complexity of the differencing operation, either for ordered or unordered tree models.

    The problem was first approached as a particular case of string oriented diff computation (see Khanna07 for a point on most recent algorithms in this area), implying computing an adequate linearization of the XML trees. Zhang and Shasha described a fast algorithm in 1989 for computing tree edit distance Zha89 improving previous work conducted in 1979. Still the runtime and space complexity was higher than quadratic on sequential architectures, but could drop to quadratic levels on parallel architectures. In Wang03 an algorithm for unordered tree was presented with a polynomial complexity for optimal differences, whereas previous works established a NP-complete complexity for general cases. A survey (still of interest 14 years after) and a comparative study on the topic can be found in Bille95 and Cobe02. Most recent algorithms for ordered diff computation are described in Lind06, Rönnau08, Rönnau09.

  2. Optimality of editing scripts (also called deltas)

    The very notion of optimality is somehow controversial, because bound to execution models one could consider either as over simplistic or too specific. Of course execution models are needed to justify underlying cost functions associated with deltas application. However, such an approach intrinsically restricts the application field or the area of proposed results. In addition, usages and document types deeply impact the structure of delta and thus the runtime performance of algorithms. As a consequence, diff algorithms should be chosen and parameterized on the basis of document type and related activities.

  3. Delta models and use of diff operations inside storage architectures (typically databases).

    In Martinez02, a versioning graph is proposed based on XLink designation mechanism. But in this proposal, the graph relates two separate documents that may exist in independent storage infrastructure. Hence there is no notion of encapsulation and explicit consistency. In Arevalo09, an application is proposed that offers online versioning services for XML documents. However, there is no notion of history encapsulation, and the history itself cannot be exported or organized into an explicit form (see Arevalo09b for an online demo).

A recent paper Rönnau08 proposes reversible deltas using a straightforward mechanism as deletion operation conveys explicitly the deleted subtree (this could quickly lead to enormous overhead, especially in the perspective of storing the whole history). The authors focus on algorithmic and speed performance issues of a novel diff algorithm using a bottom-up approach (and also precomputed hash codes on a sliding window of adjacent nodes.)

Most work conducted on this area focused on related algorithm performance, time and space complexity of the diff operation and also optimality of generated deltas, but little attention has been paid to tools and models to make use of these in efficient ways. From the practical point of view, no work considered the necessity and applicative interest of abstracting over deltas and related operation models.

Recently, the XQuery Update working group from W3C (see XQueryUp) published a candidate recommendation proposing to enrich XQuery with mechanisms intended to modify XML documents. This impressive work largely focused on expressive power and consistent integration of the new constructs to the existing language. The patching operations rely on "insert", "delete", "rename" and "replace" , using XPath to designate the locations in the XML tree ; an innovative "transform" operator allows for combining copy with on-the-fly modifications. One major difficulty was to deal with the extra power raised by the selection semantic model of XPath, i.e sequences of nodes. As a consequence, many runtime error may occur depending on the evaluation of selectors and the semantics of the delta operation. Additionally the insert operator of the language is particularly powerful since it can make explicit positional constraints ("first/last position", "before/after A") whereas when implicit, positional constraints are interpreted so that global ordering constraint are optimally satisfied (this remain a bit unclear how far this can be hard).

The transformation model of XQUpdate relies on a kind of transactional framework, where changes are stored in a list of "pending modifications" only applied after calling a dedicated primitive upd:applyUpdates. This is a pragmatic solution to the problem of path relocation (and variable scope validity!), but is a nightmare regarding static analysis. Therefore it will be much more realistic to use the XQUpdate interpreter as a patch engine (i.e generating XQUpdate code from our descriptions), than considering an XQUpdate program as a description of differences.

Conclusion

The main contribution of this work is to propose a universal XML data structure and related transformations which allow abstracting from underlying storage systems and from any execution model associated with the computation of XML versioning information.

This is in favor of long term preservation of XML documents, infrastructure and vendor independence, and open the way to interoperable processing of XML versioning information.

The method targets generic XML documents, and makes use of a particular namespace to embed any XML document (whatever vocabulary it may use) without any change in its content and tag/attribute set. The history is encoded using a specific vocabulary that captures change operations in a formal and universal way, so that any XML diff enabled processor can generate suitable deltas. Based on the properties of this change description language, we propose a set of high level operators which allows exploiting the historical information in order to navigate inside the history, compute new version in the current branch, create new branches or suppress intermediate versioning points.

Various information may be attached to the versioning points, such as user meta information, universal date/time of the creation of the versioning point, additional data required to optimize or ameliorate the computation (hash code, digest number,...), which are not central to this contribution but could substantially enhance any of its application.

The problem with existing diff related approaches is that the description model and the transformation model are identical. In other words, they use the same objects both to describe operationally and declaratively the differences between trees. We outline in our proposal a calculus that formally allows making a distinction between modification descriptors and modification operations. This calculus allows modeling "standard" processing of deltas and allows transforming descriptions in order to gain compactness or efficiency. We are not aware of any equivalent proposal.

The two transformation structures we described can lead to two broad classes of delta interpretations:

  1. Each modification step transforms the tree, so that the paths used inside subsequent deltas should be defined on the modified tree. This model is used inside database systems and DOM-like in memory tree processors (especially XUpdate, XQueryUp).

  2. All modification steps are defined on the same reference tree, and all modifications are realized in one global step. Thus the paths always refer to the same tree operand and consequently, delta sequences are more easily understood by a human reader, because they do not have to maintain the cognitive load of mentally propagating tree modifications to paths. From a computational point of view, this model is well adapted to stream based processing pipes in which each operation can be performed concurrently (e.g. on a multi core processor). As this approach doesn't need building a tree in memory, it is particularly indicated for large and very large document processing.

The delta model we propose as foundational for history encoding integrates those two transformation models, in the sense that both can be expressed explicitly. The first one is captured by sequences of singleton snapshots whereas the second may involve snapshots including several deltas.

We identified 4 key issues when addressing the challenge of encapsulating document history into a stand alone XML instance: mastering the potentially big data volume, usability of the history model (human vs machine), flexibility of the history representation and abstracting over diff operations (i.e. not being dependent of a particular diff algorithm/engine behavior). The method described in this paper addresses these four issues at various levels. We believe our approach therefore constitutes a significant contribution having a realistic applicability level, though the data volume issue would still require large scale experiments.

Acknowledgment

This work is supported by the European project SHAMAN (FP7- ICT-216736, http://shaman-ip.eu/shaman/ ). The author also would like to thanks Jean-Pierre Chanod for his continuous support and helpful comments

Appendix A. Proof of proposition Figure 17

Global structure:

We do a structural induction over Δ. Each sub-case is associated with a corresponding well-formedness hypothesis; we use these together with the definitions of inversion. We show here the case for the Δ12 composition:

  1. Goal :

    d › Δ12 › d’ d’ › ( Δ12 ) › d

    We introduce the left hand as a new hypothesis H and we get the new goal

    d’ › ( Δ12 ) › d

    By applying the definition of inversion (thanks to H), we get

    d’ › Δ2 ; Δ1 › d

    The we apply the property (a-seq)

    ∃ dd , d’ › Δ2 › dd ⋀ dd › Δ1 › d

    We apply again the (a-seq) property to hypothesis H, and get a particularisation for intermediate tree d''. Thus we choose d'' as an evidence for the existence of dd, and get

    d’ › Δ2 › d'' ⋀ d'' › Δ1 › d

    Now we split the goal in two subgoals and apply the induction hypothesis to both.

Appendix B. An RelaxNG schema (compact syntax ) capturing our model

                
default namespace xversion = "XRCE::DS::X-Version"
datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"

start =  x-version

anyElement = element * - xversion:* { anyAttribute*, anyElement*} | text
anyAttribute = attribute * { text}

x-version = element x-version { 
    element x-version-body { 
        attribute ref {V-REF}, 
        attribute xml:space {"preserve" | "ignore"}?,
        SUBTREE
        }+,
    element x-version-history {
        attribute first {list {V-REF+}},
        attribute last {list {V-REF+}},
        # the version history encodes a DAG (Directed Acyclic Graph)
        version*
    }    
}

    
version = element version {
    attribute id { V-REF},
        
    # allows zero or more
    attribute conflicts {xsd:nonNegativeInteger}?, 
    a-time?,
    delta*
    }

a-target=
    # this attribute defines a "one-step" next version 
    attribute to {V-REF}  


delta = element delta {  a-target,   (rename | insert | append | move | copy | delete | replace)* }

a-here= 
    # the "here" attribute designates the insertion point (existing stuff is to be shift to right after insertion)
    attribute here {IPath}
    
    
insert=
    # "insert" only applies to elements, text, comment and PI
    element insert {
    a-here,
    # the subtree to be inserted is here (but skipped thanks to a NVDL rule)
    SUBTREE
    }

append=
    # "append" applies to everything including attributes
    # for elements, text, comments, PI, the item is appended at the end of the sequence of children
    element append {
    a-here,
    # the "attribute" attribute is used to specify attribute insertion; it must have the "QName=Value" syntactic structure 
    ((attribute attribute {A-DEF}?, empty) | SUBTREE)
    }

copy= element copy {
    a-what, a-here,
    attribute append {flag}?
    }
    
move= element move {
    a-what, a-here,
    attribute append {flag}?
    }

delete= element delete { a-what, SUBTREE }
replace= element replace { a-what, SUBTREE }

rename= 
    # works for elements, attributes, and PI as well
    # the "as" attribute is a qualified name (that is, may refer to a given namespace through a declared prefix
    element rename {  a-what,    attribute as {token}}


SUBTREE=anyElement | conflict

conflict=element conflict { item+ }
item = element item { attribute ref {V-REF}, anyElement }

attribute-conflict =element attribute-conflict { attribute ref {V-REF}, A-DEF}
tree-conflict = element tree-conflict { attribute ref {V-REF}, A-DEF}

V-REF= xsd:string { pattern="v\d+(\.\d+)*" }

A-DEF=string


flag = "yes" | "no"

a-what = attribute what {XPath}
a-time = attribute time {xsd:dateTime}

XPath = xsd:string
IPath = xsd:string { pattern="1(/\d+)*(/@([\c-[:]]+:)?[\c-[:]]+)?" }
                
            

Appendix C. an ISO Schematron schema Schematron

                
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" queryBinding="xslt2">
    <sch:ns uri="XRCE::DS::X-Version" prefix="xv"/>
    <sch:pattern >
        <sch:let name="versions" value="/*[1]/*[2]/xv:version"/>
        <sch:let name="versions-id" value="for $i in $versions/@id return normalize-space($i)"/>
        <sch:let name="Nversions-id" value="distinct-values($versions-id)"/>

        <sch:let name="history" value="/*[1]/xv:x-version-history[1]"/>
        <sch:let name="starting-vector" value="tokenize(normalize-space($history/@first),'\s')"/>
        <sch:let name="ending-vector" value="tokenize(normalize-space($history/@last),'\s')"/>
        <sch:let name="Nstarting-vector" value="distinct-values($starting-vector)"/>
        <sch:let name="Nending-vector" value="distinct-values($ending-vector)"/>
        
        
        <sch:rule context="/*[1]/xv:x-version-history[1]">
            <sch:assert test="count($versions-id) ge count($Nversions-id)">Every version element must have an unique id attribute</sch:assert>
            
            <sch:assert test="count($starting-vector) eq count($Nstarting-vector)">"first" attribute should refer to each relevant version only once</sch:assert>
            <sch:assert test="count($ending-vector) eq count($Nending-vector)">"last" attribute should refer to each relevant version only once</sch:assert>
            
            <sch:assert test="every $i in $Nstarting-vector satisfies index-of($Nversions-id,$i) gt 0 ">
                the "first" attribute should point to existing version(s) (cf 
                "<sch:value-of select="string-join($Nstarting-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>"
                )
            </sch:assert>
            <sch:assert test="every $i in $Nending-vector satisfies index-of($Nversions-id,$i) gt 0 ">
                the "last" attribute should point to existing version(s) (cf 
                "<sch:value-of select="string-join($Nending-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>"
                )
            </sch:assert>
        </sch:rule>
        <sch:rule context="xv:version">
            <sch:report test="count(index-of($versions-id,normalize-space(@id))) gt 1">
                The "id" attribute must be unique ("<sch:value-of select="@id"/>")
            </sch:report> 
            <sch:let name="my-id" value="@id"/>
            <sch:assert test="(count(../xv:version/xv:delta[@to=$my-id]) gt 0) or (index-of($Nstarting-vector,$my-id) gt 0)">
                missing back link to anterior version (version "<sch:value-of select="@id"/>")
            </sch:assert>
            <sch:assert test="(count(xv:delta[@to]) gt 0) or (index-of($Nending-vector,$my-id) gt 0)">
                missing link to posterior version (version "<sch:value-of select="@id"/>")
            </sch:assert>
        </sch:rule>
        <sch:rule context="xv:delta">
            <sch:assert test="index-of($Nversions-id,@to) gt 0">
                The link to version "<sch:value-of select="@to"/>" is dangling (no corresponding version found in the whole history)
            </sch:assert> 
            <sch:let name="my-dest" value="@to"/>
            <sch:report test="count(preceding-sibling::xv:delta[@to=$my-dest]) gt 0" >
                Delta must be unique for one target version ("<sch:value-of select="@to"/>")
            </sch:report>
        </sch:rule>
        <sch:rule context="xv:conflict">
            <sch:let name="all-item-refs" value="for $i in xv:item/@ref return normalize-space($i)"/>
            <sch:assert test="count(distinct-values($all-item-refs)) eq count($all-item-refs)">
                Conflicting items must be uniquely defined for a given version 
            </sch:assert>
        </sch:rule>
        <sch:rule context="xv:item">
            <sch:assert test="index-of($Nversions-id,@ref) gt 0">
                The reference to version "<sch:value-of select="@ref"/>" is dangling (no corresponding version found in the whole history)
            </sch:assert>
        </sch:rule>
    </sch:pattern>
</sch:schema>

            

References

[Zha89] Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, Kaizhong Zhang; Dennis Shasha, SIAM J. Comput., 1989

[Bille95] A survey on tree edit distance and related problems, Philip Bille, Theoretical Computer Science , Volume 337 Issue 1-3 June 2005.

[Chawathe96] Change detection in hierarchically structured information, Sudarshan S. Chawathe; Anand Rajaraman;Hector Garcia-Molina ; Jennifer Widom, SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, June 1996, doi:10.1145/233269.233366

[Chien00] A Comparative Study of Version Management Schemes for XML Documents, Shu-Yao Chien ; Vassilis J. Tsotras ; Carlo Zaniolo, 2000

[Cobe02] A comparative study for XML change detection, Grégory Cobéna; Talel Abdessalem; Yassine Hinnach, Research Report, INRIA Rocquencourt, France, 2002 PDF

[LaFontaine02] Merging XML files: a new approach providing intelligent merge of XML data sets, Robin La Fontaine, 2002

[Martinez02] A method for the dynamic generation of virtual versions of evolving documents, Mercedes Martinez, Jean-Claude Derniame, Pablo de la Fuente, SAC 2002, Madrid, Spain, 2002

[pédauque03] Document: Form, Sign and Medium, as Reformulated for Electronic Documents, Roger T. Pédauque, collective writing, STIC-CNRS, 12 September 2003. PDF

[Wang03] X-Diff: A Fast Change Detection Algorithm for XMLDocuments. Yuan Wang , David J. DeWitt , Jin-Yi Cai in Proceedings of the International Conference on Data Engineering (ICDE'03) PDF

[pédauque05] Le texte en jeu (Permanence et transformations du document), Roger T. Pédauque, collective writing, STIC-CNRS, 7 April 2005. PDF

[Lind06] Fast and Simple XML Tree Differencing by Sequence Alignment, Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma, DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering October 2006, doi:10.1145/1166160.1166183

[Khanna07] A Formal Investigation of diff3, Sanjeev Khanna; Keshav Kunal ; Benjamin C. Pierce, In Arvind and Prasad, editors, Foundations of Software Technology and Theoretical Computer Science (FSTTCS), December 2007 PDF

[Tata07] Tree automata techniques and applications, Hubert Comon; Max Dauchet; Florent Jacquemard; Denis Lugiez; Sophie Tison; Marc Tommasi , 2007 PDF

[Rönnau08] Merging Changes in XML Documents Using Reliable Context Fingerprints, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM Symposium on Document Engineering, September 2008, doi:10.1145/1410140.1410151

[Rönnau09] Efficient Change Control of XML Documents, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM Symposium on Document Engineering, September 2009, doi:10.1145/1600193.1600197

[Arevalo09] A Web-Based Version Editor for XML Documents, Luis Arévalo Rosado, Antonio Polo Márquez and Miryam Salas Sánchez, ACM Symposium on Document Engineering, September 2009, doi:10.1145/1600193.1600249

[Arevalo09b] A Web-Based Version Editor for XML Documents, online demonstration version: http://picaro.unex.es:8180/vEditor/

[XQueryUp] XQuery Update Facility 1.0, W3C, Specification, June 2009 http://www.w3.org/TR/xquery-update-10/

[XUpdate] http://xmldb-org.sourceforge.net/xupdate/xupdate-wd.html ; http://en.wikipedia.org/wiki/XUpdate

[Saxonica] Saxonica, XSLT and XQuery processing Michael Kay, http://www.saxonica.com/

[Schematron] ISO Schematron, a language for making assertions about patterns found in XML documents, Topologi , web site



[1] the XPath translation of 1/2/1 could also be node()[1]/node()[2]/node()[1] in order to consider all possible nodes as allowed by the XML document tree model.

[2] Actually the calculus can be refined in such a way that this condition can be relaxed thanks to a particular rewriting of conflicting deltas. This suppose a particular semantic interpretation of conflicting information.

[3] These operations are known to be non-trivial to compute by a diff engine, can be non optimal, and are actually rarely produced for this reason. However, other sources of delta computation such as smart versioning oriented editors can produce very useful move and copy delta operations.

[4] A theorem establishes that pp1 ⊥ pp2 ⇔ pp1≪ pp2 ⋁ pp2 ≪ pp1

[5] focus node vk is by convention identified through surrounding brackets [.]

[6] Such an abstract property could be hardly met by a "black box" diff operator, from the implementation point of view. However, we investigate if a delta normalization procedure could reduce those cases, so that the abstract property of Figure 18 would be always verified.