Stand-alone Encoding of Document History

Stand-alone Encoding of Document History (or One Step Beyond XML Diff) Balisage: The Markup Conference 2010 August 3 - 6, 2010 This paper describes an approach to encapsulate together an XML document and its history (i.e. the various significant states it adopted along its life cycle) into a single standalone XML document. Our proposal introduces an XML data model suited to capture versioning information combined with an operational model to handle such encapsulated data. We describe basic operators involved along the transformation process of the document/history pair, mainly designed around the principle of maintaining coherence properties. Jean-Yves Vion-Dury Jean-Yves Vion-Dury holds an CS engineering degree from the "Conservatoire National des Arts et Métiers, Paris" (1993) and graduated with a PhD in CS from Université Joseph Fourier, Grenoble in 1999. He has been working at Xerox Research Centre Europe (in Grenoble, France) since 1995, as a research scientist; he has also been on a two year sabbatical with Vincent Quint's team at INRIA in 2002-2004. His research interests relate to various aspect of XML including models, the impact of standards, validation/transformation languages and architectures, with theoretical background in programming languages, compilation, type systems and formal logics. Jean-Yves was Program Chair of DocEng (ACM Document Engineering Symposium ) in 2004, is member of its Program Committee since 2003, and member of its Steering Committee since 2005. Senior Scientist Xerox Research Centre Europe jean-yves.vion-dury@xeroxlabs.com Copyright © 2010 Xerox. All rights reserved.

Introduction XML being recognized today as the lingua franca of computerized information, some key basic functionalities become of universal interest and will predictably gain momentum in the upcoming XML related economy. One of these functionalities is the computation of differences between two XML documents, and more generally, the management of the history of a given document. In particular, this functions as a cornerstone of any system aiming at offering long term preservation. Today, such functionality is achieved through content management systems or databases. Models and operations are therefore hidden inside a black box, and up to the author's knowledge, no standard mechanism makes this information explicit and accessible to human users neither to external algorithms and processors. Our proposal addresses this issue by associating a target document instance together with its history inside a standalone and consistent document, thus gaining strong potential for current or future interoperability. The notion of history, when applied to a document, calls for a deeper understanding of the global context, intents and social processes underpinning the creation, evolution and maintenance of documents (see ). Hence, the relevant structure of a meaningful document history highly depends on the document typology and usages. In addition, the criterion that founds any document history is the permanence of some key deep invariants (for a general, ontological, reflection around this line, see ). Those invariants define the identity of the document itself and their loss inevitably ends the versioning process and calls for a new creation cycle. The method we describe in this paper takes into consideration the key points above, especially through a flexible calculus of difference descriptors, and through offering high level transformation operations that preserve the consistency of both the document instance and its whole history.

Overview The target document (also called body hereafter), i.e. any XML document using any namespace, is encapsulated inside a dedicated packaging XML document. The body always consistently relates to a given state of its history (not necessarily the most recent one) thanks to a dedicated attribute that refers to a version label and makes this relation explicit. The history itself is encoded as interrelated nodes (XML elements) and has the structure of a directed acyclic graph (DAG); each node of this graph models a versioning point, i.e. a particular significant state that the document reached during its lifetime and for which a version has been recorded. The meaning of any versioning point with respect to the document life cycle is application dependent, as discussed above, and the proposed approach abstracts over application semantics, just considering that a new version is built when some actor in the document life cycle decided it makes sense to do so. Arcs that connect the nodes are decorated with deltas, and these arcs model the transformation that allows changing the document from one versioning point to the other. Thus arcs are oriented. Deltas descriptors are combinations of low level operations on the document tree, mainly based on deletion and insertion of subtrees. Thanks to a set of transformations qualitatively described in this paper, it is possible to navigate inside the history of the document and to consistently extract any version of the document captured in the history graph. It is also possible to create new versioning points, new branches or to merge existing branches, all these operations producing novel consistent encapsulations of the target document.

Modelling Abstract Differencing Operations

Document model The encapsulating XML document can be represented (in an abstract way) as a tree like this:

Abstract tree structure of the document history x-version[x-bodyvi[d],x-history[…vi…]] where d represent the target document and vi a versioning point. The full syntax of our encapsulation is available through a RelaxNG Schema (see appendix). The examples , , and illustrate concretely the way we propose to encode the history graph (DAG) in XML, and the way delta operations are attached to transition arcs.

Diff Engine We assume that a diff engine is able to operate as a black box function. Its abstract signature could be for instance:

Abstracting over diff operation diff(config, d, d’) → Δ with config being some input information used to configure the processor (e.g. filter to apply, mode commutative/non-commutative,…). The notation Δ represents a set of basic delta operations, formally described hereafter.

Abstracting over Delta Operations

Syntax of delta operations Δ ::= { } no operation | { δ1 … δi } commutative snapshots | Δ ; Δ sequences δ ::= insert (pp, A) subtree insertion | insert-attr(pp/@nm, A) attribute insertion | delete(p) subtree deletion The delta operation use paths, noted p, to designate the tree location where the modification should be applied. Those paths are describe by the following grammar (i is a positive integer, nm an attribute name, pp denotes "pure paths"):

Syntax of path description p ::= pp | pp/@nm pp ::= i/pp | i These paths are interpreted as relative to the root of the encapsulated document, and can be easily translated as XPath expressions. For example, 1/2/1 is translated into *[1]/*[2]/*[1], and 1/3/@id into *[1]/*[3]/@id the XPath translation of 1/2/1 could also be node()[1]/node()[2]/node()[1] in order to consider all possible nodes as allowed by the XML document tree model. . Each delta belonging to a snapshot must comply with a structural constraint that ensures orthogonality (thus making the snapshot indeed commutative). In particular, it is enough to verify that two paths in a snapshot do not designate sibling trees, and one path does not designate a sibling tree of the parent node designated by the other path. We assume that this constraint is part of the well-formedness condition assured for all definitions Actually the calculus can be refined in such a way that this condition can be relaxed thanks to a particular rewriting of conflicting deltas. This suppose a particular semantic interpretation of conflicting information. . The semantics of delta operations expresses changes that occur on the operand (an XML tree) ; we note this transformation of d into d’ by applying Δ as follows:

Transformation by Delta application d › Δ › d’ More precisely, the previous notation is a logical assertion saying that a well-formed document d is changed into a well-formed document d’ after application of the well-formed Δ operation. Formally, for all subtree A, path p, documents d and d’, deltas Δi and δi, the transformations verify the following abstract properties:

Abstract properties of delta transformations (a-seq) d › Δ1;Δ2 › d’ ⇔ ∃ d’’ , d › Δ1 › d’’ ⋀ d’’› Δ2 › d’ (a-snap) d › { δ1 ... δi} › d’ ⇔ d › {δπ1} ; { δπi} › d’ for any permutation π defined over the sequence of indexes [1,…,i] (a-void) d › { } › d’ ⇔ d ≈ d’ (a-ins) d › { insert(pp,A)} › d’ ⇒ get(d',pp)≈A ⋀ invar(d,d',pp,f) with f = ζ(pp) (a-ins-@) d › { insert-attr(pp/@nm,A)} › d’ ⇒ get(d',pp/@nm) = A ⋀ invar(d,d',pp/@nm) (a-del) d › { delete(pp)} › d’ ⇒ get(d',pp)≈get(d,pp ⊕ f) ⋀ invar(d,d',pp,-f) with f = ζ(pp) (a-del-@) d › { delete(pp/@nm)} › d’ ⇒ get(d',pp/@nm) = ε Note that the definition of insert operator makes use of the function get, which extracts the subtree rooted at a given location p, as well as a path addition function ⊕ and a so called fingerprint extraction ζ . The invar operator deals with locality of the transformation (see Invariance Property Fig and associated explanations). Moreover, ≈ (the equivalence of trees) is defined extensionally using quantification of paths, and is insensitive to attribute order, according to path definition.

Tree equivalence d ≈ d’ ⇔ ∀ p, get(d,p) = get(d’,p) Path based, extensional equivalence over trees The path addition is defined over pure paths (paths designating element nodes only, noted pp), can deal with operands of various depth and is commutative:

Path addition i/pp ⊕ j/pp’ = (i + j)/(pp ⊕ pp’) i/pp ⊕ j = (i + j)/pp i ⊕ j = (i + j) Also fingerprints capture some structural information, and more precisely, the depth level of a path:

Fingerprint computation ζ(i/pp) = 0/ζ(pp) ζ(i) = 1 Thus ζ (1/2/3) = 0/0/1 and 1/2/3 ⊕ ζ(1/2/3) = 1/2/4 and 1/2/3/2 ⊕ ζ(1/2/3) = 1/2/4/2 The property (a-snap) holds for all permutations over indexes. It simply means that the set of deltas must be commutative with respect to their sequential application. In other words, all basic deltas of a snapshot are pairwise orthogonal. The property (a-del) involves a minus operator over paths. This one is easily defined as inverting all integers found in the path.

Path inversion -(i/pp) = (-i)/-(pp) -(i) = -i The invar property used in (a-ins), (a-ins-@) and (a-del) expresses that other parts of the tree are not modified by the operation. This is defined as follows:

Invariance property invar(d, d’, pp/@nm) ≡ ∀ nm' [ get(d, pp/@nm’) = get(d’, pp/@nm’) ] invar(d, d’, pp, f) ≡ ∀ p' [ p' ≪ pp ⇒ get(d, p’) = get(d’,p’) ⋀ pp ≪ p' ⇒ get(d, p’) = get(d’,p’ ⊕ f) ] The relation ≪ is a strict total order over (pure) paths, which is sound with respect to the standard total order over document nodes as defined by the XML document model.

Strict total order on paths for all i, j integers, pp, pp' pure paths,

			i < j	i = j	i > j
i/pp	≪	j/pp'	true	pp ≪ pp'	false
i	≪	j/pp'	true	false	false
i/pp	≪	j	true	false	false
i	≪	j	true	false	false

The intuitive notion of path prefix relation captures the idea of tree embedding

Path prefix relation pp1 ≺ pp2 iff ∃ pp | pp1 = pp2/pp We now define the notion of path orthogonality as a binary boolean operator that captures the notion that a path is not a subpath of another (or equivalently, a subtree is not included into another subtree)

Path orthogonality pp1 ⊥ pp2 iff not [ pp1 ≺ pp2 ⋁ pp2 ≺ pp1]

Extension of basic vocabulary In order to ease understanding and increase the generality of descriptions, we propose now to extend the vocabulary through definitions based on the fundamental operations described above: move, copy and replace These operations are known to be non-trivial to compute by a diff engine, can be non optimal, and are actually rarely produced for this reason. However, other sources of delta computation such as smart versioning oriented editors can produce very useful move and copy delta operations. .

Definition of Additional Operators pp2 ≪ pp1 ⇒ d › { move(pp1, pp2) } › d’ ⇔ d › { insert (pp2,get(d, pp1))};{delete(pp1 ⊕ ζ(pp2)) } › d’ pp1 ≪ pp2 ⇒ d › { move(pp1, pp2) } › d’ ⇔ d › { insert (pp2,get(d, pp1))};{delete(pp1) } › d’ pp1 ⊥ pp2 ⇒ d › { copy(pp1, pp2) } › d’ ⇔ d › { insert (pp2,get(d, pp1)) } › d’ d › { replace(pp,A) } › d’ ⇔ d › { insert (pp,A)};{delete(pp ⊕ ζ(pp)) } › d’ d › { replace(pp/@nm,A) } › d’ ⇔ d › {delete(pp/@nm) }; { insert-attr(pp/@nm,A)} › d’ Note that move and copy operations are only defined for orthogonal paths A theorem establishes that pp1 ⊥ pp2 ⇔ pp1≪ pp2 ⋁ pp2 ≪ pp1 .

Inversion of basic operations and snapshots Inverting a delta (i.e. computing the changes that will exactly bring the operand in the previous state) requires knowing the original operand on which changes will be applied. The inversion function is inductively defined as follows:

Inversion Function ∘ ∘{ } = { } d › {δ1 ... δi} › d’ ⇒ ∘{δ1 ... δi} = {∘δ1 ... ∘δi} d › Δ1;Δ2 › d’ ⇒ ∘(Δ1;Δ2) = ∘(Δ2);∘(Δ1) d › {delete(pp)} › d’ ⇒ ∘(delete(pp)) = insert(pp,get(d,pp)) d › {delete(pp/@nm)} › d’ ⇒ ∘(delete(pp/@nm)) = insert-attr(pp/@nm,get(d,pp/@nm)) d › {insert(pp,A)} › d’ ⇒ ∘(insert(pp,A)) = delete(pp) d › {insert-attr(pp/@nm,A)} › d’ ⇒ ∘(insert-attr(pp/@nm,A)) = delete(pp/@nm) Delta inversion is characterized by the following soundness property:

Proposition 1 d › Δ › d’ ⇒ d’ › ∘Δ › d Inversion of well-formed delta operations over well-formed documents produces a well-formed reverse tree transformations Proof : See appendix The inversion of deltas is an important functionality which allows navigating in the history graph. Moreover, it provides a more compact representation of changes, especially when successive versions represent documents whose content tends to increase incrementally (corresponds to the construction phase inside a standard document life-cycle). Indeed, in such cases, subgraphs of the form focus node vk is by convention identified through surrounding brackets [.] vi →insert(p,A) vj →insert(p',B) [vk] can be rewritten using delta inversion as: vi ←delete(p) vj ←delete(p') [vk] This transformation is beneficial because the subtrees A and B are redundantly stored: once inside the history and once inside the encapsulated instance itself. In case the focus is set to a non-terminal versioning point (e.g. vj ), we also may have vi ←delete(p) [vj] →insert(p',B) vk which is still quite meaningful as the subtree B is only stored once inside the delta transition (indeed, the encapsulated document, consistent with focused version vj, does not comprise the B subtree (see examples 3 et 4 below to grasp more concretely the point). Note that inverting the operands of a diff operation should also lead to reversed deltas Such an abstract property could be hardly met by a "black box" diff operator, from the implementation point of view. However, we investigate if a delta normalization procedure could reduce those cases, so that the abstract property of would be always verified. :

Strong soundness of delta inversion diff(c, d, d’)=Δ ⇒ diff(c, d’, d) = ∘(Δ) A weaker version of this property (see below) requires defining an equivalence relation over deltas. This relation is based on the effect of delta application rather than its syntactic structure.

Weak soundness of delta inversion diff(c, d, d’)=Δ ⇒ diff(c, d’, d) ≈ ∘(Δ)

Equivalence of deltas Δ ≈ Δ' ⇔ ∀ d, d', d'' d › Δ › d' ⋀ d › Δ' › d'' ⇒ d' ≈ d'' In the general case, it is much more interesting (from the performance point of view) to compute inversion on deltas rather than to perform a reversed diff.

Fundamental Operations over Encapsulated Documents

Creation of a history from an initial XML document This operation has the following abstract signature:

Initial Encapsulation create-history(d) → x-version[x-body v0 [d], x-history[v0]] This reflects that the document is encapsulated, and that an initial versioning point v0 is created inside the history. The link that relates the embedded document with the consistent versioning point is inserted in the x-body subtree.

Versioning The operation requires two operands: the encapsulated document and another variant of the document which is to be considered as the novel versioning point. What is returned is a novel encapsulated document including a new versioning point, a (consistent) link inside the body part, and a transition from the previous versioning point to the new one. This transition expresses the delta operations abstracted from the diff engine outputs.

Version Creation create-version(x-version[x-body vi [d], x-history[... vi ...]], d') → x-version[ x-body vj [d], x-history[... vi → Δ vj ]] with diff(c,d,d') = Δ

Extraction Unfolds the embedded document. This is useful to work on the target document, update or change it.

Version Extraction extract( x-version[x-body vi [d], x-history[...vi...]] ) → d

Focusing This operation allows modifying the embedded document d in order to be compliant to a given version stored in the history. This requires first building the path that connects the current versioning point vi to the novel vj and second, applying all deltas to obtain the new embedded document d'

Version Focusing focus(x-version[x-bodyvi[d], x-history[...vi...]], vj) → x-version[ x-body vj [d'], x-history[... vj ...]] with vi → Δi ... → Δjvj and d › Δi ; ... ; Δj › d’ Note that in the connection path, some vertices could have the reversed form vn ←Δn vm requiring the computation of inverse delta (indeed, the connection graph is a DAG admitting branching). Note also that we assume that the algorithm is able to choose a path when several possibilities are present in the graph (it may decide which is the optimal path with respect to performance criteria based on standard algorithm based on simple metrics).

Merging branches This operation requires two operands: the embedded document and another existing versioning point. An algorithm creates a novel versioning point and two transitions that relate original versioning points to the new one. The idea is to perform a merge with maximal preservation (no deletion operation in the respective deltas). However, conflicts may arise, and can be signaled thanks to a dedicated annotation inside the target document (this annotation is based on foreign namespace which cannot conflict with namespaces used by target document)

Version Merging merge(x-version[x-bodyvi[d], x-history[...vi...]], vj) → x-version[ x-bodyvk[d'], x-history[... vi →Δa vk , vj →Δb vk ]] with dvi › Δa › d’ and dvj › Δb › d’ In the above formula, Δa represents the concatenation of (possibly reversed) deltas that allows to reach the new (merged) version starting from version vi, and similarly for Δa and vj .

Encoding Graphs of Deltas in XML Each versioning point is captured using a dedicated element (e.g. version) associated with an id attribute that uniquely identifies it. Our convention uses a naming scheme of the form v0… v3 for instance. Names for diverging branches use a dot, e.g. v1.1, v1.2. Each version element contains a sequence of delta elements which capture the transition from this versioning point to another versioning point. This information is conveyed by an attribute (e.g. fwd for forward links and bwd for backward links). Moreover, a delta contains a non order-significant sequence of delta operations, which captures the so-called snapshot. Several delta elements are interpreted as sequences of snapshots (corresponds to the Δ1 ; Δ2 syntactic form, where Δi= {…}). Each delta operation is described using a dedicated element name according to its semantics (insert, insert-attr …). The path information may be concisely encoded, e.g. through an “ipath” attribute. Copies of subtrees may be expressed through a “copy” attribute attached to the delta element. When such an attribute is not defined, an extensive definition of the subtree is expected as the content of the operation. The example 1 below is a target document we want to track

The Original Document <html xmlns="http://www.w3c.org/1999/xhtml">  <head> <title>New Testament (Matthew, chapter 2)</title> </head> <body> Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king, behold, there came wise men from the east to Jerusalem, </body> </html> The original (target) document to be encoded using our method After applying the create-history operation, the document of figure below is created.

First encapsulation <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version"> <x-body version=”v0”> <html xmlns="http://www.w3c.org/1999/xhtml">  <head> <title>New Testament (Matthew, chapter 2)</title> </head> <body> Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king, behold, there came wise men from the east to Jerusalem, </body> </html> </x-body> <x-history > <version id=”v0”/> </x-history> </x-version> A first version (labeled v0) has been created ; note that the attribute “version” of element x-body points to the right versioning point inside the x-history subtree. In the following example, we assume three distinct versioning points have been created.

A focused history <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version"> <x-body version=”v1”> <html xmlns="http://www.w3c.org/1999/xhtml">  <head> <title>New Testament (Matthew, chapter 2)</title> </head> <body> Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king, behold, there came wise men from the east to Jerusalem, Saying, Where is he that is born King of the Jews? For we have seen his star in the east, and are come to worship him. </body> </html> </x-body> <x-history start=”v0” end=”v2”> <version id=”v0” /> <version id=”v1”> <delta bwd=”v0”> <delete ipath=”1/2/2”/> </delta > <delta fwd=”v2”> <insert ipath=”1/2/3”> When Herod the king had heard these things, he was troubled, and all Jerusalem with him. </insert > </delta > </version > <version id=”v2” /> </x-history> </x-version> Three different versioning points have been created. The current version is v1 (cf attribute version of x-body element). As the versioning mode is “focused”, we do have backward oriented delta to the previous version, and forward oriented delta to the following version (to the latest one in this example). The following example illustrate how the same information can be encoded differently, featuring different properties.

The same document (see <xref linkend="Ex3"/>) in "linear mode" <x-version id="xhtml-1.0" xmlns="XEROX::XRCE::DS:X-Version"> <x-body version=”v1”> <html xmlns="http://www.w3c.org/1999/xhtml">  <head> <title>New Testament (Matthew, chapter 2)</title> </head> <body> Now when Jesus was born in Bethlehem of Judaea in the days of Herod the king, behold, there came wise men from the east to Jerusalem, Saying, Where is he that is born King of the Jews? For we have seen his star in the east, and are come to worship him. </body> </html> </x-body> <x-history start=”v0” end=”v2”> <version id=”v0” > <delta fwd=”v1”> <insert ipath=”1/2/2”> Saying, Where is he that is born King of the Jews? For we have seen his star in the east, and are come to worship him. </insert > </delta > </version > <version id=”v1”> <delta fwd=”v2”> <insert ipath=”1/2/3”> When Herod the king had heard these things, he was troubled, and all Jerusalem with him. </insert > </delta > </version > <version id=”v2” /> </x-history> </x-version> We do have a forward oriented delta from v0 to v1, and a forward oriented delta from v1 to v2. Note that the whole history can be read in compliance with the evolution order, from the beginning up to the end. However, in that case, we have information redundancy as a subtree (verse number 2) is present both in the body and in the history. We propose a particular optimisation that could be applied in linear mode: redundant trees are eliminated tanks to an explicit copy instruction using an XPath expression to designate the subtree (thus the XPath selection refers to the focused version). <delta fwd="v1"> <insert ipath=”1/2/2”> Saying, Where is he that is born King of the Jews? For we have seen his star in the east, and are come to worship him. </insert > </delta> In the case of figure , the node "insert", child of the forward delta to v1, is suppressed from the history and its parent is changed into the expression below: <delta fwd="v1"> <insert ipath=”1/2/2” copy=”/html[1]/body[1]/p[2]”/> </delta> This optimization applies only to versions belonging to past history, but makes a lot of sense when the evolution of document instances is best captured through using linear mode and comprises many incremental additions.

State of the Art XML diff operations are basically used to realize two fundamental operations on tree structured documents: comparison and merging. Therefore tree based diff algorithms are central to solve various important problems related to XML document management and editing processes (), among which: checking modifications with respect to a reference document, typically, the last version stored inside a repository or a file system (version management ) synchronizing variants, that is, detection of changes that occurred concurrently on two copies of a reference document, detecting potential conflicts ( merging variants, which is about fusing two variants into a unique document integrating modifications of both instances (this may imply solving potential conflicts when the underlying document management system applies a weak control policy) recovering anterior state(s) of a document An abundant literature exists on this topic, including recent and excellent synthetic work on an old topic, linear differencing (). Most work covers : Algorithmic complexity of the differencing operation, either for ordered or unordered tree models. The problem was first approached as a particular case of string oriented diff computation (see for a point on most recent algorithms in this area), implying computing an adequate linearization of the XML trees. Zhang and Shasha described a fast algorithm in 1989 for computing tree edit distance improving previous work conducted in 1979. Still the runtime and space complexity was higher than quadratic on sequential architectures, but could drop to quadratic levels on parallel architectures. In an algorithm for unordered tree was presented with a polynomial complexity for optimal differences, whereas previous works established a NP-complete complexity for general cases. A survey (still of interest 14 years after) and a comparative study on the topic can be found in and . Most recent algorithms for ordered diff computation are described in , , . Optimality of editing scripts (also called deltas) The very notion of optimality is somehow controversial, because bound to execution models one could consider either as over simplistic or too specific. Of course execution models are needed to justify underlying cost functions associated with deltas application. However, such an approach intrinsically restricts the application field or the area of proposed results. In addition, usages and document types deeply impact the structure of delta and thus the runtime performance of algorithms. As a consequence, diff algorithms should be chosen and parameterized on the basis of document type and related activities. Delta models and use of diff operations inside storage architectures (typically databases). In , a versioning graph is proposed based on XLink designation mechanism. But in this proposal, the graph relates two separate documents that may exist in independent storage infrastructure. Hence there is no notion of encapsulation and explicit consistency. In , an application is proposed that offers online versioning services for XML documents. However, there is no notion of history encapsulation, and the history itself cannot be exported or organized into an explicit form (see for an online demo). A recent paper proposes reversible deltas using a straightforward mechanism as deletion operation conveys explicitly the deleted subtree (this could quickly lead to enormous overhead, especially in the perspective of storing the whole history). The authors focus on algorithmic and speed performance issues of a novel diff algorithm using a bottom-up approach (and also precomputed hash codes on a sliding window of adjacent nodes.) Most work conducted on this area focused on related algorithm performance, time and space complexity of the diff operation and also optimality of generated deltas, but little attention has been paid to tools and models to make use of these in efficient ways. From the practical point of view, no work considered the necessity and applicative interest of abstracting over deltas and related operation models. Recently, the XQuery Update working group from W3C (see ) published a candidate recommendation proposing to enrich XQuery with mechanisms intended to modify XML documents. This impressive work largely focused on expressive power and consistent integration of the new constructs to the existing language. The patching operations rely on "insert", "delete", "rename" and "replace" , using XPath to designate the locations in the XML tree ; an innovative "transform" operator allows for combining copy with on-the-fly modifications. One major difficulty was to deal with the extra power raised by the selection semantic model of XPath, i.e sequences of nodes. As a consequence, many runtime error may occur depending on the evaluation of selectors and the semantics of the delta operation. Additionally the insert operator of the language is particularly powerful since it can make explicit positional constraints ("first/last position", "before/after A") whereas when implicit, positional constraints are interpreted so that global ordering constraint are optimally satisfied (this remain a bit unclear how far this can be hard). The transformation model of XQUpdate relies on a kind of transactional framework, where changes are stored in a list of "pending modifications" only applied after calling a dedicated primitive upd:applyUpdates. This is a pragmatic solution to the problem of path relocation (and variable scope validity!), but is a nightmare regarding static analysis. Therefore it will be much more realistic to use the XQUpdate interpreter as a patch engine (i.e generating XQUpdate code from our descriptions), than considering an XQUpdate program as a description of differences.

Conclusion The main contribution of this work is to propose a universal XML data structure and related transformations which allow abstracting from underlying storage systems and from any execution model associated with the computation of XML versioning information. This is in favor of long term preservation of XML documents, infrastructure and vendor independence, and open the way to interoperable processing of XML versioning information. The method targets generic XML documents, and makes use of a particular namespace to embed any XML document (whatever vocabulary it may use) without any change in its content and tag/attribute set. The history is encoded using a specific vocabulary that captures change operations in a formal and universal way, so that any XML diff enabled processor can generate suitable deltas. Based on the properties of this change description language, we propose a set of high level operators which allows exploiting the historical information in order to navigate inside the history, compute new version in the current branch, create new branches or suppress intermediate versioning points. Various information may be attached to the versioning points, such as user meta information, universal date/time of the creation of the versioning point, additional data required to optimize or ameliorate the computation (hash code, digest number,...), which are not central to this contribution but could substantially enhance any of its application. The problem with existing diff related approaches is that the description model and the transformation model are identical. In other words, they use the same objects both to describe operationally and declaratively the differences between trees. We outline in our proposal a calculus that formally allows making a distinction between modification descriptors and modification operations. This calculus allows modeling "standard" processing of deltas and allows transforming descriptions in order to gain compactness or efficiency. We are not aware of any equivalent proposal. The two transformation structures we described can lead to two broad classes of delta interpretations: Each modification step transforms the tree, so that the paths used inside subsequent deltas should be defined on the modified tree. This model is used inside database systems and DOM-like in memory tree processors (especially , ). All modification steps are defined on the same reference tree, and all modifications are realized in one global step. Thus the paths always refer to the same tree operand and consequently, delta sequences are more easily understood by a human reader, because they do not have to maintain the cognitive load of mentally propagating tree modifications to paths. From a computational point of view, this model is well adapted to stream based processing pipes in which each operation can be performed concurrently (e.g. on a multi core processor). As this approach doesn't need building a tree in memory, it is particularly indicated for large and very large document processing. The delta model we propose as foundational for history encoding integrates those two transformation models, in the sense that both can be expressed explicitly. The first one is captured by sequences of singleton snapshots whereas the second may involve snapshots including several deltas. We identified 4 key issues when addressing the challenge of encapsulating document history into a stand alone XML instance: mastering the potentially big data volume, usability of the history model (human vs machine), flexibility of the history representation and abstracting over diff operations (i.e. not being dependent of a particular diff algorithm/engine behavior). The method described in this paper addresses these four issues at various levels. We believe our approach therefore constitutes a significant contribution having a realistic applicability level, though the data volume issue would still require large scale experiments.

Acknowledgment This work is supported by the European project SHAMAN (FP7- ICT-216736, http://shaman-ip.eu/shaman/ ). The author also would like to thanks Jean-Pierre Chanod for his continuous support and helpful comments

Proof of proposition <xref linkend="proposition-1"/> Global structure: We do a structural induction over Δ. Each sub-case is associated with a corresponding well-formedness hypothesis; we use these together with the definitions of inversion. We show here the case for the Δ1;Δ2 composition: Goal : d › Δ1;Δ2 › d’ ⇒ d’ › ∘( Δ1;Δ2 ) › d We introduce the left hand as a new hypothesis H and we get the new goal d’ › ∘( Δ1;Δ2 ) › d By applying the definition of inversion (thanks to H), we get d’ › ∘Δ2 ; ∘Δ1 › d The we apply the property (a-seq) ∃ dd , d’ › ∘Δ2 › dd ⋀ dd › ∘Δ1 › d We apply again the (a-seq) property to hypothesis H, and get a particularisation for intermediate tree d''. Thus we choose d'' as an evidence for the existence of dd, and get d’ › ∘Δ2 › d'' ⋀ d'' › ∘Δ1 › d Now we split the goal in two subgoals and apply the induction hypothesis to both. An RelaxNG schema (compact syntax ) capturing our model default namespace xversion = "XRCE::DS::X-Version" datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes" start = x-version anyElement = element * - xversion:* { anyAttribute*, anyElement*} | text anyAttribute = attribute * { text} x-version = element x-version { element x-version-body { attribute ref {V-REF}, attribute xml:space {"preserve" | "ignore"}?, SUBTREE }+, element x-version-history { attribute first {list {V-REF+}}, attribute last {list {V-REF+}}, # the version history encodes a DAG (Directed Acyclic Graph) version* } } version = element version { attribute id { V-REF}, # allows zero or more attribute conflicts {xsd:nonNegativeInteger}?, a-time?, delta* } a-target= # this attribute defines a "one-step" next version attribute to {V-REF} delta = element delta { a-target, (rename | insert | append | move | copy | delete | replace)* } a-here= # the "here" attribute designates the insertion point (existing stuff is to be shift to right after insertion) attribute here {IPath} insert= # "insert" only applies to elements, text, comment and PI element insert { a-here, # the subtree to be inserted is here (but skipped thanks to a NVDL rule) SUBTREE } append= # "append" applies to everything including attributes # for elements, text, comments, PI, the item is appended at the end of the sequence of children element append { a-here, # the "attribute" attribute is used to specify attribute insertion; it must have the "QName=Value" syntactic structure ((attribute attribute {A-DEF}?, empty) | SUBTREE) } copy= element copy { a-what, a-here, attribute append {flag}? } move= element move { a-what, a-here, attribute append {flag}? } delete= element delete { a-what, SUBTREE } replace= element replace { a-what, SUBTREE } rename= # works for elements, attributes, and PI as well # the "as" attribute is a qualified name (that is, may refer to a given namespace through a declared prefix element rename { a-what, attribute as {token}} SUBTREE=anyElement | conflict conflict=element conflict { item+ } item = element item { attribute ref {V-REF}, anyElement } attribute-conflict =element attribute-conflict { attribute ref {V-REF}, A-DEF} tree-conflict = element tree-conflict { attribute ref {V-REF}, A-DEF} V-REF= xsd:string { pattern="v\d+(\.\d+)*" } A-DEF=string flag = "yes" | "no" a-what = attribute what {XPath} a-time = attribute time {xsd:dateTime} XPath = xsd:string IPath = xsd:string { pattern="1(/\d+)*(/@([\c-[:]]+:)?[\c-[:]]+)?" } an ISO Schematron schema <xref linkend="Schematron"/> <?xml version="1.0" encoding="UTF-8"?> <sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" queryBinding="xslt2"> <sch:ns uri="XRCE::DS::X-Version" prefix="xv"/> <sch:pattern > <sch:let name="versions" value="/*[1]/*[2]/xv:version"/> <sch:let name="versions-id" value="for $i in $versions/@id return normalize-space($i)"/> <sch:let name="Nversions-id" value="distinct-values($versions-id)"/> <sch:let name="history" value="/*[1]/xv:x-version-history[1]"/> <sch:let name="starting-vector" value="tokenize(normalize-space($history/@first),'\s')"/> <sch:let name="ending-vector" value="tokenize(normalize-space($history/@last),'\s')"/> <sch:let name="Nstarting-vector" value="distinct-values($starting-vector)"/> <sch:let name="Nending-vector" value="distinct-values($ending-vector)"/> <sch:rule context="/*[1]/xv:x-version-history[1]"> <sch:assert test="count($versions-id) ge count($Nversions-id)">Every version element must have an unique id attribute</sch:assert> <sch:assert test="count($starting-vector) eq count($Nstarting-vector)">"first" attribute should refer to each relevant version only once</sch:assert> <sch:assert test="count($ending-vector) eq count($Nending-vector)">"last" attribute should refer to each relevant version only once</sch:assert> <sch:assert test="every $i in $Nstarting-vector satisfies index-of($Nversions-id,$i) gt 0 "> the "first" attribute should point to existing version(s) (cf "<sch:value-of select="string-join($Nstarting-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>" ) </sch:assert> <sch:assert test="every $i in $Nending-vector satisfies index-of($Nversions-id,$i) gt 0 "> the "last" attribute should point to existing version(s) (cf "<sch:value-of select="string-join($Nending-vector,'/')"/>" versus "<sch:value-of select="string-join($Nversions-id,'/')"/>" ) </sch:assert> </sch:rule> <sch:rule context="xv:version"> <sch:report test="count(index-of($versions-id,normalize-space(@id))) gt 1"> The "id" attribute must be unique ("<sch:value-of select="@id"/>") </sch:report> <sch:let name="my-id" value="@id"/> <sch:assert test="(count(../xv:version/xv:delta[@to=$my-id]) gt 0) or (index-of($Nstarting-vector,$my-id) gt 0)"> missing back link to anterior version (version "<sch:value-of select="@id"/>") </sch:assert> <sch:assert test="(count(xv:delta[@to]) gt 0) or (index-of($Nending-vector,$my-id) gt 0)"> missing link to posterior version (version "<sch:value-of select="@id"/>") </sch:assert> </sch:rule> <sch:rule context="xv:delta"> <sch:assert test="index-of($Nversions-id,@to) gt 0"> The link to version "<sch:value-of select="@to"/>" is dangling (no corresponding version found in the whole history) </sch:assert> <sch:let name="my-dest" value="@to"/> <sch:report test="count(preceding-sibling::xv:delta[@to=$my-dest]) gt 0" > Delta must be unique for one target version ("<sch:value-of select="@to"/>") </sch:report> </sch:rule> <sch:rule context="xv:conflict"> <sch:let name="all-item-refs" value="for $i in xv:item/@ref return normalize-space($i)"/> <sch:assert test="count(distinct-values($all-item-refs)) eq count($all-item-refs)"> Conflicting items must be uniquely defined for a given version </sch:assert> </sch:rule> <sch:rule context="xv:item"> <sch:assert test="index-of($Nversions-id,@ref) gt 0"> The reference to version "<sch:value-of select="@ref"/>" is dangling (no corresponding version found in the whole history) </sch:assert> </sch:rule> </sch:pattern> </sch:schema> References Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, Kaizhong Zhang; Dennis Shasha, SIAM J. Comput., 1989 A survey on tree edit distance and related problems, Philip Bille, Theoretical Computer Science , Volume 337 Issue 1-3 June 2005. Change detection in hierarchically structured information, Sudarshan S. Chawathe; Anand Rajaraman;Hector Garcia-Molina ; Jennifer Widom, SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, June 1996, doi:10.1145/233269.233366 A Comparative Study of Version Management Schemes for XML Documents, Shu-Yao Chien ; Vassilis J. Tsotras ; Carlo Zaniolo, 2000 A comparative study for XML change detection, Grégory Cobéna; Talel Abdessalem; Yassine Hinnach, Research Report, INRIA Rocquencourt, France, 2002 PDF Merging XML files: a new approach providing intelligent merge of XML data sets, Robin La Fontaine, 2002 A method for the dynamic generation of virtual versions of evolving documents, Mercedes Martinez, Jean-Claude Derniame, Pablo de la Fuente, SAC 2002, Madrid, Spain, 2002 Document: Form, Sign and Medium, as Reformulated for Electronic Documents, Roger T. Pédauque, collective writing, STIC-CNRS, 12 September 2003. PDF X-Diff: A Fast Change Detection Algorithm for XMLDocuments. Yuan Wang , David J. DeWitt , Jin-Yi Cai in Proceedings of the International Conference on Data Engineering (ICDE'03) PDF Le texte en jeu (Permanence et transformations du document), Roger T. Pédauque, collective writing, STIC-CNRS, 7 April 2005. PDF Fast and Simple XML Tree Differencing by Sequence Alignment, Tancred Lindholm; Jaakko Kangasharju; Sasu Tarkoma, DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering October 2006, doi:10.1145/1166160.1166183 A Formal Investigation of diff3, Sanjeev Khanna; Keshav Kunal ; Benjamin C. Pierce, In Arvind and Prasad, editors, Foundations of Software Technology and Theoretical Computer Science (FSTTCS), December 2007 PDF Tree automata techniques and applications, Hubert Comon; Max Dauchet; Florent Jacquemard; Denis Lugiez; Sophie Tison; Marc Tommasi , 2007 PDF Merging Changes in XML Documents Using Reliable Context Fingerprints, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM Symposium on Document Engineering, September 2008, doi:10.1145/1410140.1410151 Efficient Change Control of XML Documents, Sebastian Rönnau; Christian Pauli; Uwe M. Borghoff, ACM Symposium on Document Engineering, September 2009, doi:10.1145/1600193.1600197 A Web-Based Version Editor for XML Documents, Luis Arévalo Rosado, Antonio Polo Márquez and Miryam Salas Sánchez, ACM Symposium on Document Engineering, September 2009, doi:10.1145/1600193.1600249 A Web-Based Version Editor for XML Documents, online demonstration version: http://picaro.unex.es:8180/vEditor/ XQuery Update Facility 1.0, W3C, Specification, June 2009 http://www.w3.org/TR/xquery-update-10/ http://xmldb-org.sourceforge.net/xupdate/xupdate-wd.html ; http://en.wikipedia.org/wiki/XUpdate Saxonica, XSLT and XQuery processing Michael Kay, http://www.saxonica.com/ ISO Schematron, a language for making assertions about patterns found in XML documents, Topologi , web site