<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>Multi-structured documents and the emergence of annotations vocabularies</title><info><confgroup><conftitle>Balisage: The Markup Conference 2010</conftitle><confdates>August 3 - 6, 2010</confdates></confgroup><abstract><para>
        The construction of multi-structured documents often implies the
        definition of annotations vocabularies.  Moreover, in a multi-users context, the
        growth of these vocabularies has to be controlled. Therefore, we propose using the
        trace of users activity to limit this growth and to document the vocabularies. For
        example, a user will be able to follow and annotate a term in the context of its
        surrounding actions from its creation to the last time it was used. From a broader
        point of view, this work is grounded on our Web based philological platform,
        DINAH, and is mainly motivated by our collaboration with a group of philosophers
        studying the handwritten manuscripts of Jean-Toussaint Desanti.
    </para></abstract><author><personname><firstname>Pierre-Édouard</firstname><surname>Portier</surname></personname><personblurb><para>Pierre-Édouard Portier is a computer science engineer. He has graduated in September 2007 from INSA-Lyon
school with a Master degree in computer science. He is continuing his studies at INSA-Lyon as a Ph.D student.
He is working in the DRIM team of the LIRIS laboratory under the supervision of Sylvie Calabretto.</para></personblurb><affiliation><orgname>Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621</orgname></affiliation><email>pierre-edouard.portier@insa-lyon.fr</email></author><author><personname><firstname>Sylvie</firstname><surname>Calabretto</surname></personname><personblurb><para>Sylvie Calabretto : Doctor in Computer Sciences of the « Institut National des Sciences Appliquées de
Lyon » in 1993. Presently, Associate professor at the Institut National des Sciences Appliquées de Lyon
(INSA-Lyon) and Researcher at the Laboratory of Images and Information Systems Engineering (LIRIS).
Co-superviser of nine PhD dissertation. Has published one collective book and about 100 papers on various
computing subjects among which Structured Document, Information Retrieval and Digital Libraries.</para></personblurb><affiliation><orgname>Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, F-69621</orgname></affiliation><email>sylvie.calabretto@insa-lyon.fr</email></author><legalnotice><para>Copyright © 2010 by the authors.  Used with
permission.</para></legalnotice><keywordset role="author"><keyword>digital libraries</keyword><keyword>vocabularies</keyword><keyword>interaction traces</keyword><keyword>overlapping hierarchies</keyword><keyword>XML</keyword></keywordset></info><section><title>Introduction</title><para>
        We proposed [<xref linkend="portier2009"/>] a methodology for the construction of
        multi-structured documents. We add tools for documenting and managing the
        evolution of the terms and relations created by the users for the description of structures.
    </para><para>
        The following ideas have been conceived in the context of a digital library
        project that includes a group of philosophers interested by the work of Jean-Toussaint
        Desanti<footnote xml:id="fnInstitut"><para>http://institutdesanti.ens-lyon.fr/</para></footnote>.
        Our documents are digital images of manuscripts' pages (more than
        35 000 pages). From a technical point of view, our users transcribe, annotate and reorder those pages.
    </para><para>
        The transcription is the association of the image of a manuscript's page with the
        digitized version of its textual content. Many levels of transcription have been
        defined<footnote xml:id="fnTranscription"><para>http://www.tei-c.org/About/Archive_new/ETE/Preview/driscoll.xml</para></footnote>:
        from the diplomatic transcription whose result should be the exact copy
        of the manuscript (the layout of the page is retained, abbreviations are not
        expanded, etc.) to transcriptions that remain easy to read.  However, with the rich underlying data
        structures of existing philological software platforms this choice can be
        postponed to the editorial phase when multiple versions of the transcription will
        be produced.  Moreover, the TEI<footnote xml:id="fnTEI"><para>http://www.tei-c.org/index.xml</para></footnote> defines
        well documented and customizable tag sets (thanks to the Roma tool that generates
        validators for ad-hoc customizations<footnote xml:id="fnRoma"><para>http://www.tei-c.org/Roma/</para></footnote>).  Finally,
        there are useful case studies ([<xref linkend="huitfeldt2004"/>], [<xref linkend="gants2006"/>], ...) of electronic editions projects. They give us
        perspectives on the processes of choosing a tag set, establishing encoding rules,
        etc. They also introduce to more general problems that most certainly appeared from
        the required formalization of electronic edition. For example, the two notions of
        representation and interpretation benefit from being presented as a conceptual
        continuum: "common sense" always seems to be able to distinguish between the two
        concepts, to draw a border, but every time we look closer this border vanishes.  As
        a second example, the computer programs considered as tools make the interactive
        relationship between the user and the text explicit.  
    </para><para>
        A particularity of the Jean-Toussaint Desanti Archives is the necessity of
        reordering the handwritten pages. In fact, we have noticed that this problem
        applies to many projects of electronic edition of handwritten manuscripts. Often,
        the pages are initially disordered: the documents may have passed through
        the hands of many people and lost their initial order. More importantly, for
        testing philological hypothesis, it is often interesting to reorder collections of
        pages, even though these reorderings do not correspond to explicit choices of the
        author. We choose two examples from the J.T.Desanti Archives in order to
        illustrate this problem.
    </para><para>
        For the first example, let say a user isolates the four pages of <xref linkend="fgFourPages"/> from a set of unclassified pages. Since three of the four
        pages are being numbered in their top right corner, he easily manages to reorder
        them. Nonetheless this reordering is to be considered as an interpretation and the
        original order and context must be preserved. Later, inside an other unclassified
        set of pages, a page is found that shares many similarities with the first page of
        the new four pages collection. Some researcher concludes that the newly
        created collection may be an alternative version of the page just found.
    </para><figure xml:id="fgFourPages"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol5/graphics/Portier01/Portier01-001.jpg" width="60%"/></imageobject><caption><para>Four pages from an unclassified collection</para></caption></mediaobject></figure><para>
    For the second example, a user finds, in the last page of a
    notebook, a reference to an unknown appendix (see <xref linkend="fgLastPage"/>). Later, the appendix is found inside an unclassified set
    of pages (see <xref linkend="fgAppendix"/>). He will insert the appendix after the
    last page of the notebook, but the original context in which the appendix was found
    must be preserved as it is highly significant since it helped to link the content of
    the notebook with the themes developed by the unclassified set of pages where the
    appendix was found.
</para><figure xml:id="fgLastPage"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol5/graphics/Portier01/Portier01-002.jpg" width="60%"/></imageobject><caption><para>A reference to an unknown appendix</para></caption></mediaobject></figure><figure xml:id="fgAppendix"><mediaobject><imageobject><imagedata format="jpg" fileref="../../../vol5/graphics/Portier01/Portier01-003.jpg" width="60%"/></imageobject><caption><para>Found appendix</para></caption></mediaobject></figure><para>
    We spent some time describing this simple situation of reordering set of pages ... for we
    find it interesting and relevant to the forthcoming considerations. A reordering is
    formally quite simple: we move a few pages, we add some annotations ... But it may
    convey valuable interpretations.
</para><para>
    We divided the users activity into three main operations: transcription, reordering
    and annotation. We briefly covered the first two and should now introduce to the third
    one. However, the annotation operation is much more generic: transcription as well as
    reordering depend on annotations. But it seemed convenient to introduce them
    separately. Our annotations are simple relations and we keep them in a RDF store.
</para><para>
    Users create those relations in a GUI (see <xref linkend="fgMainScreenshot"/>) that
    imitates the graph structure of the data.
</para><figure xml:id="fgMainScreenshot"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-004.png" width="100%"/></imageobject><caption><para>DINAH main interface</para></caption></mediaobject></figure><para>
    Those relations are used to build a faceted
    navigation interface (see <xref linkend="fgNavigation"/>).
</para><figure xml:id="fgNavigation"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-005.png" width="70%"/></imageobject><caption><para>Faceted navigation interface</para></caption></mediaobject></figure><para>
    We should now clarify our orientation towards vocabularies of terms.  According to
    Patrick Durusau [<xref linkend="durusau2006"/>], vocabularies of terms and relations
    have to be defined before any annotations are created.  However, as explained by Dino
    Buzzetti and Jerome McGann [<xref linkend="buzzetti2006"/>], a new annotation modifies
    the semiotic nature of the document by facilitating new interpretations that can
    themselves arouse the desire to add new annotations in order to formalize those
    interpretations, ... thus feeding a theoretically endless hermeneutic process.
    Thereby, it seems necessary to allow a continuous creation of terms. Our work focuses
    on how to effectively enable this creation. That being said, we recognize the
    relevance and importance of capitalization work such as the TEI. However, we try to
    focus on the problem of the emergence of annotation vocabularies when some
    researchers study a set of documents.
</para><para>
    We first provide a short introduction to the notion of "sign".
    We then describe the evolutions of our model for the construction of
    multi-structured documents. Next, we introduce the notion of the "trace of users
    interactions" and we show how this trace can be used for managing and documenting 
    vocabularies of terms and relations. Then, we explain how a very interesting activity of models
    confrontation emerged from the possibility of interacting with the trace. We will
    illustrate the notions with examples from our philological platform named
    DINAH (Dinah is Irrelevantly Not Alice Heron). In the end, we will compare DINAH with
    other philological platforms.
</para></section><section><title>Around the concept of sign</title><para>The ideas that will be presented in the next sections were conceived from reflections about the concept
of sign. Historically, this concept was created approximately at the same time by two unrelated men: Ferdinand
de Saussure (1857-1913) a Swiss linguist and Charles Sanders Peirce (1839-1914) an American philosopher and
logician (and statistician, and astronomer, and ...). Their conceptions were in many aspects quite
similar.</para><para>Saussure concept of sign makes use of two notions: the signifier and the signified. The signifier is the form
of the sign while the signified is the concept represented by the sign. <xref linkend="fgSaussureanSign"/>
shows the traditional way of representing the Saussurean sign. A simplistic interpretation of Saussure ideas
would tend to consider the sign as a univocal relation between a signified and a signifier. But it would
clearly be a misinterpretation! For Saussure, the relation signified/signifier is linguistically arbitrary.
Moreover he saw "meaning" as wholly relational or structural: the meaning of a sign arises from its relations
with other signs and not from some essential property or reference to a material world.</para><figure xml:id="fgSaussureanSign"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-006.png" width="20%"/></imageobject><caption><para>A representation of the Saussurean concept of sign: the signification is the relationship between the
signified and the signifier represented by the arrows.</para></caption></mediaobject></figure><para>Peirce produced a triadic model of the sign whose terms are: the representamen or the form of the sign,
the interpretant or the sense made of a sign (not to be confused with the interpreter) and the object to which
the sign refers. The representamen can be related to the signifier and the interpretant to the signified. So,
the new term present in Peirce's model with no equivalent in Saussure's model is the object. However, it can
be argued that a referent was implicitly present in Saussure's model. What really differentiates the two
models is that the interpretant can be considered as a sign in the mind of the interpreter. It implies that
the interpretation is a process, Peirce has named this process the semiosis. It seems clear from the
above definitions that the semiosis is theoretically unlimited (the notion of "unlimited semiosis" was first
introduced by Umberto Eco). For Saussure the interpretation was the highlighting of the complex relations a
sign maintained with other signs within a linguistic structure ; while for Peirce the interpretation is the
always to be done process of connecting signs.</para><para>Modern interpretations of Peirce's model, with its concept of semiosis, tend to erase the notion of
"signified" since the interpretant can always become a new sign. We found a quite interesting example in
Gregory Bateson work [<xref linkend="bateson1979"/>] (it should be noted that we modify it quite slightly). We
consider a mathematical property: "The sum of the first n odd numbers is equal to n squared". <xref linkend="fgOddNumbers"/> shows three representations of this property. Is the content the same, and only the
form varying? It seems more satisfying not to speak in terms of signifier/signified, form/content, etc. but
to consider different configurations of signifiers leading to different interpretations. Incidentally, this
last example can lead to the understanding of difficult concepts: the difference between cardinal and ordinal
numbers.</para><figure xml:id="fgOddNumbers"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-007.png" width="50%"/></imageobject><caption><para>Three representations of a mathematical property</para></caption></mediaobject></figure><para>In computer science, when references are made to Peirce, it is most of the time in a logical framework where first order
logic is used as a uniform way to represent assertions in order to compute new assertions. The best example of
this trend being the work of John F. Sowa [<xref linkend="sowa2000"/>]. In fact, the conceptual graphs were
derived from the work of Peirce. However, we rarely find work inspired by the notion of semiosis. That is what
we try to do by offering to the user signifiers to interpret and by taking into account results from its
interpretations. The next sections will illustrate this approach.</para></section><section><title>A model for the construction of multi-structured documents</title><section><title>Definitions</title><para>We first give some definitions:</para><para>
<itemizedlist><listitem><para><emphasis role="ital">A resource</emphasis> is anything uniquely identified by an URI. Fragments, intervals,
zones, terms, classes, binary relations, vocabularies and documents
are resources.</para></listitem><listitem><para><emphasis role="ital">A fragment</emphasis> is a part of document content. Our documents are
textual documents and manuscripts images. In the case of textual documents a fragment is the pair (D, (inf,
sup)) where D is a document identifier, and (inf, sup) is an integer interval addressing a part of the
document. In the case of images a fragment is the pair (I, ((x1, y1), (x2, y2)) where I is an image identifier
and ((x1, y1), (x2, y2)) are the coordinates of a rectangular zone of the image.</para></listitem><listitem><para><emphasis role="ital">A term</emphasis> is a string of characters.</para></listitem><listitem><para><emphasis role="ital">A class</emphasis> is a set of terms.</para></listitem><listitem><para><emphasis role="ital">A binary relation</emphasis> R(x, y) links together two resources</para></listitem><listitem><para><emphasis role="ital">A vocabulary</emphasis> is a set of binary relations</para></listitem><listitem><para><emphasis role="ital">A multi-structured document</emphasis> is
a document with fragments participating in relations that belong to multiple
vocabularies.</para></listitem></itemizedlist>
</para><para>
    How can these vocabularies be constructed? This is the problem
    we address.  But before proceeding further, we illustrate the previous definitions
    with an example. Consider the following scenario: a philologist finds a
    consistent subset about Marx inside a stack of pages of consequent size. He isolates
    this subset by creating a new collection (using the GUI of <xref linkend="fgMainScreenshot"/>). He creates a relation ”mainSubject” between this
    collection and the term ”marx” from the class ”Author”. He starts the transcription of the
    collection and also creates relations, such as ”quotation”, ”citationTitle”, between
    intervals of the transcribed text and the document (using the GUI of <xref linkend="fgTranscription"/>). He discovers later that this collection is in fact a
    preparation for another work he found in the archive. He creates a relation
    ”preparationFor” between the two collections. Etc.  Etc. These newly created relations
    dynamically update the faceted navigation interface that can be used to find specific
    collections or pages by iterative refinement (see <xref linkend="fgNavigation"/>).
</para><figure xml:id="fgTranscription"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-008.png" width="100%"/></imageobject><caption><para>Screenshot from the transcription interface</para></caption></mediaobject></figure><para>
    How is it that, for example, a user chooses to place the relation ”citationTitle”
    within the ”citations” vocabulary while he affects the relation ”hasLine” to the
    ”physicalStructure” vocabulary? In a multi-users context, how a user will know the
    meaning of a relation created by someone else? We will address the first question in
    the remaining parts of this section, and the second question in the next section. We
    should now recall some characteristics of the existing models for the representation
    of multi-structured documents.
</para></section><section><title>Existing models</title><para>
        Multi-structured documents have to be analyzed in their historical context where
        the most used formalisms for documents representation (first SGML then XML)
        implied tree structures. That is why this problem has so far been considered under
        the technical point of view of overlapping hierarchies. From our previous example,
        let say a page has been transcribed and relations have been created to indicate
        some citations. Then, the lines of text are isolated in order to align the
        transcription with the manuscript facsimile. It might happen that a quotation
        overlaps two lines and there would be locally a graph structure: a natural use of
        XML becomes impossible (see <xref linkend="fgOverlapping"/>). We now describe
        different solutions for the representation of multi-structured documents.
    </para><figure xml:id="fgOverlapping"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-009.png" width="60%"/></imageobject><caption><para>An example of overlapping markup</para></caption></mediaobject></figure><para>
    We divide the set of existing solutions into four classes: historical solutions, adhoc
    solutions, models not compatible with XML and finally models compatible with XML.
</para><para>
    CONCUR [<xref linkend="goldfarb1990"/>] is a feature of SGML designed to allow the
    integration inside a same document of tags extracted from different DTDs. Thus, if the
    definitions of the overlapping tags appear in different DTDs, the representation
    problem of multi-structured documents is solved. However, because of its complexity,
    this SGML proposal has never been fully implemented.
</para><para>
    The TEI describes different syntactic solutions for the representation of multiple
    hierarchies into the same text (as the use of milestones or the fragmentation of
    elements, etc.). These solutions make impossible using standard XML tools (XQuery,
    XPath, ...).
</para><para>
    Since the main problem for the representation of multi-structured documents seems to
    be the syntactic limitations of XML, some solutions are based on models with
    alternative syntaxes. However they cannot profit from the galaxy of tools offered by
    XML. Among those solutions, we can distinguish LMNL [<xref linkend="tennison2002"/>]
    and TexMecs [<xref linkend="huitfeldt2003"/>] which are alternatives to XML (formal
    models and syntaxes) specifically designed for the representation of overlapping
    structures, from propositions that take advantage of the native graph model of RDF to
    represent multi-structured documents. Among these, the most convincing one certainly is
    EARMARK [<xref linkend="vitali2009"/>]. The notions of ”location”, ”range”, ”markup
    item”, etc. for the modeling of multi-structured documents are precisely defined in an
    OWL ontology.  Moreover, the SPARQL language can be used to query the documents. 
    The origins of the EARMARK proposal are to be found in two previous
    works: annotations graphs [<xref linkend="maeda2002"/>] are used, in the context of
    linguistic research, to represent documents as graphs so as to avoid the overlapping
    hierarchies problem ; RDFTef [<xref linkend="tummarello2005"/>] can be seen as an
    adaptation of annotations graphs to the RDF standard formalism.
    At last, XCONCUR [<xref linkend="schonefeld2006"/>] has the same notation as the SGML
    CONCUR option and is based on a multi-rooted trees model. Moreover, the XCONCUR-CL
    language can define constraints between tree layers.
</para><para> 
    Finally there are solutions compatible with XML. They either extend the XML
    model itself or modify some XML tools (such as XPath and XQuery) to work with
    multi-structured documents. As representatives of the first category, the multi-colored
    trees [<xref linkend="jagadish2004"/>] and the delay nodes [<xref linkend="lemaitre2006"/>] solutions have very similar models based on an extension
    of the core XML model to consider documents as set of XML trees. Unlike
    multi-colored trees, delay nodes need no XPath extension in order to
    navigate inside the structures. 
</para><para>
    We now introduce members of the second category
    (modification of XML tools to
    operate on otherwise standard XML documents). GODDAG [<xref linkend="sperberg2000"/>] (General Ordered Descendant
    Directed Acyclic Graph), MSXD [<xref linkend="bruno2006"/>], MonetDB [<xref linkend="alink2006"/>] and MultiX [<xref linkend="chatti2007"/>] are similar
    proposals since in each case several trees
    are defined over the same textual content by sharing their leaves (textual fragments).
    MSXD is the first to introduce the idea of a schema for multi-structured documents.
    The MonetDB proposal is an extension of the MonetDB/XQuery XML SGBD with optimized XPath query
    operators with four new axis steps. These steps have been implemented very
    efficiently by using a region index and fast algorithms. MSDM is a lightweight solution
    that needs no more than a few specialized XQuery functions. Each one of these four
    previous solutions fails at managing change in content or structures since the entire
    structures have to be reconstructed every time modifications happen. MuLaX [<xref linkend="hilbert2005"/>] is an adaptation of the previously described SGML CONCUR
    option to the XML world. An editor has been developed as an Eclipse plugin for the
    creation of MuLaX documents, but no query mechanism has been defined. Finally, feature
    structures [<xref linkend="stegmann2009"/>] are a general purpose knowledge representation
    format that can be used for XML documents annotated with
    heterogeneous tag sets, it was adopted as a standard by the TEI in 2006. Feature
    structures have solid mathematical foundations. In particular the two operations of
    unification and generalization are well defined and offer very interesting perspectives
    for the combination of multi-structured documents. However, there is no specialized query
    mechanism and no way of managing change in content or structures.
</para><para>
    Finally, there is the solution proposed by Desmond Schmidt [<xref linkend="schmidt2009"/>]. Its model is named MVD for Multi-Version document. It is
    very similar to the kind of acyclic graph models we saw before (GODDAG, MSDM, etc.).
    But this is the only solution to provide an efficient algorithm for updating
    multi-structured documents. The author explains that updating reduces to merging a new
    version into an existing MVD. Thus, he finds inspiration from algorithms conceived for the
    alignment of genetic sequences. Moreover, an heuristic has been developed to manage
    block transpositions in a very short time !
</para></section><section><title>Strategy for the construction of multi-structured documents</title><para>
        With the previous solutions, we understand what multi-structured documents are and
        how they can be represented, but none of them seem to be interested in the way
        structures appear! They must appear in the process of document construction.
        In this section we study this process. First of all, we
        choose to represent our documents in the RDF formalism but, as it will soon be
        explained, we voluntarily impose each structure to be hierarchical (as for the
        MultiX, MSXD and GODDAG solutions).
    </para><para> 
        The technical issue of multi-structured documents is the one of overlapping
        hierarchies.  Moreover, if we do not consider the documents as immutable objects
        but as dynamic objects that have to be constructed, overlapping hierarchies 
        happen at precise moments.  Let say a user annotated some citations titles and
        quotations he found in his transcription of a manuscript. Later he is told that in
        order to precisely align his transcription with the original facsimile he should
        annotate each line of the manuscript. So, he starts this new annotation task and
        since the ”line” relation did not exist he adds it to the current vocabulary (the
        one already containing ”citationTitle”, ”quotation”, etc.). Then, while he has
        already marked some lines, a line overlaps with an existing citation title. Our
        system (DINAH) will then alert him about an incompatibility between the relations
        ”citationTitle” and ”line” and will advice him to assign either ”citationTitle” or
        ”line” to another, and possibly new, vocabulary. In this case, he may assign
        ”line” to a ”physical structure” vocabulary. <xref linkend="fgRDF"/> is a sample
        of the resulting graph.
    </para><figure xml:id="fgRDF"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-010.png" width="80%"/></imageobject><caption><para>A sample RDF representation of a multi-structured document</para></caption></mediaobject></figure><para>
    Finally, our strategy for the management of multi-structured documents promotes the
    construction of a multiplicity of structures that should reflect the perspectives
    adopted by the users while accessing the documents. Each user has the liberty to
    create new vocabularies. Moreover, when overlapping hierarchies are detected users are
    encouraged to solve the problem by introducing a new vocabulary. In our multi-users
    context, this liberty could lead to an uncontrolled growth of vocabularies with lots
    of duplicate usages, synonyms, etc. That is why the next section presents a solution
    for the dynamic documentation of vocabularies.
</para></section></section><section><title>Managing and documenting vocabularies with the trace of users interactions</title><section><title>Dynamic documentation</title><para>
        Our idea for the dynamic documentation of vocabularies relies on the
        monitoring of user actions. When a user wants to know how to use a term or a
        relation he can ask for a representation of the trace of users actions centered on
        the term (or relation) creation or any of its instances.
        This trace can itself be annotated. Thanks to these annotations users can
        document the vocabularies (see <xref linkend="fgTrace1"/> and <xref linkend="fgTrace2"/>).
    </para><figure xml:id="fgTrace1"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-011.png" width="100%"/></imageobject><caption><para>Presentation of the trace of users interactions: the user PEP added the term "private" to the class
"Visibility" ; an annotation documents this term.</para></caption></mediaobject></figure><figure xml:id="fgTrace2"><mediaobject><imageobject><imagedata format="png" fileref="../../../vol5/graphics/Portier01/Portier01-012.png" width="100%"/></imageobject><caption><para>Presentation of the trace of users interactions: a page has been deleted just before the new term
"private" was added.</para></caption></mediaobject></figure></section><section><title>Trace model</title><para>
        The first work we are aware of that made use of traces, dealt with the development
        of HCI for collaborative systems [<xref linkend="dourish1995"/>]. There are few
        works dealing with the use of activity traces for knowledge management ([<xref linkend="laflaquiere2006"/>] being one of the most representative).  They
        insist on the reflexive nature of the trace as a way of sharing knowledge.
        They also define generic
        (and quite complex) activities' models, as well as rules to transform the
        granularity of the original trace so as to make sense for the user.
        However, we choose to adopt a more lightweight approach, well adapted to our needs.
    </para><para>
        We define a simple RDF vocabulary to model actions (below is a definition
        in the turtle RDF syntax). Every time a developer adds
        a new Action to the system he has to create sub-properties of the ”withArgument”
        property for each argument of the new action. Moreover, we have simple SPARQL queries to
        build representations of the trace (see <xref linkend="fgTrace1"/> and <xref linkend="fgTrace2"/>).
    </para><programlisting xml:space="preserve">
PREFIX users: &lt;http://desanti.org/schemas/users#&gt;
PREFIX traces: &lt;http://desanti.org/schemas/traces#&gt;
INSERT INTO &lt;http://desanti.org/&gt;
{
   traces:Action           a                    rdfs:Class  .
   traces:hasDoer          a                    rdf:Property .
   traces:hasDoer          rdfs:domain          traces:Action .
   traces:hasDoer          rdfs:range           users:User .
   traces:hasTimestamp     a                    rdf:Property .
   traces:hasTimestamp     rdfs:domain          traces:Action .
   traces:withArgument     a                    rdf:Property .
   traces:withArgument     rdfs:domain          traces:Action .
   traces:documentation    a                    rdf:Property .
   traces:documentation    rdfs:domain          traces:Action .
   traces:documentation    rdfs:range           rdfs:Literal .
   traces:withInterval     rdfs:subPropertyOf   traces:withArgument .
   traces:withInterval     rdfs:range           trans:Interval .
   traces:withInterval     rdfs:label           ”intervalle” .
   ...
}
</programlisting></section><section><title>The emergence of a new way of confronting models</title><para>
        As explained, users of our system need to reorder subsets of the
        archive. We have insisted on the fact that this operation may require a lot of
        interpretation. Therefore, we were very interested to find the users of our system
        engaged in an activity we didn't anticipate. While disagreeing with an ordering
        from user B, user A navigated inside the trace and found the actions that lead to
        the problematic ordering. Then we encouraged him to annotate these actions in order to
        share his disagreement. It appeared that the trace is a very well adapted tool for
        documenting a new term (or relation) or finding interesting configurations. But it
        may not be the correct tool for the development of arguments as it relies on a recursive
        mechanism of annotations that is not meant to support conversations.
    </para></section></section><section><title>Comparisons with some existing philological platforms</title><para>
        We analyze our work relatively to other philological platforms. We divide them in two categories: first
        platforms of historical interest, next Web based platforms.
    </para><section><title>Historical platforms</title><para>
        BAMBI [<xref linkend="bozzi1997"/>] (Better Access to Manuscripts and Browsing of
        Images) is, according to the authors, ”an hypermedia system allowing historians to
        read and transcribe manuscripts, write annotations, and navigate between the words
        of the transcription and the matching piece of image in the facsimile of the
        manuscript”. It was the first philological software platform. It does not allow
        typed annotations.
    </para><para>
        Part of the DEBORA [<xref linkend="nichols2000"/>] (Digital Access to Books of the
        Renaissance) project consisted in a digital library system with collaborative
        features. It introduced the notion of ”virtual books”. A virtual book is the
        representation of a path among pages of the entire archive. But they are not
        resources themselves, they cannot be annotated. However we can consider this
        system as a first step towards a reflexive system that places users in front of
        their own activities.  
    </para><para>
        HyperNietzsche [<xref linkend="diorio2007"/>] (today Nietzschesource) was a
        pioneer digital library platform. A path mechanism is present, very similar to the
        virtual books of the DEBORA project.  However as for the virtual books, the paths
        are not resources and thus cannot truly enter in a collaborative process that
        would allow to exchange and annotate them.
    </para></section><section><title>Web based platforms</title><para>
        Collate [<xref linkend="stein2004"/>], TALIA [<xref linkend="hahn2008"/>], PINAKES
        [<xref linkend="scotti2001"/>], BRICKS [<xref linkend="bertoncini2007"/>] and
        JeromeDL [<xref linkend="kruk2007"/>] are philological platforms based on semantic
        Web technologies. They offer high quality mechanisms for collaborative
        annotations. But they do not provide convergence mechanisms for creating and
        documenting vocabularies.
    </para><para>
        Armarius [<xref linkend="doumat2008"/>] is used to classify and annotate
        collections of manuscripts. It only provides untyped generic annotations. But it
        offers a view of all the user actions that occurred during the current session and
        plans to apply graph matching algorithms in order to, for example, deduce
        probabilities for the next actions.  Thus, it can be compared with our use of
        traces.
    </para></section></section><section><title>Conclusions</title><para>
        We proposed an open system that allows the construction of multi-structured
        documents and the creation of annotation vocabularies. In order to manage the
        growth of the vocabularies we introduced a dynamic documentation mechanism based
        on the trace of users actions. Finally, all the propositions have been
        implemented in our philological software platform named DINAH.
    </para></section><bibliography><title>Bibliography</title><bibliomixed xml:id="portier2009" xreflabel="Portier2009">Pierre-Edouard Portier and Sylvie Calabretto,
<emphasis role="ital">Methodology for the construction of multi-structured documents.</emphasis> In: Proceedings of Balisage:
The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3
(2009). <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.balisage.net/Proceedings/vol3/html/Portier01/BalisageVol3-Portier01.html</link>.
doi: <biblioid class="doi">10.4242/BalisageVol3.Portier01</biblioid>. (August 2009)</bibliomixed><bibliomixed xml:id="huitfeldt2004" xreflabel="Huitfeldt2004">Claus Huitfeldt, <emphasis role="ital">Editorial
Principles of Wittgenstein's Nachlass—The Bergen Electronic Edition.</emphasis> in Dino Buzzetti, Giuliano
Pancaldi and Harold Short (eds): Augmenting Comprehension: Digital Tools and the History of Ideas, Office for
Humanities Communication, London 2004 (ISBN 1 897791 18 6)</bibliomixed><bibliomixed xml:id="gants2006" xreflabel="Gants2006">David Gants, <emphasis role="ital">Editing
Drama.</emphasis> in Electronic Textual Editing. Eds. Lou Burnard, Katherine O'Brien O'Keefe, John Unsworth.
MLA, 2006. <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/About/Archive_new/ETE/Preview/gants.xml</link></bibliomixed><bibliomixed xml:id="buzzetti2006" xreflabel="Buzzetti2006">Dino Buzzetti and Jerome McGann, <emphasis role="ital">Critical Editing in a Digital Horizon</emphasis> in Electronic Textual Editing. Eds. Lou Burnard,
Katherine O'Brien O'Keefe, John Unsworth. MLA, 2006.
<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/About/Archive_new/ETE/Preview/mcgann.xml</link></bibliomixed><bibliomixed xml:id="durusau2006" xreflabel="Durusau2006">Patrick Durusau, <emphasis role="ital">How and Why
to Formalize your Markup</emphasis> in Electronic Textual Editing. Eds. Lou Burnard, Katherine O'Brien
O'Keefe, John Unsworth. MLA, 2006.
<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/About/Archive_new/ETE/Preview/durusau.xml</link></bibliomixed><bibliomixed xml:id="bateson1979" xreflabel="Bateson1979">Gregory Bateson, <emphasis role="ital">Mind and Nature: A Necessary
Unity (Advances in Systems Theory, Complexity, and the Human Sciences).</emphasis> Hampton Press. ISBN 1-57273-434-5,
1979.</bibliomixed><bibliomixed xml:id="sowa2000" xreflabel="Sowa2000">John F. Sowa, <emphasis role="ital">Ontology, Metadata,
and Semiotics</emphasis> in B.  Ganter &amp; G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and
Computational Issues, Lecture Notes in AI #1867, Springer-Verlag, Berlin, 2000, pp. 55-81.
<link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://users.bestweb.net/~sowa/peirce/ontometa.htm</link>.
doi: <biblioid class="doi">10.1007/10722280_5</biblioid>.</bibliomixed><bibliomixed xml:id="goldfarb1990" xreflabel="Goldfarb1990">C.F. Goldfarb,
<emphasis role="ital">The SGML handbook.</emphasis> Oxford University Press,
Inc., New York, NY, USA (1990)</bibliomixed><bibliomixed xml:id="tennison2002" xreflabel="Tennison2002">J. Tennison and W. Piez, <emphasis role="ital">The
layered markup and annotation language (lmnl)</emphasis>. In: Extreme Markup Languages. (2002)</bibliomixed><bibliomixed xml:id="huitfeldt2003" xreflabel="Huitfeldt2003">Claus Huitfeldt and Michael Sperberg-McQueen,
<emphasis role="ital">Texmecs: An experimental markup meta- language for complex documents.</emphasis>
(2003)</bibliomixed><bibliomixed xml:id="vitali2009" xreflabel="Peroni2009">Angelo Di Iorio, Silvio Peroni and Fabio Vitali,
<emphasis role="ital">Towards markup support for full GODDAGs and beyond: the EARMARK approach.</emphasis>
Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of
Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009).
doi: <biblioid class="doi">10.4242/BalisageVol3.Peroni01</biblioid>.</bibliomixed><bibliomixed xml:id="maeda2002" xreflabel="Maeda2002">K. Maeda, S. Bird, X. Ma and H. Lee, <emphasis role="ital">Creating annotation tools with the annotation graph toolkit.</emphasis> In: Proceedings of the
Third International Conference on Language Resources and Evaluation. (Apr 2002)</bibliomixed><bibliomixed xml:id="tummarello2005" xreflabel="Tummarello2005">G. Tummarello, C. Morbidoni and E. Pierazzo,
<emphasis role="ital">Toward textual encoding based on
rdf.</emphasis> In: ELPUB. (2005) 57–63</bibliomixed><bibliomixed xml:id="jagadish2004" xreflabel="Jagadish2004">H.V. Jagadish, L.V.S. Lakshmanan, M. Scannapieco,
D. Srivastava and N. Wiwatwattana, <emphasis role="ital">Colorful xml: one hierarchy isn’t enough.</emphasis>
In: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, New York,
NY, USA, ACM (2004) 251–262. doi: <biblioid class="doi">10.1145/1007568.1007598</biblioid></bibliomixed><bibliomixed xml:id="lemaitre2006" xreflabel="LeMaitre2006">Jacques Le Maitre, <emphasis role="ital">Describing multistructured xml documents by means of delay nodes.</emphasis> In: DocEng ’06:
Proceedings of the 2006 ACM symposium on Document engineering, New York, NY, USA, ACM (2006)
155–164. doi: <biblioid class="doi">10.1145/1166160.1166200</biblioid></bibliomixed><bibliomixed xml:id="sperberg2000" xreflabel="Sperberg-McQueen">C.M. Sperberg-McQueen, C. Huitfeldt, <emphasis role="ital">Goddag: A data structure for overlapping hierarchies.</emphasis> In: DDEP/PODDP. (2000)
139–160</bibliomixed><bibliomixed xml:id="bruno2006" xreflabel="Bruno2006">E. Bruno, and E. Murisasco, <emphasis role="ital">Multistructured xml textual documents.</emphasis> GESTS International Transactions on Computer
Science and Engineering 34(1) (november 2006) 200–211</bibliomixed><bibliomixed xml:id="alink2006" xreflabel="Alink2006">W. Alink, R.A.F. Bhoedjang, A.P. de Vries and P.A.
Boncz, <emphasis role="ital">Efficient xquery support for stand-off annotation.</emphasis> In: XIME-P.
(2006)</bibliomixed><bibliomixed xml:id="chatti2007" xreflabel="Chatti2007">N. Chatti, S. Kaouk, S. Calabretto, and J.M. Pinon,
<emphasis role="ital">MultiX: an XML-based formalism to encode multi-structured documents.</emphasis> In:
Proceedings of Extreme Markup Languages’2007, Montréal (Canada). (August 2007)</bibliomixed><bibliomixed xml:id="hilbert2005" xreflabel="Hilbert2005">M. Hilbert, A. Witt, M. Quebec and O. Schonefeld,
<emphasis role="ital">Making concur work.</emphasis> In: Extreme Markup Languages. (2005)</bibliomixed><bibliomixed xml:id="stegmann2009" xreflabel="Stegmann2009">J. Stegmann and A. Witt, <emphasis role="ital">Tei
feature structures as a representation format for multiple annotation and generic xml documents.</emphasis>
In: Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3
(2009).  doi: <biblioid class="doi">10.4242/BalisageVol3.Stegmann01</biblioid>. (August 2009)</bibliomixed><bibliomixed xml:id="laflaquiere2006" xreflabel="Laflaquiere2006">Laflaquière, J., Settouti, L.S., Pri ́, Y.,
Mille, A.: <emphasis role="ital">Trace-based framework for experience management and engineering.</emphasis>
In: KES. (2006) 1171–1178</bibliomixed><bibliomixed xml:id="dourish1995" xreflabel="Dourish1995">Paul Dourish, <emphasis role="ital">Developing a
reflective model of collaborative systems.</emphasis> ACM Trans. Comput.-Hum. Interact. 2, 1 (Mar. 1995),
  40-63. doi: <biblioid class="doi">10.1145/200968.200970</biblioid></bibliomixed><bibliomixed xml:id="bozzi1997" xreflabel="Bozzi1997">Bozzi, A., Calabretto, S.: <emphasis role="ital">The
digital library and computational philology: The bambi project.</emphasis> In: ECDL. Volume 1324 of Lecture
Notes in Computer Science., Springer (1997) 269–285.  doi: <biblioid class="doi">10.1007/BFb0026733</biblioid>.</bibliomixed><bibliomixed xml:id="nichols2000" xreflabel="Nichols2000">Nichols, D.M., Pemberton, D., Dalhoumi, S., Larouk,
O., Belisle, C., Twidale, M.B.: <emphasis role="ital">Debora: developing an interface to support collaboration
in a digital library.</emphasis> In: European Conference on Digital Libraries, Springer (2000)
239–248.  doi: <biblioid class="doi">10.1007/3-540-45268-0_22</biblioid>.</bibliomixed><bibliomixed xml:id="diorio2007" xreflabel="D'Iorio2007">D’Iorio, P.: <emphasis role="ital">Nietzsche on new
paths: The hypernietzsche project and open scholarship on the web.</emphasis> In: Maria Cristina Fornari,
Sergio Franzese (ds.), Friedrich Nietzsche. Edizioni e interpretazioni, Pisa ETS. (2007)</bibliomixed><bibliomixed xml:id="stein2004" xreflabel="Stein2004">Stein, A., Keiper, J., Bezerra, L., Brocks, H., Thiel,
U.: <emphasis role="ital">Collaborative research and documentation of european film history: The collate
collaboratory.</emphasis> In International Journal of Digital Information Management (JDIM), special issue on
Web-based collaboratories from centres without. (2004) 30–39 </bibliomixed><bibliomixed xml:id="hahn2008" xreflabel="Hahn2008">Hahn, D., Nucci, M., Barbera, M.: <emphasis role="ital">The talia library
platform - rapidly building a digital library on rails.</emphasis> In: 4th Workshop on Scripting for the Semantic Web.
(2008) </bibliomixed><bibliomixed xml:id="scotti2001" xreflabel="Scotti2001">Scotti, A., Nuzzo, D.: <emphasis role="ital">Pinakes – a modeling
environment for scientific heritage database applications.</emphasis> In: Proc. of Reconstructing science – Contributions
to the enhancement of the European scientific heritage Workshop, Ravenna, Italy (2001) </bibliomixed><bibliomixed xml:id="bertoncini2007" xreflabel="Bertoncini2007">Bertoncini, M.: <emphasis role="ital">On the move towards the
european digital library: Bricks, tel, michael and delos converging experiences.</emphasis> In: Research and Advanced
Technology for Digital Libraries, 11th European Conference, ECDL 2007, Budapest, Hungary, September 16-21,
2007, Proceedings. Volume 4675 of Lecture Notes in Computer Science., Springer (2007) 440–441.
doi: <biblioid class="doi">10.1007/978-3-540-74851-9_37</biblioid>. </bibliomixed><bibliomixed xml:id="kruk2007" xreflabel="Kruk2007">Kruk, S.R., Woroniecki, T., Gzella, A., Dabrowski, M.:
<emphasis role="ital">Jeromedl - a semantic digital library.</emphasis> In Golbeck, J., Mika, P., eds.:
Semantic Web Challenge. Volume 295 of CEUR Workshop Proceedings., CEUR-WS.org (2007) </bibliomixed><bibliomixed xml:id="doumat2008" xreflabel="Doumat2008">Doumat, R., Egyed-Zsigmond, E., Pinon, J.M., Csiszar,
E.: <emphasis role="ital">Online ancient documents: Armarius.</emphasis> In: ACM DocEng’08. Proceeding of the
Eighth ACM Symposium on Doucument Engineering, ACM (September 2008) 127–130. doi: <biblioid class="doi">10.1145/1410140.1410167</biblioid> </bibliomixed><bibliomixed xml:id="schonefeld2006" xreflabel="Schonefeld2006">Schonefeld, O., Witt, A.: <emphasis role="ital">Towards validation of concurrent markup.</emphasis> In: Proceedings of the Extreme Markup 2006, Montréal, Canada </bibliomixed><bibliomixed xml:id="schmidt2009" xreflabel="Schmidt2009">Schmidt, D.: <emphasis role="ital">Merging Multi-Version Texts: a Generic Solution to the Overlap Problem.</emphasis> Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi: <biblioid class="doi">10.4242/BalisageVol3.Schmidt01</biblioid>.</bibliomixed></bibliography></article>
