<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>A formal approach to XML semantics: implications for archive standards</title><info><confgroup><conftitle>International Symposium on XML for the Long Haul:  Issues in the Long-term Preservation of XML</conftitle><confdates>August 2, 2010</confdates></confgroup><abstract><para>Previous literature characterizing XML semantics (<citation linkend="sperbergmcqueen2000">Sperberg-McQueen et al. 2000</citation>, <citation linkend="renear2002">Renear et al. 2002</citation>, <citation linkend="piez2002">Piez 2002</citation>) takes reasonably syntactically and semantically plausible
                markup and/or schemas as a starting point. In contrast, for this paper we aim to work
                towards such a schema as an idealized end goal, by characterizing the necessary—
                if not sufficient— semantic constraints that differentiate a schema intended for archival use from nonsense and implausible
                schemas, as well as schemas that fail to sufficiently take semantics into account.
                In addition to the goal of providing a novel approach to the perenially thorny
                problem of XML semantics, we are particularly concerned with the interaction between
                the goals of archival purposes and XML semantics.</para></abstract><author><personname><firstname>Andrew</firstname><surname>Dombrowski</surname></personname><personblurb><para>Andrew Dombrowski is a 4th year PhD student at the University of Chicago in
                    the Department of Slavic Languages and Literatures and the Department of
                    Linguistics. His research focuses on language change and contact between Slavic
                    and non-Slavic languages.</para></personblurb><affiliation><jobtitle>PhD student</jobtitle><orgname>Department of Slavic Languages and Literatures and Department of
                    Linguistics, University of Chicago</orgname></affiliation><email>adombrow@uchicago.edu</email></author><author><personname><firstname>Quinn</firstname><surname>Dombrowski</surname></personname><personblurb><para>Quinn Dombrowski is the manager of the Scholarly Technology group in the
                    University of Chicago's IT Services organization. She has an MA in Slavic
                    Linguistics from the University of Chicago, and an MLS from the University of
                    Illinois at Urbana-Champaign.</para></personblurb><affiliation><jobtitle>Manager, Scholarly Technology</jobtitle><orgname>University of Chicago</orgname></affiliation><email>quinnd@uchicago.edu</email></author><legalnotice><para>Copyright © 2010 by the authors. Licensed under Creative Commons’ attribution, non-commercial, share-alike license (<link xlink:href="http://creativecommons.org/licenses/by-sa/3.0/" xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://creativecommons.org/licenses/by-sa/3.0/</link>).</para></legalnotice></info><section><title>0. Introduction</title><para>In contrast to syntax, which is explicitly (and machine-readably) defined for XML
            documents through use of a schema, XML semantics is notoriously difficult to pin down.
                <citation linkend="sperbergmcqueen2000">Sperberg-McQueen, et al. 2000</citation>
            takes the approach of describing semantics by defining some of the processes one goes
            through unconsciously when interpreting the semantics of XML: what meaning elements and
            attributes convey, how one makes sense of seemingly conflicting statements, the
            different behavior of distributed and non-distributed features, etc. <citation linkend="renear2002">Renear, et al. 2002</citation> presents the issue of XML
            semantics in its historic context, identifies important aspects of semantics (class
            relationships, feature propagation, context and reference, etc.) that are usually only
            specified in accompanying prose documentation—if at all, and argues for the value of a
            machine-readable representation scheme for markup semantics, which is one of the
            research goals of the BECHAMEL Project. <citation linkend="piez2002">Piez
                2002</citation> takes a more philosophical approach to XML semantics, drawing on the
            work of Ferdinand de Saussure and the Structuralist movement by describing markup as a
            layered sign system. All of these approaches take reasonably syntactically and
            semantically plausible markup and/or schemas as their starting point. In contrast, for this paper we aim to work
            towards such a schema as an idealized end goal, by characterizing the necessary—
            if not sufficient— semantic constraints that differentiate a schema intended for archival use from nonsense and implausible
            schemas, as well as schemas that fa(i.e.,il to sufficiently take semantics into account. In addition to the goal of 
            providing a novel approach to the perenially
            thorny problem of XML semantics, we are particularly concerned with the interaction
            between the goals of archival purposes and XML semantics.</para><para>We argue that for archival purposes, XML semantics are non-trivial - i.e., (1) that
            the problem of XML semantics cannot be reduced to the set of all possible use cases, (2)
            that XML syntax and semantics differ with regard to crucial structural properties, and
            (3) that semantics and syntax impose independent well-formedness constraints on schemas. We examine these properties in the context of a hypothetical long-haul archival situation in which documentation may not have been preserved – and in which the agendas underpinning the original markup may not be easy to reconstruct. In such circumstances, the interpretation of a given XML markup schema will be facilitated by an ability to explicitly delineate plausible markup schemas from non-plausible schemas independent of subject-specific knowledge. </para><para>With this in mind, we provide a formal semantic characterization of traits found in good (reasonably plausible, as
            contrasted with merely syntactically valid) schemas, and finally propose a set of properties that characterize such schemas in a way that incorporates both semantic and syntactic considerations. We hope
            that specifically considering what semantic characteristics should exclude a schema from
            consideration as a plausible archive standard will indirectly shed light on the nature
            of XML semantics more broadly. However, it is not our goal in this paper to propose an exhaustive treatment of XML semantics – instead, rather to elucidate the bare minimum necessary for a scheme to be plausible. This paper is informed by linguistic methodology in the
            broad sense – i.e., the proposition that a characterization of the bare minimum of “grammaticality” can yield insight of broader interest. In particular, we draw upon notions developed in the modern school of semantics that began with Montague Grammar. As
            such, we hope that some of the developments in the field of linguistics in the last 50
            years, as reflected herein, prove as insightful a lens onto markup as the earlier
            Structuralist school.</para></section><section><title>1. Why semantics?</title><para>The characterization of archive-appropriate schemas necessitates separating "good" (i.e.,
            plausibly useful) schemas from the infinitely large space of valid XML schemas. At any
            given point in time, practical and case-specific evaluations of the utility of a given
            schema should suffice for most purposes. However, long-term preservation also means
            planning for environments in which significant amount of case-specific detail may have
            been lost. Lexical semantics are particularly mutable over time; the description of
            "symbol" provided by TEI, <quote>documents the intended significance of a particular
                character or character sequence within a metrical notation, either explicitly or in
                terms of other symbol elements in the same metDecl</quote>
            <citation linkend="teip4">TEI P4</citation> is easier to intuitively grasp given the
            modern English meaning of the word than based on the 15th century usage, meaning "creed,
            summary, religious belief" <citation linkend="onlinetym">Online Etymology
                Dictionary</citation>. The assumptions underlying research programs are even less
            stable than lexical semantics; the concern with structuralist semantics was superseded
            in the 1960's by the controversial and short-lived generative semantics research program
            which was itself eventually superseded (in the 1980s and onward) by more modern schools
            of semantics, beginning with Montague grammar, that have drawn on techniques of formal
            logic for their basis. An illustrative thought experiment here is to imagine projecting
            markup technologies into the past to be coextensive with literacy. What XML schemas
            would have been created by, for instance: a Greek dramatist, St. Augustine, an early
            medieval Chinese chronicler, and an alchemist? And how would these schemas differ from,
            say, TEI? While a single modern guideline such as TEI may be able to encode the written
            records of this diverse group of individuals in a way that is meaningful to the modern
            scholar, a TEI encoding of these texts informed by modern scholarly interests would not
            only fail to be interoperable with the schemas devised by the original authors, but may
            perhaps not even be comprehensible to them.</para><para>A rich knowledge of the specific situations (intended use, cultural context, concept
            of authorship/citation, etc.)  in which these hypothetical schemas were created would
         ameliorate the situation. However, a goal of long-term preservation standards is to allow a certain
            degree of interoperability without crucial context-specific knowledge. One step in doing
            so is to separate out the relatively small set of plausibly useful schemas from the
            potentially vast space of valid schemas; it is the goal of this paper to outline a way
            of doing so. </para><para>To illustrate this, we can consider example XML using completely ridiculous schemas
            and some using merely implausible schemas. Examples using completely ridiculous schemas
            are shown below (1-3). In each of these schemas, the permitted content type of each
            element is the actual object, action, or part of specified by the name of the element (i.e.    
            <code>&lt;branch /&gt;</code> can only contain such a protrusion from a
            tree, <code>&lt;simplify /&gt;</code> can only contain the act of
            simplification, etc.)</para><orderedlist><listitem><para> the tree-list schema: <code>&lt;trunk /&gt;&lt;oak
                        /&gt;&lt;maple /&gt;&lt;branch /&gt;</code></para></listitem><listitem><para>the command-list schema: <code>&lt;simplify /&gt;&lt;eat
                        /&gt;&lt;breathe /&gt;</code></para></listitem><listitem><para>the English conjunctions schema: <code>&lt;and /&gt;&lt;but
                        /&gt;&lt;however /&gt;</code></para></listitem></orderedlist><para>Some structurally similar schemas are intuitively less ridiculous, although also
            implausible. Examples of XML using implausible schemas are given below (4-7).</para><orderedlist startingnumber="4"><listitem><para>the word-length schema: <code>&lt;word length="x"/&gt;</code>, where x
                    = # of letters in word</para></listitem><listitem><para>the "broken clock is right twice a day" incorrect word-length schema:
                        <code>&lt;word length="x*sin(n°)"/&gt;</code> where of x = # of
                    letters in the n-th word in the document</para></listitem><listitem><para>the count-words-by-threes schema: <code>&lt;word1 /&gt;&lt;word2
                        /&gt;&lt;word3 /&gt;&lt;word1 /&gt;&lt;word2
                        /&gt;&lt;word3 /&gt;&lt;word1 /&gt;</code> etc...</para></listitem><listitem><para>the conspiracy-theorist schema: <code>&lt;word(n) /&gt;
                        &lt;word(n+k)/&gt; &lt;word(n+2k)/&gt;</code>, etc., where n
                    is the n-th word in the text and k is a number imbued with some significance
                    (e.g. 666, 42, (with a few tweaks) a succession of prime numbers, etc...)</para></listitem></orderedlist><para>How, then, to distinguish between the ridiculous, the implausible, and the
            plausible?</para><para>An immediate and intuitive objection to these schemas might be that they can be ruled
            out on the grounds that no one would possibly be interested in them. However, that
            explanation, which can be termed the "practical usability explanation" is not fully
            adequate. First, it is not necessarily clear that this approach would capture the
            difference between the ridiculous and the merely implausible. On a certain level, the
            English conjunctions schema could be thought to be more plausible than the "broken clock
            is right twice a day" incorrect word-length schema, insofar as it is much easier to
            imagine why someone would be interested in conjunctions than in looking at the result of
            multiplying word-length figures by the sine function. However, the English conjunctions
            schema is clearly bad in a way that the "broken clock is right twice a day" incorrect
            word-length schema isn't. Intuitively speaking, we might say that conjunctions are a
            reasonable area of interest, but given an interest in conjunctions, the English
            conjunctions schema is unlikely to be your choice. On the other hand, being interested in
            multiplying word-length figures by the sine function is bizarrely implausible, but if
            for some reason one wanted to do that, the "broken clock is right twice a day" incorrect
            word-length schema would work.</para><para>The "practical usability explanation" is especially problematic in the context of
            archival preservation. Part of the reason why long-term archival preservation of XML is
            a non-trivial task is precisely the fact that it is not always obvious what future
            generations of researchers will find interesting or useful. Furthermore, the
            establishment of practical usability will always to a certain extent be in the eye of
            the beholder. Schemas like our conspiracy-theorist schema could be of potential interest
            - Dan Brown, for instance, could testify to the wide public appeal of conspiracy
            theories. More seriously, debates about intuitive assessments of practical utility are
            unlikely to be a fundamentally productive line of discussion.</para><para>Another possible objection is that by definition XML markup is performed on text. This renders the tree-list schema and the command-list schema impossible insofar as it is a feature of the real world that tree parts and actions are not composed of combinations of characters. While this is a reasonable objection, the degree to which these assertions are based on potentially contestible real-world knowledge is problematic. It may be difficult to imagine a situation in which a sane person would assert that trees are composed out of characters in an ontologically real sense, but one can more easily imagine a lively argument about whether actions can be expressed with words in an ontologically real sense (e.g. performatives). Regardless, this line of reasoning is only applicable with difficulty in a hypothetical long-haul preserval scenario – assumptions about real-world phenomena have been known to change over time.</para><para>What criteria, then, can we use to distinguish ridiculous, implausible, and plausible
            schemas without reference to practical utility or related questions? Syntax could help; an intuitive
            observation about schemas (1) - (7) is that they are structurally flat, an observation
            which leads to the suggestion that more elaborate syntactic structure may be
            characteristic of plausible schemas. While this may be the case, it is also the case
            that equally absurd examples could be constructed to an arbitrary degree of syntactic
            nestedness, and not all flat schemas are absurd (i.e. Dublin Core). This illustrates
            that syntactic considerations are not sufficient to the task at hand. The rest of this
            paper develops a proposal that employs semantics to characterize plausible schemas, as
            opposed to syntactically valid but ridiculous or implausible schemas.</para></section><section><title>2. Syntax-Semantics Mismatches in XML</title><para>A prerequisite to any discussion of XML syntax versus XML semantics is to determine
            whether or not XML syntax and XML semantics are on some level equivalent. If a
            generalization about XML semantics could be restated making reference only to XML
            syntax, this would render any mention of semantics irrelevant. In this section, it is
            shown that there are at least two senses in which syntax and semantics are crucially
            distinct in XML. (A note on representation; in the field of semantics, angled brackets
            are are used to refer to words, while square brackets refer to what the words mean, or
            their denotation. Therefore, in this context, &lt;cat&gt; refers to an element
            that could be employed in a schema, while [[cat]] refers to the furry animal, and <code>&lt;cat&gt;</code> refers to an XML representation. </para><para>First, XML syntax is strictly hierarchical, but XML semantics does not have to be. An
            example where both syntax and semantics are hierarchal can be seen in paragraph
            structure: &lt;sentence&gt; ∊ &lt;paragraph&gt; (in XML,
                <code>&lt;paragraph&gt;&lt;sentence
                /&gt;&lt;paragraph&gt;</code>) and [[sentence]] ⊂ [[paragraph]] (a
            sentence is a subset of a paragraph). However, when elements refer to properties that
            are not inherently hierarchical, this is not the case. For instance,
            &lt;damage&gt; ∊ &lt;sentence&gt; but [[damage]] ⊄ [[sentence]] - i.e.,
            the element &lt;damage&gt; may be the parent element for
            &lt;sentence&gt;, but it does not make sense to say that the concept of damage
            is a subset of the concept of sentence. This can be formalized as follows: if the subset
            relationship holds between the denotations of two or more elements (like [[sentence]]
            and [[paragraph]]), let these elements be called semantically hierarchical. If not (like
            [[damage]] and [[sentence]]), then let these elements be called semantically
            non-hierarchical. The semantic hierarchy can be captured by arranging semantically
            hierarchical elements on the semantic levels s, s(1), s(2), ..., s(k) for k levels of
            specificity (proceeding from general to specific) - i.e., the semantic hierarchy
            consists of semantically hierarchical elements, arranged accordingly.</para><para>As an aside, it can be noted that proposals have been made for XML syntax to be
            non-strictly hierarchical in order to accommodate different kinds of structures in a
            document <citation linkend="renear1993">Renear, et al. 1993</citation>, which stands in
            contrast to earlier conceptions of a document as containing a single logical hierarchy
            of content objects <citation linkend="derose1990">DeRose, et al. 1990</citation>.
            Non-hierarchical syntax involves the use of different (concurrent) structures that may
            overlap with one another but share the same content <citation linkend="chatti2007">Chatti, et al 2007</citation>.  Syntactic non-hierarchicality applies only to
            interactions between different syntactic levels of the schema (although, in extreme
            cases, such as the Dublin Core, there may only be one level of syntax at all), and does
            not obviate hierarchicality in the semantics. </para><para>Syntax and semantics also impose independent constraints on the well-formedness of
            schemas (where well-formedness is understood as the property that characterizes
            plausible schemas). The independence of syntactic and semantic constraints are
            illustrated below; again, here the element <code>&lt;every&gt;</code> can only
            contain the concept of every-ness:</para><orderedlist startingnumber="8"><listitem><para>good syntax + good semantics: <code>&lt;paragraph&gt;&lt;sentence
                        /&gt;&lt;/paragraph&gt;</code></para></listitem><listitem><para>bad syntax + good semantics:
                        <code>&lt;paragraph&gt;&lt;sentence&gt;&lt;/paragraph&gt;&lt;/sentence&gt;</code></para></listitem><listitem><para>good syntax + bad semantics: <code>&lt;paragraph&gt;&lt;every
                        /&gt;&lt;/paragraph&gt;</code></para></listitem><listitem><para>bad syntax + bad semantics:
                        <code>&lt;paragraph&gt;&lt;every&gt;&lt;/paragraph&gt;&lt;/every&gt;</code></para></listitem></orderedlist><para>These considerations demonstrate that XML syntax and semantics must be analyzed as
            separate domains. The restrictions that hold on valid XML syntax have been well
            documented <citation linkend="W3C2008">W3C 2008</citation>, whereas the restrictions
            that must hold on the semantics of plausible schemas are less well described.</para></section><section><title>3. Formal Semantics of XML</title><section><title>3.1 Semantic Types</title><para>In this section, we propose that attributes and elements in plausible XML schemas
                must be of type &lt;e, t&gt;, where the notation &lt;e, t&gt; is
                understood as indicating a function from individuals (&lt;e&gt;) onto truth
                values (&lt;t&gt;).<footnote><para>The use of angled brackets to indicate
                        semantic types is common practice in the field of modern semantics. To avoid
                        inventing a new notation system just for this paper, we adopt the same
                        practice despite the risk of confusion with markup notation. In this paper,
                        &lt;e, t&gt; indicates semantic type, &lt;e&gt; indicates
                        and individual, and &lt;t&gt; indicates a truth
                    value.</para></footnote> This is the semantic type generally postulated to
                characterize common nouns and adjectives in English. For instance, [[dog]] can be
                thought of as the set of all things that are dogs - i.e., a function f from
                individuals (any and all conceivable entities in this world) onto truth values (1 =
                true, 0 = false) such that f(x) = 1 iff [[x]] is a dog. One could object that it
                would be simpler to state this proposal in terms of nouns and adjectives - i.e., to
                propose that attributes and elements should be nouns and adjectives. However, it is
                preferable to state this in terms of semantics, because we need to keep our terms
                straight. "Nouns" and "adjectives" are terms taken from English syntax, which is not
                optimal when what we really want to talk about is XML semantics - i.e., neither
                English nor syntax. This proposal rules out absurd schemas (2) and (3) from the
                introduction, and captures the intuition that attributes and elements should be
                statements about things.</para><para>Beyond the intuitive appeal of this proposal, it can be derived in a bottom-up
                fashion, based only on the assumptions that (1) texts are made up of things, and (2)
                that markup says things about things. Assumption (1) shows that texts are made up of
                basic components of type &lt;e&gt;. Assumption (2) leads directly to a
                semantic type of &lt;e, t&gt; for elements and attributes; i.e., something
                is tagged <code>&lt;paragraph&gt;</code> only if it is true that it is a
                paragraph, modulo whatever definition of paragraph is appropriate in context. A
                formal definition of "tag abuse" can also fall out from assumption (2), i.e., tag
                abuse is the mapping of an individual onto a truth value of zero.  In a situation
                where <code>&lt;ship&gt;</code> is being used to cause some arbitrary text
                (other than a ship name) to be rendered in italics <citation linkend="piez2001">Piez
                    2001</citation>, the user has misunderstood that the element
                &lt;ship&gt; is a function that assigns the value 1 to its contents, if and
                only if it is true that the denotation of the contents is a ship. </para><para>Translated into the terms above, the element &lt;paragraph&gt; is a
                function from individual bits of text onto truth values such that
                &lt;paragraph&gt;(x) = 1 iff [[x]] is a paragraph. Assumptions (1) and (2)
                should be basic for all archival purposes. Denying assumption (2) could lead to the
                emergence of bizarre surrealist schemas, but it seems safe to conclude that ruling
                out such schemas is precisely the goal for developing archival standards. It is not
                clear what denying assumption (1) would even mean ontologically.</para><para>    More complicated functions are of course conceivable, but they are the domain
                of the processing language rather than the XML itself. An example of this would be a
                function of the type &lt;&lt;e, t&gt;, &lt;e, t&gt;&gt; -
                i.e., a function that takes one element/attribute and returns another. For instance,
                one such function would take a nested element and return the element one level
                higher.</para><para>It should be noted that in the above proposal XML schemas are not assumed to be compositional semantically. To some extent, it is an open question whether or not a compositional minimal semantics for XML is a desirable feature. Compositional semantics would inevitably result in a proliferation of types, thereby obviating the proposed distinction between &lt;e, t&gt; elements that belong to XML and other elements that are the domain of the processing language. On the other hand, non-compositional semantics means that the concept of function admissible in XML must be wide enough to include input from outside the local domain of the element. For instance, the attribute <code>lang = "en"</code> must valued by referring to something beyond the string of characters "en". Similarly, an element containing many sub-elements would have to be evaluable in terms of its sub-elements. To a certain extent, it remains to be seen whether non-compositional semantics makes undesirable predictions. Absent such evidence, the more parsimonious option is not to include compositionality as an explicit requirement.
        </para></section><section><title>3.2 Semantic Coherence</title><para>The requirement that attributes and elements in plausible XML schemas be of type
                &lt;e, t&gt; is necessary but not sufficient to the task of ruling in
                plausible schemas while ruling out implausible schemas. To illustrate the point,
                consider the XML in (12) and (13):</para><orderedlist startingnumber="12"><listitem><para>
                        <code>&lt;title /&gt;&lt;creator /&gt;&lt;subject
                            /&gt;&lt;description /&gt;&lt;publisher /&gt;</code>
                    </para></listitem><listitem><para>
                        <code>&lt;title /&gt;&lt;giraffe /&gt;&lt;arsenic
                            /&gt;&lt;starvation /&gt;&lt;King of France
                            /&gt;</code>
                    </para></listitem></orderedlist><para>Example (12) is an excerpt from the well-known Dublin Core schema for marking up
                metadata, while schema (13) is nonsense that satisfies the requirement that
                attributes and elements be of semantic type &lt;e, t&gt;. How, then, to rule
                out (13) as compared to (12)? In this section, we attempt to develop the intuition
                that there exists a real-world object such that the traits [[title]], [[creator]],
                [[subject]], [[description]], and [[publisher]] can be predicated of it or its constituent parts with a truth
                value of 1 (i.e., there exists at least one object that has all of these traits),
                but there is no real-world object such that [[title]], [[giraffe]], [[arsenic]],
                [[starvation]], and [[King of France]] can be predicated of it with a truth value of
                1. As a reminder, the notation [[title]] should be understood as meaning roughly "something that is a title".</para><para>In order to formalize this insight, it is necessary to take a closer look at how
                entities of type &lt;e, t&gt; operate. The denotation of such an entity
                ([[x]] where x is of type &lt;e, t&gt;) is either 1 or 0 (corresponding to
                true or false). Such an entity must give a truth value based on an entity of type
                &lt;e&gt; - i.e., a chunk of text. The only restriction on this process is
                that it be a function, which for these purposes only means that some individual x
                cannot be assigned to both true and false - i.e., it cannot be simultaneously true
                and false that a chunk of text is a paragraph. Within this very wide scope, it is
                possible to distinguish multiple types of functions. Structural-type functions
                assign truth values based on whether or not the individual entity under evaluation
                meets certain structural criteria; i.e., x is a paragraph if and only if x is a
                paragraph. Predicative-type functions assign truth values based on a
                non-definitional but inherent property of the entity under evaluation; i.e., x is in
                German if and only if x is in German (as distinct from being a sentence, a
                paragraph, a word, etc.) Attributive-type functions assign truth values based on a
                non-definitional and non-inherent property of the entity under evaluation - i.e., x
                is the title if and only if x is the title, a bit of information that requires
                specific real-world context to determine.</para><para>With this in mind, we can return to the main topic and provide a more precise
                characterization of semantic coherence. A schema S is said to be semantically
                coherent iff for each element or attribute {a1, a2, a3, ..., an} ∊ S there exists a
                set of entities (of type &lt;e&gt;) {x1, x2, x3, ..., xn} such that
                [[ak(xk)]] = 1 for all e ∊ S. The concrete interpretation of this will vary
                depending on whether the elements or attributes in question are structural,
                predicative, or attributive.  This rules out example 13, because there are no
                real-world objects such that each element could assign those objects to a truth
                condition of 1 simultaneously (i.e. there is no thing that literally consists of or contains a
                title, a giraffe, arsenic, starvation, and the King of France, all at the same
                time.)</para></section><section><title>3.3 Semantic Hierarchies</title><para>At least one more issue must be discussed in order to fully characterize plausible
                schemas. Compare (14) and (15):</para><orderedlist startingnumber="14"><listitem><para>
                        <code>&lt;paragraph&gt;&lt;sentence/&gt;&lt;/paragraph&gt;</code>
                    </para></listitem><listitem><para>
                        <code>&lt;sentence&gt;&lt;paragraph/&gt;&lt;/sentence&gt;</code>
                    </para></listitem></orderedlist><para>(14) is obviously corresponds to a common schema while (15) is nonsense. Syntax
                cannot help here, nor does it suffice to appeal to the claim that (15) is not
                plausible because it is not plausible. The reason why (15) is not plausible is
                because syntax is conflicting with semantics. In order to get a precise handle on
                this, it is necessary to formalize the notion of semantic hierarchies.</para><para>The semantic representation of an XML tree may be considered to consist of the
                linearly arranged denotations of the elements and attributes present within an XML
                tree. In other words, &lt;element&gt; → [[element]] and
                &lt;attribute&gt; → [[attribute]]. As applied to (14) and (15), this yields
                the following table.</para><table cellspacing="10px"><thead><tr><th/><th>Linear Representation</th><th>Hierarchical Representation</th></tr></thead><tbody><tr><th>Syntax of (14)</th><td>
                            <code>&lt;paragraph&gt;&lt;sentence/&gt;&lt;/paragraph&gt;</code>
                        </td><td>
                            <code>&lt;sentence&gt; ∊ &lt;paragraph&gt;</code>
                        </td></tr><tr><th>Semantics of (14)</th><td>[[paragraph]][[sentence]]</td><td>[[sentence]] ⊂ [[paragraph]] </td></tr><tr><th>Syntax of (15)</th><td>
                            <code>&lt;sentence&gt;&lt;paragraph/&gt;&lt;/sentence&gt;</code>
                        </td><td>
                            <code>&lt;paragraph&gt; ∊ &lt;sentence&gt;</code>
                        </td></tr><tr><th>Semantics of (15)</th><td>[[sentence]][[paragraph]]</td><td>[[sentence]] ⊂ [[paragraph]]</td></tr></tbody></table><para>Table I gives an indication of what the problem is with (15) - we can freely
                change the syntax of (14), but as much as we change the syntax, we cannot change
                what "paragraph" and "sentence" mean - in particular, we cannot change the fact that
                [[paragraph]] and [[sentence]] are semantically hierarchical. The only remaining
                step is to smooth over the notational discrepancy between hierarchical syntactic
                relationships and hierarchical semantic relationships.</para><para>Below is a formal characterization of semantic hierarchies as conceived more
                abstractly as ordering relationships: given a set of entities E = {e1, e2, e3, ...,
                en}, a hierarchy can be defined as an ordered k-tuple (ei, ej, ek) made up of
                elements of E. An XML schema S is then made up of both syntactic elements/attributes
                and their denotations: S = {&lt;e1&gt;, [[e1], &lt;e2&gt;, [[e2]],
                &lt;e3&gt;, [[e3]], &lt;e4&gt;, [[e4]], ..., &lt;en&gt;,
                [[en]]}. We can then state that any ordering that holds for a syntactic element
                &lt;ek&gt; in S must also hold for its semantic correspondent [[ek]]. If the
                above holds, we may then state that semantic hierarchies respect syntactic
                hierarchies- i.e., while the syntactic and semantic hierarchies don’t need to
                correspond, they can’t be contradictory. This rules out (15).</para></section></section><section><title>4. Some Very Basic Features of Archive Standards</title><para>In this section, we summarize the above points and add some other criteria that must
            be met by plausible archive standards.</para><orderedlist><listitem><para><emphasis>Syntax is arbitrarily nested</emphasis>. If the most general level
                    is p, let the more specific levels be denoted by p, p(1), p(2) ..., p(k) for k
                    levels of specificity. It is not necessarily the case that one and only one
                    element correspond to each syntactic level. For instance, it is possible that
                    elements like &lt;sentence&gt; and &lt;metaphor&gt; are on the
                    same level.</para></listitem><listitem><para>
                    <emphasis>Elements and attributes are of semantic type &lt;e,
                        t&gt;.</emphasis>
                </para></listitem><listitem><para>
                    <emphasis>Schemas must be semantically coherent.</emphasis>
                </para></listitem><listitem><para>
                    <emphasis>Syntactic hierarchies must respect semantic hierarchies.</emphasis>
                </para></listitem><listitem><para><emphasis>Elements and attributes are assigned at the highest possible
                        level.</emphasis> This is an obvious insight that is not trivial to
                    formalize, the insight being that elements and attributes should not be
                    gratuitously repeated<footnote><para>This is similar to the insight that
                            "distributed properties" (as we classify them, predicative, and some
                            attributive, attributes and elements) are non-countable, i.e, <quote>If
                                an element type x marks a distributed property, then any two
                                adjacent x elements may be joined, or one x element may be split:
                                &lt;x&gt;abc&lt;/x&gt;&lt;x&gt;def&lt;/x&gt;
                                means exactly the same thing as
                                &lt;x&gt;abcdef&lt;/x&gt;.</quote>
                            <citation linkend="sperbergmcqueen2000">Sperberg-McQueen, et al.
                                2000</citation></para></footnote>.Sometimes (in the case of
                    structural elements), this is because to do otherwise would be semantically
                    invalid (i.e.,
                    <code>&lt;paragraph&gt;&lt;paragraph&gt;&lt;/paragraph&gt;&lt;/paragraph&gt;</code>.)
                    For transitive predicative attributes or elements, it would be redundant (i.e., not
                    everything needs to be redundantly marked for language). Thus, for most elements
                    and attributes, it is sufficient to state that an element or attribute that maps
                    onto t = 1 (true) at level {p + k} maps onto t = 0 (untrue) at level {p + (k -
                    1)}. This will handle structural elements like &lt;paragraph&gt; and
                    predicative attributes like &lt;language&gt;. The situation is more
                    complex with regard to attributive elements and attributes like
                    &lt;metaphor&gt; or &lt;damage&gt;. One can imagine situations in
                    which these elements might occur on two structurally contiguous levels -
                    metaphors within metaphors or damage within damage. Ontologically, the situation
                    could be saved by positing that underlyingly, different metaphors or different
                    types of damage are being denoted. The details of how to formalize this is not
                    entirely clear but would likely capitalize on the intuition that metaphors
                    inside metaphors only work if the two metaphors are different.</para></listitem></orderedlist></section><section><title>5. Conclusion</title><para>Starting our exploration of XML semantics from the perspective of all syntactically
            valid schemas has allowed us to formalize some semantic traits shared by mostly
            widely-used schemas that are easy to overlook, but of great significance when assessing
            how useful a schema might be for archival purposes - and reverse-engineering the interpretation of schemas that have been used for archival purposes, but for which adequate documentation is lacking. This may also have implications for
            ongoing work towards machine-interpretation of XML semantics. If an XML document uses a
            schema that conforms to our proposed archive standards, stronger statements can be made
            about the relationship between the elements in that document. The fact that the
            syntactic hierarchy of elements is compatible with a real-world semantic hierarchy, in
            combination with the other generalizations that we have made about archive-appropriate
            XML semantics, facilitates the development of automatizable processes of analysis, and
            enables developers to bring to bear existing tools used for classifying the real
            world.</para></section><bibliography><title>Bibliography</title><bibliomixed xml:id="chatti2007"> Chatti, Noureddine; Suha Kaouk, Sylvie Calabretto and Jean
            Marie Pinon. "MultiX: an XML based formalism to encode multi-structured documents" In
            Proceedings of Extreme Markup Languages 2007.
                <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html</link>
        </bibliomixed><bibliomixed xml:id="derose1990">DeRose, S. J., Durand, D. G., Mylonas, E., and Renear A. H.
            (1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2:
            3-26. doi: <biblioid class="doi">10.1007/BF02941632</biblioid></bibliomixed><bibliomixed xml:id="onlinetym">Online Etymology Dictionary. "Symbol". Accessed 15 April
            2010. <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.etymonline.com/index.php?term=symbol</link></bibliomixed><bibliomixed xml:id="piez2001"> Piez, Wendell. "Beyond the “descriptive vs. procedural”
            distinction." In Proceedings of Extreme Markup Languages 2001.
                <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://conferences.idealliance.org/extreme/html/2001/Piez01/EML2001Piez01.html</link>
        </bibliomixed><bibliomixed xml:id="piez2002"> Piez, Wendell. "Human and Machine Sign Systems." In
            Proceedings of Extreme Markup Languages 2002.
                <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://conferences.idealliance.org/extreme/html/2002/Piez01/EML2002Piez01.html</link>
        </bibliomixed><bibliomixed xml:id="renear1993">Renear, Allen; Elli Mylonas, and David Durand. "Refining
            our Notion of What Text Really Is: The Problem of Overlapping Hierarchies."
                <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.stg.brown.edu/resources/stg/monographs/ohco.html</link></bibliomixed><bibliomixed xml:id="renear2002">Renear, Allen; David Dubin, and C.M. Sperberg-McQueen.
            "Towards a semantics for XML markup". Proceedings of the 2002 ACM symposium on Document
            engineering. doi: <biblioid class="doi">10.1145/585058.585081</biblioid></bibliomixed><bibliomixed xml:id="sperbergmcqueen2000">Sperberg-McQueen, C.M.; Claus Huitfeldt, Allen
            Renear. "Meaning and interpretation of markup." Markup Languages: Theory &amp;
            Practice 2.3 (2000): 215-234. <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://cmsmcq.com/2000/mim.html</link>. doi: <biblioid class="doi">10.1162/109966200750363599</biblioid></bibliomixed><bibliomixed xml:id="teip4">Text Encoding Initiative: The XML Version of the TEI Guidelines:
            5 The TEI Header. Accessed 15 April 2010.
                <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.tei-c.org/cms/Guidelines/P4/html/HD.html</link></bibliomixed><bibliomixed xml:id="W3C2008">Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C
            Recommendation 26 November 2008
            <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/2008/REC-xml-20081126/</link></bibliomixed></bibliography></article>
