A formal approach to XML semantics: implications for archive standards

Andrew Dombrowski; Quinn Dombrowski

Abstract

Previous literature characterizing XML semantics (sperbergmcqueen2000, renear2002, piez2002) takes reasonably syntactically and semantically plausible markup and/or schemas as a starting point. In contrast, for this paper we aim to work towards such a schema as an idealized end goal, by characterizing the necessary— if not sufficient— semantic constraints that differentiate a schema intended for archival use from nonsense and implausible schemas, as well as schemas that fail to sufficiently take semantics into account. In addition to the goal of providing a novel approach to the perenially thorny problem of XML semantics, we are particularly concerned with the interaction between the goals of archival purposes and XML semantics.

0. Introduction

In contrast to syntax, which is explicitly (and machine-readably) defined for XML documents through use of a schema, XML semantics is notoriously difficult to pin down. sperbergmcqueen2000 takes the approach of describing semantics by defining some of the processes one goes through unconsciously when interpreting the semantics of XML: what meaning elements and attributes convey, how one makes sense of seemingly conflicting statements, the different behavior of distributed and non-distributed features, etc. renear2002 presents the issue of XML semantics in its historic context, identifies important aspects of semantics (class relationships, feature propagation, context and reference, etc.) that are usually only specified in accompanying prose documentation—if at all, and argues for the value of a machine-readable representation scheme for markup semantics, which is one of the research goals of the BECHAMEL Project. piez2002 takes a more philosophical approach to XML semantics, drawing on the work of Ferdinand de Saussure and the Structuralist movement by describing markup as a layered sign system. All of these approaches take reasonably syntactically and semantically plausible markup and/or schemas as their starting point. In contrast, for this paper we aim to work towards such a schema as an idealized end goal, by characterizing the necessary— if not sufficient— semantic constraints that differentiate a schema intended for archival use from nonsense and implausible schemas, as well as schemas that fa(i.e.,il to sufficiently take semantics into account. In addition to the goal of providing a novel approach to the perenially thorny problem of XML semantics, we are particularly concerned with the interaction between the goals of archival purposes and XML semantics.

We argue that for archival purposes, XML semantics are non-trivial - i.e., (1) that the problem of XML semantics cannot be reduced to the set of all possible use cases, (2) that XML syntax and semantics differ with regard to crucial structural properties, and (3) that semantics and syntax impose independent well-formedness constraints on schemas. We examine these properties in the context of a hypothetical long-haul archival situation in which documentation may not have been preserved – and in which the agendas underpinning the original markup may not be easy to reconstruct. In such circumstances, the interpretation of a given XML markup schema will be facilitated by an ability to explicitly delineate plausible markup schemas from non-plausible schemas independent of subject-specific knowledge.

With this in mind, we provide a formal semantic characterization of traits found in good (reasonably plausible, as contrasted with merely syntactically valid) schemas, and finally propose a set of properties that characterize such schemas in a way that incorporates both semantic and syntactic considerations. We hope that specifically considering what semantic characteristics should exclude a schema from consideration as a plausible archive standard will indirectly shed light on the nature of XML semantics more broadly. However, it is not our goal in this paper to propose an exhaustive treatment of XML semantics – instead, rather to elucidate the bare minimum necessary for a scheme to be plausible. This paper is informed by linguistic methodology in the broad sense – i.e., the proposition that a characterization of the bare minimum of “grammaticality” can yield insight of broader interest. In particular, we draw upon notions developed in the modern school of semantics that began with Montague Grammar. As such, we hope that some of the developments in the field of linguistics in the last 50 years, as reflected herein, prove as insightful a lens onto markup as the earlier Structuralist school.

1. Why semantics?

The characterization of archive-appropriate schemas necessitates separating "good" (i.e., plausibly useful) schemas from the infinitely large space of valid XML schemas. At any given point in time, practical and case-specific evaluations of the utility of a given schema should suffice for most purposes. However, long-term preservation also means planning for environments in which significant amount of case-specific detail may have been lost. Lexical semantics are particularly mutable over time; the description of "symbol" provided by TEI, documents the intended significance of a particular character or character sequence within a metrical notation, either explicitly or in terms of other symbol elements in the same metDecl teip4 is easier to intuitively grasp given the modern English meaning of the word than based on the 15th century usage, meaning "creed, summary, religious belief" onlinetym. The assumptions underlying research programs are even less stable than lexical semantics; the concern with structuralist semantics was superseded in the 1960's by the controversial and short-lived generative semantics research program which was itself eventually superseded (in the 1980s and onward) by more modern schools of semantics, beginning with Montague grammar, that have drawn on techniques of formal logic for their basis. An illustrative thought experiment here is to imagine projecting markup technologies into the past to be coextensive with literacy. What XML schemas would have been created by, for instance: a Greek dramatist, St. Augustine, an early medieval Chinese chronicler, and an alchemist? And how would these schemas differ from, say, TEI? While a single modern guideline such as TEI may be able to encode the written records of this diverse group of individuals in a way that is meaningful to the modern scholar, a TEI encoding of these texts informed by modern scholarly interests would not only fail to be interoperable with the schemas devised by the original authors, but may perhaps not even be comprehensible to them.

A rich knowledge of the specific situations (intended use, cultural context, concept of authorship/citation, etc.) in which these hypothetical schemas were created would ameliorate the situation. However, a goal of long-term preservation standards is to allow a certain degree of interoperability without crucial context-specific knowledge. One step in doing so is to separate out the relatively small set of plausibly useful schemas from the potentially vast space of valid schemas; it is the goal of this paper to outline a way of doing so.

To illustrate this, we can consider example XML using completely ridiculous schemas and some using merely implausible schemas. Examples using completely ridiculous schemas are shown below (1-3). In each of these schemas, the permitted content type of each element is the actual object, action, or part of specified by the name of the element (i.e. <branch /> can only contain such a protrusion from a tree, <simplify /> can only contain the act of simplification, etc.)

the tree-list schema: <trunk /><oak /><maple /><branch />
the command-list schema: <simplify /><eat /><breathe />
the English conjunctions schema: <and /><but /><however />

Some structurally similar schemas are intuitively less ridiculous, although also implausible. Examples of XML using implausible schemas are given below (4-7).

the word-length schema: <word length="x"/>, where x = # of letters in word
the "broken clock is right twice a day" incorrect word-length schema: <word length="x*sin(n°)"/> where of x = # of letters in the n-th word in the document
the count-words-by-threes schema: <word1 /><word2 /><word3 /><word1 /><word2 /><word3 /><word1 /> etc...
the conspiracy-theorist schema: <word(n) /> <word(n+k)/> <word(n+2k)/>, etc., where n is the n-th word in the text and k is a number imbued with some significance (e.g. 666, 42, (with a few tweaks) a succession of prime numbers, etc...)

How, then, to distinguish between the ridiculous, the implausible, and the plausible?

An immediate and intuitive objection to these schemas might be that they can be ruled out on the grounds that no one would possibly be interested in them. However, that explanation, which can be termed the "practical usability explanation" is not fully adequate. First, it is not necessarily clear that this approach would capture the difference between the ridiculous and the merely implausible. On a certain level, the English conjunctions schema could be thought to be more plausible than the "broken clock is right twice a day" incorrect word-length schema, insofar as it is much easier to imagine why someone would be interested in conjunctions than in looking at the result of multiplying word-length figures by the sine function. However, the English conjunctions schema is clearly bad in a way that the "broken clock is right twice a day" incorrect word-length schema isn't. Intuitively speaking, we might say that conjunctions are a reasonable area of interest, but given an interest in conjunctions, the English conjunctions schema is unlikely to be your choice. On the other hand, being interested in multiplying word-length figures by the sine function is bizarrely implausible, but if for some reason one wanted to do that, the "broken clock is right twice a day" incorrect word-length schema would work.

The "practical usability explanation" is especially problematic in the context of archival preservation. Part of the reason why long-term archival preservation of XML is a non-trivial task is precisely the fact that it is not always obvious what future generations of researchers will find interesting or useful. Furthermore, the establishment of practical usability will always to a certain extent be in the eye of the beholder. Schemas like our conspiracy-theorist schema could be of potential interest - Dan Brown, for instance, could testify to the wide public appeal of conspiracy theories. More seriously, debates about intuitive assessments of practical utility are unlikely to be a fundamentally productive line of discussion.

Another possible objection is that by definition XML markup is performed on text. This renders the tree-list schema and the command-list schema impossible insofar as it is a feature of the real world that tree parts and actions are not composed of combinations of characters. While this is a reasonable objection, the degree to which these assertions are based on potentially contestible real-world knowledge is problematic. It may be difficult to imagine a situation in which a sane person would assert that trees are composed out of characters in an ontologically real sense, but one can more easily imagine a lively argument about whether actions can be expressed with words in an ontologically real sense (e.g. performatives). Regardless, this line of reasoning is only applicable with difficulty in a hypothetical long-haul preserval scenario – assumptions about real-world phenomena have been known to change over time.

What criteria, then, can we use to distinguish ridiculous, implausible, and plausible schemas without reference to practical utility or related questions? Syntax could help; an intuitive observation about schemas (1) - (7) is that they are structurally flat, an observation which leads to the suggestion that more elaborate syntactic structure may be characteristic of plausible schemas. While this may be the case, it is also the case that equally absurd examples could be constructed to an arbitrary degree of syntactic nestedness, and not all flat schemas are absurd (i.e. Dublin Core). This illustrates that syntactic considerations are not sufficient to the task at hand. The rest of this paper develops a proposal that employs semantics to characterize plausible schemas, as opposed to syntactically valid but ridiculous or implausible schemas.

2. Syntax-Semantics Mismatches in XML

A prerequisite to any discussion of XML syntax versus XML semantics is to determine whether or not XML syntax and XML semantics are on some level equivalent. If a generalization about XML semantics could be restated making reference only to XML syntax, this would render any mention of semantics irrelevant. In this section, it is shown that there are at least two senses in which syntax and semantics are crucially distinct in XML. (A note on representation; in the field of semantics, angled brackets are are used to refer to words, while square brackets refer to what the words mean, or their denotation. Therefore, in this context, <cat> refers to an element that could be employed in a schema, while [[cat]] refers to the furry animal, and <cat> refers to an XML representation.

First, XML syntax is strictly hierarchical, but XML semantics does not have to be. An example where both syntax and semantics are hierarchal can be seen in paragraph structure: <sentence> ∊ <paragraph> (in XML, <paragraph><sentence /><paragraph>) and [[sentence]] ⊂ [[paragraph]] (a sentence is a subset of a paragraph). However, when elements refer to properties that are not inherently hierarchical, this is not the case. For instance, <damage> ∊ <sentence> but [[damage]] ⊄ [[sentence]] - i.e., the element <damage> may be the parent element for <sentence>, but it does not make sense to say that the concept of damage is a subset of the concept of sentence. This can be formalized as follows: if the subset relationship holds between the denotations of two or more elements (like [[sentence]] and [[paragraph]]), let these elements be called semantically hierarchical. If not (like [[damage]] and [[sentence]]), then let these elements be called semantically non-hierarchical. The semantic hierarchy can be captured by arranging semantically hierarchical elements on the semantic levels s, s(1), s(2), ..., s(k) for k levels of specificity (proceeding from general to specific) - i.e., the semantic hierarchy consists of semantically hierarchical elements, arranged accordingly.

As an aside, it can be noted that proposals have been made for XML syntax to be non-strictly hierarchical in order to accommodate different kinds of structures in a document renear1993, which stands in contrast to earlier conceptions of a document as containing a single logical hierarchy of content objects derose1990. Non-hierarchical syntax involves the use of different (concurrent) structures that may overlap with one another but share the same content chatti2007. Syntactic non-hierarchicality applies only to interactions between different syntactic levels of the schema (although, in extreme cases, such as the Dublin Core, there may only be one level of syntax at all), and does not obviate hierarchicality in the semantics.

Syntax and semantics also impose independent constraints on the well-formedness of schemas (where well-formedness is understood as the property that characterizes plausible schemas). The independence of syntactic and semantic constraints are illustrated below; again, here the element <every> can only contain the concept of every-ness:

good syntax + good semantics: <paragraph><sentence /></paragraph>
bad syntax + good semantics: <paragraph><sentence></paragraph></sentence>
good syntax + bad semantics: <paragraph><every /></paragraph>
bad syntax + bad semantics: <paragraph><every></paragraph></every>

These considerations demonstrate that XML syntax and semantics must be analyzed as separate domains. The restrictions that hold on valid XML syntax have been well documented W3C2008, whereas the restrictions that must hold on the semantics of plausible schemas are less well described.

3. Formal Semantics of XML

3.1 Semantic Types

In this section, we propose that attributes and elements in plausible XML schemas must be of type <e, t>, where the notation <e, t> is understood as indicating a function from individuals (<e>) onto truth values (<t>).^[1] This is the semantic type generally postulated to characterize common nouns and adjectives in English. For instance, [[dog]] can be thought of as the set of all things that are dogs - i.e., a function f from individuals (any and all conceivable entities in this world) onto truth values (1 = true, 0 = false) such that f(x) = 1 iff [[x]] is a dog. One could object that it would be simpler to state this proposal in terms of nouns and adjectives - i.e., to propose that attributes and elements should be nouns and adjectives. However, it is preferable to state this in terms of semantics, because we need to keep our terms straight. "Nouns" and "adjectives" are terms taken from English syntax, which is not optimal when what we really want to talk about is XML semantics - i.e., neither English nor syntax. This proposal rules out absurd schemas (2) and (3) from the introduction, and captures the intuition that attributes and elements should be statements about things.

Beyond the intuitive appeal of this proposal, it can be derived in a bottom-up fashion, based only on the assumptions that (1) texts are made up of things, and (2) that markup says things about things. Assumption (1) shows that texts are made up of basic components of type <e>. Assumption (2) leads directly to a semantic type of <e, t> for elements and attributes; i.e., something is tagged <paragraph> only if it is true that it is a paragraph, modulo whatever definition of paragraph is appropriate in context. A formal definition of "tag abuse" can also fall out from assumption (2), i.e., tag abuse is the mapping of an individual onto a truth value of zero. In a situation where <ship> is being used to cause some arbitrary text (other than a ship name) to be rendered in italics piez2001, the user has misunderstood that the element <ship> is a function that assigns the value 1 to its contents, if and only if it is true that the denotation of the contents is a ship.

Translated into the terms above, the element <paragraph> is a function from individual bits of text onto truth values such that <paragraph>(x) = 1 iff [[x]] is a paragraph. Assumptions (1) and (2) should be basic for all archival purposes. Denying assumption (2) could lead to the emergence of bizarre surrealist schemas, but it seems safe to conclude that ruling out such schemas is precisely the goal for developing archival standards. It is not clear what denying assumption (1) would even mean ontologically.

More complicated functions are of course conceivable, but they are the domain of the processing language rather than the XML itself. An example of this would be a function of the type <<e, t>, <e, t>> - i.e., a function that takes one element/attribute and returns another. For instance, one such function would take a nested element and return the element one level higher.

It should be noted that in the above proposal XML schemas are not assumed to be compositional semantically. To some extent, it is an open question whether or not a compositional minimal semantics for XML is a desirable feature. Compositional semantics would inevitably result in a proliferation of types, thereby obviating the proposed distinction between <e, t> elements that belong to XML and other elements that are the domain of the processing language. On the other hand, non-compositional semantics means that the concept of function admissible in XML must be wide enough to include input from outside the local domain of the element. For instance, the attribute lang = "en" must valued by referring to something beyond the string of characters "en". Similarly, an element containing many sub-elements would have to be evaluable in terms of its sub-elements. To a certain extent, it remains to be seen whether non-compositional semantics makes undesirable predictions. Absent such evidence, the more parsimonious option is not to include compositionality as an explicit requirement.

3.2 Semantic Coherence

The requirement that attributes and elements in plausible XML schemas be of type <e, t> is necessary but not sufficient to the task of ruling in plausible schemas while ruling out implausible schemas. To illustrate the point, consider the XML in (12) and (13):

<title /><creator /><subject /><description /><publisher />
<title /><giraffe /><arsenic /><starvation /><King of France />

Example (12) is an excerpt from the well-known Dublin Core schema for marking up metadata, while schema (13) is nonsense that satisfies the requirement that attributes and elements be of semantic type <e, t>. How, then, to rule out (13) as compared to (12)? In this section, we attempt to develop the intuition that there exists a real-world object such that the traits [[title]], [[creator]], [[subject]], [[description]], and [[publisher]] can be predicated of it or its constituent parts with a truth value of 1 (i.e., there exists at least one object that has all of these traits), but there is no real-world object such that [[title]], [[giraffe]], [[arsenic]], [[starvation]], and [[King of France]] can be predicated of it with a truth value of 1. As a reminder, the notation [[title]] should be understood as meaning roughly "something that is a title".

In order to formalize this insight, it is necessary to take a closer look at how entities of type <e, t> operate. The denotation of such an entity ([[x]] where x is of type <e, t>) is either 1 or 0 (corresponding to true or false). Such an entity must give a truth value based on an entity of type <e> - i.e., a chunk of text. The only restriction on this process is that it be a function, which for these purposes only means that some individual x cannot be assigned to both true and false - i.e., it cannot be simultaneously true and false that a chunk of text is a paragraph. Within this very wide scope, it is possible to distinguish multiple types of functions. Structural-type functions assign truth values based on whether or not the individual entity under evaluation meets certain structural criteria; i.e., x is a paragraph if and only if x is a paragraph. Predicative-type functions assign truth values based on a non-definitional but inherent property of the entity under evaluation; i.e., x is in German if and only if x is in German (as distinct from being a sentence, a paragraph, a word, etc.) Attributive-type functions assign truth values based on a non-definitional and non-inherent property of the entity under evaluation - i.e., x is the title if and only if x is the title, a bit of information that requires specific real-world context to determine.

With this in mind, we can return to the main topic and provide a more precise characterization of semantic coherence. A schema S is said to be semantically coherent iff for each element or attribute {a1, a2, a3, ..., an} ∊ S there exists a set of entities (of type <e>) {x1, x2, x3, ..., xn} such that [[ak(xk)]] = 1 for all e ∊ S. The concrete interpretation of this will vary depending on whether the elements or attributes in question are structural, predicative, or attributive. This rules out example 13, because there are no real-world objects such that each element could assign those objects to a truth condition of 1 simultaneously (i.e. there is no thing that literally consists of or contains a title, a giraffe, arsenic, starvation, and the King of France, all at the same time.)

3.3 Semantic Hierarchies

At least one more issue must be discussed in order to fully characterize plausible schemas. Compare (14) and (15):

<paragraph><sentence/></paragraph>
<sentence><paragraph/></sentence>

(14) is obviously corresponds to a common schema while (15) is nonsense. Syntax cannot help here, nor does it suffice to appeal to the claim that (15) is not plausible because it is not plausible. The reason why (15) is not plausible is because syntax is conflicting with semantics. In order to get a precise handle on this, it is necessary to formalize the notion of semantic hierarchies.

The semantic representation of an XML tree may be considered to consist of the linearly arranged denotations of the elements and attributes present within an XML tree. In other words, <element> → [[element]] and <attribute> → [[attribute]]. As applied to (14) and (15), this yields the following table.

Table I

	Linear Representation	Hierarchical Representation
Syntax of (14)	`<paragraph><sentence/></paragraph>`	`<sentence> ∊ <paragraph>`
Semantics of (14)	[[paragraph]][[sentence]]	[[sentence]] ⊂ [[paragraph]]
Syntax of (15)	`<sentence><paragraph/></sentence>`	`<paragraph> ∊ <sentence>`
Semantics of (15)	[[sentence]][[paragraph]]	[[sentence]] ⊂ [[paragraph]]

Table I gives an indication of what the problem is with (15) - we can freely change the syntax of (14), but as much as we change the syntax, we cannot change what "paragraph" and "sentence" mean - in particular, we cannot change the fact that [[paragraph]] and [[sentence]] are semantically hierarchical. The only remaining step is to smooth over the notational discrepancy between hierarchical syntactic relationships and hierarchical semantic relationships.

Below is a formal characterization of semantic hierarchies as conceived more abstractly as ordering relationships: given a set of entities E = {e1, e2, e3, ..., en}, a hierarchy can be defined as an ordered k-tuple (ei, ej, ek) made up of elements of E. An XML schema S is then made up of both syntactic elements/attributes and their denotations: S = {<e1>, [[e1], <e2>, [[e2]], <e3>, [[e3]], <e4>, [[e4]], ..., <en>, [[en]]}. We can then state that any ordering that holds for a syntactic element <ek> in S must also hold for its semantic correspondent [[ek]]. If the above holds, we may then state that semantic hierarchies respect syntactic hierarchies- i.e., while the syntactic and semantic hierarchies don’t need to correspond, they can’t be contradictory. This rules out (15).

4. Some Very Basic Features of Archive Standards

In this section, we summarize the above points and add some other criteria that must be met by plausible archive standards.

Syntax is arbitrarily nested. If the most general level is p, let the more specific levels be denoted by p, p(1), p(2) ..., p(k) for k levels of specificity. It is not necessarily the case that one and only one element correspond to each syntactic level. For instance, it is possible that elements like <sentence> and <metaphor> are on the same level.
Elements and attributes are of semantic type <e, t>.
Schemas must be semantically coherent.
Syntactic hierarchies must respect semantic hierarchies.
Elements and attributes are assigned at the highest possible level. This is an obvious insight that is not trivial to formalize, the insight being that elements and attributes should not be gratuitously repeated^[2].Sometimes (in the case of structural elements), this is because to do otherwise would be semantically invalid (i.e., <paragraph><paragraph></paragraph></paragraph>.) For transitive predicative attributes or elements, it would be redundant (i.e., not everything needs to be redundantly marked for language). Thus, for most elements and attributes, it is sufficient to state that an element or attribute that maps onto t = 1 (true) at level {p + k} maps onto t = 0 (untrue) at level {p + (k - 1)}. This will handle structural elements like <paragraph> and predicative attributes like <language>. The situation is more complex with regard to attributive elements and attributes like <metaphor> or <damage>. One can imagine situations in which these elements might occur on two structurally contiguous levels - metaphors within metaphors or damage within damage. Ontologically, the situation could be saved by positing that underlyingly, different metaphors or different types of damage are being denoted. The details of how to formalize this is not entirely clear but would likely capitalize on the intuition that metaphors inside metaphors only work if the two metaphors are different.

5. Conclusion

Starting our exploration of XML semantics from the perspective of all syntactically valid schemas has allowed us to formalize some semantic traits shared by mostly widely-used schemas that are easy to overlook, but of great significance when assessing how useful a schema might be for archival purposes - and reverse-engineering the interpretation of schemas that have been used for archival purposes, but for which adequate documentation is lacking. This may also have implications for ongoing work towards machine-interpretation of XML semantics. If an XML document uses a schema that conforms to our proposed archive standards, stronger statements can be made about the relationship between the elements in that document. The fact that the syntactic hierarchy of elements is compatible with a real-world semantic hierarchy, in combination with the other generalizations that we have made about archive-appropriate XML semantics, facilitates the development of automatizable processes of analysis, and enables developers to bring to bear existing tools used for classifying the real world.

References

[chatti2007] Chatti, Noureddine; Suha Kaouk, Sylvie Calabretto and Jean Marie Pinon. "MultiX: an XML based formalism to encode multi-structured documents" In Proceedings of Extreme Markup Languages 2007. http://conferences.idealliance.org/extreme/html/2007/Chatti01/EML2007Chatti01.html

[derose1990] DeRose, S. J., Durand, D. G., Mylonas, E., and Renear A. H. (1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2: 3-26. doi:https://doi.org/10.1007/BF02941632

[onlinetym] Online Etymology Dictionary. "Symbol". Accessed 15 April 2010. http://www.etymonline.com/index.php?term=symbol

[piez2001] Piez, Wendell. "Beyond the “descriptive vs. procedural” distinction." In Proceedings of Extreme Markup Languages 2001. http://conferences.idealliance.org/extreme/html/2001/Piez01/EML2001Piez01.html

[piez2002] Piez, Wendell. "Human and Machine Sign Systems." In Proceedings of Extreme Markup Languages 2002. http://conferences.idealliance.org/extreme/html/2002/Piez01/EML2002Piez01.html

[renear1993] Renear, Allen; Elli Mylonas, and David Durand. "Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies." http://www.stg.brown.edu/resources/stg/monographs/ohco.html

[renear2002] Renear, Allen; David Dubin, and C.M. Sperberg-McQueen. "Towards a semantics for XML markup". Proceedings of the 2002 ACM symposium on Document engineering. doi:https://doi.org/10.1145/585058.585081

[sperbergmcqueen2000] Sperberg-McQueen, C.M.; Claus Huitfeldt, Allen Renear. "Meaning and interpretation of markup." Markup Languages: Theory & Practice 2.3 (2000): 215-234. http://cmsmcq.com/2000/mim.html. doi:https://doi.org/10.1162/109966200750363599

[teip4] Text Encoding Initiative: The XML Version of the TEI Guidelines: 5 The TEI Header. Accessed 15 April 2010. http://www.tei-c.org/cms/Guidelines/P4/html/HD.html

[W3C2008] Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation 26 November 2008 http://www.w3.org/TR/2008/REC-xml-20081126/

^[1] The use of angled brackets to indicate semantic types is common practice in the field of modern semantics. To avoid inventing a new notation system just for this paper, we adopt the same practice despite the risk of confusion with markup notation. In this paper, <e, t> indicates semantic type, <e> indicates and individual, and <t> indicates a truth value.

^[2] This is similar to the insight that "distributed properties" (as we classify them, predicative, and some attributive, attributes and elements) are non-countable, i.e, If an element type x marks a distributed property, then any two adjacent x elements may be joined, or one x element may be split: <x>abc</x><x>def</x> means exactly the same thing as <x>abcdef</x>. sperbergmcqueen2000

Andrew Dombrowski

PhD student

Department of Slavic Languages and Literatures and Department of Linguistics, University of Chicago

`<adombrow@uchicago.edu>`

Andrew Dombrowski is a 4th year PhD student at the University of Chicago in the Department of Slavic Languages and Literatures and the Department of Linguistics. His research focuses on language change and contact between Slavic and non-Slavic languages.

Quinn Dombrowski

Manager, Scholarly Technology

University of Chicago

`<quinnd@uchicago.edu>`

Quinn Dombrowski is the manager of the Scholarly Technology group in the University of Chicago's IT Services organization. She has an MA in Slavic Linguistics from the University of Chicago, and an MLS from the University of Illinois at Urbana-Champaign.

BalisageSymposium

Balisage Paper: A formal approach to XML semantics: implications for archive standards

Andrew Dombrowski

`<adombrow@uchicago.edu>`

Quinn Dombrowski

`<quinnd@uchicago.edu>`

Table of Contents

0. Introduction

1. Why semantics?

2. Syntax-Semantics Mismatches in XML

3. Formal Semantics of XML

3.1 Semantic Types

3.2 Semantic Coherence

3.3 Semantic Hierarchies

4. Some Very Basic Features of Archive Standards

5. Conclusion

References

`<adombrow@uchicago.edu>`

`<quinnd@uchicago.edu>`

Balisage Series on Markup Technologies