Markup Meaning and Mereology

Markup Meaning and Mereology Balisage: The Markup Conference 2009 August 11 - 14, 2009 When marking up a document we chop it up into elements. Elements are parts of the document, some of which contain further elements, i.e., have parts of their own. Thus, the part-whole relation is central to the way markup works. Mereology is precisely the theory of part-whole relationships, but has not yet found much application in markup theory. In this paper we provide a sketch of how mereology, in the form more specifically of Nelson Goodman's Calculus of Individuals, might be applied to markup. We discuss ways of identifying the individuals of marked-up documents and of referencing these individuals, and we sketch some ways of applying the calculus to the problem of propagation of properties in documents. Claus Huitfeldt Claus Huitfeldt is Associate Professor at the Department of Philosophy of the University of Bergen. His research interests are within philosophy of language, philosophy of technology, text theory, editorial philology and markup theory. He was founding Director (1990-2000) of the Wittgenstein Archives at the University of Bergen, for which he developed the text encoding system MECS as well as the editorial methods for the publication of Wittgenstein's Nachlass - The Bergen Electronic Edition (Oxford University Press, 2000). He was active in the Text Encoding Initiative (TEI) since 1991, and was centrally involved in the foundation of the TEI Consortium. Huitfeldt was Research Director (2000-2002) of Aksis (Section for Culture, Language and Information Technology at the Bergen University Research Foundation). Associate professor University of Bergen, Norway claus.huitfeldt@fof.uib.no C. M. Sperberg-McQueen Sperberg-McQueen, C. M. is an independent consultant for Black Mesa Technologies LLC. He currently serves as an editor of the W3C XML Schema Definition Language (XSD) 1.1. Black Mesa Technologies LLC cmsmcq@blackmesatech.com Yves Marcoux Yves Marcoux is a faculty member at EBSI, University of Montréal, since 1991. He is mainly involved in teaching and research activities in the field of document informatics. Prior to his appointment at EBSI, he has worked for 10 years in systems maintenance and development, in Canada, the U.S., and Europe. He obtained his Ph.D. in theoretical computer science from University of Montréal in 1991. His main research interests are document semantics, structured document implementation methodologies, and information retrieval in structured documents. Through GRDS, his research group at EBSI, he has been principal architect for the Governmental Framework for Integrated Document Management, a project funded by the National Archives of Québec and by the Québec Treasury Board. Associate professor Université a Montréal, Canada yves.marcoux@umontreal.ca Copyright © 2009 by the authors. Used with permission.

Introduction XML documents consist of marked elements, which may in turn contain sequences of marked elements, etc. This hierarchy of elements is conveniently represented as a tree in which each node stands for an element, in which each arc between elements stand for a parent-child relationship, and in which the children of each node are ordered sequentially in accordance with their document order. While it is commonly the case that the generic identifier of an element is understood to ascribe a property to the element's content, that elements represented by nodes dominated by that element's node in the document tree are also understood to be contained by it, and that these nodes are understood to inherit the properties ascribed to their ancestor elements, none of this is always or necessarily the case. As we have pointed out elsewhere [], the parent-child relationship may be taken to indicate either a containment relationship, or a dominance relationship. Frequently these relationships coincide, and no harm is caused by not distinguishing them. When they do not coincide, however, the result may easily be confusing. One view of the structure of XML documents emphasizing the part-whole relationship is this: A document contains elements, i.e., parts. Some of these parts contain further elements, i.e., have parts of their own. The generic identifiers of elements ascribe properties to their own content and/or to the content of elements related to them by part-whole relationships. Mereology is precisely the theory of part-whole relationships. Even so, mereology does not seem to have found much application in markup theory until now. It may therefore be interesting to investigate whether the application of mereology may give insights relevant to the understanding of interpretation and processing of marked-up documents. It is sometimes said that XML provides a formal syntax for document representation, but no formal semantics for the interpretation or processing of this syntax. If mereology can be brought to bear on the ascription and propagation of properties and relations between parts of marked-up documents, it may help in providing a general approach to markup semantics. For example, the work presented here may turn out to be of direct relevance for the work on formal tag set descriptions and intertextual semantics specifications presented in [] and []. Before we proceed, some words on the limitations of this paper are in place. First, although our focus is on XML, and although we mention other markup languages in passing, we believe that mereology deserves to be studied in relation to markup languages in general (such as XML, SGML, TexMecs, LMNL, and others) rather than XML only. We think so partly because application of mereology may be equally or more profitable when it comes to some non-XML markup systems, and partly because such broader studies might inspire modifications of — or alternatives to — any or all of these. We hope to come back to applications of mereology to markup more generally in future work. Second, the concept XML document as used in this paper refers almost exclusively to XML in its serialized form. We do not explicitly attempt to apply mereology to XML documents considered as graphs of xPath nodes, Infoset items, or the like. Finally, we limit ourselves to an attempt to apply the so-called Calculus of Individuals, a mereological system worked out by Nelson Goodman [] (initially in cooperation with Henry S. Leonard []). As a further simplification, and in order to ensure focus, we will ignore XML attributes, entities, declarations, comments, processing instructions, and marked sections; in short, we will regard XML documents as consisting of elements and their content only .

The Calculus of Individuals The origins of mereology go back to ancient Greece, but it was taken up as a formal study and developed mathematically only early in the 20th century. Today, it is a well developed formal discipline, and there are a number of different mereological systems. The term mereology is sometimes used to refer to these formal calculi in particular, sometimes to formal as well as non-formalized theories of part-whole relationships in general [, pp. 13–15]. Early developments of formal mereology were largely motivated by scepticism towards set theory and the calculus of classes, and a desire to translate or reduce all talk of abstract classes and their members to talk of concrete individuals and their parts. Mereology therefore came to be associated with a particular ontological stance, nominalism, and to be shunned by most adherents of other ontological views. Goodman, whose work we will take as our basis here, was a well known nominalist, however of a peculiar kind. For Goodman, nominalism did not consist in the rejection of abstract entities, or even of universals, but in the refusal to admit anything but individuals as values of variables. He strongly repudiated all talk of classes as incomprehensible [, pp. 25-26, , p. 156] and therefore philosophically suspect. He also worked hard to establish a foundation for mathematics replacing set theory with the calculus of individuals. But at the same time he had no qualms taking abstract objects such as qualia as basic constituents of his own ontology [, chapters IV ff]. Such ontological considerations may or may not motivate, but do not in any way need to concern, our attempt to apply mereology to markup languages, however: later work in the field is generally taken to demonstrate that mereology and set theory may live merrily together, that in fact the one may be seen as an extension of the other, and that the adoption of mereology does not by itself commit one to any particular ontological stance. ...there is no necessary internal link between mereology and the philosophical position of nominalism. We may simply think of the former as a theory concerned with the analysis of parthood relations among whatever entities are allowed into the domain of discourse (including sets and other abstract entities, if one will). [] The part-whole relationships that mereology studies are relationships between entities that are, in Goodman's terminology, called individuals. Generally speaking an individual may be any thing in a very wide sense of the word — a concrete, an abstract, a universal or a particular — i.e., any object or entity of which something can be predicated. This is admittedly still pretty general, and more specific talk may be in order: As examples of individuals we may take stones, tables, chairs, animals and other medium-sized everyday objects; but if we like we may also populate our world with individuals such as molecules, atoms, electrons, quarks; or planets, stars and galaxies; or for that matter persons, visual after-images, mental images or sense data. If we believe in abstract objects we may include numbers, geometrical objects, concepts, etc., and according to some applications of mereology there may also be temporal individuals such as processes, events, and snippets of time. Individuals need not be contiguous, neither in space nor in time. This is one of the principles of the Calculus of Individuals which has provoked some discussion. In its defence one may point to the fact that we actually do employ the notion of at least some such disconnected wholes in everyday language. Thus, to treat the land mass of Japan (or any geographic entity which includes two or more islands) as an individual may seem unobjectionable. However, according to another principle, the sum of any two individuals is always also an individual. This seems to force us to accept as individuals, i.e., wholes, sums of randomly scattered parts such as Caesar's nose and the state of Utah [, p. 37]. For an entertaining collection of other candidate sum individuals, see []. Goodman bites that bullet, while much of the ensuing debate has been concerned with attempts to find ways of distinguishing such scattered and arbitrary sums from more cohesive or integral individuals as wholes consisting of parts in a more intuitively satisfactory sense. A formal mereological theory takes conventional first-order predicate logic as its basis. We will use conventional modern logical notation for quantifiers, operators, predicates, variables and constants. More specifically, we will use (x) for universal and (∃x) for existential quantification over x; ¬ for negation, → for implication, ∨ for inclusive disjunction, ∧ for conjunction, ⇔ for equivalence, and = for identity. We use the small roman letters a, b, c... for constants, x, y, z... for variables, and upper roman letters A, B, C... for predicates. We will occasionally use the conventional abbreviation iff for if and only if. The extension which mereology makes to this basis is very modest: In fact the extension consists in adding only one single primitive relation to the first-order system. This specifically mereological, primitive relation may be chosen from among the relations part of, proper part of, discrete from or overlapping with. As each of these relations may be defined in terms of any of the others, it does not matter much which one we chose as our undefined primitive. Equivalent systems (or rather, systems with only minimal and trivial differences) may be built whichever we choose as the primitive relation. With a hopefully obvious appeal to markup theorists, we will follow [] in choosing overlap for our primitive relation. In [], Leonard and Goodman chose discrete from as the primitive relation. A more common practice seems to be the choice of part or proper part. Variables are taken to range over individuals only, and predicates are taken to ascribe properties of or relations between individuals. From a mereological point of view, two individuals overlap iff they have some content in common. One consequence of this definition may briefly confuse markup specialists: since in an XML document a child element and its parent element have some content in common (everything contained by the child is also contained by the parent), it follows that in the sense introduced here the child and the parent overlap. That is, the term overlap, as used in the calculus of individuals, includes proper nesting or normal part/whole relations. Thus, if we think of XML elements as individuals consisting of stretches of consecutive character occurrences, and if we consider the following four cases (strictly speaking, the first line is not well formed XML and is included only for purposes of illustration): <s> <q> </s> </q> <s> <q> </q> </s> <q> <s> </s> </q> <s> </s> <q> </q> the first three cases exhibit an overlap between elements s and q. Only in the last case do the two elements not overlap, i.e., they are discrete. In contrast, markup theorists would probably consider only the first case to be one of overlap. The overlap operator is written ∘. The following condition on ∘ captures the intuitive notion of “having some content in common,” and we thus take it as an axiom: Numbers in the left margin give references to theorem and definition numbers in []. Note that Goodman used a notation slightly different from ours, but that we have retained Goodman's use of implicit universal quantification. 2.41 x ∘ y ⇔ (∃z)(w)((w ∘ z) → ((w ∘ x) ∧ (w ∘ y))) Any relation satisfying this condition is necessarily reflexive and symmetric (but not necessarily transitive). We now state further relation and operator definitions, theorems and axioms. Note that not all of them belong to all variants of mereological systems; they do, however, belong to ours. As already mentioned, the relations part of, proper part, and discrete may all be defined in terms of the overlap relation. Iff x is a part of y, then everything that overlaps x also overlaps y: D2.042 x < y =df (z)((z ∘ x) → (z ∘ y)) The part relation is reflexive, anti-symmetric and transitive. Iff x is a proper part of y, then x is a part of y but y is not a part of x: D2.043 x ≪ y =df (x < y) ∧ ¬(y < x) The proper part relation is irreflexive, anti-symmetric and transitive. Iff x and y are discrete, then they have no part in common, i.e., they do not overlap Leonard and Goodman use for the discrete from relation a symbol we have not been able to locate in Unicode; we use here a fairly close approximation, the symbol “ ʅ ”, which usually means caution. : D2.041 x ʅ y =df ¬(x o y) The discrete relation is irreflexive and symmetric (and thus, non-transitive). It is worth noting that identity can be defined in terms of the primitive relation: D2.044 x = y =df (z)((z o x) ⇔ (z o y)) The product of x and y is the individual which exactly contains their common part: D2.045 x · y =df (℩z)(w)((w < z) ⇔ ((w < x) ∧ (w < y))) The sum of x and y is the individual which contains exactly and exhaustively both of them, or, in other words, the individual which overlaps all and only those individuals which overlap any of them: D2.047 x + y =df (℩z)(w)((w ∘ z) ⇔ ((w ∘ x) ∨ (w ∘ y))) The negate of an individual includes everything which does not overlap with that individual (i.e., what is often called its complement, or the rest of the world): D2.046 –x =df (℩z)(y)((y ʅ x) ⇔ (y < z)) The difference between x and y is what remains of x after we eliminate the parts it has in common with y: x – y =df (x · –y) There is considerable controversy in the literature over the nil individual. The nil individual is the mereological analogue of the empty class. If accepted, it is part of any individual. Most mereological systems reject its existence, and we will do the same in this paper. This may be seen simply as a reflection of the fact that most mereologists have been nominalists (in Goodman's sense). But the topic also has other far-reaching repercussions — see []. There is less controversy over the existence of the universal individual, i.e., the one individual of which every other is a part — the world or the universe as an individual. In our case, we are not applying the Calculus of Individuals as a Grand Theory of Everything, but limit its application to domains consisting of a single document, to collections (not to say sets or classes) of documents, or perhaps to documents and whatever else we may need to take into consideration to make sense of what these documents say. So we, too, will endorse the existence of a universal individual, customarily denoted by the letter W: W =df (℩x)(y)(y < x) Note that, because there is no nil individual: the product of x and y can possibly exist only if x and y overlap, the difference between x and y can possibly exist only if x is not a part of y, and W (the universe) does not have a negate. However, the following statements hold, either as axioms or theorems, depending on how one elaborates the system: (x)(y)(∃z)(z = x + y), i.e., the sum of any two individual exists (that is, is an individual),

(x)(y)((x ∘ y) ⇔ (∃z)(z = x
            · y))

, i.e., the product of any two individuals exists iff they overlap, (x)(¬(x = W) ⇔ (∃z)(z = –x)), i.e., the negate of an individual exists iff the individual is not the universe, and

(x)(y)((¬x < y) ⇔ (∃z)(z = x
            – y))

, i.e., the difference between any individual x and any individual y exists iff x is not a part of y. Do all individuals have parts, or are there some individuals which are not further divisible into parts? Whether we take the one or the other position may have wide-reaching consequences for other properties of a mereological system, and the literature abounds with discussion on the subject. Given our domain of application, however, we believe that any system will have to be atomistic — on none of our analyses will documents have parts below character-level, or at least we foresee no need to talk about parts of characters. So we may simply add the axiom of atomicity to our system right away: (x)(∃y)((y < x) ∧ ¬(∃z)(z ≪ y)) [, p. 61]

The Calculus applied to XML What might it mean to apply the Calculus of Individuals to XML documents (or, for short, to XML) and what purpose might such an application of the calculus serve? A preliminary answer to the first question is that an application of the Calculus of Individuals to XML would require us to decide which entities to count as individuals, to decide which of these are to count as atomic individuals, as well as which properties they can have and which relations hold between them. Given the Calculus of Individual's rules of composition, different decisions on these issues will bring us to recognize the existence of individuals which may or may not coincide with established ways of viewing the structure of XML documents. Identifying rules which replicate such conventional views is, if possible, in itself of interest. Identifying rules which provide alternative views of XML documents may be of even greater interest, at least if they also suggest alternate and useful ways of analysing the parts of a document, of addressing them, and of how to ascribe properties of and relations between parts of a document. A preliminary answer to the second question has thus already been suggested: We suspect that an application of the Calculus of Individuals to XML might suggest ways of identifying and addressing parts of a document which in some cases, or for some purposes, would be more convenient or more powerful than existing methods such as SAX, DOM or xPath. We also suspect that some application of the Calculus of Individuals to XML might suggest ways of dealing with what is sometimes called the semantics of XML, i.e., how to understand XML documents in terms of properties ascribed to and relations indicated between the various parts of them indicated by the markup. In what follows we have nothing but tentative answers to the general questions just posed. Trying to answer the first question, we will present different ways of applying the Calculus of Individuals to XML. We will also explore some of their implications for answers to the second question. The explorative nature of our work should be emphasized: We do not want to suggest that these are the only, or the best, ways of applying the Calculus of Individuals to XML, nor do we suggest that we have identified all or even the most important implications of the approaches that we consider. Therefore, each of the following sections begins by suggesting a different answer to the question Which are the individuals of a marked-up document? First, we consider the possibility that the individuals simply are XML elements. Next, we go down one step in level of granularity and identify tags and character strings as individuals. Finally, we proceed to a still finer level of granularity in order to see what happens if we recognize individual characters as atomic individuals, and distinguish between different kinds of individuals built from these atoms.

The element-as-individual approach What to count as individuals is a matter of choice, a choice which must be made on the basis of such criteria as naturalness, convenience, expressiveness, simplicity, etc. We begin by simply assuming a one-to-one matching between the elements of an XML document and the individuals of our calculus. On this assumption, consider the following simple XML document: (1) <para>A <quote>rose</quote> is <emph>a</emph> rose.</para> If each element is an individual, then (1) itself, as well as the elements (2) <quote>rose</quote> (3) <emph>a</emph> are individuals. Now, the sum of any two individuals must (by our mereological axioms) be an individual. Thus, the sum of (2) and (3) must be an individual and, by our hypothesis, an XML element. No matter what model we have in mind for XML elements and documents, it is hard to imagine a way in which the sum of (2) and (3) could be an XML element — it would be at best two! In fact, the goal we have set ourselves here turns out to be self-defeating: It is not possible to identify XML elements with individuals, without accepting as individuals parts of the document which are not XML elements. In other words, if all XML elements are individuals, then some XML documents necessarily give rise to individuals which are not XML elements. In practice, we may read nearly all for some here. Examples of exceptions would be documents consisting of only one element, or in which each element has at most one child element. Examples: <s>...</s> <s><t>...</t></s> <s><t>...</t></s> and so on. Only in such cases may there in fact be a one-to-one correlation between elements and individuals. An obvious fix would be to retain the decision that every element is an individual, but allow for composite individuals having more than one element as their parts. This would solve the problem of sums, but others would remain (e.g., what elements can the difference (1) – (2) be the sum of?). Even taking the closure of elements under sum and difference would still not solve a granularity issue in handling text content: Take, for example, the strings A , is , and rose. ; any given individual would contain either all three or none. There would be no way to separate those strings. Another issue is that the definition of parthood implies nothing about the ordering of parts, resulting in the fact that individuals are unordered. Thus, there is no way in our approach to say, for example, that (2) occurs before (3). The Calculus of Individuals offers in itself no way of defining ordered pairs , p. 164. But see also p. 268 — and thus, relations — as individuals. However, relations can be represented by predicates on individuals. Thus, we can order (either totally or partially) our individuals by defining an appropriate binary predicate corresponding to the desired relation. If we think of individuals as corresponding to objects in an XML data model, and if that model allows serializations in which no two distinct elements or characters start at the same offset in a serialization This is the case if we think of XML documents and elements as consisting of stretches of consecutive character occurrences (remember we exclude entity declarations and references from our discussion), and also with the xPath data model. It is not necessarily the case with the Infoset data model. (we will need to deal with characters in later sections), then we can induce a total ordering of the individuals that correspond to elements and characters, based on the total order among the offsets of their XML counterparts in the serialization. We call that order relation document order. Throughout this paper, we assume that document order exists and is well defined. So far we have assumed that XML elements containing no sub-elements have no parts, i.e., that they are atoms in our system. A solution may perhaps be to recognize a more generous set of individuals. But before we proceed to investigate this, we pause to make a couple of observations on other characteristics of the element-as-individual approach. The lack of a fine enough granularity prevents a satisfactory treatment of strings, let alone parts of strings. However we could regard a string as a property of an individual. Thus, although we cannot strictly speaking say that in (1) the string rose is a part of the string A rose is a rose., we could say that an individual having the string rose as a property is part of an individual having the string A rose is a rose. as a property. Note that the strings rose is or ose i would not be properties of any individual, and thus not a part of the document even in this extended sense. Building a tree structure in which each node is an individual (i.e., an element), in which each arc represents a whole-part relationship, and in which the children of each node are ordered in document order, produces a tree which is almost identical to the XML tree for the same document, except for PCDATA leaf nodes of mixed content elements, which would be lost. This might be considered, by some, an interesting observation, since some markup theorists have argued against the use of mixed content, either generally or for specific applications or uses of markup. (However empty element leaf nodes would appear in the tree.)

The tags and PCDATA approach Moving one step down in level of granularity, we might take tags and PCDATA strings delimited by tags as atomic individuals. Thus (1) would contain the following 11 atomic individuals: <para> A <quote> rose </quote> is <emph> a </emph> rose. </para> From these, we might compose composite individuals such as, for example: <para> <para>A <para>A <quote> <para>A <quote>rose A rose A rose. rose a <para>A <quote> A <quote>rose </quote> is <emph> rose </quote> rose.</para> As a matter of fact, (1) would give rise to no less than 211-1 = 2047 individuals on this account (-1 because there is no nil individual) — in the interest of the reader we do not list all of them here. Only a handful of these individuals would be well-balanced XML fragments, of course. A total order relation on the atomic individuals based on document order could be defined, as in the preceding section. Note that in this case, the sequence of ordered atomic individuals is isomorphic to the sequence of events identified by a SAX-like XML tokenizer. Observe that although many of the individuals could be identified or referenced using xPath or similar XML-aware mechanisms, many of them could not. In particular, tag atoms could not (or, at least, it is unclear how and in what sense they could). However, the interest of being able to refer to tags individually is not obvious. Also, since strings are atoms, it is still impossible to handle parts of strings: ose i is still not an individual. Therefore, we do not pursue this avenue any further.

The character-atom approach

The approach Finally, and moving one further step down in the level of granularity, we take character occurrences as the atomic individuals in our application of the calculus. For the sake of conciseness, we will use character as a synonym for character occurrence, except where confusion might arise. The type of a character occurrence is represented in our system by a property of that character occurrence. So any atom (i.e., character occurrence) has the property of being an a, or a b, or a c, etc., thus populating our vocabulary with one predicate for each of the characters of the writing system at hand. We might allow a character occurrence to have more than one such property. For example, it could have the property of being an a, as well as that of being of some other type. Exploiting this option might be interesting in trying to account for multiple readings or interpretations in transcription, such as in []. For the time being, however, we will assume that the ascription of one such character-type-property to a particular character excludes the ascription of any other character-type-property to that character. We define a total order relation on atoms, based on document order, represented by the predicate PA(x, y), true iff x precedes y in document order (“P” stands for “precedes” and “A” indicates it is a predicate on atoms). The transitive reduction of PA is represented by the predicate

NA(x,
          y)

, true iff x immediately precedes y in document order (“N” stands for “next” and “A” indicates it is a predicate on atoms). Since characters are atomic individuals, all individuals which can be composed on the basis of the characters of a document are also individuals, i.e., composite individuals. Composite individuals of special interest for our purposes are strings. We define strings as individuals which are either atoms, or the sum of atoms consecutive in NA order. A string that consists of only one character is (also) an atom. There is no such thing as an empty string (which would have to be the nil individual). Note that strings constitute a tiny fraction of all existing individuals. Some strings are of particular interest to us. We define a molecular string (or molecule) as a string that is delimited on both sides (in the serialization underlying document order) by a tag, with no other tag intervening in between. A total ordering of molecular strings, represented by the predicate P(x, y), is trivially derived from the ordering of atoms (itself based on document order). The transitive reduction of P is represented by the predicate N(x, y). (“P” stands for “precedes” and “N” for “next”.) We define an elemental string as a string delimited by the matching tags of an XML element (there may be intervening tags). We do not rely on any ordering of elemental strings. For any given string x, we define (for convenience only) the label of x as the sequence of the types of the atoms composing x, in NA order. That is, for example, a string is labelled rose (or has the label rose) iff it is the sum of atoms of types r, o, s, and e, and those atoms are NA-ordered so that the one of type r comes first, the one of type o comes second, etc. While it might have been plausible to treat tags as a special kind of strings, and build elements and nodes with their ordering and parent-child relationship in a way similar to that suggested in the tags and PCDATA approach above, instead, we shall regard tags simply as delimiting certain string individuals, and ascribing properties to (or relations between) those individuals. We can now read (1) as follows: There are 17 atomic individuals. Their ordered sequence of types is: A, , r, o, s, e, , i, s, , a, , r, o, s, e, and .. There are five molecular string individuals. Their ordered sequence of labels is: A , rose, is , a, and rose.. There are three elemental string individuals, labelled A rose is a rose., rose and a. The elemental string labelled A rose is a rose. has the property indicated by the generic identifier <para>. Note that this does not imply that any of its parts, such as the molecular strings labelled A , rose, etc., has this property. The elemental string labelled rose has the property indicated by the generic identifier <quote>. The elemental string labelled a has the property indicated by the generic identifier <emph>. Here we have an example of an atom which is also a molecule and an elemental string. We introduce the following predicates:

Predicate	Meaning	Range of x and y

NA(x,y)	next after x is y (or, x immediately precedes y)	atoms
PA(x,y)	x precedes y	atoms
N(x,y)	next after x is y (or, x immediately precedes y)	molecules
P(x,y)	x precedes y	molecules
A(x)	x is atomic	any
M(x)	x is molecular	any
E(x)	x is elemental	any
ccc(x)	x has the property assigned by ccc (where ccc is an XML generic identifier)	any
T("c",x)	x is of type c (where c is a character type)	atoms
L("ccc",x)	x is labelled ccc (where ccc is a sequence of character types)	any

The last two predicates (T and L) are to be regarded as notational convenience features. In a realsystem, character type indications enclosed within quotes and occurring within two-place predicates, like T(A,i01) here, should be replaced with one-place predicates using for example Unicode names for character values, like T.x0041(i01). Character types are properties, not individuals, and so should not really appear as variables in the calculus. One unattractive consequence of the shorthand notation used here is that assignment of whitespace characters comes out as T( ,i2), which is both imprecise and perhaps somewhat confusing. As mentioned, saying that an individual is labelled with a string is merely a shorthand for saying that it consists of a sequence of atoms each with certain character types as their values. So expressions like L( is ,i20) in the example below are really shorthands for more complex expressions referring to the atomic parts of the individual i20 and their next and type properties. Assuming that i20=i07+i08+i09+i10, what L( is ,i20) says should be construed as something like NA(i07,i08) ∧ NA(i08,i09) ∧ NA(i09,i10) ∧ T.x0020(i07) ∧ T.x0069(i08)∧ T.x0073(i09)∧ T.x0020(i10). We are ignoring potential problems of name conflicts in this presentation (which would arise e.g. in the case of a document containing XML generic identifiers A, M or E).

Examples We assign the identifiers i01, i02, i03, etc. In a working system one would probably use more meaningful identifiers. The only requirement on identifiers is that they should identify individuals uniquely. to individuals of (1) and state some facts about them as follows:

T("A",i01)	A(i01)	NA(i01,i02)
T(" ",i02)	A(i02)	NA(i02,i03)

T("r",i03)	A(i03)	NA(i03,i04)
T("o",i04)	A(i04)	NA(i04,i05)
T("s",i05)	A(i05)	NA(i05,i06)
T("e",i06)	A(i06)	NA(i06,i07)

T(" ",i07)	A(i07)	NA(i07,i08)
T("i",i08)	A(i08)	NA(i08,i09)
T("s",i09)	A(i09)	NA(i09,i10)
T(" ",i10)	A(i10)	NA(i10,i11)

T("a",i11)	A(i11)	NA(i11,i12)

T(" ",i12)	A(i12)	NA(i12,i13)
T("r",i13)	A(i13)	NA(i13,i14)
T("o",i14)	A(i14)	NA(i14,i15)
T("s",i15)	A(i15)	NA(i15,i16)
T("e",i16)	A(i16)	NA(i16,i17)
T(".",i17)	A(i17)

i18=i01+i02	M(i18)	N(i18,i19)
i19=i03+i04+i05+i06	M(i19)	N(i19,i20)
i20=i07+i08+i09+i10	M(i20)	N(i20,i11)
	M(i11)	N(i11,i21)
i21=i12+i13+i14+i15+i16+i17	M(i21)
i22=i18+i19+i20+i11+i21

L("A ",i18)
L("rose",i19)	E(i19)	quote(i19)
L(" is ",i20)
T("a",i11)	E(i11)	emph(i11)
L("rose.",i21)
L("A rose is a rose.",i22)	E(i22)	para(i22)

The same information may be presented more conspicuously in the following table, listing for each individual its identifier, its type, its label, the kind of individual it is (A for atoms, M for molecular and E for elemental strings), its assigned properties (i.e., properties assigned by an XML generic identifier), its next atom or molecular string and its immediate proper parts. At least as long as we are limiting ourselves to XML the notion immediate proper part can be given a straightforward and natural definition: x is an immediate proper part of y =df (x ≪ y) ∧ ¬(∃z)((x ≪ z) ∧ (z ≪ y))

Id	Type	Label	Kind	Assigned property	Next atom	Next molecule	Immediate parts
i01	"A"		A		i02
i02	" "		A		i03
i03	"r"		A		i04
i04	"o"		A		i05
i05	"o"		A		i06
i06	"e"		A		i07
i07	" "		A		i08
i08	"i"		A		i09
i09	"s"		A		i10
i10	" "		A		i11
i11	"a"	"a"	A M E	emph	i12	i21
i12	" "		A		i13
i13	"r"		A		i14
i14	"o"		A		i15
i15	"s"		A		i16
i16	"e"		A		i17
i17	"."		A
i18		"A "	M			i19	i01, i02
i19		"rose"	M E	quote		i20	i03, i04, i05, i06
i20		" is "	M			i11	i07, i08, i09, i10
i21		"rose."	M				i12, i13, i14, i15, i16, i17
i22		"A rose is a rose."	E	para			i18, i19, i20, i11, i21

The elemental strings i22, i19 and i11 correspond to the XML elements (1)-(3) in a fairly straightforward way, and can now be identified for example as follows: i22 = (℩x)(para(x) ∧ E(x)) i19 = (℩x)(quote(x) ∧ E(x)) i11 = (℩x)(emph(x) ∧ E(x)) The non-elemental molecules i18, i20 and i21 can be identified for example as follows: i18 = (℩x)(∃y)(quote(y) ∧ N(x,y)) i20 = (℩x)(∃y)(emph(y) ∧ N(x,y)) i21 = (℩x)(M(x) ∧ ¬(∃y)N(x,y)) Although in this particular case the denoting expressions identifying individuals are fairly simple, identifying individuals by means of denoting expressions may in general become rather tedious. For example, in any document with more than one individual assigned the property quote, the denoting expression identifying individual i19 above would return the sum of all those individuals. So although we have shown that all atoms, molecular and elemental strings of (1) can be identified by our relatively straightforward application of the Calculus, some of the above examples draw on the simplicity of the example and are rather ad hoc. Therefore, before we proceed to discuss how the Calculus can be used to make statements and make inferences about a document, we introduce a slightly more complicated (and also more realistic) example. Consider the following XML document: <?xml version="1.0" encoding="UTF-8"?> <doc> A rule: <list> <item>First:</item> <item> <list> <item>think,</item> <item>decide.</item> </list> </item> <item>Then:</item> <item> <list> <item>act,</item> <item>regret.</item> </list> </item> </list> </doc> Once again we provide identifiers for individuals of the document and present their properties and relations in tabular form, but this time we include only the molecular and elemental individuals: We have made life even more comfortable for ourselves by leaving out the blankspace molecular atoms which occur between each of the molecules listed in the table.

Id	Label	Kind	Assigned property	Next molecule	Immediate parts
i01	A rule:	M		i02
i02	First:	M E	item	i03
i03	think,	M E	item	i04
i04	decide.	M E	item	i05
i05	Then:	M E	item	i06
i06	act,	M E	item	i07
i07	regret.	M E	item
i08		E	list, item		i03, i04
i09		E	list, item		i06, i07
i10		E	list		i02, i08, i05, i09
i11		E	doc		i01, i10

Note that the individuals i08 and i09 are each represented as one individual with two assigned properties, rather than as two individuals each with one property. The difference between this representation and the conventional XML representation can be illustrated by juxtaposing a conventional XML tree of the document (to the left) and what we might call a mereological graph (to the right): It should be noted that the mereological graph here has been construed so as to highlight the differences from XML discussed in this particular example, and that other important differences do not come out with this kind of visualization. For example, the nodes of the XML graph are commonly understood to represent XML elements, which in this case have been decorated with their generic identifiers. The nodes of the mereological graph, however, represent individuals and are decorated with what we have here called there assigned properties. Moreover, the nodes visible in the mereological graph represent only a tiny fraction of the individuals of the document. The arcs of the XML graph are commonly understood to represent containment and/or dominance relations between elements. In the mereological graph, they represent exclusively part-whole relationships. Again, the number of part-whole relationships depicted in the graph represent only a fraction of the part-whole relationships between the individuals of the document. Because of our decision not to count tags as part of the document, all coextensive XML elements will be represented as one elemental individual. The nesting order of these elements in the XML document will not be preserved in this representation. It might of course seem that the nesting order is preserved by the order in which the assigned properties are mentioned in the table. However the table represents an unordered set of statements, so the order is insignificant. More on nesting order of coextensive elements further below. As before, we can use denoting expressions to refer to any part of the document, for example: i01 = (℩x)¬(∃y)N(y,x) i02 = (℩x)(item(x) ∧ ¬(∃y)(item(y) ∧ P(y,x))) i03 = (℩x)(∃y)(∃z)(w)(v) ((x ≪ y) ∧ list(y) ∧ (y ≪ z) ∧ list(z) ∧ (N(w,x) → ¬(w ≪ y)) ∧ (N(v,w) → ¬(v ≪ z))) i09 = (℩x)(∃y)(∃z) (list(x) ∧ (x ≪ y) ∧ list(y) ∧ list(z) ∧ (z ≪ y) ∧ ¬(x = z) ∧ P(x,z))

Statements and inferences We can also use the Calculus to make statements about the document — unquantified, such as (1)–(4), or quantified, such as (5)–(8): (1) list(i09) (2) item(i09) (3) i07 ≪ i09 (4) i09 ≪ i10 (5) (x)(y)((list(x) ∧ item(x) ∧ (y ≪ x)) → item(y)) (6) (x)(y)((list(x) ∧ item(x) ∧ (x ≪ y)) → (list(y) ∨ doc(y))) (7) (x)(item(x) → (∃y)((x ≪ y) ∧ list(y))) (8) (x)(item(x) → (∃y)(∃z) (item(y) ∧ list(z) ∧ (x ≪ z) ∧ (y ≪ z) ∧ ¬(x = y))) In order to avoid unnecessary misunderstanding, it should be pointed out that (1)–(8) are descriptive statements about this particular document. (In other context, such as for example situations where we wanted to express general constraints on document structure, we might of course also want to state facts about document types, but that is not our issue here.) From the statements we can make inferences, such as for example: (9) item(i07) [From (1), (2), (3) and (5).] (10) list(i10) ∨ doc(i10) [From (1), (2), (4) and (6).] (11) (∃y)((i09 ≪ y) ∧ list(y)) [From (2) and (7).] (12) (∃y)(∃z)(item(y) ∧ list(z) ∧ (i07 ≪ z) ∧ (y ≪ z) ∧ ¬(i07 = y)) [From (8) and (9).]

Conclusion We have shown that strings composed of characters defined as atomic individuals can be identified and referenced by denoting expressions, that the Calculus can be used to describe the part-whole relationships and ordering relations between parts of the document as well as the properties ascribed by generic identifiers. We have also shown that this application of the Calculus can be used for making statements about documents and for drawing inferences from these statements. The approach chosen here has at least two obvious problems, or shortcomings; one concerns the representation of coextensive elements, one relates to the representation of empty elements. Before we discuss these problems, however, we would like to assess one of its possible merits. In the next section, we will therefore sketch how this application of the Calculus can be used for the formulation of rules for propagation of properties among the parts of a document.

Property Propagation — a Sketch We have assumed that the generic identifier of an element may be seen as assigning a property to the PCDATA content of that element, and not to any proper part of that PCDATA content. But sometimes, the meaning of the markup is such that that property is not assigned (or not only assigned) to the contents of the element itself, but also to all or some of its descendants, or to all or some of its ancestors, or to one or more of its siblings, or to only specific other elements. Furthermore, what is assigned to the element or elements in question may be not a monadic property, but a relation of them to other elements in the same document, or even to document elements or other entities outside that document. Thus, the propagation of properties ascribed by the generic identifier of an element may follow a large diversity of patterns. Using examples from the TEI and HTML encoding schemes, we will show that some of these patterns can conveniently be described by means of our application of the Calculus. We will first address some of the general distribution patterns identified by Nelson Goodman, which seem to represent important aspects of the intended semantics of certain TEI or HTML element types. We will then proceed to more complicated examples.

Dissective and anti-dissective properties As mentioned, in our application of the Calculus so far we have assumed that the property designated by the generic identifier of an XML element is assigned exclusively to the individual delimited by the start and end tags of the element, and not to its parts. This seems plausible enough for a number of element types, such as paragraphs, list items and titles. For example, a part of a paragraph, a list item or a title is not in general itself a paragraph, a list item or a title. TEI element types such as <hi> (highlighting) In the following we will often use the expression element or element type as short for property ascribed to an element by its generic identifier. or <add> (added), however, do not seem to follow this rule. Every part of a highlighted or added element is itself presumably highlighted or added. Other examples may be <del> (deleted) and <foreign>. The HTML element type (italics) may provide an even clearer example here — every part of an italicized element is itself in italics. According to Goodman, a ... predicate is ... dissective if it is satisfied by every part of every individual that satisfies it [, p. 38]. A dissective one-place predicate is defined as follows: F is dissective iff (x)(y)((F(x) ∧ (y < x)) → F(y)) Consider the following document fragment: <s>We <add>, as all <del>purely <hi>human</hi> and</del> finite beings, </add> are all fallible.</s> As earlier, we represent the properties of this fragment in tabular form. From now on, however, in stead of indicating assigned properties for each individual we will list relevant statements (some of which may be inferences from statements about the properties of other individuals):

Id	Label	Kind	Statements	Next	Parts
i01	We	M		i02
i02	, as all	M		i03
i03	purely	M		i04
i04	human	M E	hi(i04)	i05
i05	and	M		i06
i06	finite beings,	M		i07
i07	are all fallible.	M
i08		E	del(i08)		i03, i04, i05
i09		E	add(i09)		i02, i08, i06
i10		E	s(i10)		i01, i08, i09, i07

However, if we add the following statements to the effect that the properties add, del and hi are dissective: (x)(y)((add(x) ∧ (y < x)) → add(y)) (x)(y)((del(x) ∧ (y < x)) → del(y)) (x)(y)((hi(x) ∧ (y < x)) → hi(y)) — then, we can infer additional properties, with the following result:

Id	Label	Kind	Statements	Next	Parts
i01	We	M		i02
i02	, as all	M	del(i02)	i03
i03	purely	M	del(i03), add(i03)	i04
i04	human	M E	hi(i04), del(i04), add(i04)	i05
i05	and	M	del(i05), add(i05)	i06
i06	finite beings,	M	del(i06)	i07
i07	are all fallible.	M
i08		E	del(i08), add(i08)		i03, i04, i05
i09		E	add(i09)		i02, i08, i06
i10		E	s(i10)		i01, i08, i09, i07

(Note that this is the first example so far of non-elemental individuals carrying assigned properties.) Goodman observes that In practice, we are usually concerned only with disectiveness under some special or systematic limitations... [, p. 38]. This seems to be the case here, too: While the TEI elements <hi>, <add> and <del> and the HTML element seem to apply all the way down to every atomic part of an individual, an element type like <foreign> hardly applies below word-level. Furthermore, there seem to be exceptions even in the case of <hi>, <add> and <del>: In a transcription, a <note> (note) element is normally not intended to inherit the property in question. A more generally usable formula for disectiveness may therefore be this: (x)(y)(z)((F(x) ∧ (y < x) ∧ ¬((z < x) ∧ (y < z) ∧ (G(z) ∨ H(z) ∨ ...))) → F(y)) where G, H,... indicate exceptions. Let us define an anti-dissective one-place predicate as follows: The term anti-dissective (and its definition) is ours, not Goodman's. The same goes for the terms anti-expansive and anti-collective in the following paragraphs. F is anti-dissective iff (x)(y)((F(x) ∧ (y ≪ x)) → ¬F(y)) The TEI element <docDate> (document date) and the TEI and HTML <body> may serve as examples of anti-dissective properties, — no part of a <docDate> or a <body> element is itself a <body> or a <docDate>. The HTML (paragraph) element is also clearly anti-dissective. The TEI element presents a complication. It would seem to be anti-dissective, but unlike HTML, TEI allows s nested within s. So (x)(y)((p(x) ∧ (y ≪ x)) → ¬p(y)) is true in HTML, but not in TEI. The TEI element can therefore not be said to be either dissective or anti-dissective. A reflection upon this fact may also make us change our judgement of the HTML element: Perhaps it is just a result of the content model of in HTML that it seems anti-dissective. Anyhow, since nested s simply do not occur in HTML, it does not matter much whether we classify the property as non-dissective or anti-dissective.

Expansive and anti-expansive properties A one-place predicate is expansive if it is satisfied by everything that has a part satisfying it. [, p. 38]. An expansive one-place predicate can be defined as follows: F is expansive iff (x)(y)((F(x) ∧ (x < y)) → F(y)) In more conventional XML terms, while dissective predicates propagate down the document tree, expansive predicates propagate upwards in the tree, from children to their parents. This might be thought to be unusual, and actually it is difficult to find examples of such properties in the TEI and HTML encoding schemes. Element types such as <docDate> and <docAuthor> may, as we shall see later, be said to ascribe properties to individuals of which they are a part, but that does not make these individuals themselves <docDate>s or <docAuthor>s. (Even so, it easy to think of expansive properties: — for example, the property of containing the word Hamlet would clearly be expansive.) Let us define an anti-expansive property as follows: F is anti-expansive iff (x)(y)((F(x) ∧ (x ≪ y)) → ¬F(y)) The TEI element <foreign> may be an example of a property which is anti-dissective, at least up to a certain level, and at least insofar as it seems reasonable to assume that if something is marked as foreign, then it is marked off from something which is not in a foreign language.

Collective and anti-collective properties That a one-place predicate is collective means that it is satisfied by the sum of every two individuals (distinct or not) that satisfy it severally [, p. 39]. A collective one-place predicate can be defined as follows: F is collective iff (x)(y)((F(x) ∧ F(y)) → F(x + y)) Dissective elements like the TEI elements <hi>, <add>, <del> and <foreign> and the HTML element seem also to be collective: any sum of strings in italics would seem itself to be in italics, etc. There probably are examples of expansive and non-dissective or anti-dissective properties in TEI or HTML, but so far we have not found any. Let us define an anti-collective property as follows: F is anti-colletive iff (x)(y)((F(x) ∧ F(y) ∧ (x ʅ y)) → ¬F(x + y)) Both the TEI and the HTML <div> (division) element types seem to be anti-collective: no sum of <div>s is itself a <div>.

The HTML title element So far, we have been concerned only with one-place predicates. We have simply tried to find examples of the patterns Goodman terms dissective, expansive and collective, and added the corresponding patterns anti-dissective etc. Goodman also identifies patterns he terms nucleative, pervasive, cumulative and agglomerative [, p. 39–40]. We do not discuss these here, as we have not found any interesting application of them for the present purposes. In particular, a nucleative property is a property such that F is nucleative iff (F(x) ∧ F(y)) → F(x · y) Since XML has no elements which overlap without the one being a part of the other, the product of two element strings is always a part of one of them. Therefore, although the pattern does not have any interesting applications to XML — it may have for markup systems such as xConcur, TexMecs, Goddag, LMNL and others which allow overlapping elements. Many TEI and HTML elements ascribe properties according to more complicated patterns which can more conveniently be accounted for by representing them as relations, or predicates with two or more places. We begin with a simple example of an element expressing a two-place predicate, the HTML title element. From: <!DOCTYPE html SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Simple HTML</title> </head> <body> First para Second para </body> </html> we get:

Id	Label	Kind	Statements	Next	Parts
i01	Simple HTML	M E	head(i01), title(i01)	i02
i02	First para	M E	p(i02)	i03
i03	Second para	ME	p(i03)
i04		E	body(i04)		i02, i03
i05		E	html(i05)		i01, i04

We state the propagation rule that: (x)(y)((title(x) ∧ (x < y) ∧ html(y)) → hasTitle(y,x)) and get for the last line of the previous table:

Id	Label	Kind	Statements	Next	Parts
i05		E	html(i05), hasTitle(i05,i01)		i01, i04

The fact that the propagation rule can be made so simple in this case is partly due to the fact that we are assuming that the document is valid, and that the relative structural positions of the elements are constant. For example, there is no need to state that the title element has to be the child of a head element which in turn is directly succeeded by a body element etc.

The TEI sp, speaker and stage elements While it is quite legitimate to assume document validity when stating propagation rules, these rules tend to become more complex when more elements are involved, and/or the rules for the structural positions of the elements concerned are more complex. The relation between the TEI elements <sp> (speech), <speaker> and <stage> (stage direction) is that a <sp> may contain a <speaker>, and if it does, the <speaker> element contains the name of the speaker of the rest of the <sp> element, except for any <stage>s (stage directions) it might contain. From: <sp> <speaker>Peer</speaker> Why <stage>(hesitating)</stage> swear? </sp> we get:

Id	Label	Kind	Statements	Next	Parts
i01	Peer	M E	speaker(i01)	i02
i02	Why	M		i03
i03	(hesitating)	M E	stage(i03)	i04
i04	swear?	M
i05		E	sp(i05)		i01, i02, i03, i04

We state the following propagation rule: (x)(y)((speaker(x) ∧ (x < y) ∧ sp(y)) → (z)(((z < y) ∧ ¬(speaker(z) ∨ stage(z))) → saidBy(z,x))) and get:

Id	Label	Kind	Statements	Next	Parts
i01	Peer	M E	speaker(i01)	i02
i02	Why	M	saidBy(i02,i01)	i03
i03	(hesitating)	M E	stage(i03)	i04
i04	swear?	M	saidBy(i04,i01)
i05		E	sp(i05)		i01, i02, i03, i04

The TEI docTitle, docDate and docAuthor elements The TEI <docTitle> (document title) element may occur directly within <titlePage> or <front> (front matter); <titlePage> may occur directly within <front> or <back> (back matter), and <front> and <back> may occur directly within <text>. <docTitle> behaves very much like the HTML <title> element: (x)(y)((docTitle(x) ∧ (x < y) ∧ text(y)) → hasTitle(y,x)) <docTitle> assigns the property of being a document title to its own content, and the property of having that title to the individual which carries the property of being a text, and of which it is itself a part. Thus, while no other parts of the elemental text individual have any of these properties, all its parts have the property of being the part of an individual which carries the title in question. The <docDate> (document date) element, in turn, behaves very much like the <docTitle> element. Although it may occur in a larger variety of positions, it assigns the property of being (or identifying) the date of the document to its own content, and the property of having that date to the individual which carries the property of being a text, and of which it is itself a part. We may assume, however, that the document date carries over to most or all the parts of the text, i.e., that all the parts of the element have the property of having that date, too. If we are dealing with a transcription of an authorial document which according to the <docDate> element dates from a particular year, it may be the case that we also know that all parts of the document marked by <add> contain corrections in that document made by another person several years later, and that all <note>s are editorial notes supplied even later than that, by the creator of the electronic version. A propagation rule to this effect may be expressed for example as follows: (x)(y)(z)(w)((docDate(x) ∧ (x < y) ∧ text(y)) → (((z < y) ∧ ¬((z < w) ∧ (add(w) ∨ note(w)))) → (hasDate(y,x) ∧ hasDate(z,x)))) Note, however, that in some situations the TEI <docDate> element gives the date of the first edition of the text, while the text actually transcribed by the document comes from a later edition. In such situations the semantics of the element is rather different, and the property of having the date given may possibly not propagate to elements below <text> level at all. The <docAuthor> (document author) element, again, behaves much like the <docDate> element. It assigns the property of being the name of the author of the document to its own content, and the property of having the author of that name to the text of which it is a part. In the example just discussed, we may again assume that the property, in this case the property of having the author in question, is not carried over to later additions and notes. Other element types, such as <q> (quote) <cit> (citation), would for more or less obvious reasons also have to be considered for exclusion. However, there is a further complication: If a person is considered the author of a document, he is normally also considered the author of parts of that document, such as its chapters, sections and paragraphs. Perhaps authorship may also be attributed to sentences or phrases, but certainly not to individual words or letters. Again we are faced with a property which propagates down to a certain level, but where it is unclear exactly where that level ends. And as is so often the case with markup, it does not help us much to become clear about the level at which the propagation ends, be it subparagraphs, sentences or phrases, if it turns out that the elements at that level have not been marked up.

Problems We have mentioned that there are at least two serious problems with our application of the Calculus. One problem, which has already been identified, relates to the representation of coextensive elements. The other problem, which relates to the representation of empty elements, has only been mentioned in passing. We believe this is the least serious of the two, and we will therefore discuss that first.

Empty elements For the purposes of this discussion, we may conveniently distinguish between milestone elements and other empty elements

Milestone elements Milestones are empty elements which ascribe properties to parts of a document, but which for various reasons are represented by empty elements. The reason why some textual phenomena are represented by milestones rather than ordinary elements is often a need to overcome the XML constraint that element structure must be hierarchical. Typically, a milestone may be seen as assigning a property to the following parts of the document, up to the next milestone element of the same type, up to the occurrence of an element of some specific other type, or to the end of the document. We think we have already demonstrated that our application of the Calculus to XML documents can handle such property assignment. We believe that many of the other mechanisms proposed to handle so-called overlapping hierarchies in XML (for example, Trojan Horse milestones, [] and fragmented or virtual elements []) can be handled in similar ways, and therefore do not constitute a serious problem for our application of the Calculus.

Other empty elements Empty elements which are not milestones typically stand for and/or ascribe properties to some part of the document which cannot straightforwardly be represented as a character or string of characters. These empty elements are more difficult to deal with, because according to our application of the Calculus something which cannot be said to consist of character atoms simply cannot be an individual. And if it is no individual there seems to be nothing to which properties can be ascribed; only individuals can have properties. The TEI elements <ptr> (pointer), <anchor> (anchor point), <index> (index entry) and <divGen> (automatically generated text division) are some examples. Either they indicate a point in the document, i.e., they have no extension in the terms of our application of the Calculus and would seem to have to be located in a position between two atoms. Or they do not indicate any point or extension in the document, but rather an instruction to generate strings with certain properties at the position they are located. In some cases, the problems outlined here can be solved by replacing the empty element in question with a character string, taken for example from an attribute value of the element in question. In cases where the element occupies or points to a location between characters, we might find a practical workaround by letting it apply or point instead to the atom immediately before or after the relevant location in our model of the document. A slightly different kind of problem is presented by the TEI <graphic> (inline graphic, illustration, or figure) and HTML <graphic> elements. The basic meaning of these elements is easy enough to catch: The occurrence of the element indicates that an illustration or a figure occurs at a specific location in the document. Therefore, a more appropriate solution to this as well as to the previously mentioned examples is probably to lift the requirement that all atoms should have a character type as a property. A graphics element, for example, might simply be represented in our model by a graphics atom. More generally, this would be a model in which a document consists not of a sequence of character atoms, but of a sequence of some more generic kind of atoms. We might, for example, agree to call them atomic content objects, and concede that such atoms may or may not have a character property, an image property etc. Although we have not investigated the matter, we believe that such a modification would not drastically change the application of the Calculus described above.

Coextensive elements We have already exemplified and briefly discussed the problem with coextensive elements: If two or more nested elements have exactly the same content, i.e., share exactly the same leaf nodes in the XML tree, they will be represented in our application of the Calculus as one individual sharing all the properties ascribed by the nested XML elements. What kind of problem this is, and whether and how it can be solved, depends on the wider requirements and aims for our application of the Calculus to markup. Under certain requirements or perspectives, it may cease to be a problem. If our aim is to establish a representation from which the serialized form of an XML document can be regenerated, we obviously have a problem: It is by no means obvious if or how this could be done. Likewise, if our aim is to establish a representation from which the XML DOM, the XDM or the XML Infoset representation can be generated, or which is isomorphic to and/or contains (all) the information given in any of those, then it is perhaps even more obvious that we have a problem. We have two responses to this: On the one hand, the value of the approach presented here does not depend on such capabilities. The value of the approach to property propagation, for example, may be simply as an ancillary representation of some of the features of marked-up documents, a representation which is not intended to capture all the information present in XML documents but rather to assist in the processing of such documents. Therefore, the problem discussed here is a problem only to the extent that it impedes our work to realize this more modest aim. So far, we have not found any indication that it does. On the other hand, we might want to use this representation in order to modify the XML documents so represented, and in that case we would clearly need to reserialize them to XML or generate an XML-conformant document model of them. For such purposes, we believe that information about the XML nesting order of coextensive elements could easily be stored in some ancillary data structure which would make reserialization etc possible. It should also be mentioned that, although again we have not investigated the matter, it is not unreasonable to assume that a representation of documents in the way proposed for our application of the Calculus might be a convenient step in the process of converting XML documents to certain other markup systems, such as TexMecs or LMNL. Finally, if our aim is to offer an alternative representation based on a different understanding of the structure and semantics of marked-up documents, then we have a problem only if it can convincingly be argued that our representation is in some respect inferior to these standard ways of modelling documents. We think such a discussion is premature unless and until the application sketched here is developed further, but at least two lines of argument seem to present themselves as possible responses to the challenge. First, one might argue that the problem is with XML, and not with the approach discussed here. For example, if a TEI (paragraph) and <s> (s-unit, sentence) element are coextensive, XML forces us to decide whether we are dealing with a paragraph containing a sentence, or a sentence containing a paragraph, and leaves us no other option. But we might just as well (or rather) want to say that we are dealing with one object which has two properties: that of being a paragraph and that of being a sentence. The part-whole relationship which seems forced upon us by XML is an artifact of the serialization, a result of one of the limitations of embedded markup.[] Second, we might concede that the representation of coextensive elements as conceived in the present approach is a problem, and try to solve it by amending our mereological system. Part of the solution may be found in allowing more generous set of atoms, as discussed above in connection with the problem of empty elements. Another part of the solution might be to replace the Calculus of Individuals with some other formal mereological system. For example, there seems to be mereological systems which allow for the idea that one individual may be part of another even in cases where we cannot identify any part which they do not share. For options along these lines, see the discussion of supplementation and closure principles in p. 38 f.f.

Conclusion and Future Work We have considered some possible applications of the Calculus of Individuals to XML, whereof the so-called character-atom approach has seemed the most promising so far. Strings composed of characters defined as atomic individuals can be identified and referenced by denoting expressions. The part-whole relationships and ordering relations between parts of the document as well as the properties ascribed by generic identifiers can be described. Statements about the individuals of documents and their properties can be made, and inferences can be drawn from these statements. We have shown, by means of examples from the TEI and HTML encoding schemes, how this application of the Calculus can be used for the formulation of rules describing the propagation of properties among the parts of a document. We have identified problems or shortcomings concerning the representation of empty elements and coextensive elements, and suggested that these problems may be overcome partly by allowing a more generous set of atoms, and partly by replacing the Calculus of Individuals with some other formal mereological system. In order to assess whether the application of formal mereology to markup semantics is worth while, we believe that continued work is required along several lines: The application to XML should be extended beyond the limitations of the approach presented here to include XML the full range of XML mechanisms, such as attributes, entities, declarations, comments, processing instructions, and marked sections. While the approach presented here is limited to the consideration of XML documents in serialized form, i.e. as character streams, attempts should be made at applying formal mereology to XML documents considered as graphs of xPath nodes, Infoset items, and the like. Furthermore, and as already mentioned, mereological systems beyond the Calculus of Individuals should be considered in order to overcome some of the problems encountered in the approach presented her. Last, but not least: The application of formal mereological systems should be extended to other markup systems such as SGML, TexMecs, LMNL, Goddag and others.

References Casati, Roberto and Varzi, Achille C. Parts and Places. The Structures of Spatial Representation. MIT Press, 1999. DeRose, Steven J. 2004. Markup overlap: A review and a horse. In Proceedings of Extreme Markup Languages 2004. Fitzgerald, Henry. Nominalist things. Analysis 63.2, OUP, April 2003, pp 170-71. doi:10.1093/analys/63.2.170. Goodman, Nelson. Problems and Projects. Hackett, Indianapolis 1972. Goodman, Nelson. The structure of appearance. Third edition. Boston: Reidel, 1977 Leonard, Henry S. and Goodman, Nelson. The Calculus of Individuals and Its Uses, The Journal of Symbolic Logic Vol 5, No. 2, pp 45-55, June 1940. doi:10.2307/2266169. Libardi, Massimo. Applications and limits of mereology. From the theory of parts to the theory of wholes, Axiomathes, n.1, aprile 1994, pp. 13-54. Marcoux, Yves, Michael Sperberg-McQueen, and Claus Huitfeldt. Formal and informal meaning from documents through skeleton sentences: Complementing formal tag-set descriptions with intertextual semantics and vice-versa. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Sperberg-McQueen01. Risto Pitkänen. Content Identity. Mind.1976; LXXXV: 262–268. doi:10.1093/mind/LXXXV.338.262. Sperberg-McQueen, C. M., and Claus Huitfeldt. Containment and dominance in Goddag structures. Talk given at Conference on Text Technology, Bielefeld, March 2008. Forthcoming. Raymond, Darrell, Frank Wm. Tompa and Derick Wood. From Data Representation to Data Model: Meta-Semantic Issues in the Evolution of SGML, Computer Standards and Interfaces 18 p. 25-36 (1996). doi:10.1016/0920-5489(96)00033-5. Sperberg-McQueen, C. M., Claus Huitfeldt and Yves Marcoux. What is transcription? (Part 2). Talk given at Digital Humanities 2009, Maryland, June 2009. Forthcoming. The TEI Consortium / The Association for Computers and the Humanities (ACH); The Association for Computational Linguistics (ACL); The Association for Literary and Linguistic Computing (ALLC). TEI P4: Guidelines for Electronic Text Encoding and Interchange XML-compatible edition. Ed. C. M. Sperberg-McQueen and Lou Burnard; XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz. Oxford, Providence, Charlottesville, Bergen: TEI Consortium, December 2001. http://www.tei-c.org/release/doc/tei-p4-doc/html/ Varzi, Achille. Mereology. Stanford Encyclopedia of Philosophy. http://plato.stanford.edu/entries/mereology/ First published Tue May 13, 2003; substantive revision Thu May 14, 2009.