How to cite this paper

DeRose, Steven J. “The structure of content.” Presented at International Symposium on Quality Assurance and Quality Control in XML, Montréal, Canada, August 6, 2012. In Proceedings of the International Symposium on Quality Assurance and Quality Control in XML. Balisage Series on Markup Technologies, vol. 9 (2012).

International Symposium on Quality Assurance and Quality Control in XML
August 6, 2012

Balisage Paper: The structure of content

Steven J. DeRose

Director of R&D


Steve DeRose has been working with electronic document systems since joining Andries van Dam's FRESS project in 1979. He holds degrees in Computer Science and in Linguistics and a Ph.D. in Computational Linguistics from Brown University. His development of fast, accurate part-of-speech tagging methods for English and Greek corpora helped touch off the shift from heuristic to statistical methods in computational linguistics.

He co-founded Electronic Book Technologies to build the first SGML browser and retrieval system, "DynaText", and has been deeply involved in document standards including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM and others. He has served as Chief Scientist of Brown University's Scholarly Technology Group and Adjunct Associate Professor of Computer Science. He has written many papers, two books, and eleven patents. Most recently he joined OpenAmplify, a text analytics company that does very high-volume analysis of texts, mostly from social media.

Copyright © 2012 by the author. Used with permission.


Text analytics involves extracting features of meaning from natural language texts and making them explicit, much as markup does. It uses linguistics, AI, and statistical methods to get at a level of "meaning" that markup generally does not: down in the leaves of what to XML may be unanalyzed "content". This suggests potential for new kinds of error, consistency, and quality checking. However, text analytics can also discover features that markup is used for; this suggests that text analytics can also contribute to the markup process itself.

Perhaps the simplest example of text analytics' potential for checking, is xml:lang. Language identification is well-developed technology, and xml:lang attributes "in the wild" could be much improved. More interestingly, the distribution of named entities (people, places, organizations, etc.), topics, and emphasis interacts closely with documents' markup structures. Summaries, abstracts, conclusions, and the like all have distinctive features which can be measured.

This paper provides an overview of how text analytics works, what it can do, and how that relates to the things we typically mark up in XML. It also discuss the trade-offs and decisions involved in just what we choose to mark up, and how that interacts with automation. It presents several specific ways that text analytics can help create, check, and enhance XML components, and exemplifies some cases using a high-volume analytics tool.

Table of Contents

Tedious and brief: Language vs. Markup?
Text Analytics to the rescue?
Statistical methods
Intuitive methods
Text Analytics' Relation to Markup
How shall we find the concord of this discord?


The basic purpose of markup is to make the implicit structure of texts explicit; the same is true of "text analytics" (or "TA"), a fast-growing application area involving the automated extraction of (some) meaning from documents. In this paper I will try to place markup and analytics in a larger frame, and discuss their relationship to each other and to the notion of explicit vs. implicit information.

To begin, note that 'implicit' and 'explicit' are matters of degree. Language itself is an abstract system for thought, entirely implicit in the relevant sense here. The words and structures of natural language make thought more explicit, fixing it in the tangible medium of sound (of course, some people cannot use that medium, but still use language). Writing systems take us another step: particularly in their less familiar, non-phonetic features: sentence-initial capitals, quotation marks, ellipses, etc. all make linguistic phenomena explicit, and it has long been noted that punctuation (not to mention layout) are kinds of "markup"[Coom87], albeit often ambiguous. In a yet broader sense, the "s" on the end of a plural noun can be considered "markup" for a semantic feature we call "plural", and a space is a convenient abbreviation for <word>. XML markup thus fills a "metalinguistic" role analogous to punctuation, though (ideally) richer and less ambiguous.

'Documents' is also an imprecise term. Here I am focusing only on the more traditional sense: human-readable discourses, in a (human) language, with typically near-total ordering. This doesn't preclude including pictures, data tables, graphs, video, models, etc.; but I am for now excluding documents that are mainly for machine consumption, or are in a specialized "language" such as for mathematics, vector graphics, database tables, etc.

What, then, can text analytics do for XML? What are some phenomena in texts that we might make explicit, or check, using text analytics methods? How do these mesh with how our linguistic and markup sytems are organized? An anonymous reviewer raised the valuable question of where text analytics can be effective for auto-tagging material; which in turn poses an old problem in a new guise: If you can use a program to find instances of some textual feature, why mark it up?

Tedious and brief: Language vs. Markup?

To explore these questions we may arrange reality into successive levels of abstraction and explicitness. This is similar to the typical organization of layers in natual language processing systems.

  • Semantic structure: implicit (in that we rarely "see" semantic structure directly); presumably relatively unambiguous.

    When representing semantics, we tend to use artificial, formal languages such as predicate calculus, or model the real world with semantic structures such as Montague's[Mont73], involving Entities (people, places, things); Relationships (A does V to B with C near D because R...); and Truth values (characterizing, say, the meaning of a sentence).

  • Syntactic structure: mostly implicit, but with some explicit markers); ambiguous. "The six armed men held up the bank" is a classic example of syntactic and lexical ambiguity.

    Typical syntactic units include Morphemes -> Words -> Phrases -> Clauses -> Sentences -> Discourse units (such as requests, questions, and arguments), etc. At this level we begin to see some explicit markers. In speech they include timing, tone contours, and much more; in writing punctuation and whitespace often mark these kinds of units.

    A full analysis of speech must also deal with many features we tend to ignore with written language: incomplete or flatly incorrect grammar; back-tracking; background noise that obscures the "text" much as coffee stains obscure manuscripts; even aphasic speech; but we'll ignore most of that here.

  • Orthographic structure: explicit (by definition); but still ambiguous. Characters in characeter sets, need not map trivially to characters as linguistic units (ligatures, CJK unification, etc). More significantly, orthographic choices can communicate sociolinguistics messages: Hiragana vs. Katakana; italics for foreign words; even the choice of fonts. Case distinctions often indicate part of speech. And a given spelling may represent different words: "refuse", "tear", "lead", "dove", "wind", "console", "axes"[het1, het2]. These are "heteronyms"; true "homonyms" (which also share pronunciation), such as "bat", are rarer. Both are commonly misused, making them obvious candidates for automatic checking.

    Most English punctuation marks also have multiple uses. Ignoring numeric uses, periods may end or be in the midst of sentences, or stand open abbreviations. Colon can express phrase or clause or sentence boundary. The lowly apostrophe, introduced to English in the 16th century, can mean open or close 'quote'; there's a possessive present; contraction as in "fo'c's'le"; or sometimes plural as in P's and Q's. Punctuation is, I think, underappreciated in both markup and text analytics.

    All these level involve both markup-like and linguistic-like features, so it seems clear that there is much potential for synergy. At the same time, at each level the nature of errors to be detected and of markup to be automated differ, and so the applications of text analytics msut be considered pragmatically.

  • Markup structure: explicit and uambiguous. Markup is rarely considered in natural-languages processing, even though it has obvious implications. Consider tags such as <del> and <ins>, <abbr>, as well as tags that change appropriate syntactic expectations: <q>, <heading>, <code>, and many others. We may hope that never anything can be amiss, when simpleness and duty tender it. But markup directly affects the applicability and techniques of NLP.

In large-scale text projects, the creation of high-quality markup is one of the largest expenses, whether measured by time, money, or perhaps even frustration. How do we choose what to mark up? Can text analytics shift the cost/benefit calculus, so that we can (a) mark up useful features we couldn't before, and (b) assure the quality of the features we have marked up?

Text Analytics to the rescue?

Text analytics tries to extract specific features of meaning from natural language text (i.e., documents). It mixes linguistics, statistical and machine learning, AI, and the like, to do a task very much like the task of marking up text: searching through documents and trying to categorize meaningful pieces. Like markup, this can be done by humans (whose consensus, when achievable, is deemed the "gold standard" against which algorithms are measured), or by algorithms of many types.

Text analytics today seeks very specific features; largely ones that are saleable; it does not typically aim at a complete analysis of linguistic or pragmatic meaning. Among the features TA systems typically seek are:

  • Language identification

  • Genre categorization: Is this reportage, advocacy, forum discussions, reviews, ads, spam,...).

  • Characteristics of the author: gender, age, education level,....

  • Topic disambiguation: Flying rodents vs. baseball implements; non-acids vs. other baseball implements; mispelled photographs vs. baseball players vs. carafes; and so on.

  • Topic weight: Is the most important topic here baseball, or sports, or the Acme Mark-12 jet-assisted fielding glove's tendency to overheat?

  • Sentiment: Does the author like topic X?

  • Intentions: Is the user intent on buying/selling/getting help?

  • Times and events: Is the text reporting or announcing some event? What kind? When?

  • "Style" or "tone" measures: Decisiveness, flamboyance, partisan flavor, etc.

Text analytics is related to many more traditional techniques, such as text summarization, topical search, and the like. However, it tends to be more focused on detecting very specific features, and operating over very large collections such as Twitter feeds, FaceBook posts, and the like. The OpenAmplify[oa] service, for example, regularly analyzes several million documents per day.

A natural thing for a text analysis system to do, is to mark up parts of documents with what features they were found to express: this "he" and that "our renowned Duke" are references to the "Theseus" mentioned at the beginning; this paragraph is focused on fuel shortages and the author appears quite angry about them; and so on. Of course, some features only emerge from the text as a whole: the text may be about space exploration, even though that topic is never mentioned by name.

Users of text analytics commonly want to process large collections of short social media texts, searching for particular kinds of mentions. For example, Acme Mustardseed Company may want to know whenever someone posts about them, and what attitude is expressed. Beyond mere positive vs. negative, knowing that some posters are advocates or detractors (advising others to buy or avoid), can facilitate better responses. Other companies may want to route emails to tech support vs. sales, or measure the response to a marketing campaign, or find out what specific features of their product are popular.

Text analytics algorithms are varied and sometimes compex, but there are two overall approaches:

One method is almost purely statistical: gather a collection of texts that have been human-rated for some new feature, and then search for the best combination of features for picking out the cases you want. This "Machine Learning" approach allows for fast bootstrapping, even for multiple languages, and is sometimes very accurate. However, since it does not take much account of language structure, complex syntax, negation, and almost any kind of subtlety may trip it up. It's also very hard to fix manually -- if you go tweaking the statistically-derived parameters, unintended consequences show up.

The other general method is heuristic or intuitive: linguists analyze the feature in question, and find vocabulary, syntactic patterns, and so on that express it. Then you run a pattern-matching engine over documents. The plusses and minuses of this approach are opposite those of machine learning: It is much more time- and expertise-intensive; if the analysts are good the results can be amazing. But it's hard to do this for 100 languages. When problems crop up, the linguists can add new and better patterns, add information to the lexicon, etc.

As Klavans and Resnik point out[Kla96], it can be very effective to combine these approaches. One way to do that is to use statistal methods as a discovery tool, facilitating and checking experts' intuitions.

With either approach or a combination, you end up with an algorithm, or more abstractly a function, that takes a text and returns some rating of how likely the text is to exhibit the phenomenon you're looking for. For the trivial

Statistical methods

These methods, as noted, involve combining many features to characterize when the sought. Usually the features are very simple, so they can be detectected reliably and very quickly: frequencies of words or specific function-words, character-pair frequencies, sentence and word lengths, frequency of various punctuation, and so on.

The researcher may choose some features to try based on intuition, but it is also common simply to throw in a wide variety of features, and see what works. The features just listed turn out to be commonly useful as well as convenient.

Taking again the simple example of language identification, one might guess that the overall frequencies of individual characters would suffice. To test this hypothesis, one collects from hundreds to perhaps a million texts, categorized (in this example) by language. Say, all the English texts in one folder, and all the French texts in another. Generating such a "corpus" is usually the critical-path item: somehow you must sort the texts out by the very phenomenon you don't yet have a tool for. This can be done by the researcher (or more likely their assistants), by Mechanical Turk users, or perhaps by a weak automatic classifier with post-checking; sometimes an appropriate corpus can be found rather than constructed de novo. Constructing an annotated corpus has much in common with the sometimes difficult and expensive task of doing XML markup, particularly in a new domain where you must also develop and refine schemas and best-practice tagging guidelines and conventions.

Given such a "gold standard" corpus, software runs over each text to calculate the features (in this example by counting characters). This produces a list of numbers for each text. The mathematically-inclined reader will have noticed that such a list is a vector, or a position in space -- only the space may have hundreds or thousands of dimensions, not merely 11. For ease of intuition, imagine using only 3 features, such as the frequencies of "k" and "?" and a flag for whether the text is from Twitter. Usually, frequency measures are normalized by text length, so that texts of widely varying lengths are comparable; and binary features such as being from Twitter, are treated as 0 or 1.

Each document's list of 3 numbers equates to a location in normal space: (x, y, z). It is easy to calculate the distance between any two such points, and this is a measure of "distance" or similarity between the documents those two points represent. The "as the crow flies" distance is often used, but there are other useful measures such as the "Manhatten distance", the "cosine distance", and others.

Software such as WEKA[WEKA] runs through all the vectors, and tries to find the best combination of features to look at, in order to correctly assign documents to the target categories. This is called "training". Of these 3 features, the frequency of "k" is typically much higher in English texts than French texts, while the other 2 features don't contribute much (that is, they are not distinctive for this purpose). Such softare typically results in a "weight" for each feature for each language: in this case "k-" frequency would have a substantial positive weight for English, and negative for French.

Intuitively, this training allows you to discard irrelevant featurs, and to weight the relevant features so as to maximize the number of texts that will be accurately categorized. The real proof comes when you try out the results on texts that were not part of the original training corpus, and you discover whether the training text were really representative or not. With too small a corpus or with features that really aren't appropriate, you may get seemingly good results from training, that don't actually work in the end.

This example is a bit too simple -- as it happens, counting sequences of 2 or 3 letters characterizes specific languages far better than single letters. Abramson[Abr63] (pp. 33-38) presents text generated randomly, but in accordance with tables of such counts (known as "Hidden Markov Models") based on various languages. For example, if the last two letters generated were "mo", the next letter would be chosen according to how often each letter occurs following "mo". Abramson's randomly generated texts illustrate how well even so trivial a model distinguish languages:

  • (1) jou mouplas de monnernaissains deme us vreh bre tu de toucheur dimmere lles mar elame re a ver il douvents so

  • (2) bet ereiner sommeit sinach gan turhatt er aum wie best alliender taussichelle laufurcht er bleindeseit uber konn

  • (3) rama de lla el guia imo sus condias su e uncondadado dea mare to buerbalia nue y herarsin de se sus suparoceda

  • (4) et ligercum siteci libemus acerelin te vicaescerum pe non sum minus uterne ut in arion popomin se inquenque ira

These texts are clearly identifiable as to the language on whose probabilities each was based.

Given some set of available features, there are many specific statistical methods that programs like WEKA can use to derive an effective model. Among them are Support Vector Models (SVM), simulated annealing, Bayesian modeling, and a variety of more traditional statistics such as regressions. The first of these is graphically intuitive: the program tries to position a plane in space, with all (or as many as possible) of the French texts falling on one side, and the English texts on the other. Some methods (such as SVM) can only learn "pairwise" distinctions (such as "English or not" and "French or not"); others can distinguish multiple categories at once (English or French or Spanish...). Sometimes a degree of certainty or confidence can also be assigned as well.

These methods often work quite well, and (given a training corpus) can be tried out rapidly. If the method fails, adding new features lets you try again. Programs can typically manage hundreds to a few thousand features. This seems generous, but remember that if you want to track the frequencies of all English words, or even all letter-triples, you'll quickly exceed that, so some restraint is necessary.

On the other hand, statistical methods are not very intuitive. Sometimes the results are obvious: the frequency of accented vowels is vastly higher in French than English. But often it is hard to see why some combination of features works, and this can make it hard to "trust" the resulting categorizer. This may be reminiscent of the methods used for norming some psychological tests, by asking countless almost random questions of many people, and seeing which ones correlate with various diagnoses. This can work very well, and is hard to accuse of bias; on the other hand, if it stops working in a slightly larger sample space, that may be hard to notice or repair.

Intuitive methods

A more "traditional" approach to developing text analytics systems is for experts to articulate how they would go about categorizing documents or detecting the desired phenomena and then implementing those methods. This is usually done as an iterative process, running the implementation against a corpus much as with statistical methods and then studying the results and refining.

For example, a linguist might articulate a set of patterns to use to detect names of educational institutions in text. One might be "any sequence of capitalized words, the first or last of which is one of "College", "University", "Institute", "Seminary", or "Academy". This rule can be easily tried on some text, and two checks can be done:

1: checking what it finds, in order to discover false positive such as "College Station, TX"; and

2: check what is left over (perhaps just series of capitalized words in the leftovers), in order to discover false negatives such as "State University of New York", "College of the Ozarks".

The rules are then refined, for example to allow uncapitalized words like "of", "at", and "the"; and to allow "State", "City", and "National" preceding the already-listed possible starter words. Extending rules by searching for words that occur in similar contexts also helps.

Eventually this process should produce a pretty good set of rules, although in natural language the rules will often have to include lists of idiosyncratic cases: "Harvard", "Columbia", "McGill", "Brown" (with the difficulty of sorting out uses as surnames, colors, place names, and the like).

Intuition-based methods tend to be far better when larger phenomena matter. For example, the ever-popular "Sentiment" calculation is highly sensitive to negation, and natural languages have a huge variety of ways to negate things. Besides the many ways overt negatives like "not" can be used, there are antonyms, sarcasm, condemnation by faint praise, and countless other techniques. Statistical methods are unlikely to work well for negation, in part because "not" or other signals of negation may occur quite a distance from what they modify; just having "not" in a sentence tells you little about what specifically is being negated. "There's seldom a paucity of evidence that could preclude not overlooking an ersatz case of negative polarity failing to be missing."

Intuitive approaches have the advantage of being more easily explained and more easily enhanced when shortcomings arise. But building them requires human rather than machine effort (especially difficult if one needs to support many languages).

Intuitive methods have the added cost and benefit, that experts commonly refer to high-level, abstract linguistic notions. In more realistic cases than the last example, a linguist might want to create rules involving parts of speech, clause boundaries, active vs. passive sentences, and so on. To do this requires a bit of recursion: how do you detect *those* phenomena in the first place? That requires some tool that can identify those constructs, and make them available as features for the next "level".

"Shallow parsing" is well understood and readily available (such as via [Gate] and [NLTK]), and can provide many of those features. "Shallow parsers" identify parts of speech (using far more than the traditional 8 distinctions), as well as relatively small components such as noun phrases, simple predicates, and so on, often using grammars constructed from rules broadly similar to those in XML schemas. Shallow parsing draws the line about at clauses: subordinate clauses as in "She went to the bank that advertised most" are very complex in the general case, and attaching them correctly to the major clause they occur in even more so.

The results of shallow parsing are usually strongly hierarchical, since they exclude many of the complex long-distance phenomena in language (such as relationships between non-adjacent nouns, pronouns, clauses, etc.). Because of this, XML is commonly used to represent natural-language parser output. However, this is separate from potential uses of text analytics on general XML. An example of the structures produced by a shallow parser:

          <at base="the">The</at>
          <nn base="course">course</nn>
          <in base="of">of</in>
             <jj base="true">true</jj>
             <nn base="love">love</nn>
          <rb base="never">never</rb>
          <vb base="do">did</vb>
          <vb base="run">run</vb>
       <jj base="smooth">smooth</jj>
          <dl base=".">.</dl>

Many current text analytics systems use a lexicon, part of speech tagging, and shallow parsing to extract such linguistic structures. While far from a complete analysis of linguistic structure, these features permit much more sophisticated evaluation of "what's going on" than strictly word-level features such as described earlier. For example, knowing whether a topic showed up as a subject vs. an object or a mere modifier, is invaluable for estimating it's importance. Knowing whether a given action is past or future, part of a question, or subject to connectives such as "if" or "but" (not to mention negation!) also has obvious value. Having a handle on sentence structure also helps reveal what is said about a given topic, when a topic is referred to indirectly (by pronouns, generic labels like "the actor", etc.).


Users of text analytics systems always ask "how accurate is it?" Unlike XML validation, this is not a simple yes/no question, and so using TA in creating or checking markup is a probabilistic matter. As it turns out, even a "percent correct" measure is often a misleading oversimplification. In the interest of brevity, I'll give a few examples of the problems with a unitary notion of "accuracy", using Sentiment as an example (since it is perhaps the most common TA measure in practice):

  • Let's say a system categorizes texts as having "Positive" or "Negative" sentiment (leaving aside the precise definition of "Sentiment"), and gets the "right" answer for 70% of the test documents. The first key question is how the desired answer came to be considered "right" in the first place. Normally, accuracy is measured against human ratings on a set of texts. Yet if one asks several people to rate a particular text, they only agree with each other about 70% of the time[1]. If texts only have one rater, 30% of the texts probably have debatable ratings. If one throws out all the cases where people disagree, that unfairly inflates the computer's score because all the hard/borderline cases are gone. Treating the humans' ratings like votes, if the computer agrees with the consensus of 3 humans 70% of the time, is it 70% accurate, or 100% as good as a human? If the algorithm does even better, say 80%, what does that even mean? In considering applications to XML, it would be interesting to know how closely human readers agree about the matters TA might be called on to evaluate.

  • In practice, Sentiment requires a third category: "Neutral". A corpus that is 80% neutral, 5% negative, and 15% positive is typical. That means a program can beat the last example merely by always returning "Neutral": that's 80% accurate, right? This illustrates the crucial distinction of precision versus recall: this strategy perfectly recalls (finds) 100% of the neutral cases; but it's not very precise: 1/5 of the texts it calls "Neutral" are wrong. In addition, it has 0% recall for positives and negatives (which are much more important to find for most purposes).

  • At some level Sentiment is typically calculated as a real number, not just three categories; say, ranging from -1.0 to +1.0. How close to 0.0 counts as Neutral? That decision is trivial to adjust, and may make a huge difference in the precision and recall (and note that the Neutral zone doesn't have to be symmetrical around 0).

  • For systems that generate quantities instead of just categories, should a near miss be considered the same as a totally wild result? Depending on the user's goals and capabilities, a close answer might (or might not) still be useful. Using TA methos to evaluate XML document portions is a likely case of this: Although most abstracts are likely to be broadly positive, only severe negativity would likely warrant a warning; similarly, so long as an abstract is reasonably stylistically "abstract-ish", it's probably ok (but not if it looks "bibliography-ish").

  • Many documents longer than a treet have a mixture of positive, neutral, and negative sentiments expressed toward their topics. Does a long document that's scrupulously neutral, deserve the same sentiment as one that describes two very strong opposing views? They may average out the same, but that's not very revealing.

  • A subtler problem is that rating a document's sentiment toward a given topic depends on rightly identifying the topic. What if the topic isn't identified right? Is "Ms. Gaga" the same as "Lady Gaga"? Is the topic economics, the national debt, or fiscal policy? Some systems avoid this problem by reporting only an "overall" sentiment instead of sentiment towards each specific topic, but that greatly exacerbates the previous problem.

  • Users' goals may dictate very different notions of what counts as positive and negative. For market analysts or reporters, a big change in a company's stock price may be "just data": good or bad for the company, but just a fact to the analyst. Political debate obviously has similar issues.

Understanding those general methods and caveats, it's fair to generalize that text analysis systems typically detect features with 70-80% accuracy, although some features are far easier than others. Language identification is far more accurate; sarcasm detection, far less. This means that such systems work best when there are many texts to process -- the errors will generally come out in the statistical wash.

Text Analytics' Relation to Markup

Markup has multiple purposes; among them are

  • Disambiguating structure (e.g., famous OED italics)

  • Controlling layout and other processing

  • Identifying things to search on

Markup makes aspects of document structure explicit. In principle, any phenomenon that text analytics can identify, can then be marked up, to a corresponding level of accuracy. Exactly the same analytics can be used in checking: If a text is already marked up for feature X, we need only run an auto-tagger for X and compare. This simultaneously gives feedback on the text-analytic output's accuracy, and the prior markup's.

When two sources of data are available like this, they can be used to check each other. In addition, the degree of overlap in what is "found" by each source, enables estimating the number of cases not found be either. A simple statistic called the "Lincoln Index", originating in species population surveys, provides this estimate. In the same way, text analytics can be used to do XML markup de novo, or as a direct quality check on existing markup.

Such comparative analysis may be one of the most useful applications of TA to XML evaluation. In a text project where markup is not straightforward, how can one evaluate how well the taggers are doing? Say a literary project is marking up subjective features such as allusions, or sometimes-unclear features such as who the speaker is for each beat in dialog. TA methods can learn how to characterize such features, and then be run against the human-tagged texts. Disagreement can trigger a re-check, thus saving time versus checking everything.

There seems to be an implicit "sweet spot" for markup use. We don't mark up sufficiently obvious phenomena, such as word boundaries (except in special cases). Given that almost every kind of processing needs to know where they are, why not? Probably because finding word boundaries seems trivial in English.[2] Yet word boundaries are also unlikely to be marked up in Asian languages, where identifying them is far from trivial. Thus, simplicity can't be the whole story. Perhaps it is that consciously or not we assume that most any downstream software will do this by itself, so there would be no return on even a small investment in explicit markup.

Language-use has better ROI, for example enabling language-specific spell-checking or indexing. Downstream software is perhaps less likely to "just handle it." Nevertheless, it is not very common to see xml:lang more than once in a document except in special cases such as bilingual dictionaries, diglot literature editions, and the like.

TA systems can certainly add (or check) word-boundary and language-use markup, and the most common related attributes, such as part of speech and lemma. Such markup is perhaps of limited value except in special applications, such as text projects in Classical languages or that contend with paleographic issues.

Marking up small-scope, very specific semantics such as emphasis and the role of particular nouns is a traditionally awkward matter in markup. Some schemas merely provide tags for italic, bold, and the like; using less font-oriented tags such as <emph> is considered a step up, but often accomplished little more than moving "italic" from element to attribute. If more meaningful small-scale elements are not available, conventions such as RDF[RDF] and microformats[micro] make it feasible to represent the results of text analytics or even linguistic parsing in ways accessible to XML systems.

DocBook[docb] provides many more meaningful items at this level: guibutton, keycap, menuchoice, etc. As with other fairly concrete features already described and as an anonymous referee pointed out, text analytics could be used to tag (or check) many such distinctions automatically: "ENTER" is going to be a keycap, not a guiItem; in other cases nearby text such as "press the ___ key" can help, as can context such as being in a section titled "List of commands". This seems entirely tractable for analytics, and could have significant value because such markup is valuable for downstream processings (particularly search), but tedious for humans to create or check, and therefore error-prone.

However, even this level of markup can get subtle. In Making Hypermedia Work[DeRo94] David Durand and I decided to distinguish SGML element names from HyTime architectural form names in the text, because the distinction is crucial but, at least to the new user, subtle. We also decided, in many examples, to name elements the same as the architectural form they represented. In most cases there was only one element of a given form under discussion; and because elements are concrete while forms are abstract, one cannot easily reify the latter without the former. In most cases this was trivial; but in a few cases the decision seemed impossible. Examing those cases via TA methods would likely reveal much about the distinction we were trying to achieve, as well as no doubt reveal marup errors.

Bibliography entries are notoriously troublesome, whether in XML or any other representation. They have many sub-parts, with complex rules for which parts can co-occur; the order (not to mention formatting) of the parts varies considerably from one publisher to another; and there are special cases that are difficult given the usual rules. PubMedCentral receives an extraordinary number of XML articles, often in a standard XML Schema[NCBI]; but usage varies significantly even in valid data, and the code to manage bibliographic entries in the face of such variability is substantial. Many publishers opt for the "free" style, in which most schema constraints are relaxed, and recovering the meaningful parts of entries is a task worthy of AI.

At a higher or at least larger level, many schemas are heavy on tags for idiosyncratic components with much linguistic content, which also have distinctive textual features: Bibliography, Abstract, Preface, etc. For example, a Preface will likely use much future tense, while a Prior Work section will use past. Text analytics can find and quantify such patterns, and then find and report outliers, which might show up due to tagging errors, or to writing which, whether through error or wisdom, does not fit the usual patterns.

This provides a particularly promising area for applying text analytics. Although the titles of such sections differ, and there may or not be distinct tags for each kind, a text analytics system could learn the distinctive form of such components, and then evaluate how consistent the tagging and/or content of corresponding sections in other documents are.

We usually mark up things that are necessary for layout; the ROI is often obvious and quick. But it takes a lot of dedication, sophistication, and money to, say, disambiguate the many distinct uses of italics, bold, and other typography in the Oxford English Dictionary[Tom91], or to characterize the implicitly-expected style for major components such as those in front and back matter.di. Many of the implicit data described earlier can be detected using text analytic methods, but using this to assist the markup process has been little explored.

How shall we find the concord of this discord?

If you can find it reliably via some algorithm, why mark it up? In a sense, creating markup via algorithms is kind of like the old saw about Artificial Intelligence: "as soon as you know how to do it, it's not AI anymore." If text analytics (or any other technology) could completely reliably detect some feature we used to mark up, we might stop marking it up. But in reality, neither humans nor algorithms are entirely reliable for marking up items of interest. The probability that the two will err in quite different ways, means there is synergy to be had.

Anyone who has tried to write a program to turn punctuated quotes into marked-up quote elements, has discovered that there are many difficult cases, at a variety of levels. Choice of character set and encoding, international differences in style, nested and/or continued quotations in dialog, alternate uses of apostrophe and double quote, and even quotations whose scope crosses tree boundaries[see DeRo04]. Would we bother marking quotations if the punctuation were unambiguous, or if we had widespread text-analytics solutions that could always identify quotations for formatting search, and other common needs?

Typical XML schemas define some very common specific items: Bibliography, Table of Contents, Preface; and some common generic items: Chapter, Appendix, etc. But (perhaps pragmatically) we don't enumerate the many score front and back matter sections listed in the Manual of Style, or the additional ones that show up in countless special cases -- at some point we just say "front-matter section, type=NAME" and quit. Worse, we sometimes cannot choose the "correct" markup: whether we are the original author or a later redactor, we may simply be unable to say whether to mark up "Elizabeth" as <paramour> or <tourist> in "Elizabeth went to Essex. She had always liked Essex."[TEI P3]

The short response to these issues, I think, is that markup is always a tradeoff; there are levels we make explicit, and levels we don't. Perhaps it cannot be otherwise. Intuitively, it seems that at least for authors many of the choices should always be clear; and to that extent text analytics can also find many of these phenomena. So why does the principle not work, that an author knows when component X is needed, and so should have an easier time just naming X, than carrying out commands to achieve a look that others will (hopefully) process back to X?[Coom87]

I think it is because people's interaction with language (not via language) is largely unconscious. We rarely think "the next preposition is important, so I'm going to say it louder and slower"; we don't think "I've said everything that relates to that topic sentence, so it's time for a new paragraph"; nor even "'Dr' is an abbreviation, so I need a period". Expertise has been defined as the reduction of more and more actions to unconsciousness -- that's how we (almost always) walk and/or chew gum. Our understanding of language is often similarly tacit. As Dreyfuss and Dreyfuss put it[Drey05, p. 788), If one asks an expert for the rules he or she is using, one will, in effect, force the expert to regress to the level of a beginner and state the rules learned in school. Thus, instead of using rules he or she no longer remembers, as the knowledge engineers suppose, the expert is forced to remember rules he or she no longer uses.

The act of markup, whether automated or manual, seems similar: we know a paragraph (or emphasis, or lacuna) when we see it, just as we know an obstacle on the sidewalk when we see it; but neither often makes it to consciousness.

Text analytics and markup are very similar tasks, though they tend to identify different things; it is rare for (say) a literary text project to mark up sentiment in novels, while it is equally rare for text anaytics to identify emphasis (although emphasis mught contribute to other features, such as topic weight).

Perhaps the most obvious place to start, beyond simple things like language-identification, is checking whether existing markup "makes sense", at a higher level of abstraction that XML schema languages -- a level closely involving the language rather than the text. The usual XML schema languages do little or nothing with text-content; with DTDs one can strip out *all* text content and the validity status of a document cannot change. With a text analytics system in place, however, it is possible to run tests related to the actual meaning of content. For example:

  • After finding the topics of each paragraph, one can estimate the cohesiveness of sections and chapters, as well as check that section titles at least mention the most relevant topics. Comparison between the topics found in an abstract, and the topics found in the remainder of an article, could be quite revealing.

  • The style for a technical journal might require a Conclusions section (which might want to be very topically similar to the abstract), and a future work section that should be written in the future tense. Similarly, a Methods section should probably come out low on subjectivity. In fiction, measuring stylistic features of each character's speech could reveal either mis-attributed speeches, or inconsistencies in how a given character is presented.

  • The distribution of specific topics can also be valuable: Perhaps a definition should accompany the *first* use of a given topic -- this is relatively easy to check, and a good text analytic system will not often be fooled by a prior reference not being identical (for example, plural vs. singular references), or by similar phrases that don't actually involve the same topic.

  • One important task in text analytics is identification of "Named Entities": Is this name a person, organization, location, etc? Many XML schemas are rich in elements whose content should be certain kinds of named entities: <author>, <editor>, <copyright-holder>, <person>, <place>, and many more. These can easily be checked by many TA systems. Since TA systems typically use large catalogs of entities, marked-up texts can also contribute to TA systems as entity sources.

Text analytics is strongest at identifying abstract/conceptual features, when those features are not easily characterized by specific words or phrases, but emerge from larger linguistic context. The most blatant example is the perennial problem with non-language-aware search engines: negation. There are many ways to invert the sense (or sentiment) of a text, some simple buy many extremely subtle or complex. Tools that do not analyze syntax, clause roles, and the like can't distinguish texts that mean nearly the opposite of each other. Thus, at all levels from initial composition and markup, through validation and production, to search and retrieval, text analytics can enable new kinds of processes. Perhaps as such technology becomes widespread and is integrated into our daily workflows, it may help us to reach more deeply into the content of our texts.

Little has been published on the use of text analytics in direct relation to markup, although text analytics tools often use XML extensively, particularly for the representation of their results. However, TA has the potential to contribute significantly to our ability to validate exactly those aspects of documents, that markup does not help with: namely, what's going on down in the leaves.


[Abr63] Norman Abramson. 1963. Information Theory and Coding. New York: McGraw-Hill.

[Coom87] James H. Coombs, Allen H. Renear, and Steven J. DeRose. 1987. "Markup systems and the future of scholarly text processing." Communications of the ACM 30, 11 (November 1987), 933-947. doi:

[DeRo04] Steven DeRose. 2004. "Markup Overlap: A Review and a Horse." Extreme Markup Languages 2004, Montréal, Québec, August 2-6, 2004.

[DeRo94] Steven DeRose and David Durand. 1994. "Making Hypermedia Work: A User's Guide to HyTime." Boston: Kluwer Academic Publishers. doi:

[Drey05] Hubert L. Dreyfus and Stuart E. Dreyfus. "Peripheral Vision: Expertise in Real World Contexts." Organization studies 26(5): 779-792. doi:

[Gate] Gate: General Architecure for Text Engineering

[het1] "Heteronym Homepage"

[het2] "The Heteronym Page"

[Kla96] Judith Klavans and Philip Resnik. The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press 1996. 978-0-262-61122-0.

[micro] Microformats (home page)

[Mont73] Richard Montague. 1973. "The Proper Treatment of Quantification in Ordinary English". In: Jaakko Hintikka, Julius Moravcsik, Patrick Suppes (eds.): Approaches to Natural Language. Dordrecht: 221–242. doi:

[NCBI] National Center for Biotechnology Information, National Library of Medicine, National Institutes for Health. "Journal Publishing Tag Set".

[NLTK] NLTK 2.0 documentation: The Natural Language Toolkit.

[docb] OASIS Docbook TC.

[oa] OpenAmplify.

[RDF] Resource Description Format.

[Stede] Manfred Stede and Arhit Suriyawongkul. "Identifying Logical Structure and Content Structure in Loosely-Structured Documents." In Linguistic Modeling of Information and Markup Languages: Contributions to Language Technology. Andreas Witt and Dieter Metzing, eds., pp. 81-96.

[TEI P3] TEI Guidelines for the Encoding of Machine-Readable Texts. Edition P5.

[Tom91] Frank Wm. Tompa and Darrell R. Raymond. 1991. "Database Design for a Dynamic Dictionary." In (Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper from ALLC/ACH Conference, Toronto.

[WEKA] Machine Learning Group at University of Waikato. "Weka 3: Data Mining Software in Java."

[1] The Kohen's Kappa statistic is a commonly-used measure of Inter-Rate Reliability.

[2] It isn't quite trivial; there are many edge cases such as "$10 million", "AT&T", "@twitter", ":)", H2O, "New York-based", contractions, and some particularly ugly cases where people use hyphen when they mean emdash. Humans don't usually need a truly precise notion of "word", and our writing systems don't provide one.


Norman Abramson. 1963. Information Theory and Coding. New York: McGraw-Hill.


James H. Coombs, Allen H. Renear, and Steven J. DeRose. 1987. "Markup systems and the future of scholarly text processing." Communications of the ACM 30, 11 (November 1987), 933-947. doi:


Steven DeRose. 2004. "Markup Overlap: A Review and a Horse." Extreme Markup Languages 2004, Montréal, Québec, August 2-6, 2004.


Steven DeRose and David Durand. 1994. "Making Hypermedia Work: A User's Guide to HyTime." Boston: Kluwer Academic Publishers. doi:


Hubert L. Dreyfus and Stuart E. Dreyfus. "Peripheral Vision: Expertise in Real World Contexts." Organization studies 26(5): 779-792. doi:


Gate: General Architecure for Text Engineering


Judith Klavans and Philip Resnik. The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press 1996. 978-0-262-61122-0.


Microformats (home page)


Richard Montague. 1973. "The Proper Treatment of Quantification in Ordinary English". In: Jaakko Hintikka, Julius Moravcsik, Patrick Suppes (eds.): Approaches to Natural Language. Dordrecht: 221–242. doi:


National Center for Biotechnology Information, National Library of Medicine, National Institutes for Health. "Journal Publishing Tag Set".


NLTK 2.0 documentation: The Natural Language Toolkit.


Resource Description Format.


Manfred Stede and Arhit Suriyawongkul. "Identifying Logical Structure and Content Structure in Loosely-Structured Documents." In Linguistic Modeling of Information and Markup Languages: Contributions to Language Technology. Andreas Witt and Dieter Metzing, eds., pp. 81-96.


TEI Guidelines for the Encoding of Machine-Readable Texts. Edition P5.


Frank Wm. Tompa and Darrell R. Raymond. 1991. "Database Design for a Dynamic Dictionary." In (Eds.) Susan Hockey and Nancy Ide. Research in Humanities Computing: Selected Paper from ALLC/ACH Conference, Toronto.


Machine Learning Group at University of Waikato. "Weka 3: Data Mining Software in Java."

Author's keywords for this paper:
Text Analytics; Markup Systems; Markup Theory