One Document Does-it-all (ODD): a language for documentation, schema generation, and customization from the Text Encoding Initiative

Raffaele Viglianti

Research Programmer

Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland

Copyright ©2019 by the author. Used with permission.

expand Abstract

expand Raffaele Viglianti

Balisage logo

Preliminary Proceedings

expand How to cite this paper

One Document Does-it-all (ODD): a language for documentation, schema generation, and customization from the Text Encoding Initiative

Symposium on Markup Vocabulary Customization
July 29, 2019

Introdcution

The Text Encoding Initiative (TEI) began as an international research project in 1987, with the goal of creating guidelines for the representation of texts in digital form. While these guidelines are still its main focus today, the TEI has since evolved into a non-profit consortium with numerous members and an elected body of individuals (the Technical Council) who maintain and expand the Guidelines in response to the needs of the community. TEI’s broad mission of representing “text” has resonated in particular with the academic community, libraries, and cultural heritage institutions, who have widely applied the TEI—and consequently shaped it—as an instrument for online research, teaching, and preservation.

In 2007, the TEI released version P5 of the Guidelines and with it introduced a complete revision of ODD, or One Document Does-it-all, the system for its documentation, schema generation, and customization.[1] This system makes use of literate programming principles in order to keep the documentation, grammar, and constraints rules of the TEI all together in the same TEI XML document. To achieve this, the large documentation text of the Guidelines is encoded in TEI and is peppered with references to formal declarations of elements, attributes, modules, and classes. These formal declarations are themselves expressed using TEI elements, which allows the TEI’s processing tools to generate both human-readable documentation and schemas in a variety of formats.[2] These elements, described in Chapter 22 of the Guidelines,[3] can be used both to define and to customize the various components of the TEI; this allows users to define and document customizations, and to generate human-readable and machine-readable output.

Customizing the TEI is an essential step in the creation of a TEI project: the specification is very large and using it all at once is discouraged. Indeed, the TEI offers customization “exemplars” to users, including TEI Lite, TEI for Manuscript Description, and jTEI (a customization for articles for the Journal of the TEI). Researchers using the TEI are recommended (most often via workshops and other training sessions) to let their research questions drive their customization design and create a subset that most closely addresses their needs. Besides selecting a subset, customization in ODD allows encoders to add constraints (such as limiting open attribute values) and to introduce extensive prose documentation tightly coupled with formal declarations.

What is in an ODD

Because ODD is used to both define and customize a markup vocabulary, it takes two ODD files to tango: one to define the vocabulary (e.g. the whole of the TEI) and one to customize it (e.g. TEI Lite). Both kinds of operation are performed with the same element set, but a @mode attribute determines whether something is being added, changed only in part, replaced, or explicitly removed. The absence of @mode means that something new is being declared. The following subsections introduce what is in an ODD by way of a brief introduction to some of these elements.[4]

Documentation

As a TEI document, an ODD can contain extensive prose describing either a new markup language or a customization. This human-readable documentation is typically contained within the TEI <text> element, and can make use of the standard TEI elements including those for divisions, headings, paragraphs, and snippets of computer code.

A schema specification (or more)

Specification and customization elements are contained by <schemaSpec>, on which a number of top-level options can be set, such as the schema name, language, namespace, and the possible root or outermost elements.

<schemaSpec ident="myTEI" start="TEI" ns="http://tei-c.org/ns/1.0">
  <!-- specification and customization elements go here -->
</schemaSpec>

Specifications: a brief overview

The specifications introduced below are defined and referenced using ODD elements that share a similar structure. First, the names of these elements are formed by a term plus “Spec”, such as <elementSpec> or <classSpec>. Elements that express references to these specifications end in “Ref”, such as <elementRef> or <classRef>. Other shared features include the @ident attribute to indicate the name of the object being specified or referred to, and documentation elements for providing descriptions and usage examples.

<*Spec ident="name">
  <gloss>An expansion of the name, if necessary</gloss>
  <desc>A description of this specification</desc>

  <!-- definitions depending on the type of specification -->

  <exemplum>
    <!-- Examples of usage -->
  </exemplum>
  <remarks>
    <!--Any further notes or comments about this specification-->
  </remarks>
</*Spec>

Modules

A module provides a name for a set of other formal declarations, which other specifications will use to indicate their membership to the module and that module alone (specifications can only belong to one module).

<moduleSpec ident="namesdates">
  <desc>Additional elements for names and dates</desc>
</moduleSpec>

Modules are rarely changed, but a customization ODD will use <moduleRef> to indicate which of them are to be included in the customization. This element is also equipped with attributes to exclude or include element members; for example the following example includes the whole “namesdates” module, but without <event> and <listEvent>:

<moduleRef ident="namesdates" except="event listEvent" />

This element can also be used to bring in external schemata if necessary.

<moduleRef url="svg.rng" />

A TEI customization needs four modules to be functional: tei, core, header, and textstructure. TEI’s ODD processor does not enforce the presence of these modules, however, and it is left to the user to make sure they are included.[5]

Model Classes

Model classes work similarly to modules, but only accept memberships from element declarations (elements may be members of multiple model classes). A model class can be referenced from within a content model, allowing all members of the model class to appear at that point in the content model. When the class is referenced, one can indicate cardinality and the order (alternation or sequence) of any or all members of the class (see Elements below).

<classSpec module="tei" type="model" ident="model.segLike">
  <desc>groups elements used for arbitrary segmentation.</desc>
  <classes>
    <memberOf key="model.phrase"/>
  </classes>
</classSpec>

Customizations may change model classes to fine-tune class dependencies. Here is an example that allows members of the model.segLike class to appear wherever members of the model.addrPart class are allowed.

<classSpec module="tei" type="model" ident="model.segLike" mode="change">
  <classes>
    <memberOf key="model.phrase"/>
   <memberOf key="model.addrPart"/>
  </classes>
</classSpec>

Attribute Classes

Attribute classes declare and provide documentation for a set of attributes. Elements and other attribute classes can inherit from them.

<classSpec module="verse" type="atts" ident="att.enjamb">
  <attList>
    <attDef ident="enjamb" usage="opt">
      <desc>indicates whether the end of a verse line is marked by enjambement.</desc>
      <datatype>
        <dataRef key="teidata.enumerated"/>
      </datatype>
      <valList type="open">
        <valItem ident="no">
          <desc>the line is end-stopped </desc>
        </valItem>
        <valItem ident="yes">
          <desc>the line in question runs on into the next </desc>
        </valItem>
        <valItem ident="weak">
          <desc>the line is weakly enjambed </desc>
        </valItem>
        <valItem ident="strong">
          <desc>the line is strongly enjambed</desc>
        </valItem>
      </valList>
    </attDef>
  </attList>
</classSpec>

Customizations may adjust dependencies to other attribute classes and will often update and constrain attribute values. The example below makes the @enjamb attribute required (by default it is optional), changes its values to a particular preferred terminology, and closes the list of values, thus disallowing a value that is not from the specified preferred terminology. All of these changes are well within the original specification of the TEI, which is quite permissive, but this case supposes a situation where a mandatory and stricter version of @enjamb is required by a text encoding project. Note how children of <classSpec> that do not need change are not included (such as <desc>). Note the use of @mode="replace" to override the attribute declaration.

<classSpec module="verse" type="atts" ident="att.enjamb" mode="change">
  <attList>
    <attDef ident="enjamb" usage="req" mode="replace">
      <valList type="close">
        <valItem ident="endstop">
           <desc>the line is end-stopped </desc>
         </valItem>
         <valItem ident="light">
           <desc>the line is lightly enjambed</desc>
         </valItem>
         <valItem ident="heavy">
           <desc>the line is heavily enjambed</desc>
         </valItem>
      </valList>
    </attDef>
  </attList>
</classSpec>

Elements

The definition of elements includes their memberships to modules and classes, attributes, and a content model declaration.

<elementSpec module="tagdocs" ident="code">
  <desc>contains literal code</desc>
  <classes>
    <memberOf key="model.emphLike"/>
  </classes>
  <content>
    <textNode/>
  </content>
  <attList>
    <attDef ident="type" usage="opt">
      <desc>the language of the code</desc>
      <datatype>
        <dataRef key="teidata.enumerated"/>
      </datatype>
    </attDef>
  </attList>
</elementSpec>

Content models can be defined using RELAX NG, or (preferably) using dedicated ODD elements. There are a number of features available to organize the content model, such as <alternate> and <sequence> to determine how the referenced elements can be combined; and @minOccurs and @maxOccurs attributes to set cardinality. Note that each specification element (<moduleSpec>, <classSpec>, <elementSpec>) has corresponding reference elements (<moduleRef>, <classRef>, <elementRef>).

<content>
  <alternate>
    <classRef key="model.pLike" maxOccurs="unbounded"/>
    <sequence>
      <elementRef key="summary" minOccurs="0" maxOccurs="1"/>
      <elementRef key="msItem" maxOccurs="unbounded"/>
    </sequence>
  </alternate>
</content>

Model classes group elements by membership and typically do not impose a specific order. When referenced, however, the @expand attribute can be used to override this behavior. For example, the following content model boils down to ( p*, ab* ) rather than the usual ( p | ab ).

<content>
  <classRef key="model.pLike" expand="sequenceOptionalRepeatable" />
</content>

In a customization, including or removing an element is typically done when selecting a module via the @include and @except attributes on <moduleRef>. However, these operations can also be performed explicitly using <elementRef> inside a <schemaSpec>. For example, to add the element <msItem> (manuscript item) without including the manuscript description module:

<elementRef key="msItem" />

Or to remove the <p> (paragraph) element without removing the core module (without which TEI would make little sense):

<elementRef key="p" mode="delete" />

More minute changes to elements are quite common in a TEI customization and they will range from adjusting the description, to adjusting attribute values, to class memberships. A typical operation would be constraining attributes, for example the @type attribute on the <div> (textual division) element. The @type attribute is derived from <div>’s membership in the attribute class att.typed. Note the use of @mode="replace" to override the declaration of the @type attribute inherited from att.typed.

<elementSpec ident="div" mode="change">
  <attList>
   <attDef ident="type" mode="replace">
      <valList type="closed">
        <valItem ident="chapter"/>
        <valItem ident="section"/>
      </valList>
    </attDef>
  </attList>
</elementSpec>

Entirely new elements can be added as well, though when customizing TEI, the Guidelines require that new elements and attributes are added under a new namespace. Membership to classes will determine where the element can go; for example model.phrase groups “inline” elements, so a new inline element can simply declare its membership to that class.

<elementSpec ident="opus" ns="myTEI.example.org" mode="add">
  <desc>The opus number or "work number" that is assigned to a musical composition</desc>
  <classes>
    <memberOf key="model.phrase"/>
    <memberOf key="att.global"/>
  </classes>
  <content>
    <textNode/>
  </content>
</elementSpec>

Likewise, a “block” element could be part of the same model class as <div> (model.divLike) or as <p> (model.pLike). When the new element is meant to be a child of another specific element, the parent element’s content model will need to be changed. For example this is how the new element <opus>, presuming it is not a member of model.phrase, could be added to TEI’s <title> element only.

<elementSpec ident="title" mode="change">
  <content>
    <alternate minOccurs="0" maxOccurs="unbounded">
      <macroRef key="macro.paraContent"/>
      <elementRef key="opus" />
    </alternate>
  </content>
</elementSpec>

Datatypes

Datatypes for attributes and other string content can be specified and used by multiple declarations. W3C XML datatypes can be referred directly by their name and need not be redefined.

<dataSpec ident="teidata.pointer">
  <desc>defines the range of attribute values used to provide a single URI, absolute or relative,
    pointing to some other resource, either within the current document or elsewhere.</desc>
  <content>
    <dataRef name="anyURI"/>
  </content>
</dataSpec>

When referenced, datatypes can be restricted to match a given regular expression.

<!-- a fraction: -->
<dataRef name="token" restriction="(\-?[\d]+/\-?[\d]+)"/>

Datatypes can be changed by customizations, though it is more common to add or change restrictions or introduce entirely new datatypes.

Macros

Macros are used to declare predefined strings or patterns. Content models can be defined here just like they are in elements.

<macroSpec module="tei" ident="macro.paraContent">
  <content>    
    <alternate minOccurs="0" maxOccurs="unbounded">
      <textNode/>
      <classRef key="model.gLike"/>
      <classRef key="model.phrase"/>
      <classRef key="model.inter"/>
      <classRef key="model.global"/>
      <elementRef key="lg"/>
      <classRef key="model.lLike"/>
    </alternate>    
  </content>
</macroSpec>

Customizations may consider introducing new macros, or adding new classes and elements to existing macros.

Constraints

Other formal constraints can be documented and specified within the <constraintSpec> element. These can be placed within other specification elements, or elsewhere in the documentation text. The TEI source uses Schematron to express constraints, for example:

<constraintSpec ident="activemutual" scheme="schematron">
  <constraint>
    <s:report test="@active and @mutual">Only one of the
      attributes @active and @mutual may be supplied</s:report>
  </constraint>
</constraintSpec>

Expressing constraints is a powerful tool for building customizations, particularly when multiple encoders will be working with the schema. Schematron in particular offers many level of reporting for catching encoding errors and offering suggestions to encoders.

More literate programming

A truly literate programming ODD will couple prose with specifications, yet the elements introduced so far are children of <schemaSpec>, which is somewhat divorced from the <text> element containing the bulk of the human-readable documentation. It is possible, nonetheless, to refer from the prose to specifications in the <schemaSpec>, which a processor will expand when generating documentation. The TEI Guidelines use this mechanism, which simplifies the maintenance of specifications organized into multiple XML files.

<listRef>
  <ptr target="#ID_OF_SPEC_ELEMENT"/>
</listRef>

This is hardly a tight coupling of prose and specification, but it works well for a complex ecosystem such as the TEI Guidelines. It is also possible, however, to do just the opposite: break up specifications into groups to be included within the documentation prose. References within <schemaSpec> can then take care of telling the processor how to put everything back together. The more recent TEI customization for “Simple Print” documents employs this strategy.[6]

The following (abridged) bit of prose describes the selection of elements from the TEI header for the Simple Print customization:

<div>
  <p>A subset of 45 elements is selected from the TEI header module. In addition, <!-- etc. --></p>
  <specGrp xml:id="header">
    <moduleRef key="header" include="abstract availability biblFull catDesc etc" />
    <moduleRef key="corpus" include="particDesc settingDesc"/>
  </specGrp>
</div>

Elsewhere in the document, the <schemaSpec> points back to this and other <specGrp> elements.

<schemaSpec ident="teisimpleprint" start="TEI teiCorpus">
  <specGrpRef target="#base"/>
  <specGrpRef target="#header"/>
  <!-- etc. -->
</schemaSpec>

Processing ODDs: zapping, sourcing, chaining

To generate documentation and schemata, a processor will merge together the source ODD and the customization ODD, resulting in a compiled document containing everything that the customization selected from the source, plus the instructions to perform the additions and changes required. This first step is an opportunity to drop anything that is not needed, which makes it possible to write fairly lean customizations. For example, the TEI analysis module has among its members the global attribute class att.global.analytic. In turn, this class is a member of the att.global class, which is referenced by every single element in the TEI. When a customization excludes the analysis module, att.global.analytic will also be dropped without needing to change att.global explicitly. Similarly, when selected classes or elements end up not being referenced anywhere else in the compiled ODD, they get “zapped” to avoid unreferenced declarations in the resulting schemata and to exclude unnecessary documentation from the human-readable output.

In a typical TEI customization, only the customization ODD is supplied by the user and the processor obtains the source ODD for the latest release of TEI P5 before compilation. While an altogether different source can be passed to the processor, the user can also indicate in the customization file that certain specifications should be obtained from specific sources. <schemaSpec> and other reference elements (e.g. <moduleRef>, <elementRef>) can use the @source attribute to point the processor to a different ODD to look for that specification. The TEI Guidelines specify a private URI (tei:x.y.z) to be able to refer to specifications from older versions of the TEI. Because @source can point to any ODD via a URI, it is possible to “chain” ODDs by customizing an existing customization. This example from the TEI Guidelines[7] shows how to extend the customization “TEI Bare”, which doesn’t include <q> (quote), with <q> from version 3.0.0 of the TEI.

<schemaSpec ident="Bare-plus" source="tei_bare.compiled.odd" start="TEI">
  <moduleRef key="tei"/>
  <moduleRef key="header"/>
  <moduleRef key="core" include="p list item label head author title"/>
  <elementRef key="q" source="tei:3.0.0"/>
  <moduleRef key="textstructure"/>
</schemaSpec>

Tools

The only existing ODD processor is a set of XSLT scripts maintained by the TEI.[8] Besides obtaining and running these scripts directly, there are a number of ways to process ODD to generate documentation and schemata.

  • Command line: the TEI Stylesheets repository on GitHub includes a number of scripts to perform transformations. The script bin/teitorelaxng compiles ODDs and transforms them into RELAX NG. It allows the user to set a non-TEI source ODD as well as a number of other options. Other scripts such as bin/teitohtml5 can generate documentation from a compiled ODD.

  • Oxygen XML editor: the TEI Oxygen plugin includes the TEI Stylesheets and routines for generating documentation and schemata from TEI ODD customizations.

  • OxGarage: this online service at https://oxgarage.tei-c.org/ provides both a graphical interface and an API to the TEI Stylesheets. It can compile ODDs and generate documentation and schemata in a number of formats.

Additionally, the TEI has created Roma (https://roma.tei-c.org/), an online tool to create customizations via a user interface, which also interfaces with the TEI Stylesheets to generate documentation and schemata. The interface does not cover the full expressiveness of ODD, but it supports users with less schema design expertise. An entirely new version of Roma is currently in beta (https://romabeta.tei-c.org/). Besides a complete rewrite of the interface, the new version takes advantage of the OxGarage API for processing ODD and covers a wider range of customization operations.

ODD for TEI interchange and beyond

ODD plays an important role in data interchange within the TEI ecosystem: as a large and greatly adaptable format, TEI-encoded documents can look quite different from one another. When carefully crafted, ODD customizations become the key to facilitate TEI interchange because they contain human-readable documentation as well as a formal description as to how a schema differs from the whole TEI specification.[9] Finally, because ODD can be used to both express and customize a markup vocabulary, it has been adopted outside of the TEI. The most notable case is the Music Encoding Initiative, a markup language for representing music notation targeted at library and musicological research that shares many of the documentation and customization principles and needs of the TEI. The Music Encoding Initiative,[10] which also uses ODD for its source and customizations, provides a transformation service online at http://customization.music-encoding.org/, which also applies the TEI Stylesheets to process ODD and generate schemata. ODD has also been used for the definition of the Internationalization Tag Set (ITS);[11] and “various standards proposal designed within ISO committee TC 37 have been totally or partially written in TEI/ODD: MLIF, MAF, ISO 16642 rev., ISOTimeML”.[12] New applications of ODD are still underway, including Martin Holmes’ proposed use for HTML (Holmes 2018).

Acknowledgements

My thanks to Syd Bauman for his extensive feedback on this piece and to the organizers of the pre-conference Symposium on Markup Vocabulary Customization for inviting me to talk about TEI ODD.

References

[Baumann 2011] Bauman, Syd. 2011. Interchange vs. Interoperability, in Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:https://doi.org/10.4242/BalisageVol7.Bauman01

[Baumann 2017] Baumann, Syd. 2017. tei_customization: A TEI customization for writing TEI customizations (paper), in Proceedings of the Text Encoding Initiative Conference and Members Meeting, Victoria, British Columbia, Canada, November 11 - 15 2017. https://hcmc.uvic.ca/tei2017/abstracts/t_110_bauman_teicustomization.html

[Burnard 2000] Burnard, Lou. 2000. Text Encoding for Interchange: a new Consortium, Ariadne, 24. http://www.ariadne.ac.uk/issue/24/tei/

[Burnard and Rahtz 2000] Burnard, Lou and Sebastian Rahtz. 2000. Relax NG with Son of ODD, in Proceedings of Extreme Markup Languages 2000. https://ora.ox.ac.uk/objects/pubs:394056

[Cummings 2008] Cummings, James. 2007. The text encoding initiative and the study of literature, A Companion to Digital Literary Studies, ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2008. http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405148641/9781405148641.xml&chunk.id=ss1-6-6

[Holmes 2018] Holmes, Martin. 2018. Using ODD for HTML, in Proceedings of of the Text Encoding Initiative Conference and Members Meeting The Markup Conference, Tokyo, Japan, September 9 - 13 2018. Pages 240 - 241. https://tei2018.dhii.asia/AbstractsBook_TEI_0907.pdf

[Vanhoutte 2004] Vanhoutte, Edward. 2004. An Introduction to the TEI and the TEI Consortium, Literary and Linguistic Computing, Volume 19, Issue 1, April 2004, Pages 9–16. doi:https://doi.org/10.1093/llc/19.1.9

[Wittern et al. 2009] Wittern, Christian, Arianna Ciula, Conal Tuohy. 2009. The making of TEI P5, Literary and Linguistic Computing, Volume 24, Issue 3, September 2009, Pages 281–296. doi:https://doi.org/10.1093/llc/fqp017



[1] The story of the TEI has been told several times in writing (Burnard 2000, Vanhoutte 2004, Cummings 2008, Wittern et al. 2009, to name a few). Burnard and Rahtz 2000 explain how the idea behind ODD first originated with Lou Burnard and Michael Sperberg-McQueen in 1998, yet the transition to P5 (concluded in 2007) determined most of the modern shape of ODD that this paper will introduce.

[2] RELAX NG, DTD, and Schematron are generated directly, but for XML Schemas, the current processing takes a shortcut by converting RELAX NG to XSD using Trang.

[4] This overview is meant to showcase the capabilities of ODD for creating customizations and it is not intended to be a comprehensive documentation of the language or a tutorial. Please refer to Chapter 22 of the TEI Guidelines for a more comprehensive description of TEI ODD.

[5] Nonetheless, Baumann 2017 has created a customization that helps enforce TEI-specific requirements when using ODD to create TEI customizations.

[9] On this topic, see also Baumann 2011 on “Interchange vs. Interoperability”.

[12] From the TEI Wiki page on ODD: https://wiki.tei-c.org/index.php/ODD.

Author's keywords for this paper: markup ecosystems; interchange; interoperable; interoperational; TEI; TEI P5; ODD; One Document Does-it-all; customization; markup customization; validation