How to cite this paper
Graceful Tag Set Extension
Balisage: The Markup Conference 2016
August 2 - 5, 2016
We live in a time of tag set extensions.
There was a time when organizations planning a conversion to XML, or planning to move a
new document type to XML, assumed that the process would involve creating a tag set for that
document type. The costs of creating that new tag set usually included an outside expert to
create and document the tag set, internal subject experts to assist in document
analysis, and programmers to customize editing, database, and formatting tools to work
with the new tag set.
Now, the assumption is that there is an existing public tag set that meets their needs, or meets them closely enough. Most
organizations don’t consider a new bespoke tag set, and some consider the choice of public
tag set so obvious that they don’t waste time exploring other options. Even among
those who do explore their options, the default assumption seems to be that there is a model
they can adopt to meet their needs. Many publishers with older bespoke tag sets convert to a public one.
There are a lot of good reasons to adopt instead of developing from scratch. The most
Cost to Develop and Document
Vocabulary development in real-life complex domains is a multi-year
multi-person project that requires the time and skills of subject matter experts
as well as XML expertise. Costs include identifying not only the
structures and types of information that are key to the expected data usage of this community, but also structures that are common in documents and needed for the applications and publications to be made from these documents.
A group developing a subject-specialized vocabulary in a subject area is
likely to do a better job modeling aspects relating to the subject matter
than normal prose structures — partly because they are more interested in
their own subject matter and partly because modeling common prose structures
is likely to feel like a waste of time to them. We have seen subject matter
experts sigh, turn on their phones, or even leave the room when a lively
discussion of the metadata needed to identify the subject of one of their
reports turned to a discussion of what types of lists they would need in the
prose portions of the same documents.
Also surprising is the costs and time required
to document a vocabulary well enough that tagging and usage will be consistent.
Adoption of a vocabulary out of the box enables a community to avoid all
of these costs. Adoption and adaptation enable the community to spend its
energy (and time and money) modeling only those structures that are unique
to the community and to document only the new or revised structures.
Cost of tool customization
While it is possible to create XML documents using XML editing tools out
of the box, and it is possible to store, search, and retrieve XML documents
using an XML database as it is shipped, neither of these provides an
attractive user experience, especially for people who are not very
comfortable with the syntax of XML. There is a significant investment in customizing
tools to work with XML documents. Some of these customizations are specific to the type of document, but many are
specific to each element, element in context, or element with attribute value.
Users of a new tag set can save a significant amount of time and money if they do not have to tell their editing tool when elements should be displayed to an editor as blocks and which as in-line; which are list items, and what text should be generated on display. Similarly, if they do not have to tell their database which elements contain non-textual material (such as TeX) and which should be considered higher value for search result ranking (perhaps titles and table column heads) a lot of set-up time can be saved.
Cost of formatting and display development
Technically, it is also possible to format an XML document for human
consumption without customizing the formatting software, but is it unlikely
that the documents will be recognizable or useful. One of the major
advantages people hope for when adopting a vocabulary is to be able to use,
or at least start from, existing formatting applications to make common display formats such as HTML and PDF.
Availability of experienced staff and vendors
It is far easier to work in an environment in which one can hire
experienced staff and in which service vendors are familiar with your
requirements. Of course, you could train all of your staff members from
scratch, but that takes time and resources and significantly increases the
loss when they leave. Similarly, if you develop an XML vocabulary from the
bottom up, you will be able to find vendors to create, manage, and host your
documents, but you will have to pay them to learn your vocabulary and needs,
pay them to train their staff, and pay them to customize their tools and
processes. If you adopt an existing vocabulary you will have to work with
your staff and vendors on any variations you prefer and teach them about any
customizations you have made.
Pressure from tool vendors, service suppliers, and XML community
XML is rarely created and used strictly in-house any longer. There are numerous
partners who will be involved in creating and using it including tagging vendors, publishing
partners, and aggregators. Using a tag set that is familiar to these partners
simplifies these relationships and may significantly reduce costs and errors because there is less need to explain the XML model and how it is used and less need for exception processing. (Many organizations choose a particular vocabulary because a particular vendor requires it or a particular tool creates or ingests it.)
Adopt and Adapt
There are some situations in which users, and whole user sectors, can adopt an XML
model and use it comfortably. However, in many cases, it is more accurate to describe
the process as
Adopt and Adapt than simply
A user who has exactly the situation envisioned when a tag set was developed may well
be able to simply use it. A user who wants to encode their system manuals in XML may find
DocBook works well for them as published, and they gain the added value of being able to
use existing user interface layers on tools and formatting stylesheets.
Similarly, a user who want to send their journal articles to an archive or document
repository may be required to use JATS (ANSI/NISO z39.96-2015), and even provided with guidelines that specify
which of the JATS tag sets and how they should use optional features.
A user who wants to participate in an existing data interchange process may be
required to use the tag set used by the existing participants regardless of comfort. For
example, a user who wants to include their poster and pamphlet content in a publication
locator service based on XML-tagged technical reports will have to find a way to tag those posters and
pamphlets using the vocabulary used for the technical reports.
A community that wants to begin interchanging XML documents may find that there is no
existing tag set and community of practice that exactly meets their needs. Some members
of the community may be using XML, but if they have not worked together to develop their
practices it is likely that they have different approaches. Even if the individual
members have adopted public models, they may not have adopted the same public
An example of a community that is currently working on developing a shared XML model
for interchange of documents among the participants is the Standards community. For an
excellent overview of this process, see
NISO STS Project Overview and Update [Wheeler et al. 2016]. In this case,
various members of the community already use DocBook-, DITA-, XHTML-, and JATS-based
models, and at least one has done a TEI-based pilot project. None found any of the public
models out-of-the-box met their needs; all adapted the models they had adopted. This
community is now working to create an interchange tag set that will serve all of their
needs. They are starting with a tag set created by one of the participants (ISO) that
was developed by adopting and adapting JATS [ISO 2016]. This process is,
we believe, typical of the way shared tag sets are being developed now.
Public models have been developed and documented with the assumption that they will be
adapted. NIEM describes itself as a
framework and provides tools for
domains to use to develop
[NIEM 2016]. DITA includes the
Specialization feature, which
enables users to extend the tag set and use DITA processors that are unaware of the
extension [Eberlein et al. 2010]. The Text Encoding Initiative Guidelines describe
unclean modifications and provide a tool for
creating extended TEI-based vocabularies [TEI 2016]. JATS documents how the
tag sets can be modified [NCBI 2015] and provides terminology to identify and distinguish between
JATS-Based and JATS-Conforming extensions [ANSI/NISO 2015].
As users adopt (by choice or fiat) a customizable model and begin to adapt that model
to meet their needs, they are faced with decisions that may have far-reaching
consequences. It is not uncommon for users to make customization decisions early in the
adaption process that they come to regret later. In some cases, there is considerable
discussion of options, and a choice is made between what are known to be imperfect
options. In other cases, however, the customizers do not even know that they
are creating problems for themselves and their users down the line.
If your adapted tag set is for use in isolation, most of these guidelines are
irrelevant to your project and usage. If you intend to craft or customize tools as needed and are unconcerned
about how your adapted tag set will work with existing tools, others of these guidelines
are irrelevant. If you are going to train all of the people who will create, manage,
use, and archive your documents, others of these guidelines are irrelevant. If you and
your documents are on a technologically isolated deserted island and expect to remain
so, none of this matters to you; do what you want as you want.
Most tag set adapters want documents that use their adopted/adapted tag set to play
nicely with others. They want to able to store their documents in databases with
documents tagged with the source tag set or other adaptations of it and to be able to
search them all as one coherent collection. They want to be able to use tools such as
editors with customized user interfaces by adding only those features needed for the new
structures in their documents. They want to be able to use formatting and display tools
for the existing documents by adding handling for any new structures (if that [with a
DITA specialization, even that should be unnecessary]).
JATS Compatability Guidelines
We, the authors of this paper, have been inspired by the ways in which JATS is being
extended and occasionally surprised by problems people who have adapted JATS have
reported. We have been drafting a set of Guidelines [Usdin et al. 2016]
for people extending JATS to help them understand which adaptations will integrate
gracefully into existing JATS environments and how to tell if an adaption might bite
them later. To our surprise, this was not always obvious. Many types of adaptation that
we initially assumed would be problematic seem to be fine, and a few types of changes
that seem innocuous can create significant surprises at late stages of the document life
The principles articulated in this paper are based on the work done to develop the
JATS Compatibility Guidelines [Usdin et al. 2016], and many of the examples
are taken from JATS and the JATS Compatibility Guidelines. However, readers who intend
to create a JATS-compatible tag set are referred to those Guidelines; this paper is not
a substitute for those Guidelines. We also hope that the JATS work and the thought that went into
creating those Guidelines is more widely applicable.
Things that Must Match to Maintain Compatability
Respect the Semantics
Starting from first principles, when using or extending a tag set, respect the
semantics of the starting structures. This should be obvious, but an amazing number
of XML users think that they are doing no harm by repurposing an element or
attribute they would not use for the original purpose.
They don’t call it tag abuse, but that is what it is. Sometimes blatant,
sometimes with a story justifying
bending the meaning of a structure
for convenience, tag abuse is rarely a good short term strategy and virtually always
a bad long term strategy.
Tag abuse is using an element or attribute for content for which it was not
intended. Tags are abused when users are trying to control display. For example, it is common
to use several empty
<p> elements in HTML to produce some blank
space on the screen. There are not several empty logical paragraphs in the document,
this is tag abuse to achieve screen formatting. Similarly, using a block-quote
element to emphasize instructions, making them stand out from the prose around them,
may achieve an acceptable display at the cost of junking up searches for
block-quotes and hiding the content from a search for instructions.
If you need to store the country in which some people live and you don’t use
the phone number element for foreigners you could put their country names in the
phone number element. We have seen this done. So, what happens when you start to
validate phone numbers? Or when you decide that you can make phone calls across
state lines and need a place to put the phone numbers for those people? Can your database
list the countries for all authors? What about
when a formatting engine inserts the usual punctuation for a phone number into those
country names and displays them?
If your starting tag set has a tag called
or province do not create an attribute called
the possible values
state does not technically infringe on
state, but it will confuse people. Call your attribute
@state-of-matter or some such.
Sometimes tag abuse happens from a coincidence of names — when a new user does not check the semantics and is misled by a homophone. Oh, they think,
I need an element for what the witness said at the trial and there is a , not noticing that
<statement> is defined as a logical proof or hypothesis.
Use the Same Style of Nesting/Recursion for Sections
There are, generally speaking, three styles of modeling nested sections in XML:
In the recursive model, sections contain sections, which can contain sections,
which can contain sections. Display styling of the section headers is based on
analysis of the location of the section in containment structure, its depth in the section hierarchy.
In the nested-with-explicit-levels model, sections level 1 may contain sections
level2 which may contain sections level 3, etc.
In the non-nested with explicit levels model, sections level 1 may be followed by
sections level 2 which may be followed by sections level 3, but these may come in
any order and are not nested.
The section logic is fundamental to complex prose documents, and mixing section
logic in the same environment creates the opportunity for significant confusion. People and
software, can get very confused if it is not clear whether
sections are nested or not; whether the level of nesting should be computed from the
level of sections in which a section is contained or derived from the name of the
section. Worst of all is a model in which sections that have explicitly named levels
are sometimes nested at other levels. (Yes, this does occur in real documents.)
Maintain Distinction Between Elements and Attributes
In the XML world there are people who argue that the distinction between elements
and attributes is arbitrary and that since it is easy to transform one to the other
using XSLT, vocabulary developers should feel free to use either at any time for any
purpose. If the vocabulary is being developed in a vacuum, this may be so, but if a
new or modified vocabulary is intended to interoperate with another vocabulary, this
is very much not so! While attributes are often used to control display, and their
values may be used either to prompt selection of generated text or be displayed,
their use in display is significantly different from element content. Similarly,
there are times when element content is not displayed, the default in most
(text-based) applications is that element content is displayed to the reader.
In most databases, attribute values are indexed, searched, and displayed differently
from element content. Also, in most XML editing systems, attribute values are entered and
displayed differently from element content.
If content in the source vocabulary is element content, keep it as element
content. If it is attribute content, keep it as attribute content. Should there be a
need in a new vocabulary to change the form of content in a source vocabulary from
element to attribute or vice versa, we recommend using a different name for the new
structure and documenting the relationship to the content in the source vocabulary.
In XML, some whitespace is significant and some is insignificant. How whitespace is handled has serious impact on the
ability to re-use tools among documents in a heterogeneous collection. If elements in a tag set
extension do not have the same whitespace handling properties as display tools were developed to expect, there will be
unfortunate (and in some cases surprising) effects on the display of the document content.
Three whitespace handling types are listed below. A compatible tag set extension must not change
the whitespace handling type for any existing element.
Content models that contain only elements (no characters) have insignificant whitespace. That is,
XML tools may create or destroy whitespace in these models with, by definition, no effect on the document, how it is handled, or how it is displayed.
Content models that contain character data or mixed content contain significant whitespace. That is,
XML tools may fold the whitespace (collapse multiple whitespace characters into a single space character),
but they may not create or destroy any whitespace nodes.
Content models defined as preserve whitespace are character or mixed content models where the
whitespace nodes must not be folded. Each whitespace character in the XML must be preserved. Usually
this is used for alignment of code or other
Rendering and behavior, especially the rending and behavior of links, is often dependent on the
relationship. If an attribute that has a type of
ID in the source vocabulary is
changed to any other type, rendering tools may not process the links
IDREFS or vice versa is not a concern. The number
of pointers will not affect compatibility. Changing the direction of the pointer or obscuring the pointer is the
We have actually seen one instance in which a user reversed the uses of
IDREFs, creating documents that looked similar to those in the source vocabulary.
The result was chaotic; it turned out that the XSLT that created the HTML version of
these documents relied on the
IDREF mechanism MOST of the time, but occasionally
simply treated the attribute values as values. So, SOME of the links worked as
expected and some did not. (On further thought, this is as much the fault of an
inconsistent transformation as a surprising document; all of these links should probably have failed!)
Alternatives or Media-specific Content
In the world of prose documents, it is assumed that the reader should
have access to all content. However, there are situations in which that is not the
case. For example, it is common to provide several versions of the same graphical
object: one for high resolution or full-screen display, one for display on small
devices such as hand-helds, a thumbnail for navigation, and perhaps a very high
resolution or black & white version for print. In counting the number of figures
in a document this figure should be counted once, not as many times as there are
media- or use-specific versions and only the most appropriate for the display media should be rendered. Similarly, it is becoming common for journals to
publish author names both in the language and script of the journal and in the
language and script of the author’s home environment. This person should be counted
only once in specifying the number of authors of the paper and, more importantly,
this paper should only count once when calculating the author’s influence.
Any structure in the original vocabulary that is provided to wrap two or more alternative structures, must be used in the same way in all compatible vocabularies.
Things that Don’t Seem to Matter in Compatible Modeling
In drafting the JATS Compatibility Meta-Model
Description [Usdin et al. 2016] we considered quite a few areas of conformance that, on
further examination, proved to be unnecessary to create document models that were
compatible for our purposes. There are recognizable, classifiable distinctions that just
turn out not to matter for these purposes.
EMPTY Elements versus Contenting-containing Ones
One obvious element differentiator was
EMPTY elements versus those with
#PCDATA, element, or mixed content.
Element content is indeed unique, but data characters, mixed content, and
EMPTY are all the same, since characters
are, by definintion, optional in XML. An elmement with a
#PCDATA model or mixed content may have nothing in it, and will look the same
EMPTY element in the document. Thus, the following categories are uninteresting in this context:
Structures that contain character data only
Elements that may not have internal markup. In many tag sets Date may not have internal markup.
Structures that contain character data and phrase-like
Paragraph is often allowed to contain character data and phrase-like structures such as Italic, Place Name, or Cross Reference but not allowed to contain larger nesting structures such as lists and figures.
Structures that contain character data, phrase-like structures, and
In some tag sets there are structures that may contain character data, phrases, and block-like structures. For example, paragraphs may be allowed to contain lists, boxed text, display equations, block quotes, tables, or figures.
Some structures (whole documents, authors, boxed-text, appendices) may have
metadata and there are other structures that are unlikely to have metadata (italic,
break, address-line). However, on analysis, we found that there are circumstances in
which almost any structure could have metadata (at least an
IDREF that associates this structure with others), and that this does not affect interoperability as we were
looking at it.
In many tag sets, some elements are only used in the metadata of a document (journal in which published)
while others are only used in the narrative text (figure). But in most tag sets there are many elements
that can be used both to describe the document in which they occur and to describe other documents (copyright,
digital identifiers, publication date), so this distinction is not just unimportant, it often changes over time.
Sections and Section-like Structures
It seemed intuitively obvious that an element that had the section structure in
one vocabulary should have a section structure in a compatible vocabulary. That, for
example, if a Boxed-text could contain not only paragraph-like structures but also
nested headed sections in a source vocabulary, it should in any compatible
vocabularies. But since those nested sections are, or could be, optional in the
source vocabulary, documents without them can clearly be handled by the tools and
formatter because we believe that a subset of an element model is always conforming.
Thus, it is not necessary that compatible vocabularies allow nested
sections in all of the places that the source vocabulary does.
Conversely, we considered that nested sections be allowed only in the places where
they are allowed in the source vocabulary, and found that this, too, is not a
requirement. If a tool or format is data driven (what in XSLT-speak is called
push-processed), it should be able to accommodate sections that have the same style of
sections as are already present in the vocabulary even in new locations.
Role in the Document
Structures can easily be grouped by role in a document, and it is tempting to think that structures must play the same role in all document types in order to be compatible. We found that this is not so, and that while it might be interesting to group structures by their roles in documents, at least these roles do not seem to affect interoperability:
Elements that may be used at the same structural level as a Paragraph (
<p>), for example, inside of a section. This would include many block-level structures such as figures and tables.
Elements that have the Preserve whitespace model, which is often used for Code and sometimes for poetry.
Emphasis-like with Toggle
Inline elements that may be toggled on and off with recursion. In some tag sets, Italic toggles. That is, if an Italic tagged phrase appears in a context that would be displayed in italic anyway, the Italic tagged phrase is NOT displayed in italics to retain the typographic emphasis.
Emphasis-like without Toggle
Inline elements that do not toggle on and off with recursion. Some structures must be displayed as tagged even if the context they are in would have that display. For example, Sans Serif often does not toggle.
Structures that identify the document, such as ISSNs, ISBNs, author names, or volume and issue numbers
Structures that contain several related structures but that have no formatting consequences themselves. For example, Article Metadata may be grouped separately from Issue Metadata, and Keywords may be grouped into a Keyword Group
Structures that are generally displayed as footnotes are may include Footnote, Author Note, Funding Source, and Corresponding Author Address.
Elements that are used to identify locations in the document or that are used in pairs to indicate the start and end of some portion of a document, typically that cannot be simply wrapped in an element because of overlap problems. Milestones may be Revision Start and Revision End, or simply Pull Quote.
Structures that can have labels and/or titles
Many, but not all, block type structures can have labels and/or titles. For example, Block Quotes, Boxed Text, Sections, Bibliographies, Lists, and Figures can have labels and titles in many tag sets.
Elements that mark a location in the document or that may have attributes but no element content.
Structures that have accessibility data
The ability to provide alternate text or long descriptions may be available for Figures, Graphics, Equations, Tables, and a variety of other structures.
Structures that have attribution and/or permissions or licensing
Structures such as Articles, Boxes, Sections, Tables, and Appendices may have information about who wrote them or who may use them and under what conditions.
Attribute Value Types (other than
Even in a DTD, it is possible to type attribute values, and in XSD and RNG attribute
value types can be quite strongly specified. We know (see above) that it is
critical that attributes of type
ID remain of type
ID and that
attributes of type
IDREFS remain of type
IDREFS in order for
documents to be compatible. However, that leaves many other attribute types.
Some processing may be tied to specific values of attributes, and if none of the
expected values are present the processing may fail. For example, if a formatter
<styled-content view="GrIt"> as green italic, if that value is not
present the formatter will not render the content in green and italic. However, we
see no disruption from:
Adding or removing items from a specified value list
CDATA attribute to one with a specified value list, or vice
NMTOKENS attribute to
CDATA or vice versa
Changing the value of a
#FIXED attribute or changing a
CDATA or a specified value list
We came to the conclusion that most attribute typing is useful in the creation of
correct documents as specified by the content creator, but is not essential to the
storage, management, or rendering of the documents.
The first public draft of the JATS Compatibility Meta-Model
Description [Usdin et al. 2016] was released to the public in
July 2016. We anticipate that the assumptions we have made in this work will be tested
through the process of public review and comment. We hope that we will be prompted to
improve the content of the guidelines to make them more effective and to improve the
descriptions of them to make them clearer and easier to implement.
Although some of the comaptibility principles we describe such as whitespace handling,
ID/IDREF consistency, and maintaining the meaning of object names are applicable for testing
tag set compatibility in general, we were working specificly on compatibility of extensions
to ANSI/NISO Z39.96-2015 JATS.
We welcome your comments on this conference paper and, more importantly, on the document
at the NISO site.
[Wheeler et al. 2016] Wheeler, Robert, Bruce Rosenblum, and Lesley West. 2016.
NISO STS Project Overview and Update. In Journal Article
Tag Suite Conference (JATS-Con) Proceedings 2016. Bethesda (MD): National Center for Biotechnology Information (US). http://www.ncbi.nlm.nih.gov/books/NBK350146/.
[ISO 2016] International Organization for Standardization (ISO). 2016.
Welcome to the ISO Standards Tag Set (ISOSTS). Accessed April 19.
[NIEM 2016] National Information Exchange Model (NIEM). 2016. NIEM. Accessed
April 19. https://www.niem.gov/.
[Eberlein et al. 2010] Eberlein, Kristen James, Robert D. Anderson, and Gershon Joseph,
eds. December 2010. Darwin Information Typing Architecture (DITA) Version
1.2. Organization for the Advancement of Structured Information Standards
(OASIS) Standard. http://docs.oasis-open.org/dita/v1.2/os/spec/DITA1.2-spec.html.
[TEI 2016] Text Encoding Initiative (TEI). 2016.
Customization. In P5: Guidelines for Electronic Text Encoding and
Interchange. Version 3.0.0. Last modified March 29, revision 89ba24e.
[NCBI 2015] National Center for Biotechnology Information (NCBI),
National Library of Medicine (NLM). 2015.
Modifying This Tag Set.
Journal Archiving and Interchange Tag Library NISO JATS Version 1.1 (ANSI/NISO
Z39.96-2015). Last modified December.
[ANSI/NISO 2015] American National Standards Institute/National Information
Standards Organization (ANSI/NISO). 2015. ANSI/NISO Z39.96-2015, JATS: Journal
Article Tag Suite, version 1.1. Baltimore: National Information Standards
[Usdin et al. 2016] Usdin, B. Tommie, Deborah A. Lapeyre, Laura Randall, and Jeffrey Beck. 2016. JATS Compatibility Meta-Model Description. Draft Version 0.7. 32 p. http://www.niso.org/apps/group_public/document.php?document_id=16764&wg_abbrev=jats-sc.