Markup Vocabulary Customization

Introduction

Syd Bauman

Digital Scholarship Group, Northeastern University

expand Abstract

Table of Contents

Introduction

expand Syd Bauman

Balisage logo

Proceedings

expand How to cite this paper

Markup Vocabulary Customization

Introduction

Symposium on Markup Vocabulary Customization
July 29, 2019

Introduction

Natural languages change over time.[1] Take the word “wicked”, which gained a new meaning in the 21st century (“Excellent; awesome; masterful”)[2], the opposite of its historical meaning (“Evil or mischievous by nature”[3], since the 13th century[4]). One of my favorite anecdotal examples are the words “urgent” and “emergency” as used in American medicine. “Emergency”, which less than 100 years ago meant unscheduled and possibly serious, now means very serious, very urgent. “Urgent”, which used to mean demanding immediate attention, now indicates something that while unscheduled, does not require immediate medical attention.[5]

Markup languages, in particular XML markup languages, often change over time, too. But unlike a natural language, an XML markup language may be designed to be altered. That is, a mechanism for modifying the language may be built into it. The designers of the language may have explicitly established mechanisms for users to change the language to meet their particular application. In turn, users of the language may be permitted or even expected to customize it to more closely fit their needs.

While this may seem counter-intuitive at first — after all, the basic underlying technology these languages use is the “Extensible Markup Language”: isn’t it the XML layer that is intended to be extensible by the schema layer? — in the end it makes perfect sense. The data and context we, modern computer users, apply our markup languages to, and the processing we expect from the marked-up results, are almost as varied as we are. It is inevitable that some uses would be very similar, but not precisely the same, as others. Take, as a fictional example, the scholar who is studying the effect of weather on the tone of letters to the editor. Besides the usual metadata about each letter (date written, date of publication, which newspaper, etc.) and the transcription of the letter itself — features that would likely be readily available in any tagset designed for transcribing or writing letters, including TEI — she also needs metadata about the weather on the day the letter was written at the place it was written, a feature I daresay very few, if any, tagsets for transcribing letters would include.

So it is not surprising that many major markup languages have built-in mechanisms for user extension. These mechanisms permit the user to modify the vocabulary, the grammar, or the semantics of the base markup language. Customization mechanisms often include methods to:

  • narrow the schema components (removing elements or attributes)

  • expand the schema components (adding new elements or attributes)

  • loosen or restrict the schema (required versus optional, etc.)

  • add to or change the semantics of a component

  • document the customizations

Not all vocabularies provide all customization capabilities; and more importantly there is no agreement, nay not even much similarity, in the mechanisms various markup languages use to allow and disallow various user customizations of the language.

Are some of those mechanisms far better, or far worse, than others? How easy are they to use? How much power do they afford the customizers? How difficult is it to maintain the customization mechanism writ large? A particular customization? Can a document that conforms to a customized schema be interchanged among groups that use the main language?

So for our symposium we have taken the first step in a deep-dive understanding of customization mechanisms. We have assembled experts in each of five of the major XML markup languages that expect user customization, and asked each to describe, in detail, the mechanism used by that language.



[1] This is true even of those languages for which a language academy tries to control or regulate changes. Wikipedia lists over 80 such languages.

[5] When my dad was in medical school “emergency surgery” was not necessarily all that important (although it might be), but its chief characteristic was that it was unscheduled. After all, the etymology of “emergency” is from “emerge”, to come forth from concealment or come to the surface. “Urgent”, on the other hand, for hundreds of years meant “important, requiring immediate attention”. But in the mid-20th century the area of a US hospital that treats unscheduled patients became the “emergency room”, later the “emergency department”. Since many of these patients have particularly urgent problems, the word “emergency” began to mean “serious” or “urgent”. (Personally, I blame the 1970s television show Emergency for popularizing this usage.) But during the late 20th century as changes in attitudes and insurance systems caused an overflow of patients showing up at emergency departments with minor problems, hospitals needed to find a place to put them that did not tie up the resources of the Emergency Department. Roughly simultaneously (give or take a decade) free-standing treatment centers for unscheduled, but non-serious, problems cropped up. In many cases these centers were not open 24 hours, and local legislation limited the use of the name “emergency” to establishments that were open 24 hours a day. Thus these new units that handled less urgent unscheduled medical problems needed a different name, and they became “urgent care”. Nowadays any emergency nurse can tell you that an “emergency” patient is much more urgent than an “urgent” one, and many a patient gets sent from the Emergency Department to the Urgent Care unit because their problem is not particularly urgent.