I have been working on the PubMed Central PMC01repository at the National Library of Medicine for 11 years. My role there is to ingest XML from different publishers and transform it into the JATS format JATS01 for inclusion in our database. In this role, I have seen a lot of article SGML and XML content and had to make decisions on whether it was of a consistent quality to be included in the PMC database or not.
Sometimes we have problems with content that is submitted to PMC. We see content that is not well-formed, not valid to the schema that is being used, tag abuse and other inconsistencies in how the elements and attributes of the XML model have been applied.
PMC accepts content in many different XML formats, but well over 50% of the content being supplied currently is in one of the JATS models. Both the Archving and Interchange and the Journal Publishing models use the XHTML table model; the CALS table model is supplied in the Tag Suite but not used in these two models. A wonderful example of content no longer being valid outside of a closed system involved a modification of the DTD to call in the CALS table model. The DTD was expanded correctly, and everything ran fine on the publisher's system. However the DTD files were not given a new name. Also, the PUBLIC and SYSTEM IDs used in the DOCTYPE declaration in the instances were those that were defined for the Journal Publishing Model.
When this content arrived in PMC, the PUBLIC ID was resolved the the standard Journal Publishing DTD (without the extra table model), and all of the instances were invalid.
When we provide feedback on the sample XML supplied during the PMC evaluation process Beck01, the response we most often here from the publishers is "We paid a lot of money to get this XML. It works on our website, so we know it is good."
It is obvious that if inconsistent and invalid content works in their system that they are creating and using their content in a Closed XML System.
What Is a Closed XML System?
A closed XML system is a system where the XML files never leave the system or are never used by anyone other than the creator. We have all created little XML documents with one-off throw-away models for one task or another that only we use.
An example would be if you were creating a To Do list application to run in XML for your own use. First you would figure out what you want to track and probably start with a sample document.
<todolist> <todo> <due month="07" day="08" year="2011"/> <reminder month="07" day="01" year="2011"/> <task>Finish Balisage Paper.</task> </todo> </todolist>
This works for a while, and then you decide to add a second reminder.
<todolist> <todo> <due month="07" day="08" year="2011"/> <first-reminder month="07" day="01" year="2011"/> <second-reminder month="07" day="01" year="2011"/> <task>Finish Balisage Paper.</task> </todo> </todolist>
And maybe a way to add custom messages.
<todolist> <todo> <due month="07" day="08" year="2011"/> <first-reminder month="07" day="01" year="2011"> <message>Just one week left to finish the paper.</message> </first-reminder> <second-reminder month="07" day="01" year="2011"> <message>Time to call Tommie to get an extension.</message> </second-reminder> <task>Finish Balisage Paper.</task> </todo> </todolist>
As this simple model has evolved, you've kept up with it with your XSLT or XQuery that you are using to process the To Do list. You can handle a <reminder>, a <first-reminder>, and a <second-reminder>, both with and without <message>. The To Do list works fine because, although you have inconsistent data, you were able to make allowances for it in your processor when you made the changes to the model.
This is not a problem, because you control both ends of the process and nothing is showing up unexpectedly. Confusion could arise if you shared your data with someone else who had to figure out what the difference was between a <reminder> and a <first-reminder> and what to do with those that have messages and those that don't.
But these little documents are not the focus of this paper.
Coming from the document publishing side of the XML world, I am going to concentrate on document content XML: journal articles, books, book chapters, reports. But these 'rules' apply to any XML that is intended to be saved, used or reused. Certainly this would apply to both document and data XML applications.
A not-closed XML system is one where there is some interchange of XML. This could be interchange between organizations, interchange between departments in an organization, or between individuals. Interchange can be an sharing of content between entities, but it could also be between steps in an XML workflow.
The submission of papers for this conference is an example of XML interchange. The author creates XML to be used by the conference committee for peer review and (hopefully) publication in the conference proceedings.
A Classic Communication Model
Wiener's modification of Shannon's classic communication model (see Fig. 1; Wiener01 Wiener02 Foulger01) can be applied to XML interchange. In the communication model, there are two actors, the sender (information source) and the receiver (destination).
Fig. 1: Interactive Communication Model
The communication in Fig. 1 contains the following steps:
The Information Source creates a Message.
The Message is converted into a Signal and sent by the Transmitter.
The Signal may be acted upon or interfered with by Noise - some third party or environmental activity.
The Received Signal is converted into a Message by the Receiver.
The Destination receives the Message.
The Destination provides feedback to the Information Source.
Of course, this feedback is another message, with the original Destination as the Information Source, but what is important here is that the Receiver acknowledges or confirms the Message. (At this point, the Balisage audience should all be nodding their heads in agreement.) The feedback is a critical element of communication. It allows the sender to know whether the message is getting through and to make adjustments necessary to make the communication successfull.
This model can be applied to any communication. For example a telephone conversation:
Person A (Information Source) says "XML is great!" (Message) into his telephone.
Telephone (Transmitter) converts sound to electrical Signal
The cell drops out (Noise).
Person B's telephone (Receiver) converts the Received Signal into sound (Message).
Person B hears, "XML is gray---".
Person B provides feedback: "Gray?"
What can go wrong with XML - The four layers of "bad"
XML can go bad on several levels. These levels were beautifully and simply illustrated by Bauman01 in "The 4 'Levels' of XML Rectitude". The TEI examples in this section are his.
The first thing that can go wrong is that the XML is not well-formed. Simply the basic rules of XML are not followed.
Fig. 2: Not Well-Formed XML
<titleStmt> <title>Fun!</head </titleDesc
If the document is well formed, the next potential problem is validity. Does the syntax of the XML match the schema?
Fig. 3: Invalid TEI XML
<note href="#there" > <div> What? In TEI, <gi>div</gi> is not allowed in <gi>note</gi>. </div></note>
Assuming that the document is well-formed and valid, next you have to worry about Sensibility. XML constructions that do not make any sense are not good to anyone.
Fig. 4: Nonsense XML construction
<caesura xml:space="preserve"/> <respStmt> <name><catchwords/></name> <resp><height unit="cm"/></resp> </respStmt>
Finally, if the XML is well-formed, valid, and is sensibly constructed, the content may just be wrong.
Fig. 5: Just wrong content
<quote who="#Washington" > <time dur="PT4H7M"]] > Four score and seven years</time> ago our fathers brought forth on this continent a new nation …</quote>
Applying the communication model to XML Interchange
We can apply the classic communication model to interchange of XML between entities.
Fig. 6: XML Interchange in the Communication model
Fig. 6 shows how the communication model can be applied to XML interchange between parties. The steps in this communication are:
The Information Source creates some Content.
The Content is encoded into a file based on an XML Model and sent.
The file may be acted upon or interfered with by Noise.
The file is converted into Content with the XML Model.
The Destination receives the Content.
The Destination provides feedback to the Information Source.
For our purposes, we can simplify this model somewhat by removing the Noise. Certainly there can be noise in XML interchange, but I see noise that occurs in the transfer of files as a Systems problem and outside the scope of this discussion.
There is another change we need to make, which is at the root of our discussion. Just as there may have been problems encoding and decoding the Message into and out of the Signal in the communication model because the Transmitter and Receiver are not the same entity, we need to note here that the XML is encoded with the Sender's XML model and decoded with the Receiver's XML model (see Fig. 7).
Fig. 7: Modified XML Interchange model
So, if the Sender's XML Model is not exactly the same as the Receiver's XML model, there will be distortion of the Content, just as there is distortion of the Message if the Receiver is not decoding the signal as the Transmitter encoded it.
Because XML is intended to be machine-processed content, we can run some tests after the XML has been received. First, we can test Well-Formedness with any XML parser. Well-formedness is defined by the XML Specification.
The rest of the tests require some agreement between Sender and Receiver, either explicitly ("I am going to send you this article in DocBook 5 format.") or implicitly - where the XML file identifies itself. Either way, we can test for validity by processing the file with the agreed-upon schema.
Next, content Sensibility can be checked with a content-application-level tool such as a Schematron or other application-specific checking tool like the PMC Stylechecker. For example, it would be trivial to write a Schematron rule to check the schema-valid but Nonsense XML in Fig. 4. If the application thought that a <height> element with units but no value was not "correct", a test could be added for that; similarly if <catchwords> was not something you wanted to see in <name>, you could have a test for that. But, Sensibility checking at this level comes with a price, which is even greater communication between Sender and Receiver.
All of this Sender/Receiver communication is for one goal: to get the Sender's XML model and the Receiver's XML model to be as closely aligned as possible. This works pretty well in XML systems where content is transferred between entities, because the receiver is accustomed to running at least well-formedness and validity checks on incoming content.
Fig. 8: XML Interchange model with checks
In a closed XML system, the Sender and Receiver are the same entity. This greatly simplifies the XML interchange model (see Figs 9 and 10).
Fig. 9: XML Interchange in a closed system.
Fig. 10: Actually the Information Source and Destination are the same Entity.
In Fig. 10, things have gotten very simple with the Information Source and Destination collapsed. And in some cases things are quite simple here. The danger of a closed XML system is that it gives a false sense of "All's Well." There are several things that happen in closed systems. First, one entity controls both ends of the pipe.
For example, one person can be responsible for tagging articles for a magazine and building the rendering software that renders the articles on the web in HTML. When a new object appears in an article, the only requirement for the XML tagging is that it works in the renderer.
Generally in these systems, the only test that things are OK is that the XML is working in the system; that is, in a system that was created to fit each twist and turn in the evolving XML model. If XML tools are used, then well-formedness tests come along for the ride, but validation against a schema (if there is one) is deemed unnecessary and complicated. After all, "Our XML works".
In a closed system, "Garbage In" is OK, because you control both ends of the pipe, know the garbage is coming through, and build something to deal with it when it comes out the other end. Sometimes Garbage In is OK.
And if it works, that is great. The real price for this closed system won't come due until you either have to send your XML to someone for reuse or reuse it yourself.
Switching to an XML Interchange workflow from a closed system can be humbling and expensive. Any Destination that will be taking your content will expect it to pass all four levels of XML Rectitude, will actively check well-formedness and validity against a schema, and will seek information on tagging conventions (sensibility) expecting (or at least hoping for) some consistency in the tagging.
If the XML corpus has not been subjected to these checks throughout there will be big problems with reuse of the content by an outside entity.
Similarly, when everything about your XML system is in one geek's head, you will have trouble maintaining the system, let alone changing it or upgrading it when that geek moves on to greener pastures.
Standards are the answer?
Actually not really. Using a standard model like the TEI, DocBook, or the JATS will get you schemas and some information on best practices for tagging, but there is no forced validation. Also, in closed XML Systems, there is no penalty for Tag Abuse. That is, if the standard schema does not have an element for an object, you can just tag your content any way you like. Because you control both ends of the pipe,
Requirements for XML Interchange
There are two requirements for any XML Interchange, and they both come from an agreement between the Information Source and the Destination.
Validation - XML files must be well-formed and valid against an agreed-upon schema.
Defined tagging practices - XML files will be tagged consistently in a manner that makes sense.
These two requirements for interchange are the same ones you will need to run a consistent, sane system over time.
If xml interchange is like a conversation, than a closed system is like listening to the voices in your head.
[Wiener01] Wiener, N. (1948). Cybernetics: or Control and Communication in the Animal and the Machine. Wiley.
[Wiener02] Wiener, N. (1986). Human Use of Human Beings: Cybernetics and Society. Avon.
[Foulger01] Foulger, Davis. (2004) "Models of the Communication Process." http://davis.foulger.info/research/unifiedModelOfCommunication.htm
[Bauman01] Bauman, Syd. (2010) "The 4 Levels of XML Rectitude", Balisage 2010, poster.
[Beck01] Beck, Jeff. “Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). doi:10.4242/BalisageVol6.Beck01. http://www.balisage.net/Proceedings/vol6/html/Beck01/BalisageVol6-Beck01.html