Note: Editor‘s Note
This talk was given in the opening session of “Balisage: The Markup Conference” 2009.
Standards considered harmful. Actually, when it occurred to me that I might want to call a talk that, it occurred
to me that
XYZ considered harmful talks were probably harmful. I thought about that, and I did a little browsing around,
and learned that actually several people have given conference talks or written essays
harmful essays being harmful. And I decided that proves it’s a cliché, and there’s nothing
wrong with clichés, so here I am talking about standards considered harmful. And as
harmful-considered papers, what I’m really saying is: Reflexive or mindless use of standards
or mindless application of standards that are not applicable is harmful.
We have, unfortunately I think, reached a stage in the evolution of markup and markup specifications and standards where we not only have a multitude of them, if not an excess, but we also have generated ourselves some religious fanatics about them. I am not really fond of fanatics of any sort — I think political fanatics make me even crazier than religious fanatics — but in either case application without reason or encouragement of application without reason is harmful. And it’s easier to fall into than you would think.
Standards bullying often comes in innocuous and helpful-sounding guises.
Best practice is one that frequently makes my skin crawl.
Oh, you can’t do that! It’s best practice to …. I’m going to talk about that some more in a minute, and we have a [conference] panel
best practices tomorrow.
It would be more standards-compliant if you did this instead of that. How many times have people outside your projects
reminded you of requirements you must meet? Those are the people who really make me anxious.
A lot of these
requirements translate to
You should spend a lot of your money and your time and your energy doing something
that will be of no value to you, but that might someday be useful to me or to someone
I can imagine.
Or worse, in my opinion:
You should spend a lot of your money and make everything you are doing harder so that
if, in the future I, or someone like me, wants to re-use your content, it will be easy
for us. Often the people who are helpfully pushing you toward standards or best practices
or shared recommendations are not only not involved in funding your project, their
goals are not your primary mission, or, for that matter your secondary, or tertiary,
or quaternary mission. They are sidetracking you. And in many cases they don’t even
These self-styled standards evangelists are of good will. They think it would be really wonderful if everyone could just get along, and if everyone could just share all of their information. And, of course, they are right; it would be really wonderful. But that doesn’t make sharing information, especially with unknown future users with unspecified and unknowable requirements, the primary goal for each and every information project. In fact, I would venture to say that it is not the primary goal of any funded project that has the money to create a substantial body of marked up documents. (It may be the primary goal of projects or activities to enhance information sharing, but those projects rarely (never?) have the funding to prepare the documents. They assume that they will write guidelines and others will use those guidelines.)
The projects and activities that actually seem to have the money to mark up a substantial
body of documents generally have as their missions things like: Create print and electronic
publications that follow our house style for display, are searchable using our organization’s
favorite search engine, and that will help make the next update easier and cheaper
than the last one was. It is possible that the mission also includes being a basis
for creation of unknown future electronic publications and possible mix-n-match print.
But there’s usually a
use scenario and a
user scenario in mind at the point when the funding is available to do the work. I think
we need to stop forgetting that.
Know Your Goals and Stick to Them
When you start a project — practically any project — there are some basic management tasks that we all ought to do. These tasks help us stay on track by making the track explicit, to ourselves and others. In addition, they allow us to know what we are not trying to do, so we can focus on what we are trying to do.
Identify the stakeholders. That is, who are you doing this for? Be specific. You are not creating a document collection for
all future humans,or at least I doubt that you are. Your stakeholders are more likely to include the people who will read, use, or buy your electronic product (are they students, teachers, railroad switch engineers, managers of non-profit associations, art historians, or … some other group), and the people funding your project, and the managers of your organization, and the printer with whom you have a contract.
Identify your project goals. What do you have to do to be a success? Project goals are probably a mixture of technical, educational, financial, scheduling, and budgetary.
Identify your non-goals. That is, things that perhaps somebody else might want to do and therefore assume that you are setting out to do, but that you are not doing. It is my opinion that explicitly recording your non-goals is the best safety measure you can take in starting any funded project.
I didn’t forget to make a Spanish translation of this content. That wasn’t a goal of my project.
I didn’t forget to make an RDF-access tool for my data; that’s not what I was doing.
Prioritize your project goals. If you prioritize your project goals into
low,and all your project goals end up as
high,prioritize again. Everything can’t be top priority. It’s not possible. If you’ve done that, what you’ve done is — I don’t want to say, lied to yourself — you just haven’t done the hard work of figuring out what is the most important thing and what are secondary things. Now, you can say,
Here’s a list of eight things that my project has to do. All of them must be done for me to be a success.But that doesn’t mean that all of them have to be the most important thing because all of them can’t be. Your project goals may be a mixture of technical goals, educational goals, financial goals, scheduling goals, and budget goals. Without explicit goals you can't know if you are getting where you want to go, and you are wandering about witout your primary weapon against being forever sidetracked before you accomplish anything.
At that point, I think you probably also want to start thinking about a distinction I talk about a lot when we’re talking about document modeling: what is true versus what is useful. When you start looking at a set of documents, you can find a lot of things that are true about them and that you could identify and spend an awful lot of money on. The question is how many of those things are useful. It would be possible, for example, when marking up business documents for a subject retrieval system to identify the parts of the document on your subject taxonomy and to identify the documents themselves and the chapters of them and the sections of them and the paragraphs and the sentences and the words and what was the language of origin of each of the verbs in each of the sentences. Is this likely to be useful? Is it conceivable that there might be somebody someplace who would find that useful? Yes. If your goal is to make a corporate procedure library easily available to the telephone help desk, is it likely that knowing which word is a verb is going to be helpful? Probably not. So, if you’re supporting a telephone help desk, maybe you don’t need to get into the linguistic analysis of the sentences. That’s what I’m talking about: about not supplying, not spending money to do something that it is possible that some unknown future person might want. Stick to what you’re supposed to be doing. There may be a text markup standard that specifies how to mark the parts of speech of each word in documents and their language of origins. If there is, and knowing or manipulating this information is related to one of your project goals, then and only then, is that standard relevant to your project.
I think in this talk I should also talk about
useful. It may be good to know — I’m trying to think of a good example about these same help
desk documents — how many times it’s been used, who read it, and what language it
was originally written in (if you’re in a multi-lingual environment). There may be
a whole lot of other things that it might be
cool to know but that are irrelevant to the users and the uses of this data. Maybe you
don’t need to keep track of that. Even if there is a document describing standard
markup for tracking such information.
Peer Pressure in Academia
Some of you are probably thinking that maybe this is true in the business environment, but not in the academic world. Right? I mean, we know that in the academic world being a good team player is the route to gaining respect, admiration, and support, right? Or at least that’s what people say. (It doesn’t actually look like that sometimes, but that’s what they say.)
So, when we get our hands on a new set of data in the academic environment, we should work with the expectation that we’re going to share it with others, right? That’s what data is for.
Let’s imagine a scenario. We have an eager, young cultural anthropologist — I like cultural anthropologists because they seem to be interested in everything — and he wanders into a little library in a farming town in the middle of who-cares-what-country (it doesn’t matter). Some little town somewhere. Little tiny town library. And he discovers literally hundreds of diaries of the farmers and the townspeople for the last 150 years. It’s a treasure trove. Wow! What a find! It just made his career, right? He’s going to study these, he’s going to publish about them, he’s got it.
The town is losing population fast, the library is underfunded, its roof is leaking,
and the diaries may be destroyed by water and mold the next time there’s a big summer
storm. But the librarian says,
You can’t have my diaries. You can take pictures of them, but you can’t have them. What does our young researcher go out and do? He needs a grant, so he needs to write
a proposal. He can’t just go home to the university and say
I found some cool data. Give me some money. So, he starts planning. He’s going to have to scan them; he’s going to have to clean
up the pages; he’s going to have to transcribe them, analyze them, and start publishing
about all the things he’s learned about the history of this farming community and
the history of farmers and all these wonderful things.
How to transcribe it? Well, his professors when he was in school got their source
materials and sat down at typewriters and transcribed them into typescript. That’s
probably not the — it’s probably going to be 2010 before he gets his money — 2010
way to do it. Among other things, he knows he’s going to want to put bits of these
on the web with his analysis of them, besides which, he wants to do
sophisticated searching of his documents.
[Aside:] How many of you have had users who said they wanted to do
sophisticated searching? How many of you think they knew what they were talking about? I have an
opinion. I know what
sophisticated searching means.
Sophisticated searching means you can show it off at a cocktail party, and people will say,
Cool. When a user says
Sophisticated searching, and then you say,
Give me ten examples of searches you want to do, and they can’t give you any, what they want is a
wow. That’s what our mythical cultural anthropologist wants: a
He’s going to want to flag the things he finds most interesting in these documents: crops, harvests, the weather, the health of the people, the names of the people, the names of places, the stuff that he’s going to be wanting to find amongst this data set so he can do his analysis. Now, this is going to be a little tricky, but fortunately our young cultural anthropologist is an expert on HTML. How did he get to be an expert on HTML? He has a 64-page book about it, and he read it. Well, he looked at it. Well, he has it, anyway. So, he’s an expert. He knows how to do this. He’s going to have an army of students read the copies of the pages of these diaries and transcribe them into HTML which is pretty easy, and he can afford a whole lot more copies of the 64-page book so he can teach them, too. This is going to be easy.
He thinks about the information he really cares about in these documents, and how
to flag it so he can find it and analyze it later; he read about it in his HTML book.
There’s this, there’s this, there’s this — there’s a way to say that something is
about a class — he can add
class attributes to stuff so he’s going to be able to find the stuff about the crops and
the weather and the people’s health. He’s got it all figured out. He’s got his budget,
he’s got his proposal ready to submit to his funding agency, he’s telling his friends
about it, and a colleague with several years of experience in creating collections
of historical documents says,
Nah, nah, nah. You don’t want to do this HTML stuff. The difference between HTML and
XML is that in XML you can make up your own tags. XML is make-up-your-own-tag-HTML.
And it’s going to be much easier for you to use XML than HTML because you don’t have
to say So, at the last minute he makes a change: He’s not going to mark the data up in
HTML; he’s going to mark it up in XML.
<span class="crop"> to mark your crops; you can have a tag called
<crop> and just surround it. It’s much easier, and you’re going to be much happier.
He sends in his grant proposal, chews on his fingers for a few months, and gets funded.
Cool. Two years’ of funding — it’s time to get started. He celebrates, he hires an
army of students, and he starts getting organized.
Wait, wait, says another member of his departmental coffee-klatch,
How are you going to identify your documents?
What do you mean, how am I going to identify my documents? We’re going to put in the
file name, we’re going to have the name of the person who wrote the diary in which
volume it is and which page it is, and we’re going to transcribe it. What’s the big
You never heard of Dublin Core? What’s the matter with you? How anti-social are you?
You have to use the standard.
Oh. I don’t know anything about it.
Well, you better learn ’cause it’s the right way to do it.
Okay. So he goes and reads up on Dublin Core. And actually that doesn’t look so hard; he agrees to make little Dublin Core headers.
He sends the teams out to go and scan the documents, and he starts marking up his
little XML documents that have HTML tags for all the sort of normal things like paragraphs,
and he makes up tags for the things like weather and crops that HTML doesn’t have
a tag for. This should work, because, after all, XML is add-a-tag-HTML. And he’s chatting
about his cool new project at a conference, when a well-respected expert on documentary
Wait, wait. You can’t do that.
What do you mean I can’t do that? The sample document I have been working on looks
OK to me, and I’m in the middle of writing guidelines so we can do this consistently.
Look, he says, I have a
<p> tag around my paragraphs, and I have
<crop> tags around each mention of a crop.
<crop> seems much easier to deal with than
But, the expert says,
XML is more complicated than that. Go and get an XML book. So he takes himself off to the little local library to look for the 64-page book
all about XML, and there aren’t any. There are books — way more than 64 books — none
of them are as short as 64 pages. (This XML stuff is complicated! Did you know they
have conferences about XML? There’s enough that you can have a conference?? It’s ridiculous.)
So, he comes back, and he starts talking to his friends and saying,
Okay, which of these books do I buy? And do I really have to read all of it? These
books are big.
No, you don’t have to buy a book about XML, and you don’t have to read one. You just
have to go and read about the Text Encoding Initiative because that is the one and
only right way to tag your scholarly XML documents so other scholars can use them.
Does everybody else have to use them?
Yes. How anti-social are you going to be? You’ve got to use this standard.
So, okay, he’s going to go and read about the Text Encoding Initiative. Do you know how big that sucker is? [Laughter.] You know, they don’t have a 64-page book about it either. So, he settles down, and he does some reading, and then he takes a class, and he goes to a workshop, and gets a consultant in to help him cut the TEI tag set down to something that is merely daunting. And his budget is flying out the window.
So, he does this, he does that, and he’s finally got himself a subset of the TEI with
Dublin Core metadata because how can you not do that, and he’s got some sample documents
he can show to his student-workers, and he’s got a few scanned pages that he can mark
up, and he starts marking them up. And now, instead of HTML’s
<span class="crop"> or the XML he made up for himself
<crop>, he has the infinitely better
<rs type="crop"> (making the word
corn into a
referencing string of type crop). Well, that’s an improvement, isn’t it.
And what does our young researcher want to do now? He wants to look at his sample tagged documents. That is, he wants to do what he set out to do in the first place which is to be able to look at his text on the screen with the crops in yellow and the weather in blue. And he finds that he can’t do it with off-the-shelf tools. (Not even the tools promulgated by the TEI), He has to find a programmer to write some code to render his documents readable and searchable, so he can start doing his analysis. So, he finds somebody who knows some Perl to do this for him.
Wait, wait, wait, wait, says the standards do-gooder,
You’re going to write some throw-away Perl to do that. You can’t do that. That’s anti-social.
Respectable people don’t do throw-away Perl. We do XSLT. Well-documented and properly
parameterized so that when somebody wants to go and re-use it, they don’t have it
tied to your tags and your subjects; they can plug theirs in and their colors. And, all of a sudden, we’re not talking about an afternoon of work anymore, are we?
So, he gets somebody to write this thing and document it.
And now he’s out of money. He now needs to go back to his funding agency and say,
You gave me the money to be able to get the first of these things up, searched, and
analyzed, but I was busy being socially helpful. And I have learned a bunch of tag
sets, I have learned a bunch of specs, I have had some code written that can be re-used,
and I have no documents and no analyses to show for it. Can I have some more money?
How does that story sound to you? [Answer from audience:
Typical, to which Tommie responds,
Yeah, but is it a good idea?] And how likely is our young researcher to get more money with so little to show
for the funds he already spent? Even if this is par for the course, how much easier
would it be for him to get additional money if he did have something to show for his previous funding?
Inappropriate Standards Application in the Commercial World
So, does this only happen in the academic world? Not a chance.
We don’t have time to go into details on this one, so I will simply tell you that I know of a very, very large commercial publisher who has a database intended to be the font from which vast riches, in the form of re-used content, will flow. All content published after the start date for the re-use project must be either produced in or converted to the XML format for this database and the content stored in the database. Then, editors or marketers looking for existing material they can incorporate in new products can simply search this database, find the bits they want to re-use, copy them out, and combine them with other resources from the database to make new targeted products. This means that the budget for every single publication needs to include making this XML, either from the XML they may need to produce the product-specific electronic product they are going to sell or from the typesetting files used to create their print publication. Now, for some of their content this makes sense; they have a lot of text books and reference books on the same subject matter, and specialty volumes can often be created by combining the portions of several text books and sections of reference volumes that address a specialized topic. However, there are publications that everyone knows cannot be so re-purposed, and they must participate anyway. Because it is a corporate standard, and for no other reason.
Underspecification Hurts, Too
The folks at Mulberry recently wrote a tag set — actually, we revised a public tag set, customized it — for a publisher who wanted to move from their old SGML-based document model to a new XML-based one. They said that they wanted this new model to last as long and be as flexible as the model it was replacing, which had been in use for over 20 years. They wanted a really good model of their documents; highly semantic, rich enough to support typesetting as they typeset now (meaning they needed to be able to override the formatting manually in many places), vendor and technology neutral, and adopting all of the appropriate standards. Oh, yeah. And they wanted it to be as compatible as possible with their existing model for a related type of materials. By that they meant that paragraphs should have the same tag, and sections should be recursive, as they were in the other documents. You know, stuff like that. Use the same list and table tagging. We sat with them for quite a while, learning about their documents, their plans for the future of their documents, the functionality they could imagine supporting in future electronic versions of their documents, the variations in their documents historically, and even how they might want to market their documents, and thus what information their marketing people might want from the documents to support marketing.
We built them a beautiful document model, if I say so myself. It met all of the functional requirements we discussed, and it was graceful. We sent them a
draft, with draft user documentation and some guideline on how to evaluate it. And
the day after they received it, we got a phone call saying,
We love it.
Okay, how do you love it so quickly?
We love it because we just slid it into the editing application we’re using for our
other documents, and it works really well.
At that moment I should have said,
Oh, @$&#*, we’re in trouble. But I didn’t, because I was busy being happy that my client was happy. I hadn’t realized
at that point that using their old editing application was not one of the functional requirements, and they loved it for doing something they’d
never said they wanted it to do. I would have been right if I had started to worry
then. Three or four days later I got email from the project manager — who knows the
organization, and the documents, and their end users, but is as technical as my left
shoe. This email said essentially
It’s a really nice tag set, but there are a few things in it that you should modify
for current best practice. Now, I expected and wanted a set of comments and suggestions, but these suggestions
were really odd. They were phrased as
Most people now … or
It is best practice to … or
You need XYZ in order to be able to do ABC with your documents. Now
most people and
best practice are pretty soft, so I let them go, and started with the
You need XYZ in order to ABC. The email suggested that since there were elements in common in the metadata for
the documents and in citations (that is, the document would have a title, an author,
a publication date) and the documents cited in the document might have titles, authors, and publication dates),
put all of the citation-related content into its own namespace.
Grrr, I said, at least to myself. That may be convenient for this particular application,
but it is a really revolting thing to do in general. It is a sneaky underhanded way
to make the names of the same information in different contexts different, and in
a way that may surprise future users. You can simply look up the tree to see what
the context is of that author, or title, or … Sigh.
And it is best practice continued this email,
to have not only tagged in your data all of the parts of the names of the people,
but also to store the full display name, including all punctuation.
And it is best practice to store in addition to the titles and all of the subtitles
of which there may be none or multiples, a combined title which is, for example, the
display title. Since when is it best practice to do this? Best practice according to whom? It does not seem to me to be
best practice, or even
good practice to store the same content twice, nor does it seem like even acceptable practice
to store computable data in the same XML file as the source from which it could be
So I pushed back a little, and I discovered that they not only had selected a database
in which this stuff was going to reside, they had already customized the database
product and written the searches, before bringing us in to design the tag set. And
they needed a tag set that was going to work with this database product in this particular
database with these already canned searches. So now I know what
best practice means in this environment.
Best practice means works in the database they already have. This also tells me what
most people now means; it means that since I have been working pushing a product that requires this,
most of the applications I see do it that way.
I suggested that they create two XML formats: one generic format, in which information
was stored only once (instead of, for example, as the parts of each name as fine granules
and as combined names for sorting), and in which context was used to identify what
sort of date, author, heading, etc. each was. And then a second format, which could
be automatically created from the first, which was optimized for their database. This
would allow them to create, quality control, and manage for the long term documents
that were smaller and
cleaner, in that they didn’t contain duplicate content. They rejected this suggestion; they
didn’t want the complication, and they had apparently been assured by their database
vendor that the database was XML, so they would have all of the advantages of XML
when they used this database.
They were quite disappointed in the design we sent them. And they were right to be disappointed; it didn’t meet their real requirement. What did the really want? Well, they had just made a major investment in a tool and were spending a small fortune building a system around that tool. They needed a mode for their XML that would work gracefully with that tool. That was their top priority, and the model we had designed was not optimized for the tool. We had not met their real needs.
How did this happen? I think it happened because they have heard what XML is good
for, and they thought,
Good was good. If they told us that they wanted a model that was
good, it would work in their
Good XML is vendor neutral.
Good XML uses all appropriate standards.
Good XML enables as much semantic tagging as possible. They thought that if they wanted all
of those good things they would end up with good XML, and that would work well in
They needed a model that would support the tool in which they had made a major investment.
This tool and the application built on top of it would make or break their business.
That, actually, was their top priority. And in a half-day discussing their functional
requirements and several days discussing the details of their model, they never mentioned
it. Because they assumed that a
good XML model would be good for their chosen tool.
So, we modified the model to accommodate the tool. Did we compromise our principles?
I don’t think so. Is the model they are using less elegant than the draft we sent
them? Yes. It has some unnecessary namespaces, for example. Why? Because either the
tool or the application designer (I don’t know which, and it doesn’t really matter)
can’t cope with context. They want (need?) to have a different namespace for
<citation> than for
<author> in the document metadata. From a strictly XML design point of view, this is silly;
any application can know what the context of that
<author> is. However, it does no harm in the long run; all the information we modeled is still
there, and if they switch to a different application that doesn’t need these namespaces,
and perhaps that doesn’t want to deal with all these namespaces they can remove them.
The context is still there, after all. Similarly, the XML files contain some things
that are completely derived from other things in the file: Those same author names
are not only stored as first name, middle initial, and surname, they are also stored
in the XML file as
display name, meaning all the parts of the name, in one element, with spaces as needed. Is this
harmful? No, it’s clutter. (Well, it isn’n harmful unless/until they someone updates
one but not the other copy of the same information @mdash; that’s the danger of storing
information twice in any format). But apparently either the tool or the application
finds it unacceptable or inefficient to concatenate the parts of the name and spaces
on display. This sets my teeth on edge; it shouldn’t be in the long-term storage XML
file; they are paying conversion vendors to create the same data twice (or, preferably,
to put it the programmatically and charge them as if they had created it twice). Worse,
they are going to ask their editors to create the same data twice, and are either
going to add a validation step to ensure that the data is the same or deal with the
inconsistencies that storing the same data twice creates. (If you need to store information
like that in your database …) But if having it pre-assembled really makes the database
or the application more efficient, or even if it simply caters to the limitations
of the application developer, then it needs to be there.
The Mulberry pig got passed around a few times during the process, but now everybody is happy. Do you know about the Mulberry pig? We have a trophy at Mulberry; it’s got a big, fancy, gold color base and a big pseudo-malachite column and a little gold color pig on the top. When you grump creatively or excessively, especially about a project or client, the pig it sits on your desk until the next person earns it. Well, the pig passed around quite a bit for a few weeks.
I don’t want to tell you not to use standards. I don’t want to tell you that standards are bad. I don’t want to tell you to ignore advice given to you by your friends and colleagues. I think I want to suggest that you treat suggestions to use a standard the way I treat solicitations from charities.
My guess is that I get an average of ten solicitations from charitable organizations a week. Some of them are from organizations promoting a position I find abhorrent, and these are easy to deal with: into the recycle bin they go! Most of them are from organizations doing work or promoting causes with which I have some sympathy. (There are no diseases for which I think a cure would be a bad thing.) And a few are solicitations for something that will be of immediate use to me as well as perhaps helping the world; for example membership in my local zoo not only supports the zoo, it gets me free parking at the zoo. Similarly, use of some XML application standards not only increases the chances that unplanned and unknown users will find your data useable, it may well enable you to interchange data with your business partners.
There is no question about contributing to those charities for which I see immediate personal benefit; I send them money every year. As for the rest: I have a budget, and I try to help as many worthy causes as I can within that charitable contribution budget. I do not even try to donate to every worth cause that asks for money, nor do I even try to donate as much to any of them as they ask. I simply can’t.
I suggest that you might want to consider the various XML specifications, standards,
and applications that vie for your time and money in the same way. Set a
standards compliance budget and stay within it, both in terms of time and money. Of course, if there is an immediate
win to your project from using a specific specification, use it. If there might be
a benefit to the world in general, or to some unknown future users of you data if
you add some metadata, or use an existing tag set, or some such, decide if this additional
effort is within your
charitable contributions budget.
So, as you participate in Balisage this year I charge you to consider the relevance of the various technologies, specifications, and applications you are hearing about. Learn as much as you can. Add as many tools to your toolbox as possible. Consider as many points of view as your mind can manage. And when you go home, remember than you don’t have to use your shiny new hammer on every task. Think carefully about which of the specifications and standards you know of should be used in what situation, and to what benefit.