The semantics of “semantic”
Copyright © 2013 by the author. Used with permission.
The first thing I have to say is “Welcome to Balisage, and welcome to Montréal.” This is my favorite week of the year, and I want to be sure that you all enjoy it too.
Before I start talking about what’s really on my mind and what I said I was going to talk about, which is “semantics”, I want to talk about something that should go without saying, but two people caught me yesterday and asked me to say a few words about expected behavior this week — as I did last year and the year before and probably every year before that — it’s what I think of as the “Conference Mommy Moment”. I’m talking about courtesy for everyone at this conference. As far as I’m concerned, every one of you is my guest, and that means I expect every one of you to treat each other as my guests. I don’t expect you to agree with each other, and in fact a large part of what we’re here to do is to disagree. That’s a lot of the fun; we frequently disagree on how to address problems that we agree are important. And we sometimes disagree on what is important and why. I hope such disagreements are heartfelt, articulate, and clear. I insist they be respectful.
This means challenging the idea, not the person. By the way, “That’s the dumbest idea I have ever heard” is not challenging the idea. That’s challenging the person whether the sentence seems to use the word “idea” as its subject or not. “I think there are some important factors that you should consider that might change your mind, for example, …” is challenging the idea.
Another aspect of treating each other with courtesy is using the microphones. Many of you are used to speaking to large groups. Many of you believe you can be heard well without a microphone. I am telling you that there are people in this room who will not be able to hear you or understand you if you don’t use a microphone. If you have anything more to say than “Yea, verily, yea,” get up out of your seat and go stand at the microphone. (Shouting “Yea, verily, yea” from your seat will do just fine.)
Eschewing the microphone means that people with less than perfect hearing may have trouble understanding you, and people with relatively little English — or with an English accent dramatically different than yours — will have a difficult time understanding you. And it wouldn’t be a bad idea to eschew words such as “eschew” which many people don’t understand.
I like Balisage for a lot of reasons. I like Balisage because I think of it as a playground where a group of really smart people gather together to learn from each other. This means it is a gathering of friends, but it is also a gathering of “might be” friends. I suggest that those of you who know twenty or thirty people in this room consider the fact that there are probably twenty or thirty other people in this room who might become friends if you spent time talking with them. It is easy for those of us who know and like each other and see each other once a year to concentrate on old friends. I’m not saying you shouldn’t do that — I’m not saying you shouldn’t enjoy the people you know — but meet some of these other people because it’s a good bet they’re pretty smart.
“Tell them where to meet!” First of all, there’s the riskiest thing to do — which I highly recommend: sit down next to somebody you don’t know at lunch. What’s the worst thing that’s going to happen? They’re going to be boring. Odds are, they won’t. Because they’re here. And we don’t have a lot of boring people here.
However, we also have a conference office downstairs. You can tell it’s the conference office because there’s a Balisage logo on the door. And when that door is open, come on in! We have chairs, we have tables. We’ve also got a printer there, so stop in if you need to print a few pages. And we’ve got a refrigerator with some water in it. So come by! This evening go across the hall to dinner. After that, if you’re looking for a group of people to go out with, check the office; in the early evening, people will gather there and make groups. It’s a good place to be.
This is sort of a week-long party. I don’t actually go to parties where people have projectors and throw their slides up on the wall, but I do go to parties where people climb on metaphorical soaps boxes and make speeches about all sorts of things that they care a lot about, which is one of the things we’re going to do at Balisage.
I actually go to dinner parties where things get a little weirder than that. I remember one group — I think there were eight of us — in a very, very fancy French restaurant in the Georgetown area of Washington where somebody pulled out a dataflow diagram and passed it around the table. And the first person who got it, looked at it for a minute and snickered, then passed it to the next person who laughed. It went all the way around the table, and everybody laughed. After everybody had had a chance to look at the joke, we discussed where the error had to be and how silly this diagram was. It said there was one office that was going to fill with paper documents because there were all these paper documents being printed, and they went to this one person, and they never left. And then the guy who brought this diagram pulled a photograph of that office out of his pocket. It was stacked to the ceiling with paper, with little paths through it. We laughed at that, and it occurred to somebody that there we were sitting in a restaurant drinking wine … and passing around a dataflow diagram. Some of those people are here, and many of the rest of you would have enjoyed the evening. It’s that kind of crowd.
I like Balisage because I learn a lot here. I learn about interesting things people are doing with marked up content and interesting applications people are developing to create and manipulate and display marked up content. But that isn’t surprising. It is the markup conference, after all.
More interesting, I learn about the sorts of problems that people are paying attention to in the first place. As I read the submissions to this conference, I read about people using markup to solve problems I hadn’t even realized were problems. People are doing work I never thought anybody cared about, solving problems I had not thought about at all. That’s really interesting.
Every year I leave here knowing that I know less than I thought I did when I arrived. I recommend that approach. I suppose that’s the reason my house is filling with books, though now that I’m buying both print books and electronic books, I keep hoping that will mean the rate at which I acquire physical books is going to be reduced.
I leave Balisage with a long list of things I need to learn more about: details of things I didn’t know about, aspects of specifications and tools I hadn’t known about, big things I need to know about, whole new specifications, capabilities, problems. That’s a lot of what you’re going to pick here: things to go learn about.
One of the things that I keep learning about is the limitations of my knowledge and the way that I reveal those in my use of language. I, for example, cannot have an informed conversation with my nephew about Pokymon because I use the language incorrectly and it is clear that I don’t know what I’m talking about. (I’m actually okay with not being able to have an informed conversation about Pokymon with a five-year-old.) But I’m not so okay about not being able to have what sounds like an informed conversation about some of the things that I think matter to me, for example, semantics. (Hey, I finally got to what I said I was going to talk about.)
I recently learned that I don’t know what the word “semantics” means. This is interesting because I thought I did. I thought I’ve known that for a long time. Way back in pre-history — before I had ever heard of markup when the only use I knew for pointy brackets were mathematical expressions (some of us were not born knowing what chicken lips are good for, you know) — I thought I knew what semantics meant. I thought semantics was the study of what words meant, and the study of semantics focused on how the meanings of words were created and changed and, occasionally, lost. I had more than a casual interest in words and terms and their meanings; my first real job — after I escaped from the educational establishment — was as a lexicographer. A lexicographer?? A person who creates dictionaries, or in my case, controlled vocabularies for search-and-retrieval systems. Semantics was the heart of my work. I thought I knew what semantics meant. We were very careful to create controlled vocabularies that had a minimum of homophones in them and a minimum of ambiguity — notice we never promised we would get rid of either — we were very careful with selecting our terms and defining them to control the semantics of the datasets that were going to be indexed with them. So, a long time ago I knew what semantics meant.
And then I got involved with markup. I got there because the way we were creating these full-text databases that we were indexing with the controlled vocabularies I was making is that people typeset and printed this material, and the first two copies to come from the printer were then sent immediately to be processed into the full-text system. So we waited until we had bound books, then we cut off the bindings, and we sent them to two different rekeying shops to be retyped. Those two files were compared, and then we built a searchable database from them.
This was problematic. Among other things, people had already proofread it once to make the print, and they didn’t want to do it again, besides which it took a long time and was expensive. The full-text databases might take up to a year or year and a half after the print was out. This was imperfect. Very imperfect.
So, we started getting involved in ways we could make both the print and the database from the same source, and we had these pointy-bracket things that we were putting around parts of the document. We had documents where parts of them laid out in a grid shape. And there were two different ways that we could identify the information we wanted to see in a grid. We could identify them by what kind of information it was, for example, “condition, patient age, and dosage.” Or we could identify it in rows and columns; how did you want it laid out in that grid shape? You could do a lot more with the information if you identified it by what kind of information it was — “condition,” “patient age,” and “dosage” — than if you had “first column,” “second column,” “third column.” But it was a lot harder to set things so you displayed it in the grid shape you wanted it in. You only did that for stuff that you had a lot of and it was worth spending effort on.
I learned that when you identified the kind of information it was, we called that semantic tagging. And when you identified what cell in the table you wanted it in, we called that syntactic tagging.
Okay, this was related to what I thought I knew. Semantics means “what it means,” and syntax means “what you want it to look like.” Okay, we can work with this.
And then I learned that the line between syntax and semantics is a squishy line; it slides around. Somebody explained it to me: “If you understand how to cope with it, it’s syntax. And if you can’t quite manage it — there’s a little magic in the formula — it’s semantics.” (Actually, I think there may be something to that.)
I have recently learned that semantically-rich content is the key to future wealth: precise searching, high quality retrieval, good health, great weather. Or at least that’s what it seems like. The “semantic web,” for example, is going to solve all of our economic, social, and perhaps even climate problems. Some of the people promising these wonders berate me because I’ve advised clients to manage their content in ways that are not “future friendly,” specifically in ways that are not semantic. Okay, what do we mean by that?
Well, I’ve advised them to identify names and given names and family names and to associate them with institutional and public identifiers; to identify institution names, street names, drug names; to identify if a drug name is a brand name, a generic name, or a street name. I think identifying all that stuff in a document collection is creating a semantically-rich document collection, so why are these people so cross?
Because there’s no semantic tagging in those documents! What?? Well, there isn’t a triple in sight. Oh, I understand. There is only one appropriate syntax for semantics now. Whoops.
That’s just a little bit narrow-minded. I wish it were really news; it actually isn’t.
A few years ago — actually, quite a few years ago, now that I think about it — someone asked for permission to include the proceedings of one of Balisage’s predecessor conferences in a topic map — Remember topic maps? Topic maps were cool. — about things related to SGML (Does anybody remember SGML?) So, after all the appropriate logistical details had been taken care of and the lawyers had all waived their hands and blessed this activity, I provided the necessary SGML files and instantly received back a howl of anger. What could we have been thinking?? This data was unusable. How stupid could we possibly be? Didn’t we know that the point of creating SGML was to make it repurposable? (Actually, I had thought that the point of creating that particular SGML was to create the proceedings for that event, not to accommodate unknown future users and to work with products that had not yet been invented, but perhaps not.)
So, what was the problem that caused this outrage? The keywords were hierarchically structured; this was already problematic. Worse, the nesting was indicated with colons! This was outrageous. It was unprofessional. It was unacceptable. It was slovenly. It meant that they would have to preprocess the data before they could pour it into their tool.
I smell a failure of imagination.
These sorts of disagreements on the meaning of semantic don’t actually end there, and they didn’t end many years ago when we were talking about inconceivably badly structured SGML files because they contained colons. My colleague, Debbie Lapeyre, who is sitting over there, was the JATS expert in a group of people who recently wrote a paper entitled “From Markup to Linked Data: Mapping NISO JATS version 1.0 to RDF using the SPAR Ontologies.”
Her co-authors were astonished that, in this day and age, a respectable and respected vocabulary such as JATS could be totally lacking in semantics. Her co-authors were shocked to learn that in JATS, for example, the names and the definitions of the tags embedded in the document identified what the information meant; there wasn’t a separate ontology that identified what the tags meant.
Perhaps even more peculiar to them was the fact that structures nest and context affects meaning. For example, the tag <article-title> in the header of the document contained the title of the article, and the tag <article-title> in a citation in the reference list contained the title of an article being referenced by this article. That was very peculiar to them.
That and a variety of other oddnesses convinced them that either Debbie was from another planet or there were no semantics in JATS. They were very kind to her. I think they treated her as if she were an idiot savant, knowing a lot about this peculiar XML stuff, but knowing nothing about semantics … or anything semantic.
To give them credit, they did include a section in the paper that I think properly should have been entitled “Peering Through the Looking Glass.” They called it something like “Philosophical Differences between RDF and XML” — which in fact they didn’t really address — in which they described the difficulty of describing in RDF terms data that is based on such a different philosophical point of view. I wish they were here at Balisage because they were able to recognize that a completely alien viewpoint might not be stupid and might provide something of interest and importance to them.
We at this conference — each of us — are going to encounter a few completely alien viewpoints. That’s one of the things the Balisage committee looks for as we select the papers. When we get peer reviews back for Balisage papers, the papers we take instantly are the ones where some of the peer reviewers say “Yes, that’s brilliant!” and other peer reviewers say “No, that makes no sense.” We like those papers!! We like the alien viewpoints.
To circle back: it’s not that I don’t think ontologies are useful. I think they can be. In some circumstances, they are very valuable indeed. And I’ll even acknowledge that it’s possible — although in my opinion it’s far from a sure thing — that widely shared ontologies will be of increasing value. I’m suspicious because I used to write ontologies, and I know how good many of them aren’t. I don’t think links to ontologies are the only way semantically rich information can and should be created or managed. I don’t think RDF triples are the only syntax in which semantic information can be created, managed, or stored.
I’m reminded of long passionate arguments at predecessor conferences to this one about what kind of information should be stored in elements and what kind of information should be stored in attributes. People cared. People got red in the face. They banged on tables.
There were people who said all element content should be banned and all content should be stored as attribute values. They were serious, and they thought it was important.
There were people who said attributes were nothing but syntactic sugar and should be banned from the language. Everything should be element content; it would be much easier to process. They were equally serious.
There were people who came up with complex formulations for what should be in which place. My favorite one — which in some circumstances may be reasonable — was: If the end user should see it, then it should be element content. But if you use it to control the display or for other back room uses, it should be attribute content. That makes sense until you realize that you can’t actually decide for all time for any content what users will see and what they won’t because the line between data and metadata is really fuzzy and keeps moving.
But this mattered to them. It mattered to them a lot. And then XSLT came along, and it seemed to occur as a flash of insight to huge numbers of people simultaneously that it was actually a pretty trivial transformation to take something that was attribute content and make it element content, and take something that was element content and make it an attribute value if you really cared. We can change the syntax if it isn’t what we want for our tool at this moment.
Suddenly, that argument became moot. Fortunately, I haven’t heard it in a while.
I suggest that much, if not most, of the posturing about the appropriate syntax for semantic information is or should be similarly moot. It’s just not important. And if you disagree with me, I guess I can’t punch you in the nose — we’re at Balisage. And I’m not going to call you a nasty name because we’re at Balisage. So, I’m going to ask you to explain why you think it matters in a clear and coherent fashion. And I will remind you that if you explain it in a clear and coherent and respectful fashion, I might even change my mind and agree with you. And if your argument is “because that’s the way my application wants to receive it,” I will suggest that you go and buy a book on XSLT.
Which brings me back to Balisage. This is a week that seriously stretches my imagination. There are talks on the program that are on topics in which I have absolutely no interest; I’ll listen to some of them. Perhaps I’ll find that I should be interested. Perhaps I’ll find that I am interested.
In general, I have found that the more I know about anything, the more interesting it is. (Except perhaps basketball. I can’t manage to get interested in basketball despite working with a basketball fan. You know, to me it’s just basketball.)
I suspect you will find that some of the talks at Balisage are “just basketball.” You don’t have to be interested in everything, and you don’t have to pretend you are. But consider going to some things you don’t think you’re interested in because you might be surprised.
This week I expect us to hear about new languages, new projects, new uses for old languages, new names for old technologies. A colleague of mine recently told me that all the new ideas in computing are recycled and renamed from work the engineers at IBM did in the 1950s. “They invented everything,” said my colleague. For example, he insisted there is no such thing as “big data” — it’s just “volume.” And all the things they’re inventing now to deal with big data such as link lists and indexes were invented in the 1950s by the guys at IBM.
I think he was exaggerating just a little. But he does have a point. We as a community are doing a lot of relearning, reinventing, and recycling. I also think we are doing some new thinking, combining some old clothes into new outfits, knitting some new fabric — have I mixed enough metaphors in that sentence? Can I get a few more in there if I try hard?
I have noticed the lifecycle of many languages seems to be similar, including many of the languages we’re going to be talking about here. They grow out of a need for an easy way to do something. A person or a group of people find that it’s unreasonably difficult to do something that would be very easy with a tool that was designed to support what it is they want to do. So, they design it, and some people build it, and users say “That’s wonderful! But it also needs to do this and this and this to meet our needs.” So they extend it. And then more users show up and start using this wonderful tool … and say “But it also needs to do this and this and this.” And it gets a little more flexible. “And we need it to be a little more abstract so it has a little more flexibility.”
And before you know it, you have a turing complete language. Cool, right? This is great. And people start using it for things the originators never imagined and that are completely unrelated to the original mission. And at this point, two things happen. First, a group of enthusiasts start talking about using this tool or language for everything, eliminating the need for other languages that are older and clearly competing. And at the same time, some people — sometimes the same people, and sometimes different people — start talking about the need to simplify the language, removing features that are only used by a few fringe cases … like the original point of the language. Watch for that at Balisage this year. I think we have three or four papers that are really addressing just that.
Yesterday I heard someone say “I know I can do anything I need if I have a turing complete language, especially one that I like.” Well, to be more precise, it was Michael Sperberg-McQueen, and what he said was “I know that I can do anything I need if I have a turing complete language, especially one I like, like XSLT.” So, the question is: How do we decide which languages to like? Or perhaps, how do we convince other people to use the languages we like instead of the languages that they already know and already like?
It seems to me that we at Balisage have a joint interest in markup, marked-up documents, and tools that deal with marked-up documents. And one of the challenges that many of us face is convincing people who have had this bizarre XML stuff thrust upon them to use tools that are markup smart. To use the markup instead of treating an XML document as if it were a string. Not that they can’t do everything they need to do using string processing tools — they most assuredly can. But because that has them doing the same work two times, three times, four times … and maybe introducing errors. When they are using our technologies incorrectly, we tend to get very angry at them. Or we tend to feel sorry for them because they just haven’t been enlightened.
One of the things I challenge all of you to think about is how to persuasively explain to them that if they have markup, they ought to use it even if that means learning a new tool or a new way of thinking.
Balisage is a place to say what you think and think about what you think and why you think it. My brother recently asked me what I thought about genetically modified food crops. I said I didn’t know enough about them to have an informed opinion. He said, “Most people don’t. But what do you think?” He was fishing for an uninformed opinion.
One of the joys of Balisage is that we will have a mixture of informed and uninformed opinions, probably both equally passionately stated. One of the things I challenge you to do is to detect the difference, not only in other people’s informed and uninformed opinions — it really isn’t as simple as if they agree with you, they’re informed, and if they don’t agree with you, they’re uninformed — but also to challenge that in yourself. How many of your opinions are informed, and how many of them are uninformed? Can you tell the difference?
As you listen at Balisage, remember that the speaker may have a dramatically different point of view than you do. Try to understand it. Question the premises behind their point of view and their methods, processes, and conclusions. But start from the assumption that they are smart people working on important problems and that they have made educated decisions that put them where they are, even if you find that surprising. Maybe especially if you find it surprising.
When a speaker says something that contradicts something you know to be true, do not leap to your feet, rush to the floor microphone, and shout “You idiot!” Figure out a polite way to ask a question that unearths the reason this person so clearly disagrees with you. If possible, ask it in a way that convinces them to investigate the tool, technique, or position you hold. You will be more persuasive if you are polite.
You will not agree with everything you hear. It is my opinion that any statement that absolutely everyone agrees with is so bland there’s no point in stating it.
I recently went to a concert sponsored by the local folklore society where I live, held in a local church. I got there early so I could get a good seat. I was sitting with nothing to do for a few minutes, and being me, I picked up the local reading material and opened it. What was there to read, sitting in the church? There was the hymnal which had in its front cover the creed of this particular organization. A full page of fairly small print in which I decided that there was not one statement with which any human being on planet Earth could disagree. In other words, there was no content on that page. This was what they believe — so does everybody else. They weren’t telling me who they were.
We’re not going to have any of that sort of content-free talk at Balisage. I don’t want to aim for them; I don’t want the speakers to give them. I want the audience to respect the fact that there will be content in the talks and therefore there will be things we disagree with. That’s a good thing.
State your positions, describe the reasons for them, allow for the possibility that there are things you don’t know — yeah, even you — and have a great time at Balisage.