Topic maps in near-real time

Sam Hunting

Introduction

Why did you adopt a Q&A format?

Writer’s block. When I found myself making a process flow diagram figure, I felt there are a lot of people here who can do that better than I can. So why do it? Ditto for topic map theory, where I’m sure I’ve committed at least one major howler. What I do claim to have is an topic map application, complete with a colorable disclosure, that can deliver unique value to users on a widely used platform, whose development has been informed by all the work we’ve done together on topic maps over the years.

Why did you write a topic map application?

I believe that the news, like food, should be good, clean, fair--and local: This toxic waste dump, this voting machine debacle, this housing authority scandal, this corrupt official. Contemporary journalism is neither good, clean, fair, nor local (see Bob Somerby). However, local news gatherers must also be able to connect their narratives to other, similar narratives that contain subject matter of interest to them. (Toxic Waste, Inc., knows all about Localities A and B, but unless Localities A and B know about each other, any narrative they create about Toxic Waste, Inc. will necessarily remain partial, and any local action based on that narrative could well lack critical information.) Hence, there is a requirement for a distributed information system that would enables localities to discover subjects of mutual interest, perhaps serendipitously, intrinsic to content that they themselves have created.

Hence topic maps.

Why did you write your topic map application in Drupal?

Subjects are a function of content:

Equation (a)

S = f(c)

and Drupal is the content management system par excellence. Drupal is open source (GPL 2.0); Drupal has a vibrant community; Drupal is hot, being blessed by Google ("Summer of Code"); Drupal has superior community building and categorization tools; and I am intimately familiar with the platform, having built, administered, and moderated a Technorati 5000 community site for several years. Drupal also adds functionality by adding modules, so I determined to write a topic map module for Drupal.

Proxy disclosure

Can you disclose your proxies?

Yes.

What are your proxies?

Here is the "proxy spider" diagram for my proxy; it works, at least, like the BigAssert assertion model devised by Steve Newcomb. Keys that are not "map," "association caption," or "association type" are roles. Values with role keys are players.

Player properties may be related to a notation processor. The presence of a processor may, or may not, affect the subject identity of the proxy that contains the property.

When are two proxies the same?

Two proxies are the same when they are in the same map, their type properties are the sam, and their role/player combinations are the same, and if a player has a notation that can impact subject identity, both notations are the same.

Where are the individual topics?

There are no individual topics, because no topic can exist in isolation. So, in my disclosure, they’re properties. And I have to admit, that when I looked at Barta's handy guide to the TMRM, I couldn’t make my diagram work any other way. And this may be the howler to which my introduction alludes!

How can users navigate using your proxies?

As if properties were holes in a punch card:

Implementation

What were the main challenges you encountered during development?

Different layers of the LAMP stack needed different representations. User input required a representation that would work in a text box (since JavaScript editors are not ready, and would be WYSIWYG if they were, and XML editors are not available either). However, user input is not appropriate for processing proxies than processing anglebrackets instead of the using the DOM would be. Finally, a relational representation is needed for storage.

What markup did you devise for user input?

A wiki-like markup language:

The wporkflow envisaged is user integrating associations into actual text. The markup is reasonably easy for the user to enter, and reasonably easy for the application to parse.

As can be seen from the sample, [[ and ]] delimit the proxy. : delimits the properties of the proxy. Only the value of the player property is visible. (However, hiding a player is such a ubiquitous use case there's a special syntax for it: A hat before property delimiter hides the player (^:Bart hides "Bart"). = adds a notation to a player property, a la [[ ... text:Federalist 51=federalist ...]]] (Later, we will see how the plugin for the federalist notation would process the player.)

Implementation uses Drupal’s nodeapi hook; when content is being processed, any function of the form [drupal_module]_nodeapi is invoked, with the content as input and output for the function. So, topicmap_nodeapi takes content marked up as proxies and transforms it to the data structure used for processing the proxy, to which we now turn.

What data structure did you use for processing?

A tuple-like representation. Here is a sample:

Proxies need to be transformed from user input in are processed in several ways: for TOCs, for legends, for validation, for plugins, etc. Because the wiki syntax isn’t suitable for PHP processing, we adopt the PHP mindset and represent proxies as arrays, so we get to use PHP’s rich variety of array functions.

After a false start using a keyed array (a proxy can have duplicate keys), I adopted the tuple representation shown. (Jack Park and I did a tuple representation of topic maps long ago, on the theory, which I still believe to be correct, that a Linda-like tuple space implementation would be great way to federate topic maps, and this idea was inspired by that work.

This representation is efficient since the map value is always at position three, roles are every fourth array element starting at 12, roles are every fourth array element starting at 16, and so on. It will also map cleanly to XTM and JSON output formats.

Implementation (still in topicmap_nodeapi) takes post content with wiki-like markup embedded, parses it, rips out the properties for each proxy, validates each tuple either generically or by type, and associates the tuple with an offset ("8090","8096") back into the content. The tuple is then transformed into HTML by adding generic span, div, and class markup, and notation processing (for example, the federalist notation processor might transform its player data into an HTML A tag linking to an online version of the Federalist Papers).

What relational structure did you use for storage?

A table that permits result sets like this, which you will shortly see is convenient for generation navigation tables, or TOCs:

(I apologize for not having a diagram; I couldn’t find a diagrammer that works with Postgres on OS X.) Here is the basic idea.

In the text box, users enter text values ("Bart","Alberto Gonzales"); it wouldn’t be sensible to have users do anything else. However, all processing on the database side takes place through manipulating integers called (adopting Drupal jargon), "nids," or node IDs. Therefore:

There is a table of values
There is a mapping table of value to nid
There is a table of nids (All logging data (creator, date created) goes on the nid table)
There is a "big table" with columns for each property. The columns are: A[ssociation],[t]type,[r]ole,[p]layer,[c]casting,[n]otation, and ac (association caption). Alas, ac ("association caption") and notation are optional, and so these columns can contain ugly NULLs.
The data in each column is a foreign key into the table of nids.
There are ancillary mapping tables of association to source (the Drupal post), nid to its autogenerated Drupal page, and so on.

The implementation is almost certainly naïve; I never did figure out a way to cram an association into a single row, because the role and player combinations vary in number. In practice, that means that to grab a single association, you need to grab all the rows that have the same values for a, t, r, and p (and are in the same map, and have notations that either do not affect subject identity, or are the same) and no others. Relational purists also take the view that auto-generating nids as relational keys is pernicious; however, that’s how Drupal does things. However, the implementation is operating fast enough to deliver value to users at the required scale.

Plugin Architecture

What are the advantages of notation plugins?

Notation plugins allow the administration add data-driven functionality to a Drupal site that uses the topic map module. For example, entering the following proxy:

[[test:test_type_7[role_7_2_1:player_7_2_1] and [role_7_2_2:364 U.S. 507=caselaw]]

Uses the caselaw plugin, and generates a sidebar with metadata about the case:

What notation plugins did you include?

There are notation plugins for aircraft tail numbers, email addresses, citations to the Bible, the U.S. Constitution, and the Federalist Papers, and geocoded maps.

The geocding/mapping plugin takes a physical address as input, geocodes it, and returns a map, which illustrates the distinction between this topic map implementation, and most other approaches to the semantic web: The plugins operate at the data level, not the resource level, and integrate into content at a point of the author’s own choosing. This is quite distinct from the model where resources are vertically organized by site, and then "mashed up" into a new resource that is still not integrated into content.

What are the advantages of association type plugins?

In a word, validation.

Validation parameters can be set by the user:

And enforced by the application:

The interface is crude, being CSS-driven and therefore not dynamic, but usable.

What type plugins did you include?

Plugins for the types required by ISO 13250 (class/instance and supertype subtype), as well as types for asserting that two properties are the same, and an "org" type, for analyzing who reports to whom inside an organization (like, in the sample topic map, the Bush administration).

What are the advantages of search plugins?

What search plugins did you include?

The topic map paradigm is extremely rich, and provides an almost unlimited number of ways to "connect the dots." However, since we can’t know method appropriate to navigating a corpus before actually knowing the corpus, it makes sense to enable developers to create plugins, rather than decide in advance that I know better than they do.

For example, the "degrees" plugin:

This example shows that indeed the head bone is connected to the neck bone, the neck bone is connected to the back bone, the back bone is connected to the hip bone, the hip bone is connected to the thigh bone, and the thigh bone is connected to the knee bone. Which may seem trivial, unless you want to know how many degrees separate Alberto Gonzales from Monica Lewinsky, say. (Note that the implementation does not depend on the type of association in which the player participates, but solely on building a linked list associations with player overlap. And sometimes, indeed, "you can't get there from here.")

Interlude: The TOC

Why is your presentation of the topic map so resolutely un-Flashy?

Here is what the questioner means by "resolutely un-Flashy."

How the TOC works, from top left: Each numbered "stripe" is a single proxy, and each column is a property of the proxy. The downward pointing blue triangles link down into the occurrence of the proxy in the content. Further, each type plugin annotates its proxies by adding a footnote to a proxy’s property, where appropriate. The footnote shows metadata for the proxy via a "hover," and links to the TOC legend, also generated by the plugin. For example, all instances of "cat" are footnoted with "2," in blue, color-coding the note as applying to a supertype/subtype type of proxy. Clicking on footnote "2" takes the user to note "2" in the legend, where the user may click on the "cat" to go to a (dynamically created) page that shows all the proxies that use the "cat" property, or the user may click on the outline icon in parenthesis, and go to the "cat" node on the type page, which shows the type hierarchy. (Different type plugins, as you see, have different colors, icons, and also organize their pages differently.) Naturally, if a new plugin in were added, it to would add its own notes, legend, and have is own page, assuming it used the plugin API.

This is not sexy; most designers prefer a graph style presentation, with nodes and arcs, but I think that’s "Visualization That Doesn’t Help You Vizualize." Such an approach has a number of disadvantages: The graphs are generally a Flash presentation or equivalent and so are not searchable via search engines, are static, and can’t evolve with the community, and are not integrated into any content. In addition, they take up an awful lot of space on the page, and consume a lot of bandwidth. (There’s no reason to assume that net neutrality will continue, and so low bandwidth applications may assume increasing importance.) The un-Flashy TOC presented here has none of those disadvantages, and all of the functionality listed, none of which the Flash approach offers.

Example

Do you have an example of a topic map that uses your application?

Yes: The Criminal Bush Regime. Based on stories from the Washington Post and Slate, it contrasts the prose and parallel Flash-y approach to a topic map. Although it is not "local" (except inside the Beltway), I like to think it's good, clean, and fair, and a community could add value to it.

Conclusion

What would you say to anyone thinking about using topic maps?

Great paradigm, great people, great software.

Can I download your module?

Great paradigm, great people, great software.

Is the latest version of the topi map module for Drupal available for download?

Yes, at The Universal Pantograph

Anything else?

Humongous thanks to the conference organizers for their patience, and for Balisage, too.

BalisageThe Markup Conference

Balisage Paper: Topic maps in near-real time

Sam Hunting

Table of Contents