Balisage 2009 logo
Balisage Conference
Schedule at a Glance
Speaker/Author Bios
Symposium on Processing XML Efficiently
(August 10, 2009)
Balisage Series on Markup Technologies
proceedings of previous events

Balisage 2009 Program

Tuesday, August 11, 2009

Tuesday 10:00 am - 10:30 am

(FP) Standards considered harmful

Tommie Usdin, Mulberry Technologies

Standards and shared specifications allow us to share data, build general purpose tools, and significantly reduce training and customization costs and startup time. That is, the use of appropriate specifications can help us reduce costs, reduce startup time, and increase quality, usability, and reusability of content. Some vigorous standards proponents insist that the more standards used the better. To them I say “mind your own business and let me mind my own store”. They argue that using standards is always the right thing to do, because it enables re-use and interchange. Maybe so. But adoption of a standard that supports an activity that is not central to your mission is a distraction, an unwarranted expense, a bad idea.

Tuesday 11:00 am - 11:45 am

XML in the browser: the next decade

Alex Milowski, Appolux

XML in the browser was first demonstrated by Netscape in 1999. Since then, XML has become ubiquitous and browser technology has matured into a platform for delivery of complicated services and applications, based largely on a combination of HTML and Javascript — which does not quite match the original vision of ubiquitous delivery of information over the web via specialized XML documents. All major browsers have the needed core technologies, yet XML applications can’t use them the way HTML applications can. We can deliver XML content augmented with application semantics within some of today’s browsers, although there are some limitations. These limitations and what can be done to overcome them in the near term are discussed, as are future directions. And, of course, there will be a cool demo or two.

Tuesday 11:45 am - 12:30 am

(FP) Automatic XML namespaces

Liam Quin, W3C

The XML community has lived with XML namespaces for a decade. They are useful to the point of seeming indispensable, they are ubiquitous, and yet they are at the same time unwieldy and flawed. Namespace declarations can be inconvenient to remember, and errors in them are frequently the source of subtle and hard-to-diagnose errors. From a programming perspective, namespaces provide scope and disambiguation; from a document authoring perspective, namespaces provide headaches. By introducing a single new feature namespace declarations could be simplified and namespace functionality enhanced without losing the existing benefits of namespaces. Let’s talk about making namespace lemonade from namespace lemons.

Tuesday 2:00 pm - 2:45 pm

Engineering document applications — from UML models to XML schemas

Dennis Pagano & Anne Brüggemann-Klein, Technische Universität München

UML models for documents need to be exchanged like other models. XML Metadata Interchange (XMI) satisfies this interchange requirement by representing UML models as XML documents. But it seems to us that UML class diagrams, which model persistent data, are more closely aligned with XML schemas, which model the XML representation of persistent data as documents. The Object Management Group (OMG) has defined a method to turn models written in its MOF modeling language into equivalent XSD schemas using XMI. Since MOF models can be considered to be a specific case of UML models, we patterned our method (named uml2xsd) after the OMG translation of MOF models into XSD, extending it to concepts present in UML class diagrams but not in MOF models. Our tool uml2xsd transforms an XMI representation of a UML class diagram into an XSD schema that constrains XML instances of the UML model.

Tuesday 2:00 pm - 2:45 pm

(LB) An XML user steps into, and escapes from, XPath quicksand

David J. Birnbaum University of Pittsburgh

The otherwise admirable and impressive eXist XML database sometimes fails to optimize queries with numerical predicates. For example, a search for $i/following::word[1] retrieves all following <word> elements and only then applies the predicate as a filter to return only the first of them, which can be enormously inefficient when $i points to a node near the beginning of a very large document, with many thousands of following <word> elements. As an end-user without the Java programming skills to write optimization code for eXist, the author describes two types of optimization in the more familiar XML, XPath, and XQuery, which reduce the number of nodes that need to be accessed and thus improve response time substantially.

Tuesday 2:45 pm - 3:30 pm

Prying apart semantics and implementation: Generating XML schemata directly from ontologically sound conceptual models

Bruce Todd Bauman U.S. Department of Defense

Models produced with mainstream modeling languages (e.g. UML, ERD) or directly in implementation languages (XSD, RDFS, OWL) reflect technology specific design decisions. These modeling languages both obscure the expression of domain semantics, and inherently limit the potential for model reuse in other designs. Frustrated by this, we have started to use an ontological profile of UML defined by Giancarlo Guizzardi in “Foundations for Structural Conceptual Models” (2005) to create conceptual models that document a shared, implementation-neutral understanding of a domain targeted for human understanding. Physical models are generated from these conceptual models and annotated with encoding directives that custom software uses to compile an XSD. By way of example, this talk introduces the constructs of the conceptual modeling language, the physical XSD annotations, and XSD complier.

Tuesday 2:45 pm - 3:30 pm

Formal and informal meaning from documents through skeleton sentences: Complementing formal tag-set descriptions with intertextual semantics and vice-versa

Yves Marcoux, Université de Montréal, C. M. Sperberg-McQueen, Black Mesa Technologies, & Claus Huitfeldt, University of Bergen

What do we mean when we add markup to a document? Proponents of two approaches to markup semantics (formal tag-set description and intertextual semantics) show how these two approaches can be combined to generate analytical tools for documents. With examples of increasing complexity, they demonstrate how an intertextual semantics approach can generate the materials for building a formal tag-set description.

Tuesday 4:00 pm - 4:45 pm

(LB) Akara - Spicy Bean Fritters and an XML Data services Platform

Uche Ogbuji, Zepheira

Akara is an open-source XML/Web mashup platform supporting XML processing in an environment of RESTful data services. It includes “Web triggers”, which build on REST architecture to support orchestration of Web events. This is a powerful system for integrating services and components across the Web in a declarative way, so that perhaps a Web request could access information from a service running on Amazon EC2 to analyze information gathered from social networks, run through a remote spam detector service. Akara is designed from ground up to support such rich interactions, using the latest conventions and standards of the Web 2.0 era. It’s also designed for performance, modern processor conventions and architectures, and for ready integration with other tools and components.

Tuesday 4:00 pm - 4:45 pm

Documents cannot be edited

Allen H. Renear & Karen M. Wickett, University of Illinois at Urbana-Champaign

What is a document? We often say that they are strings of characters (perhaps among other things). But strings or sequences of any kind are extensional objects: timeless, eternal, unchanging. How can an immutable object be edited?

Tuesday 4:45 pm - 5:30 pm

(LB) Visual Designers: Those XML tools with no angle bracket at all!

Jean Michel Cau & Mohamed Zergaoui, Innovimax

Is the future of XML planned to be without XML? Visual tools are everywhere and XProc might be the first XML dialect to be immediately available with its visual editor. After erratic evolutions, visual tools have become more and more precise (even HTML+CSS tools are now very powerful), and are become more and more main stream. Could we imagine dealing with XML Schema without descent Visual Tools? We will show in this presentation an overview of where we do XML without seeing any angle bracket and the places where we expect to have some equivalent tools soon.

Tuesday 4:45 pm - 5:30 pm

(FP) How to play XML: Markup technologies as nomic game

Wendell Piez, Mulberry Technologies

Projects involving markup technologies are game-like: they have players (teams and individuals), equipment, rules, victories, and defeats. In many of the markup games we play, the making of the game’s rules is part of the game itself. When the playing of a game involves the modification of the game’s own rules, it is said to be a “nomic game”. The process of legislation, for example — including the collaborative development of markup vocabularies and other markup standards — is a nomic game. This meditation considers how the experiences of earlier nomic games are influencing today’s contests, the far-reaching influence today’s nomic games will exert on those to be played later, and things to consider as we engage each other in the nomic games of markup theory and practice.

Wednesday, August 12, 2009

Wednesday 9:00 am - 9:45 am

TEI feature structures as a representation format for multiple annotation and generic XML documents

Jens Stegmann, Bielefeld University & Andreas Witt, Institute for the German Language (IDS), Mannheim

Annotated texts can usefully be represented in terms of feature structures — rooted labeled directed acyclic graphs. The ISO-standard tag set for the representation of the structural features of texts based on the feature-structure markup of TEI P5 can be used to integrate sets of annotation documents, such as different linguistic analyses of a common core; such an approach is known to facilitate the representation and processing of overlapping structures. Less frequently discussed is the possibility that any XML documents and document sets might be usefully represented in terms of feature structures, thus making the tools of computational linguists, and specifically the operations of unification and generalization, available in XML processing contexts.

Wednesday 9:45 am - 10:30 am

Towards markup support for full Goddags and beyond: the EARMARK approach

Silvio Peroni, Fabio Vitali, & Angelo Di Iorio, University of Bologna

For representing overlapping structures, why not use something designed for graphs? Why not use ... RDF? EARMARK (Extreme Annotational RDF Markup) uses RDF to encode non-hierarchical structures (overlap, repetitions, transpositions) which have been previously addressed by TEI, LMNL, TexMecs, and XConcur, among others. OWL provides a standardized, well supported notation for declaring the document model for such complex structures. EARMARK documents can be translated, of course, into other notations, using well known techniques for working around restrictions of existing syntaxes. EARMARK thus provides a unifying model for a wide variety of phenomena of interest both in markup theory and in practice. And since it exploits RDF and OWL, Earmark can be processed conveniently using existing RDF and OWL tools and technologies.

Wednesday 11:00am - 11:45am

Markup, meaning, and mereology

Claus Huitfeldt, University of Bergen, C. M. Sperberg-McQueen, Black Mesa Technologies, & Yves Marcoux, Université de Montréal

XML markup divides a whole (a document) into parts (elements), which may themselves be further subdivided in to parts (also elements). Thus the part-whole relationship is central to our understanding of XML markup. Mereology is a branch of logic that deals with theories of part-whole relationships, without reference to the idea of sets and their members and dealing instead with part-whole and sum relationships between individuals. But documents are more complex than mere part-wholes, and the propagation of properties between the various parts of a document can follow diverse patterns. Some of these patterns are difficult to specify along the more commonly used containment/dominance dimensions of XPath (for example). We investigate whether some of these patterns can be more conveniently or usefully described with the formalism of the “Calculus of Individuals”, a mereological system worked out by Nelson Goodman and Henry S. Leonard that may have application to marked-up text.

Wednesday 11:45 am - 12:30 pm

TNTBase: Versioned storage for XML

Vyacheslav Zholudev & Michael Kohlhase, Jacobs University Bremen

Version Control systems like CVS and Subversion have transformed collaboration workflows in software engineering and made possible globally distributed project teams. Even though XML, as a text-based format, is amenable to version control, the fact that most version control systems work on files makes the integration of fragment access techniques like XPath and XQuery difficult. The TNTBase system is an open-source versioned XML database created by integrating Berkeley DB XML into the Subversion Server. The system is intended as a basis for collaborative editing and sharing XML-based documents that integrates the versioning and fragment access needed for fine-grained document content management. Our aim is to make possible the kinds of workflows and globally distributed project teams familiar from open source projects.

Wednesday 2:00 pm - 2:45 pm

Investigating the streamability of XProc pipelines

Norman Walsh, Marklogic

High-performance XML processing, particularly on very large documents, requires that processing components be usable in streamed pipelines. XProc is a W3C specification for describing a sequence of XML operations to be performed over a set of documents. The spec imposes no streamability constraints, leaving it up to the implementation whether or not to stream. A streaming implementation could be expected to outperform a similar non-streaming implementation, but not all steps in such a pipeline may be streamable. So the question arises: Would a majority of real world pipelines benefit from streaming? As you read this, comparison data is being collected from thousands of pipeline runs (where the pipelines were not constructed by the author). Conclusions will be drawn.

Wednesday 2:45 pm - 3:30 pm

You pull, I’ll push: On the polarity of pipelines

Michael Kay, Saxonica

What's the most effective way to move XML data through a processing pipeline? The answer isn't always simple. Control flow in the pipeline can run either with the data flow ("push") or against it ("pull"), reflecting the "push" and "pull" styles familiar to XSLT authors; each is useful in some situations. Mixing them, however, presents challenges: buffering the data leads to latency and memory problems, while using multiple threads leads to coordination overheads. The concept of program inversion, originally developed to eliminate bottlenecks in magnetic-tape-based processes, offers help. In particular, ideas derived from Jackson Structured Programming allow processes written in a convenient pull style to be compiled into push-style code; this can reduce both coordination overhead and latency.

Thursday, August 13, 2009

Thursday 9:00 am - 9:45 am

A toolkit for multi-dimensional markup: The development of SGF to XStandoff

Maik Stührenberg & Daniel Jettka, University of Bielefeld

The Sekimo Generic Format (SGF) and its successor, XStandoff, use stand-off annotation to handle overlapping structures in richly annotated XML documents, while retaining XML compatibility and allowing the use of standard XML tools like XSLT. This paper describes the changes introduced by XStandoff and its suite of XSLT stylesheets. These can be used to create XStandoff instances from documents with inline annotations, to merge two XStandoff documents over the same primary data into a single instance, to delete one or more levels of annotation, and to serialize an XStandoff document in XML with inline markup and milestone elements.

Thursday 9:00 am - 9:45 am

Managing electronic records business objects using XForms and Genericode at the National Archives and Records Administration

Quyen L. Nguyen, National Archives and Records Adminstration & Betty Harvey, Electronic Commerce Connection

The Electronic Records Archives (ERA) system at the U.S. National Archives and Records Administration (NARA) is intended to handle large volumes of electronic records on widely varying topics in many different formats. It must be extensible, evolvable, and scalable. Naturally, XML is used where possible. XForms and Genericode are used within ERA to manage transfer requests, records schedules, and other archival business objects; they make it convenient to verify controlled fields against authority lists and to check inter-field dependencies. This case study outlines the design and construction of the Electronic Records Archives system and describes how it permits agile responses to the ongoing evolution of requirements at NARA.

Thursday 9:45 am - 10:30 am

Methods for the construction of multi-structured documents

Pierre-Edouard Portier & Sylvie Calabretto, Université de Lyon

In recent years, numerous methods have been proposed for representing complex overlapping structures. But how are multi-structured documents to be created? This paper presents methods for creating and interacting with multi-structured documents using the MultiX2 model. The basic operations have been implemented using the functional language Haskell; this prototype implementation will be described.

Thursday 9:45 am - 10:30 am

Gracefully handling a level of change in a complex specification: Configuration management for community-scale implementation of an HL7v3 messaging specification

Charlie McCay, Ramsey Systems, Michael Odling-Smee, XML Solutions, Joseph Waller, XML Solutions, & Ann Wrightson, Informing Healthcare (NHS Wales)

Change management for a complex specification is always difficult. When the specification involves the life-and-death issues of healthcare messaging, a full strategy for handling both changes to the specification and the variations in data over time becomes essential. The technical requirement is to make interfaces both flexible and breakable, to accommodate change and enforce necessary compliance. The authors describe an in-depth analytical method and the resulting maintenance process for a key interoperability specification for the English National Health Service.

Thursday 11:00 am - 11:45 am

Merging multi-version texts: A generic solution to the overlap problem

Desmond Schmidt, Queensland University of Technology

In XML processing contexts, “multi-version documents” (MVDs, as proposed in a published 2009 paper by Schmidt and Colomb) can represent overlapping (separate, partial, conditional) hierarchies and variations (insertions, deletions, alternatives, and transpositions) in texts. The MVD data structure allows most desired operations on texts to be simple and fast. However, creating and editing MVDs is a much harder and more complex operation with no approaches known to be both optimal and practical. The problem is similar to the multiple-sequence alignment problem in molecular biology. A heuristic algorithm partly derived from recent biological alignment programs offers satisfactory speed and quality for creation and editing operations, which means that MVDs can be considered as a practical and editable format suitable for overlapping structures in digital texts.

Thursday 11:00 am - 11:45 am

(LB) hData - A Simplified Approach to Health Data Exchange

Gerald Beuchelt, Robert Dingwell, Andy Gregorowicz, Harry Sleeper, MITRE Corporation

Interoperability issues have limited the expected benefits of Electronic Health Record (EHR) systems. Ideally, the medical history of a patient is recorded in a set of digital continuity of care documents which are securely available to the patient and their care providers on demand. The history of continuity of care standards includes multiple standards organizations, differing goals, and ongoing efforts to reconcile the various specifications. Existing standards define a format that is too complex for exchanging continuity of care information effectively. We propose hData, a simplified XML framework to describe health information. hData addresses the challenges of the current HL7 Continuity of Care Document format and is explicitly designed for extensibility to address health information exchange needs, in general. hData applies established best practices for XML document architectures to the vertical health domain, which has experienced significant XML-based interoperability issues.

Thursday 11:45 am - 12:30 pm

XSAQCT: An XML queryable compressor

Tomasz Müldner, Acadia University, Christopher Fry, Acadia University, Jan Krzysztof Miziołek, University of Warsaw, & Scott Durno, Acadia University

An XML-aware compressor reduces the size of a document by taking advantage of the redundancy in the XML syntax. A queryable XML compressor furthermore stores the compressed data in a form that can be queried without first decompressing the full document. XSAQCT (pronounced "exact") is a queryable, grammar-free compressor (meaning it is informed only by the document instance and not by a schema) that separates the document structure from the text and attribute values, storing the structure as an annotated tree and the data values in containers. Both are compressed; a decompressor can restore the original document, but a query processor operates on the compressed document, lazily decompressing as little as possible. Preliminary results look good, XSAQCT compresses documents in our corpus to 12% of the original size and outperforms the other XML compressors we have tested.

Thursday 11:45 am - 12:30 pm

(LB)The Graphic Visualization of XML Documents

Zoe Borovsky University of California, Los Angeles; David J. Birnbaum, University of Pittsburgh; Lewis R. Lancaster, University of California, Berkeley; James A. Danowski University of Illinois at Chicago

We propose to show how graphic visualizations of deeply encoded XML documents allow Humanities scholars to reap the rewards of their work. These visualizations become, in turn, objects that scholars can analyze and interpret. Beginning with a short overview outlining the history of development in visualization strategies of Humanities computing technologies, we present Birnbaum’s Repertorium Workstation as an early attempt at graphic visualization of a large collection of XML encoded texts. Borovsky’s work shows how graphs of encoded data can themselves become objects of analysis; she will present examples of visual queries and results. Lancaster’s work envisions a visual query system using large graphs—a framework designed for exploring structurally complex Humanities data sets. Our work leads us to conclude that graphic visualization isn’t just something one can do with XML data; it is often crucial to making the data usable in research.

Thursday 2:00 pm - 3:30 pm

XML best practices: panel discussion

Peter F Brown, Pensive, David Chesnutt, Chet Ensign, Bloomberg, Betty Harvey, Electronic Commerce Connection, Laura Kelly, National Library of Medicine, & Mary McRae, OASIS

Who doesn't want to do things well? Who doesn't want to stand on the shoulders of giants? Who doesn't want to share hard earned wisdom with others? So why is it that "best practices" are so elusive? In this panel discussion we consider how "best practices" (and practices that, for whatever reasons, masquerade as "best") can be discovered, recognized, verified, modified, replaced, debunked, enforced, promulgated, etc.

Thursday 4:00 pm - 4:45 pm

EXPath: A practical introduction: Collaboratively defining open standards for portable XPath extensions

Florent Georges

The EXPath project was established in April 2009 to define libraries of extension functions for XPath. These functions will exist outside of any existing processor, allowing processors to implement them natively or to install them as external packages. Ideally, these expressions will become portable across every processor and will be usefully employed in XQuery, XSLT, and other XPath-based languages. Come find out about the project and enjoy some dynamite examples of submitted functions.

Thursday 4:45 pm - 5:30 pm

Managing XML references through the XRM vocabulary

Jean-Yves Vion-Dury, Xerox Research Centre Europe

XML References Management (XRM) is a method and vocabulary for formalizing knowledge about the types of links found in a given family of XML documents (i.e., in an XML document type). With such formalized knowledge in hand, instances of the document type are amenable to automated link verification, and to transformation and derivation with predictable and desirable results. The XRM approach allows link description (the definition of link types in terms of the contexts in which they appear), link validation description (the properties of valid instances of each type of link), and link translation description (the rules that should govern transformations and derivations).

Friday, August 14, 2009

Friday 9:00 am - 9:45 am

Why writers don’t use XML: The usability of editing software for structured documents

Peter Flynn, University College Cork

Why can’t people stand XML editors? The details are legion, but the root cause is simple: XML editors routinely put XML and its structure at the center of their concerns, whereas the central concerns of most writers of prose lie elsewhere. Even when they are agreeing on the importance of document structure, XML thinkers and actual writers often mean different things by it. What would it take to make the editor’s interface support the user’s mostly bottom-up model of text, instead of insisting on the top-down XML model most easily and commonly implemented? A user survey can help reveal whether a more task- and user-centered interface would help make XML editors more usable.

Friday 9:45 am - 10:30 am

(FP) Open data and the XML community

Kurt Cagle

The world of XML is changing. Large “super schemas” like OOXML, XBRL, NIEMs, HL7, and so on, push the limits of existing XML software, while also encouraging the creation of ecosystems built around them, in order to exploit the large quantities of important data now or soon to be available in these formats. Standardization around these formats is driven less by existing proprietary formats and less by industry consortia than by government adoption. The super schemas are often formulated less as definitions of single concrete vocabularies than as meta-definitions of families of vocabularies. The confluence of emerging Open Data standards, the government-as-database conjecture, and a shift towards RESTful services will serve to turbocharge the XML community.

Friday 11:00 am - 11:45 am

Documenting and implementing guidelines with Schematron

Joshua Lubell, National Institute of Standards and Technology

Data exchange specifications must be broad and general to achieve acceptance; for actual interoperability, the data need to be more tightly constrained by specific business rules. Naming and Design Rules (NDR) are guidelines for constraining the development of new schemas or extending existing schemas to ensure interoperability. Both the business rules and the NDR can be implemented in Schematron, a rule-checking and reporting language for XML documents. Schematron is also a particularly useful literate programming tool for documenting and implementing guidelines. The Schematron literate-programming approach is compared to a previously implemented NDR document model approach with embedded Schematron for enforcing guidelines.

Friday 11:45 am - 12:30 am

Test assertions on steroids for XML artifacts

Jacques Durand, Fujitsu America, Stephen Greene, Document Engineering Services, Serm Kulvatunyou, Oracle, & Tom Rutt, Fujitsu America

Testing of XML material — either XML-native business documents or XML-formatted inputs of various sources — combines diverse validation requirements that are in general not well supported by any single validation tool. Schema-based validation must be complemented with additional syntactic and semantic rules. Known tools in this space are limited in their expressive power, and/or they can’t make reports that are sufficiently nuanced, and/or they can’t take additional information, such as metadata and operational artifacts, into account. This paper describes a more integrated XML testing paradigm that supports chaining and parameterization of test cases, and the modularization of reusable tests. It combines the familiar notion of test assertions (as described in the OASIS TAG model) with XPath2.0 and XSLT2.0.

Friday 12:30 pm - 1:15 pm

(FP) Sometimes a question of scale

C. M. Sperberg-McQueen, Black Mesa Technologies

Reflections on size, scale, scaleability, and value.