Balisage 2010 logo
Balisage Conference
Schedule at a Glance
Speaker/Author Bios
XML for the Long Haul: Issues in the Long-term Preservation of XML
(August 2, 2010)
Balisage Series on Markup Technologies
proceedings of previous events

Balisage 2010 Program

Tuesday, August 3, 2010

Tuesday 9:15 am — 9:45 am

The high cost of risk aversion

Tommie Usdin, Mulberry Technologies

Avoiding risk is not always the way to minimize risk.

Tuesday 9:45 am - 10:30 am

Multi-channel eBook production as a function of diverse target device capabilities

Eric Freese, Aptara

The challenge: develop an eBook that can demonstrate a number of enhanced eBook capabilities (intelligent table of contents, bidirectional linking, external links to study files and geospatial data, hidden text, media support, epubcheck validation, etc.) that will work on many “standard” eBook devices (such as the Kindle, nook, Sony Reader, iPad, and eDGe platforms, and even on smart phones). The text: the World English Bible. The talk: show-and-tell session and the sharing of lessons learned.

Tuesday 11:00 am - 11:45 am

gXML, a new approach to cultivating XML trees in Java

Amelia A. Lewis & Eric E. Johnson, TIBCO Software Inc.

Different XML tree-processing tasks may require tree models with different design tradeoffs, producing problems of multiplicity, interoperability, variability, and weight. It is no longer necessary to use a different API for each different tree model. A single unified Java-based API, gXML, can provide a programming platform for all tree models for which a “bridge” has been developed. gXML exploits the Handle/Body design pattern and supports the XQuery Data Model (XDM).

Tuesday 11:00 am - 11:45 am

Grammar-driven markup generation

Mario Blažević, Stilo International

For use in document conversions, we have written a normalizer that generates a grammatical element structure for incompletely tagged document instances, guided by a RELAX NG schema. From a well-formed but invalid instance that contains only tags that occur in the target schema, the normalizer generates a document instance valid against the grammar; the weakly structured input is first translated into elements from the schema, then the instance is manipulated into validity. We introduce a set of processing instructions that allow a user to control how the normalizer resolves ambiguity in the instance.

Tuesday 11:45 am - 12:30 pm

Java integration of XQuery — an information unit oriented approach

Hans-Jürgen Rennau

Need to process XML data in Java? Keen to let Java delegate to XQuery what XQuery can do much better than Java, to discover a novel pattern of cooperation between  XQuery and Java developer? A new API, XQJPLUS, makes it possible to let XQuery build “information units” collected into “information trays”. Tray design is driven by the Java side requirements; the implementation is a pure XQuery task, and the use does not require any knowledge of XQuery.

Tuesday 11:45 am - 12:30 pm

Reverse modeling for domain-driven engineering of publishing technology

Anne Brüggemann-Klein, Tamer Demirel, Dennis Pagano, & Andreas Tai, Technische Universität München

Our ultimate goal is to develop a meta-meta-modeling facility whose instances are custom meta-models for conceptual document and data models. Such models could drive development by being systematically transformed into lower-level models and software artifacts (model-driven architecture). In a step toward that goal, we present “reverse modeling” that constructs a conceptual model by working backwards from a pre-existing model such as an XML Schema or a UML model. Starting with a schema, we abstract a custom, domain-specific meta-model, which explicitly captures salient points of the model bearing on system and interface design, and then we re-formulate the original model as an instance of the new meta-model.

Tuesday 2:00 pm - 2:45 pm

Managing semantics in XML vocabularies: an experience in the legal and legislative domain

Gioele Barabucci, Luca Cervone, Angelo Di Iorio, Monica Palmirani, Silvio Peroni, Fabio Vitali, University of Bologna

Akoma Ntoso is an XML vocabulary for legal and legislative documents sponsored by the United Nations for use in African and other countries. Documents include concrete semantic information describing and identifying the resource itself as well as the legal knowledge contained in it. This paper shows how the Akoma Ntoso standard expresses the multiple independent conceptual layers and provides ontological structures on top of them. We also describe features intended to allow future access to the legal information represented in documents without relying on the future availability of today's technology.

Tuesday 2:45 pm - 3:30 pm

XML pipeline processing in the browser

Vojtěch Toman EMC Corporation

Powerful XML processing pipelines can be specified using the W3C XProc language, but the currently available XProc implementations, such as EMC's Java-based Calumet, are expected to run on servers, not on client-side browsers. A client-side implementation could be provided as a browser plug-in, but a Javascript-based implementation would offer comprehensive client-side portability for XML pipelines specified in XProc. A Javascript port of Calumet is in the works, and the early results are encouraging.

Tuesday 4:00 pm - 4:45 pm

Extension of the type/token distinction to document structure

Claus Huitfeldt, University of Bergen Yves Marcoux, Université de Montréal, & C. M. Sperberg-McQueen, Black Mesa Technologies,

C. S. Peirce's type/token distinction can be extended beyond words and atomic characters to higher-level document structures. Also, mechanisms for handling tokens whose types are (perhaps intentionally) ambiguous can be added. Thus fortified, the distinction offers an intellectual tool useful for closer examination of the relationships between XML element types and their instances, and, more broadly, across the whole hierarchy of character, word, element, and document types.

Tuesday 4:00 pm - 4:45 pm

A virtualization-based retrieval and update API for XML-encoded corpora

Cyril Briquet, McMaster University & ATILF (CNRS & Nancy-Université); Pascale Renders, University of Liège & ATILF (CNRS & Nancy-Université); Etienne Petitjean, ATILF (CNRS & Nancy-Université)

Processing a large textual corpus with many XML tags is fraught with difficulties for many processes such as search and editing. The presence of tags interleaved with text may cause textual operations to return invalid results, such as false positives or false negatives. Virtualization of XML documents offers the possibility to guarantee correct results by hiding selected tags, text and combinations thereof without invalidating the overall corpus. A Java API that supports virtualization has enabled automatic processing (retrieval and update) of large and complex documents that contain multipurpose semantic tags.

Tuesday 4:45 pm - 5:30 pm

Discourse situations and markup interoperability

Karen Wickett, University of Illinois Urbana-Champaign

Interoperability of markup across time and systems requires a mapping from tags to the logical predicates associated with those tags. The use of natural-language element names allows readers to loosely interpret markup by exploiting the everyday resource situations that support ordinary language-based communication, but (as we demonstrate) the name of a tag alone does not convey everything necessary to interpret the meaning of the markup. Misinterpretation problems become obvious when the markup is used to derive erroneous RDF statements. Semantic resolution requires sufficient access to documentation. Without such support, interoperability across time and systems is an unlikely prospect.

Tuesday 4:45 pm - 5:30 pm

(LB) XHTML Dialects: Interchange over domain vocabularies through upward expansion: With examples of manifesting and validating microformats

Erik Hennum

The XML community exhibits a persistent tension between the value of sharing (motivating standards) and the value of individuation (motivating customization of those standards). Some communities resolve this tension through particular emphasis on customizations that produce subsets of base vocabularies. Current practices for defining subset vocabularies, however, have limitations that reduce the value of this approach. This paper proposes enhancing the XML ecosystem with a general-purpose mechanism for defining and managing subset extensions of a vocabulary. The proposal makes use of Semantic Web strategies — in particular, asserting new type relations for existing type definitions and simplifying content models — to identify commonality for variant vocabularies. This approach has particular promise for extending XHTML as illustrated with a few microformats.

Wednesday, August 4, 2010

Wednesday 9:00 am - 9:45 am

Where XForms meets the glass: bridging between data and interaction design

Charlie Wiecha, Rahul Akolkar, & Andrew Spyker, IBM

XForms offers a model-view framework for XML applications. Some developers take a data-centric approach, developing XForms applications by first specifying abstract operations on data and then gradually giving those operations a concrete user interface using XForms widgets. Other developers start from the user interface and develop the MVC model only as far as is needed to support the desired user experience. Tools and design methods suitable for one group may be unhelpful (at best) for the other. We explore a way to bridge this divide by working within the conventions of existing Ajax frameworks such as Dojo.

Wednesday 9:45 am - 10:30 am

I say XSLT, you say XQuery: let’s call the whole thing off

David J. Birnbaum, University of Pittsburgh

XSLT and XQuery can both be used for extracting information from XML resources and transforming it for presentation in a different form. The same task can be performed entirely with XSLT, entirely with XQuery, or using a combination of the two, and there seems to be no general consensus or guidelines concerning best practice for choosing among the available approaches. The author solved a specific problem initially (and satisfactorily) with XSLT because XQuery was not a sufficiently mature technology at the time the task first arose, but years later began to suspect that XQuery might be, in some ineffable way, a better fit than XSLT for the data and the task. Both the exclusively XSLT approach and the exclusively XQuery approach were comparable in functionality, efficiency, ease of development, and ease of maintenance, and they also shared (of course) an XPath addressing component, but they were nonetheless profoundly different in the way they interacted with the same source XML files. The goal of this presentation is to consider why one or the other technology may be a better fit for a particular combination of data and task, and to suggest guidelines for making decisions of that sort.

Wednesday 11:00am - 11:45am

Refining the taxonomy of XML schema languages. A new approach for categorizing XML schema languages in terms of processing complexity.

Maik Stührenberg, & Christian Wurm, Bielefeld University

During the last decade, many researchers have worked in the fields of XML applications (especially regarding schema languages) and formal languages. Amongst these is the taxonomy of XML schema languages described by Murata et al., including local tree grammars (DTDs), single-type tree grammars (XSD schemas) and restrained competition grammars (RELAX NG schemas).

We refine and extend this hierarchy, using the concepts of determinism, local and global ambiguity. It turns out that there exist interesting grammar types which are not yet captured formally, such as “unambiguous restraint competition grammars” and “unique subtree grammars”. In addition, we prove some interesting results regarding ambiguous grammars and languages: if a tree language is inherently ambiguous (i.e. ambiguity cannot be deleted), different interpretations of the same structure are isomorphic. This has important consequences for the treatment of ambiguity in document grammars.

Wednesday 11:45 am - 12:30 pm

Schema component paths for schema analysis

Mary Holstege Mark Logic

An XPath-like syntax for XSD schema components allows sets of XSD schema documents to be described and navigated in convenient and familiar ways. Each component has a unique canonical path, which can be used to identify the component; canonical paths are robust against changes in the physical organization of the schema. A set of canonical paths provides a sort of snapshot or signature of a schema, which can provide a quick and simple summary of what has changed in a new version of a familiar schema. Schema signatures may also be helpful in the calculation of simple measures of schema complexity.

Wednesday 2:00 pm - 2:45 pm

A packaging system for EXPath

Florent Georges, H2O Consulting

EXPath provides a framework for collaborative community-based development of extensions to XPath and XPath-based technologies (including XSLT and XQuery), thus exploiting the built-in extensibility of those technologies. But given multiple modules extending XPath, how can a user conveniently manage installation and de-installation of extension modules? How can developers make installation easy for users? How can users and developers avoid being trapped in dependency hell? These problems are familiar from other platforms, as are potential solutions. We can adapt conventional ideas of packaging to work well in the EXPath environment.

Wednesday 2:45 pm - 3:30 pm

A streaming XSLT processor

Michael Kay, Saxonica

XSLT transformations can refer to any information in the source document from any point in the stylesheet, without constraint; XSLT implementations typically support this freedom by building a tree representation of the entire source document in memory and in consequence can process only documents which fit in memory. But many transformations can in principle be performed without storing the entire source tree. The W3C XSL Working Group is developing a new version of XSLT designed to make streamed implementations of XSLT feasible. The author (editor of the XSLT 2.1 specification) has been implementing streaming features in his Saxon XSLT processor; the paper will describe how the implementation is organized and how far it has progressed to date. The exposition is chronological to show how the streaming features have developed gradually from small beginnings.

Thursday, August 5, 2010

Thursday 9:00 am - 9:45 am

Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless

Piotr Bański, University of Warsaw

Textual and linguistic analysis of corpora together awaken all the sleeping dragons of markup overlap. The TEI, like many others with an interest in markup, has taken up stand-off markup as one of its weapons of choice. That choice has problems in both the technical and sociological realms, however. Implementing extensions to XML tools to support XInclude and XPointer would make life easier for OWLs (ordinary working linguists).

Thursday 9:00 am - 9:45 am

DITA or Not?

Lynne A. Price Text Structure Consulting

Use of DITA has become so pervasive that some users assume that anyone who inquires about moving to an XML environment must use DITA. Often, the selection of DITA is independent of DITA's strengths such as ease of reuse, specialization, support of distributed authoring, and availability of the Open Toolkit. While numerous DITA case studies have been published, such reports tend to focus on what was accomplished rather than how the approach was chosen, and typically reflect successful implementations in large organizations. This study focusses on why end users, consultants, and tools have chosen to use or to avoid DITA. While this should not be considered an unbiased or scientifically balanced survey, anecdotal evidence such as this can be valuable to organizations faced with similar decisions.

Thursday 9:45 am - 10:30 am

Freestyle Markup Language

Denis Pondorf & Andreas Witt Institute for the German Language (IDS)

Freestyle Markup Language (FML) is a nascent generalized descriptive markup language to describe polyhierarchical markup of texts and data. FML is (we hope) the next generation in the evolution of markup languages. By design, FML is described using a Type-2 grammar (production rules in EBNF) so that FML may be produced by a context-free grammar and recognized by a nondeterministic pushdown automaton. FML documents will be transformable into a semantically unambiguous corresponding graph structure. By overcoming many of the restrictions inherent in monohierarchical OHCO (ordered hierarchy of content objects) structures, FML should overcome problems such as of congruence, interference, and content redundancy that result from root- and hierarchy-bondage.

Thursday 9:45 am - 10:30 am

IPSA RE: A New Model of Data/Document Management, Defined by Identity, Provenance, Structure, Aptitude, Revision and Events

Walter E. Perry, Fiduciary Automation

In private investment fund dealing each transaction is a series of interactions between parties transacting business at different granularities and often with materially different understandings of the substance of the transaction. Data records for private investment fund trading often don't accurately reflect whose money has gone into, or come out of, a given transaction or, conversely, in which particular transaction was a investor's stake in a fund secured, and at what basis. Investor skepticism in light of recent events, and government insistence on regulation necessitates transparency about whose money is deployed in what exact amounts in which transactions for which investment assets at what basis and through what chain of provenance.

The design of Google BigTable and the API for Google App Engine facilitates implementation of a "linksbase" which redefines a data record as an instance aggregation of linkages or "extended arc" on whose path may lie any number of instances identified by entity types, each separately influencing the resultant arc. The instance record is transactable across gross differences of granularity separating transaction parties and widely different understandings of the instance transaction.

Thursday 11:00 am - 11:45 am

Multi-structured documents and the emergence of annotation vocabularies

Pierre-Édouard Portier, Sylvie Calabretto, University of Lyon

Annotation vocabularies frequently need to grow and change as the user's understanding of the documents being annotated grows. We have developed methods to allow users to add new annotation terms while keeping some control over the growth and change of the annotation vocabularies; we use traces of user actions involving particular terms to help document those terms for users. Our ideas are being tested in a project involving the papers of Jean-Toussaint Desanti, the French philosopher of mathematics.

Thursday 11:00 am - 11:45 am

Processing arbitrarily large XML using a persistent DOM

Martin Probst

Processing of large XML documents usually traps the user between the memory constraints on DOM processing and the limitations on tree traversal in streaming processes. Moving the DOM out of memory and into persistent storage offers another processing option. Because disk storage is much slower than memory access, an efficient binary representation of the XML document has been developed, with a supporting Java API. Results are promising for gigabyte-sized documents that are not suitable for conventional DOM techniques.

Thursday 11:45 am - 12:30 pm

On Implementing string-range() for TEI

Hugh Cayless, NYU & Adam Soroka, UVA

The long-standing argument over the theoretical validity of “embedded” XML markup (particularly the TEI) flared again recently on the Humanist mailing list. That discussion prompted a group of programmers (including the authors) to meet for a session at THATCamp Prime in May to see whether anything practical could be done to address the deficiencies of TEI-style embedded markup. The TEI guidelines contain XPointer schemes which, if implemented, would allow the kinds of standoff markup and annotation that the anti-embedded-markup camp want within the context of a widely used standard. In the years since these (still unimplemented) pointer schemes were proposed, there have been developments (one very recent) that might now make implementation practical, so we decided to make one of these schemes, string-range(), actually work. This paper will present our implementation and a discussion of how it might be used to manage overlapping hierarchies of markup within a single TEI document.

Thursday 11:45 am - 12:30 pm

There are No Documents

Allen H. Renear & Karen M. Wickett, University of Illinois at Urbana-Champaign

Last year at Balisage (2009) we considered the claim that documents cannot be modified. This consideration took the form of identifying and evaluating possible responses to this inconsistent triad: 1) Documents are strings; 2) Strings cannot be modified; 3) Documents can be modified. Late this spring we were surprised to realize that our survey of possible responses had overlooked one: There are no documents. We turn to that neglected possible response now.

Thursday 2:00 pm - 3:30 pm

Panel Discussion. Greasing the Wheels: Overcoming User Resistance to XML

While Balisage may be filled with people who are not only comfortable working in XML, many of us are more comfortable with XML than with spreadsheets, word processors, or pens. But many of the users we work with find XML confusing or intimidating and resist learning about XML and using XML tools. This discussion will focus on how to overcome user resistance to XML, including ways to hide the XML, or at least the full complexity of the XML, from end users.

Thursday 4:00 pm - 4:45 pm

XML essence testing

Abraham Becker & Jeff Beck, U.S. National Library of Medicine (NLM)

PubMed Central (PMC) is the U.S. National Institutes of Health free digital archive, gathering together biomedical and life sciences journal literature from diverse sources. When an article arrives at PMC, it conforms to one of over 40 evolving DTDs. An ingestion process applies appropriate “Modified Local” and “Production” XSLT stylesheets to produce two instances of the common NLM Archiving and Interchange DTD. In the “essence testing” phase, the essential nodes of these instances, as specified by some 60 XPath expressions, are compared. This method allows the reliable detection of unintentional changes to an XSLT stylesheet with negative impacts on product quality.

Thursday 4:45 pm - 5:30 pm

Automatic upconversion using XSLT 2.0 and XProc

Stefanie Haupt & Maik Stührenberg, University of Bielefeld

Upconversion of presentation-oriented HTML documents to a data-centric XML form is a non-trivial but automatable process. Our data is a corpus of video game reviews represented as (sometimes invalid) HTML 4.01. Hidden in these reviews are useful pieces of metadata such as genre, number of players, age ratings, difficulty, and the pros and cons of the game. With a schema cleanly defining and extending useful datatypes, an XSLT 2.0 stylesheet making heavy use of regular expressions and string processing, we recursively process the HMTL documents using an XProc pipeline. Thus we transform tag soup into fully structured (and valid!) XML instances that allow semantically rich XQueries over the data.

Friday, August 6, 2010

Friday 9:00 am - 9:45 am

Stand-alone encoding of document history

Jean-Yves Vion-Dury, Xerox Research Centre Europe

Tracking the change history of a document has frequently depended either on external systems that atomize the document into databases or on running differences over separately stored intermediate versions. Why not encapsulate the entire history process in a single XML document? Using appropriate namespaces, both an instance and its history can be combined in a single construct. Unification of document and history allows the use of XPath expressions to express delta structures and the systematic distinction between modification descriptors and modification operations, with gains in both compactness and efficiency of storage.

Friday 9:45 am - 10:30 am

Scripting documents with XQuery: Virtual documents in TNTBase

Vyacheslav Zholudev Michael Kohlhase Jacobs University, Bremen

If x is to XQuery as views are to the query language SQL, what is x? We present a virtual-document facility integrated into TNTBase, an XML database with support for versioning. Our virtual documents consist of document skeletons with static text and parameterizable embedded XQuery queries; they can be edited, and changes to elements in the underlying XML repository are propagated automatically back to the database. The ability to integrate computational tasks into documents makes virtual documents an enabling technology with far-reaching possibilities.

Friday 11:00 am - 11:45 am

XQuery design patterns

William Candillon, Matthias Brantner, Dennis Knochenwefel, 28msec Inc.

The idea that design patterns are identifiable, reusable, and teachable is itself a (meta) design pattern whose applications and benefits extend far beyond the field of object-oriented programming. XQuery is an XML technology that, like OO technology, both suggests design patterns and is being influenced by them. A working AtomPub-based cloud application illustrates some XQuery design patterns: “Chain of Responsibility”, “Pattern Matching”, “Strategy”, and “Observer”.

Friday 11:45 am - 12:30 am

Platform independence 2010 - Helping documents fly well in emerging architectures

Ann Wrightson, NHS Wales Informatics Service

XML data structures, and those who design them, must adapt to the reality that multiprocessing technologies, including multicore processors, multichannel memory, multilevel caches, clouds, etc. are now ubiquitous. What does this mean for the practices and design patterns of the XML industry and for our instincts about design tradeoffs, such as cascading defaults? our willingness to incur the cost of data replication, or the addition of an optional detail to a model?

Friday 12:30 pm - 1:15 pm

(FP)Stone soup

C. M. Sperberg-McQueen, Black Mesa Technologies

Reflections on making the best of unpromising situations.