International Symposium on Quality Assurance and Quality Control in XML
Preliminary Program

Monday, August 6, 2012

9:00 am - 9:10 am


9:10 am - 9:50 am

Quality assurance in the XML world: Beyond validation

Dale Waldt, LexisNexis

Validation of XML documents typically provides feedback in binary, yes/no form. This avoids the ambiguity, manual intervention, and increased cost of other approaches. But it may not be enough to make XML applications efficient, accurate, or semantically rich. How do you ensure that the correct element and attribute types are applied to the appropriate content chunks? That XML documents are accurate and current? That your XML has a level of semantic richness appropriate to your business goals? How do you control quality over large collections? How do you resolve conflicting organizational goals for information integration and ensure that content and schemas help the enterprise as a whole? Conceptual and physical models, model / schema traceability, and effective stakeholder review can all help. Schematron, document comparison (diff) tools, statistical methods can also help, but may raise QA questions of their own. Improvements in requirements gathering and QA processes can produce visible results; concrete examples can and will be discussed.

9:50 am - 10:30 am

XML instances to validate XML schemas

Eric van der Vlist, Dyomedea

Ever modified an XML schema? Ever broken something while fixing a bug or adding a new feature? As with any piece of engineering, the more complex a schema is, the harder it is to maintain. In other domains, unit tests dramatically reduce the number of regressions and thus provide a kind of safety net for maintainers. Can we learn from these techniques and adapt them to XML schema languages? In this workshop session, we will develop a schema using unit test techniques, to illustrate their benefits in this domain.

11:00 am - 11:40 am

Quality control practice for Scholars Portal, an XML-based e-journals repository

Wei Zhao, Jayanthy Chengan, & Agnes Bai, all of OCUL Scholars Portal

A repository for e-journals that adds thousands of new records daily must develop quality-assurance procedures to avoid being overwhelmed by invalid data. The Ontario Scholars Portal is an XML-based digital repository of over 31 million articles; it has standardized on the NLM suite of XML applications, but not all its suppliers have taken the same step. A detailed workflow provides logging of every step from initial receipt to release into the database, with both automated and human-curated tracking of errors. The results enable the repository staff to aid information suppliers in providing the best results.

11:40 am - 12:20 pm

Quality control of PMC content: A case study

Christopher Kelly & Jeff Beck, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Heath (NIH)

PubMed Central (PMC) is the US National Library of Medicine’s digital archive of life sciences journal literature. Publishers submit XML and SGML-tagged articles to PMC, which transforms them into a NISO Journal Archiving and Interchange Tag Set format; on average PMC processes 15,000 articles per month. For journals new to PMC, sample journal files are checked automatically and manually for completeness, DTD-validity, metadata accuracy, and graphic quality. Once journal content is coming to PMC on a regular basis, articles are spot checked. A PMC-built system describes content that needs to be checked and provides a list of typical errors. When problems are encountered, PMC staff determine whether the problems result from source content, PMC ingest transforms, or errors in rendering the normalized XML content. XML and automated tools do not solve all difficulties: a sharp eye and attention to detail are still a necessity.

12:20 pm - 1:00 pm

ACS publications — Ensuring XML quality

Tamara Stoker & Keith Rose, American Chemical Society

As a publisher of over 40 technical journals, the American Chemical Society has chosen an internal workflow that is entirely in XML, based on the NLM suite. However, authors do not submit papers in XML, so conversion to XML and validation of the results are necessary. The ACS uses both internal translation and validation tools and external conversion and composition vendors. Extensive statistical tracking has proven to be a key tool for process improvement and time savings.

2:05 pm - 2:45 pm

Beyond well-formed and valid

Sheila M. Morrissey, John Meyer, Sushil Bhattarai, Gautham Kalwala, Sachin Kurdikar, Jie Ling, Matt Stoeffler, & Umadevi Thanneeru, ITHAKA

Portico is a digital preservation service for journals, books, and other content. As of April 2012, Portico was preserving approximately 17.7 million journal articles, 17,000 books, and 1.5 million historical items, originally produced in approximately 300 different XML/SGML vocabularies. The Portico workflow is a pluggable framework: XML configuration files control each workflow step and dynamically select which tool to employ, based on the format or MIME type of the files being processed. To ensure error-free application deployment, these configuration files must be checked for properties beyond well-formedness and validity: logical consistency of the configuration, referential integrity for format identifiers and other configuration files, and the existence of directories, files, classes or other resources needed for processing a particular dataset. Using XSLT (with Java extension functions in some cases), ITHAKA is better able to check the correctness of configuration files.

2:45 pm - 3:15 pm

Case study: Quality assurance and quality control techniques in an XML data conversion project

Charlie Halpern-Hamu, Tata Consultancy Services

A wide variety of techniques have been used in an XML data conversion project. Emphasis on Quality Assurance, not making errors in the first place, was supported by Quality Control, catching errors that occurred anyway. Data analysis and estimation techniques included counting function points in source documents to estimate effort and autogeneration of tight schemas to discover variation. Quality assurance was based on guiding specification based on parent-child pairs and programming for context and all content. Quality Control techniques included source-to-target comparison to check for lost or duplicated content, automatic highlighting of anomalous data, and use of XQuery to review data.

3:45 pm - 4:15 pm

Text analytics and the internal structure of content

Steven J. DeRose, OpenAmplify

Text analytics extracts features of meaning from natural-language texts and makes them explicit in much the same way as markup does. Using linguistic analysis, artificial intelligence, and statistical methods, text analytics can be used to get at a level of meaning often left unanalyzed down in the leaf nodes of XML trees. This suggests potential for new kinds of tools for quality control, consistency checking, and error detection. Are xml:lang values coded correctly? Text-analytic tools can reliably identify the language of a paragraph. Do references to people, places, organizations, and topics need to be marked up? Text analytics can often identify them reliably. Summaries, abstracts, conclusions, and discussions of methodology or future work often have distinctive text-analytic features, which means text-analytic tools can be used to check that they have been correctly, or at least plausibly, tagged in documents. This paper discusses a variety of ways that text analytics can help check and enhance the content of XML components, and will demonstrate some cases using a high-volume analytics tool on actual texts.

4:15 pm - 5:15

Open Forum

Symposium attendees are invited to submit mini-proposals (a title and one sentence describing the topic) for five minute mini-presentations. If there are more proposals than can be accommodated, proposals from people who have already spoken will be discarded, and random selections from the remaining proposals will be made. The only restriction on allowed content is the symposium topic. The Symposium’s organizers will terminate presentations that are not related to quality and XML, that are disrespectful of others or their points of view, or that are still incomplete after five minutes have elapsed.

Attendees are welcome to bring (or create on-site) visuals to support their mini-presentations, preferably in HTML on USB flash (thumb) drives.

5:15 pm - 5:30 pm


Mary McRae

There is nothing so practical as a good theory