How to cite this paper

Halpern-Hamu, Charlie. “TCS tcs: Tata Consultancy Services trash compactor script: Design considerations in the implementation of a boil-this-corpus-down-to-a-sample-document tool.” Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). https://doi.org/10.4242/BalisageVol8.Halpern-Hamu02.

Balisage: The Markup Conference 2012
August 7 - 10, 2012

Balisage Paper: TCS tcs: Tata Consultancy Services trash compactor script

Design considerations in the implementation of a boil-this-corpus-down-to-a-sample-document tool

Charlie Halpern-Hamu

Senior Solutions Architect

Tata Consultancy Services

`<charlie.hamu@tcs.com>`

Charlie has been working with structured text since 1991. During this time, he has acted as a content and systems architect, programmer, systems integrator, consultant, mentor, best-practices coordinator, trainer, book editor, project lead, department manager, and vice president. His consulting and training work has taken him all over North America as well as visits to South America, Europe, Australia and China. Charlie has a PhD in Computer Science from the University of Toronto and an MBA from Heriot-Watt University. He's good at making complex systems easy to understand. Or so he claims.

Disclaimer: All views expressed in the publication are of the author and Tata Consultancy Services (TCS) does not warrant, either expressly or implied, the accuracy, appropriateness of the information in the publication. TCS disclaims any responsibility for content error, omissions and any responsibility associated with relying on the information provided in the publication.

Abstract

Creation of representative sample(s) of a large document collection can be automated using XSLT. Such samples will be useful for analysis, as a preliminary document analysis step in vocabulary redesign or conversion and to guide design of storage, editing, and transformation processing. Design goals are: to work intuitively with default configuration and no schema, produce plausible output, and produce a range of outputs from a large representative set to a short but highly complex sample document. The technique can be conceptualized in passes: annotate structures as original or redundant; keep wrappers to accommodate original markup found lower in the hierarchy; retain required children and attributes; and collapse similar structures. Possible settings include redundancy thresholds, text compression techniques, target length, schema-awareness, schema intuitions, how much context to preserve around kept elements, and whether similar structures should be collapsed (overlaid).

Context
Alternatives
Goals
Algorithm
Efficiency
Parameters
Code
Discussion

Context

Sample documents play a critical role in the development of most XML systems.

But good sample documents are difficult to generate. There seem to be three basic approaches, each with its difficulties:

Craft. Artfully crafting sample documents by hand is usually a labor-intensive path to an artificial-feeling result. Hand-crafted samples can be especially frustrating when they go stale, when even minor schema changes necessitate significant rework.
Mad Libs.™ Automatically-generated samples are often so random and disconnected from reality that they are difficult for humans to comprehend and use. Based on all the possibilities of a schema, they include many combinations that simply would not occur in real life. The schema provides no guidance about the kinds of content that would make sense in each element, so contents are unfamiliar nonsense.
Curation. Thoughtfully choosing some sample documents from a larger set of available documents is the approach most similar to the approach presented by this paper. The difficulty is that most XML corpora are like war: long stretches of boredom punctuated by moments of mildly increased interest. It's easy to miss those documents that have interesting, unusual markup. And it's easy to miss the interesting, unusual markup even within the documents chosen.

There is often a tension between brevity and completeness. It's easiest for a human to understand a shorter, simpler documents. Including more elements, in more contexts, requires longer, more complex documents.

Alternatives

Before this paper was presented on Wednesday of the Balisage conference, there was a discussion of approaches to creating sample documents at the preceding symposium on QA and QC on the Monday. The following alternatives were listed:

Dale Waldt suggested starting with all XPaths (presumably FQGI, fully qualified generic identifier, ancestry paths) in a larger set, and finding an example of each.
Paul Ryan said he automated the extraction of all XPaths (again, presumably FQGIs) and automatically created Lorem Ipsum text.
Murray Maloney spoke of running all variations, as determined by the possibilities allowed by the Schema.
John Cowan talked about how, with LexisNexis, it was easy to get a firehose blast of volumes of data. He talked about crafting a sample document that was actually true facts, though a constructed sample.
Robert Stuart advised against real, or real-looking data, having once had an intelligence agency get upset with him when he to closely simulated, using public sources, the look of secret documents.
Syd Bauman said that when he needs a set of test documents to exercise a collection, he typically starts by cutting out the uninteresting TEI header metadata, and then keeping the first few paragraphs of each of several document. When he needs to generate test data about an instance he tends to need to be much pickier about which bits he keeps, and, in general, he does this by hand.

For the cases where content could not be shared, Syd also mentioned that Elliotte Rusty Harold had presented an XML Randomizer that worked like a neutron bomb on XML documents (my inappropriate analogy), scrambling the content but leaving the structures intact. [Harold]
Debbie Lapeyre warned against taking the first ten pages, and relayed a anecdote in which the last thousand pages were very different from the first ten.
Ian Gorman described taking all available content as the universe from which to choose. If I understood correctly, this is the approach automated by the current paper.

Goals

The primary goal of this project is a tool to boil large sets of XML data down to smaller samples. The samples should maintain as much as possible of the variety and complexity of markup patterns, and remove as much as possible of the useless repetitions.

Subsidiary design goals include:

Plausible results. The hope is that the results will feel natural: compact, but recognizable.
Good defaults. The default parameter settings should hit some kind of sweet spot. Is it presumptious to imagine that such a sweet spot exists? I imagine it to be something like this: one can say that the sample is complete and that at the same time it is manageably short.
Single dial. Given the trade-offs between brevity and completeness, it would be nice to have a single parameter that simultaneously adjusted all the other parameters towards one extreme or the other.
Ease over perfection. If the occassional result is invalid, because of a wrong guess about required elements, that's probably okay.
Use information from schema. If there is a schema available, it would be nice to take advantage of it. The goal would not to be to include elements only because they exist in the schema, but to avoid deleting elements that are required by the schema.
Intuit schema. My guess is that in most real-life situations, required elements can be easily intuited. For example, if every section contains a title, it's reasonable to guess that title is required in section. Of course this rule of thumb can fail in one of two ways:
1. False positive. It could be the case that every section in a sample contains a title, but only by chance. But even in this case, there's no denying that a representative sample will contain a title for every section.
2. False negative. It could be that the content model is more complicated, that either number or title is required on every section. As a result, the intuition will fail, and conclude both number and title to be optional. I haven't had enough real-life experience with the tool yet to conclude how much trouble this will create.

Algorithm

The algorithm, as implemented, involves multiple passes:

Annotate with signatures. This present implementation uses an attribute in a special namespace, tcs:signature to record the characteristics we're using to determine what counts as the same and what counts as different. This is configurable, and includes some combination of ancestry, previous siblings, element name, attributes specified and child elements.

Because of the choice to use an attribute, it's easiest to annotate only elements, not attributes, processing-instructions or comments. In an attempt to increase efficiency, a signature-to-element key was introduced. This introduced the possibility of using the same key mechanism to annotate attributes, processing instructions and comments.

The current implementation uses this first pass to shorten the text nodes. The reasoning is that there will be less text passing through the subsequent passes. This is really for ease of debugging: it actually seems to slow the process.
Mark unique elements. The next pass is to indicate a small number of each signature to keep. As implemented the first one (or two or more ... configurable) of each unique-signature element is kept, but there's no requirement that the first elements be kept.
Mark wrapping elements. If an element is to be kept, obviously all its ancestor elements must be kept. The current implementation marks these ancestors in a separate pass.
Mark required children. For all the elements that are to be kept, certain child elements need to be kept, as they are required ... or guessed to be.

In addition, this is a good time to mark inline elements mixed in with the text of the elements that are being kept. The reason for doing this is to maintain the sense of the retained text. If not, it's likely we'll lose the second and subsequent emphasized words in a sentence.
Prune unneeded elements. The final pass removes those elements that haven't been marked to be kept. The current implementation keeps all atrributes and also text, processing instruction and comment children of kept elements.

Efficiency

In the course of developing the current implementation, several efficiency issues were encountered. Not all have yet been resolved.

Regex text trimming. The current implementation keeps the first few words and the last few words of each text node. The regular expression processing to acheive this appears to be slow. My guess is that it has to do with backtracking, and might be improved by simply taking the first few words and not looking to keep the last.
Best time to trim text. In the naïve expectation that reducing the amount of text to be processed by subsequent passes would be an efficiency gain, the first pass does the trimming of the text nodes. But only a small number of text nodes are retained in the end, so it turns out to be more efficient to prune the tree first before abbreviating the text nodes.

The truth is that the most useful results might come from untrimmed text.
N-squared comparison to previous elements. Unlike text trimming, which isn't essential to the process, finding unique signatures is the purpose of the script, so it can't be skipped. The first implementation of the script used a comparison of the current tcs:signature with all the tcs:signatures on the with preceding axis. This obviously introduces an order n ² inefficiency. The second attempt involved using a key to map signatures to elements, and an intersect to detect if the set of similar signatures includes elements that overlap with the preceding axis. This is someone more efficient, but on the same order.

In order to work with large data sets, this needs to be resolved. The current solution is to use a non-XSLT streaming pass. The current implementation is not XML-aware, programmed using Perl regular expressions. It could be done using a streaming XML parser, but unless there are tcs:signature attributes that aren't actually markup in the documents, there will be no problem.
Multiple passes vs complicated XPath. The current implementation consists of several passes as outlined above. In order to accommodate a non-XSLT streaming pass in the middle, the first pass is in a separate script from the last passes. It's almost certainly the case that several of these calculations could be strung together in a more complicated XPath expression. This might be easier to understand, or not. And it might be more efficient, or not. I imagine it might be elegant, in a way.

Parameters

The following parameters have been implemented:

Ancestor count. How much context should be included in an element's signature? Measured on a scale of zero to infinity, with 0 representing no ancestry, 1 representing the the parent of the current element, 2 representing the grandparent and parent, 7 presumed to be the largest number ever needed short of 8 which is taken to represent infinity (at a ninety-degree angle). Infinity seems to be a good default, as some XML processing operates on fully-qualified generic identifiers. Also, it's common for deeply nested elements have a tendency to stress stylesheets and other processing.
Preceding-sibling count. How many preceding siblings should be included in an element's signature? A good default seems to be 1, looking only at the closest preceding sibling.
Self count. Should the element name itself be include in its own signature? Surely the answer should be yes in most applications. For consistency, this parameter is included, on a scale from 0 (no) to 1 (yes).
Attribute count. Should the element's specified attributes count as part of its signature? Measured on a scale of 0 (no) to 8 (all of them), with no sensible meaning for the numbers in between. The current script has a bug in that it doesn't sort the attribute names as it should. A good default seems to be to include the attributes names.
Child count. How many children should be included an element's signature? Again, this parameter is measured on a scale of 0 = none to 8 = infinity. The children are counted out in document order. The thought here is to distinguish between a section that starts with a number from a section that starts with a title. An overlapping effect can sometimes also be achieved by using using the preceding-sibling parameter: including a section not on its own merits, but because it contains a title that follows a number.
Start words count. How many words should kept at the start when abbreviating a text node?
Middle words count. What is the minimum number of words to delete from the middle of a text node? If there are less than this number of words in the middle, don't abbreviate.
End words count. How many words should be kept at the end when abbreviating a text node? So, a text node is not abbreviated unless it contains start + middle + end words. Though these parameters are still available in the full code, more readable results are found more efficiently by simply taking the first sentence of each text node. This is the approach taken in the simplified code below.
Repetition count. How many times would we like to see each signature? Since we may want to reduce a very large corpus to a still large sample, it doesn't make sense to follow the silly pattern of the other parameters, insisting that after 7 comes infinity. One discovery, working with the tool, is that the most effective way to get a bulkier sample is to include higher values for the various context parameters. This way, in addition to being simply bulkier, the extra included elements will show more variety. For this reason, the chosen default value is 1.
Show deletions. Should comments be left in the sample to indicate where elements have been pruned? It can be helpful, when working with a smaller sample, to have indication of where elements have been deleted from the larger source. As implemented, the integer parameter is interpreted as a Boolean. But it might make sense to interpret as a number: showing some number of comments before stopping. To avoid overwhelming results, the default value is 0 = no.
Debug signatures. Should the intermediate element signatures be shown in the final output? Like with the parameters above and below, it would make sense to interpret this value as a integer, rather than as a Boolean. The default value here is 0, meaning no tcs:signatures are shown.
Debug marks. Should the intermediate tcs:mark attributes be shown? Again, this integer parameter might make better sense interpreted as a count, rather than interpreted as a Boolean. The tcs:mark attributes are added to all elements that are to be kept, with the value indicating the reason. Reasons include that the element had a unique signature (or was within the specified repetition count), that the element had a descendent that was to be kept, that the element was known or guessed to be a required child of a kept element, or that the element was an textual inline in the middle of kept text. The default value is 0, meaning no intermediate tcs:mark attributes are shown.

The following parameters have been designed, but not yet implemented:

Collapse parallel structures.
Target length.
Schema information.
Text abbreviation.

Code

To illustrate the key concepts, a reference implementation is presented here. This version of the script is stripped of comments, parameters and efficiencies. Please contact the author for the current revision of the full script.

<?xml version='1.0' encoding='UTF-8'?>
<!-- TCS = Tata Consultancy Services; tcs = trash compactor script. -->
<!-- For brevity, this version is missing comments, parameters and efficiencies. -->
<transform xmlns='http://www.w3.org/1999/XSL/Transform' version='2.0'
           xmlns:xs='http://www.w3.org/2001/XMLSchema'
           xmlns:tcs='mailto:charlie.hamu@tcs.com'>
    <template match='/'>
        <variable name='signed'>
            <apply-templates select='node()' mode='sign' />
        </variable>
        <variable name='marked'>
            <apply-templates select='$signed' mode='mark' />
        </variable>
        <variable name='wrapped'>
            <apply-templates select='$marked' mode='wrap' />
        </variable>
        <variable name='needed'>
            <apply-templates select='$wrapped' mode='need' />
        </variable>
        <variable name='needed-2'>
            <apply-templates select='$needed' mode='need' />
        </variable>
        <variable name='wanted'>
            <apply-templates select='$needed-2' mode='want' />
        </variable>
        <variable name='pruned'>
            <apply-templates select='$wanted' mode='prune' />
        </variable>
        <sequence select='$pruned' />
    </template>
    <template match='*' mode='sign'>
        <copy>
            <attribute name='tcs:signature' select='ancestor::*/name(),
                                                    preceding-sibling::*[1]/name(),
                                                    name(),
                                                    @*/name(),
                                                    *[1]/name()' />
            <copy-of select='@*' />
            <apply-templates select='node()' mode='sign' />
        </copy>
    </template>
    <template match='*' mode='mark'>
        <copy>
            <if test='not(preceding::*[@tcs:signature = current()/@tcs:signature])'>
                <attribute name='tcs:mark' select='"keep"' />
            </if>
            <copy-of select='@*' />
            <apply-templates select='node()' mode='mark' />
        </copy>
    </template>
    <template match='*' mode='wrap'>
        <copy>
            <if test='not(@tcs:mark) and *//@tcs:mark'>
                <attribute name='tcs:mark' select='"wrap"' />
            </if>
            <copy-of select='@*' />
            <apply-templates select='node()' mode='wrap' />
        </copy>
    </template>
    <template match='*' mode='want'>
        <copy>
            <copy-of select='@*' />
            <if test='not(@tcs:mark) and ../@tcs:mark
                      and text()[normalize-space() ne ""]
                      and ../text()[normalize-space() ne ""]'>
                <attribute name='tcs:mark' select='"want"' />
            </if>
            <apply-templates select='node()' mode='want' />
        </copy>
    </template>
    <template match='*' mode='need'>
        <copy>
            <if test='not(@tcs:mark)
                      and ../@tcs:mark
                      and not(preceding-sibling::*[name() eq current()/name()])
                      and not(//*[name() eq current()/../name()
                                  and not(*[name() eq current()/name()])])'>
                <attribute name='tcs:mark' select='"need"' />
            </if>
            <copy-of select='@*' />
            <apply-templates select='node()' mode='need' />
        </copy>
    </template>
    <template match='*' mode='prune'>
        <if test='@tcs:mark'>
            <copy>
                <copy-of select='@* except @tcs:*' />
                <apply-templates select='node()' mode='prune' />
            </copy>
        </if>
    </template>
    <template match='text()' mode='prune'>
        <value-of select='replace(.,"\.\s.+","...","s")' />
    </template>
    <template match='comment() | processing-instruction()' mode='#all'>
        <copy />
    </template>
</transform>

Discussion

This paper was presented at Balisage 2012. [Balisage] In the discussion after, the following points were raised by attendees.

Pointers. It would be nice to link the samples, element-by-element, back to the full corpus, so that you could see the full context if curious.
Randomize. Liam Quinn suggested that if you are taking a small sample from a larger set, it would be nice to be be able to randomize the selection, rather than just taking the first examples of each markup combination.
Log the seed. As a follow-on suggestion, Lee says to log the seed used for the random selection, so that it can be recreated as required.
Redo from back. Wendell Piez observed that in order to avoid redundant elements, in which for example, the first paragraph is included because we've never seen a paragraph before, but the second paragraph is included because it contains an inline, we can run the algorithm twice: once front to back and then a second time back to front.
XQuery. There seemed to be general consensus that multiple passes were the natural approach, and that any attempt to collapse this into a single XPath expression or XQuery was misguided.
Xmlsh. David Lee said the individual passes could be run together with xmlsh. This would be especially relevant if the inefficient have-I-seen-this-before pass were recast using another programming language.
Efficiency. Michael Sperberg-McQueen said that the n² inefficiency could be solved, even in XSLT 1.0, using keys and Muenchian method. [Tennison]
Accummulators. Abel Braaksma said that XSLT 3.0 accumulators were designed to solve problems just like the have-I-seen-this-before inefficiency, and took the example back to the XSLT working group.

My initial intuition was to add the current element's signature to the accumulated set as the element starts. But this is too soon, as we need to know about all preceding elements, not including the one we're on. On the other had, updating the accumulator at the end of the element would be too late, as we would not have had its signature as we reviewed all of its descendants. The naive fix would be to somehow put the current element's signature on deck so that it can be folded in to the accumulated list of all signatures as soon we hit the start of the next element. Rather than building this data structure ourselves, the elegant choice seems to be to avoid the temptation to add the current element's signature to the accumulated set, but to rather add the immediately preceding element's signature to the accumulator.

References

[Balisage] Balisage: The Markup Conference, 2012 August 7-10, Hotel Europa, Montreal, Canada, http://balisage.net/Proceedings/vol8/contents.html (accessed 2012 August 17).

[Harold] Harold, Elliote Rusty. Paper: Obsuring XML, Proceedings of Extreme Markup Languages 2005 August 1-5, Montreal, Canada. http://conferences.idealliance.org/extreme/html/2005/Harold01/EML2005Harold01.html Presentation: Randomizing XML, http://cafeconleche.org/slides/extreme/randomizer (accessed 2012 August 20).

[Mad_Libs] Penguin Group (USA). Mad Libs, http://madlibs.com (Accessed 2012 August 20).

[QA-QC] International Symposium on Quality Assurance and Quality Control in XML, 2012 August 6, Hotel Europa, Montreal, Canada, http://balisage.net/QA-QC (accessed 2012 August 17).

[Tennison] Tennison, Jeni. Grouping using the Muenchian Method, http://jenitennison.com/xslt/grouping/muenchian.html (accessed 2012 August 17).

Balisage: The Markup Conference, 2012 August 7-10, Hotel Europa, Montreal, Canada, http://balisage.net/Proceedings/vol8/contents.html (accessed 2012 August 17).

Harold, Elliote Rusty. Paper: Obsuring XML, Proceedings of Extreme Markup Languages 2005 August 1-5, Montreal, Canada. http://conferences.idealliance.org/extreme/html/2005/Harold01/EML2005Harold01.html Presentation: Randomizing XML, http://cafeconleche.org/slides/extreme/randomizer (accessed 2012 August 20).

Penguin Group (USA). Mad Libs, http://madlibs.com (Accessed 2012 August 20).

International Symposium on Quality Assurance and Quality Control in XML, 2012 August 6, Hotel Europa, Montreal, Canada, http://balisage.net/QA-QC (accessed 2012 August 17).

Tennison, Jeni. Grouping using the Muenchian Method, http://jenitennison.com/xslt/grouping/muenchian.html (accessed 2012 August 17).

Author's keywords for this paper:

sample documents; sampling; big data; interoperability; quality; XSLT; modelling; querying; software-based processing; transforming

BalisageThe Markup Conference2012