TCS tcs: Tata Consultancy Services trash compactor script
Design considerations in the implementation of a boil-this-corpus-down-to-a-sample-document tool
Senior Solutions Architect
Tata Consultancy Services
Creation of representative sample(s) of a large document collection can be automated using XSLT. Such samples will be useful for analysis, as a preliminary document analysis step in vocabulary redesign or conversion and to guide design of storage, editing, and transformation processing. Design goals are: to work intuitively with default configuration and no schema, produce plausible output, and produce a range of outputs from a large representative set to a short but highly complex sample document. The technique can be conceptualized in passes: annotate structures as original or redundant; keep wrappers to accommodate original markup found lower in the hierarchy; retain required children and attributes; and collapse similar structures. Possible settings include redundancy thresholds, text compression techniques, target length, schema-awareness, schema intuitions, how much context to preserve around kept elements, and whether similar structures should be collapsed (overlaid).