More data than you know what to do with
It’s increasingly common for information workers to come in contact with unfamiliar datasets. Governments around the world are releasing more and bigger datasets. Corporations are collecting and releasing more data than ever, often in XML or another format easily convertible to XML. This trend, while welcomed by many, poses questions about how to come to understand large XML datasets.
In general, bulk analysis of large datasets gets done through an adhoc assortment of machine learning techniques. Few of these, however, are specifically targeted at the unique aspects of the XML data model, as examined in this paper.
One tool in particular that has come into widespread use for analyzing datasets is R, which has available a set of XML libraries R Language; Package 'XML'. However, while R is often useful for initial exploration, it lacks an upgrade path into an operational data store such as an XML database, and the XML capabilities provided are DOM and XPathcentric, which are less suited to bulk analysis.
The most common approach used by databases to deal with searching and navigating through large datasets is an index, trading space for time. Once an index is set up and occupying additional disk and/or memory space, certain kinds of queries run significantly faster.
With an unknown dataset, though, this presents a chickenandegg problem. Many database systems assume the availability of indexes to perform navigational functions such as faceted search. Without such features, it’s more difficult to get to know the data. And if the structure of the data is unknown, how can one create the necessary indexes in the first place?
One answer is that indexes should be arranged based on queries rather than data. This is a valid approach, although not of much help in the case of bootstrapping a truly unknown dataset.
Another approach, one explored in the remainder of this paper, is to rely on statistical sampling approaches rather than indexes. This presents tradeoffs in terms of size, speed, and accuracy. At the same time, it offers benefits:
Rapid start 
The ability to run queries immediately, without prior configuration. 
Productivity 
The ability to run queries before or during a lengthy indexing operation. 
Exploration 
An aid to identifying areas in the dataset that might requires some amount of cleanup before applying indexing. 
Crossplatform development 
None of the techniques here depend on proprietary index structures. 
Performance baselining 
These techniques can be used as a measuring stick to compare the size and speed tradeoffs of various index configurations. 
Though the techniques shown here are universally applicable, for convenience, code samples in this paper are based on an early access release of MarkLogic 6 MarkLogic docs.
Random Sampling
A common approach to characterizing a large population involves taking a random sample. A simple XQuery expression
fn:collection()[1]takes advantage of the implementation detail that many databases define document order across the entire database in an arbitrary order so that the "first" document is essentially a random choice. One common manual approach is to "eyeball" the first such document, and maybe a few others, to get an intuitive feel for the kinds of data and structures at hand. This approach has obvious scaling difficulties.
For larger samples, greater automation is needed for the analysis. We can define a process producing a sample of size N to be considered random if any particular collection of documents is equally likely to turn up as any other samesized collection of documents. Randomness of a sample is important, because a nonrandom sample will lead to systematic errors not accounted for in statistical measures of probability.
Within a random sample it is possible to examine characteristics of the sample and make statistical inferences about the overall population. The more documents in the sample, the better the approximation. When the overall population is small, it is possible to sample most or even all of the documents. But with large collections of documents, the proportion of documents contained in a reasonablysized sample will be very small. When dealing with a dataset of this size, certain simplifying assumptions become possible, as outlined in later sections of this paper.
Users of search interfaces have become accustomed to approximations, for example, a web search engine may report something like “Page 2 of about 415,000,000 results”, but as users go deeper into the result set, the estimated number tends to converge on some actual value. Users have accepted this behavior, though it is less commonly used in finergrained navigational situations. For example, if a sidebar on a retail site says, “New in last 90 days (328)”, quite often one will find that exactly 328 items will be available by clicking through all the pages of results.
This difference in user expectations can be exploited. In particular, in the case of first contact with a new XML dataset, exact results are far less important than overall trends and correlations, which makes a sampling approach ideal.
Sample size end error
Statistical estimates by definition are not completely reliable. For example, it's possible (though breathtakingly unlikely) that a random sample of 100 documents would happen to contain all 100 unusual instances out of database of a million documents. A surer bet, though, would be none of the unusual documents would turn up in a sample of 100. One measure of an estimate's reliability, called a confidence interval is related to a chosen probability range. To put it another way, if the random experiment were repeated many times whereby it was found that 95.4% of the calculated confidence intervals included the true value, one could say that 95.4% was the confidence interval. (Note, however, that when speaking about a single experiment, the estimated range either contains the true value or it doesn't and one would need more complicated Bayesian techniques to delve deeper.)
For a large population, it is convenient to use a confidence interval of 95.4%, which encompasses values two standard deviations from the mean in either direction and makes the math come out easier later on. At that confidence level, and assuming a large overall population relative to the sample size, the maximum margin of error is simply (continuing to use XQuery notation)
1 div math:sqrt($samplesize)although in particular cases, the observed error can be somewhat less.
For example, a sample size of 1000, likely to be conveniently held in memory, the maximum margin of error comes out to about 3.16%, which is good enough for many purposes.
A key weakness in sampling is that rare values are likely to be missed. A doctype that appears only 100 times out of a million is unlikely to show up in a sample of 100 documents, and on rare occasions when it does show up exactly once, straightforward extrapolation would infer that it occurs in 1% of all documents, a gross overestimation.
Performance
The approaches in this paper assume that an entire sample of documents can comfortably fit into main memory. To fulfill a randomsampling query without indexes, each document needs to be read from disk. Therefore, overall the performance of the query will be roughly linear in the sample size, plus time for local processing. Details will vary depending on the database system, but in general there will be some amount of data locality making for shorter documenttodocument disk seeks. A backoftheenvelope estimate is about 1 second per 200 documents in the sample, ignoring the prospect of documents cached in memory, perhaps somewhat higher with highperformance disks such as SSD or RAID configurations that stripe data across multiple disks.^{[1]}.
Dipping a toe into the database
Getting familiar with an XML dataset requires a combination of automated and handson approaches. While writing this paper, I had on hand a dataset of over 5 million documents, crawled from an assortment of public social media sources, of which I knew very little in advance.
The firstdocumentinthecollection approach mentioned above yields this ^{[2]}:
<sl:streamlet xmlns:sl="http://example.com/ns/socialmedia/streamlet"> <sl:vid>13067505336836346999</sl:vid> <sl:tweet>#Jerusalem #News 'Iran cuts funding for Hamas due to Syria unrest' http://t.co/ARRqabU</sl:tweet> </sl:streamlet>
The choice of the "first" document, is, of course, completely arbitrary. The second document has something completely different:
<person:person xmlns:person="http://example.com/ns/socialmedia/person"> <person:id>8999631448253261463</person:id> <person:followercount>0</person:followercount> <person:influence>0</person:influence> <person:name>Borana Mukesh</person:name> ... </person:person>
There's only so much one can learn from picking through individual documents.
The following XQuery 3.0 XQuery 3.0 code demonstrates this approach by examining a sample of documents in order to extract potentially interesting features, which a human operator can use to make decisions about where to dive deeper. The code assembles the following:

Root elements

Commonlyoccurring elements

Commonlyoccurring namespaces

Elements that tend to have a lot (or a little) text content

Text nodes that look like dates

Text nodes that almost look like dates

Text nodes that look like numeric data, for example years
let $dv := distinctvalues#1 let $n := ($samplesize, 1000)[1] let $xmldocs := spx:estdocs() let $textdocs := spx:esttextdocs() let $samp := spx:randomsample($n) let $cnames := $dv($samp/*/spx:name(.)) let $allns := $dv($samp//namespace::*) let $leafe := $samp//*[empty(*)] let $leafetxt := $leafe[text()] let $leafelong := $leafetxt [stringlength(.) ge 10] let $leafeshort := $leafetxt [stringlength(.) le 4] let $dates := $dv($leafelong [. castable as xs:dateTime]/spx:name(.)) let $neardates := $dv($leafelong [matches(localname(.), '[Dd]ate')] [not(. castable as xs:dateTime)]/spx:name(.)) let $allyears := $dv($leafeshort [matches(., "^(1920)\d\d$")]/spx:name(.)) let $allsmallnum := $dv($leafeshort [. castable as xs:double]/spx:name(.)) let $epd := count($samp//*) div count($samp/*) return <spx:datasketch xmldoccount="{$xmldocs}" textdoccount="{$textdocs}" binarydoccount="{$binarydocs}" elementsperdoc="{$epd}"> {$cnames!<spx:rootelem name="{.}" count="{spx:estbyQName(spx:QName(.))}"/> } {$allns!<spx:nsseen>{.}</spx:nsseen>} {$dates!<spx:date>{.}</spx:date>} {$neardates!<spx:almostdate>{.}</spx:almostdate>} {$allyears!<spx:year>{.}</spx:year>} {$allsmallnum!<spx:smallnum>{.}</spx:smallnum>} </spx:datasketch>
This code and the following listings make use of the following helper functions which contain vendorspecific implementations, which are not important here (see the Code section later for details):
spx:estdocs() 
A function that quickly estimates the total number of documents in the database. 
spx:esttestdocs() 
A function that quickly estimates the total number of documents in the database that consist of a single text node. 
spx:randomsample() 
A function that returns a random sample of documents from the database. 
spx:name() 
Returns a Clark name of a given node. 
spx:formatq() 
Returns a Clark name from a given QName. 
spx:nodepath() 
Returns an XPath expression that uniquely identifies a node. 
This code produced the following (with adjustments for line length):
<spx:datasketch xmlns:spx="http://dubinko.info/spelunx" xmldoccount="5789128" textdoccount="0" binarydoccount="0" elementsperdoc="12.88" > <spx:rootelem name="{...}person" count="1248848"/> <spx:rootelem name="{...}media" count="2117625"/> <spx:rootelem name="{...}streamlet" count="1173545"/> <spx:rootelem name="{...}author" count="1248815"/> <spx:nsseen> http://example.com/ns/socialmedia/person </spx:nsseen> <spx:nsseen> http://example.com/ns/socialmedia/media </spx:nsseen> <spx:nsseen> http://example.com/ns/socialmedia/streamlet </spx:nsseen> <spx:nsseen> http://example.com/ns/socialmedia/author </spx:nsseen> <spx:nsseen> http://www.w3.org/XML/1998/namespace </spx:nsseen> <spx:date>{...}ingested</spx:date> <spx:date>{...}published</spx:date> <spx:date>{...}canonical</spx:date> <spx:date>{...}inserted</spx:date> <spx:smallnum>{...}followercount</spx:smallnum> <spx:smallnum>{...}influence</spx:smallnum> <spx:smallnum>{...}followercount</spx:smallnum> </spx:datasketch>
This dataset appears fairly homogeneous: only four different root element
QNames, were observed over 1,000 samples. Additionally, these documents
contain a number of elements that seem datelike, but would require some
cleanup in order to be represented in as the Schema datatype xs:dateTime.
For purposes of this paper, one particular element, influence
,
as seen earlier, seems particularly
interesting. Is there a way to learn more about it?
Digging Deeper
It’s possible to perform similar kinds of analysis on specific nodes in the database. Given a starting node, the system of XPath axes provides a number of different ways in which to characterize that element’s use in a larger dataset. Some care must be taken to handle edge cases, assuming nothing in an unknown environment. The following code listing characterizes a given element node (named with a QName) along several important axes:
let $dv := distinctvalues#1 let $n := ($samplesize, 1000)[1] let $samp := spx:randomsample($n) let $ocrs := $samp//*[nodename(.) eq $e] let $vals := data($ocrs) let $numbervals := $vals [. castable as xs:double] let $nv := $numbervals let $datevalues := $vals [. castable as xs:dateTime] let $blankvals := $vals[matches(., "^\s*$")] let $parents := $dv( $ocrs/nodename(..)!spx:formatq(.)) let $children := $dv($ocrs/*!spx:name(.)) let $attrs := $dv($ocrs/@*!spx:name(.)) let $roots := $dv($ocrs/root()/*!spx:name(.)) let $paths := $dv($ocrs/spx:nodepath(.)) return <spx:nodereport estimatecount="{spx:estbyQName($e)}" samplecount="{count($ocrs)}" numbercount="{count($numbervals)}" datecount="{count($datevalues)}" blankcount="{count($blankvals)}"> {$parents!<spx:parent>{.}</spx:parent>} {$roots!<spx:root>{.}</spx:root>} {$paths!<spx:path>{.}</spx:path>} <spx:min>{min($numbervals)}</spx:min> <spx:max>{max($numbervals)}</spx:max> {if (exists($vals)) then <spx:mean> {sum($nv) div count($nv)} </spx:mean> else () } </spx:nodereport>
These two techniques combine to provide a powerful tool for picking through an unknown dataset. First identify ‘interesting’ element nodes, then dig into each one to see how it is used in the data. While the sample documents are in memory, it is possible to infer datatype information, and for values that look numeric, to calculate the sample min, max, mean, median, standard deviation, and other useful statistics.
These techniques can be readily expanded to include statistics for other node types, notably attribute and processinginstruction nodes.
Freeform faceting
Indexbacked approaches make it possible to produce a histogram of values, often called "facets", for example all the prices in a product database, arranged into buckets of values like 'less than $10' or '$10 to $50' and so on.
It’s possible to combine the concepts introduced thus far by breaking down a random sample into faceted data.
With no advance knowledge of the range of values, it’s difficult to arrange values into
reasonable buckets, but with some spelunking, as in the preceding section,
it’s possible to construct reasonable bucketing. Based on the exploration
from the preceding sections, the influence
element looks
worth further investigation.
The following XQuery function plots out the values of a given element as xs:double values in specified ranges.
declare function spx:histogram( $e as xs:QName, $samplesize as xs:unsignedInt?, $bounds as xs:double+ ) { let $n := ($samplesize, 1000)[1] let $samp := spx:randomsample($n) let $fullpopulation := spx:estdocs() let $multiplier := ($fullpopulation div $n) let $ocrs := $samp//*[nodename(.) eq $e] let $vals := data($ocrs) let $numbervals := $vals [. castable as xs:double]!xs:double(.) let $buckettops := ($bounds, xs:float("INF")) for $buckettop at $idx in $buckettops let $bucketbottom := if ($idx eq 1) then xs:float("INF") else $buckettops[position() eq $idx  1] let $sampcount := count($numbervals [. lt $buckettop][. ge $bucketbottom]) let $p := $sampcount div $n let $moe := 1 div math:sqrt($samplesize) let $SE := math:sqrt(($p * (1  $p)) div $n) let $estcount := $sampcount * $multiplier let $error := $SE * $fullpopulation let $esttop := $estcount + $error let $estbot := $estcount  $error return <histogramvalue ge="{$bucketbottom}" lt="{$buckettop}" samplecount="{$sampcount}" estcount="{$estcount}" estrange="{$estbot} to {$esttop}" error="{$error}"/> };
This code accepts a particular QName referring to an element, a
sample size, and an ordered set of numeric bounds, and returns the
approximate count of values that occur in between each boundary. The first
bucket includes values down to INF
, and the last bucket
includes all values up to INF
. Selecting values to partition
the values into orderofmagnitude buckets will give a broad first
approximation of the distribution.
For comparison, the following vendorspecific code, which requires a preexisting inmemory index, resolves the exact counts of different values occurring in the database.
for $bucket in cts:elementvalueranges( QName("http://example.com/ns/socialmedia/person", "influence"), (1,10,100,1000), "empties") return <histogramvalue ge="{($bucket/cts:lowerbound, 'INF')[1]}" lt="{($bucket/cts:upperbound, 'INF')[1]}" count="{cts:frequency($bucket)}"/>
The results of calling these function on the test database are given below in table format.
How wrong can you get?
As the book Statistics Hacks Statistics Hacks states,
Anytime you have used statistics to summarize observations, you’ve probably been wrong.
This technique is no exception.
As mentioned earlier, if we assume that the sample is of a small proportion of the overall population and is randomly selected, the maximum margin of error is a simple function of sample size. However, against particular values we can usually find a more accurate estimate.
To estimate the overall proportion, the standard error of the proportion must be computed, using the following formula.
math:sqrt(($p  $p * $p) div $samplesize )The maximum error, which occurs when the proportion is exactly 50%, is exactly half of the margin of error calculation earlier^{[3]}.
The following table illustrates the tradeoffs in accuracy, runtime, and necessity of a preconfigured index. The columns on the left represent histogram buckets of values of various ranges, while the rows across the top represent different sample sizes, or in the case of the final column, exact index resolution. The hardware under test consisted of a 2.4Ghz Dualcore 64bit Intel machine with two 15k rpm disks in a RAID O configuration.
Table I
estimate, n=10  estimate, n=100  estimate, n=1000  estimate, n=10000  exact index resolution  

values < 1  115,7825  810,477  897,314  782,111  807,284 
1 <= values < 10  0  0  17,367  25,472  28,414 
10 <= values < 100  578,912  115,782  208,408  164,990  161,734 
100 <= values < 1000  578,912  347,347  219,986  208,408  204,298 
values >= 1000  0  0  28,945  36,471  47,070 
Run time  0.35 sec  0.48 sec  1.9 sec  19.4 sec  0.19 sec 
Unsurprisingly, at smaller sample sizes, it is probable that infequentlyoccurring data will be completely excluded from the random sample. Even the most frequentlyoccurring values, in this case the bucket of values less than one, occurs in less than 14% of the 5.7M documents. Given this, the accuracy of the random sampling technique, even at the lower sample counts, is more than enough to give a general impression of the distribution of the data values.
To visulaize this, it is possible to export these values into a desktop spreadsheet program and produce a graph, including error bars, as shown in the following figure.
Figure 1: Graphical representation of data distribution (by percentage)
Conclusion
The techniques shown in the paper offer a useful framework within which to make the initial foray into an unknown XML dataset. Starting with an automated rundown of highlevel features in the dataset, particular QNames chosen by the user can be drilled down into deeper analysis. The dataset can even be summarized through histogram facets, much like those available to significantly more resourceintensive indexed databases.
The techniques shown here do not rely on proprietary features and are applicable to a wide range of available XQuery processors.
The book How to Lie with Statistics How to Lie with Statistics concludes with advice on how to be properly skeptical of statistics, and the guidelines apply to the techniques in this paper as much as in any other area.
What's missing? 
Be on the lookout for areas where summarization may be obscuring important facts. 
Did somebody change the subject? 
Beware of an unfounded jump from raw figures to conclusions. 
Does it make sense? 
Any results that come from these techniques need to be eyeballed. Anything that seems wildly out of proportion needs to be more closely examined. 
With those caveats, the techniques in this paper can provide a useful lever by which to pry open a large, unkonwn XML data set.
Code availability
The code samples mentioned in this paper are in available in a project named Spelunx available at GitHub Spelunx.
Further topics to explore

Correlation and coexistence between given nodes

Multidimensional sampling, as in geographic data

Searching for correlated latitude and longitude pairs

Hypothesis testing and type I vs type II errors

Markov chain analysis for element and attribute containership

Comparison and contrast with machine learning techniques

Exploring the availability of randomsampling extension functions from different database vendors

Ways to summarize mixed content
References
[R Language; Package 'XML'] Duncan Temple Lang (editor), Package 'XML', version 3.94
[online]. [cited 13th July, 2012]. http://cran.rproject.org/web/packages/XML/XML.pdf
[XQuery 3.0] Jonathan Robie, Don Chamberlin, Michael Dyck, and John Snelson (editors), XQuery 3.0: An XML Query Language
[online]. [cited 13th July 2012]. http://www.w3.org/TR/2011/WDxquery3020111213/
[MarkLogic docs] MarkLogic Corporation, MarkLogic Server Search Developer's Guide
[online]. © 2012 [cited 13th July 2012]. http://developer.marklogic.com/pubs/5.0/books/searchdevguide.pdf
[Clark notation] James Clark, XML Namespaces
[online]. [cited 13th July, 2012]. James describes "universal names written as a URI in curly brackets followed by the local name" which have proved to be a useful construction in more contexts than explaining namespaces. http://www.jclark.com/xml/xmlns.htm
[Statistics Hacks] Bruce Frey, Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds
(O'Reilly Media, 2006). Despite the name, includes a great deal of basic information on statistics and the math behind it.
[How to Lie with Statistics] Darrell Huff, How to Lie with Statistics
(W. W. Norton & Company, 1993 reprint). To appreciate the power of statistics, you must understand how it can be used as a weapon.
[Spelunx] Micah Dubinko, Spelunx
open source project [online] https://github.com/mdubinko/spelunx
^{[1]} In general, documentlevel caching provides little benefit for randomsampling based approaches, as a different random set of documents gets selected on each run of the experiment.
^{[2]} Namespace URIs have been anonymized.
^{[3]} Half because margin of error already accounts for error in the plus or minus direction, i.e. a diameter, while standard error or the proportion does not, i.e. a radius.