Where Are All The Bugs?
Introspection in XQuery
Copyright © 2013 Mary Holstege
Table of Contents
Copyright © 2013 Mary Holstege
Table of Contents
Ensuring that software works properly can be a difficult undertaking. When a code base is small and worked on by a handful of people, you can manually generate tests with good confidence that they cover the ground. As the code base grows it develops more complex interactions, includes more features, and is modified by more and more developers. The code develops unused byways and tangles of dependencies. It becomes unfeasible to manually develop tests for every feature and combination of features. Complete automated testing of all possibilities also becomes increasing less feasible: developing cases, executing them, and analyzing the results becomes more and more time consuming. The sad truth about QA is that the difficulty of ensuring that software works correctly does not scale linearly with the size of the code base.
Analysis of code usage patterns can help focus attention where testing can do the most good. Code analysis can also provide a different perspective on the code that can lead to uncovering potential issues as well. For example, a function that is not called much probably doesn't need a lot of testing. Testing is designed to limit risk, and the risk of a bug in a rarely used function is commensurately low. On the other hand, a function that is widely called may also not need a lot of additional testing on the grounds that it is already tested in the process of testing its callers. So again, from a risk perspective, the risk of not adding additional tests for that function is also low. Beyond simple testing, analysis of usage patterns can perhaps highlight areas where APIs can be simplified by removing unnecessary options and alternatives.
The schema types and elements used in function argument and return types form an important part of the API. Analysis of the usage patterns of declared types and elements against the available pool exposed by the in scope schemas can also provide insights into possible API simplification, or testing focus.
The idea explored here is that (a) functions and types form vocabularies of sorts and the distribution of usage of those vocabularies follows the well-established laws for natural language vocabularies (i.e. Zipf's law) and as such (b) TF-IDF score calculations over those vocabularies make mathematical sense and (c) high scoring items make interesting and useful targets for testing.
This paper describes a case study in analyzing a large XQuery code base using XQuery and XML Schema introspection in an attempt to determine where to focus testing efforts, based on this idea. For functions at least, there seems to be some reason to conclude that the idea holds promise. Types, less so.
Several extensions to XQuery are required to perform this introspection: a schema component API, functions to provide lists of available functions and return types and argument types of functions. It is also convenient, in order to implement the entire processing reasonably efficiently in XQuery, to have map data types, a structured dump of the parse tree of XQuery modules, filesystem access functions, and the ability to evaluated constructed XQuery functions.
About ten years ago, a handful of developers worked on a couple thousand lines of XQuery code, and one person manually constructed the tests to add to automatic regression suites. All was well. As time went by, the code grew and became more complex, the number of people manually constructing tests increased, with additional people to manually analyze regression failures. Some tests are automatically generated, but such tests tend to raise analysis costs when something does fail. The time has come to think about new approaches and find ways to focus effort.
The code base consists of about 630 thousand lines of XQuery code, divided roughly evenly between an old and much modified schema-driven GUI application for database configuration and management, more recent REST endpoint implementations and web applications, and a collection of library modules (including some main modules used in triggers). In total there are a little over 1500 XQuery module files, defining over 68 thousand functions. In addition, the particular XQuery implementation associated with this collection of XQuery modules also provides a little over 1800 built-in functions. XML schemas are actively used to strongly type parameters: there are over 150 schema files exposing over 12 thousand global element declarations, 26 thousand named simple type definitions, and 7 thousand named complex type definitions. The bulk of these schemas relate to XHTML, however. Setting that aside there are over 50 schema files exposing over 7 thousand global element declarations, almost 18 thousand named simple type definitions, and 3 thousand complex type definitions.
This is a live and active body of code, ranging in age from code first written ten years ago to functions still just being implemented today. A peculiar problem with this particular code base is the range of syntax in play: The code ranges from XQuery 1.0 compliant code to code using some XQuery 3.0 features or anticipatory XQuery extensions (similar to although not necessarily identical to proposed XQuery features) and some very old code still using some 2003 draft syntax. Relying on the XQuery engine itself to analyze such diverse code is helpful, because the engine already has to reconcile the syntactic differences internally. This peculiarity does mean that third party code analysis tools cannot be used.
Figure 1: Detailed statistics
|Group||Files||Lines of code|
|REST endpoints and applications||387||101322|
|Global element declarations||12324||7338|
|Named simple type definitions||26536||17866|
|Named complex type definitions||7038||3009|
|Group||Number of Functions|
|Built-in functions (C++)||1849|
|XQuery module functions||68191|
Zipf's law is an empirical observation Zipf49 that
in a natural language corpus, the distribution of word frequencies (number of
occurrences, that is) varies in
inverse proportion to the rank of the word in frequency order.
Frequency(R) = Frequency(1)/Rk where
Frequency(R) is the frequency of the word of rank
k is some constant.
If you graph the distribution on log-log axes, you get a straight line.
Such distributions are very common across various fields and
applications. This paper will consider sequence types used in function
parameters and return types as forming one vocabulary, and the names of
functions themselves forming another, and look at the frequency distribution of
these terms in the corpus of XQuery modules.
TF-IDF scoring has become a standard technique for computing relevance of documents to full-text queries since it was introduced in 1972 Jones72. TF stands for "term frequency" and is a count of how often a particular term occurs in a particular document. In the context of full-text search a term is typically a single word, but other terms for phrases or wildcards are possible. IDF stands for "inverse document frequency" and is the number of documents a term occurs in, as a fraction of the total number of documents in the corpus. The "inverse" part of IDF means that this number is divided, so that higher document frequencies lead to smaller scores.
When TF-IDF scores are used to compute relevance to a query, the term and document frequencies of each query term are combined together. The technique is based on a particular probabilistic model of documents, but the intuition behind it is simple enough: more occurrences of a query term makes a document more relevant, and occurrences of query terms that only occur in a few documents makes a document more relevant.
Various modifications and extensions of the technique are currently ubiquitous in modern search engines. It has also been put to related purposes, such as computing document similarities and clustering. In these applications there may not be a query per se, but the set of terms is taken from the documents themselves.
Here, I will attempt to use TF-IDF scores of function calls and types as a way of selecting "interesting" functions and types on which to focus QA attention. In this model, the XQuery modules play the role of documents, function calls or types play the role of terms, and the terms are taken from the documents themselves, as one would see in a clustering or similarity application.
First, we will look at the actual distributions of types and functions, to see if computing scores makes sense at all. Where it makes sense to do so, we will look at computed scores qualitatively and quantitatively to try to see whether the high scoring items are indeed interesting targets for testing focus.
XQuery XQuery30 functions may have declared
sequence types for their parameters and their return values. XQuery sequence
types may be either simple type names XSD11.2 or node
tests of some kind. It is
possible to define sequence types that name a complex type without naming a
specific element. XQuery
3.0 F&O30 also allows for functions tests.
types also have cardinality indicators, with '?' meaning a sequence of 0 or 1
instances of the base type, '*' meaning any
number, '+' meaning 1 or more, and no indicator meaning exactly 1. When no
explicit sequence type declaration is given, the sequence type is the most
generic, matching anything:
declare function my:example($x, $y as xs:string?) as element(my:result)
In this code base and the underlying XQuery engine associated with it,
the built-in functions return 196 distinct declared sequence types while the
XQuery modules use 629 distinct declared sequence types. In this particular
XQuery implementation, it is not possible to declare different named element
sequence types for parameters or return types of built-in functions.
That is, a built-in function may be declared as returning
If we consolidate all the
types for module functions, there are only 103 distinct declared sequence types.
Obviously, the sequence types used in the API form but a small part of
the total set exposed through the schemas. Only 3 named complex types are
explicitly used, from the complete set of 3 thousand (non-XHTML) named complex
types. Only 474 named elements are explicitly used, from the complete set of 7
thousand. Only 49 named simple types are explicitly used by module
functions. The overabundance of named schema components is partly because named
types and are used implicitly in the schema-driven configuration and
administration UI, and partly a matter of stylistic preference for global
element declarations and for declaring functions using generic element tests
element()) rather than specific complex types
element(*,my:example)). Regardless, it doesn't look like it would
be helpful to pay special attention to types or elements that are not
specifically referenced in function signatures.
Figure 3: Sequence type distribution (return values)
The return type frequency distribution curves in Figure 3 do show that these return types mostly follow a power law distribution, although for built-in functions there is very substantial drop-off at ends, possibly as a consequence of the relatively small vocabulary size.
The distribution of function parameter sequence types looks very similar to that of function return value sequence types, including the drop-offs at the extremes for built-in functions, as shown in Figure 4
Figure 4: Sequence type distribution (parameters)
The top five return value and parameter sequence types share some
overlap between built-ins and module functions, although there are some
interesting differences. The atomic types
cts:query are special built-in types, the former anticipating
XQuery maps, and the latter in support of full-text queries. Given the
importance of full-text queries to this implementation, the strong showing
cts:query is unsurprising. The strong showing of
map:map demonstrates the utility of this feature. In both
cases the distribution does suggest that these are important types and
should be the focus of some QA effort.
code base it appears that developers use strong types on parameters more
often than on
function return values (as indicated by the higher prevalence of
item()* for function return values but not for parameters).
Simple types dominate the distribution for parameter types but less so for
return types. The high frequency of
empty-sequence() in return
types is a consequence of the fact that the implementation supports
functions that update the data store, so many functions are called for
effect rather than for the return value.
|Return types (modules)|
|Simple types (ignoring cardinality)||
|Return types (built-ins)|
|Simple types (ignoring cardinality)||
Given that the distribution of types follows a classic Zipf distribution, it is not unreasonable to apply the probabilistic document model on which TF-IDF scoring is based.
The score of a particular term
t that occurs at least
once in the corpus of
N documents is given by the following
log is the natural logarithm,
TF(t) is the number of occurrences of
DF(t) is the number of documents in which
The sequence types with the highest scores are almost all
distinctly named elements, typically those used as important API
elements. On the other hand, it must be said that there doesn't seem to
be any qualitative difference between distinctly named elements with high
scores and those with much lower scores. Various kinds of
elements are scattered throughout the range of scores, for
element(xproc:xslt) has a much higher score than
element(xproc:xquery) but again, it is hard to see a difference in
the QA needs of these two sequence types. What are we to make of the fact that
xs:long has a much lower score than
much, I think. Perhaps the vocabulary of sequence types is too small to be
Figure 5: Sequence Type Scores
39.10 element(configuration) 33.70 element(opt:options) 31.55 element(forest:forest-status)* 30.14 element(flexrep:configuration) 28.85 element(plugin:plugin-model)? 28.66 element(opt:constraint) 28.18 element(rsrc:resources) 27.74 element(alert:config) 27.26 lnk:uri 27.09 element(sec:external-security) ... 14.10 xs:integer 14.04 xs:float ... 12.45 element(xproc:xslt) 12.30 xs:integer* 12.12 xs:double ... 8.33 xs:date 7.86 xs:long ... 7.35 element(xproc:string-replace) 7.35 element(xproc:xquery) ... 4.79 cts:element-attribute-value-query 4.79 element(project)* 4.64 xs:long* 4.41 cts:word-query 4.36 function(*)+ 4.35 text() 4.23 element(test)* 4.07 map:map* 4.03 function() 4.00 cts:element-value-query
Type frequencies for built-in functions were computed using some introspection APIs: a function to enumerate all the defined functions, and accessor functions for the return type, the arity of the function, and the parameter types. Function accessors defined by Holstege Holstege and function introspection functions defined in XQuery 3.0 F&O30 do not quite do the job. The accessors defined by Holstege only give the atomic type names, not the full sequence types, while the XQuery 3.0 functions only give access to the function name and its arity. I used three additional built-in extension functions, one to enumerate the set of functions, and two to give the sequence types as string values. Maps kept running totals of the usage of each sequence type.
Computing type frequencies for module functions requires a little
more. There are two approaches: (1) use an evaluation function that
imports a given module and uses the same approach as for built-ins, but
only considering functions in the proper namespace, (2) dump out the
module parse tree in a structured form and analyze it for the necessary
information. In this case, I used the second approach, using a built-in
extension function that dumped out the parse tree. The format of this
dump was not, alas, XML, so fishing information out of the text relied
heavily on the XQuery 3.0 function
Doing everything in XQuery was convenient. Doing so meant using
non-standard features such as maps and access to filesystem information.
The filesystem access gave me the equivalent of a recursive
find. It was handy, but not
strictly necessary: one could instead use shell scripts to gather up the
set of module names to process them, and then pass that to the analysis
module. Strictly speaking, maps are not necessary either, but they made
it much easier to write the analysis modules.
Function call distributions show the same characteristic Zipf
distribution as types (see Figure 6, although with a
bit of a drop-off at the upper end of the curve. The majority of the top
10 functions are XQuery standard built-in functions, although map-related
functions occur at rank 8 and 9. The top 10 most frequently called
Figure 6: Function call distribution
Quite a few functions are not used within the code base at
all. A total of 497 built-in functions are unused, although only 16
module functions are unused. Given that there is a much larger number of
declared module functions, the disproportionate nature of this result
demands comment. It turns out that the unused built-in functions include a
number of accessor functions for special types (defined for API
completeness), some specialized functions that just happen not to be used
in the applications in this code base (geospatial functions, for
example), and the
constructor functions for some built-in types including some relatively
uncommon types such as
Figure 7 gives a rough break down.
|Accessors and constructors for special built-in types||30.6|
|System management and operations||16.3|
|Advanced specialized functions||14.7|
|Assorted standard functions||13.8|
|Assorted standard XQuery functions||6.4|
|In active development||2.4|
The fact that certain specialized functions are not called anywhere in the code base is not a problem, per se, but the fact that out of an assortment of special built-in types of a particular class (geospatial, for example) only a couple are not represented at all may warrant further investigation. It may be a sign of missing functionality in higher level APIs.
The situation with the unused module functions is better: all but one of the unused functions have to do with features under active development or testing/debugging code. The remaining unused function, however, looks like it does represent a genuine oversight.
Again, since the function call distribution follows a classic Zipf distribution, it is not unreasonable to apply the probabilistic document model on which TF-IDF scoring is based.
Figure 8: Function Scores
43.02 agui:navItem 42.49 admin:database-set-value 41.43 agui:navItemClosed 41.35 admin:gr-config 40.60 admin:database-get-value 39.75 agui:reindex-done 39.48 admin:db-config 38.99 admin:appserver-set-value 38.48 admin:appserver-get-value 38.04 aws:add-param ... 7.35 agui:buildHTTPServer2 7.35 converters:set-response-content-type 7.35 utils:distinct-values 7.35 conf-server:create-server 7.35 pipeline:xquery 7.35 agui:get-action 7.35 entity:call-calais 7.35 search-ast:do-parse 7.35 compile:build-zip 7.35 forest:get-default-rep
A look at the functions with the highest scores was encouraging:
the number one
navItem, is in a notorious part of the code
that has been subject to much revision and bug fixes. The function with
the third highest score was its companion function
navItemClosed. Other functions in the top ten were
service functions used within administrative API. These service
functions are called from many places, but only from within that one
module. However, that one module is very large. Document size
normalization is a common modification to TF-IDF scoring. Perhaps a similar
normalization here would be useful.
Most of the top 50 functions are other service functions for
other modules that have similar calling patterns. Are these good
functions to focus extra testing effort on? Perhaps. Directed testing of
such functions is likely to expose important bugs, but
probably normal API testing would have accomplished the same thing.
The fact that only one or two modules uses these functions does increase
the risk that fewer developers are calling them and therefore they are more
likely to be making (possibly unwarranted) assumptions about how they
will be used.
At the other end of the
scale, with very low scores, are widely used functions with names like
buildHTTPServer2 (widely used throughout the Admin UI) and
do-parse (a key function for the search APIs) and rarely
used functions such as
create-server. Are these good functions to avoid expending
extra testing effort on? Perhaps. Certainly the very widely used functions
would have been exercised many times in many different contexts through normal
API testing. The rare functions have demonstrably few interactions with the
rest of the code, so presumably the risk of not expending a lot of effort on
testing them is low.
Quantifying the usefulness of this approach is difficult. Getting a concrete measure for the bugginess of a particular function is hard as bug reports are linked only weakly to particular files, much less individual functions within those files. Here we will take the number of source control revisions within the scope of a particular function as a measure of how likely it is to need fixing. This is obviously an imperfect measure, as source changes may reflect new functionality, or trivial formatting changes. It is also a manually intensive measure to obtain. The following graph shows statistics for functions that were used more than once in the corpus that were defined in the same module as the top scoring function. It is an admittedly very small data set.
Figure 9: Revisions as a function of score
The correlation of score to the numbers of revisions is positive,
although not strong (
There are other associations we might expect to have stronger or more
meaningful correlations. For example, one might think that the number
of revisions would correlate with the age of a function, so that older
functions would tend to have more revisions associated with them. This
seems to be the case. For this little data set, the number of revisions
does have a stronger relationship to age, using a power
as shown in Figure 10.
Figure 10: Revisions as a function of age
Similar relationships held for functions in the hand-generated data
for a couple of other modules as well, with scores showing a weak direct
0.3 and age showing an inverse relationship with
R2 values about double that.
However, neither of these correlations holds up well on a larger data set
across modules, as you can see in
Figure 11 and
Figure 12. The best fit line in
both cases has
R2 of essentially
zero, virtually indistinguishable from the mean value line. The data
behind these numbers is much messier than the hand-generated set, as
the automated detection of function boundaries is much more error-prone
and tends to pull in large comment blocks and non-function declarations.
Nevertheless, it is implausible to suggest that such factors
overwhelm what would otherwise be an interesting correlation. A manual
check of the statistics for a particular module from the automated
data set showed only small deviations from the statistics from the
Figure 11: Revisions as a function of score
Figure 12: Revisions as a function of age
Some of the same introspective functions for computing the distribution of sequence types in modules were used to compute the distribution of function calls in modules. The key technique involved analyzing the structured dump of the module parse tree for function calls, using maps to store running totals, and using filesystem access functions to recursively find and process all modules in the code base.
Computing the revisions within a particular function involved
using the source control system's ability to produce the source code
annotated with revision numbers. The resulting output was
chopped up into individual functions manually, with a little help from
Emacs keyboard macros. A shell script counted up the number of distinct
revision numbers as well as the minimum revision number for the
function using standard Unix tools such as
The larger data set began with a file containing the annotated
source files concatenated together. The XQuery function
tokenize was used to chop up the file first into subfiles
(by looking for the
xquery version declaration) and then into
functions (by looking for
declare function, and finally into
individual lines of code, from which the revision number was obtained.
Maps were used to collect minimum, maximum, and
counts of revision number.
XQuery introspection can be used to provide a fresh slant on a code base, even a fairly large one. An analysis of unused schema components (elements and types) proved largely uninteresting because stylistic considerations for schema writing dominated actual usage requirements in the APIs. The fact that certain extension types (in particular maps) were so common suggests that there is a real need for providing such types as part of standard XQuery. TF-IDF scores of sequence types did not produce any interesting information.
The analysis of function usage points in a different direction. A look at unused functions can highlight some interesting problems, but the most frequently used functions are mainly basic standard XQuery functions. Looking at a combination of frequency of use and rarity across modules using TF-IDF scoring can pinpoint some interesting functions that warrant more QA attention. This is a qualitative judgement. Attempts to quantify the correlation of score to quality failed, using the number of revisions applied to the function as a proxy for quality. Relationships within one module at a time hold up better, but are still weak.
One common modification to TF-IDF scoring is to normalize the TF values by the size of the document. Since the high scoring functions included many that were only found in a very large module, perhaps applying a similar normalization here would lead to better results.
To perform introspection of XQuery from within XQuery, a number of extension functions are useful:
Parse tree accessors
File system accessors
Provide a convenient means of keeping running totals keyed to function names, for example
[Holstege] Holstege, Mary. Type Introspection in XQuery. Presented at Balisage: The Markup Conference 2012, Montréal, Canada, August 7 - 10, 2012. In Proceedings of Balisage: The Markup Conference 2012. Balisage Series on Markup Technologies, vol. 8 (2012). http://www.balisage.net/Proceedings/vol8/html/Holstege01/BalisageVol8-Holstege01.html. doi:10.4242/BalisageVol8.Holstege01.
[Jones72] Jones, Karen. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28 (1): 11–21. http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf. doi:10.1108/eb026526. 1972.
[XSD11.2] W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra, C.M. Sperberg-McQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C. April 2012. http://www.w3.org/TR/xmlschema11-2/
[Zipf49] Zipf, George Kingsley. Human Behaviour and the Principles of Least Effort. Addison-Wesley, 1949.