Holstege, Mary. “Schema Component Paths for Schema Analysis.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). https://doi.org/10.4242/BalisageVol5.Holstege01.
Balisage: The Markup Conference 2010 August 3 - 6, 2010
Balisage Paper: Schema Component Paths for Schema Analysis
Mary Holstege is Principal Engineer at Mark Logic
Corporation. She has worked as a software engineer in and around markup
technologies for over 20 years. She is a member of the W3C XML Schema and XML
Query working groups, and an editor of the W3C XML Schema Component Designators
and the XML Query Full Text specifications. Mary Holstege holds a Ph.D. from
Stanford University in Computer Science, for a thesis on document
Schema component paths define an XPath-like syntax for describing
and navigating W3C XML Schema component models. Canonical schema component
paths provide a unique, string-comparable designator for each component in
schema. MHSCD is a driver than can generate canonical schema component paths or
non-canonical schema component paths to a certain depth, or locate a component
or set of components in a schema given a schema component path.
Component paths can be applied to various schema analysis
tasks. The set of canonical schema component paths provides a simple signature
for a schema that is robust to differences in the physical organization of the
schema document. Comparing two such signatures gives a quick "what's changed
between these two schema versions?" summary. This signature can also be used
for the calculation of basic schema complexity metrics, including basic counts
of components of various types.
XML Schemas have become artifacts that play
a role in many software projects. Software is generated or driven from them.
While there is a long history of work on software metrics and analysis, work
is only beginning on understanding the XML Schemas as software
artifacts in their own right.
This paper introduces schema component paths, a specification under
development by the W3C, and shows how they can be used to tame some of
the complexity of the XML Schema model itself, and provide the
basis of some XML Schema metrics and analysis tools.
Schema Component Paths
Schema component paths, or SCPs, define an
XPath-like syntax for describing and navigating W3C XML Schema XSD10 component models. Certain schema component paths define the
minimal path to each specific component in the component model: these are the
canonical schema component paths.
The XML Schema component model is complex, with many
asymmetries and special cases. A particular assembled schema consists
of a rooted graph of components and property records typically assembled from
one or more schema documents. Property records are used to encapsulate
certain compound properties, but are not themselves considered schema
Schema components and property records have properties, some of which are
simple values, and some of which are other schema components and properties.
For the purposes of schema component paths, component-valued
properties define labelled arcs between schema components. Each labelled arc
defined a different axis of traversal from one component to another. Some axes
select more than one component. To distinguish the components that an axis
selects, SCPs use name tests and positional predicates: the name test matches
components by their name and namespace URI and positional predicates count
components in order.
Syntactically, a SCP resembles an XPath expression: the path consists
of a sequence of steps separated by a slash ('/'), where each step consists of
an axis name, a double-colon ('::') separator, a name test, and possibly a
predicate surrounded by square brackets ('[' and ']'). In the case of SCPs the
only predicate available is the numerical positional predicate: an integer.
Again, as with XPath expressions, various axis abbreviations are available.
Complete details are available in the specification SCD.
Figure Figure 1 shows some SCPs.
The first SCP selects an
attribute declaration named 'inner' for an
element declaration named 'outer' whose type is a locally defined anonymous
type. The path starts at the root of the assembled schema ('/') and then
traverses the schemaElement axis ('schemaElement::') with a name test
('p:outer'). The name test matches an element declaration whose local name is
'outer' and whose namespace URI matches the namespace bound to the prefix
'p'. The path continues through the type axis ('type::') with a name test
('0') that in this case matches a type definition with no name ('0' being the
indicator for this case).
Finally the schemaAttribute axis is traversed to select the
attribute declaration whose name matches 'p:inner'.
The second SCP selects the type of the second element
declaration named 'duplicate' in the sequence within the type definition named
'second'. The path starts at the root of the assembled schema ('/'), traverses
through the type axis and then the model axis ('model::').
Here the test ('sequence') matches
a model group's kind (sequence vs. choice vs. all) and selects only sequence
model groups. Then the schemaElement axis is traversed. The
predicate on this axis ('') selects the second element declaration with the
name 'duplicate' in the namespace bound to 'p': this can only be the case if
there are two local element declarations. Finally, the type axis is traversed
and a wildcard name test ('*') is applied, which will match the type
definition, regardless of type.
The third and fourth SCPs are abbreviated versions of the first and
second, using the tilde '~' abbreviation for the type axis, the use of
the bare name as an abbreviation for the schemaElement axis, and the use of the
at sign ('@') as an abbreviation for the schemaAttribute axis.
Table Table I summarizes the schema component axes. Not all axes apply to canonical paths, and
some axes apply only against the XML Schema 1.1 XSD11 component model.
Schema Component Path Axes
Axes appearing in canonical paths
Named attribute group definition
Named model group definition
Identity constraint definition
Referenced key in identity constraint definition
Constraining or fundamental facet
Assertion (1.1 component model only)
Type alternative (1.1 component model only)
Axes appearing in non-canonical paths
The current component
The substitution group head of an element declaration
The base type of a type definition
The primitive type of a simple type definition
The item type of a list simple type definition
A member type of a union simple type definition
A particle in a model group
An attribute use (local attribute declaration)
The complex type definition, attribute group definition, or model group definition
defining the scope of a local element or attribute declaration
The complex type definition, attribute declaration, or element declaration defining
the context of a local type definition
There is one privileged path to each component in the schema, the
canonical schema component path. Intuitively, the canonical SCP of a component
is the SCP that minimally describes that component and only that component.
For example, the SCP for a global type definition is the SCP that traverses
solely the type axis from the root; the SCP for a local type definition
is the SCP for the element declaration that governs the type definition,
extended by traversing the type axis. Canonical paths restrict traversals to
certain axes, sometimes based on complex constraints involving other components
(particularly the base type component), and eliminating abbreviations and
wildcarding wherever possible. Every canonical SCP is the extension of an
existing SCP with an allowable step, plus the canonical SCP for component that
represents the whole schema, whose canonical SCP is a slash ('/').
The set of canonical paths can be generated for an assembled schema
by traversing the component graph from the root, gathering up canonical SCPs,
and extending them through allowable transitions. Figure
Figure 2 shows a small schema and its canonical SCPs,
excluding the canonical SCPs for the built-in schema components that are
present in every assembled schema.
The set of canonical SCPs for a schema give us a quick summary of
basic facts of the schema. In this case we can see that the schema has
four top-level element declarations, five top-level
type definitions, two constraining facets, one model group, and one local
attribute declaration. The schema appears to be written in the Garden of Eden
style, because there are no anonymous type definitions.
Note, however, that the canonical SCPs (and indeed, SCPs in general)
do not currently include information about non-component properties of the
components, such as occurrence indicators or value constraints. Clearly such
properties provide important information about a schema, and their absence
is a serious limitation to using SCPs alone. The SCP specification does define
an accessor syntax, but declines to define any specific accessors or their
Comparison with Extended XPaths
Coates and Dui Coates10 present the idea of
using "extended XPaths" for XML Schema differencing. As with schema component
paths, these extended XPaths use an XPath-like syntax to traverse the
component model for an assembled schema. The paper presents how these
paths can be used to compare schemas for changes.
A key difference between the extended XPaths and schema component
paths is simply that there is no specification of the rules for generation
and interpretation of the extended XPaths, while schema component paths are
defined in a public formal specification.
Still, some differences are clear:
Extended XPaths include information about non-component
properties, such as occurrence indicators (minOccurs and maxOccurs), value
constraints (default values), and facet values. A predicate style of
representation is used.
Extended XPaths focus on paths for elements and
attributes, with annotations for certain kinds of type
Schema component paths includes a definition of canonical
paths; these paths distinguish shared components from locally defined
Schema component paths cover all component types, including named
model groups and attribute groups.
There are strengths and weaknesses to both approaches.
non-component properties in the path means that metrics or differences that
depend on those properties can be calculated using the paths alone. For
example, a canonical path-based schema difference will report no change
in the schema if the default value for an attribute changes, or if one
schema requires 1 or more occurrences of an element instead of 0 or more.
Schema component paths are therefore insufficient to detect such
differences. The predicate style of representation makes these properties
manifest in the paths, which makes the information more immediately
accessible than relying on something else to use accessors to fetch the
values and compute information based on those values.
A central aim of complexity metrics is to measure reuse.
Distinguishing between paths that involve shared components, such as
those inherited from base types or named groups, is therefore essential to
compute such metrics. Extended XPaths cannot be used to compute such metrics
because by design they elide such differences.
The design of extended XPaths captures differences that make a
difference to validation outcomes, but not other kinds of differences. If the
purpose of computing the difference between two schemas is to determine if some
inadvertent material change has been made to the set of documents that are
valid per the schema, this approach is preferable. One need not be
bothered to review changes that do not materially affect outcomes.
In the sections that follow we will look at how to apply SCPs to
perform various schema analysis tasks, with some comparison to extended
The set of canonical SCPs for a given schema provides a useful
schema signature. This signature can be identify schemas that are functionally
the same, robustly in the face of differences in physical organization of the
schema documents, ordering of declarations within those schema documents and
the presence of extraneous information such as comments. Furthermore, the text
format of a list of canonical SCPs is simple enough that it can be processed
with simple tools, such as Unix command line tools, to analyze the schema and
compare it with other schemas.
This schema signature procedure performs a stable sort on the
second field only (which is to say, the first step after the root), so top level
schema components will appear with their names in order by component type,
while model groups will not be reshuffled. In the case of sequence model
groups, it is important to preserve the order of the particles because a change
in the ordering constitutes a significant difference. However, this means that
ordering changes in choice or all groups will produce different signatures even
though these changes do not materially affect the schema.
Similarly the reordering of attributes would also produce a different signature.
Alternatively, a global sort could be used, with the opposite weakness of
giving equivalent signatures to two schemas that differ in the order of
particles in a sequence model group.
Determining what has changed between two versions by looking at the
schema documents themselves can be a daunting task. Simple file differencing
can include lots of irrelevant detail, or can be stymied by a reorganization of
the partitioning of the schema across multiple schema documents. A comparison
of the two schema signatures is much easier to grasp and doesn't suffer from
A schema difference based on a canonical schema component path signature
will be sensitive to additions and deletions of elements and attributes, the
introduction of new named types or groups, or a switch in compositor type.
A change in base type will be seen as a second order effect: by what impact it
has on the derived type. Such a schema difference will be insensitive to
changes in occurrence or value constraints, or in facet values.
A schema difference based on extended XPaths
will also be sensitive to additions and deletions of elements and attributes
and switches in compositor types. It will also pick up differences in
occurrence and value constraints and in facet values. A change in base type
will be directly visible, but the introduction of new types will be visible as
a second order effect and only if the new type is actually used within the
schema. The introduction or removal of named model group and attribute groups,
or the switching of an element from being local to being global will be
From the point of view of knowing what the changes are that
materially affect the set of valid documents, the extended XPath approach is
clearly preferable. The lack of non-component properties on schema component
paths is a serious weakness in this respect. Facet values and occurrence
constraints have an obvious effect on validation and changes to them count as
important changes. Augmenting the schema component path model to make such
values manifest as predicates, as extended XPaths do, would be a good step
forward. On the other hand, from the point of view of knowing about
substantive changes to the usability of the schema by other schemas or for
non-validation purposes, the schema component path approach of enumerating
canonical paths for all components makes sense. When an XML Schema is imported
into an XQuery module, for example, all the types are present and available,
even ones not used in any content model. Augmenting extended XPaths to capture
information about all components would be a positive step forward for that
Schema signatures can be used to calculate schema complexity
metrics. Compared to the large body of work on software metrics, little has
been done on schema metrics. Neither is there clear consensus of what the
useful metrics should be.
A paper by Lammel, Kitsis, and Remy Lammel05 examines
a number of counts and metrics
and computes them against a corpus of actual schemas, as an attempt to
characterize the usage patterns found in practice. The paper begins with basic
counts against the XML document, and then to XML Schema aware
counts of the number of global element and attribute declarations, global
complex and simple type definitions, and named model group and attribute group
definitions. The paper argues against the simple sum of global element
declarations and global complex type definitions as a metric of schema size on
the grounds that this measure is sensitive to schema construction styles: a
Russian Doll schema would always rank as small (one global element declaration)
no matter how deeply nested its inner element declarations became. The paper
moves on to counts of local element declarations and type definitions, and
proposes a simple size metric that is purely the count of all complex type
The authors then attempt to apply define something akin to
McCabe McCabe76 complexity measures for XML Schemas. The
metric combines the number of branches in choice model groups, the number of
non-default occurrence constraints (minOccurs or maxOccurs something other than
1), the number of references to a substitution group head, the number of
references to a global type definition, the number of nillable attributes, and
the number of global element declarations.
Additional metrics are defined for code-oriented and
instance-oriented breadth and depth. The depth metrics incorporate such
features as the number of particles in content models or the number of
"parties": the difference is that the code-oriented depth metric counts a
reference to a named model group as 1, but the instance-oriented depth metric
counts all the particles obtained by the reference. The code-oriented depth
metric counts the amount of nesting of element declarations in the schema.
A paper by McDowell, Schmidt, and Yue McDowell04 proposes
various schema complexity and quality
metrics: counts of complex type declarations (broken down by the type of the
content model), simple type declarations, annotations, derived complex types,
global type declarations, the average number of attributes per type
declaration, the number of references to global types, the number of unbounded
elements, the average range in bounds for bounded elements ("multiplicity"),
the average number of restrictions per simple type, and the fan-in and fan-out
of element declarations.
Overall complexity and quality indexes apply weighting factors to various
measures to give an overall score. The quality index combines the ratio of
simple to complex type declarations, the percentage of annotations over total
number of element declarations, the average restrictions per simple type
declaration, percentage of derived complex type declarations of the total
number of complex type declarations, the average bounded multiplicity size,
and the average number of attributes per type declaration. The complexity
index combines the number of unbounded elements, the element fanning, the
number of complex type declarations, the number of simple type declarations,
and the average number of attributes per complex type declaration.
Element declarations: #global, #local, #references
Complex type definitions: #global, #local, total, #simple,
#mixed, #element-only, #derived
Simple type definitions: total, restrictions/total
Attributes: average per complex type
Elements: average bounded element multiplicity, fanning
There is some overlap in these metrics, such as basic counts in the
number of different kinds of components, but in the main these are two very
different takes on what kind of information might be interesting or useful to
Many of these metrics can be readily calculated from the schema
signature. For example the number of element declarations can be determined by
counting the number of canonical SCPs containing 'schemaElement::' as the last
step, the number of global element declarations is the number of canonical SCPs
beginning with '/schemaElement::' but not containing two slashes, and the
number of local element declarations is the number of canonical SCPs containing
'schemaElement::' somewhere other than at the start.
Many of the metrics listed above to not lend themselves well to a
simple schema-signature-based approach.
Certain kinds of metrics do not lend then well to a SCP-based
approach at all: certainly those that rely on the XML representation of
the schema rather than the schema itself such as the number of XML nodes, for
SCPs do not distinguish directly between simple and complex type
definitions because they form a single symbol space in XML Schema: one can have
both an element declaration and a type definition named 'example', but not both
a simple and complex type definition with that name. If there is a canonical
SCP where a facet axis follows a type definition, we know that the type is a
simple type; if there is a canonical SCP where a model axis or attribute axis
follows a type definition, we know that the type is a complex type. Otherwise,
we can't tell from the SCPs alone.
Another class of metrics that are not readily
computable from canonical SCPs are certain kinds of inbound counts: the number
of uses of global element declarations, the number of uses of a substitution
group head, and so forth. Similarly, statistics that distinguish particles
that derive from references to named model groups from the rest cannot be
computed with canonical SCPs alone, as the schema component model records that
information through through the scope property. In any case, if the content
model has a reference to a global element declaration, this will not create a
canonical SCP for that particle: the canonical SCP for the element declaration
is the top-level one.
In addition, since non-component properties of schema components
are not reflected in the SCPs, any statistic that depends on the value of such
a property cannot be computed with SCPs alone. The the unbounded element
multiplicity from McDowell04 and the cyclomatic complexity from
Lammel05, which look at the minOccurs and maxOccurs, fall
into this class.
Coates and Dui Coates10 did not look at
metrics in their paper, but surely many metrics can be calculated using their
extended XPaths as well. Counts of components of various types would be
difficult: the distinction between element and attribute declarations seems to
be manifest only in the how the type predicates are represented, the
apparent loss of information about named groups suggests that not only will
named groups not be counted at all, but some over-counting of element and
attribute declarations is likely. Similarly, types cannot be accurately
counted. In general the extended XPath approach is not conducive to measuring
component reuse, which is an important aspect of schemas to measure.
On the other hand, extended XPaths provide information that can be
used to compute metrics that depend on occurrence constraints.
At this point, the reader will be excused for thinking that things
are looking grim for the use of SCPs for obtaining serious schema metrics. All
is not lost, however. First, many of these statistics can be computed by using
non-canonical SCPs to select a particular set of components, and counting
against that set. Where the metrics need the values of properties of
particular components, the ability to select components using a non-canonical
SCP needs to be augmented with an ability to inspect or query the properties.
For example, the particle axis can be used to count number of particles in
content models, the scope axis can be used to distinguish particles derived
from named model groups from local ones, and the substitutionGroup axis can be
used to count references to substitution group heads.
Second, since it is far from clear which statistics to use to examine schema
size, complexity, or quality, a more fruitful approach may be to see what
statistics we can compute from SCPs and see what they show
us. Some metrics can be replaced by similar metrics that are more amenable to
calculation via SCPs. For example, looking at the ratio of SCPs containing a
type definition as an intermediate step against the number of SCPs representing
a type definition (that is, whose final step is a type axis) gets at similar
schema characteristics as element fanning.
The simplest measure obtainable from the schema signature is a
simple count of how many canonical paths there are for a particular schema.
A schema with a high level of reuse of global declarations and definitions will
result in fewer canonical paths. Suppose there are two schemas, one of which
defines a global element declaration and uses it in two places, and one of
which defines a local element declaration in each place. The first schema will
have one canonical SCP for the global element declaration, while the second
will have a canonical SCP for each local element declaration. Each reuse of
the global declaration leads to one less canonical SCP in the schema.
If two schemas have a similar number of declarations, the one with fewer paths
is the simpler.
An interesting extension of the path count can be obtained by
generating SCPs along the canonical axes to a particular depth, but not
worrying about the other whether the SCP is canonical or not. For example, if
a content model references a global element declaration, a SCP that extends the
content model's SCP through the schemaElement axis would not be a canonical
one (the canonical SCP is the one directly from the root to the global element
declaration), but it would be a level 1 extension to a canonical SCP.
A level 2 extension to a canonical SCP is the addition of one more step through
a canonical axis to a level 1 extension to a canonical SCP, and so on.
In some simple data-oriented schemas, the set of canonical SCPs is no different
from the set of level 1 extensions. At the other extreme, schemas where
content models for different elements recursively refer to each other can have
a set of level 1 extensions substantially larger than the set of canonical
SCPs. The growth in the number of paths as the level increases is a measure of
the inter-relatedness of the components in the schema, and high growth can be
the sign of a schema that has many dependencies and is therefore more complex.
Calculation of the total and average number of steps in the SCPs can
also be computed readily. A higher average path length indicates less
component reuse and more local definitions, more complex content
models, or more additional constraints on simple type. In short: a more complex
schema. Again, the growth of this measure for level N extensions gives some
indication of the inter-relatedness of the components in the schema.
Statistics for a selection of schemas
Level 10 paths
Level 10 path length
Table IV shows some metrics for an
assortment of schemas. XMLSpec (the vocabulary
used to write W3C specifications) and XHTML are relatively
simple document-oriented vocabularies and measured as the sum of element
declarations, attribute declarations, and type declarations they are of
roughly comparable size. Simplified DocBook, a somewhat more extensive
document-oriented vocabulary, is about twice as large by this measure.
In terms of paths, however, DocBook is smaller than XHTML and about only about
one and a half times the size of XMLSpec.
GML and FpML are both highly structured data-oriented schemas.
Which is more complex? FpML has more elements, but GML has a larger E+T+A
count. FpML has more paths and a larger average path length. E+T+A speaks
to what one needs to know about a schema to fully make use of it in a processing
environment such as XSLT or XQuery, while the path count speaks to the
burden on a schema maintainer.
While the average path length for canonical SCPs does not vary
greatly, the growth in average path length for level 10 paths is quite
The differences in level 10 extensions is astonishing, however: simplified
DocBook produces two orders of magnitudes more level 10 extensions than
XMLSpec and one order of magnitude more than XHTML,
despite the fact that XMLSpec is larger in terms of the element/type/attribute
count and not drastically smaller in terms of the canonical
SCP count and XHTML is larger in terms of the canonical SCP count.
It appears that the main reason for this is that XMLSpec and XHTML make use of
substitution groups and named model and attribute groups, while simplified
DocBook uses large choice groups instead.
Reusable components in a selection of schemas
Substitution Group Heads
As we can see in Table V,
there is a wide range in the utilization of reusable
schema components. Global elements and types are broadly used, but global
attributes are rare. Where attribute reuse is desired, it is accomplished (in
these schemas at least), through named attribute groups. Substitution groups
and named model groups seem to restrict both level 10 path counts and lengths.
Schema component paths provide a characterization of the structure of
schemas that is insensitive to details of the XML representation and
partitioning into multiple files. They can be used as the basis to analyze and
compare schemas, and to compute metrics of schema size and complexity. This
paper attempts to sketch some of the possibilities in these areas. Fuller
metrics and analysis could by obtained by following the lead of extended XPaths
and including non-component accessors on the paths as well.
These metrics calculated in this paper are suggestive and seem to
capture interesting differences in schema designs, but a more systematic study
Appendix A. Tools
MHSCD MHSCD is a set of Java tools for
manipulating schema component paths. Both a SCP generator and a locator
API (which provides information about component properties) is included.
It was used to generate the examples of schema component paths in this paper
and is available under a Creative Commons Attribution license.
The schema component path specification SCD is
currently under development by the W3C (as of this writing at the Candidate
Recommendation phase). Readers are invited to review and
comment on that specification.
Anthony B. Coates and Daniel Dui.
"Full Impact" Schema Differencing.
Conference proceedings XML Prague 2010.
Ralf Lammel, Stan Kitsis, and Dave Remy.
Analysis of XML schema usage.
Conference Proceedings XML 2005.
A Measure of Complexity.
IEEE Transactions on Software Engineering, 2(4), pp. 308-320,
Andrew McDowell, Chris Schmidt, and Kwon-Bun Yue.
Analysis and Metrics of XML Schema.
Proceedings of the 2004 International Conference on Software Engineering Research
and Practice. Volume 2.