Balisage logo

Proceedings

Schema Component Paths for Schema Analysis

Mary Holstege

Principal Engineer

Mark Logic Corporation

Balisage: The Markup Conference 2010
August 3 - 6, 2010

Copyright © 2010 Mary Holstege

How to cite this paper

Holstege, Mary. “Schema Component Paths for Schema Analysis.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 - 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). DOI: 10.4242/BalisageVol5.Holstege01.

Abstract

Schema component paths define an XPath-like syntax for describing and navigating W3C XML Schema component models. Canonical schema component paths provide a unique, string-comparable designator for each component in schema. MHSCD is a driver than can generate canonical schema component paths or non-canonical schema component paths to a certain depth, or locate a component or set of components in a schema given a schema component path.

Component paths can be applied to various schema analysis tasks. The set of canonical schema component paths provides a simple signature for a schema that is robust to differences in the physical organization of the schema document. Comparing two such signatures gives a quick "what's changed between these two schema versions?" summary. This signature can also be used for the calculation of basic schema complexity metrics, including basic counts of components of various types.

Table of Contents

Introduction
Schema Component Paths
Comparison with Extended XPaths
Analyzing Schemas
Schema Signatures
Schema Differences
Schema Metrics
Conclusion
Appendix A. Tools

Introduction

XML Schemas have become artifacts that play a role in many software projects. Software is generated or driven from them. While there is a long history of work on software metrics and analysis, work is only beginning on understanding the XML Schemas as software artifacts in their own right.

This paper introduces schema component paths, a specification under development by the W3C, and shows how they can be used to tame some of the complexity of the XML Schema model itself, and provide the basis of some XML Schema metrics and analysis tools.

Schema Component Paths

Schema component paths, or SCPs, define an XPath-like syntax for describing and navigating W3C XML Schema XSD10 component models. Certain schema component paths define the minimal path to each specific component in the component model: these are the canonical schema component paths.

The XML Schema component model is complex, with many asymmetries and special cases. A particular assembled schema consists of a rooted graph of components and property records typically assembled from one or more schema documents. Property records are used to encapsulate certain compound properties, but are not themselves considered schema components. Schema components and property records have properties, some of which are simple values, and some of which are other schema components and properties. For the purposes of schema component paths, component-valued properties define labelled arcs between schema components. Each labelled arc defined a different axis of traversal from one component to another. Some axes select more than one component. To distinguish the components that an axis selects, SCPs use name tests and positional predicates: the name test matches components by their name and namespace URI and positional predicates count components in order.

Syntactically, a SCP resembles an XPath expression: the path consists of a sequence of steps separated by a slash ('/'), where each step consists of an axis name, a double-colon ('::') separator, a name test, and possibly a predicate surrounded by square brackets ('[' and ']'). In the case of SCPs the only predicate available is the numerical positional predicate: an integer. Again, as with XPath expressions, various axis abbreviations are available. Complete details are available in the specification SCD.

Figure 1

  /schemaElement::p:outer/type::0/schemaAttribute::p:inner
  /type::p:second/model::sequence/schemaElement::p:duplicate[2]/type::*
  /p:outer/~0/@p:inner
  /~p:second/model::sequence/p:duplicate[2]/~*
  

Some SCPs

Figure Figure 1 shows some SCPs. The first SCP selects an attribute declaration named 'inner' for an element declaration named 'outer' whose type is a locally defined anonymous type. The path starts at the root of the assembled schema ('/') and then traverses the schemaElement axis ('schemaElement::') with a name test ('p:outer'). The name test matches an element declaration whose local name is 'outer' and whose namespace URI matches the namespace bound to the prefix 'p'. The path continues through the type axis ('type::') with a name test ('0') that in this case matches a type definition with no name ('0' being the indicator for this case). Finally the schemaAttribute axis is traversed to select the attribute declaration whose name matches 'p:inner'.

The second SCP selects the type of the second element declaration named 'duplicate' in the sequence within the type definition named 'second'. The path starts at the root of the assembled schema ('/'), traverses through the type axis and then the model axis ('model::'). Here the test ('sequence') matches a model group's kind (sequence vs. choice vs. all) and selects only sequence model groups. Then the schemaElement axis is traversed. The predicate on this axis ('[2]') selects the second element declaration with the name 'duplicate' in the namespace bound to 'p': this can only be the case if there are two local element declarations. Finally, the type axis is traversed and a wildcard name test ('*') is applied, which will match the type definition, regardless of type.

The third and fourth SCPs are abbreviated versions of the first and second, using the tilde '~' abbreviation for the type axis, the use of the bare name as an abbreviation for the schemaElement axis, and the use of the at sign ('@') as an abbreviation for the schemaAttribute axis.

Table Table I summarizes the schema component axes. Not all axes apply to canonical paths, and some axes apply only against the XML Schema 1.1 XSD11 component model.

Table I

Schema Component Path Axes

AxisMeaning
Axes appearing in canonical paths
schemaAttributeAttribute declaration
schemaElementElement declaration
typeType definition
attributeGroupNamed attribute group definition
groupNamed model group definition
identityConstraintIdentity constraint definition
keyReferenced key in identity constraint definition
notationNotation declaration
modelModel group
anyAttributeAttribute wildcard
anyWildcard
facetConstraining or fundamental facet
annotationAnnotation
assertionAssertion (1.1 component model only)
alternativeType alternative (1.1 component model only)
Axes appearing in non-canonical paths
componentAny component
currentComponentThe current component
substitutionGroupThe substitution group head of an element declaration
baseTypeThe base type of a type definition
primitiveTypeThe primitive type of a simple type definition
itemTypeThe item type of a list simple type definition
memberTypeA member type of a union simple type definition
particleA particle in a model group
attributeUseAn attribute use (local attribute declaration)
scopeThe complex type definition, attribute group definition, or model group definition defining the scope of a local element or attribute declaration
contextThe complex type definition, attribute declaration, or element declaration defining the context of a local type definition

There is one privileged path to each component in the schema, the canonical schema component path. Intuitively, the canonical SCP of a component is the SCP that minimally describes that component and only that component. For example, the SCP for a global type definition is the SCP that traverses solely the type axis from the root; the SCP for a local type definition is the SCP for the element declaration that governs the type definition, extended by traversing the type axis. Canonical paths restrict traversals to certain axes, sometimes based on complex constraints involving other components (particularly the base type component), and eliminating abbreviations and wildcarding wherever possible. Every canonical SCP is the extension of an existing SCP with an allowable step, plus the canonical SCP for component that represents the whole schema, whose canonical SCP is a slash ('/').

The set of canonical paths can be generated for an assembled schema by traversing the component graph from the root, gathering up canonical SCPs, and extending them through allowable transitions. Figure Figure 2 shows a small schema and its canonical SCPs, excluding the canonical SCPs for the built-in schema components that are present in every assembled schema.

Figure 2

  
  <xs:schema targetNamespace="http://www.w3.org/xmlschema-ref/example1"
   xmlns="http://www.w3.org/xmlschema-ref/example1"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   elementFormDefault="qualified">
 
    <xs:complexType name="registered-query">
      <xs:complexContent>
        <xs:extension base="query">
          <xs:sequence>
            <xs:element ref="id" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element ref="option" minOccurs="0" maxOccurs="unbounded"/>
          </xs:sequence>
          <xs:attribute name="weight" type="weight" use="optional"/>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
 
    <xs:element name="registered-query" type="registered-query"
                substitutionGroup="query"/>
 
    <xs:simpleType name="id">
      <xs:restriction base="xs:unsignedLong"/>
    </xs:simpleType>
 
    <xs:element name="id" type="id"/>
 
    <xs:element name="option" type="option"/>
 
    <xs:simpleType name="option">
      <xs:restriction base="xs:string">
        <xs:enumeration value="stemmed"/>
        <xs:enumeration value="unstemmed"/>
        <xs:enumeration value="wildcarded"/>
        <xs:enumeration value="unwildcarded"/>
      </xs:restriction>
    </xs:simpleType>
 
    <xs:complexType name="query">
      <xs:annotation>
        <xs:documentation>Any query.</xs:documentation>
        <xs:appinfo/>
      </xs:annotation>
      <xs:complexContent>
        <xs:restriction base="xs:anyType">
          <xs:anyAttribute processContents="lax"/> 
        </xs:restriction>
      </xs:complexContent>
    </xs:complexType>
 
    <xs:element name="query" type="query" abstract="true"/>
 
    <xs:simpleType name="weight">
      <xs:restriction base="xs:double">
        <xs:minInclusive value="0"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:schema>
  
      
  /
  /schemaElement::p:registered-query
  /schemaElement::p:option
  /schemaElement::p:id
  /schemaElement::p:query
  /type::p:registered-query
  /type::p:registered-query/model::sequence
  /type::p:registered-query/schemaAttribute::weight
  /type::p:option
  /type::p:option/facet::enumeration
  /type::p:id
  /type::p:weight
  /type::p:weight/facet::minInclusive
  /type::p:query
  /type::p:query/anyAttribute::*
      

Schema and its canonical paths

The set of canonical SCPs for a schema give us a quick summary of basic facts of the schema. In this case we can see that the schema has four top-level element declarations, five top-level type definitions, two constraining facets, one model group, and one local attribute declaration. The schema appears to be written in the Garden of Eden style, because there are no anonymous type definitions.

Note, however, that the canonical SCPs (and indeed, SCPs in general) do not currently include information about non-component properties of the components, such as occurrence indicators or value constraints. Clearly such properties provide important information about a schema, and their absence is a serious limitation to using SCPs alone. The SCP specification does define an accessor syntax, but declines to define any specific accessors or their semantics.

Comparison with Extended XPaths

Coates and Dui Coates10 present the idea of using "extended XPaths" for XML Schema differencing. As with schema component paths, these extended XPaths use an XPath-like syntax to traverse the component model for an assembled schema. The paper presents how these paths can be used to compare schemas for changes.

A key difference between the extended XPaths and schema component paths is simply that there is no specification of the rules for generation and interpretation of the extended XPaths, while schema component paths are defined in a public formal specification.

Still, some differences are clear:

  1. Extended XPaths include information about non-component properties, such as occurrence indicators (minOccurs and maxOccurs), value constraints (default values), and facet values. A predicate style of representation is used.

  2. Extended XPaths focus on paths for elements and attributes, with annotations for certain kinds of type information.

  3. Schema component paths includes a definition of canonical paths; these paths distinguish shared components from locally defined ones.

  4. Schema component paths cover all component types, including named model groups and attribute groups.

There are strengths and weaknesses to both approaches.

  • Including non-component properties in the path means that metrics or differences that depend on those properties can be calculated using the paths alone. For example, a canonical path-based schema difference will report no change in the schema if the default value for an attribute changes, or if one schema requires 1 or more occurrences of an element instead of 0 or more. Schema component paths are therefore insufficient to detect such differences. The predicate style of representation makes these properties manifest in the paths, which makes the information more immediately accessible than relying on something else to use accessors to fetch the values and compute information based on those values.

  • A central aim of complexity metrics is to measure reuse. Distinguishing between paths that involve shared components, such as those inherited from base types or named groups, is therefore essential to compute such metrics. Extended XPaths cannot be used to compute such metrics because by design they elide such differences.

  • The design of extended XPaths captures differences that make a difference to validation outcomes, but not other kinds of differences. If the purpose of computing the difference between two schemas is to determine if some inadvertent material change has been made to the set of documents that are valid per the schema, this approach is preferable. One need not be bothered to review changes that do not materially affect outcomes.

In the sections that follow we will look at how to apply SCPs to perform various schema analysis tasks, with some comparison to extended XPaths.

Analyzing Schemas

Schema Signatures

The set of canonical SCPs for a given schema provides a useful schema signature. This signature can be identify schemas that are functionally the same, robustly in the face of differences in physical organization of the schema documents, ordering of declarations within those schema documents and the presence of extraneous information such as comments. Furthermore, the text format of a list of canonical SCPs is simple enough that it can be processed with simple tools, such as Unix command line tools, to analyze the schema and compare it with other schemas.

Figure 3

 
      canonicals example.xsd | sort -f/ -s -k2,2 
      

Computing a schema signature

This schema signature procedure performs a stable sort on the second field only (which is to say, the first step after the root), so top level schema components will appear with their names in order by component type, while model groups will not be reshuffled. In the case of sequence model groups, it is important to preserve the order of the particles because a change in the ordering constitutes a significant difference. However, this means that ordering changes in choice or all groups will produce different signatures even though these changes do not materially affect the schema. Similarly the reordering of attributes would also produce a different signature. Alternatively, a global sort could be used, with the opposite weakness of giving equivalent signatures to two schemas that differ in the order of particles in a sequence model group.

Schema Differences

Determining what has changed between two versions by looking at the schema documents themselves can be a daunting task. Simple file differencing can include lots of irrelevant detail, or can be stymied by a reorganization of the partitioning of the schema across multiple schema documents. A comparison of the two schema signatures is much easier to grasp and doesn't suffer from these problems.

Figure 4

 
      canonicals example_v1.xsd | sort -f/ -s -k2,2 > 1.out
      canonicals example_v2.xsd | sort -f/ -s -k2,2 > 2.out
      echo "*********** New in $xsd2"
      diff -w 1.out 2.out | grep '>' | sed 's/^> //'
      echo "*********** Removed from $xsd2"
      diff -w 1.out 2.out | grep '<' | sed 's/^< //'
      

Comparing schema versions

Figure 5

 
*********** New in example_v2.xsd
/schemaElement::p:cluster
/schemaElement::p:clustering
/schemaElement::p:clustering/type::0
/schemaElement::p:clustering/type::0/model::choice
/schemaElement::p:complete
/schemaElement::p:max-terms
/schemaElement::p:min-weight
/schemaElement::p:options
/schemaElement::p:options/type::0
/schemaElement::p:options/type::0/model::choice
/schemaElement::p:score
/schemaElement::p:term/type::0/model::sequence
/schemaElement::p:term/type::0/schemaAttribute::fitness
/schemaElement::p:term/type::0/schemaAttribute::confidence
/schemaElement::p:use-db-config
/type::p:cluster
/type::p:cluster/schemaAttribute::id
/type::p:cluster/schemaAttribute::parent-id
/type::p:cluster/schemaAttribute::label
/type::p:cluster/schemaAttribute::count
/type::p:cluster/schemaAttribute::nodes
/type::p:nodes
/type::p:nodes/facet::finite
/type::p:score-kind
/type::p:score-kind/facet::enumeration
*********** Removed from example_v2.xsd
      

A sample schema difference report

A schema difference based on a canonical schema component path signature will be sensitive to additions and deletions of elements and attributes, the introduction of new named types or groups, or a switch in compositor type. A change in base type will be seen as a second order effect: by what impact it has on the derived type. Such a schema difference will be insensitive to changes in occurrence or value constraints, or in facet values.

A schema difference based on extended XPaths will also be sensitive to additions and deletions of elements and attributes and switches in compositor types. It will also pick up differences in occurrence and value constraints and in facet values. A change in base type will be directly visible, but the introduction of new types will be visible as a second order effect and only if the new type is actually used within the schema. The introduction or removal of named model group and attribute groups, or the switching of an element from being local to being global will be invisible.

From the point of view of knowing what the changes are that materially affect the set of valid documents, the extended XPath approach is clearly preferable. The lack of non-component properties on schema component paths is a serious weakness in this respect. Facet values and occurrence constraints have an obvious effect on validation and changes to them count as important changes. Augmenting the schema component path model to make such values manifest as predicates, as extended XPaths do, would be a good step forward. On the other hand, from the point of view of knowing about substantive changes to the usability of the schema by other schemas or for non-validation purposes, the schema component path approach of enumerating canonical paths for all components makes sense. When an XML Schema is imported into an XQuery module, for example, all the types are present and available, even ones not used in any content model. Augmenting extended XPaths to capture information about all components would be a positive step forward for that technique.

Schema Metrics

Schema signatures can be used to calculate schema complexity metrics. Compared to the large body of work on software metrics, little has been done on schema metrics. Neither is there clear consensus of what the useful metrics should be.

A paper by Lammel, Kitsis, and Remy Lammel05 examines a number of counts and metrics and computes them against a corpus of actual schemas, as an attempt to characterize the usage patterns found in practice. The paper begins with basic counts against the XML document, and then to XML Schema aware counts of the number of global element and attribute declarations, global complex and simple type definitions, and named model group and attribute group definitions. The paper argues against the simple sum of global element declarations and global complex type definitions as a metric of schema size on the grounds that this measure is sensitive to schema construction styles: a Russian Doll schema would always rank as small (one global element declaration) no matter how deeply nested its inner element declarations became. The paper moves on to counts of local element declarations and type definitions, and proposes a simple size metric that is purely the count of all complex type definitions.

The authors then attempt to apply define something akin to McCabe McCabe76 complexity measures for XML Schemas. The metric combines the number of branches in choice model groups, the number of non-default occurrence constraints (minOccurs or maxOccurs something other than 1), the number of references to a substitution group head, the number of references to a global type definition, the number of nillable attributes, and the number of global element declarations.

Additional metrics are defined for code-oriented and instance-oriented breadth and depth. The depth metrics incorporate such features as the number of particles in content models or the number of "parties": the difference is that the code-oriented depth metric counts a reference to a named model group as 1, but the instance-oriented depth metric counts all the particles obtained by the reference. The code-oriented depth metric counts the amount of nesting of element declarations in the schema.

Table II

Summary of metrics in Lammel05

File size kB or lines of code
XML nodes: total
XML annotation nodes: total
Element declarations: #global, #local, total
Complex type definitions: #global, #local, total
Simple type definitions: #global, #local, total
Named model group definitions: #global, total
Attribute group definitions: #global, total
Attribute declarations: #global, #local, total
McCabe cyclomatic complexity for XML Schema
Code-oriented breadth and depth
Instance-oriented breadth and depth

A paper by McDowell, Schmidt, and Yue McDowell04 proposes various schema complexity and quality metrics: counts of complex type declarations (broken down by the type of the content model), simple type declarations, annotations, derived complex types, global type declarations, the average number of attributes per type declaration, the number of references to global types, the number of unbounded elements, the average range in bounds for bounded elements ("multiplicity"), the average number of restrictions per simple type, and the fan-in and fan-out of element declarations.

Overall complexity and quality indexes apply weighting factors to various measures to give an overall score. The quality index combines the ratio of simple to complex type declarations, the percentage of annotations over total number of element declarations, the average restrictions per simple type declaration, percentage of derived complex type declarations of the total number of complex type declarations, the average bounded multiplicity size, and the average number of attributes per type declaration. The complexity index combines the number of unbounded elements, the element fanning, the number of complex type declarations, the number of simple type declarations, and the average number of attributes per complex type declaration.

Table III

Summary of metrics in McDowell04

Annotation nodes: total
Element declarations: #global, #local, #references
Complex type definitions: #global, #local, total, #simple, #mixed, #element-only, #derived
Simple type definitions: total, restrictions/total
Attributes: average per complex type
Elements: average bounded element multiplicity, fanning
Quality index
Complexity index

There is some overlap in these metrics, such as basic counts in the number of different kinds of components, but in the main these are two very different takes on what kind of information might be interesting or useful to measure.

Many of these metrics can be readily calculated from the schema signature. For example the number of element declarations can be determined by counting the number of canonical SCPs containing 'schemaElement::' as the last step, the number of global element declarations is the number of canonical SCPs beginning with '/schemaElement::' but not containing two slashes, and the number of local element declarations is the number of canonical SCPs containing 'schemaElement::' somewhere other than at the start.

Figure 6

# Total number of global element declarations
canonicals example.xsd | grep '^/schemaElement::[^/]*$' | wc -l
# Total number of local element declarations
canonicals example.xsd | grep '[^/].*/schemaElement::[^/]*$' | wc -l
# Total number of element declarations
canonicals example.xsd | grep 'schemaElement::[^/]*$' | wc -l
# Type definitions
canonicals example.xsd | grep 'type::[^/]*$' | wc -l
# Attribute declarations
canonicals example.xsd | grep 'schemaAttribute::[^/]*$' | wc -l
# Named model group definitions
canonicals example.xsd | grep 'group::[^/]*$' | wc -l
# Attribute group definitions
canonicals example.xsd | grep 'attributeGroup::[^/]*$' | wc -l
# Notation declarations
canonicals example.xsd | grep 'notation::[^/]*$' | wc -l
# Identity constraint definitions
canonicals example.xsd | grep 'identityConstraint::[^/]*$' | wc -l
# Total number of components
canonicals example.xsd | wc -l 
      

Computing simple count metrics

Many of the metrics listed above to not lend themselves well to a simple schema-signature-based approach. Certain kinds of metrics do not lend then well to a SCP-based approach at all: certainly those that rely on the XML representation of the schema rather than the schema itself such as the number of XML nodes, for example.

SCPs do not distinguish directly between simple and complex type definitions because they form a single symbol space in XML Schema: one can have both an element declaration and a type definition named 'example', but not both a simple and complex type definition with that name. If there is a canonical SCP where a facet axis follows a type definition, we know that the type is a simple type; if there is a canonical SCP where a model axis or attribute axis follows a type definition, we know that the type is a complex type. Otherwise, we can't tell from the SCPs alone.

Another class of metrics that are not readily computable from canonical SCPs are certain kinds of inbound counts: the number of uses of global element declarations, the number of uses of a substitution group head, and so forth. Similarly, statistics that distinguish particles that derive from references to named model groups from the rest cannot be computed with canonical SCPs alone, as the schema component model records that information through through the scope property. In any case, if the content model has a reference to a global element declaration, this will not create a canonical SCP for that particle: the canonical SCP for the element declaration is the top-level one.

In addition, since non-component properties of schema components are not reflected in the SCPs, any statistic that depends on the value of such a property cannot be computed with SCPs alone. The the unbounded element multiplicity from McDowell04 and the cyclomatic complexity from Lammel05, which look at the minOccurs and maxOccurs, fall into this class.

Coates and Dui Coates10 did not look at metrics in their paper, but surely many metrics can be calculated using their extended XPaths as well. Counts of components of various types would be difficult: the distinction between element and attribute declarations seems to be manifest only in the how the type predicates are represented, the apparent loss of information about named groups suggests that not only will named groups not be counted at all, but some over-counting of element and attribute declarations is likely. Similarly, types cannot be accurately counted. In general the extended XPath approach is not conducive to measuring component reuse, which is an important aspect of schemas to measure. On the other hand, extended XPaths provide information that can be used to compute metrics that depend on occurrence constraints.

At this point, the reader will be excused for thinking that things are looking grim for the use of SCPs for obtaining serious schema metrics. All is not lost, however. First, many of these statistics can be computed by using non-canonical SCPs to select a particular set of components, and counting against that set. Where the metrics need the values of properties of particular components, the ability to select components using a non-canonical SCP needs to be augmented with an ability to inspect or query the properties. For example, the particle axis can be used to count number of particles in content models, the scope axis can be used to distinguish particles derived from named model groups from local ones, and the substitutionGroup axis can be used to count references to substitution group heads. Second, since it is far from clear which statistics to use to examine schema size, complexity, or quality, a more fruitful approach may be to see what statistics we can compute from SCPs and see what they show us. Some metrics can be replaced by similar metrics that are more amenable to calculation via SCPs. For example, looking at the ratio of SCPs containing a type definition as an intermediate step against the number of SCPs representing a type definition (that is, whose final step is a type axis) gets at similar schema characteristics as element fanning.

The simplest measure obtainable from the schema signature is a simple count of how many canonical paths there are for a particular schema. A schema with a high level of reuse of global declarations and definitions will result in fewer canonical paths. Suppose there are two schemas, one of which defines a global element declaration and uses it in two places, and one of which defines a local element declaration in each place. The first schema will have one canonical SCP for the global element declaration, while the second will have a canonical SCP for each local element declaration. Each reuse of the global declaration leads to one less canonical SCP in the schema. If two schemas have a similar number of declarations, the one with fewer paths is the simpler.

An interesting extension of the path count can be obtained by generating SCPs along the canonical axes to a particular depth, but not worrying about the other whether the SCP is canonical or not. For example, if a content model references a global element declaration, a SCP that extends the content model's SCP through the schemaElement axis would not be a canonical one (the canonical SCP is the one directly from the root to the global element declaration), but it would be a level 1 extension to a canonical SCP. A level 2 extension to a canonical SCP is the addition of one more step through a canonical axis to a level 1 extension to a canonical SCP, and so on. In some simple data-oriented schemas, the set of canonical SCPs is no different from the set of level 1 extensions. At the other extreme, schemas where content models for different elements recursively refer to each other can have a set of level 1 extensions substantially larger than the set of canonical SCPs. The growth in the number of paths as the level increases is a measure of the inter-relatedness of the components in the schema, and high growth can be the sign of a schema that has many dependencies and is therefore more complex.

Calculation of the total and average number of steps in the SCPs can also be computed readily. A higher average path length indicates less component reuse and more local definitions, more complex content models, or more additional constraints on simple type. In short: a more complex schema. Again, the growth of this measure for level N extensions gives some indication of the inter-relatedness of the components in the schema.

Table IV

Statistics for a selection of schemas

SchemaElementsTypesAttributesE+T+APathsPath lengthLevel 10 pathsLevel 10 path length
XSLT 2.052931853304813.5618506.54
XHTML 1.19711923044616823.8192337413.58
XMLSpec17822613954310873.221088910.09
SDocBook119282785118615743.19418369012.41
FpML 4.41972889262312373133.85731109.59
GML 3.2106311371717391763863.22612498.58

Table IV shows some metrics for an assortment of schemas. XMLSpec (the vocabulary used to write W3C specifications) and XHTML are relatively simple document-oriented vocabularies and measured as the sum of element declarations, attribute declarations, and type declarations they are of roughly comparable size. Simplified DocBook, a somewhat more extensive document-oriented vocabulary, is about twice as large by this measure. In terms of paths, however, DocBook is smaller than XHTML and about only about one and a half times the size of XMLSpec.

GML and FpML are both highly structured data-oriented schemas. Which is more complex? FpML has more elements, but GML has a larger E+T+A count. FpML has more paths and a larger average path length. E+T+A speaks to what one needs to know about a schema to fully make use of it in a processing environment such as XSLT or XQuery, while the path count speaks to the burden on a schema maintainer.

While the average path length for canonical SCPs does not vary greatly, the growth in average path length for level 10 paths is quite substantial. The differences in level 10 extensions is astonishing, however: simplified DocBook produces two orders of magnitudes more level 10 extensions than XMLSpec and one order of magnitude more than XHTML, despite the fact that XMLSpec is larger in terms of the element/type/attribute count and not drastically smaller in terms of the canonical SCP count and XHTML is larger in terms of the canonical SCP count. It appears that the main reason for this is that XMLSpec and XHTML make use of substitution groups and named model and attribute groups, while simplified DocBook uses large choice groups instead.

Table V

Reusable components in a selection of schemas

GlobalE+T+A Global/TotalNamed GroupsSubstitution Group Heads
SchemaElementsTypesAttributesE+T+AModelAttribute
XSLT 2.052284840.25223
XHTML 1.1979811960.44571410
XMLSpec1786141980.361716816
SDocBook11911902380.20000
FpML 4.411088809980.3269111
GML 3.26316601413050.33715114

As we can see in Table V, there is a wide range in the utilization of reusable schema components. Global elements and types are broadly used, but global attributes are rare. Where attribute reuse is desired, it is accomplished (in these schemas at least), through named attribute groups. Substitution groups and named model groups seem to restrict both level 10 path counts and lengths.

Conclusion

Schema component paths provide a characterization of the structure of schemas that is insensitive to details of the XML representation and partitioning into multiple files. They can be used as the basis to analyze and compare schemas, and to compute metrics of schema size and complexity. This paper attempts to sketch some of the possibilities in these areas. Fuller metrics and analysis could by obtained by following the lead of extended XPaths and including non-component accessors on the paths as well.

These metrics calculated in this paper are suggestive and seem to capture interesting differences in schema designs, but a more systematic study is warranted.

Appendix A. Tools

MHSCD MHSCD is a set of Java tools for manipulating schema component paths. Both a SCP generator and a locator API (which provides information about component properties) is included. It was used to generate the examples of schema component paths in this paper and is available under a Creative Commons Attribution license.

The schema component path specification SCD is currently under development by the W3C (as of this writing at the Candidate Recommendation phase). Readers are invited to review and comment on that specification.

References

[Coates10] Anthony B. Coates and Daniel Dui. "Full Impact" Schema Differencing. Conference proceedings XML Prague 2010.

[MHSCD] Mary Holstege. MHSCD, available at http://www.mathling.com/xsd/scds.html.

[Lammel05] Ralf Lammel, Stan Kitsis, and Dave Remy. Analysis of XML schema usage. Conference Proceedings XML 2005.

[McCabe76] T.J. McCabe. A Measure of Complexity. IEEE Transactions on Software Engineering, 2(4), pp. 308-320, December 1976.

[McDowell04] Andrew McDowell, Chris Schmidt, and Kwon-Bun Yue. Analysis and Metrics of XML Schema. Proceedings of the 2004 International Conference on Software Engineering Research and Practice. Volume 2.

[XSD11] W3C: Shudi (Sandy) Gao 高殊镝, C. M. Sperberg-McQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. Last Call Working Draft. W3C, December 2009. http://www.w3.org/TR/2009/WD-xmlschema11-1-20091203/

[SCD] W3C: Mary Holstege and Asir S. Vedamuthu, editors. W3C XML Schema Definition Language (XSD): Component Designators. Candidate Recommendation. W3C, January 2010. http://www.w3.org/TR/2010/CR-xmlschema-ref-20100119/

[XSD10] W3C: Henry S. Thompson, Murray Maloney, David Beech, and Noah Mendelsohn, editors. XML Schema Part 1: Structures Second Edition. W3C, October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/

Mary Holstege

Principal Engineer

Mark Logic Corporation

Mary Holstege is Principal Engineer at Mark Logic Corporation. She has worked as a software engineer in and around markup technologies for over 20 years. She is a member of the W3C XML Schema and XML Query working groups, and an editor of the W3C XML Schema Component Designators and the XML Query Full Text specifications. Mary Holstege holds a Ph.D. from Stanford University in Computer Science, for a thesis on document representation.