The evaluation of the algorithm of Figure “” on the schema shown in Figure “”, annotates the alternatives of the Type Table within R with the following error conditions:not(@kind='string') and (@kind='base64' or (@kind='binary' or (@kind='xml' or @kind='XML')))TRUEThe error condition associated to the first alternative states that an element E of the instance document and R violates CTSR whenever E is assigned one of the types messageTypeBase64 and messageTypeXML in the context of B (we are in the hypothesis that messageTypeString is not a valid restriction of any of those two types). The error condition associated to the second alternative states that regardless of the type assigned in the context of B, a CTSR violation occurs. This is because error is not a valid restriction of any of the types of the Type Table within B.On the other hand, the algorithm annotates each alternative of the Type Table within B with the error condition FALSE. It means that for any element E, E and B do not violate CTSR. This is because B's base is anyType and obviously the types within the Type Table within B are valid restrictions of anyType.OCP Run-Time PhaseAt validation time, the annotations on context-determined Type Tables are read in order to check CTSR. In particular, let E be an element of the instance document, T be the type of E's parent, and TT be the context-determined Type Table of E in T. Firstly, TT has to be evaluated. Then, the error condition associated to the satisfied alternative is also evaluated. If the error condition evaluates to true, then it is possible to conclude that CTSR is not satisfied. Otherwise, the same procedure has to be recursively executed using T's base type. The recursive process stops either when a CTSR violation occurs, or anyType is reached.The procedure described above is shown in Java-like pseudo-code in Figure “”. In order to show how it works, let us consider following document:
<messages xsi:type="R">
<message kind="string">
...
</message>
<message kind="binary">
...
</message>
</messages>
and suppose we have to validate it against the schema depicted in Figure “” (the error conditions built during the static phase are described in Section “”). When the first <message> element is processed, its context-determined Type Table is evaluated. It is then checked that it satisfies the first condition @kind='string'. As a consequence, it is assigned the first alternative. So the error condition associated with that alternative is evaluated. Such a condition is not(@kind='string') and (@kind='base64' or (@kind='binary' or (@kind='xml' or @kind='XML'))). Clearly, the error condition is not satisfied (not(@kind='string') evaluates to false). Consequently, the Type Table within B has to be evaluated. Again, the first alternative is chosen, and thus its error condition is evaluated. But such a condition is FALSE. And so it is possible conclude that the first <message> element and R satisfy CTSR.For what concerns the second <message> element, we have that it does not satisfy the first alternative predicate, and so it is assigned the default alternative. The error condition associated to such alternative is TRUE. So we have that the second <message> element and R do not satisfy CTSR.OCP Cost AnalysisIn this subsection we provide a cost analysis for the static phase and a cost analysis for the run-time phase of OCP.OCP Static Phase AnalysisHere we are not interested in analyzing the cost of the static phase applied to the entire schema type hierarchy. Rather, we fix an element name and we consider a single path from the root to a generic leaf of the type hierarchy.Thus, let T1, ..., Tn be a derivation chain, and e be our element name. We can now consider the sequence of Type Tables TT1, ..., TTn, where TTi is the context-determined Type Table for an element named e within Ti. The size of each TTi is denoted by di.Given a 1 < i <= n, now we analyze the time needed to annotate TTi.The function build-error-condition iterates over the whole alternative sequence of TTi-1, and for each alterantive it computes a number of operations whose cost is constant. Thus the function cost is linear in the TTi-1 size, i.e., di-1The function simplify can be implemented visiting the structure of the expression returned by build-error-condition. The number of nodes of such an expression is linear in di-1. Thus, the simplify computational cost is linear in di-1 too.As both simplify and build-error-condition are called for each alternative of TTi, the asymptotic computational cost for the function annotate-type-table is di-1⋅di.Thus, the asymptotic cost for building and simplifying the error conditions of the whole sequence of Type Tables, is given by:d1 + d1⋅d2 + ... + dn-1⋅dnWe believe such a cost is perfectly acceptable at schema compile time.OCP Run-Time Phase AnalysisHere we provide a computational cost analysis of the run-time phase of OCP. As similarly done for RTC, we are interested in determining the number of XPath predicates that have to be evaluated for a generic element of the instance document.Let E be an element of the instance document, and T be the type assigned to E's parent. Consider the derivation chain T1, ..., Tk, where T1 is anyType and Tk is T. Also consider the usual Type Table sequence TT1, ..., TTk, where TTi is the context-determined Type Table for E within Ti.If E and T satisfy CTSR, the entire Type Table sequence is processed. For any 1 < i <= k, TTi is evaluated to obtain the assigned alternative. The cost of such an operation is linear in di. Once the assigned alternative has been determined, the algorithm evaluates the corresponding error condition. As already discussed, such a condition is a boolean expression over the XPath predicates of TTi-1. In our analysis, the cost of evaluating an error condition with n predicates is linear in n. As by construction none XPath predicate appear more than once within the same error condition, we have that the error condition associated to the assigned alternative contains at most di-1 predicates of TTi-1. So its evaluation cost is linear in di-1. Thus, the number of predicates evaluated for TTi is upper-bounded by di + di-1.Considering the whole Type Table sequence, the number of evaluated XPath predicates is given by the formula shown in Equation “”.2⋅d1 + ... + 2⋅dk-1 + dkComparing CP, OCP, and RTCIn this section we provide a comparison among the main techniques discussed so far: Optimized Cartesian Product, Run-Time Check, and Cartesian Product. The comparison focuses on the number of XPath predicates evaluated at run-time. Before starting, let us first fix some notations. Let:E be an element of the instance document;T be the type assigned to E's parent;T1, ..., Tk be the derivation chain for T, where T1 is anyType and Tk is T;TT1, ..., TTk be the sequence of context-determined Type Tables of E along the derivation chain;di be the TTi size, for every i;TT'1, ..., TT'k be the Type Tables generated by the Cartesian Product static phase.Both OCP and RTC evaluate TTk in order to decide which type alternative E has to be assigned. Clearly, both techniques evaluate the same XPath predicates of TTk. The number of evaluated XPath predicates ranges from 1 to dk.On the other hand, CP evaluates TT'k. If, for any i between 1 and k, E satisfies the first alternative of TTi, CP is assigned the first alternative of TTk, and thus the condition of that alternative only is evaluated. However, that condition is the conjunction of k XPath predicates. So in the best case, CP evaluates k XPath predicates. But if for every i between 1 and kE satisfies the last alternative of TTi, than CP has to process every alternative of TT'k. It means that it has to evaluate d1 ⋅ ... ⋅ dk conditions, where each condition is the conjunction of k XPath predicates.After the TT'k evaluation, CP already knows whether E and T satisfy CTSR without the need to walk on the derivation chain: if TT'k selected type error then CTSR is violated, otherwise CTSR is satisfied. The problem is that the evaluation of TT'k might be very expensive.On the other hand, after the TTk evaluation, both OCP and RTC execute further operations. OCP evaluates the error condition linked to the alternative returned by TTk, while RTC evaluates TTk-1. Thus, for the purposes of our comparison, it is important to understand whether evaluating the error condition is more or less expensive than evaluating TTk-1. In order deal with a clearer notation, we temporarily rename some variables:Tk becomes R;Tk-1 becomes B;TTk becomes TTR;TTk-1 becomes TTB;dk becomes n;dk-1 becomes m;We denote the TTR alternatives by <r1, R1>, ..., <rn, Rn>; and the TTB alternatives by <b1, B1>, ..., <bm, Bm>. Moreover, let i be the (index of the) alternative selected by TTR. We denote the error condition associated to that alternative by erri.As already observed in Section “”, erri is a boolean expression over the XPath predicates (here called atoms) of TTB. Assuming the simplification process did not rewrite it, erri contains each of the m atoms of TTB.At this point it is important to study the structure of a generic error expression erri. As also shown in Figure “”, an error condition has a fixed structure: for each or (and) operator, its left operand is always a (negated) atom, while its right operand is either another binary operator, or FALSE (TRUE). Moreover, we can observe that the atoms appear in the same order they appear in TTB.It is easy to implement an error condition evaluator as a lazy boolean evaluator: for any input binary operator it always evaluates the left operand first, and it evaluates the right operand only if necessary. The atoms of erri actually evaluated by such a boolean evaluator are exactly the same as those evaluated by RTC to decide the TTB selected type.For instance, suppose that for a given j our E element does not satisfy none of b1, ..., bj-1, and it does satisfy bj. RTC evaluates b1, ...,bj. Also our technique evaluates those predicates, and it does not evaluate further ones. Indeed within erri, bj appears either in negated form as left operand of an and operator, or directly as left operand of an or operator (it depends on whether or not Ri is validly substitutable as restriction for Bj). In either case, the erri evaluation stops before processing the right operand.Thus we can conclude that even if it is not possible to simplify erri, OCP and RTC are equivalent in terms of evaluated atoms. But there are cases in which erri is simplified by the rewriting rules described in Section “”. Indeed, if there exists a j such that eitherfor each j < j' <= m, Ri is not validly substitutable as restriction for Bj'orfor each j < j' <= m, Ri is validly substitutable as restriction for Bj',then the simplification process removes from erri the atmos bj+1, ..., bm.In such cases, if E does not satisfy any of the predicates b1, ..., bj+k, for some k, then OCP does not need to evaluate the k atoms bj+1, ..., bj+k in order to decide whether CTSR is satisfied or not. On the other hand, RTC does evaluate those atoms, because it has to find the type actually selected by TTB.Thus, we can conclude that on a single step of a derivation chain, OCP evaluates a number of predicates less than or equal to the number of predicates evaluated by RTC.However, as can be noted from the formulas shown in Equations “” and “”, OCP might evaluate twice the same atoms. Coming back to the notation introduced early in this section, if E does not satisfy the error condition of the alternative selected by TTk, then OCP has to evaluate TTk-1. But as the error condition previously processed was built on the atoms of TTk-1, it is clear that some predicates of TTk-1 might be processed twice.However, it is possible to ease such an additional cost if during the processing of an error condition, the result of each atom evaluation is stored in some data structure. In this way, an XPath predicate is actually evaluated only if it has not been evaluated yet.So we conclude that for a given derivation chain, OCP evaluates a number of XPath predicates less than or equal to the number of XPath predicates RTC evaluates.ImplementationWe realized a prototype implementation of Optimized Cartesian Product, thus demonstrating its feasibility. We implemented it in Java within Xerces []. Our prototype patches Xerces under three aspects:support for XSD 1.1 related components;implementation of the OCP static phase;implementation of the OCP run-time phase within the existing validation code.As Xerces is an XML parser for XSD 1.0, it does not handle 1.1-specific constructs. Our prototype modifies the Xerces modules delegated to the construction of schema components (package org.apache.xerces.impl.xs.traversers). It also modifies the Xerces implementation of the XML Schema API [], in order to represent type alternative components, and to give element declarations awareness of their Type Tables (packages org.apache.xerces.xs and org.apache.xerces.impl.xs).The OCP static phase is implemented within a separated package it.unibo.cs.cta. The code for the error condition construction is within the class it.unibo.cs.cta.preprocessor.impl.ErrorConditionBuilder. Such a class processes an input XSD schema, associating each type with a map. That map is our implementation of tt-mapT. Indeed, it associates element names to context-determined Type Tables. ErrorConditionBuilder also annotates each context-determined Type Table with its error conditions. Error conditions are built directly using the algorithm described in Section “”. The classes handling error conditions are within the package it.unibo.cs.cta.errorexpr. In particular, the simplification of error conditions is implemented by ErrorExpressionSimplifier, while their evaluation is implemented by ErrorExpressionEvaluator.The static phase is delegated to a pre-processor invoked when a schema document is loaded. In order to invoke it, the simple and compact code below is used:// instantiation
PreprocessorFactory pf = PreprocessorFactory.getInstance();
fPreprocessor = pf.createPreprocessorSequence(
new String[]{"ErrorConditionBuilder"}
);
// invocation on an XS Model
fPreprocessor.processModel(model);
The static phase result (i.e., association between types and maps) is read calling the pre-processor method getStateByName("type-table-map").The OCP run-time phase is implemented within the class org.apache.xerces.impl.xs.OptimizedCTAXMLSchemaValidator, a patched version of the original XSD validator provided by Xerces. In particular, the code for the CTSR verification is within the method handleStartElement. XPath predicates are evaluated using the interfaces in javax.xml.xpath. Currently, our prototype does not check whether an XPath predicate has already been evaluated. Thus, as observed in Section “”, an XPath predicate might be evaluated twice for the same element.Our prototype is meant to prove the OCP feasibility, and as such it is not aimed to be XSD 1.1 conformant. In particular it has some limitations, the most important of which are:XPath 1.0 expressions only are accepted;all non CTA related syntax is ignored. E.g., <assert> elements are not considered legal within a schema;derivations by restriction are checked using the original Xerces code, i.e., XSD 1.0 rules are applied.XSD 1.0 defines the derivation by restriction in terms of ad hoc rules provided by the recommendation itself. XSD 1.1 allows processors to choose the algorithm they like to check whether a content model includes another content model.We also developed a small test suite for OCP. It can be run through a simple graphic interface. Source code and jars are available from http://tesi.fabio.web.cs.unibo.it/Tesi/OptimizedCartesianProduct.Related WorksAmong the most known validation languages (DTD [], RELAX NG [], Schematron [], DSD [], etc), the problem of verifying the subtype relation in presence of conditional declarations is very specific to XSD 1.1. Indeed, although there exist at least one language, DSD, permitting the definition of conditional content models, that language is not type-based, and consequently nor it has any concept of type derivation. We do not know works about restriction checking in presence of conditional declarations.However, there exist works on the problem of verifying whether an XSD 1.0 type is a legal restriction of another type [], [], []. Those works propose techniques to statically verify whether a type accepts a subset of what the base type accepts. On the same line, Neven et al present theoretical results about some basic decision problems concerning schemas, among which the problem of testing for inclusion of schemas [].ConclusionsIn XSD 1.1, the presence of conditional declarations increases the difficulty in verifying whether a type is a legal restriction of its base. We discussed about three main approaches to the problem: CTA usage limitation, run-time verification, and hybrid verification. Solutions of the first kind ensure it is possible to statically verify whether a type is a legal restriction of its base, but at the cost of limiting the CTA expressivity. Solutions of both second and third kinds allow the highest degree of expressivity, but they may recognize as legal restriction also a type accepting something its base rejects. They throw an error only for those instance documents actually proving that a type is not a legal restriction of its base. Hybrid solutions are meant to precompute during the static phase some information that might decrease the work to be done at run-time.In particular, we described the solution adopted by the XSD current draft, which follows a run-time approach described within the specs by the Conditional Type Substitutable in Restriction (CTSR) constraint. We discussed about an algorithm verifying CTSR, and we called it Run-Time Check (RTC). Then we proposed an alternative solution to RTC, named Optimized Cartesian Product (OCP). OCP is a hybrid solution. Its idea is to analyze conditional declarations in order to statically decide which XPath predicates can be ignored at run-time. We showed as, contrary to Cartesian Product (CP) - another hybrid solution OCP can be seen as an optimization of - the OCP static analysis cost is perfectly acceptable.We than compared the RTC, OCP and CP techniques, focusing on the number of XPath predicates evaluated at run-time. We showed as CP is the worst technique, as it inherits from the static phase a high volume of information that might heavily slow down the run-time phase. We also showed that although OCP might process the same alternatives twice, storing the XPath predicate evaluation results, we can assert that OCP evaluates a number of predicates less than or equal to the number of predicates RTC evaluates.An interesting future work is the experimental comparison among RTC, OCP and CP on a base of real schema documents. Moreover it is interesting to improve our error condition simplification process. For instance, our simplification rules are not able to rewrite expressions like not(@a = 'v1') and (@a = 'v2') into (@a = 'v2'). There are also error conditions that are clearly unsatisfiable when associated to a particular alternative. For instance, if the alternative predicate is (@a = 'v1') and the error condition is (@a = 'v2'), it is clear that the error condition will never be satisfied. Improving the simplification rule set should increase the number of situations in which OCP is preferable to RTC.AcknowledgementsWe would like to thank Stefano Zacchiroli for the technical discussions we had during the design of the Optimized Cartesian Product technique, the anonymous reviewers for their comments, and the XML Schema Working Group for the several and inspiring discussions on the topics covered by this paper.ReferencesCo-occurrence constraints ESW Wiki. http://esw.w3.org/topic/Co-occurrence_constraintsMøller, A. 2002. Document Structure Description 2.0. BRICS, Department of Computer Science, University of Aarhus, Aarhus, Denmark. http://www.brics.dk/DSD/.M. Fuchs, and A. Brown. Supporting UPA and restriction on an extension of XML Schema. In Proceedings of Extreme Markup Languages. August, 2003. Montréal, Québec. http://www.idealliance.org/papers/extreme03/html/2003/Fuchs01/EML2003Fuchs01.html.P. Marinelli, C. Sacerdoti Coen, and F. Vitali. SchemaPath, a Minimal Extension to XML Schema for Conditional Constraints. In Proceedings of the Thirteenth International World Wide Web Conference. New York, NY, USA. May, 2004. Pages 164-174. ACM Press.W. Martens, F. Neven, and T. Schwentick. Which XML Schemas Admit 1-Pass Preorder Typing? In Proceedings of the 10th International Conference on Database Theory. Edinburgh, UK, January 5-7, 2005. LNCS. Volume 3363. Pages 68-82.Information technology -- Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG. ISO/IEC 19757-2:2003, JTC1/SC34 Committee. Publicly available at http://standards.iso.org/ittf/PubliclyAvailableStandards/c037605_ISO_IEC_19757-2_2003(E).zipInformation technology -- Document Schema Definition Language (DSDL) -- Part 3: Rule-based validation -- Schematron. ISO/IEC 19757-3:2006, JTC1/SC34 Committee. Publicly available at http://standards.iso.org/ittf/PubliclyAvailableStandards/c040833_ISO_IEC_19757-3_2006(E).zip.C. M. Sperberg-McQueen. Applications of Brzozowski derivatives to XML Schema processing. In Proceedings of Extreme Markup Languages. August, 2005. Montréal, Québec. http://www.mulberrytech.com/Extreme/Proceedings/html/2005/SperbergMcQueen01/EML2005SperbergMcQueen01.html.H. S. Thompson, and R. Tobin. Using Finite State Automata to Implement W3C XML Schema Content Model Validation and Restriction Checking. In Proceedings of XML Europe. London, England. May, 2003. http://www.idealliance.org/papers/dx_xmle03/papers/02-02-05/02-02-05.html.N. Walsh, and J. Cowan. Schema Language Comparison. December, 2001. http://nwalsh.com/xml2001/schematownhall/slides/.The Apache Software Foundation. Apache Xerces. http://xml.apache.org.Elena Litani. XML Schema API. W3C Member Submission. 22 January 2004. http://www.w3.org/Submission/2004/SUBM-xmlschema-api-20040122/.Extensible Markup Language (XML) 1.1 (Second Edition). W3C Recommendation. 16 August 2006. http://www.w3.org/TR/xml11/.XML Schema Part 1: Structures Second Edition. W3C Recommendation. 28 October 2004. http://www.w3.org/TR/xmlschema-1/ XML Schema Part 2: Datatypes Second Edition. W3C Recommendation. 28 October 2004. http://www.w3.org/TR/xmlschema-2/W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. W3C Working Draft. 20 June 2008. http://www.w3.org/TR/2008/WD-xmlschema11-1-20080620/W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C Working Draft. 20 June 2008. http://www.w3.org/TR/2008/WD-xmlschema11-2-20080620/