Bridging the Gap Between XML and RDF Validation

Kurt Cagle

Copyright © 2017 Semantical LLC

expand Abstract

expand Kurt Cagle

Balisage logo

Preliminary Proceedings

expand How to cite this paper

Bridging the Gap Between XML and RDF Validation

Balisage: The Markup Conference 2017
August 1 - 4, 2017

They say that structure is freedom, and in a sense it is. When you're dealing with multiple constraints, you have to figure out what you can get out of that.

— Dmitri Martin

Introduction

One can argue that the XML Schema Definition Language, or XSD, had a profound impact upon the XML community. XML, coming from SGML, has long had a clear mechanism for identifying the structure of a given document instance. However, it had only a limited concept of type, which inhibited the adoption of XML among developers for whom type declarations and bindings were often more important than class relationships. At the same time, until XML, there were few mechanisms for the ad hoc assignment of type that weren't intrisically tied into the specific byte level storage implementation.

Yet XSD by itself has also proven to be only part of a broader constraint validation strategy. While XSD can, for a given element, identify the children of that element and the data types for atomic data (including cardinality, regular expression patterns, min and max values and enumerations), , certain types of constraints fall outside of these terms. The ISO Schematron standard that emerged in the wake of XSD specifically addressed these relational constraints, making it possible to specify constraints such as certain enumerations only being valid when an attribute has one type of value as compared to another. These constraints are frequently specified as rules. The XSD 1.1 specification incorporated some of the features of Schematron, while the ability to create constraints that span multiple nodes is proving to be one of the more desired features for validation.

The Resource Description Framework (RDF) is younger than XML (and indeed, younger than XSD) and because it's initial focus was more atomic and assertional, it is perhaps not surprising that RDF quickly evolved a set of schematic constraint languages - from RDF Schema to the Web Ontology Language (OWL) to a whole collection of OWL profiles. Arguably, because RDF works upon the open world assumption, constraining the language has always been more complex, to the extent that much of the flexibility of the language has been compromised because OWL itself evolved to be an internally consistent constraint language with an extremely robust toolset for differentiating between different types of constraints.

However, OWL also predated SPARQL, which is a language for both querying and constructing RDF triples. OWL established constraints through the use of blank nodes - an "open" slot in the tuple, while SPARQL made it possible to impugn some operational semantics (variable names) into these slots as rules. The complexity of blank node semantics tends to make OWL a major hurdle for even semanticists to master, and for those more used to thinking in terms of SQL queries OWL seemed like expensive overkill, especially when it often required the forward chaining of assertions and the cocommitment of significant memory allocation for transactions that in general changed at best slowly over time.

The idea that SPARQL could be used to perform validation and constraint has consequently been floated for a while, and has given rise (through some interesting historical stepping stones) to the notion of semantic shapes

Creating Shapes

The distinction between a shape and a class is subtle but can best be stated in XSD terms. A class can be thought of as analogous to element declaration in XSD, while a shape is the analog to a simple or complex type declaration - the first identifies the existence of a given entity and identifies it as being the structure for all instances of that type, while a shape defines what that structure is. In effect, a shape holds roughly the same role as an abstract type in XSD.

For instance, consider a movie such as "Star Wars: A New Hope" by Walt Disney Studio, that has the following (highly abbreviated) XSD structure:

<packet>
   <movie id="star-wars-a-new-hope">
      <title>Star Wars: A New Hope</title>
      <productionDate>2015-11-25</productionDate>
      <franchiseRef ref="star-wars-franchise"/>
      <studioRef ref="the_walt-disney-company"/>
      <studioRef ref="lucasfilms"/>
   </movie>
   <franchise id="star-wars-franchise">
      <name>Star Wars</name>
   </franchise>
   <studio id="the_walt-disney-company">
      <name>The Walt Disney Company</name>
   </studio>
   <studio id="lucasfilms">
      <name>Lucas Films</name>
   </studio>
</packet>

This structure is normalized (broken down into pieces.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xsd:element name="packet" type="packet_type"/>
    <xsd:element name="movie" type="movie_type"/>
    <xsd:element name="franchiseRef" type="franchise_type"/>
    <xsd:element name="studioRef" type="studio_type"/>

    <xsd:complexType name="packet_type">
        <xsd:sequence>
           <xsd:element ref="movie"/>
           <xsd:element ref="franchise"/>
           <xsd:element ref="studio"/>
        </xsd:sequence>
    </xsd:complexType>    
    <xsd:complexType name="movie_type">
        <xsd:sequence>
            <xsd:element name="title" type="xsd:string"/>
            <xsd:element name="productionDate" type="xsd:date" minOccurs="0"/>
            <xsd:element  ref="franchiseRef" type="reference" minOccurs="0"/>
            <xsd:element ref="studioRef" type="reference" minOccurs="1" maxOccurs="unbounded"/>
        </xsd:sequence>
        <xsd:attribute name="id" type="xsd:ID"/>
    </xsd:complexType>
    
    <xsd:complexType name="reference">
        <xsd:sequence>
            <xsd:attribute name="ref" type="xsd:IDREF"/>
        </xsd:sequence>
    </xsd:complexType>
    
    <xsd:complexType name="franchise">
        <xsd:sequence>
            <xsd:element ref="name" type="xsd:string"/>
        </xsd:sequence>
        <xsd:attribute name="id" type="xsd:ID"/>
    </xsd:complexType>
    
    <xsd:complexType name="studio">
        <xsd:sequence>
            <xsd:element ref="name" type="xsd:string"/>
        </xsd:sequence>
        <xsd:attribute name="id" type="xsd:ID"/>
    </xsd:complexType>
</xsd:schema>
        
       

One problem that XML faces is that external references usually do not identify type information outside of the document itself. Because RDF consists of normalized data (increasingly the case for large XML structures), this constraint set is complex to model with schema.

The same information can be rendered in RDF (here using Turtle, and again, being rather cavalier about namespaces):

movie:ANewHope a class:Movie;
     movie:title "A New Hope"^^xsd:string;
     movie:productionDate "2015-11-25"^^xsd:date;
     movie:franchise franchise:StarWars;
     movie:studio studio:DisneyStudios,studio:Lucasfilms;
     .

franchise:StarWars a class:Franchise;
     franchise:name "Star Wars"^^xsd:string;
     .
studio:DisneyStudios a class:Studio;
     franchise:name "Disney Studios"^^xsd:string;
     .
studio:Lucasfilms a class:Studio;
     franchise:name "Lucasfilms"^^xsd:string;
     .

The RDF here is a bit friendlier for references, at the expense of being more complex for atomic types. The Shape language (defined at https://www.w3.org/TR/shacl/) provides the language for establishing the constraint models, along with a reporting toolset that provides the results of validation.

An example can showcase what such shapes are capable of specifying. The following illustrates the shapes associated with this data:

shape:Movie 
     a sh:NodeShape;
     sh:targetClass class:Movie; #Applies to all movies.
     sh:property [
         sh:name "Title";
         sh:path movie:title; #This property shape applies to movie title.
         sh:datatype xsd:string; #title is a string.
         sh:minCount "1"^^xsd:integer; #title is a required property
         sh:maxCount "1"^^xsd:integer;
         sh:order "0"^^xsd:integer;
     ];

     sh:property [
         sh:name "Production Date"^^xsd:string; #identifies the UX name of the property
         sh:path movie:productionDate; #This property shape applies to the production date of the movie.
         sh:datatype xsd:date; #title is a string.
         sh:minCount "0"^^xsd:integer; #publication date is an optional property
         sh:maxCount "1"^^xsd:integer;
         sh:order "1"^^xsd:integer;
     ];

     sh:property [
         sh:name "Franchise"^^xsd:string; #identifies the UX name of the property
         sh:path movie:franchise; #This property shape applies to franchise.
         sh:nodekind sh:IRI; #The franchise is given as an IRI link.
         sh:minCount "0"^^xsd:integer; #movie franchise is an unbounded property
         sh:class class:Franchise; #the object of the property is a franchise class
         sh:order "2"^^xsd:integer;
      ];

     sh:property [
         sh:name "Studio"^^xsd:string; #identifies the UX name of the property
         sh:path movie:studio; #This property shape applies to franchise.
         sh:nodekind sh:IRI; #The franchise is given as an IRI link.
         sh:minCount "0"^^xsd:integer; #movie franchise is an unbounded property
         sh:class class:Studio; #the object of the property is a studio class
         sh:order "3"^^xsd:integer;
      ];

shape:Franchise 
     a sh:NodeShape;
     sh:targetClass class:Franchise; #Applies to all franchises.
     sh:property [
         sh:path franchise:name; #This property shape applies to the franchise name.
         sh:name "Name"; 
         sh:datatype xsd:string; #name is a string.
         sh:minCount "1"^^xsd:integer; #name is a required property
         sh:maxCount "1"^^xsd:integer;
     ];

shape:Studio 
     a sh:NodeShape;
     sh:targetClass class:Studio; #Applies to all studios.
     sh:property [
         sh:path franchise:name; #This property shape applies to the franchise name.
         sh:name "Name"; 
         sh:datatype xsd:string; #name is a string.
         sh:minCount "1"^^xsd:integer; #name is a required property
         sh:maxCount "1"^^xsd:integer;
     ];

There are a few key predicates in the shape namespace that need to be explained. The shape:Movie object is an instance of a node shape (as opposed to a property shape). It has a target class of class:Movie - the shape describes the movie class. It has four properties, each of which are here treated as blank nodes.

Each property has a sh:path property which identifies the predicates that this property applies to. This can be a single predicate, or a more complex predicate path (such as the union of multiple predicates, or a predicate path such as that used for constructing an RDF collection). This differs from XSD in that any given element reference must always be relative to its immediate parent. Similarly, the sh:class property within a property definition identifies the target classes. Unlike XML and XSD, this can be used for constraining the result of an IDREF just to instances that are of a given class. Note that this is basically the same kind of operations as rdfs:range, while sh:targetClass performs much the same operation as rdfs:domain. In that regard, a lot of what SHACL does is to pull together a minimal data-friendly ontology from the rdfs+ and very basic OWL class sets.

The sh:order component strengthens another weakness of RDF. The framework generally has no preferred order for output of properties (unlike XML, which defaults to an xsd:sequence model from the schema). The use of sh:order establishes an ordering algorithm to properties, making it easier to build interfaces that follow a specific cluster. Additionally, SHACL defines sh:group for grouping together properties within logical groups, then uses the sh:order to determine interim ordering within the group.

Shapes can define cardinality. sh:minCount and sh:maxCount determine the lower and upper bounds respectively for cardinality, with the assumption that sh:minCount defaults to "0" while sh:maxCount, when not included, gives the unbounded case. This can be used with sh:defaultValue to populate UI widgets or establish behavior of "new" components.

Finally, SHACL defines a concept called an entailment, which is a SPARQL query that performs additional validation. Entailments are intriguing, because they can make for more sophisticated queries, can construct interim results and can use those provide internodal constraints.

Shape Validation and Reporting

Validation is a two stage process. The first part passes a node of a given type to the validator along with the graph holding the SHAPE files themselves, most likely with the processor being a fairly complex SPARQL query. The output are triples, which can then be passed to a post-processor for conversion to an HTML or XML page of some sort.

The reports so generated are similar to those of XSD, in that passing validation results in no output. However, even a property or component relationship is not valid, then this will be output. For instance, supposed that the above instance had franchise set to a studio value instead.

movie:ANewHope a class:Movie;
     movie:title "A New Hope"^^xsd:string;
     movie:productionDate "2015-11-25"^^xsd:date;
     movie:franchise studio:DisneyStudios; # This line is invalid
     movie:studio studio:DisneyStudios,studio:Lucasfilms;
     .

The invalid line will then, when validated, return the following report.

[	a sh:ValidationReport ;
	sh:conforms false ;
	sh:result [
		a sh:ValidationResult ;
		sh:resultSeverity sh:Violation ;
		sh:focusNode move:aNewHope ;
		sh:resultPath movie:franchise ;
		sh:value studio:DisneyStudios ;
		sh:resultMessage "movie:franchise expects an IRI of type class:franchise." ;
		sh:sourceConstraintComponent sh:ClassConstraintComponent ;
		sh:sourceShape sh:Movie ;
	]
] .

The report will include multiple sh:result nodes for each node in the passed nodesets (or of the given target class in the triple store). These can also be output as turtle or similar files, or even mapped to JSON or XML files for additional post-processing.

SHACL and System Generated User Interfaces

A huge challenge exists for people working with large scale RDF triple databases. Typically, there may be potentially thousands of different classes involved in such databases, making the hand creation of user interfaces problematic - especially when dealing with data hubs and similar aggregate enterprise systems. Generating user interfaces from XSDs is a well known processes, but because of the complexities of OWL, having systems create their own interfaces was simply out of the bounds for all but the simplest of models.

SHACL has the potential to change that. Because SHACL resources can be grouped and ordered, there is generally enough of information to build not only display only but editable interfaces from SHACL graphs, as well as to support services for validation and import processing of content when used in a restful architecture (which is typical for RDF) systems. This can also be supplemented with permission constraints to determine editability of content at the property or record level. Because such information is just RDF in a different graph in a federatable system, there is no real need to create separate vocabularies as part of SHACL - these simply become other constraint conditions.

Additionally, SPARQL can be used to determine what tools and widgets work best for editing, and could potentially construct these as part of an output. A simple example illustrates the concept:

select ?output where {
    $node a ?class.
    graph graph:sh {
        ?shape sh:targetClass ?class.
        ?shape sh:property ?property.
        ?property sh:order ?order.
        optional {?property sh:group ?group.}
        ?property sh:path ?path.
        ?property sh:datatype ?datatype.
        ?property sh:name ?label.
        $node ?path ?value.
        bind(if(sameIRI(?datatype,xsd:string),concat('<div class="prop" id="',$node,'"><span class="label">',
            ?label,'</span><input type="text" name="',$path,
            '" value="',?value,'"/></div>'),
            sameIRI(?datatype,xsd:date),concat('<div class="prop"><span class="label">',?label,
            '</span><input type="text" name="',$node,
            '" value="',?value,'"/></div>'))) as ?output)
       }
    } order by ?group ?order

This query would then generate an ordered sequence of HTML items for inputting string or date content. Obviously, a real world scenario would be more complex, but not dramatically so.

Summary

The graph model that describes RDF and the folded hierarchy of XML are readily translatable, although there are assumptions made in each of these data representation models as they currently exist in OWL that are simply too rich to capture in XSD. However, SHACL, as a smaller, more data-centric format, may actually be a good tool for managing equivalency in pipelines where large number of resources (millions or even billions of data "documents") are involved. Because of the work done in rectifying the core XDM and JSON-DM models, SHACL could act as a unifying bridge, a mechanism for storing both normalized and denormalizing content and providing both validation and potentially visualization of interfaces moving forward.

Author's keywords for this paper: XML Schema; SPARQL; SHACL; RDF; XML; JSON; OWL