Application of Brzozowski Derivatives to JSON Schema Validation

Mary Holstege

Distinguished Engineer

MarkLogic Corporation

Copyright ©2019 Mary Holstege

expand Abstract

expand Mary Holstege

Balisage logo

Preliminary Proceedings

expand How to cite this paper

Application of Brzozowski Derivatives to JSON Schema Validation

Balisage: The Markup Conference 2019
July 30 - August 2, 2019

Introduction

JSON Schema is a JSON vocabulary for defining valid JSON data. The data may be null, simple – numbers, strings, or boolean values – or structured – arrays of JSON values or objects consisting of key/value pairs. In JSON parlance the keys are called properties. The JSON Schema specification defines a set of properties that define possible constraints on the JSON values. In general these constraints are independent of one another.

Among the constraints are some that define what we might call, in the XML world, "content models" or "content types". As such, we pose the question: can we apply similar techniques as we use for XML Schema to implementing JSON Schema? In particular, can we use Brzozowski derivatives to compute validity with respect to these "content models"?

Background and Related Work

Brzozowski64 defines the concept of a derivative of an extended regular expression with respect to a symbol, and to a sequence of symbols. The derivative of an extended regular expression with respect to a symbol is the extended regular expression that would need to be matched once that symbol is consumed. By repeating the process as we walk through a sequence of symbols (the input sequence) we end up with an extended regular expression that must match the empty sequence once we have consumed the whole input sequence. An extended regular expression is called nullable if it can be reduced to match the empty sequence. Brzozowski shows that the input sequence is valid with respect to the initial extended regular expression if and only if the derivative of that expression with respect to the input sequence is nullable.

Sperberg-McQueen (Sperberg-McQueen05) discusses the application of Brzozowski derivatives to various problems of XML Schema processing, and offers a nice summary of some of the previous work in this area. Brzozowski derivatives are reported [1] to be used in James Clark's JING Validator and Sun's MSV validator, and I know it to be used in MarkLogic's validation support as well. There are no doubt others.

The essential idea in using Brzozowski derivatives for XML Schema content model validation is to observe that, with a few extensions to the model as shown by Sperberg-McQueen, XML Schema content models are extended regular expressions over the set of defined element names (plus one "every other name" symbol). When the derivative of a sequence of names from the input with respect to the declared content model is nullable, the input matches the content model. Given the unique particle attribution property in XML Schema, we can look up each symbol in the matching sequence, check simple type constraints as necessary, and complete validation.

JSON Schema Overview

The definitions of JSON Schema (JSONSchema) and JSON Schema Validation (JSValidation) are evolving and managed for the most part outside of the scope of any standards body (see https://json-schema.org/specification.html for the current state), although there are now IETF Internet-Drafts (cited above) that track the changes. Here I describe draft 07.

The conceptual framework behind JSON Schema differs greatly from the traditions coming from the XML community or the formal languages community and some of the vocabulary is used differently as well. A "schema" is a JSON value: either an object or boolean value. It defines a set of constraints against a JSON value, which may be an object (a map of name/value pairs), an array of JSON values, or a simple number, string, boolean, or null value. Certain properties of the schema object have defined meanings; undefined properties are irrelevant and ignored. Certain properties contain schemas as values. It is important to remember that the term "schema" does not mean the totality of the set of constraints defined in a particular schema document and its referenced schema documents but instead means the current set of constraints against a particular JSON value.

A schema defines a set of constraints through a set of properties and their values. Importantly, with a very few defined exceptions, the constraints apply independently. Certain properties are ignored for certain values. A JSON value is valid with respect to its schema when all the relevant properties are valid with respect to that value.

A few examples might help make this clear:

Figure 1

{
  "minimum": 12,
  "pattern": "a+"
}
      

A schema for a simple value

Figure 1 shows a simple JSON Schema. The minimum property requires that the JSON value being validated be a numeric value greater than twelve. The pattern property requires that the JSON value being validated be a string value consisting of a sequence of 'a's. Is this an unsatisfiable schema? No. If the value presented happens to be a number, it will be valid if it is more than twelve, because the pattern property will be ignored for numeric values. If the value presented happens to be a string, it will be valid if it consists of a sequence of 'a's because the minimum property will be ignored for string values. What is more, if the value presented happens to be a JSON object, it will also be valid, because both the minimum and pattern properties will be ignored!

We can fix the schema to disallow that last scenario by adding another property:

Figure 2

{
  "type": ["number","string"]
  "minimum": 12,
  "pattern": "a+"
}
      

A schema for an integer or string value

In Figure 2 we have a schema that will match either a number greater than twelve or a sequence of 'a's and nothing else.

Similar kinds of scenarios play out with constraints on JSON objects.

Figure 3

{
  "type": "object",
  "required": ["a","b","c"],
  "properties": {
     "a": true,
     "b": true,
     "c": true
  },
  "additionalProperties": false
}
      

A schema requiring an object with exactly three properties

The schema in Figure 3 needs all four[2] of the properties to ensure that the value is an object with exactly three properties named a, b, and c. The type property excludes other kinds of values such as numbers. The required property excludes objects that do not have all of the indicated properties. The properties property defines constraints for a certain set of named properties (those constraints being the boolean schema true which means anything is valid). And finally the additionalProperties property excludes objects that have other properties we haven't mentioned using a boolean schema of false (always invalid). There are no ordering constraints or occurrence constraints: objects are unordered maps in which keys can only occur once.

This is obviously a little more convoluted than just saying the content model is a & b & c. It is easy to imagine we might have a mapping that gets us such a content model, however, and we will see that in a moment. Nevertheless, the reader may be forgiven for suspecting that applying regular expression derivatives to validation here is a dubious enterprise. Hold on to that thought! One more example:

Figure 4

{
  "type": "object",
  "required": ["a","b"],
  "oneOf": [
    {
      "properties": {
        "a": { "type": "number", "minimum": 0 },
        "b": { "type": "number", "minimum": 0 },
        "c": { "type": "number" }
      },
      "additionalProperties": false
    },
    {
      "properties": {
        "a": { "type": "number", "maximum": 0 },
        "b": { "type": "number", "maximum": 0 },
        "d": { "type": "number" }
      },
      "additionalProperties": false
    }
  ]
}
      

A tricky schema

The schema in Figure 4 will successfully validate objects that have two numeric properties named a and b, where either both numbers are more than zero, or both numbers are less than zero, together with an additional numeric property named either c or d, depending on whether we are in the positive or negative case. We could map this to the "content model" (a & b & c) | (a & b & d) but verifying that the numeric constraints hold would require a separate check that essentially performs all the content model checking anyway. The reader now has probably concluded that regular expression derivatives for JSON Schema validation are doomed. Not so fast, reader! We will be adding some extensions to the Brzozowski model to rescue the situation.

Before we get to the extensions to the model required, let's enumerate the JSON Schema validation properties. The classification is not part of the JSON Schema specification: it is an informational organization that we will make use of in a minute.

Property name Relevance Meaning Classification
type all Data value kind: object, array, number, integer, string, boolean, null type (special)
format string Required format such as date-time (optional) type
enum all Set of possible values type
const all Fixed value type
minLength string Minimum length type
maxLength string Maximum length type
pattern string String matches regular expression type
multipleOf number Number is multiple of this type
minimum number Inclusive minimum type
minimumExclusive number Exclusive minimum type
maximum number Inclusive maximum type
maximumExclusive number Exclusive maximum type
minProperties object Minimum number of properties type
maxProperties object Maximum number of properties type
minItems array Minimum number of array items type
maxItems array Maximum number of array items type
uniqueItems array Are values in array unique? type
properties object Child properties composition
additionalProperties object Constraints on unenumerated properties composition
required object Names of required properties composition
items (array value) array Sequence of constraints for items in array composition
items (single value) array Constraint on every item in array composition
additionalItems array Constraints on additional unenumerated items composition
contains array At least one item is valid composition
anyOf all Valid against at least one subschema composition
oneOf all Valid against exactly one subschema composition
allOf all Valid against all subschemas composition
not all Valid if not valid against subschema composition
if,then,else all Conditional validation composition
propertyNames object Schema constraint on the names of properties type (special)
dependencies object Additional constraints if certain child properties are present composition (special)
$ref all References the schema via a path rather than in place special
$id all Sets a base URI for resolution of $ref properties special

Extending the Model

The approach taken to applying Brzozowski derivatives to JSON Schema validation is to map the compositional properties to extended regular expressions, where the symbols will be composites of the JSON property names tagged with a type identifier. A further mapping will gather up bundles of type properties in the schema; these bundles will get a type identifier. Call these bundles types and their constraints facets. We will show how we can define nullability for these extended expressions and derivatives of them with respect to JSON values. Some additional machinery and notation will need to be in play for unnamed values, either as top-level values or array items.

In sum, here is the program of work:

  1. Define a mapping from JSON Schema properties to extended regular expressions

  2. Define a mapping from JSON Schema properties to types and facets

  3. Define extension operators to account for all the composition properties

  4. Define extensions to tag property names with type identifiers in the expressions

  5. Define a type matching function

  6. Define extensions for type-only items (top level and array items)

  7. Define derivatives with respect to JSON values

  8. Define derivatives over extended expressions

  9. Define nullability over extended expressions

  10. Sip wine in celebration

Mapping

Mapping performs two operations: mapping type properties to faceted types, and mapping compositional properties to extended regular expressions. We will denote this with two separate functions, μ and μM respectively. To complete the picture, we will also need a function that provides the type with its name, and a lookup function that can obtain the facets for a given named type. I use the form f(A) => B to say that the definition of the result of applying function f to A is B.

Basic Notation

{} a schema with no constraints
true true schema, equivalent to a schema with no constraints
ε empty sequence (true match)[3]
empty set (impossible match)
false false schema, equivalent to empty set
n symbol (property name)
[condition] convenience notation: ε if condition is true; ∅ otherwise
p1:v1 a property:value pair
{p1:v1,...,pn:vn} map of property:value pairs, JSON object
[v1,...,vn] array of values, JSON array

Mapping function μ

μ(schema) maps schema to type facets {p1:v1,...,pn:vn}

Model mapping function μM

μM(schema) maps schema to extended regular expression E

Key function κ

κ({p1:v1,...,pn:vn}) maps set of type properties to key uniquely identifying type
κ(ε) => ε special case ε
κ(∅) => ∅ special case ∅

Lookup function λ

λ(key) maps type key back to type facets {p1:v1,...,pn:vn}
λ(κ(P)) = P identity: lookup of the key of type properties gets the properties back again
λ(ε) => ε special case ε
λ(∅) => ∅ special case ∅

The type mapping function is fairly straightforward, although capturing the interactions with the compositional properties gets a little messy in a functional paradigm. In the formulation here, the extended regular expressions from μM are expressed as the output for a particular property (model) in the output of μ.

For the most part type properties in the schema map directly to facets in the type. If there is a type property in the schema, its value will constrain which properties map to facets; if it is an array of values, we will treat this as creating a content model disjunction between the alternatives; and if it is absent we will treat this as creating a content model disjunction between all[4] alternatives (array, object, number, string, boolean, null). Each alternative will only gather properties relevant for that specific type value.

In what follows, P is a map of property-value pairs {p1:v1,...,pn:vn} where each property name is unique and the + operator performs a union of maps. The rules are generally written so that the set of property names is disjoint.[5] In addition, s1, s2, etc. refer to (sub)schemas.

The definition below shows the mapping, broken down for each of the distinct type values. Most properties can be mapped independently, although there are some exceptions. That is, unless specified otherwise: μ({p:v,P}) => μ({p:v})+μ(P) and similarly μM({p:v,P}) => μM({p,v})+μM(P). Where there is a rule for a specific combination of properties, that rule takes precedence. There are a few edges cases to take care of: boolean schemas and $ref dereferencing.

It will be useful to define some standard bundles of properties. The function relevant(P,kind) will return just the properties listed for the specific kind from the map of property-value pairs. For example, relevant({minLength:1,maximum:12},"string") is {minLength:1}:

Figure 5

Kind Properties selected by relevant(P,kind)
any enum, const, oneOf, anyOf, allOf, not, if, then, else
boolean Those for any
string minLength, maxLength, pattern and those for any
integer Same as number
number multipleOf, minimum, exclusiveMinimum, maximum, exclusiveMaximum and those for any
null Those for any
array minItems, maxItems, uniqueItems, items, additionalItems and those for any
object propertyNames, minProperties, maxProperties, properties, patternProperties, additionalProperties, required and those for any

Properties relevant to certain types

First, we define how some edge cases are handled:

Figure 6

s μ(s)
true true
false
{$ref:path} μ(dereference(path))

μ edge cases

The boolean schemas map to true or empty set as appropriate, and references map to the referenced definition. Specifying the details of how the JSON path gets dereferenced is beyond the scope of this paper. In what follows assume we have a fully dereferenced schema in hand.[6] We are also assuming the schema as a whole (including the references) is valid.

The type property is the main driver for the mapping. In general it gets mapped to a kind facet plus a simpleType facet for references to simple values and occasionally other facets as well. The type names have the same meaning as in XSD2. Here are the rules for mapping the type property:[7]

Figure 7

s μ(s)
{type:[t1,...,tn], P} μ({oneOf: [{type:t1,P},...,{type:tn,P}]})
{type:boolean, P} μ({type:boolean}) + μ(relevant(P,"boolean"))
{type:string, format:f1, P} μ({type:string, format:f1}) + μ(relevant(P,"string"))
{type:string, P} μ({type:string}) + μ(relevant(P,"string"))
{type:integer, P} μ({type:integer}) + μ(relevant(P,"integer"))
{type:number, P} μ({type:number}) + μ(relevant(P,"number"))
{type:null, P} μ({type:null}) + μ(relevant(P,"null"))
{type:array, P} μ({type:array}) + μ(relevant(P,"array"))
{type:object, P} μ({type:object}) + μ(relevant(P,"object"))
{type:boolean} {kind:boolean, simpleType:"xs:boolean", enum:[true,false]}
{type:string, format:date-time} {kind:string, simpleType:"xs:dateTime"}
{type:string} {kind:string, simpleType:"xs:string"}
{type:number} {kind:number, simpleType:"xs:numeric"}
{type:integer} {kind:number, simpleType:"xs:integer"}
{type:null} {kind:null}
{type:array} {kind:array}
{type:object} {kind:object}

μ mappings for type

The enum and const properties apply to all kinds of values:

Figure 8

s μ(s)
{enum:[s1,s2,...,sn]} {enum:[s1,s2,...,sn]}
{const:value} {fixed:value}

μ for const and enum

The string-specific properties are minLength, maxLength, and pattern

Figure 9

s μ(s)
{minLength:value} {minLength:value}
{maxLength:value} {maxLength:value}
{pattern:value} {pattern:value}

μ for string properties

Here are the numeric property mappings:

Figure 10

s μ(s)
{multipleOf:value} {multipleOf:value}
{minimum:value} {minimumInclusive:value}
{minimumExclusive:value} {minimumExclusive:value}
{maximum:value} {maximumInclusive:value}
{maximumExclusive:value} {maximumExclusive:value}

μ for numeric properties

Type properties for objects:[8]

Figure 11

s μ(s)
{minProperties:value} {minProperties:value}
{maxProperties:value} {maxProperties:value}
{propertyNames:s1} {propertyNames:μ(s1)}

μ for object properties

Array properties:

Figure 12

s μ(s)
{minItems:value} {minItems:value}
{maxItems:value} {maxItems:value}
{uniqueItems:value} {uniqueItems:value}

μ for array properties

Before we get to model mapping, we need to introduce some additional notation. The notation is similar to Sperberg-McQueen05, which is a little different from Brzozowski64, but with a few alterations and extensions. JSON Schema does not need general occurrence counts, and the wildcards required are relative to a set of property names, not namespaces.

Basic Extended Regular Expressions

ε empty string (universal match)
impossible match
n property name
n:v name/value pair
F model F
F@T F with type identifier T
Thus F = F@ε, a model F with no type identifier
F? optional F
F* one or more F
F,G F followed by G
F|G F or G (non exclusive)
F&G F and G in any order
!F F invalid
F⊕G F xor G
(F?G:H) conditional: if F valid then G otherwise H
<F,G> both F and G simultaneously on same input
wc(pattern) any key matching pattern
wc(-N,-P) any key that is not a name in the name set N and that does not match a pattern in the pattern set P
wc(-N,+P) any key that is not a name in the name set N and that does match a pattern in the pattern set P

The type marking, exclusive or, conditional, simultaneous, and wildcards are extensions. Negation is present in Brzozowski64 although with a slightly different notation.

It may seem that F&G and <F,G> mean the same thing, but the difference will become clearer when we get to the definition of derivatives, because F&G matches F and G against different segments of the input, whereas with <F,G> matches F and G against the same segment of the input simultaneously. For example, a@T & a@S can never be satisfied because each symbol is unique in a JSON object so there can only be one a matched, but <a@T, a@S> may well be satisfied if both type constraints are satisfied on the value of property a.

The mapping of the basic composition operators is fairly direct:

Figure 13

s μ(s)
{anyOf: [s1f,...,sn]} {model: μM({anyOf:[s1,...,sn])}
{oneOf: [s1,...,sn]} {model: μM({oneOf:[s1,...,sn])}
{allOf: [s1,...,sn]} {model: μM({allOf:[s1,...,sn])}
{not:s} {model: μM({not:s})}
{if:s1,then:s2,else:s3} {model: μM({if:s1,then:s2,else:s3})}

μ for basic composition properties

Figure 14

s μM(s)
{anyOf: [s1,s2,...,sn]} μM(s1)|μM(s2)|...|μM(sn)
{oneOf: [s1,s2,...,sn]} μM(s1)⊕μM(s2)⊕...⊕μM(sn)
{allOf: [s1,s2,...,sn]} <μM(s1),μM(s2),...,μM(sn)>
{not:s} !μM(s)
{if:s1,then:s2,else:s3} (μM(s1)?μM(s2):μM(s3))
{if:s1,then:s2} (μM(s1)?μM(s2):true)
{if:s1,else:s3} (μM(s1)?true:μM(s3))

μM for basic composition properties

Mapping of the object-related compositional properties requires dealing with the interactions and dependencies between them. The two main compositional properties are properties and patternProperties. Each of these consists of a set of keys and an associated schema for each key. In the case of properties the key is the property name to match in the instance; in the case of patternProperties the key is a regular expression that matches the name of the property in the instance. All properties named are optional unless their name appears in the required array. Properties otherwise unaccounted for must conform to the schema given for additionalProperties, which defaults to the true schema if absent.

To simplify things somewhat, let's first ensure that when the additionalProperties property was not specified, we treat it as an explicit true schema, that is, that all additional properties are OK. We also want to normalize the additionalProperties property when it is the false schema to remove it, and to normalize the properties and patternProperties properties to replace absence with an empty object, and the required property to replace absence with an empty array. Assume the following normalizations apply one after another:

Figure 15

Constraints on {p1:s1,...,pn:sn} μ({p1:s1,...,pn:sn})
∀pi|pi!=additionalProperties μ({p1:s1,...,pn:sn,additionalProperties:true})
{p1:s1,...,pn:sn}={additionalProperties:false,P} μ({P})
∀pi|pi!=properties μ({p1:s1,...,pn:sn,properties:{}})
∀pi|pi!=patternProperties μ({p1:s1,...,pn:sn,patternProperties:{}})
∀pi|pi!=required μ({p1:s1,...,pn:sn,required:[]}

Normalizations for interdependent object composition properties

This leaves us with two variations to map: the bundle of properties P with the properties named properties, patternProperties, and required, and the bundle of properties with those plus the property named additionalProperties as well. In either case, the mapping creates a model property with the property bundle mapped by μM: {model: μM(P)}.

Capturing the semantics of additionalProperties requires some trickiness, because if its value is not the false schema, then it applies to any properties that are not named in properties and that do not match a pattern in patternProperties. We capture this in the wildcard which takes those properties and patterns as its sets. We also capture the required property by marking appropriate properties as optional or not in the model. This is an optimization: we could instead have a simultaneous compositor where one branch handles the required properties and the other handles the type information. Both branches end up being relatively weak, in general.

The semantics of patternProperties is particularly tricky:

  1. Pattern properties do not count as additional properties.

  2. More than one pattern property can match a specific input property and both constraints need to be enforced.

  3. A pattern property may or may not match a specifically named property; if it does its constraint must be enforced.

  4. We may lack named properties (properties) but still have a constraint on "additional" properties (additionalProperties).

In general, every property listed is optional, unless it is specifically mentioned in the required array. To capture the semantics of required in an effective way is a little awkward to express algebraically. Note that it is perfectly acceptable in JSON Schema for a property to be in the required array but not mentioned by either the properties or patternProperties.

For example the following schema is perfectly OK, and will not match an object {"c":1,"dd":2} as it lacks the required a and b properties:

Figure 16

{
  "type": "object",
  "required": ["a","b"],
  "properties": {
    "c": { "type": "number" }
   },
  "patternProperties": {
    "d+": { "type": "number" }
  },
  "additionalProperties": true
}
        

Schema with required property that is not otherwise described

With all that said, the general rules for mapping the object-related composition properties are:

Figure 17

s μM(s)
{
  properties: {p1:s1,...pm:sm},
  patternProperties: {r1:pt1,...,ro:pto},
  additionalProperties:sp,
  required:[rq1,...,rqn]
}







    < 
      (q1@κ(t1) & ... & qk@κ(tk) & 
       ql@κ(ql)? & ... & qm@κ(tm)? &
       u1@κ(sp) & ... & uv@κ(sp) & 
       (wc(-[p1,...,pm],-[r1,...,ro])@κ(sp)|
        wc(-[p1,...,pm],+[r1,...,ro]))*
      ),
      (wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
      (wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
        ...
      (wc(rn)@κ(pto) | wc(-[],-[ro]))*
    >
  
{
  properties: {p1:s1,...pm:sm},
  patternProperties: {r1:pt1,...,ro:pto},
  required:[rq1,...,rqn]
}













    < 
      (q1@κ(t1) & ... & qk@κ(tk) & 
       ql@κ(ql)? & ... & qm@κ(tm)? &
       u1@κ(sp) & ... & uv@κ(sp) & 
       wc(-[p1,...,pm],+[r1,...,ro])*
      ),
      (wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
      (wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
        ...
      (wc(rn)@κ(pto) | wc(-[],-[ro]))*
    >

where
  l = k+1
  for i<=k, qi=pj and ti=sj and ∃rqh|pj=rqh
  for i>k, qi=pj and ti=sj and ∀rh|pj!=rqh
  for i<=n, ui=rqj where ∀ph|ph!=rqj
  

μM for object composition properties

This looks complicated, but all it is saying is that our mapped content model consists of a mix, in any order, of: properties that are named and required (q1 through qk), properties that are named but not required (ql through qm, properties that are not named but are nevertheless required (u1 through uv), a wildcard to pick up properties that are not named that match none of the patterns but satisfy the schema for additionalProperties, and a wildcard to pick up properties that are not named but do match one of the patterns. The wildcard clauses in the second and subsequent branches of this simultaneous model ensure that properties matching patterns satisfy their appropriate schema: everything matches the pattern and its corresponding schema, or doesn't match the pattern at all. Each pattern gets its own simultaneous branch because an input property may match more than one of them, in which case it would need to satisfy constraints from each match.

If patternProperties is empty, as is often the case, this simplifies a great deal. We can drop the wildcards that match against patterns and simplify the simultaneous model with one branch to just that branch:

Figure 18

s μM(s)
{
  properties: {p1:s1,...pm:sm},
  additionalProperties:sp,
  required:[rq1,...,rqn]
}


    (
     q1@κ(t1) & ... & qk@κ(tk) & 
     ql@κ(ql)? & ... & qm@κ(tm)? &
     u1@κ(sp) & ... & uv@κ(sp) & 
     wc(-[p1,...,pm],-[])@κ(sp)*
    )
  
{
  properties: {p1:s1,...pm:sm},
  required:[rq1,...,rqn]
}

  
    (
     q1@κ(t1) & ... & qk@κ(tk) & 
     ql@κ(ql)? & ... & qm@κ(tm)? &
     u1@κ(sp) & ... & uv@κ(sp)
    )
  

Simplified μM for object composition properties

Notice that we still get fairly weak content models here: where every property can occur in any order, optional unless specifically required, with all kinds of wildcards unless additionalProperties is explicitly false and there are no patternProperties. This has performance implications, as it creates larger and more complex derivatives. In JSON objects, every property name is unique, however. We can use this knowledge to strengthen the wildcards in the first branch by adding any required properties to the excluded list. (In the examples we will assume we do this.) Other optimizations are possible. We will return to this point later.

Mapping the array-relevant compositional properties follows a similar paradigm, but an additional class of extension is in order, because array members do not have any symbol associated with them, just values. However, given our type constraint scheme, we still have something to say about such things, we just need a little bit more notational mechanism:

Non-symbol extensions

•@T type-only, no symbol
• = •@ε no symbol, no type; i.e. value must exist

Given the type-only expressions, we are now in a position to map array properties. The items property in JSON Schema comes in two varieties with different semantics: as a single schema, or as an array of schemas. If it is a single schema, then that schema applies to every member of the array, and additionalItems is irrelevant. If it is an array, then each member of the array applies one to one to the values in the array in order, with any surplus values in the array matching the additionalItems schema, if any:

Figure 19

s μ(s)
{items:I,additionalItems:sa,contains:sc} {model: <μM({items:I,additionalItems:sa}),μM({contains:sc})>}
{items:I,additionalItems:sa} {model: μM({items:I,additionalItems:sa})}
{items:I,additionalItems:false} {model: μM({items:I})}
{items:I} {model: μM({items:I,additionalItems:true})}

μ for array composition properties

Figure 20

s μM(s)
{contains:sc} (•@ε*,•@κ(sc),•@ε*)
{items:s1,additionalItems:sa} (•@κ(s1)*)
{items:[s1,s2,...,sn], additionalItems:sa}
  (•@κ(s1),•@κ(s2),...,•@κ(sn),•@κ(sa)*|
   •@κ(s1),•@κ(s2),...,•@κ(sn)?|
   •@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
   ...
   •@κ(s1),•@κ(s2)?|
   •@κ(s1)?
  )

  =

  ((•@κ(s1),
     (•@κ(s2),
        ...
        (•@κ(sn)?,•@κ(sa)*)
        ...
     )?
   )?
  )
  
{items:[s1,s2,...,sn]}
  (•@κ(s1),•@κ(s2),...,•@κ(sn)?|
   •@κ(s1),•@κ(s2),...,•@κ(sn-1)?|
   ...
   •@κ(s1),•@κ(s2)?|
   •@κ(s1)?
  ) 

  =

  ((•@κ(s1),
    (•@κ(s2),
      ...
      •@κ(sn)?
      ...
    )?
   )?
  )

μM for array composition properties

The reason for the complexity of the rules for items is this: just because the property defines type constraints for three slots, that doesn't mean a matching array must have three entries in it. That is handled by a different constraint property, the minItems property. We could apply it here to remove some of the optionality.

In other words, array constraint properties map to type-only expressions. The additionalItems property is only relevant if the items property enumerates a sequence of type constraints for each item in the array.

Mapping the dependencies property is a little confusing because its value is a JSON object where each value is either a subschema or an array of property names. The semantics are very different, and the dependencies property is likely to bifurcate in future drafts of JSON Schema. Each item in the JSON object maps separately, using a different rule depending on whether its value is a schema or an array.

Here are the rules:

Figure 21

s μ(s)
{dependencies: {p1:v1,p2:v2,...,pn:vn}}
{model: <
     μM({dependencies:{p1:v1}}),
     ...,
     μM({dependencies:{pn:vn}})
   >}

μ for dependencies

Figure 22

s μM(s)
dependencies:{p:S} ((p & wc(-[p],-[])*) ? wc(*)@κ(S) : true)
dependencies:{p:[q1,q2,...,qn]}
    ((p & wc(-[p],-[])*) ? 
     (q1 & ... & qn & wc(-[q1,...,qn],-[])*) : 
     true)
     

μM for dependencies

That is, if we have the property named p plus a bunch of other stuff not named p, the object must match either the given schema, or must include all the named properties (depending on which form was used).

It should also be noted that although at first blush we have made so many extensions to regular expressions that it seems likely that none of the supporting results would hold, in fact we have hardly extended anything substantively. Instead of the finite vocabulary consisting of the symbols in the schema we have a vocabulary consisting of the symbols crossed with the type identifiers plus the symbol. It is still a finite vocabulary.[9] The operators and functions we have added, being boolean operators, fit into the rules defined by Brzozowski. True, we have made the rules for matching more complex, but one could also think of them as just being a funny kind of equality. On the other side, while our input can consist of an arbitrary number of distinct values, they are all just annotations on that finite vocabulary. The infinitude of values will concern us when we get to caching, however.

Derivatives

Given a JSON Schema mapped in this way, we can compute derivatives of those expressions with respect to JSON values. Most of these rules come directly from Sperberg-McQueen05 or Brzozowski64 or can be derived by Brzozowski's observation that the derivative of any boolean operator over some operands is the operator over the derivatives of those operands. Algebraically: Δf(R)/Δa = f(ΔR/Δa) for any boolean function f. A key difference is that when we get to the leaves, we have to do type matching as well as just symbol matching. Let us introduce a little more notation:

Derivative function Δ

ΔX/Δy compute a new expression E, the derivative of expression X with respect to JSON value y

Nullability function ν

ν(model) produces ε if ε is in language of model; ∅ otherwise

Typematch function τ

τ(value,type) produces ε if value matches type constraints; ∅ otherwise

JSON values

{p1:v1,...,pn:vn} a JSON object with property/value pairs
[v1,...,vn] a JSON array with a sequence of values
n = n:• name with no particular value
v = •:v value with no name

Taking lowercase a and b as symbols and uppercase E, F, and G as expressions, the derivatives of basic expressions (without type matching) are as follows:

Figure 23

E ΔE/Δa
b [a=b]
ε
true true
E? ΔE/Δa | ν(E)
E* ΔE/Δa,E*
E,F (ΔE/Δa),F | ν(E),ΔF/Δa
E|F ΔE/Δa | ΔF/Δa
E&F (ΔE/Δa & F) | (E & ΔF/Δa)
!E !ΔE/Δa
E⊕F (ΔE/Δa) ⊕ (ΔF/Δa)
(E?F:G) (ΔE/Δa ? ΔF/Δa : ΔG/Δa)
<E,F> <ΔE/Δa, ΔF/Δa>
wc(pat) [matches(a,pattern)]
wc(-[p1,...,pn],-[r1,...,rn]) [∀pj:a!=pj ∧ ∀rj:!matches(a,rj)]
wc(-[p1,...,pn],+[r1,...,rn]) [∀pj:a!=pj ∧ ∃rj:matches(a,rj)]

Basic derivative rules

To take the derivative of an expression with respect to a JSON object, first note the edge case for an empty sequence is defined so that:

ΔE/Δε => E
    
Then observe that for a sequence of symbols u followed by a symbol a:
ΔE/Δ(u,a) => (ΔE/Δu)/Δa
        
or, unrolling an entire sequence:
ΔE/Δ(p1,p2,...,pn) => (...(((ΔE/Δp1)/Δp2)/...)/Δpn)
        

Given the set of properties of an object p1, p2, etc., then the derivative of the set of properties is the union of the derivative of each possible sequence starting with some pj:

Figure 24

Derivative expression Derivative result
ΔE/Δ•:{p1:v1,p2:v2,...,pn:vn}
  (ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} | 
  (ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} |
  ... | 
  (ΔE/Δpn:vn)/Δ•:{p1:v1,p2:v2,...,pn-1:vn-1}
        

Derivative with respect to a whole object

Adding type matching to the picture doesn't change the rules much although they look a little more complicated. All the rules are the same, just carrying through the type annotation on the expressions, except for the leaf cases where we need to perform a type check as well as a symbol matching:[10]

Figure 25

Derivative expression Derivative result
Δb@T/Δa:v [a=b],τ(v,λ(T))
Δwc(pat)@T/Δa:v [matches(a,pattern)],τ(v,λ(T))
Δwc(-[p1,...,pn],-[r1,...,rn])@T/Δa:v [∀pj:a!=pj ∧ ∀rj:!matches(a,rj)],τ(v,λ(T))
Δwc(-[p1,...,pn],+[r1,...,rn])@T/Δa:v [∀pj:a!=pj ∧ ∃rj:matches(a,rj)],τ(v,λ(T))
ΔE@T/Δ•:{p1:v1,p2:v2,...,pn:vn}
  τ({p1:v1,p2:v2,...,pn:vn},λ(T)),
  ((ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn} | 
   (ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn} | 
    ... |
   (ΔE/Δpn:vn)/Δ•:{p1:v1,p2:p2,...,pn-1:vn-1}
  )

Derivatives with type matching

Now we have type-aware derivatives for expressions over symbols, we still need to account for arrays and top level values that have no symbols and compute derivatives over these type-only expressions. Arrays unroll like sequences:

Figure 26

Derivative expression Derivative result
ΔE@T/Δ•:[v1,v2,...vn] τ([v1,v2,...,vn],λ(T)), ΔE/Δ(•:v1,•:v2,...,•:vn)
ΔE/Δ(•:v1,•:v2,...,•:vn) (...(((ΔE/Δ•:v1)/Δ•:v2)/...)/Δ•:vn)
Δ(•@T)/Δ•:v τ(v,λ(T))
Δ•/Δb
Δa/Δ•
Δwc(pat)/Δ•
Δwc(-[p1,...,pn],-[r1,...,rn])/Δ•
Δwc(-[p1,...,pn],+[r1,...,rn])/Δ•
Δ•/Δ• ε
Δε/Δ•

Derivatives for arrays and nameless values

What this says is that derivative of an array can be treated mostly as a sequence of non-symbol values, although we need to consider the type constraints on the array as a whole. A symbol cannot match a non-symbol, so derivatives of that kind are empty. A derivative of a non-symbol with respect to a non-symbol value comes down to the type matching function against that value.

Nullability

The nullability function decides whether an expression can be matched against nothing. When the derivative of an expression with respect to the input is not nullable, that means that either it would need additional input in order to match or it has already run down a blind alley and there is no way to match what is there. Therefore the data cannot be valid with respect to that expression. Conversely, when the derivative is nullable, it means that the input has been successfully matched by the regular expression (or consumed by the state machine, if you prefer) and no additional input is necessary for a successful match.

The rules for the nullability function are as follows:

Figure 27

E ν(E)
ε ε
true ε
a
E* ε
E,F <ν(E), ν(F)>
<E,F> <ν(E), ν(F)>
E|F ν(E) | ν(F)
!E ε if ν(E) = ∅
!E ∅ if ν(E) = ε
F&G <ν(F), ν(G)>
F?G:H (ν(F)?ν(G):ν(H))
F? ε
F⊕G ν(F) ⊕ ν(G) = (ν(F)?ν(!G):ν(G))
wc(pat)
wc(N,P)
F@T ν(F)

Rules for nullability function ν

Type Matching

Type matching checks the type constraints of a type T against a particular value v. It returns ε if the value matches the type constraints of the type and otherwise.

Figure 28

τ(v,T) expression τ(v,T) result
τ(•,T) ε
τ(v,T) ε if ∀f∈facets(T) τ(v,f)=ε, otherwise ∅
τ(v,ε) ε
τ(v,∅)

τ basic rules

Type-matching evaluates the various facets according to their semantics, something like:

Figure 29

f τ(v,f)
{kind:x} [v isa x]
{simpletype:x} [v instanceof x]
{enum:[x1,x2,...,xn]} [∃xi|v=x1]
{const:x} [v=x]
{minLength:x} [!(v isa string) ∨ string-length(v) >= v]
{maxLength:x} [!(v isa string) ∨ string-length(v) <= v]
{pattern:x} [!(v isa string) ∨ matches(v,x)]
{multipleOf:x} !(v isa number) ∨ (v mod x)=0]
{minimumInclusive:x} [!(v isa number) ∨ v >= x]
{minimumExclusive:x} [!(v isa number) ∨ v > x]
{maximumInclusive:x} [!(v isa number) ∨ v <= x]
{maximumExclusive:x} [!(v isa number) ∨ v < x]
{minProperties:x} [!(v isa object) ∨ count(keys(v)) >= x]
{maxProperties:x} [!(v isa object) ∨ count(keys(v)) <= x]
{propertyNames:x} [!(v isa object) ∨ ∀k∈keys(v)|τ(k,x)]
{minItems:x} [!(v isa array) ∨ count(items(v)) >= x]
{maxItems:x} [!(v isa array) ∨ count(items(v)) <= x]
{uniqueItems:x} [!(v isa array) ∨ !x ∨ count(items(v))=count(distinct(items(v)))]

τ for type facets

There is an additional rule for the τ function that handles subschemas. This will make more sense after we have looked at some examples:

Figure 30

f τ(v,f)
{model:E} ν(ΔE/Δ•:v)

τ over model facet

Given this rule, we can observe that a JSON value is valid with respect to a JSON schema if and only if the type matching function for the mapped JSON schema is ε.

Relations and Simplifications

There are a number of relations and simplifications of the derivative expressions that come in useful in computing derivatives and keeping the size of the derivatives in reasonable bounds.

First, some of the operators are associative:[11]

Figure 31

(E|(F|G)) = ((E|F)|G) = (E|F|G)
(E,(F,G)) = ((E,F),G) = (E,F,G)
(E&(F&G)) = ((E&F)&G) = (E&F&G)
(F⊕G)⊕H = F⊕(G⊕H) = F⊕G⊕H
<<F,G>,H> = <F,<G,H>> = <F,G,H>
        

Derivative associativity

Some of the operators are also commutative:

Figure 32

E|F = F|E
E&F = F&E
E⊕F = F⊕E
<E,F> = <F,E>
        

Derivative commutativity

There are straightforward simplifications involving ε, , and redundant expressions. We will make use of these, generally without comment, in the examples.

Figure 33

E|E = E
ε&E = E&ε = E
∅&E = E&∅ = ∅
∅|E = E|∅ = E
ε,E = E,ε = E
∅,E = E,∅ = ∅

E⊕E = ∅
E⊕!E = ε
∅⊕E = E⊕∅ = E

(∅?F:G) = G
(ε?F:G) = F

<E,E> = E
<E,!E> = ∅
<ε,E> = <E,ε> = E
<∅,E> = <E,∅> = ∅
        

Derivative simplification identities

These identities mean that, given a function or operator f that can only take the values of ε or , such as the type matching function or the nullability function

Figure 34

<f,E> = f,E
        

Boolean function equivalence

We have implicitly made use of this relation in the definitions of some of the derivatives, above.

Negation behaves as expected with respect to boolean operators:

Figure 35

!(!E) = E
!<E,F> = !E|!F
!(E|F) = <!E,!F>
!(E⊕F) = <!E,!F>|<E,F>
        

Derivative negation identities

Finally, certain of the extension operators can be defined in terms of the others. This can be useful for proving some of these derivations, but it also gives us some comfort that we are not actually enhancing extended regular expressions beyond recognition:

Figure 36

F⊕G = <F|G,!F|!G> 
(F?G:H) = <F,G>|<!F,H>
        

Relations between extension operators

JSON Schema Validation

Again, following Sperberg-McQueen05 we can use derivatives for validation. Given a JSON value v and a JSON schema mapped to a type T, where T is a member of a set of mapped types accessible via λ: v is valid with respect to T if and only if the type check τ(v,T) returns ε. If the mapping includes a mapped {model:E} then calculation of τ will compute the nullability of the derivative of E with respect to the nameless value: ν(ΔE/Δ•:v) per the type matching rule given previously.

The overall procedure is to compute the mapping of the schema. This will give us a type whose identifier (given by κ) will be the type to match. Where the mapping produces a model, computing the nullability of the derivative of this model with respect to the nameless JSON value in hand tells us whether that value is valid with respect to the schema.

Some examples might help:

Examples

Simple Type

Let's start with a very simple case. Is the value 47 valid with respect to this schema?

Figure 37

{ type:"number" }
           

Trivial schema

We would certainly expect it to be. Let's walk through the mechanics.

The mapping of the schema gives us just:[12]

Figure 38

T0: { kind:number, simpleType:"xs:numeric" }
         

Mapping of trivial schema

We need to compute the type matching function, which is simple enough (here = shows the chain of derivations):

Figure 39

τ(47,λ(T0)) = 
  τ(47,{kind:number, simpleType:"xs:numeric"}) = 
  [47 isa number],[47 instanceof xs:numeric] = ε
           

Evaluation of τ

Since the result is ε, the value is valid with respect to the schema.

More Complicated Example

Let's take a more complicated example, the schema from Figure 4, reproduced here:

Figure 40

{
  "type": "object",
  "required": ["a","b"],
  "oneOf": [
    {
      "properties": {
        "a": { "type": "number", "minimum": 0 },
        "b": { "type": "number", "minimum": 0 },
        "c": { "type": "number" }
      },
      "additionalProperties": false
    },
    {
      "properties": {
        "a": { "type": "number", "maximum": 0 },
        "b": { "type": "number", "maximum": 0 },
        "d": { "type": "number" }
      },
      "additionalProperties": false
    }
  ]
}
           

A tricky schema

Mapping gives us:

Figure 41

T0: {
  kind:object,
  model: 
    <
     (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
     (a & b & wc(-[a,b],[])*)
    >
}
T1: {
  kind:number,
  simpleType:"xs:numeric",
  minimumInclusive:0
}
T2: {
  kind:number,
  simpleType:"xs:numeric"
}
T3: {
  kind:number,
  simpleType:"xs:numeric",
  maximumInclusive:0
}
           

Mapping of tricky schema

We will henceforth abbreviate wc(-[a,b],[]) as just wc().

Let's see if this object is valid:

Figure 42

{
  a: 1,
  b: -1,
  c: 2
}
           

JSON object to validate against tricky schema

To do that, we need to compute the type function:

Figure 43

τ({a:1,b:-1,c:2}, λ(T0)) = 
  τ({a:1,b:-1,c:2}, 
    {
      kind:object,
      model:
        <
          (a@T1? & b@T1? & c@T2?) ⊕
            (a@T3? & b@T3? & d@T2?),
          (a & b & wc()*)
        >
  }) =
  ν(Δ<
    (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
    (a & b & wc()*)
    >/Δ•:{a:1,b:-1,c:2}
  )
           

Computing τ

which is to say, determine the nullability of the derivative of the model on the top level type.

To do that, we need to compute the following derivative:

Figure 44

Δ<
  (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
  (a & b & wc()*)
>/Δ•:{a:1,b:-1,c:2}
           

Derivative to be computed

This expression is equal to:

Figure 45

(
  Δ<
    (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), 
    (a & b & wc()*)
  >/Δa:1
)/Δ•:{b:-1,c:2}
           

Derivative with respect to a:1

Let's unroll that a bit:

Figure 46

(
  Δ<
    (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?),
    (a & b & wc()*)
   >/Δa:1
)/Δ•:{b:-1,c:2} =

(
  <
    Δ(
      (a@T1? & b@T1? & c@T2?) ⊕
        (a@T3? & b@T3? & d@T2?)
    )/Δa:1, 
    Δ(a & b & wc()*)/Δa:1
  >
)/Δ•:{b:-1,c:2} =    

(
  <
    Δ(a@T1? & b@T1? & c@T2?)/Δa:1 ⊕ 
      Δ(a@T3? & b@T3? & d@T2?)/Δa:1, 
    Δ(a & b & wc()*)/Δa:1
  >
)/Δ•:{b:-1,c:2} =    
           

Moving a:1 inside simultaneous and exclusive or

Let's consider the left side of the exclusive or (⊕):

Figure 47

Δ(a@T1? & b@T1? & c@T2?)/Δa:1 =
  Δ(a@T1?)/Δa:1 & b@T1? & c@T2? |
  a@T1? & Δ(b@T1?)/Δa:1 & c@T2? |
  a@T1? & b@T1? & Δ(c@T2?)/Δa:1
          

Moving a:1 inside intermix

Evaluating the first clause gives us ε because the symbol and type match:

Figure 48

Δ(a@T1?)/Δa:1 = Δa@T1/Δa:1 | ν(a@T1) = 
  [a=a],τ(1,λ(T1)) | ν(a@T1) = 
  τ(1,
    {
      kind:number,
      simpleType:"xs:numeric",
      minimumInclusive:0
    }
   ) = ε
          

Computing derivative of a@T1?

Therefore:

Figure 49

        
Δ(a@T1?)/Δa:1 & b@T1? & c@T2? = b@T1? & c@T2?
          

Derivative of first full intermix clause

So far so good! What about the other alteratives? These don't match (leading to a derivative of ) because b and c do not match a and are not nullable.

Since we have

Figure 50

Δ(b@T1?)/Δa:1 = Δb@T1/Δa:1 | ν(b@T1) = ∅ | ∅ = ∅
Δ(c@T2?)/Δa:1 = Δc@T2/Δa:1 | ν(c@T2) = ∅ | ∅ = ∅
          

Computing derivatives of b@T1? and c@T2?

so

Figure 51

a@T1? & Δ(b@T1?)/Δa:1 & c@T2? = a@T1? & ∅ & c@T2? = ∅
a@T1? & b@T1? & Δ(c@T2?)/Δa:1 = a@T1? & b@T1? & ∅ = ∅
          

Derivative of other full intermix clauses

each alternative reduces to and we are still left with

Figure 52

Δ(a@T1? & b@T1? & c@T2?)/Δa:1 = b@T1? & c@T2?
          

Derivative of first clause of exclusive or

On other clause of the exclusive or (⊕) we have

Figure 53

Δ(a@T3? & b@T3? & d@T2?)/Δa:1 =
  Δ(a@T3?)/Δa:1 & b@T3? & d@T2?) |
  a@T3? & Δ(b@T3?)/Δa:1 & d@T2? |
  a@T3? & b@T3? & Δ(c@T2?)/Δa:1
          

Moving a:1 inside intermix

Here the first alternative fails, because the type check fails:

Figure 54

Δ(a@T3?)/Δa:1 =
  Δa@T3/Δa:1 | ν(a@T3) =
  [a=a],
  τ(1,
    {
      kind:number,
      simpleType:"xs:numeric",
      maximumInclusive:0
    }
  )
          

Computing derivative of a@T3?

which is all just .

The other alternatives don't fare any better, for reasons similar to the case on the other side (b and d do not match a and are not nullable):

Figure 55

Δ(b@T3?)/Δa:1 = Δb@T3/Δa:1 | ν(b@T3) = ∅
Δ(d@T2?)/Δa:1 = Δ(d@T2)/Δa:1 | ν(d@T2) = ∅
          

Computing derivatives of b@T3? and d@T2?

so each of those alternatives nets out to .

Putting this all together, we have got to

Figure 56

(
  <
    (b@T1? & c@T2?) ⊕ ∅, 
    Δ(a & b & wc()*)/Δa:1
  >
)/Δ•:{b:-1,c:2}
          

Summary: derivative of exclusive or clause in context

The clause coming from the required property is a little easier since it doesn't involve any types:

Figure 57

Δ(a & b & wc()*)/Δa:1 = 
  (Δa/Δa:1 & b & wc()* |
   a & Δb/Δa:1 & wc()* |
   a & b & Δwc()*/Δa:1) =
  b & wc()*
          

Pushing a:1 inside intermix

because Δb/Δa:1 is (eliminating the second alternative) and

Figure 58

Δwc()*/Δa:1 = Δwc()/Δa:1,wc()* = ∅,wc()* = ∅

Derivative of wildcard

because, recall, this wildcard excludes a (eliminating the third).

Whew! We've consumed the first property. On to b!

Figure 59

<(b@T1? & c@T2?) ⊕ ∅, b & wc()*>/Δ•:{b:-1,c:2} =
  <
   Δ(b@T1? & c@T2?)/Δb:-1 ⊕ Δ∅/Δb:-1, 
   Δ(b & wc()*)/Δb:-1
  >/Δc:2
          

Pushing b:-1 inside simultaneous and exclusive or

Type matching b fails. We have

Figure 60

Δ(b@T1? & c@T2?)/Δb:-1 =
  (Δ(b@T1?)/Δb:-1 & c@T2?) | (b@T1? & Δ(c@T2?)/Δb:-1) = 
  ([b=b],τ(-1,λ(T1)) & c@T2?) | (b@T1? & ∅)
          

Pushing b:-1 into intermix

since

Figure 61

τ(-1,
  {
    kind:number,
    simpleType:"xs:numeric",
    minimumInclusive:0
  }
) = ∅
          

Computing τ of -1

and

Figure 62

Δ∅/Δb:-1 = ∅
          

Derivative of with respect to b:-1

we end up with

Figure 63

Δ(b@T1? & c@T2?)/Δb:-1 ⊕ Δ∅/Δb:-1 = ∅ ⊕ ∅
          

Summary: derivative of exclusive or

On the other half of simultaneous operator we have:

Figure 64

Δ(b & wc()*)/Δb:-1 = 
  Δb/Δb:-1 & wc()* | b & Δwc()*/Δb:-1 =
  wc()* | b & (Δwc()/Δb:-1,wc()*) =
  wc()* | b & (∅,wc()*) =
  wc()* | b & ∅ = 
  wc()*
          

Derivative of intermix with respect to b:-1

We're down to the final symbol, which is where things fail entirely because the exclusive or fails, dragging everything else down with it:

Figure 65

Δ<∅ ⊕ ∅, wc()*>/Δc:2 = 
  Δ<∅, wc()*>/Δc:2 =
  Δ∅/Δc:2 = 
  ∅
          

Derivative with respect to c:2

Since is not nullable, the object is not valid with respect to the schema.

Array Example

Let's look an array example to show how arrays unroll.

Is the array

Figure 66

[1,"a","b"]
          

JSON array

valid with respect to this schema?

Figure 67

{
  type:"array",
  items:[{type:"number"},{type:"string"}],
  additionalItems:false
}
          

Array schema

The mapped schema:

Figure 68

T0: {
  kind:array,
  model: •@T1,•@T2
}
T1: {
  kind:number, simpleType:"xs:numeric"
}
T2: {
  kind:string, simpleType:"xs:string"
}
          

Mapped array schema

We need to compute:

Figure 69

τ([1,"a","b"],λ(T0)) =
  τ([1,"a","b"],{kind:array}),τ([1,"a","b"], {model: •@T1,•@T2})
          

Evaluation of τ

The kind check passes because the value is an array, leaving the derivative we need to compute being

Figure 70

Δ(•@T1,•@T2)/Δ•:[1,"a","b"] =
  ((Δ(•@T1,•@T2)/Δ•:1)/Δ•:"a")/Δ•:"b"
          

Unrolling •:[1,"a","b"]

Starting with the first item:

Figure 71

Δ(•@T1,•@T2)/Δ•:1 = 
  Δ(•@T1)/Δ•:1,•@T2 | ν(•@T1),Δ(•@T2)/Δ•:1 =
  τ(1,λ(T1)),•@T2 =
  τ(1,{kind:number, simpleType:"xs:numeric"}),•@T2 =
  •@T2
          

Derivative with respect to •:1

The second item ("a") also works:

Figure 72

Δ(•@T2)/Δ•:"a" =
  τ("a",{kind:string, simpleType:"xs:string"}) = ε
         

Derivative with respect to •:"a"

But things run aground on the third item ("b") because we run out of expressions to match against the value:

Figure 73

Δε/Δ•:"b" = ∅
          

Derivative with respect to •:"b"

Here we have a non-nullable derivative, and the array is not valid with respect to the schema.

Nested Object Example

Finally, let's look at an example where we have nested objects.

Here is the schema:

Figure 74

{
  type: "object",
  properties: {
    "a": {
       type: "object",
       properties: {
         "b": { type: "integer" }
       },
       additionalProperties: false
     }
  },
  additionalProperties: false
}
          

Nested object schema

This maps to:

Figure 75

TO = { kind:object, model: a@T1? }
T1 = { kind:object, model: b@T2? }
T2 = { kind:number, simpleType:"xs:integer" }
          

Mapping of nested object schema

Let's validate the JSON object

Figure 76

{
  a: { c: false }
}
          

JSON object with nesting

In the familiar way, we begin with the type matching of the top-level type

Figure 77

τ({a:{c:false}}, λ(T0)) = 
  τ({a:{c:false}}, {kind:object}),
  τ({a:{c:false}}, {model: a@T1?})
          

Evaluation of τ

which nets out to determining the nullability of the following derivative:

Figure 78

Δ(a@T1?)/Δa:{c:false}
          

Derivative to be computed

Resolving this:

Figure 79

Δ(a@T1?)/Δa:{c:false} = 
  Δ(a@T1)/Δa:{c:false} | ν(a@T1) =
  [a=a],τ({c:false},λ(T1)) =
  τ({c:false},λ(T1))
          

Computing derivative of a@T1?

Here is where things get interesting, and we move down to the derivative of the subschema:

Figure 80

τ({c:false},λ(T1)) = 
  τ({c:false},{kind:object}),
  τ({c:false}, {model: b@T2?}) =
  ν(Δ(b@T2?)/Δc:false)
          

Recursion to subschema

The derivative for the nested object fails:

Figure 81

Δ(b@T2?)/Δc:false = Δb@T2/Δc:false | ν(b@T2) 
          

Derivative of b@T2?

because both those alternatives net out to .

Therefore

Figure 82

τ({c:false},λ(T1)) = ∅
          

Computing τ

making the derivative

Figure 83

Δ(a@T1?)/Δa:{c:false} = ∅
          

Popping the recursion

Since this is non-nullable, it makes the whole of τ({a:{c:false}},λ(T0)) equal to , so the object is not valid with respect to the schema.

Implementation Considerations

The rules given above for mapping object-related properties produce very weak expressions, which require a lot of work to resolve. We could do slightly better if we recall that properties are unique in a JSON object and assume we can always access those properties in a particular order, say, in codepoint order. Then we can use the sequence operator instead of the intermix operator when we don't have wildcards. We could also use the sequence operator and just insert wildcards everywhere, but that is not better than just using intermix. Because JSON property names are unique within a given object, it is easy enough to prune the branches, especially if we recognize the pattern:

Figure 84

Δ(p1@T1? & p2@T2? & .. & pn@TN? & wc()@S*)/Δx:v = 
  (Δp1@T1?/Δx:v & p2@T2? & ... & pn@TN? & wc()@S*) |
  (p1@T1? & Δp2@T2?/Δx:v & ... & pn@TN? & wc()@S*) |
  ...
  (p1@T1? & p2@T2? & ... & Δpn@TN?/Δx:v & wc()@S*) |
  (p1@T1? & p2@T2? & ... & pn@TN? & Δwc()@S*/Δx:v)
        

A common pattern arising from JSON objects

Furthermore, if x equals any pj we know this can only be nullable if we pick the branch where we take the derivative of that property, because the derivative of any other pi or of the wildcard will be (the wildcard excludes all property names mapped from properties). That is:

Figure 85

Δ(
  p1@T1? & p2@T2? & ... & pn@Tn? & wc(-[p1,...,pn])@S*
)/Δx:v = 
  ([∃pj|pj=x] ?
   (τ(v,Tj),
     (p1@T1? & ... & pj-1@Tj-1? & pj+1@Tj+1? & ... 
       & pn@Tn? & wc()@S*)
   ) :
   (τ(v,S),
     (p1@T1? & p2@T2? & ... 
       & pn@Tn? & wc(-[p1,...,pn])@S*)
   )
  )
        

Simplifying a common pattern

and in fact if we iterate over the whole object:

Figure 86

Δ(
  p1@T1? & p2@T2? & ... & pn@Tn? & wc([-pi])@S*
)/Δ{q1:v1,q2:v2,...,qm:vm} =
  τ(q1,S1),τ(q2,S2),...,τ(qm,Sm),
  (r1@V1? & r2@V2? & ... & wc(-[p1,...pn])@S*)

where Sj=([∃i|pi=pj]?Ti:S) and 
the rk@Vk are the pi@Si where ∄j|pi=qj
        

Full simplification of a common pattern

It is possible to create yet another extension operator for this case and handle it specially as a performance optimization.

To make this whole scheme worthwhile, we want to amortize the cost of computing derivatives, both within a particular validation episode and across validation episodes. Indeed, the major theme of Brzozowski's paper is that it can be used to construct a finite state automaton for matching the complete extended regular expression. Having constructed such a thing, it can be used again and again. We choose to be slightly more incremental here, and only compute the parts of the machine we actually need, as we go. Experience with this technique in XML Schema has shown that this is more cost effective, as there can be large parts of the state machine that never get accessed by actual data, especially with large schemas.

If we don't worry about type matching, it is pretty straightforward to create a cache of derivative expressions keyed off an expression ID and a symbol.[13] The expression ID needs to be computed by memoizing it, lest we keep storing the same derivative computation over and over again. That is, from the state machine point of view, we don't want to create lots of redundant states. It is reasonably straightforward to canonicalize expressions for memoization.

It is helpful, in practice, to prune the computed derivatives as much as possible, precomputing any of the predicate functions (such as [a=x] or ν which lead to ε and values) that can be used to prune further. Here is where the type-matching causes us trouble: we could compute this and prune the derivative, but not if we wish to cache the result, because the cache is keyed only off the symbol, and τ evaluation depends on the value as well.

What is to be done?

There are three basic options:

  1. Don't cache derivatives at all. We can then prune based on τ results without constraint.

  2. Include the value in the cache key, so that we cache the appropriate derivative for the symbol:value pair.

  3. Cache "partial" derivatives: derivatives where all τ calls are replaced with a marker to be evaluated using actual value presented. When a derivative is pulled from the cache, if it has any unresolved markers, evaluate them and return the pruned result.

Not caching means a lot of work calculating answers: we'd probably be better off just using a more direct algorithm and not using derivatives at all. The whole point of this technique was to compute the finite state machine (here implemented in pieces via the cache). Including values in the cache key means that the derivative cache will have a lot of duplicative entries and looks like an unpromising approach, especially when you remember that some of these values are whole JSON objects and arrays. We essentially sacrifice the finiteness of states, as a practical matter. A more promising variation is to cache partial derivatives and add a secondary cache for the unresolved τ calculations. The τ cache only needs to store one bit of information for each type plus value, and we can make different decisions about the sizing and lifespan of entries for the two caches. The partial derivative cache ( cache) can be readily shared across validation episodes, while the τ cache could live for only a single episode. We can tune whether the τ cache caches objects and arrays at all, or if so, how large they might be before we don't bother.

Performance

Measure, because it is never what you think

This test performs a simple validation on a small document (~3K) with up to five levels of nesting, a Medline citation, transcribed directly from its XML representation. The JSON Schema imposes structural constraints (required properties, and sub-properties) as well as some value and simple type constraints. The document is repeatedly validated in a single-threaded loop under various regimes. As a baseline, an XML Schema validation of the XML version of the document is also tested (also using derivatives and the same baseline simple type facet testing). The documents are valid against their respective schemas and both documents and schema files are pulled from a pre-warmed cache. The table shows both the validations per second for the first validation, which for some of the regimes is the one that does most of the work, and for subsequent validations.

Format Cached schemas? Derivatives? ∂ cache? τ cache? V/s (non-initial) V/s (initial)
XSD yes yes N/A N/A 15088 678
JSON Schema yes yes no yes 12618 610
JSON Schema yes yes yes yes 11380 415
JSON Schema yes no N/A N/A 9419 1071
JSON Schema no no N/A N/A 2375 1720
JSON Schema no yes no yes 650 617
JSON Schema no yes yes yes 500 461
JSON Schema yes yes no no 220 203
JSON Schema yes yes yes no 207 617
JSON Schema no yes no no 201 210
JSON Schema no yes yes no 178 178

The variance in these numbers is quite high, although the relative ordering is generally the same.

What is perhaps most interesting here are the negative results:

  • The simple recursive walk algorithm with no caching is not greatly worse than any derivative algorithm, regardless of caching strategy. Given its substantially better initialization cost compared to the derivative algorithms, this suggests that where the same schema is not substantially reused, the derivative algorithms aren't worth the bother. This is despite the fact that this test is something of a best case scenario for the caching in the derivative algorithms.

  • Caching partial derivatives doesn't help performance in this scenario: hurts it slightly, in fact.

I believe the answer to the first negative result is that a lot of the work, at least in this schema, is located in the τ checking, not in the property relation constraints, and that the models are simplistic enough that the recursive walk doesn't have to do much beyond looking at the current set of properties, so the work of computing, constructing, and caching derivatives doesn't really pay.

As for the partial derivatives, I think what is going on here is (1) the τ cache is in play regardless, and in this scenario with the same document repeated over and over, it provides a strong benefit and (2) resolving partial derivatives requires construction of more objects representing the resolved derivative, and the top line on performance here is time spent in allocation from the heap. We can see the strong impact of the τ cache by seeing the two orders of magnitude hit to performance when we turn it off. This scenario is clearly a best-case one for the τ cache, as we can see from these numbers.

The strongest performance win here is from simply having compiled schema objects cached, which gives about a 5 times gain for the recursive walk algorithm, and a 20 times gain for the derivative algorithm. The usefulness of the derivative algorithm is much stronger for cached schemas, unsurprisingly, as the work of computing the derivatives can be amortized over multiple validation episodes.

Similar results obtain from running multiple validations in parallel. This test is a validate-on-load one, where 18 thousand Medline citations are loaded into a database after being validated, using 20 threads. Here the performance is overall lower and overall more similar, probably because the update operations dominate performance overall. Here the data is more varied, so the impact of the τ cache is substantially reduced. This test doesn't separate out the costs of the initial use of the schema, and in fact, in effect, each client thread may end up doing some of the initialization work redundantly in parallel, which impacts the amortization.

Format Cached schemas? Derivatives? ∂ cache? τ cache? V/s
JSON Schema yes no N/A N/A 868
JSON Schema yes yes no yes 701
JSON Schema yes yes no no 675
JSON Schema yes yes yes yes 608
JSON Schema yes yes yes no 588

Again we see the same fundamental negative results: derivatives aren't much better (and here are a little worse, in fact) than a simple recursive walk and derivative caching is not particularly helpful.

This is a somewhat disappointing discovery.

It is entirely possible that there are deficiencies in the specific implementation strategy here. While tying model checking directly to τ simplifies the model of what is going on, I don't think it is a great implementation strategy: you end up computing τ recursively on complex objects over and over as you move through branches of their parent type's derivative. The τ cache helps here somewhat, but it might be better to deal with this as a local validation instead. I believe the numbers from the XML Schema implementation are instructive here because that is how it works: local validity worked from the bottom up. Another approach that might work better is a scheme to cache the fully resolved derivatives also, keyed off the partial derivative expression ID and the actual value of whatever unevaluated τ's it might contain. That would avoid repeated construction of new expression objects whenever we resolve from the cache.

It is possible to reclaim finiteness and perform full caching by vectorizing the values. Take each distinct facet anywhere in the schema. Each slot in the vector is 1 or 0 depending on whether a particular value satisfies that facet. This then compresses to a binary value of however many bits. While this recovers finiteness, evaluating a lot more facets that necessary in the moment is also not without cost, so it isn't clear what the conditions are where this could be a performance win.

It may well also be the case that the optimization of folding the required property into the mapping for object composition properties is no optimization at all. Perhaps we'd be better served with a mapping like:

Figure 87

μM({
  properties: {p1:s1,...pm:sm},
  patternProperties: {r1:pt1,...,ro:pto},
  required:[rq1,...,rqn]
}) =>
  {model: 
    < rq1 & ... & rqn & wc(-[rq1,...,rqn])*,
      p1@κ(t1)? & ... & pm@κ(sm)? & 
        wc(-[p1,...,pm],+[r1,...,ro])*,
      (wc(r1)@κ(pt1) | wc(-[],-[r1]))*,
      (wc(r2)@κ(pt2) | wc(-[],-[r2]))*,
        ...
      (wc(rn)@κ(pto) | wc(-[],-[ro]))*
    >
  }
       

Alternative mapping for object properties

Although the models are weaker overall, the first branch is entirely independent of type matching of specific values, and we can fail fast, pruning values that don't match it by a pure cache lookup in the derivative cache without any additional resolution required.

Other Applications of Type-Tagged Regular Expressions

Type-tagged regular expressions like this can be applied to other problems, or can be used to recast other problems in different terms.

First, most obviously, we can recast XML Schema validation in these terms as well. The mapping function is different (rather more straightforward). We do need additional wildcard functions (see Sperberg-McQueen). The τ function evaluates the constraining facets of simple types, analogously to what we have above, and looks at the nullability of the derivative of the content models of complex types. These content models will have their type information retained, tagged either with the type name itself (for named types), or with a generated stable ID. We don't have a single top-level type, so we kick things off by identifying the appropriate element declaration (if any), and computing the τ with respect to its type. Attribute checks folds into this model as well, through appropriate mapping. The details of this are beyond the scope of this paper, and work for another day.

Whether evaluated with derivatives or not, type-tagged regular expressions provide a framework for clearer expression of certain kinds of patterns or for a more general understanding of certain kinds of regular expressions. Consider, for example, a conventional regular expression[14]:

Figure 88

([0-1]\d)|(2[0-4])-([0-5]\d)-([0-5]\d)(.\d\d\d)?(.{3,3})
      

Regular expression

A type-tagged expression of this could be something like:

Figure 89

(\d\d)\@{HOUR}-(\d,\d)\@{MINUTE}-(\d,\d)\@{SECOND}(.,\d,\d,\d)?,(.{3,3})\@{TZ}
      

Tagged version of regular expression

with types

Figure 90

HOUR = {minimumInclusive:0, maximumInclusive:24}
MINUTE = {minimumInclusive:0, maximumExclusive:60}
SECOND = {minimumInclusive:0, maximumExclusive:60}
TZ = {enum:["PST","PDT",....]}
      

Associated type definitions

(In keeping with the spirit of abbreviation for this style of regular expression, the type tagging uses \@{type} as a marker rather than the bare @type.)

I believe this is a clearer expression of the intention and meaning of the desired pattern, and may be useful for expositional clarity, if nothing else.

There are extensions to regular expressions that capture a certain class of τ functions in this fashion already: Unicode character category and block escapes:

Figure 91

((\p{Lu}\p{Ll}*)_)*
\p{IsGreek}*
      

Regular expressions with Unicode escapes

The first of these expressions matches camel-cased strings of letters with an underbar separator (e.g. "Example_Match") and the second matches a sequence of Greek characters. Rewriting them as type-tagged expressions in the notation of this paper:

Figure 92

((wc()@Lu,(wc()@Ll)*),_)*
(wc()@IsGreek)*
      

Reinterpreting Unicode escapes as type tags

Clearly this is not a compact rewrite! However, it does allow us to think about the escape as less a special case than as a way of naming something whose significance is defined by the τ function. Let us reimagine the \p escape as naming a type where the τ function in this case is either "matches a Unicode category with the given name" or "matches a Unicode block with the same name as what follows the 'Is'".

One way to think about this is that we have a representation of a state machine with τ representing a conditional function on the transitions. There is no reason (other than computation efficiency) why τ needs to be a "type" function at all. What restrictions should we place on τ? Any?

Final Thoughts

This work followed a path of applying Brzozowski derivatives to JSON Schema validation, following the pattern of using them for XML Schema validation. This led to the need to extend regular expressions with type tagging, and adding a type checking function that applied the appropriate matching constraints for the compound symbol, and to allow for type-only symbols in the expressions. It turns out, this formulation also works to provide a more uniform story for XML Schema validation beyond content model checking. More generally, tagged symbols combined with a tag matching function can be used more widely to improve the clarity of certain regular expressions and provide a framework for expressing other kinds of matching problems. However, the current implementation performance numbers are not particularly compelling: the need to match types against an infinitude of specific values makes it difficult to find effective caching strategies, undermining the point of using derivatives as an implementation technique in the first place.

References

[JSValidation] H. Andrews, Editor, G. Luff. JSON Schema Validation: A Vocabulary for Structural Validation of JSON IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-validation-01

[JSONSchema] H. Andrews, Editor. JSON Schema: A Media Type for Describing JSON Documents IETF Internet-Draft, March 19, 2018. Available at https://tools.ietf.org/html/draft-handrews-json-schema-01

[Brzozowski64] Janusz A. Brzozowski Derivatives of Regular Expressions. Journal of the ACM 11.4 (1964): 481-494.

[XSD2] W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra, C.M. Sperberg-McQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C. April 2012. http://www.w3.org/TR/xmlschema11-2/

[Sperberg-McQueen05] Michael Sperberg-McQueen Applications of Brzozowski derivatives to XML Schema processing. Presented at Extreme Markup Languages, Montréal, Canada, August 1-5, 2005. Available at http://cmsmcq.com/2005/EML2005Sper0518.html.



[1] The links to Clark's "An algorithm for RELAX NG validation" and Bob Foster's blob referenced by Sperberg-McQueen are, alas, dead.

[2] Strictly speaking, in this case, where we don't have any actual constraints on a, b, and c we could have dispensed with properties. If we wanted to ensure the properties were all numeric, we would need to have a schema like {"type":"numeric"} instead of true for each of the properties, and properties would be necessary.

[3] The true schema is not actually equivalent to ε: derivative rules are different.

[4] Actually, integer is another alternative value for the type property, but as the facets for integer are the same as for number, we don't need to account for it separately here.

[5] There is a small detail to take care of: it is possible to end up with more than one model property according to these rules, but we require the output to have unique keys. In that case we use the simultaneous operator to combine them: {model:m1} + {model:m2} => model:<m1,m2>.

[6] You can't actually implement things that way as there may be circular references. In practice, the usual kind of pre-pass or symbol table fix-up is required, if for no other reason than to prevent problems. The JSON Schema specification does say that schemas "should not" make use of infinitely recursive circular references as they may blow up naive validators and specifies such behaviour as undefined.

[7] Support for the format property is optional, as are the specific kinds of formats that are supported. Here I give an exemplar as to how it might be mapped. For other format kinds more elaborate mappings would be required, or the use of special simpleType values that encapsulate any checks that can't be mapped to other facets.

[8] From an implementation perspective both the type matching rules of the propertyNames property and its mapping are a little strange and special. We store the mapping of s1, but realistically what is stored is the key (κ(μ(s1))) with the actually mapped faceted type stored in a lookup table elsewhere.

[9] Once you follow the conventional dodge of adding a unique symbol for all non-matching symbols encountered.

[10] Note: There are actually a couple of annoying of special cases around empty objects and handling subtype recursion that I have elided here. There is an asymmetry due to the fact that top level values have no name, but values inside a JSON object (including other objects) do.

[11] See Sperberg-McQueen05 for many of these; the logic for the others is similar. Since JSON objects have unique keys and contain nothing but keyed items, the analysis of intermix with respect to UPA apply here too.

[12] T0, T1, etc. are the identifiers given to us by κ. The set of properties on the other side of the equals sign is what is produced from the mapping.

[13] Given JSON's proclivity for open-ended sets of properties, it is useful to be able to group all symbols not found in the schema together as a single "unknown symbol" here.

[14] For those unfamiliar with it: This grep-style regular expression uses a slightly different notation than what we've used in this paper: The sequence operator (,) is assumed, [0-5] is an abbreviation for (0|1|2|3|4|5), \d is an abbreviation for a choice between all digits, and dot (.) is a wildcard.