Application of Brzozowski Derivatives to JSON Schema Validation
Copyright ©2019 Mary Holstege
Copyright ©2019 Mary Holstege
JSON Schema is a JSON vocabulary for defining valid JSON data. The data may be null, simple – numbers, strings, or boolean values – or structured – arrays of JSON values or objects consisting of key/value pairs. In JSON parlance the keys are called properties. The JSON Schema specification defines a set of properties that define possible constraints on the JSON values. In general these constraints are independent of one another.
Among the constraints are some that define what we might call, in the XML world, "content models" or "content types". As such, we pose the question: can we apply similar techniques as we use for XML Schema to implementing JSON Schema? In particular, can we use Brzozowski derivatives to compute validity with respect to these "content models"?
Brzozowski64 defines the concept of a derivative of an extended regular expression with respect to a symbol, and to a sequence of symbols. The derivative of an extended regular expression with respect to a symbol is the extended regular expression that would need to be matched once that symbol is consumed. By repeating the process as we walk through a sequence of symbols (the input sequence) we end up with an extended regular expression that must match the empty sequence once we have consumed the whole input sequence. An extended regular expression is called nullable if it can be reduced to match the empty sequence. Brzozowski shows that the input sequence is valid with respect to the initial extended regular expression if and only if the derivative of that expression with respect to the input sequence is nullable.
SperbergMcQueen (SperbergMcQueen05) discusses the application of Brzozowski derivatives to various problems of XML Schema processing, and offers a nice summary of some of the previous work in this area. Brzozowski derivatives are reported ^{[1]} to be used in James Clark's JING Validator and Sun's MSV validator, and I know it to be used in MarkLogic's validation support as well. There are no doubt others.
The essential idea in using Brzozowski derivatives for XML Schema content model validation is to observe that, with a few extensions to the model as shown by SperbergMcQueen, XML Schema content models are extended regular expressions over the set of defined element names (plus one "every other name" symbol). When the derivative of a sequence of names from the input with respect to the declared content model is nullable, the input matches the content model. Given the unique particle attribution property in XML Schema, we can look up each symbol in the matching sequence, check simple type constraints as necessary, and complete validation.
The definitions of JSON Schema (JSONSchema) and JSON Schema Validation (JSValidation) are evolving and managed for the most part outside of the scope of any standards body (see https://jsonschema.org/specification.html for the current state), although there are now IETF InternetDrafts (cited above) that track the changes. Here I describe draft 07.
The conceptual framework behind JSON Schema differs greatly from the traditions coming from the XML community or the formal languages community and some of the vocabulary is used differently as well. A "schema" is a JSON value: either an object or boolean value. It defines a set of constraints against a JSON value, which may be an object (a map of name/value pairs), an array of JSON values, or a simple number, string, boolean, or null value. Certain properties of the schema object have defined meanings; undefined properties are irrelevant and ignored. Certain properties contain schemas as values. It is important to remember that the term "schema" does not mean the totality of the set of constraints defined in a particular schema document and its referenced schema documents but instead means the current set of constraints against a particular JSON value.
A schema defines a set of constraints through a set of properties and their values. Importantly, with a very few defined exceptions, the constraints apply independently. Certain properties are ignored for certain values. A JSON value is valid with respect to its schema when all the relevant properties are valid with respect to that value.
A few examples might help make this clear:
Figure 1
{ "minimum": 12, "pattern": "a+" }
Figure 1 shows a simple JSON Schema. The minimum
property requires that the JSON value being validated be a numeric value greater
than twelve. The pattern
property requires that the JSON value being validated be a string value consisting
of a sequence of 'a'
s. Is this an unsatisfiable schema? No. If the value presented happens to be a number,
it will be valid if it is more than twelve, because the pattern
property will be ignored for numeric values. If the value presented happens to be
a string, it will be valid if it consists of a sequence of 'a'
s because the minimum
property will be ignored for string values. What is more, if the value presented
happens to be a JSON object, it will also be valid, because both the minimum
and pattern
properties will be ignored!
We can fix the schema to disallow that last scenario by adding another property:
Figure 2
{ "type": ["number","string"] "minimum": 12, "pattern": "a+" }
In Figure 2 we have a schema that will match either a number greater than twelve or a sequence
of 'a'
s and nothing else.
Similar kinds of scenarios play out with constraints on JSON objects.
Figure 3
{ "type": "object", "required": ["a","b","c"], "properties": { "a": true, "b": true, "c": true }, "additionalProperties": false }
The schema in Figure 3 needs all four^{[2]} of the properties to ensure that the value is an object with exactly three properties
named a
, b
, and c
. The type
property excludes other kinds of values such as numbers. The required
property excludes objects that do not have all of the indicated properties. The properties
property defines constraints for a certain set of named properties (those constraints
being the boolean schema true
which means anything is valid). And finally the additionalProperties
property excludes objects that have other properties we haven't mentioned using a
boolean schema of false
(always invalid). There are no ordering constraints or occurrence constraints: objects
are unordered maps in which keys can only occur once.
This is obviously a little more convoluted than just saying the content model is a & b & c
. It is easy to imagine we might have a mapping that gets us such a content model,
however, and we will see that in a moment. Nevertheless, the reader may be forgiven
for suspecting that applying regular expression derivatives to validation here is
a dubious enterprise. Hold on to that thought! One more example:
Figure 4
{ "type": "object", "required": ["a","b"], "oneOf": [ { "properties": { "a": { "type": "number", "minimum": 0 }, "b": { "type": "number", "minimum": 0 }, "c": { "type": "number" } }, "additionalProperties": false }, { "properties": { "a": { "type": "number", "maximum": 0 }, "b": { "type": "number", "maximum": 0 }, "d": { "type": "number" } }, "additionalProperties": false } ] }
The schema in Figure 4 will successfully validate objects that have two numeric properties named a
and b
, where either both numbers are more than zero, or both numbers are less than zero,
together with an additional numeric property named either c
or d
, depending on whether we are in the positive or negative case. We could map this
to the "content model" (a & b & c)  (a & b & d)
but verifying that the numeric constraints hold would require a separate check that
essentially performs all the content model checking anyway. The reader now has probably
concluded that regular expression derivatives for JSON Schema validation are doomed.
Not so fast, reader! We will be adding some extensions to the Brzozowski model to
rescue the situation.
Before we get to the extensions to the model required, let's enumerate the JSON Schema validation properties. The classification is not part of the JSON Schema specification: it is an informational organization that we will make use of in a minute.
Property name  Relevance  Meaning  Classification 

type  all  Data value kind: object, array, number, integer, string, boolean, null  type (special) 
format  string  Required format such as datetime (optional)  type 
enum  all  Set of possible values  type 
const  all  Fixed value  type 
minLength  string  Minimum length  type 
maxLength  string  Maximum length  type 
pattern  string  String matches regular expression  type 
multipleOf  number  Number is multiple of this  type 
minimum  number  Inclusive minimum  type 
minimumExclusive  number  Exclusive minimum  type 
maximum  number  Inclusive maximum  type 
maximumExclusive  number  Exclusive maximum  type 
minProperties  object  Minimum number of properties  type 
maxProperties  object  Maximum number of properties  type 
minItems  array  Minimum number of array items  type 
maxItems  array  Maximum number of array items  type 
uniqueItems  array  Are values in array unique?  type 
properties  object  Child properties  composition 
additionalProperties  object  Constraints on unenumerated properties  composition 
required  object  Names of required properties  composition 
items (array value)  array  Sequence of constraints for items in array  composition 
items (single value)  array  Constraint on every item in array  composition 
additionalItems  array  Constraints on additional unenumerated items  composition 
contains  array  At least one item is valid  composition 
anyOf  all  Valid against at least one subschema  composition 
oneOf  all  Valid against exactly one subschema  composition 
allOf  all  Valid against all subschemas  composition 
not  all  Valid if not valid against subschema  composition 
if,then,else  all  Conditional validation  composition 
propertyNames  object  Schema constraint on the names of properties  type (special) 
dependencies  object  Additional constraints if certain child properties are present  composition (special) 
$ref  all  References the schema via a path rather than in place  special 
$id  all  Sets a base URI for resolution of $ref properties

special 
The approach taken to applying Brzozowski derivatives to JSON Schema validation is to map the compositional properties to extended regular expressions, where the symbols will be composites of the JSON property names tagged with a type identifier. A further mapping will gather up bundles of type properties in the schema; these bundles will get a type identifier. Call these bundles types and their constraints facets. We will show how we can define nullability for these extended expressions and derivatives of them with respect to JSON values. Some additional machinery and notation will need to be in play for unnamed values, either as toplevel values or array items.
In sum, here is the program of work:
Define a mapping from JSON Schema properties to extended regular expressions
Define a mapping from JSON Schema properties to types and facets
Define extension operators to account for all the composition properties
Define extensions to tag property names with type identifiers in the expressions
Define a type matching function
Define extensions for typeonly items (top level and array items)
Define derivatives with respect to JSON values
Define derivatives over extended expressions
Define nullability over extended expressions
Sip wine in celebration
Mapping performs two operations: mapping type properties to faceted types, and mapping
compositional properties to extended regular expressions. We will denote this with
two separate functions, μ and μM respectively. To complete the picture, we will also
need a function that provides the type with its name, and a lookup function that can
obtain the facets for a given named type. I use the form f(A) => B
to say that the definition of the result of applying function f
to A
is B
.
Basic Notation 


Mapping function 


Model mapping function 


Key function 


Lookup function 

The type mapping function is fairly straightforward, although capturing the interactions
with the compositional properties gets a little messy in a functional paradigm. In
the formulation here, the extended regular expressions from μM
are expressed as the output for a particular property (model
) in the output of μ
.
For the most part type properties in the schema map directly to facets in the type.
If there is a type
property in the schema, its value will constrain which properties map to facets;
if it is an array of values, we will treat this as creating a content model disjunction
between the alternatives; and if it is absent we will treat this as creating a content
model disjunction between all^{[4]} alternatives (array
, object
, number
, string
, boolean
, null
). Each alternative will only gather properties relevant for that specific type
value.
In what follows, P
is a map of propertyvalue pairs {p1:v1,...,pn:vn}
where each property name is unique and the +
operator performs a union of maps. The rules are generally written so that the set
of property names is disjoint.^{[5]} In addition, s1
, s2
, etc. refer to (sub)schemas.
The definition below shows the mapping, broken down for each of the distinct type
values. Most properties can be mapped independently, although there are some exceptions.
That is, unless specified otherwise: μ({p:v,P}) => μ({p:v})+μ(P)
and similarly μM({p:v,P}) => μM({p,v})+μM(P)
. Where there is a rule for a specific combination of properties, that rule takes
precedence. There are a few edges cases to take care of: boolean schemas and $ref
dereferencing.
It will be useful to define some standard bundles of properties. The function relevant(P,kind)
will return just the properties listed for the specific kind from the map of propertyvalue
pairs. For example, relevant({minLength:1,maximum:12},"string")
is {minLength:1}
:
Figure 5
Kind  Properties selected by relevant(P,kind) 

any 
enum, const, oneOf, anyOf, allOf, not, if, then, else 
boolean 
Those for any 
string 
minLength, maxLength, pattern and those for any 
integer 
Same as number 
number 
multipleOf, minimum, exclusiveMinimum, maximum, exclusiveMaximum and those for any 
null 
Those for any 
array 
minItems, maxItems, uniqueItems, items, additionalItems and those for any 
object 
propertyNames, minProperties, maxProperties, properties, patternProperties, additionalProperties,
required and those for any 
First, we define how some edge cases are handled:
Figure 6
s 
μ(s) 

true 
true 
false 
∅ 
∅ 
∅ 
{$ref:path} 
μ(dereference(path)) 
The boolean schemas map to true
or empty set as appropriate, and references map to the referenced definition. Specifying
the details of how the JSON path gets dereferenced is beyond the scope of this paper.
In what follows assume we have a fully dereferenced schema in hand.^{[6]} We are also assuming the schema as a whole (including the references) is valid.
The type
property is the main driver for the mapping. In general it gets mapped to a kind
facet plus a simpleType
facet for references to simple values and occasionally other facets as well. The
type names have the same meaning as in XSD2. Here are the rules for mapping the type
property:^{[7]}
Figure 7
s 
μ(s) 

{type:[t1,...,tn], P} 
μ({oneOf: [{type:t1,P},...,{type:tn,P}]}) 
{type:boolean, P} 
μ({type:boolean}) + μ(relevant(P,"boolean")) 
{type:string, format:f1, P} 
μ({type:string, format:f1}) + μ(relevant(P,"string")) 
{type:string, P} 
μ({type:string}) + μ(relevant(P,"string")) 
{type:integer, P} 
μ({type:integer}) + μ(relevant(P,"integer")) 
{type:number, P} 
μ({type:number}) + μ(relevant(P,"number")) 
{type:null, P} 
μ({type:null}) + μ(relevant(P,"null")) 
{type:array, P} 
μ({type:array}) + μ(relevant(P,"array")) 
{type:object, P} 
μ({type:object}) + μ(relevant(P,"object")) 
{type:boolean} 
{kind:boolean, simpleType:"xs:boolean", enum:[true,false]} 
{type:string, format:datetime} 
{kind:string, simpleType:"xs:dateTime"} 
{type:string} 
{kind:string, simpleType:"xs:string"} 
{type:number} 
{kind:number, simpleType:"xs:numeric"} 
{type:integer} 
{kind:number, simpleType:"xs:integer"} 
{type:null} 
{kind:null} 
{type:array} 
{kind:array} 
{type:object} 
{kind:object} 
The enum
and const
properties apply to all kinds of values:
Figure 8
s 
μ(s) 

{enum:[s1,s2,...,sn]} 
{enum:[s1,s2,...,sn]} 
{const:value} 
{fixed:value} 
The stringspecific properties are minLength
, maxLength
, and pattern
Figure 9
s 
μ(s) 

{minLength:value} 
{minLength:value} 
{maxLength:value} 
{maxLength:value} 
{pattern:value} 
{pattern:value} 
Here are the numeric property mappings:
Figure 10
s 
μ(s) 

{multipleOf:value} 
{multipleOf:value} 
{minimum:value} 
{minimumInclusive:value} 
{minimumExclusive:value} 
{minimumExclusive:value} 
{maximum:value} 
{maximumInclusive:value} 
{maximumExclusive:value} 
{maximumExclusive:value} 
Type properties for objects:^{[8]}
Figure 11
s 
μ(s) 

{minProperties:value} 
{minProperties:value} 
{maxProperties:value} 
{maxProperties:value} 
{propertyNames:s1} 
{propertyNames:μ(s1)} 
Array properties:
Figure 12
s 
μ(s) 

{minItems:value} 
{minItems:value} 
{maxItems:value} 
{maxItems:value} 
{uniqueItems:value} 
{uniqueItems:value} 
Before we get to model mapping, we need to introduce some additional notation. The notation is similar to SperbergMcQueen05, which is a little different from Brzozowski64, but with a few alterations and extensions. JSON Schema does not need general occurrence counts, and the wildcards required are relative to a set of property names, not namespaces.
Basic Extended Regular Expressions 

The type marking, exclusive or, conditional, simultaneous, and wildcards are extensions. Negation is present in Brzozowski64 although with a slightly different notation.
It may seem that F&G
and <F,G>
mean the same thing, but the difference will become clearer when we get to the definition
of derivatives, because F&G
matches F
and G
against different segments of the input, whereas with <F,G>
matches F
and G
against the same segment of the input simultaneously. For example, a@T & a@S
can never be satisfied because each symbol is unique in a JSON object so there can
only be one a
matched, but <a@T, a@S>
may well be satisfied if both type constraints are satisfied on the value of property
a
.
The mapping of the basic composition operators is fairly direct:
Figure 13
s  μ(s) 

{anyOf: [s1f,...,sn]} 
{model: μM({anyOf:[s1,...,sn])} 
{oneOf: [s1,...,sn]} 
{model: μM({oneOf:[s1,...,sn])} 
{allOf: [s1,...,sn]} 
{model: μM({allOf:[s1,...,sn])} 
{not:s} 
{model: μM({not:s})} 
{if:s1,then:s2,else:s3} 
{model: μM({if:s1,then:s2,else:s3})} 
Figure 14
s  μM(s) 

{anyOf: [s1,s2,...,sn]} 
μM(s1)μM(s2)...μM(sn) 
{oneOf: [s1,s2,...,sn]} 
μM(s1)⊕μM(s2)⊕...⊕μM(sn) 
{allOf: [s1,s2,...,sn]} 
<μM(s1),μM(s2),...,μM(sn)> 
{not:s} 
!μM(s) 
{if:s1,then:s2,else:s3} 
(μM(s1)?μM(s2):μM(s3)) 
{if:s1,then:s2} 
(μM(s1)?μM(s2):true) 
{if:s1,else:s3} 
(μM(s1)?true:μM(s3)) 
Mapping of the objectrelated compositional properties requires dealing with the interactions
and dependencies between them. The two main compositional properties are properties
and patternProperties
. Each of these consists of a set of keys and an associated schema for each key. In
the case of properties
the key is the property name to match in the instance; in the case of patternProperties
the key is a regular expression that matches the name of the property in the instance.
All properties named are optional unless their name appears in the required
array. Properties otherwise unaccounted for must conform to the schema given for
additionalProperties
, which defaults to the true
schema if absent.
To simplify things somewhat, let's first ensure that when the additionalProperties
property was not specified, we treat it as an explicit true
schema, that is, that all additional properties are OK. We also want to normalize
the additionalProperties
property when it is the false
schema to remove it, and to normalize the properties
and patternProperties
properties to replace absence with an empty object, and the required
property to replace absence with an empty array. Assume the following normalizations
apply one after another:
Figure 15
Constraints on {p1:s1,...,pn:sn}  μ({p1:s1,...,pn:sn}) 

∀pipi!=additionalProperties 
μ({p1:s1,...,pn:sn,additionalProperties:true}) 
{p1:s1,...,pn:sn}={additionalProperties:false,P} 
μ({P}) 
∀pipi!=properties 
μ({p1:s1,...,pn:sn,properties:{}}) 
∀pipi!=patternProperties 
μ({p1:s1,...,pn:sn,patternProperties:{}}) 
∀pipi!=required 
μ({p1:s1,...,pn:sn,required:[]} 
This leaves us with two variations to map: the bundle of properties P
with the properties named properties
, patternProperties
, and required
, and the bundle of properties with those plus the property named additionalProperties
as well. In either case, the mapping creates a model
property with the property bundle mapped by μM
: {model: μM(P)}
.
Capturing the semantics of additionalProperties
requires some trickiness, because if its value is not the false
schema, then it applies to any properties that are not named in properties
and that do not match a pattern in patternProperties
. We capture this in the wildcard which takes those properties and patterns as its
sets. We also capture the required
property by marking appropriate properties as optional or not in the model. This
is an optimization: we could instead have a simultaneous compositor where one branch
handles the required
properties and the other handles the type information. Both branches end up being
relatively weak, in general.
The semantics of patternProperties
is particularly tricky:
Pattern properties do not count as additional properties.
More than one pattern property can match a specific input property and both constraints need to be enforced.
A pattern property may or may not match a specifically named property; if it does its constraint must be enforced.
We may lack named properties (properties
) but still have a constraint on "additional" properties (additionalProperties
).
In general, every property listed is optional, unless it is specifically mentioned
in the required
array. To capture the semantics of required
in an effective way is a little awkward to express algebraically. Note that it is
perfectly acceptable in JSON Schema for a property to be in the required
array but not mentioned by either the properties
or patternProperties
.
For example the following schema is perfectly OK, and will not match an object {"c":1,"dd":2}
as it lacks the required a
and b
properties:
Figure 16
{ "type": "object", "required": ["a","b"], "properties": { "c": { "type": "number" } }, "patternProperties": { "d+": { "type": "number" } }, "additionalProperties": true }
With all that said, the general rules for mapping the objectrelated composition properties are:
Figure 17
s 
μM(s) 

{ properties: {p1:s1,...pm:sm}, patternProperties: {r1:pt1,...,ro:pto}, additionalProperties:sp, required:[rq1,...,rqn] } 
< (q1@κ(t1) & ... & qk@κ(tk) & ql@κ(ql)? & ... & qm@κ(tm)? & u1@κ(sp) & ... & uv@κ(sp) & (wc([p1,...,pm],[r1,...,ro])@κ(sp) wc([p1,...,pm],+[r1,...,ro]))* ), (wc(r1)@κ(pt1)  wc([],[r1]))*, (wc(r2)@κ(pt2)  wc([],[r2]))*, ... (wc(rn)@κ(pto)  wc([],[ro]))* > 
{ properties: {p1:s1,...pm:sm}, patternProperties: {r1:pt1,...,ro:pto}, required:[rq1,...,rqn] } 
< (q1@κ(t1) & ... & qk@κ(tk) & ql@κ(ql)? & ... & qm@κ(tm)? & u1@κ(sp) & ... & uv@κ(sp) & wc([p1,...,pm],+[r1,...,ro])* ), (wc(r1)@κ(pt1)  wc([],[r1]))*, (wc(r2)@κ(pt2)  wc([],[r2]))*, ... (wc(rn)@κ(pto)  wc([],[ro]))* > where l = k+1 for i<=k, qi=pj and ti=sj and ∃rqhpj=rqh for i>k, qi=pj and ti=sj and ∀rhpj!=rqh for i<=n, ui=rqj where ∀phph!=rqj 
This looks complicated, but all it is saying is that our mapped content model consists
of a mix, in any order, of: properties that are named and required (q1
through qk
), properties that are named but not required (ql
through qm
, properties that are not named but are nevertheless required (u1
through uv
), a wildcard to pick up properties that are not named that match none of the patterns
but satisfy the schema for additionalProperties
, and a wildcard to pick up properties that are not named but do match one of the
patterns. The wildcard clauses in the second and subsequent branches of this simultaneous
model ensure that properties matching patterns satisfy their appropriate schema: everything
matches the pattern and its corresponding schema, or doesn't match the pattern at
all. Each pattern gets its own simultaneous branch because an input property may match
more than one of them, in which case it would need to satisfy constraints from each
match.
If patternProperties
is empty, as is often the case, this simplifies a great deal. We can drop the wildcards
that match against patterns and simplify the simultaneous model with one branch to
just that branch:
Figure 18
s 
μM(s) 

{ properties: {p1:s1,...pm:sm}, additionalProperties:sp, required:[rq1,...,rqn] } 
( q1@κ(t1) & ... & qk@κ(tk) & ql@κ(ql)? & ... & qm@κ(tm)? & u1@κ(sp) & ... & uv@κ(sp) & wc([p1,...,pm],[])@κ(sp)* ) 
{ properties: {p1:s1,...pm:sm}, required:[rq1,...,rqn] } 
( q1@κ(t1) & ... & qk@κ(tk) & ql@κ(ql)? & ... & qm@κ(tm)? & u1@κ(sp) & ... & uv@κ(sp) ) 
Notice that we still get fairly weak content models here: where every property can
occur in any order, optional unless specifically required, with all kinds of wildcards
unless additionalProperties
is explicitly false
and there are no patternProperties
. This has performance implications, as it creates larger and more complex derivatives.
In JSON objects, every property name is unique, however. We can use this knowledge
to strengthen the wildcards in the first branch by adding any required properties
to the excluded list. (In the examples we will assume we do this.) Other optimizations
are possible. We will return to this point later.
Mapping the arrayrelevant compositional properties follows a similar paradigm, but an additional class of extension is in order, because array members do not have any symbol associated with them, just values. However, given our type constraint scheme, we still have something to say about such things, we just need a little bit more notational mechanism:
Nonsymbol extensions 

Given the typeonly expressions, we are now in a position to map array properties.
The items
property in JSON Schema comes in two varieties with different semantics: as a single
schema, or as an array of schemas. If it is a single schema, then that schema applies
to every member of the array, and additionalItems
is irrelevant. If it is an array, then each member of the array applies one to one
to the values in the array in order, with any surplus values in the array matching
the additionalItems
schema, if any:
Figure 19
s 
μ(s) 

{items:I,additionalItems:sa,contains:sc} 
{model: <μM({items:I,additionalItems:sa}),μM({contains:sc})>} 
{items:I,additionalItems:sa} 
{model: μM({items:I,additionalItems:sa})} 
{items:I,additionalItems:false} 
{model: μM({items:I})} 
{items:I} 
{model: μM({items:I,additionalItems:true})} 
Figure 20
s 
μM(s) 

{contains:sc} 
(•@ε*,•@κ(sc),•@ε*) 
{items:s1,additionalItems:sa} 
(•@κ(s1)*) 
{items:[s1,s2,...,sn], additionalItems:sa} 
(•@κ(s1),•@κ(s2),...,•@κ(sn),•@κ(sa)* •@κ(s1),•@κ(s2),...,•@κ(sn)? •@κ(s1),•@κ(s2),...,•@κ(sn1)? ... •@κ(s1),•@κ(s2)? •@κ(s1)? ) = ((•@κ(s1), (•@κ(s2), ... (•@κ(sn)?,•@κ(sa)*) ... )? )? ) 
{items:[s1,s2,...,sn]} 
(•@κ(s1),•@κ(s2),...,•@κ(sn)? •@κ(s1),•@κ(s2),...,•@κ(sn1)? ... •@κ(s1),•@κ(s2)? •@κ(s1)? ) = ((•@κ(s1), (•@κ(s2), ... •@κ(sn)? ... )? )? ) 
The reason for the complexity of the rules for items
is this: just because the property defines type constraints for three slots, that
doesn't mean a matching array must have three entries in it. That is handled by a
different constraint property, the minItems
property. We could apply it here to remove some of the optionality.
In other words, array constraint properties map to typeonly expressions. The additionalItems
property is only relevant if the items
property enumerates a sequence of type constraints for each item in the array.
Mapping the dependencies
property is a little confusing because its value is a JSON object where each value
is either a subschema or an array of property names. The semantics are very different,
and the dependencies
property is likely to bifurcate in future drafts of JSON Schema. Each item in the
JSON object maps separately, using a different rule depending on whether its value
is a schema or an array.
Here are the rules:
Figure 21
s 
μ(s) 

{dependencies: {p1:v1,p2:v2,...,pn:vn}} 
{model: < μM({dependencies:{p1:v1}}), ..., μM({dependencies:{pn:vn}}) >} 
Figure 22
s 
μM(s) 

dependencies:{p:S} 
((p & wc([p],[])*) ? wc(*)@κ(S) : true) 
dependencies:{p:[q1,q2,...,qn]} 
((p & wc([p],[])*) ? (q1 & ... & qn & wc([q1,...,qn],[])*) : true) 
p
plus a bunch of other stuff not named p
, the object must match either the given schema, or must include all the named properties
(depending on which form was used).
It should also be noted that although at first blush we have made so many extensions
to regular expressions that it seems likely that none of the supporting results would
hold, in fact we have hardly extended anything substantively. Instead of the finite
vocabulary consisting of the symbols in the schema we have a vocabulary consisting
of the symbols crossed with the type identifiers plus the •
symbol. It is still a finite vocabulary.^{[9]} The operators and functions we have added, being boolean operators, fit into the
rules defined by Brzozowski. True, we have made the rules for matching more complex,
but one could also think of them as just being a funny kind of equality. On the other
side, while our input can consist of an arbitrary number of distinct values, they
are all just annotations on that finite vocabulary. The infinitude of values will
concern us when we get to caching, however.
Given a JSON Schema mapped in this way, we can compute derivatives of those expressions
with respect to JSON values. Most of these rules come directly from SperbergMcQueen05 or Brzozowski64 or can be derived by Brzozowski's observation that the derivative of any boolean
operator over some operands is the operator over the derivatives of those operands.
Algebraically: Δf(R)/Δa = f(ΔR/Δa)
for any boolean function f
. A key difference is that when we get to the leaves, we have to do type matching
as well as just symbol matching. Let us introduce a little more notation:
Derivative function Δ 


Nullability function 


Typematch function 


JSON values 

Taking lowercase a
and b
as symbols and uppercase E
, F
, and G
as expressions, the derivatives of basic expressions (without type matching) are
as follows:
Figure 23
E 
ΔE/Δa 

b 
[a=b] 
ε 
∅ 
∅ 
∅ 
true 
true 
E? 
ΔE/Δa  ν(E) 
E* 
ΔE/Δa,E* 
E,F 
(ΔE/Δa),F  ν(E),ΔF/Δa 
EF 
ΔE/Δa  ΔF/Δa 
E&F 
(ΔE/Δa & F)  (E & ΔF/Δa) 
!E 
!ΔE/Δa 
E⊕F 
(ΔE/Δa) ⊕ (ΔF/Δa) 
(E?F:G) 
(ΔE/Δa ? ΔF/Δa : ΔG/Δa) 
<E,F> 
<ΔE/Δa, ΔF/Δa> 
wc(pat) 
[matches(a,pattern)] 
wc([p1,...,pn],[r1,...,rn]) 
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)] 
wc([p1,...,pn],+[r1,...,rn]) 
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)] 
To take the derivative of an expression with respect to a JSON object, first note the edge case for an empty sequence is defined so that:
ΔE/Δε => EThen observe that for a sequence of symbols
u
followed by a symbol a
:
ΔE/Δ(u,a) => (ΔE/Δu)/Δaor, unrolling an entire sequence:
ΔE/Δ(p1,p2,...,pn) => (...(((ΔE/Δp1)/Δp2)/...)/Δpn)
Given the set of properties of an object p1
, p2
, etc., then the derivative of the set of properties is the union of the derivative
of each possible sequence starting with some pj
:
Figure 24
Derivative expression  Derivative result 

ΔE/Δ•:{p1:v1,p2:v2,...,pn:vn} 
(ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn}  (ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn}  ...  (ΔE/Δpn:vn)/Δ•:{p1:v1,p2:v2,...,pn1:vn1} 
Adding type matching to the picture doesn't change the rules much although they look a little more complicated. All the rules are the same, just carrying through the type annotation on the expressions, except for the leaf cases where we need to perform a type check as well as a symbol matching:^{[10]}
Figure 25
Derivative expression  Derivative result 

Δb@T/Δa:v 
[a=b],τ(v,λ(T)) 
Δwc(pat)@T/Δa:v 
[matches(a,pattern)],τ(v,λ(T)) 
Δwc([p1,...,pn],[r1,...,rn])@T/Δa:v 
[∀pj:a!=pj ∧ ∀rj:!matches(a,rj)],τ(v,λ(T)) 
Δwc([p1,...,pn],+[r1,...,rn])@T/Δa:v 
[∀pj:a!=pj ∧ ∃rj:matches(a,rj)],τ(v,λ(T)) 
ΔE@T/Δ•:{p1:v1,p2:v2,...,pn:vn} 
τ({p1:v1,p2:v2,...,pn:vn},λ(T)), ((ΔE/Δp1:v1)/Δ•:{p2:v2,p3:v3,...,pn:vn}  (ΔE/Δp2:v2)/Δ•:{p1:v1,p3:v3,...,pn:vn}  ...  (ΔE/Δpn:vn)/Δ•:{p1:v1,p2:p2,...,pn1:vn1} ) 
Now we have typeaware derivatives for expressions over symbols, we still need to account for arrays and top level values that have no symbols and compute derivatives over these typeonly expressions. Arrays unroll like sequences:
Figure 26
Derivative expression  Derivative result 

ΔE@T/Δ•:[v1,v2,...vn] 
τ([v1,v2,...,vn],λ(T)), ΔE/Δ(•:v1,•:v2,...,•:vn) 
ΔE/Δ(•:v1,•:v2,...,•:vn) 
(...(((ΔE/Δ•:v1)/Δ•:v2)/...)/Δ•:vn) 
Δ(•@T)/Δ•:v 
τ(v,λ(T)) 
Δ•/Δb 
∅ 
Δa/Δ• 
∅ 
Δwc(pat)/Δ• 
∅ 
Δwc([p1,...,pn],[r1,...,rn])/Δ• 
∅ 
Δwc([p1,...,pn],+[r1,...,rn])/Δ• 
∅ 
Δ•/Δ• 
ε 
Δε/Δ• 
∅ 
What this says is that derivative of an array can be treated mostly as a sequence of nonsymbol values, although we need to consider the type constraints on the array as a whole. A symbol cannot match a nonsymbol, so derivatives of that kind are empty. A derivative of a nonsymbol with respect to a nonsymbol value comes down to the type matching function against that value.
The nullability function decides whether an expression can be matched against nothing. When the derivative of an expression with respect to the input is not nullable, that means that either it would need additional input in order to match or it has already run down a blind alley and there is no way to match what is there. Therefore the data cannot be valid with respect to that expression. Conversely, when the derivative is nullable, it means that the input has been successfully matched by the regular expression (or consumed by the state machine, if you prefer) and no additional input is necessary for a successful match.
The rules for the nullability function are as follows:
Figure 27
E 
ν(E) 

ε 
ε 
∅ 
∅ 
true 
ε 
a 
∅ 
E* 
ε 
E,F 
<ν(E), ν(F)> 
<E,F> 
<ν(E), ν(F)> 
EF 
ν(E)  ν(F) 
!E 
ε if ν(E) = ∅ 
!E 
∅ if ν(E) = ε 
F&G 
<ν(F), ν(G)> 
F?G:H 
(ν(F)?ν(G):ν(H)) 
F? 
ε 
F⊕G 
ν(F) ⊕ ν(G) = (ν(F)?ν(!G):ν(G)) 
wc(pat) 
∅ 
wc(N,P) 
∅ 
F@T 
ν(F) 
• 
∅ 
Type matching checks the type constraints of a type T
against a particular value v
. It returns ε
if the value matches the type constraints of the type and ∅
otherwise.
Figure 28
τ(v,T) expression

τ(v,T) result


τ(•,T) 
ε 
τ(v,T) 
ε if ∀f∈facets(T) τ(v,f)=ε, otherwise ∅ 
τ(v,ε) 
ε 
τ(v,∅) 
∅ 
Typematching evaluates the various facets according to their semantics, something like:
Figure 29
f 
τ(v,f) 

{kind:x} 
[v isa x] 
{simpletype:x} 
[v instanceof x] 
{enum:[x1,x2,...,xn]} 
[∃xiv=x1] 
{const:x} 
[v=x] 
{minLength:x} 
[!(v isa string) ∨ stringlength(v) >= v] 
{maxLength:x} 
[!(v isa string) ∨ stringlength(v) <= v] 
{pattern:x} 
[!(v isa string) ∨ matches(v,x)] 
{multipleOf:x} 
!(v isa number) ∨ (v mod x)=0] 
{minimumInclusive:x} 
[!(v isa number) ∨ v >= x] 
{minimumExclusive:x} 
[!(v isa number) ∨ v > x] 
{maximumInclusive:x} 
[!(v isa number) ∨ v <= x] 
{maximumExclusive:x} 
[!(v isa number) ∨ v < x] 
{minProperties:x} 
[!(v isa object) ∨ count(keys(v)) >= x] 
{maxProperties:x} 
[!(v isa object) ∨ count(keys(v)) <= x] 
{propertyNames:x} 
[!(v isa object) ∨ ∀k∈keys(v)τ(k,x)] 
{minItems:x} 
[!(v isa array) ∨ count(items(v)) >= x] 
{maxItems:x} 
[!(v isa array) ∨ count(items(v)) <= x] 
{uniqueItems:x} 
[!(v isa array) ∨ !x ∨ count(items(v))=count(distinct(items(v)))] 
There is an additional rule for the τ
function that handles subschemas. This will make more sense after we have looked
at some examples:
Figure 30
f 
τ(v,f) 

{model:E} 
ν(ΔE/Δ•:v) 
Given this rule, we can observe that a JSON value is valid with respect to a JSON
schema if and only if the type matching function for the mapped JSON schema is ε
.
There are a number of relations and simplifications of the derivative expressions that come in useful in computing derivatives and keeping the size of the derivatives in reasonable bounds.
First, some of the operators are associative:^{[11]}
Figure 31
(E(FG)) = ((EF)G) = (EFG) (E,(F,G)) = ((E,F),G) = (E,F,G) (E&(F&G)) = ((E&F)&G) = (E&F&G) (F⊕G)⊕H = F⊕(G⊕H) = F⊕G⊕H <<F,G>,H> = <F,<G,H>> = <F,G,H>
Some of the operators are also commutative:
Figure 32
EF = FE E&F = F&E E⊕F = F⊕E <E,F> = <F,E>
There are straightforward simplifications involving ε
, ∅
, and redundant expressions. We will make use of these, generally without comment,
in the examples.
Figure 33
EE = E ε&E = E&ε = E ∅&E = E&∅ = ∅ ∅E = E∅ = E ε,E = E,ε = E ∅,E = E,∅ = ∅ E⊕E = ∅ E⊕!E = ε ∅⊕E = E⊕∅ = E (∅?F:G) = G (ε?F:G) = F <E,E> = E <E,!E> = ∅ <ε,E> = <E,ε> = E <∅,E> = <E,∅> = ∅
These identities mean that, given a function or operator f
that can only take the values of ε
or ∅
, such as the type matching function or the nullability function
Figure 34
<f,E> = f,E
Negation behaves as expected with respect to boolean operators:
Figure 35
!(!E) = E !<E,F> = !E!F !(EF) = <!E,!F> !(E⊕F) = <!E,!F><E,F>
Finally, certain of the extension operators can be defined in terms of the others. This can be useful for proving some of these derivations, but it also gives us some comfort that we are not actually enhancing extended regular expressions beyond recognition:
Figure 36
F⊕G = <FG,!F!G> (F?G:H) = <F,G><!F,H>
Again, following SperbergMcQueen05 we can use derivatives for validation. Given a JSON value v
and a JSON schema mapped to a type T
, where T
is a member of a set of mapped types accessible via λ
: v
is valid with respect to T
if and only if the type check τ(v,T)
returns ε
. If the mapping includes a mapped {model:E}
then calculation of τ
will compute the nullability of the derivative of E
with respect to the nameless value: ν(ΔE/Δ•:v)
per the type matching rule given previously.
The overall procedure is to compute the mapping of the schema. This will give us a
type whose identifier (given by κ
) will be the type to match. Where the mapping produces a model, computing the nullability
of the derivative of this model with respect to the nameless JSON value in hand tells
us whether that value is valid with respect to the schema.
Some examples might help:
Let's start with a very simple case. Is the value 47
valid with respect to this schema?
Figure 37
{ type:"number" }
We would certainly expect it to be. Let's walk through the mechanics.
The mapping of the schema gives us just:^{[12]}
Figure 38
T0: { kind:number, simpleType:"xs:numeric" }
We need to compute the type matching function, which is simple enough (here =
shows the chain of derivations):
Figure 39
τ(47,λ(T0)) = τ(47,{kind:number, simpleType:"xs:numeric"}) = [47 isa number],[47 instanceof xs:numeric] = ε
Since the result is ε
, the value is valid with respect to the schema.
Let's take a more complicated example, the schema from Figure 4, reproduced here:
Figure 40
{ "type": "object", "required": ["a","b"], "oneOf": [ { "properties": { "a": { "type": "number", "minimum": 0 }, "b": { "type": "number", "minimum": 0 }, "c": { "type": "number" } }, "additionalProperties": false }, { "properties": { "a": { "type": "number", "maximum": 0 }, "b": { "type": "number", "maximum": 0 }, "d": { "type": "number" } }, "additionalProperties": false } ] }
Mapping gives us:
Figure 41
T0: { kind:object, model: < (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc([a,b],[])*) > } T1: { kind:number, simpleType:"xs:numeric", minimumInclusive:0 } T2: { kind:number, simpleType:"xs:numeric" } T3: { kind:number, simpleType:"xs:numeric", maximumInclusive:0 }
We will henceforth abbreviate wc([a,b],[])
as just wc()
.
Let's see if this object is valid:
Figure 42
{ a: 1, b: 1, c: 2 }
To do that, we need to compute the type function:
Figure 43
τ({a:1,b:1,c:2}, λ(T0)) = τ({a:1,b:1,c:2}, { kind:object, model: < (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc()*) > }) = ν(Δ< (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc()*) >/Δ•:{a:1,b:1,c:2} )
To do that, we need to compute the following derivative:
Figure 44
Δ< (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc()*) >/Δ•:{a:1,b:1,c:2}
This expression is equal to:
Figure 45
( Δ< (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc()*) >/Δa:1 )/Δ•:{b:1,c:2}
Let's unroll that a bit:
Figure 46
( Δ< (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?), (a & b & wc()*) >/Δa:1 )/Δ•:{b:1,c:2} = ( < Δ( (a@T1? & b@T1? & c@T2?) ⊕ (a@T3? & b@T3? & d@T2?) )/Δa:1, Δ(a & b & wc()*)/Δa:1 > )/Δ•:{b:1,c:2} = ( < Δ(a@T1? & b@T1? & c@T2?)/Δa:1 ⊕ Δ(a@T3? & b@T3? & d@T2?)/Δa:1, Δ(a & b & wc()*)/Δa:1 > )/Δ•:{b:1,c:2} =
Let's consider the left side of the exclusive or (⊕):
Figure 47
Δ(a@T1? & b@T1? & c@T2?)/Δa:1 = Δ(a@T1?)/Δa:1 & b@T1? & c@T2?  a@T1? & Δ(b@T1?)/Δa:1 & c@T2?  a@T1? & b@T1? & Δ(c@T2?)/Δa:1
Evaluating the first clause gives us ε
because the symbol and type match:
Figure 48
Δ(a@T1?)/Δa:1 = Δa@T1/Δa:1  ν(a@T1) = [a=a],τ(1,λ(T1))  ν(a@T1) = τ(1, { kind:number, simpleType:"xs:numeric", minimumInclusive:0 } ) = ε
Figure 49
Δ(a@T1?)/Δa:1 & b@T1? & c@T2? = b@T1? & c@T2?
So far so good! What about the other alteratives? These don't match (leading to a
derivative of ∅
) because b
and c
do not match a
and are not nullable.
Since we have
Figure 50
Δ(b@T1?)/Δa:1 = Δb@T1/Δa:1  ν(b@T1) = ∅  ∅ = ∅ Δ(c@T2?)/Δa:1 = Δc@T2/Δa:1  ν(c@T2) = ∅  ∅ = ∅
Figure 51
a@T1? & Δ(b@T1?)/Δa:1 & c@T2? = a@T1? & ∅ & c@T2? = ∅ a@T1? & b@T1? & Δ(c@T2?)/Δa:1 = a@T1? & b@T1? & ∅ = ∅
∅
and we are still left with
Figure 52
Δ(a@T1? & b@T1? & c@T2?)/Δa:1 = b@T1? & c@T2?
On other clause of the exclusive or (⊕) we have
Figure 53
Δ(a@T3? & b@T3? & d@T2?)/Δa:1 = Δ(a@T3?)/Δa:1 & b@T3? & d@T2?)  a@T3? & Δ(b@T3?)/Δa:1 & d@T2?  a@T3? & b@T3? & Δ(c@T2?)/Δa:1
Here the first alternative fails, because the type check fails:
Figure 54
Δ(a@T3?)/Δa:1 = Δa@T3/Δa:1  ν(a@T3) = [a=a], τ(1, { kind:number, simpleType:"xs:numeric", maximumInclusive:0 } )
∅
.
The other alternatives don't fare any better, for reasons similar to the case on the
other side (b
and d
do not match a
and are not nullable):
Figure 55
Δ(b@T3?)/Δa:1 = Δb@T3/Δa:1  ν(b@T3) = ∅ Δ(d@T2?)/Δa:1 = Δ(d@T2)/Δa:1  ν(d@T2) = ∅
∅
.
Putting this all together, we have got to
Figure 56
( < (b@T1? & c@T2?) ⊕ ∅, Δ(a & b & wc()*)/Δa:1 > )/Δ•:{b:1,c:2}
The clause coming from the required
property is a little easier since it doesn't involve any types:
Figure 57
Δ(a & b & wc()*)/Δa:1 = (Δa/Δa:1 & b & wc()*  a & Δb/Δa:1 & wc()*  a & b & Δwc()*/Δa:1) = b & wc()*
Δb/Δa:1
is ∅
(eliminating the second alternative) and
Figure 58
Δwc()*/Δa:1 = Δwc()/Δa:1,wc()* = ∅,wc()* = ∅
a
(eliminating the third).
Whew! We've consumed the first property. On to b
!
Figure 59
<(b@T1? & c@T2?) ⊕ ∅, b & wc()*>/Δ•:{b:1,c:2} = < Δ(b@T1? & c@T2?)/Δb:1 ⊕ Δ∅/Δb:1, Δ(b & wc()*)/Δb:1 >/Δc:2
Type matching b
fails. We have
Figure 60
Δ(b@T1? & c@T2?)/Δb:1 = (Δ(b@T1?)/Δb:1 & c@T2?)  (b@T1? & Δ(c@T2?)/Δb:1) = ([b=b],τ(1,λ(T1)) & c@T2?)  (b@T1? & ∅)
Figure 61
τ(1, { kind:number, simpleType:"xs:numeric", minimumInclusive:0 } ) = ∅
Figure 62
Δ∅/Δb:1 = ∅
Figure 63
Δ(b@T1? & c@T2?)/Δb:1 ⊕ Δ∅/Δb:1 = ∅ ⊕ ∅
On the other half of simultaneous operator we have:
Figure 64
Δ(b & wc()*)/Δb:1 = Δb/Δb:1 & wc()*  b & Δwc()*/Δb:1 = wc()*  b & (Δwc()/Δb:1,wc()*) = wc()*  b & (∅,wc()*) = wc()*  b & ∅ = wc()*
We're down to the final symbol, which is where things fail entirely because the exclusive or fails, dragging everything else down with it:
Figure 65
Δ<∅ ⊕ ∅, wc()*>/Δc:2 = Δ<∅, wc()*>/Δc:2 = Δ∅/Δc:2 = ∅
Since ∅
is not nullable, the object is not valid with respect to the schema.
Let's look an array example to show how arrays unroll.
Is the array
Figure 66
[1,"a","b"]
Figure 67
{ type:"array", items:[{type:"number"},{type:"string"}], additionalItems:false }
The mapped schema:
Figure 68
T0: { kind:array, model: •@T1,•@T2 } T1: { kind:number, simpleType:"xs:numeric" } T2: { kind:string, simpleType:"xs:string" }
We need to compute:
Figure 69
τ([1,"a","b"],λ(T0)) = τ([1,"a","b"],{kind:array}),τ([1,"a","b"], {model: •@T1,•@T2})
The kind check passes because the value is an array, leaving the derivative we need to compute being
Figure 70
Δ(•@T1,•@T2)/Δ•:[1,"a","b"] = ((Δ(•@T1,•@T2)/Δ•:1)/Δ•:"a")/Δ•:"b"
Starting with the first item:
Figure 71
Δ(•@T1,•@T2)/Δ•:1 = Δ(•@T1)/Δ•:1,•@T2  ν(•@T1),Δ(•@T2)/Δ•:1 = τ(1,λ(T1)),•@T2 = τ(1,{kind:number, simpleType:"xs:numeric"}),•@T2 = •@T2
The second item ("a"
) also works:
Figure 72
Δ(•@T2)/Δ•:"a" = τ("a",{kind:string, simpleType:"xs:string"}) = ε
But things run aground on the third item ("b"
) because we run out of expressions to match against the value:
Figure 73
Δε/Δ•:"b" = ∅
Here we have a nonnullable derivative, and the array is not valid with respect to the schema.
Finally, let's look at an example where we have nested objects.
Here is the schema:
Figure 74
{ type: "object", properties: { "a": { type: "object", properties: { "b": { type: "integer" } }, additionalProperties: false } }, additionalProperties: false }
This maps to:
Figure 75
TO = { kind:object, model: a@T1? } T1 = { kind:object, model: b@T2? } T2 = { kind:number, simpleType:"xs:integer" }
Let's validate the JSON object
Figure 76
{ a: { c: false } }
In the familiar way, we begin with the type matching of the toplevel type
Figure 77
τ({a:{c:false}}, λ(T0)) = τ({a:{c:false}}, {kind:object}), τ({a:{c:false}}, {model: a@T1?})
Figure 78
Δ(a@T1?)/Δa:{c:false}
Resolving this:
Figure 79
Δ(a@T1?)/Δa:{c:false} = Δ(a@T1)/Δa:{c:false}  ν(a@T1) = [a=a],τ({c:false},λ(T1)) = τ({c:false},λ(T1))
Here is where things get interesting, and we move down to the derivative of the subschema:
Figure 80
τ({c:false},λ(T1)) = τ({c:false},{kind:object}), τ({c:false}, {model: b@T2?}) = ν(Δ(b@T2?)/Δc:false)
The derivative for the nested object fails:
Figure 81
Δ(b@T2?)/Δc:false = Δb@T2/Δc:false  ν(b@T2)
∅
.
Therefore
Figure 82
τ({c:false},λ(T1)) = ∅
Figure 83
Δ(a@T1?)/Δa:{c:false} = ∅
τ({a:{c:false}},λ(T0))
equal to ∅
, so the object is not valid with respect to the schema.
The rules given above for mapping objectrelated properties produce very weak expressions, which require a lot of work to resolve. We could do slightly better if we recall that properties are unique in a JSON object and assume we can always access those properties in a particular order, say, in codepoint order. Then we can use the sequence operator instead of the intermix operator when we don't have wildcards. We could also use the sequence operator and just insert wildcards everywhere, but that is not better than just using intermix. Because JSON property names are unique within a given object, it is easy enough to prune the branches, especially if we recognize the pattern:
Figure 84
Δ(p1@T1? & p2@T2? & .. & pn@TN? & wc()@S*)/Δx:v = (Δp1@T1?/Δx:v & p2@T2? & ... & pn@TN? & wc()@S*)  (p1@T1? & Δp2@T2?/Δx:v & ... & pn@TN? & wc()@S*)  ... (p1@T1? & p2@T2? & ... & Δpn@TN?/Δx:v & wc()@S*)  (p1@T1? & p2@T2? & ... & pn@TN? & Δwc()@S*/Δx:v)
Furthermore, if x
equals any pj
we know this can only be nullable if we pick the branch where we take the derivative
of that property, because the derivative of any other pi
or of the wildcard will be ∅
(the wildcard excludes all property names mapped from properties
). That is:
Figure 85
Δ( p1@T1? & p2@T2? & ... & pn@Tn? & wc([p1,...,pn])@S* )/Δx:v = ([∃pjpj=x] ? (τ(v,Tj), (p1@T1? & ... & pj1@Tj1? & pj+1@Tj+1? & ... & pn@Tn? & wc()@S*) ) : (τ(v,S), (p1@T1? & p2@T2? & ... & pn@Tn? & wc([p1,...,pn])@S*) ) )
Figure 86
Δ( p1@T1? & p2@T2? & ... & pn@Tn? & wc([pi])@S* )/Δ{q1:v1,q2:v2,...,qm:vm} = τ(q1,S1),τ(q2,S2),...,τ(qm,Sm), (r1@V1? & r2@V2? & ... & wc([p1,...pn])@S*) where Sj=([∃ipi=pj]?Ti:S) and the rk@Vk are the pi@Si where ∄jpi=qj
It is possible to create yet another extension operator for this case and handle it specially as a performance optimization.
To make this whole scheme worthwhile, we want to amortize the cost of computing derivatives, both within a particular validation episode and across validation episodes. Indeed, the major theme of Brzozowski's paper is that it can be used to construct a finite state automaton for matching the complete extended regular expression. Having constructed such a thing, it can be used again and again. We choose to be slightly more incremental here, and only compute the parts of the machine we actually need, as we go. Experience with this technique in XML Schema has shown that this is more cost effective, as there can be large parts of the state machine that never get accessed by actual data, especially with large schemas.
If we don't worry about type matching, it is pretty straightforward to create a cache of derivative expressions keyed off an expression ID and a symbol.^{[13]} The expression ID needs to be computed by memoizing it, lest we keep storing the same derivative computation over and over again. That is, from the state machine point of view, we don't want to create lots of redundant states. It is reasonably straightforward to canonicalize expressions for memoization.
It is helpful, in practice, to prune the computed derivatives as much as possible,
precomputing any of the predicate functions (such as [a=x]
or ν
which lead to ε
and ∅
values) that can be used to prune further. Here is where the typematching causes
us trouble: we could compute this and prune the derivative, but not if we wish to
cache the result, because the cache is keyed only off the symbol, and τ
evaluation depends on the value as well.
What is to be done?
There are three basic options:
Don't cache derivatives at all. We can then prune based on τ
results without constraint.
Include the value in the cache key, so that we cache the appropriate derivative for the symbol:value pair.
Cache "partial" derivatives: derivatives where all τ
calls are replaced with a marker to be evaluated using actual value presented. When
a derivative is pulled from the cache, if it has any unresolved markers, evaluate
them and return the pruned result.
Not caching means a lot of work calculating answers: we'd probably be better off just
using a more direct algorithm and not using derivatives at all. The whole point of
this technique was to compute the finite state machine (here implemented in pieces
via the cache). Including values in the cache key means that the derivative cache
will have a lot of duplicative entries and looks like an unpromising approach, especially
when you remember that some of these values are whole JSON objects and arrays. We
essentially sacrifice the finiteness of states, as a practical matter. A more promising
variation is to cache partial derivatives and add a secondary cache for the unresolved
τ
calculations. The τ
cache only needs to store one bit of information for each type plus value, and we
can make different decisions about the sizing and lifespan of entries for the two
caches. The partial derivative cache (∂
cache) can be readily shared across validation episodes, while the τ
cache could live for only a single episode. We can tune whether the τ
cache caches objects and arrays at all, or if so, how large they might be before
we don't bother.
This test performs a simple validation on a small document (~3K) with up to five levels of nesting, a Medline citation, transcribed directly from its XML representation. The JSON Schema imposes structural constraints (required properties, and subproperties) as well as some value and simple type constraints. The document is repeatedly validated in a singlethreaded loop under various regimes. As a baseline, an XML Schema validation of the XML version of the document is also tested (also using derivatives and the same baseline simple type facet testing). The documents are valid against their respective schemas and both documents and schema files are pulled from a prewarmed cache. The table shows both the validations per second for the first validation, which for some of the regimes is the one that does most of the work, and for subsequent validations.
Format  Cached schemas?  Derivatives?  ∂ cache?  τ cache?  V/s (noninitial)  V/s (initial) 

XSD  yes  yes  N/A  N/A  15088  678 
JSON Schema  yes  yes  no  yes  12618  610 
JSON Schema  yes  yes  yes  yes  11380  415 
JSON Schema  yes  no  N/A  N/A  9419  1071 
JSON Schema  no  no  N/A  N/A  2375  1720 
JSON Schema  no  yes  no  yes  650  617 
JSON Schema  no  yes  yes  yes  500  461 
JSON Schema  yes  yes  no  no  220  203 
JSON Schema  yes  yes  yes  no  207  617 
JSON Schema  no  yes  no  no  201  210 
JSON Schema  no  yes  yes  no  178  178 
The variance in these numbers is quite high, although the relative ordering is generally the same.
What is perhaps most interesting here are the negative results:
The simple recursive walk algorithm with no caching is not greatly worse than any derivative algorithm, regardless of caching strategy. Given its substantially better initialization cost compared to the derivative algorithms, this suggests that where the same schema is not substantially reused, the derivative algorithms aren't worth the bother. This is despite the fact that this test is something of a best case scenario for the caching in the derivative algorithms.
Caching partial derivatives doesn't help performance in this scenario: hurts it slightly, in fact.
I believe the answer to the first negative result is that a lot of the work, at least
in this schema, is located in the τ
checking, not in the property relation constraints, and that the models are simplistic
enough that the recursive walk doesn't have to do much beyond looking at the current
set of properties, so the work of computing, constructing, and caching derivatives
doesn't really pay.
As for the partial derivatives, I think what is going on here is (1) the τ
cache is in play regardless, and in this scenario with the same document repeated
over and over, it provides a strong benefit and (2) resolving partial derivatives
requires construction of more objects representing the resolved derivative, and the
top line on performance here is time spent in allocation from the heap. We can see
the strong impact of the τ
cache by seeing the two orders of magnitude hit to performance when we turn it off.
This scenario is clearly a bestcase one for the τ
cache, as we can see from these numbers.
The strongest performance win here is from simply having compiled schema objects cached, which gives about a 5 times gain for the recursive walk algorithm, and a 20 times gain for the derivative algorithm. The usefulness of the derivative algorithm is much stronger for cached schemas, unsurprisingly, as the work of computing the derivatives can be amortized over multiple validation episodes.
Similar results obtain from running multiple validations in parallel. This test is
a validateonload one, where 18 thousand Medline citations are loaded into a database
after being validated, using 20 threads. Here the performance is overall lower and
overall more similar, probably because the update operations dominate performance
overall. Here the data is more varied, so the impact of the τ
cache is substantially reduced. This test doesn't separate out the costs of the initial
use of the schema, and in fact, in effect, each client thread may end up doing some
of the initialization work redundantly in parallel, which impacts the amortization.
Format  Cached schemas?  Derivatives?  ∂ cache?  τ cache?  V/s 

JSON Schema  yes  no  N/A  N/A  868 
JSON Schema  yes  yes  no  yes  701 
JSON Schema  yes  yes  no  no  675 
JSON Schema  yes  yes  yes  yes  608 
JSON Schema  yes  yes  yes  no  588 
Again we see the same fundamental negative results: derivatives aren't much better (and here are a little worse, in fact) than a simple recursive walk and derivative caching is not particularly helpful.
This is a somewhat disappointing discovery.
It is entirely possible that there are deficiencies in the specific implementation
strategy here. While tying model checking directly to τ
simplifies the model of what is going on, I don't think it is a great implementation
strategy: you end up computing τ
recursively on complex objects over and over as you move through branches of their
parent type's derivative. The τ
cache helps here somewhat, but it might be better to deal with this as a local validation
instead. I believe the numbers from the XML Schema implementation are instructive
here because that is how it works: local validity worked from the bottom up. Another
approach that might work better is a scheme to cache the fully resolved derivatives
also, keyed off the partial derivative expression ID and the actual value of whatever
unevaluated τ
's it might contain. That would avoid repeated construction of new expression objects
whenever we resolve from the ∂
cache.
It is possible to reclaim finiteness and perform full caching by vectorizing the values. Take each distinct facet anywhere in the schema. Each slot in the vector is 1 or 0 depending on whether a particular value satisfies that facet. This then compresses to a binary value of however many bits. While this recovers finiteness, evaluating a lot more facets that necessary in the moment is also not without cost, so it isn't clear what the conditions are where this could be a performance win.
It may well also be the case that the optimization of folding the required
property into the mapping for object composition properties is no optimization at
all. Perhaps we'd be better served with a mapping like:
Figure 87
μM({ properties: {p1:s1,...pm:sm}, patternProperties: {r1:pt1,...,ro:pto}, required:[rq1,...,rqn] }) => {model: < rq1 & ... & rqn & wc([rq1,...,rqn])*, p1@κ(t1)? & ... & pm@κ(sm)? & wc([p1,...,pm],+[r1,...,ro])*, (wc(r1)@κ(pt1)  wc([],[r1]))*, (wc(r2)@κ(pt2)  wc([],[r2]))*, ... (wc(rn)@κ(pto)  wc([],[ro]))* > }
Typetagged regular expressions like this can be applied to other problems, or can be used to recast other problems in different terms.
First, most obviously, we can recast XML Schema validation in these terms as well.
The mapping function is different (rather more straightforward). We do need additional
wildcard functions (see SperbergMcQueen). The τ
function evaluates the constraining facets of simple types, analogously to what we
have above, and looks at the nullability of the derivative of the content models of
complex types. These content models will have their type information retained, tagged
either with the type name itself (for named types), or with a generated stable ID.
We don't have a single toplevel type, so we kick things off by identifying the appropriate
element declaration (if any), and computing the τ
with respect to its type. Attribute checks folds into this model as well, through
appropriate mapping. The details of this are beyond the scope of this paper, and work
for another day.
Whether evaluated with derivatives or not, typetagged regular expressions provide a framework for clearer expression of certain kinds of patterns or for a more general understanding of certain kinds of regular expressions. Consider, for example, a conventional regular expression^{[14]}:
Figure 88
([01]\d)(2[04])([05]\d)([05]\d)(.\d\d\d)?(.{3,3})
A typetagged expression of this could be something like:
Figure 89
(\d\d)\@{HOUR}(\d,\d)\@{MINUTE}(\d,\d)\@{SECOND}(.,\d,\d,\d)?,(.{3,3})\@{TZ}
Figure 90
HOUR = {minimumInclusive:0, maximumInclusive:24} MINUTE = {minimumInclusive:0, maximumExclusive:60} SECOND = {minimumInclusive:0, maximumExclusive:60} TZ = {enum:["PST","PDT",....]}
\@{type}
as a marker rather than the bare @type
.)
I believe this is a clearer expression of the intention and meaning of the desired pattern, and may be useful for expositional clarity, if nothing else.
There are extensions to regular expressions that capture a certain class of τ
functions in this fashion already: Unicode character category and block escapes:
Figure 91
((\p{Lu}\p{Ll}*)_)* \p{IsGreek}*
The first of these expressions matches camelcased strings of letters with an underbar separator (e.g. "Example_Match") and the second matches a sequence of Greek characters. Rewriting them as typetagged expressions in the notation of this paper:
Figure 92
((wc()@Lu,(wc()@Ll)*),_)* (wc()@IsGreek)*
Clearly this is not a compact rewrite! However, it does allow us to think about the
escape as less a special case than as a way of naming something whose significance
is defined by the τ
function. Let us reimagine the \p
escape as naming a type where the τ
function in this case is either "matches a Unicode category with the given name"
or "matches a Unicode block with the same name as what follows the 'Is'".
One way to think about this is that we have a representation of a state machine with
τ
representing a conditional function on the transitions. There is no reason (other
than computation efficiency) why τ
needs to be a "type" function at all. What restrictions should we place on τ
? Any?
This work followed a path of applying Brzozowski derivatives to JSON Schema validation, following the pattern of using them for XML Schema validation. This led to the need to extend regular expressions with type tagging, and adding a type checking function that applied the appropriate matching constraints for the compound symbol, and to allow for typeonly symbols in the expressions. It turns out, this formulation also works to provide a more uniform story for XML Schema validation beyond content model checking. More generally, tagged symbols combined with a tag matching function can be used more widely to improve the clarity of certain regular expressions and provide a framework for expressing other kinds of matching problems. However, the current implementation performance numbers are not particularly compelling: the need to match types against an infinitude of specific values makes it difficult to find effective caching strategies, undermining the point of using derivatives as an implementation technique in the first place.
[JSValidation] H. Andrews, Editor, G. Luff. JSON Schema Validation: A Vocabulary for Structural Validation of JSON IETF InternetDraft, March 19, 2018. Available at https://tools.ietf.org/html/drafthandrewsjsonschemavalidation01
[JSONSchema] H. Andrews, Editor. JSON Schema: A Media Type for Describing JSON Documents IETF InternetDraft, March 19, 2018. Available at https://tools.ietf.org/html/drafthandrewsjsonschema01
[Brzozowski64]
Janusz A. Brzozowski
Derivatives of Regular Expressions
. Journal of the ACM 11.4 (1964):
481494.
[XSD2] W3C: David Peterson, Shudi (Sandy) Gao 高殊镝, Ashok Malhotra, C.M. SperbergMcQueen, and Henry S. Thompson, editors. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. W3C. April 2012. http://www.w3.org/TR/xmlschema112/
[SperbergMcQueen05]
Michael SperbergMcQueen
Applications of Brzozowski derivatives to XML Schema processing
. Presented at Extreme Markup Languages, Montréal, Canada, August 15, 2005. Available
at http://cmsmcq.com/2005/EML2005Sper0518.html.
^{[1]} The links to Clark's "An algorithm for RELAX NG validation" and Bob Foster's blob referenced by SperbergMcQueen are, alas, dead.
^{[2]} Strictly speaking, in this case, where we don't have any actual constraints on a
, b
, and c
we could have dispensed with properties
. If we wanted to ensure the properties were all numeric, we would need to have a
schema like {"type":"numeric"}
instead of true
for each of the properties, and properties
would be necessary.
^{[3]} The true
schema is not actually equivalent to ε
: derivative rules are different.
^{[4]} Actually, integer
is another alternative value for the type
property, but as the facets for integer
are the same as for number
, we don't need to account for it separately here.
^{[5]} There is a small detail to take care of: it is possible to end up with more than one
model
property according to these rules, but we require the output to have unique keys.
In that case we use the simultaneous operator to combine them: {model:m1} + {model:m2} => model:<m1,m2>
.
^{[6]} You can't actually implement things that way as there may be circular references. In practice, the usual kind of prepass or symbol table fixup is required, if for no other reason than to prevent problems. The JSON Schema specification does say that schemas "should not" make use of infinitely recursive circular references as they may blow up naive validators and specifies such behaviour as undefined.
^{[7]} Support for the format
property is optional, as are the specific kinds of formats that are supported. Here
I give an exemplar as to how it might be mapped. For other format kinds more elaborate
mappings would be required, or the use of special simpleType
values that encapsulate any checks that can't be mapped to other facets.
^{[8]} From an implementation perspective both the type matching rules of the propertyNames
property and its mapping are a little strange and special. We store the mapping of
s1
, but realistically what is stored is the key (κ(μ(s1))
) with the actually mapped faceted type stored in a lookup table elsewhere.
^{[9]} Once you follow the conventional dodge of adding a unique symbol for all nonmatching symbols encountered.
^{[10]} Note: There are actually a couple of annoying of special cases around empty objects and handling subtype recursion that I have elided here. There is an asymmetry due to the fact that top level values have no name, but values inside a JSON object (including other objects) do.
^{[11]} See SperbergMcQueen05 for many of these; the logic for the others is similar. Since JSON objects have unique keys and contain nothing but keyed items, the analysis of intermix with respect to UPA apply here too.
^{[12]} T0
, T1
, etc. are the identifiers given to us by κ
. The set of properties on the other side of the equals sign is what is produced from
the mapping.
^{[13]} Given JSON's proclivity for openended sets of properties, it is useful to be able to group all symbols not found in the schema together as a single "unknown symbol" here.
^{[14]} For those unfamiliar with it: This grepstyle regular expression uses a slightly different
notation than what we've used in this paper: The sequence operator (,
) is assumed, [05]
is an abbreviation for (012345)
, \d
is an abbreviation for a choice between all digits, and dot (.
) is a wildcard.