Introduction

Whereas the computing community invents a new programming language almost every week, and the best ideas from these many experiments find their way into perhaps one programming language a year that sees the light of day outside the project that conceived it, new markup languages are rather rare, and most attempts to create them (such as the MicroXML project1) have self-imposed constraints of compatibility that limit the freedom of the designers to find new ways of doing things, even in areas where existing designs are universally acknowledged to be problematic.

Invited to run a course at a summer school in August 2012 for a high-achieving group of German undergraduates, I decided to take the opportunity to remedy this. While enjoying the thin air of the Swiss Alps in the Romansch-speaking village of Ftan at 1700m above sea level, the students spent the first week learning the technologies in the XML stack, and the second week designing a replacement. The result was FtanML. [1]

FtanML Goals

Some of the design goals the students set themselves at the end of the first week were:

  • The language would be as good as JSON3 in handling typed data, and as good as XML in handling documents.

  • The language would be more concise than XML, while still being human-readable.

  • Both a syntax and a data model would be defined; the data model must map readily to data structures available in most modern programming languages.

Perhaps as important was a non-goal of which we had to remind ourselves frequently: compatibility with the past was not an objective. We did not want to repeat other people's mistakes for the sake of compatibility, whether at the level of documents, parsers, APIs, editing tools, or simply user expectations. Associated with this goal was the implicit decision that we would not compromise technical quality in the interests of market acceptance. The aim was to do it right, and we would not measure success by the level of adoption. Having said that, there was no point in being needlessly different when there was nothing wrong with existing designs.

During the second week of the course we defined the FtanML markup language and object model, and implemented a parser using JavaCC. In the weeks after the course, some of the students rewrote the parser in Scala, and together we worked on extending the system with a type/constraint language. Inevitably, with the students dispersed to their various institutions, momentum was lost, but I decided that there were enough good ideas that it was worth bringing the design to some kind of completion. This paper provides an overview of the language rather than a complete specification (which remains as work to be done). A Scala implementation covering a significant subset is available at [2].

Requirements Background

XML has been remarkably successful and is widely used. It meets a wide variety of needs, achieves a high level of interoperability, and is not expensive to implement.

Nevertheless, over a period of 15 years' use, the drawbacks and limitations of XML have become well known, and are acknowledged by XML's critics and enthusiasts alike. Perhaps the most notable limitations and frustrations are:

  • XML has been widely adopted as a serialization format for structured data, but its data model has a poor fit to the type systems of most popular programming languages. Hence alternatives such as JSON and YAML4.

  • XML is over-complex. Many of its features are rarely used, or used only in very simple ways, but still make everything more complicated. Hence MicroXML.

  • XML cannot handle overlap or graph structures. Hence LMNL5 and GODDAG6.

  • XML is verbose and inefficient. Hence the various Binary XML contenders, including Fast Infoset7 and EXI8, as well as the adoption of custom non-XML syntax for various applications such as RelaxNG and RDF in direct competition with an XML syntax for the same information.

  • XML is syntax without an agreed data model. No-one knows, for example, whether CDATA sections should be treated as information-bearing or not. Similarly for comments. Hence the myriad XML data models such as DOM and XDM, all of them different.

So there's clearly room for improvement. A standard, once entrenched, rarely gives way to a technically superior alternative: the QWERTY keyboard is an oft-cited example, and XML will probably be no exception. However, there's room for diversity, and the aim of this exercise is to explore what is possible. It doesn't tackle all the problems noted above (for example, there's nothing on overlap or graph structures); but it tries to address most of them.

FtanML: The Markup Language

This section presents the syntax of FtanML. We'll present the "data-only" core of the language at this stage, but with some forwards references to how the language is subsequently extended to enable active scripting of documents.

A document (the unit of input to the parser) is a sequence of Unicode characters conforming to the grammar defined in this section. The encoding of characters as octets (or as scratches on clay tablets) is out of scope — it belongs in a different layer of the protocol stack. But if in doubt, UTF-8 is recommended.

The document must match the value production.

value ::= null | boolean | number | string | list | element | richText
        

As this production shows, there are seven kinds of value, which we will present in turn, starting with the simplest. The term "rich text" means text with interspersed markup: what the markup community traditionally calls "mixed content".

Later we will introduce an eighth kind of value, namely functions. But first, let's start with an example.

FtanML Example: the Purchase Order

This is what the purchase order from the XML Schema Primer2 might look like in FtanML.

<purchaseOrder 
   orderDate="1999-10-20" 
   shipTo = <country="US" [
      <name "Alice Smith">
      <street "123 Maple Street">
      <city "Mill Valley">
      <state "CA">
      <zip 90952>
   ]>
   billTo = <country="US" [
      <name "Robert Smith">
      <street "8 Oak Avenue">
      <city "Old Town">
      <state "PA">
      <zip 95819>
   ]>
   comment = |<emph |Hurry|>, my lawn is going wild|
   items = [
      <  partNum="872-AA"
         productName="Lawnmower"
         quantity=1
         USPrice=148.95
         comment=|Confirm this is <strong |electric|>|
      >
      <  partNum="926-AA"
         productName="Baby Monitor"
         quantity=1
         USPrice=39.98
         shipDate="1999-05-21"
      >
   ]
>

This example follows the example given in the XML Schema Primer very closely; I've only made one change, which is to use rich text in the comment fields. Let's compare it with the XML version:

  • End tags reduce to a simple ">".

  • The content of an element, and the content of an attribute, can be either a string (in single or double quotes), a number, a boolean (not used in this example), rich text (delimited with vertical bars), an element, or a list of elements (inter alia). When elements have element content, the child elements are enclosed in a list marked by square brackets.

  • Since an attribute can contain anything an element can contain, it's possible to use structured attributes, and I have taken advantage of this. I have chosen to use attributes rather than child elements in cases where ordering does not matter, and where there is only one child of the parent element with a given name: specifically for the top-level properties of a purchase order, and for the properties of each item. Where there is some significance in the ordering, as with the components of an address, I chose to use child elements.

  • In the list of items, the original XML has an element named items, whose children are all elements named item. Since the name of the child element is always the same, it is redundant, so I chose to leave it out: the content of the items attribute is now a list of anonymous elements.

  • There's a difference between a singleton and a list of length one. Lists are always explicitly marked with square brackets. That might be a little inconvenient for authors, but it makes life a lot easier for the programmer at the receiving end. (You could choose to allow the items attribute to contain a single item rather than a list if only one item has been ordered, but the program reading the data would then have to cater for both possibilities.)

  • For the purpose of the example I have followed the XML Schema Primer in defining the ZIP code as a number, though in reality it should be a string of digits, which is not the same thing.

  • There's no ambiguity about where whitespace is and is not significant. It's only significant if it appears in a string, or in rich text.

If we compare this with how it might have been done in JSON, there are two main differences. Firstly, JSON provides no satisfactory way to handle the mixed content comments. Secondly, with JSON we would have to make a choice how to represent the addresses: either use an object (i.e. a map), in which case ordering information is lost, or use an array in which case the components have no names. A minor difference with JSON, or at least with official JSON, is that we don't need quotes around the element and attribute names.

Now let's look at the individual constructs of FtanML.

The null value

null ::= "null"
            

There is a single value in this class, denoted by the keyword "null". It is borrowed directly from JSON, but plays a wider part in the data model.

Boolean values

boolean ::= "true" | "false"
            

There are two boolean values, denoted by the keywords "true" and "false".

Numeric values

number ::= "-"? digits ("." digits)? ([eE] [+-]? digits)?
digits ::= [0-9]+
            

The production rule for numbers is a little different from both the DoubleLiteral of XPath 2.0 (it requires digits both before and after the decimal point), and the equivalent in JSON (it allows leading zeros). The value space is not binary floating point, but decimal. Specifically, it is the set of values that can be represented in the form N * 10^M where N and M are integers, and N is not a multiple of ten. Implementations may impose limits on this infinite set.

Why decimals? Because that's what most human beings on the planet use in their everyday lives. Floating-point binary is designed for machines, not for humans. Also, because it survives round-trip conversion to and from text without ambiguity or loss of precision. However, the use of decimals gives a problem with the design goal that it should be easy to program using conventional programming languages. We take a hit here: in the case of programming languages with no decimal data type, numbers may be converted to whatever number system that language uses. But the native language for processing FtanML, namely FtanSkrit, treats the values as decimals.

Strings

string ::= ('"' (charRep - '"')* '"') | ("'" (charRep - "'")* "'")
charRep ::= (char - "\") | escape
char ::= (any Unicode character)
escape ::= (see prose)
            

Strings are enclosed in either double or single quotes. The value space for strings is the set of all sequences of Unicode characters. In the FtanML representation of a string, these characters are represented as themselves, except in the case of characters that have a special meaning, notably the string delimiter, and the escape character "\".

Escape sequences fall into a number of categories:

  • Whitespace escapes: \n, \r, \t, \s, and \S represent newline, carriage return, tab, space, and non-breaking space respectively.

  • Formatting escapes: \ followed by a sequence of whitespace characters represents nothing. This means that a FtanML editor can reformat the text for display purposes by inserting or removing escaped newlines without changing the actual content.

  • Special character escapes: \\ for backslash, \" for quotation mark, \' for apostrophe, \| for vertical bar, \` for backtick, \< for a left angle bracket, \[ for a left square bracket, \{ for a left curly brace.

  • Unicode codepoint escapes: \xHHHHH; represents the Unicode codepoint whose hexadecimal value is HHHHH. This may be any number of digits, followed by a semicolon. (Unlike JSON, non-BMP characters are represented by the actual codepoint, not by a surrogate pair.)

  • Cells: \[§....§] where § is any character that does not appear in the string. This is analogous to XML's CDATA section, except that it can also be used in attributes: it allows a literal string to appear without escaping of special characters. For example a sequence of four backslashes might be written \[⟡\\\\⟡]. Cells are handy for things such as regular expressions and Windows filenames, and for authoring papers that describe new markup languages.[2]

The only characters that must be escaped in strings are \, { , and the character used as the string delimiter (" or '). We'll come on to the significance of curly braces later.

Lists

A list is a sequence of values. The values may be of any of the seven kinds (null, boolean, number, string, list, element, or rich text).

The unabbreviated syntax is the same as for arrays in JSON:

list ::= "[" (value ("," value)* )? "]"
            

For example, [1, 3, "London", null]

Two abbreviations are allowed:

  • Commas may be omitted, so [1 2 3] is equivalent to [1,2,3] and [<first><last>] is equivalent to [<first>,<last>].

  • The value null is implicit if there is nothing between two commas, or before the first comma, or after the last. So [,,] is equivalent to [null,null,null]

The effect of these two rules is that the abbreviated syntax for lists becomes:

lists ::= "[" ( value | ",")* "]"
            

Whitespace is needed between two values only where necessary to terminate a token; specifically, when one value ends with an alphanumeric character and the next starts with an alphanumeric character.

Elements

Elements serve the same purpose as objects (maps) in JSON and elements in XML.

element ::= "<" name? (name "=" value)* content? ">"
content ::= value
            

Elements have three parts: an optional name, a set of name/value pairs called attributes, and an optional value referred to as the element's content.

The values of attributes can be of any type: not just strings as in XML, but numbers, booleans, lists, elements, rich text. An attribute with the value null is deemed equivalent to omitting the attribute.[3]

Attribute names within an element must be distinct.

Like attributes, the content value can be of any type.

As with lists, whitespace is needed only where necessary to terminate a token.

We'll have more to say on element and attribute names later. For the moment, suffice it to say that the name can be any non-empty string. If the name contains special characters it can be written within backticks (a convention borrowed from the SQL world).

Here are some examples of elements. (We haven't explained rich text yet, so we won't use it in any of our examples):

Table I

Examples of Elements

Example Explanation
<> An empty element (no name, attributes, or content)
<br> An empty element named br
<age 23> An element whose name is age and whose content value is the number 23
<colors ["red", "green", "blue"]> An element whose name is colors and whose content value is a list of three strings
<x=0.13 y=0.57> An unnamed element containing two attributes, both numeric
<polygon coords=[[1,1], [1,3], [3,2]]> An element named polygon with an attribute named coords whose content value is a list; the list contains three sublists, and each sublist contains two numbers.
<[<i><j><k>]> An unnamed element whose content value is a list of three elements. Note the omission of the optional commas.
<`Graduate Trainee` `date of birth`="1995-01-01"> An element where both the element name and attribute name contain spaces.

Rich Text

Rich Text (known in the XML world as mixed content) consists of characters and markup, or more specifically a sequence whose members are either characters or elements.

richText ::= "|" (charRep | element) "|"
            

Rich text is written between vertical bars.[4]

For example: |H<sub|2|>O| represents text consisting of the character "H", an element whose name is sub and whose content is the rich text "2", and the character "O".

Escapes can be used in rich text just as they can in strings. Any recognized escape sequence may be used; the only characters that must be escaped are "\", "|", "{", and "<".

Whitespace

FtanML (unlike XML) is explicit about the difference between significant and insignificant whitespace.

Whitespace appearing directly within a string or within rich text is significant and is retained in the data model — except that a sequence of whitespace characters preceded by a backslash is ignored (this is formatting whitespace, used only to make the text more easily readable on screen or paper). Whitespace between tokens in a list or element is insignificant and is not retained. Whitespace is never required between tokens unless necessary for disambiguation.

Note that because elements may be embedded in rich text, these rules apply recursively. Whitespace characters appearing between the tokens of an element that itself appears within rich text are not significant; it is the immediate container that matters. Support for rich text means that unlike JSON, this is not a two-level grammar where it makes sense to think of a tokenization phase followed by a syntax analysis phase, with whitespace being discarded during tokenization.

Names and Namespaces

As stated earlier, the name of an element or attribute may be any string. Names without special characters are called simple names; those containing special characters must be written with enclosing backticks (grave accent, x60), and are called quoted names.

The rule for a simple name is that it must begin with a letter or underscore, and continue with letters, digits, or underscore. The terms "letter" and "digit" are defined by reference to Unicode character categories.

A quoted name may use escaped characters in the same way as a string literal. The only characters that must be escaped are the backslash and backtick.

name ::= simpleName | quotedName
simpleName ::= [\p{L}_][\p{L}\p{D}_]*
quotedName ::= "`" ((charRep - "`") | Escape)+ "`"
             

A name written in a FtanML document, with or without backticks, cannot be zero length; in the data model, however, the content value is modelled as an attribute with a zero-length name.

There are no namespaces in FtanML.

As a matter of convention, it is recommended that an element or attribute intended to be used in an alien context, that is, a context where the containing element is part of a different vocabulary defined by a different specification, should be made unique by use of a "reverse-DNS" qualified name along the lines of org_w3c_xsl_transform.

By contrast, in the normal case where an element or attribute always has a containing element whose name is defined as part of the same vocabulary, short names such as status or name are perfectly adequate and cause no ambiguity.

For interoperability with XML, there may be cases where it is desirable to use the same names for elements and attributes as defined in an XML vocabulary. There are two ways this might be done:

  • The XML expanded name can be used in Clark notation, enclosed in backticks. For example: [<`{http://www.w3.org/1999/XSL/Transform}stylesheet` version="2.0"...>

  • Prefixes and namespace declaration attributes may be used, following XML conventions: <`xsl:stylesheet` `xmlns:xsl`="http://www.w3.org/1999/XSL/Transform" version="2.0"...>. The FtanML system will not attach any meaning to such namespace declaration attributes, but it is capable of representing them if required. Note that any name containing a colon (or various other characters such as ".") needs to be backtick-quoted.

Data Model

The data model for FtanML corresponds closely to the syntactic structure.

Null values, booleans, strings, and numbers need no further explanation.

A list is an ordered sequence of values; a list of length one is not the same thing as its singleton member.

An element comprises a name (which is a string, or absent) and a set of zero or more name/value pairs, the element's attributes. The content value of the element is modelled as an attribute whose name is the zero-length string. Attributes whose value is null are treated as absent.

Rich text is modelled as a sequence of strings and elements, in which no string is zero-length, and no two strings are immediately adjacent. But note that rich text is a distinct data type and is distinguishable from a list of strings and elements. [5]

All values in the model are immutable; modifications always involve creating new values rather than modifying existing values. There is no notion of identity; it is not meaningful to ask whether two lists both containing the values [1,2,3] are "the same list", and this is also true for elements.

These concepts have mappings to the data structures of popular programming language that in most cases are fairly obvious. There are a few exceptions: some languages do not have a natural way of representing decimal numbers; others have difficulty representing Unicode strings, especially strings in which the NUL character (x00) is permitted. The way in which such conflicts are resolved is outside the scope of this paper.

A noteworthy feature of the data model is that there are no "parent pointers". It is not possible to navigate from a value to its container. Closely related to this, values have no "identity" in the sense of object-oriented data models. In this respect the data model follows JSON rather than the various models used to represent XML. The absence of parent references and object identity creates some challenges, but has many benefits in establishing a purely functional semantics for the processing language, and in enabling efficient transformation: it means, for example, that copying a subtree from one element to another is a very cheap operation, because the physical data can be shared. [6]

The Schema Language: FtanGram

The schema language can be used to define constraints on values, including constraints on entire documents. This is the only purpose of a schema; validation returns a true or false answer, perhaps with a stream of error messages as a side effect, but it does not change the data being validated in any way, except perhaps as an internal optimization.

A type is thus a predicate; it distinguishes values that match the type from those that do not.

A schema is a set of named types. The seven named types null, boolean, number, string, list, element, and text are always available; other types are user-defined.

Types have a representation as FtanML elements, and we will use this representation in discussing types. However, the element used to represent a type must not be confused with the type itself.

The convention for type representations is to use elements such as <number gt=0 le=1000>, where number is the name of a base type, and attributes such as gt=0 and le=1000 define constraints. These attributes are referred to as facets. If there are multiple attributes, they define multiple constraints, which are independent and orthogonal. In this example, the gt facet defines a minimum value (exclusive), while the le facet defines a maximum value (inclusive). Specifying a base type is often unnecessary — in this example every value that can be greater than zero is necessarily a number, so every value that satisfies the predicate will also satisfy the base type. However, including the base type can still be useful to aid clarity.

Although we speak of "base type" here, there is no type hierarchy. One value can belong to any number of types, and although it may be true that one type subsumes another, the language makes no use of the fact. Naming a base type in a type representation merely indicates that to satisfy the type, a value must satisfy all the constraints imposed by the base type in addition to the facets explicitly listed.

Before we get into a detailed exposition, we'll again start with an example.

FtanGram Example: the Purchase Order Schema

In this section we present the schema for the purchase order shown earlier. This is based on the example schema in the XSD primer, modified to correspond with the way we restructured the instance document to take advantage of FtanML.

 <org_ftanml_schema [
  <import "ftan_calendar.ftg">
  <types
   purchaseOrderType = 
     <element form=<purchaseOrder 
                      shipTo=<addressType>
                      billTo=<addressType>
                      comment=<nullable<text elements=<inlineType>>>
                      items=<occurs=[1,] <itemType>>
     >
   addressType = 
     <element form=<  country=<eq="US">
                      <seq [ <element form=<name <string>>>,
                             <element form=<street <string>>>,
                             <element form=<city <string>>>,
                             <element form=<state <string>>>,
                             <element form=<zip <number>>>]>
     >
   itemType =
     <element form=<  partNum=<SKUType>
                      productName=<string>
                      quantity=<number ge=1 lt=100 step=1>
                      USPrice=<number ge=0 step=0.01>
                      comment=<nullable<text elements=<inlineType>>>
                      shipDate=<nullable<org_ftanml_calendar_dateType>>
                   >
     >
   inlineType = 
     <element elemName=<enum=["ital", "bold"]> 
              form=<<inlineType>>
     > 
   SKUType = <string pattern="\[#\d{3}-[A-Z]{2}#]">  
  >   
]>

Looking at this in a little detail, we see:

  • A schema is a set of named types. Some of these types are defined inline, some (in this case org_ftanml_calendar_dateType) are imported from an external type library.

  • Elements are defined using the form attribute. The value of this attribute is a proforma element. The name of the proforma element matches the name of the instance element; the attributes of the proforma element define the types of the attributes of the instance element; and the content value of the proforma element defines the type of the content value of the instance element.

  • An optional attribute is given a type such as <nullable<T>>. This reflects the fact that an absent attribute is equivalent to an attribute that has the explicit value of null; so as well as the normal type of the attribute, the schema must also allow it to take the value null.

  • Note the use of a "cell" for escaping the regular expression in the pattern facet for SKUType. This helps to avoid clutter in a string that makes generous use of special characters, especially backslashes.

Constructing Types

The construct <value> represents a type that matches every value.

Given types T, U, V, the construct <anyOf [T, U, V]> represents the union of these types, while <allOf [T, U, V]> represents their intersection.

For example, <anyOf [<number>, <string>]> allows numbers and strings, while <allOf [<positive>, <even>]> allows values provided they satisfy both the (user-defined) types positive and even.

For convenience, the construct <nullable <T>> is equivalent to <anyOf [<T>, <null>]>: that is, either T or null. Thus <nullable <number>> matches either a number, or null.

An enumeration type can be defined using the construct <enum=[A,B,C,...]>. For example, <enum=["red", "green", "blue"]> matches the three specified strings and nothing else. A singleton enumeration can be defined with the eq facet: for example <eq=""> matches the zero-length string only.

The construct <not <T>> denotes a type that matches all values that are not instances of T. This can be useful in constructing more complex types; for example <not<eq="">> matches all non-empty strings, while <allOf [<number>, <not <eq=0>>]> matches values that are numbers and that are not equal to zero.

The most general way of defining a restriction is with an assertion facet, for example: <assert={$.startsWith("abc")}>. To understand assertions, however, we need to look at the scripting language, which comes later in the paper. (The curly braces signal that the value is a function; this represents an extension to the base FtanML syntax which is used only in scripts.)

Restricting numbers

Numeric ranges may be defined using the four attributes ge, gt, le, and lt, corresponding to the XML Schema facets minInclusive, minExclusive, maxInclusive, and maxExclusive, together with the facets eq and ne which are applicable to all values. For example, the type consisting of numbers in the range 0 to 100 inclusive may defined as <number ge=0 le=100>. (As mentioned earlier, the element name number is redundant, because only a number can satisfy the other constraints.)

A step facet constrains the number to be an integer multiple of the given increment. The most common values (both found in our example schema) are 1, which requires the value to be an integer, and 0.01, which is often suitable for currency amounts. Specifying step=17.2 would be unusual, but is perfectly legal. The facet does not constrain the way the value is written, for example an integer can be validly written as 1.00000.

Restricting strings

Strings may be restricted using a regular expression, for example <string pattern="[A-Z]*">. There are no special facets for defining a minimum, fixed, or maximum length, since regular expressions are sufficient for this purpose.

Restricting lists

A list can be constrained with a grammar. A grammar is a facet like any other: just another way of defining a restriction on the content, and it is defined in the same way: <list grammar=....>. A simple grammar might allow a list to consist of a sequence of zero or more numbers. This would be defined like this:

<list grammar=<number occurs=[0,]>>                

To take another example, a grammar might require a value to be a list comprising a string, a number, and a boolean. Here is the definition:

<list grammar=<seq [<string>, <number>, <boolean>]>>                

Unlike most schema languages in the XML world, grammars can constrain any sequence of values, not only a sequence of elements. In principle, if there are subtypes of string representing nouns, verbs, and so on, then a grammar could constrain a list to contain a sequence of words making up an English sentence.

The "alphabet" of the grammar — the set of tokens it recognizes — is the set of types. The fact that a value might belong to more than one of these types does not matter. The grammar exists not to define an unambiguous parse tree of the input, but only to determine whether the input is valid against the type definition or not.

A grammar can be represented as a tree of particles. Each particle consists of a term (what does it match?), and a repetition indicator (how often does it match?). For leaf particles, the term is a type. Non-leaf particles are either sequence particles or choice particles, and in each case the term is the set of child particles in the tree.

The value of the grammar facet is an element representing the root particle in this tree.

The three kinds of particle are represented as follows:

  • A sequence particle is represented by an element named seq; an optional occurs attribute; and content which is a list containing the child particles in the tree. For example: <seq occurs=[0,] [<white>,<black>]>, which matches an alternating sequence of values of types <white> and <black>.

  • A choice particle is represented by an element named choice; an optional occurs attribute; and content which is a list containing the child particles in the tree. For example: <choice occurs=[0,] [<white>,<black>]>, which matches sequence of values, each of which can be either of <white> or <black> type.

  • A leaf particle is represented by the same element used to describe the type, augmented if necessary with an occurs attribute. For example <number>, or <number occurs=10>. The occurs attribute defaults to 1; it appears alongside the attributes defining facets of the type, though it is not really a property of the type, but rather of the particle referring to the type.

The value of the occurs attribute is either an integer (indicating a fixed number of occurrences), or a list of size two (indicating a range with a minimum and maximum). The first item must be an integer, the second can be either another integer, or null to indicate an unbounded range. For example [0,1] indicates an optional particle (zero or one occurrences), [0,] indicates zero or more, and [1,] indicates one or more. The default is occurs=1.

Some further examples of grammars are shown in the table below:

Table II

Examples of Grammars

Example Explanation
<seq [<string>, <number>, <number>]> A string followed by two numbers
<seq [<string>, <number occurs=2>]> A string followed by two numbers
<occurs=[0,] <seq [<string>, <number>]>> An alternating sequence of strings and numbers
<enum=["red", "green", "blue"] occurs=[1,]> A sequence of one or more strings each taken from a defined set of colour values
<occurs=[0,100] <choice [<string>, <number>]>> A list of up to 100 items, each of which may be either a string or a number. Note that when the sub-particles of a choice are leaf particles, an alternative approach is to define a union type using <anyOf>

Many of these examples serve the purpose that in XML Schema would be achieved using simple types of variety list or union. But of course, in the document markup tradition, grammars are commonly used to define sequences of elements, and we will see examples of this in the next section.

Restricting elements

The simplest way to place restrictions on elements is by use of the form facet. Its value is an element, known as a proforma, which works as follows:

  • The name of the proforma element constrains the name of the target element.

  • The attributes of the proforma element constrain the attributes of the target element.

  • The content value of the proforma element constrains the content value of the target element.

For example, the proforma:

                <img height=<number> width=<number> <null>>
            

represents an element whose name must be "img", whose height and width attributes must be numbers, and whose content value must be absent (null).

This proforma can be used to define an element type like this:

                <element form=<img height=<number> width=<number> <null>>>
            

Like all facets, a proforma can only define restrictions. If the proforma includes no element name, then it places no restrictions on the element name. If a particular attribute is not present in the proforma, then it places no restrictions on the presence or content of that attribute. If the proforma has no content value, then the content value of the target element is unconstrained.

If an attribute is to be optional, this can be indicated by permitting null as the value: for example writing height=<nullable<number>> indicates that the height attribute must either be a number, or null. Recall that omitting an attribute is the same as giving it a value of null.

Some additional facets are available for elements for use where the proforma construct is insufficiently expressive:

  • The elemName facet defines the type of the element name.

    For example <element elemName=<enum=["i", "b", "u"]>> constrains the element name to be one of the names listed.

  • The attName facet defines the type that all attribute names must conform to (for example, as an enumeration, or by means of a pattern). This is the easiest way of prohibiting attributes from appearing (the other way is to constrain the value to be null). For example, attName=<ne="xmlns"> would disallow the attribute name xmlns; this constraint could also be expressed in the proforma as xmlns=<null>.

  • For convenience, as an alternative to using a proforma, the content of the element can be constrained using the content facet. The value is a type. For example, content=<boolean> constrains the content to be the boolean value true or false, while content=<null> constrains it to be null (which can be achieved either by omitting the content, or using the FtanML keyword null).

We can now see how to define an element type that participates in the content model of another element type. Suppose we have an element named items whose children are elements named item with string content. We can define the type of items like this:

<element form=
  <items 
    grammar=<element form=<item <string>> occurs=[0,]>
  >  
>

(I find it useful when writing such constructs to ensure that every angle bracket is aligned either vertically or horizontally with its partner, and to limit the nesting of angle brackets on a single line to about 3.)

Content models like this would quickly become unwieldy if the whole structure had to be defined inline. In addition, it would not be possible to reuse types in different parts of the model. It is therefore possible for the definition of one type to refer to other types by name. The above example could be expressed using named types in a schema, thus:

<types
  itemsType = <element form=<items <grammar=<itemType occurs=[0,]>>>>
  itemType = <element form=<item <string>>>
>

Restricting Rich Text

Most of the time, the only restriction that needs to be placed on rich text is to define what elements may appear within it. This is done with an elements facet, whose value is a type. All elements appearing in the text must conform to this type.

We don't expect it to be used very often, but FtanML also allows rich text to be constrained with a grammar. The rules for defining a grammar are exactly the same as for lists, and they define the grammar when the text is considered as a list containing strings and elements. For example, a grammar might define that the first thing to appear is a headword element, and after that there are no constraints.

Uniqueness and Referential Constraints

As with XML Schema, definition of constraints takes advantage of the processing language, so this section contains some forward references to facilities not yet introduced.

Uniqueness is enforced by a function-valued facet. For example:

<list unique={$@id}>

expresses a contraint on a list of elements stating that among the elements in this list, all attributes named id must have distinct values. Null values are excluded. This facet can be applied to lists and elements; in each case the supplied function is used as a mapping function, and is applied to each item in the list or each attribute of the element, as if by the ! operator; the value is invalid if the resulting list contains any duplicates. So a simple constraint that all the numbers in a list of numbers be unique can be expressed as unique={$}; a constraint that the names of the attributes in an element should each start with a different letter can be written unique={substring($, 0, 1)}, and a constraint that all the non-null attributes of an element should have distinct values can be expressed as unique={$2} (when a mapping function is applied to an element, the first argument $ is the attribute name, and the second argument $2 is the attribute value).

Referential constraints are enforced by a similar facet whose value is a pair of functions, one of which selects the references (foreign keys) and one the target identifiers (primary keys):

<ref=<from={$@ref} to={$@id}>>

The rule is that the set of values selected by the from function (again excluding any nulls) must be a subset of the set of values selected by the to function.

Queries and Transformations: the FtanSkrit Processing Language

FtanSkrit is a functional, weakly-typed, Turing-complete programming language for manipulating instances of the FtanML data model. It is an expression language with full orthogonality: any expression can be used as an operand of any other expression, subject only to rules on operator precedence and type constraints.

A program in FtanSkrit is written as a function (a function which typically takes a source document as input and produces a result document as its result). The body of a function is an expression, and this exposition of the language will focus on the different kinds of expression that can be written.

The data model that can result from parsing a FtanML document, as we saw earlier, can contain seven types of value: null, boolean, number, string, list, element, and text. We also mentioned an eighth type of value, namely a function. Functions can appear in the data model anywhere that the other seven types of value can appear, for example as the value of an attribute in an element, or as the value of an item in a list.

Because expressions can be nested arbitrarily, it's not easy to define the different classes of expression without forward references to concepts that haven't been explained yet, and it's also rather difficult to know where to begin. But because functions are so important and central, that's where I'll start.

Functions

There are two important kinds of expression associated with functions: function declarations and function calls.

Function Declarations

A function is written as an expression enclosed in curly braces. Here's a function that computes the sum of its two arguments: {$1 + $2}

References to parameters are written $1, $2 etc, where $N refers to the Nth supplied argument in the function call. The expression $ can be used in place of $1 to refer to the first argument, and is particularly useful for functions that expect a single argument. It can be used in rather the same way as . (the context item) in XPath, and plays a similar role to _ in languages such as Perl or Scala.

For example, a function that returns true if the supplied element has a name might be written {name($) != null}.

Functions have no name, but can be bound to named variables, in which case the variable name serves effectively as a function name. Functions in the system library are bound to predefined variables.

A function does not have a fixed arity. The example function {$1 + $2} expects two arguments, but it can be called with more than two arguments (excess arguments are ignored), or with fewer than two (unsupplied arguments default to null).

The expression $$ returns all the supplied arguments in the form of a list. This makes it possible to write functions that take a variable number of arguments: the actual number is accessible as count($$).

Functions can refer to variables defined outside the function body, which become part of the closure of the function, to be used when it is evaluated.

Within the body of a function, the variable self is bound to the function itself. This makes it easy to write anonymous recursive functions: for example a function to compute the sum of its arguments can be written as {if empty($$) then 0 else $ + self(tail($$))}. We'll see later how to write mutually-recursive functions.

Because a function is an expression, it can be used anywhere an expression can appear; for example as the value of an attribute in an element. This allows an element to be used as a map from strings to functions, which is very like Javascript's notion of an object. This enables a kind of polymorphism.

Sometimes it is useful to design a function so that parameters are supplied by name rather than positionally. The can be achieved by writing the function to accept an element as its argument. The caller might supply the arguments like this: f(<x=2 y=3>); and in the function body the supplied values can then be referenced as $@x or $@y.

Functions do not declare their arguments explicitly. As a matter of convention, when writing a public function it is good practice to bind the supplied parameters to variables along with a type check. For example the following implementation of the indexOf function starts by giving names to its arguments and checking their type, which simultaneously makes the function more robust and more readable. [7]

let indexOf = {
  let Array = $1.as(<occurs=[0,] <number>>);
  let Search = $2.as(<number>);
  0..count(Array)-1?{Array[$]=Search}
  }

Because argument types are not declared, it's up to the implementor of a function what to do when the caller supplies arguments of the wrong type. There are no implicit conversions defined as part of the call mechanism. The preferred approach is to throw an error, which can be readily achieved using the coding style in the above example.

Function Calls

If F is an expression that evaluates to a function, then the function may be called with arguments x and y using the expression F(x, y).

If f is a variable whose value is a function, and if the function has at least one argument, then a function call can be written either as f(x,y) or as x.f(y).

As in XPath 3.0, partial function application (currying) is possible by supplying ? for one of the arguments: contains(?, ':') returns a function whose effect is to test whether its first argument is a string that contains a colon.

Some built-in functions can also be invoked using an infix operator. For example the + operator corresponds to the plus function; a + b has the same meaning as plus(a, b) or a.plus(b). All the operators in the language, including higher-order operators, are defined in terms of functions, to allow them to be passed as arguments to higher-order functions.

The names of built-in functions always use the ASCII alphabet; for some operators we have allowed ourselves the luxury of reaching beyond ASCII, but users can always avoid relying on such operators and can use the function name instead.

List and Element Constructors

The syntax of lists and elements is extended so that expressions may appear anywhere the FtanML syntax allows a value.

For example, the expression {[$, $+1, $+2]}(5) returns the list [5, 6, 7].

Lists in FtanML are not automatically flattened, so the expression [1 to 5, 10] produces the length-2 list [[1,2,3,4,5],10] rather than the length-6 list [1,2,3,4,5,10]. The latter result can be achieved either by applying the flatten() function explicitly, or by using list concatenation/append operators: for example (1..5).append(10).

In an element constructor, expressions can be used to compute the values of attributes, but cannot be used to compute their names. The value can be expressed either as a parenthesized expression, or using a string or text value containing expressions embedded in curly braces: <img size=(x+1) caption="Figure {n}">. The same applies to the content value. Note that curly braces are used only for inline expansion of strings and text (and for writing functions); to compute general structured content, parenthesized expressions should be used. The expression:

<job-titles (distinct-values(employee@job-title))>

might generate

<job-titles ["Manager", "Programmer", "Bottle-Washer"]>

A null value for an attribute indicates the effective absence of the attribute, so the expression <size x=(a+1) y=(if a=2 then 3 else null)> might produce the output <size x=3>.

More specifically, in an element constructor, the value of an attribute, or of the content value, can take any of the following forms:

  • A literal value, for example a=3 or a="blue" or a=false.

  • A string or text value with embedded expressions enclosed between curly braces, for example a="Chapter {n}" The value of the attribute is obtained by evaluating the embedded expressions as strings and inserting the resulting strings into the text.

  • A list constructor, for example a=[n, n+1, n+2].

  • An element constructor, for example a=<x=(n+1) y=(n+2)>

  • A parenthesized expression, for example a=(n+1)

  • A function, for example a={$+1}. In this case the value of the attribute is the function, not the result of evaluating the function.

Where element constructors cannot be used because the element or attribute names are not known statically, functions can be used to construct an element. For example:

element("img").add("x", 3).add("y", 5).add("", "An image")
}

Here the function call element("img") constructs an element with a given name, and the add() function adds an attribute with a given name and value (copying the element to create a new element). The last call adds the element content, represented as an attribute with an empty name. It should be remembered that although we use the term "element", FtanML elements will not only be used in the way that XML elements are traditionally used, but also in the way that maps are used in other programming languages, where the keys (attribute names) are highly dynamic: indeed, to satisfy the kind of use cases for which maps are being added to XSLT 3.0.

Rich text (mixed content) is constructed as a list of strings and elements, which is then converted to rich text by applying the toText() function.

Conditional Expressions

The syntax for a conditional expression is:

                "if" expression "then" expression "else" expression
            

There is no need for parentheses (though you can use them if you like, for old time's sake). The "else" branch is mandatory, partly to avoid choosing an arbitrary default (null?) and partly to prevent the dangling-else ambiguity when conditional expressions are nested. For example:

 if $ = 0 then x else y
 

A simple try/catch construct is provided:

                "try" expression "catch" function
            

which returns the result of the expression unless an error occurs during its evaluation, in which case the catch function is called, supplying error information as its argument, in the form of an element with attributes representing the error code and error description.

For example, the following catches a divide-by-zero error (we assume use of the XPath error codes), and returns null if it occurs; otherwise the error is re-thrown:

                try (x.div(n)) catch {if $@code="FOAR0001" then null else error($)}
            

A function orElse allows a default to be substituted when a value is null. For example a.orElse(0) returns the value of a unless it is null, in which case it returns zero. This function could be defined as {if $1=null then $2 else $1}.

Variables

Variables have simple names (no "$" prefix, no backticks). The names true, false, and null are reserved: they are used syntactically like variables, but have fixed predefined values. Language keywords such as if and let are also reserved: unlike XPath, this is possible because bare names are not used to refer to elements and attributes in input documents.

Variables may be declared using the construct:

                LetExpression ::= "let" name "=" expression; expression
            

which evaluates the second expression with the named variable bound to the value of the first expression; for example let x=2; x+x returns 4.

Variables declared in this way are available only after they have been declared. An alternative style of declaration allows forwards references to variables, which is necessary when writing recursive functions. This style uses element notation:

                let <
                  even = {if $=0 then true else odd(abs($)-1)}
                  odd  = {if $=0 then false else even(abs($)-1)}
                >;
                even(32)
            

With this approach, all the variables declared as attributes of the same element are in scope within each others' declarations, failing dynamically (or in the worst case, failing to finish) if this results in non-terminating recursion.

Equality and Other Comparisons

The eq function (operator =) is defined over all values. To be equal, two values must have the same fundamental type (this means, for example, that the string "John" is not equal to the rich text |John|). Strings are compared codepoint-by-codepoint. Lists are equal if they have the same size and their items are pairwise equal. Elements are equal if they have the same name, and if there is a one-to-one correspondence of attributes in which both the attribute names and the corresponding values are equal (the content value is treated here as an attribute). Two texts are equal if the two sequences of strings and elements making up the texts are pairwise equal.

Defining equality for functions requires further work. Some languages such as ML and Haskell make equality of functions undefined, but this would mean that equality of lists and elements containing functions also becomes undefined. Currently my preference is to make equality of functions implementation-defined, subject to the proviso that two functions can only be equal if all invocations are guaranteed to return equal results. It would be useful to attempt a more careful definition, for example one that guarantees that the result of the expression let a=b; a=b is always true, but formalizing this is not easy without introducing some notion of identity.

The ne function (operator !=) is the inverse of eq.

Ordering (specifically, the functions le, lt, ge, gt, and their corresponding operators <=, <, >=, >=) is defined over numbers and strings only. Strings are sorted in Unicode codepoint sequence.

Testing whether a value V is present in a list A (the equivalent of the = operator in XPath) is sufficiently common that we provide a function, in(V, A) with corresponding operator ∊ (x220A). The function in(V, A) can be defined as {let V=$1; let A=$2; exists(A?{$=V})}. (This uses a filter operator which we will introduce in due course.)

A collation is modelled as a set of functions. Specifically, a collation for a particular language, say Swedish, is obtained using the function call collation(<lang="se">). This returns an element, whose attributes are functions. One of these functions is a sort function, so to sort a list of strings using Swedish collation, one can write collation(<lang="se">)@sort(input). Other functions available as attributes of a collation include comparison functions eq, le, etc, and collation-sensitive versions of other functions that involve string comparison such as in, min, max, indexOf, contains, startsWith, endsWith, substringAfter, substringBefore.

Comparing A = null returns true if A is null, false otherwise. (This is not as obvious as it might seem, given the semantics in some other languages.)

Operations involving Types

The function isA tests whether a value belongs to a given type. Types are represented using the element representation introduced in an earlier section. For example, x.isA(<number ge=0>) returns true if the value of x is a number and satisfies the facet ge=0. Recall that a type is a predicate, not a label associated with a value, so this tests whether the value meets all the constraints defined by the type, not whether the value carries any particular type label.

For convenience, the functions isNull, isBoolean, isNumber, isString, isList, isElement, isText, and isFunction are available to test for membership of the primitive types.

The function as is similar to isA, but instead of returning a boolean indicating whether or not the value is a member of the type, it returns the value unchanged if this is the case, and throws an error otherwise. We saw this function used to check the arguments to a function call.

For convenience, the functions asNull, asBoolean, asNumber, asString, asList, asElement, asText, and asFunction are available to test for membership of the primitive types. In each case, they return the argument unchanged if it matches the corresponding type, or throw an error otherwise.

Functions for conversion of values have names such as toString, toList, and toText. There are no general rules here; as in other languages, the rules for what can be converted to what are inherently ad-hoc.

The parse function takes a FtanML lexical representation of a value and returns the corresponding value; conversely, serialize takes a value and returns its FtanML lexical representation.

Boolean Functions and Operators

The functions and, or, and not are available. The first two have equivalent operators && and ||.

The argument must either be a boolean or null; there is no implicit conversion to boolean as in XPath. If an argument is null, the operators implement three-valued logic as in SQL, for example (null||true) is true.

Order of evaluation is not defined; programmers should not assume that the second argument will only be evaluated if it is required. (This rule might seem unnecessary in the absence of side-effects, but it becomes important when defining the terminating conditions of a recursive function. Like XPath, we choose to allow optimizers the freedom to re-order the terms in a predicate, which can be important when indexes are available on large data sets.)

Numeric Functions and Operators

The functions plus, minus, times, div, idiv, and mod are defined; the first four have corresponding operators +, -, *, and /. Arithmetic is performed in decimal. Division by zero is an error; the precision of the result of division is implementation-defined, as are limits on the value space.

Additional functions abs, round, floor, ceiling have the same effect as in XPath.

The function parse may be used to convert a string to a number. Writing parse(X).asA(<number>) checks that the value was indeed numeric.

Supplying null to an arithmetic function or operator results in the value null. Supplying any other non-numeric value causes an error. There is no implicit conversion of strings to numbers.

Aggregation functions sum, avg, min, max work as in XPath.

The function to or its equivalent operator .. returns a list of consecutive integers, for example 1..5 returns the list [1,2,3,4,5].

String Functions and Operators

The toString function can be applied to any value without exception to convert it to a string. If the input is already a string, it is returned unchanged. If the input is a boolean, number, or null, the result is the same as the FtanML literal representation of the value. If the input is a list, the result is string-join($!toString, " "), that is, the space-separated concatenation of the string values of the members of the array. If the input is an element, the result is string(content($)), that is, the string value of the element's content. If the input is rich text, the result is string-join($.toList()!toString), that is, the concatenation (without space separation) of the string-values of the strings and elements making up the text. If the value is a function, the resulting string begins with "{" and ends with "}", and is otherwise implementation-defined; it is not necessarily a string that can be parsed and executed as a function.

String concatenation can be achieved conveniently using string templates, for example "See section {s} in chapter {n}". This mechanism can be used wherever a string literal is permitted. The result of the enclosed expressions is converted to a string if necessary, by using the toString function.

Generally FtanSkrit avoids implicit conversion. For example, if rich text is to be compared to a string, it must be converted to a string explicitly. When null is supplied to a function that expects a string, it is generally treated as a zero-length string (but this is a convention adopted by the functions in the built-in function library; it is not a feature of the language).

Other functions are available as in XPath. Counting of characters in a string, however, starts at zero. The basic built-in functions use codepoint collation; equivalents using a different collation can be obtained as attributes of the collation.

Functions and Operators on Lists

The number of items in a list A is obtained using count(A) (or equivalently, A.count()).

The construct A[n] selects the n'th item in list A. This construct is never used for filtering, only for subscripting. If n is out of range, the expression returns null. This operation is also available as a function itemAt(A, n).

The function cat(a, a) (operator ++) concatenates two lists. The function append(a, i) (operator +~) appends an item to a list, while prepend(i, a) (operator ~+) prepends. Thus for example 0 ~+ [1,2,3] ++ [4,5,6] +~ 7 returns the list [0, 1, 2, 3, 4, 5, 6, 7].

The function head(a) is equivalent to a[0], while tail(a) equates to remove(a, 0).

The function flatten flattens a list: it creates a new list in which any non-list items in the argument list are copied to the new list, and any list items are processed by copying their contents. This only works one level deep. So flatten([[1,2],[3,[4,5]]]) returns [1,2,3,[4,5]].

Functions index-of, remove, subsequence work as in XPath, except that indexing starts at zero. The insert-before function inserts a single item at a specified position; if the supplied item is a list, it becomes a nested list (there is no flattening).

The function toList works as follows: if the argument is a list, it is returned unchanged. If the argument is rich text, it is converted to a list whose members are (non-zero-length) strings and elements. In other cases, the function creates and returns a singleton list in which the argument is the only item. This function is useful because it makes it easier to process different types of content in the same way: a single element looks the same as a list of elements of length one, which looks the same as mixed content comprising a single element; a single string looks the same as mixed content containing no elements.

The function forEach, or the equivalent operator !, applies a function to every item in a list. So forEach([1,2,3], {$+1}) returns [2,3,4]; this can also be written [1,2,3] ! {$+1}. Similarly, [1,2,3]!toString returns ["1", "2", "3"]. Note that this is a non-flattening mapping operation; the result list will contain exactly the same number of items as the input.

Another example: (1..5)!{<br>} returns [<br>, <br>, <br>, <br>, <br>]

The function select, or the equivalent operator ?, applies a function to every item in a list and returns a list containing those items for which the function returns true. So select([1,2,3], {$>=2}) returns [2,3]; this can also be written [1,2,3]?{$>=2}. Similarly, [1,2,"London"]?isNumber returns [1,2].

The result of the select operator or function is always a list, even if only one item is selected. If it is known that the predicate will select exactly one item, it is necessary to extract that item from the result, typically by a call on head, or by using the subscript operation (A?P)[0]. Because this is a common operation, the operator ?? is provided, equivalent to ? followed by head(): it selects the first item found, or null if nothing was matched. A query to find a singleton can now be written, for example items??{$@id='xyz'}.

The functions all and some can be used in conjunction with the forEach (!) operator to perform universal and existential quantification: they test whether a list consists entirely of boolean true values (all), or contains at least one true value (some). So, for example all([1,2,3]!{$>0}) returns true, while some([1,2,3]!{$=0}) returns false.

Functions fold-left and fold-right are available as in XPath 3.0.

Functions and Operators on Elements

The function name(E) returns the name of an element E, or null if it is unnamed. The syntax E.name() can also be used, of course.

The mapping and filtering operators (! and ?) apply to elements as well as to lists. In this case they expect a two argument function to be supplied, and call this with the attribute name as the first argument and the attribute value as the second. The mapping operator returns a list with as many members as there are non-null attributes in the input; the filter operator returns an element with the same name as original, and with a subset of its attributes. These operators treat the content value as just another attribute. They provide the most general and powerful means of processing elements, and other operations can be defined in terms of these two. For example, the function content(E), which returns the content of an element, could be defined as E?{$1=""}!{$2}.

For example E?{$.in("id", "code", "status"} returns a copy of element E, retaining only the three specified attributes.

If E is an element and xyz is the name of an attribute (known at the time the program is written), then E@xyz returns the value of the attribute. It returns the value only, not an "attribute node" as in XPath; if the attribute is not present, it returns null. If the name needs to be dynamically computed, this can be achieved using an expression in parentheses, for example E@(X@name) returns the attribute of E whose name is given by the expression (X@name). The construct E^ is an abbreviation for E@`` — it returns the value of the attribute whose name is the zero-length string, that is, the content value.

Filtering a list of elements to select those with a given name is likely to be a common operation. The syntax L?{name($)=N} achieves this but is a little cumbersome, and becomes more so if the list can also include values other than elements. So we provide the construct :N, where N is a name, which represents a function that returns true when its first argument is an element with the name N, and false otherwise. So given a list of elements L, we can now select those having the name N by writing L?:N. If we know there will be only one such element, we can select it using L??:N.

So if PO is the purchase order presented in section 2.1, then PO@shipTo^??:name gives the value "Alice Smith", while PO@items[0]@partNum gives "872-AA".

The following example selects from a list of elements those having a particular attribute value: PO@items?{$@USprice > 20.00}.

The @ operator performs implicit mapping. Specifically: if the left-hand operand is a list L, then any lists contained in this list are expanded recursively. Any values in the expanded list that are not elements are ignored, so we end up with a list of elements; call this LL. The value of the expression L@name is then defined as LL!{$@name}. Note that the result may be a list of lists; it is not flattened.

Returning again to the purchase order example, this means that PO@items@partNum returns the list of strings ["872-AA", "926-AA"].

The postfix operator // represents the deepContent() function, which is the flattened transitive closure of the content() function. Specifically, if the result of content() is a list, then any element in that list has its own descendants inserted immediately after it. So the function descendants(E) can be defined as content(E)!{$ ~+ (if $.isA(<element>) then $.descendants() else [])}. So if E is an element, then E//?:status will select all descendant elements named status, and E//?[$@id=12] will select all descendant elements having the id attribute equal to 12.

The postfix operator @@ similarly gives the transitive closure of attributes().

As already mentioned, the function forEach and the equivalent operator ! are overloaded for elements to process all the attributes of an element (including the content). The second operand is a function which is called with two arguments, the attribute name and the attribute value. For example, given an element E the expression E!{$} returns the names of its attributes.

Similarly the function select and the equivalent operator ? are overloaded to process all the attributes, and return an element in which only a subset of the original attributes are present.

The function element(name) constructs an element with a given name (which may be null). It can also be called with two arguments: element(name, attributes). The second argument is a list of attributes, each attribute being represented by a two-member list containing the name and the value.

The function add(element, name, value) takes an existing element, and adds an attribute with the given name and value, replacing any existing attribute with the same value. A new element is returned. Calls to this function can conveniently be chained: element("rect").add("id", "a001").add("color", "black").

For convenience, the function addContent is defined as add(?, "", ?).

So, for example, we can convert attributes to a list of child elements like this:

let elementsToAttributes = {
    let E = $.asElement();
    element(E.name()).setContent(E!{element($1).setContent($2)})    
 }
            

The semantics of these constructions in FtanSkrit are different from the corresponding operations in XPath, but hopefully they will have a familiar feel.

Future Features

FtanML as presented in this paper packs a large amount of functionality into a small language. It doesn't offer everything that anyone might ask for, and nor should it: keeping it small is important. Nevertheless, there are things one might want to add, and which have not been ruled out.

  • An extensible mechanism for data types is needed: for example, representing dates as values. Schema validation can confirm that a date is a valid string, but for processing one would like to manipulate it as a date, not just as a string. Similarly, support for binary data is important to some applications; and it would be nice if URIs were recognizable as such. A general mechanism for extending the set of types (perhaps along the lines suggested in [Jeni Tennison]), would be undoubtedly useful.

  • What about pointers and links? XML has a sorry tale to tell in this area, but that doesn't mean it can't be done better. Arguably links and anchors should be first-class constructs marked as such in the syntax, rather than a semantic overlay affecting the interpretation of strings. Both intra-document and inter-document links are needed, and they should ideally be handled using a single mechanism. Support for the kind of referential integrity found in relational databases is as important as support for the hyperlinking traditions of the markup community, and there is no good reason why the two mechanisms should be distinct.

  • In the scripting language, there is an obvious need for rule-based processing in the style popularised by XSLT. In this paper, I have concentrated on presenting a small functional core for the scripting language, but I would like to see rule-based processing superimposed, and I see no reason why this should not be achievable.

Conclusions and Summary

We have introduced three languages as replacements for the central pillars of the markup edifice: FtanML as the markup language, FtanGram as its schema language, and FtanSkrit as its query and transformation language. Let's try now to assess what we've achieved.

Firstly, FtanML compared to XML. FtanML is considerably smaller as a specification, but it's also more powerful. It gets rid of the same unwanted things that MicroXML gets rid of (namespaces, comments, processing instructions, entities, DTDs), but by allowing attributes and element content to be any value, the data model is much richer, more orthogonal, and more expressive. It also solves the whitespace issue (which whitespace is significant?). By dropping end tags, the language is a lot less verbose, which is particularly noticeable when it is used for highly structured information, as in the FtanGram syntax. There's a lot of general tidying-up in little areas like escaping of special characters.

Does verbosity matter? We think it does. The fact that XML is bulky and hard to read is a significant factor leading to the adoption of alternative syntaxes for languages such as RDF and RelaxNG, and is a big turn-off for people coming newly to XSLT. Even if specialist editors can reduce the burden of entering the markup, the amount of noise on the page affects the ability of a human reader to absorb information quickly. This is not to say that the most concise syntax is optimal, of course: we might have swung too far. XML had human readability as one of its goals, and we should remember that readability is not a binary attribute; there are degrees of readability, and readability also depends greatly on the familiarity of the reader with the notation.

Compared to JSON, FtanML's main contribution is that it adds support for mixed content. And element names, which are very handy when modelling document structures.

Compared to the XPath data model (XDM), the FtanML model has more capability and greater orthogonality. The core structuring constructs (elements and lists) are powerful enough for all computational requirements. XSLT and XQuery have found a need to extend the core XML-based model with other constructs such as maps and lists; the FtanML model does not have this awkward duality between constructs that can be directly serialized in the markup, and constructs used only for internal processing.

FtanGram learns from RelaxNG the importance of designing a schema language to do validation and nothing else. Unlike RelaxNG, it's able to take advantage of the simplification and orthogonality of the data model. The unification of facilities for "simple types" and "complex types" is particularly appealing, allowing a smaller number of constructs to be combined in more powerful ways to create richer functionality. The idea that element names as well as attributes and content are something that can be constrained by a type is also a useful simplification. FtanGram also attempts to show that by making the markup language itself more powerful and less verbose, the need for a "compact syntax" (that is, a syntax using constructs other than those available in the target language) is eliminated.

FtanSkrit is broadly equivalent in capability to XQuery, but with a stronger reliance on higher-order functions and operators in preference to custom syntax. It currently lacks any mechanism comparable to XSLT's template rules, but we have ideas for how that could be added.

There will always be debates about strong versus weak typing, static versus dynamic. I believe that FtanML's dynamic typing approach fits better with the philosophy that with markup, you can have as much or as little schema machinery as you want. The XPath ability to mix typed and untyped data is one solution to the problem of spanning the worlds of structured and unstructured data, but it is something of a camel.

Are there any downsides? Some may find the languages excessively terse; highly compact syntax is not easy on the eye. The absence of named end tags in FtanML can lead to long strings of closing angle-brackets which are hard to match up without the support of a syntax-driven editor. Generally, though, we feel that FtanML with its sister languages FtanGram and FtanSkrit together form a markup system that has more than the power of the equivalent XML stack, with much greater integrity of design, simplicity, orthogonality, efficiency, and usability.

Implementation

A Scala implementation is available as open source software. It can be downloaded from here:

https://github.com/organizations/FtanML-WG

The implementation is not 100% complete, and is intended as a proof of concept rather than as production quality software. It includes a complete parser for FtanML which constructs a tree represenation of the object model; an implementation of all the FtanGram types and facets (including the grammar facet), but not the schema language itself; and an implementation of most of the FtanSkrit processing language, though with some relatively unimportant functions and operators omitted.

Acknowledgements

FtanML was invented by a group of students from German universities taught by the author, with Stephanie Haupt as co-tutor, during a summer school in Ftan, Switzerland in August 2012, organised by the Max Weber Programm, Bayern. The students deserve much of the credit, if only for challenging things that I had assumed to be self-evident: they were Max Altgelt, Julien Bergner, Lukas Graf, Dominik Helm, Axel Kroschk, Uwe von Lüpke, My-Tien Nguyen, Sebastian Meßmer, Suhanyaa Nitkunanantharajah, Jan Landelin Pawellek, and Martin Schmitt. They were a most impressive team and a pleasure to work with: absorbing knowledge quickly, researching information thoroughly, generating ideas constantly, reaching consensus amicably, writing parsers correctly, making decisions wisely, and communicating bilingually. I am particularly indebted to Sebastian Meßmer for helping me climb the Scala learning curve.

References

[1] MicroXML. Ed. James Clark and John Cowan, 2012. W3C. https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html.

[2] XML Schema Part 0 Primer, Second Edition. Ed. David C. Fallside and Priscilla Walmsley. 28 Oct 2004. W3C. http://www.w3.org/TR/xmlschema-0/.

[3] Introducing JSON. http://www.json.org

[4] YAML: YAML ain't markup language. http://yaml.org

[5] LMNL: Layered Markup and Annotation Language. http://www.lmnl-markup.com

[6] ODDAG: A Data Structure for Overlapping Hierarchies. C. M. Sperberg-McQueen and C. Huitfeld. 2004. Springer.

[7] Information technology -- Generic applications of ASN.1: Fast Infoset. ISO/IEC 24824-1:2007

[8] Efficient XML Interchange (EXI) Format 1.0. 10 Mar 2011. W3C. http://www.w3.org/TR/2011/REC-exi-20110310/



[1] Ftan is a place name, not an acronym, and while words beginning "Ft" are uncommon in English, the pronunciation comes easily with practice.

[2] It's called a cell because escaping is not allowed.

[3] This decision means that JSON is not a pure subset of FtanML, because JSON distinguishes an absent entry in an object from an entry whose value is null. However, the decision makes programming simpler, and makes sense semantically.

[4] This is a change from the original FtanML design. Originally rich text was introduced by a vertical bar, and ended with the ">" delimiter marking the end of the element. This design prevented rich text appearing as the value of an attribute, or being used as a value in the scripting language. The revised design restores orthogonality by allowing rich text to appear where any value can appear.

[5] Modelling rich text as a list of strings and elements is convenient in some situations, especially because it's the only representation available using the data types of many programming languages. The main drawback is that it's not convenient when we want to treat the data as a simple string, and ignore the markup. So we make it a distinct data type, that can easily be converted either to a list or to a string for processing when required.

[6] I have previously [x] discussed the possibility of writing an XSLT optimizer in XSLT; I concluded that the only thing preventing this was the inefficiency of the XSLT processing model in cases where it is necessary to make many passes over a tree, with each pass effecting a small change. Allowing subtrees to be shared between the source and result of the transformation could eliminate this problem.

[7] It would be easy enough to add syntax for a more verbose function declaration with an explicit signature. But at this stage, it's important (a) to keep the language small, and (b) to provide a very concise syntax for functions, allowing them to be used as freely and easily as predicates and steps are used in XPath.

Author's keywords for this paper:
Markup Languages; Schema Languages; Document Processing Languages

Michael Kay

Saxonica

Michael Kay has been developing the Saxon product since 1998, initially as a spare-time activity at ICL and then Software AG, but since 2004 within the Saxonica company which he founded. He holds a Ph.D from the University of Cambridge where he studied databases under the late Maurice Wilkes, and spent 24 years with ICL, mainly working on the development of database software. He is the editor of the W3C XSLT specification. The FtanML project is totally separate from any W3C or Saxonica activities.