Transcending Triples

Micah Dubinko

Abstract

Semantic modeling to date has been largely an exercise in considering the whole world as triples (or at least, n-way tuples). Efforts to bridge the XML gap such as XSPARQL have focused on low-hanging syntactic fruit, and have not had much effect on deeper layers of the architecture.

In semantic modeling, common situations that are supposedly bread-and-butter for RDF, for example tracking provenance or recording facts known to a certain probability, give rise to complexity such as the much reviled RDF reification, or statements about statements. Even the W3C RDF Primer doesn't seem altogether comfortable with it:

While there are applications that successfully use reification, they do so by following some conventions, and making some assumptions, that are in addition to the actual meaning that RDF defines for the reification vocabulary, and the actual facilities that RDF provides to support it.

In a similar vein, inference is commonly viewed in the semantic world as something that produces (only) triples from (only) triples. A broader view of inference encompassing XML documents and values as inputs and outputs can make many common use cases far more straightforward. This paper discusses a (hypothetical?) world where triples and documents get along better with each other, and speculates about what future products fusing these technologies might look like.

Introduction

Q: How many designers does it take to change a light bulb?

A: What makes you think a light bulb is the best solution?

There's no shortage of works expounding on the obvious impedance mismatch between trees and graphs. The same goes for failed XML syntaxes for RDF. This paper isn't about those.

There's also plenty out there on extensions to either XML or RDF or SPARQL to broaden the field of use cases covered by the technology. (XSPARQL for example is an elegant solution.) This paper isn't about any of those either.

This paper outlines a number of common problems that application developers frequently encounter, and are normally associated with RDF and triple-based solutions, and yet prove to be complicated in practice to solve entirely with triples. This paper concludes with some initial thoughts and observations on the relative simplicity of treating triples and documents as part of a unified whole.

Asserting facts in XML

Consider a web-crawling application that encounters statements encoded in web pages, in two different cases:

<html>
   <head>
       <meta name="author" content="Micah Dubinko"/>
…

<div typeof="rdf:Statement">
   <div property="rdf:subject" href=""/>
   <div property="rdf:predicate" resource="dc:author"/>
   <div property="rdf:object">Micah Dubinko</div>
</div>

One way to ask about the meaning of these cases is to ask what the consuming application would do with them. In the former case, it's clear that the entity that controls the web page is asserting a statement of authorship, and that can be modeled as a triple. But in the latter case, absent additional statements-about-statements, it's not clear that the consuming application can do anything useful with the mere existence of an unasserted statement. That is to say, the latter case is less meaningful to the application.

True, this sort of capital-R Reification has been long damned through disuse, for example Reification Considered Harmful, but it's still serves an important role in discussion. The original designers of RDF had particular use cases in mind, and application developers should either repudiate those use cases themselves, or figure out how to actually implement them.

Next, consider:

<FactSet likelihoodPercent="50">
   <meta name="author" content="Micah Dubinko"/>
…

Can a consuming application consider this statement to have been asserted? Well, if you take the intent of the element and attribute names at face value, yes. By extension, the same document could have different FactSet elements each with different degrees of assertionness. Presumably the document head in the first example asserts at a full strength of 100%, while the opposite extreme

<FactSet likelihoodPercent="0">
   <meta name="author" content="James Bond"/>

outright asserts the falsehood of the given statement.

Can this use case be discredited? Probably not. A great deal of ongoing thought has gone into representing conditional knowledge in triples, for example Fukushige 2005, but again, the solution here shown in Turtle 2013 syntax, ends up looking an awful lot like capital-R Reification:

[a prob:Partition;
   prob:condition :cond0;
   prob:case
      [a prob:ProbabilisticStatement;
       prob:about :case1;
       prob:hasProbability :prob1],
      [a prob:ProbabilisticStatement;
       prob:about :case2;
       prob:hasProability :prob2].
].

It turns out thehe real world is messy, and modeling that messiness as triples adds a huge amount of complexity.

What does it mean for a fact to be embedded in an XML element? The short answer is 'pretty much anything'. A more nuanced answer has been discussed here before, for example Dombrowski 2010 and others.

Bug or feature? Let's dig deeper.

Asserting facts over time

Another common set of use cases involves the represtation facts in time in triples. Consider something obvious, like

:BarackObama :presidentOf :UnitedStatesOfAmerica .

If a consuming application encountered this fact, say during a web crawl performed in 2013, and found this statement embedded in a page as RDFa, few developers would question the truth of the asserted statement or their ability to do something useful with the fact as it stands. But what about this?

:RonaldReagan :presidentOf :UnitedStatesOfAmerica .

In this case, few developers would dispute the truth of that statement during the mid-80s, but asserting that fact in the same manner as the prior in is clearly problematic. This highlights a deficiency in the treatment of the first triple, and more broadly any such discovered statements. Facts change, and anything beyond the most simplistic models needs to reflect this. These include questions of the form "what was Martin's address on 1 Jul 1999" and "what did we think Martin's address was on 1 Jul 1999 when we sent him a bill on 12 Aug 1999" as discussed in Fowler Temporal Patterns. In order to model reality, the facts of which change over time, a model needs to take into account various aspects of time. But the consequences of modeling to this level of detail are significant--in short, the simple examples above are insufficient. An average web-page-creator may be hard pressed to put well-formed temporal facts in their pages. A brief look at some proposals will show why this is the case:

Gutierrez 2007 has this to say:

There is a blank node connected to the components of the triple, in a sort of “temporal reification” scheme (using the vocabulary tsubj, tpred, and tobj). The remainder of the graph are statements about the timestamps at which the triple was valid.

Henson 2009 outlines a similar approach, though more ontology-driven. Here's enough to give the flavor of the approach:

Therefore, om-owl:TimeSeriesObservation inherits properties from both om-owl:Observation and om-owl:ObservationCollection

Rodríguez 2009 offers yet another approach, extending both RDF and SPARQL. Timestamps can be embedded in subjects, predicates, objects, or some cases subject/object in the same statement. The query language is likewise extended, for example finding the most recent reading with a syntax reminiscent of XPath predicates:

SELECT ?temp ?s.t
WHERE {
   <urn:Chicago> <urn:hasSensor> ?s .
   ?s[LAST] <urn:hasValue> ?temp .
}

Temporal SPARQL also outlines a similar approach with both anonymous named graphs and a SPARQL extension, using data in this form:

_:kanzaki a foaf:Person _:Always .
_:kanzaki whois:place "Tokyo, Japan" _:Interval1 .
_:kanzaki whois:place "Mie, Japan" _:Interval2 .
…
_:Interval1 a time:Interval .
_:Interval1 time:begins "1982" .
…
_:Interval2 a time:Interval .
_:Interval2 time:begins "1968" .
_:Interval2 time:ends "1978" .

and queries in this form

SELECT ?name WHERE {
   TIME [ time:inside "2007"^^xsd:dateTime ] {
      [ a foaf:Person;
        foaf:name ?name;
        whois:place "Tokyo, Japan" .
      ]
   }
}

Note that as in the previous section, this solution makes use of a bnode identifer to name a graph.

Is this a case of XML to the rescue? Possibly. Representation of time values, or ranges of time values, is already commonplace in XML application models. Embedding a triple in XML is a possible way out. This also supports a convenient query processing model as follows

restrict the universe of documents down to those representing a temporal range of interest, using a document-centric query language such as XQuery
then run SPARQL over the triples associated with the documents in step one

But these are the simple cases. Temporal queries can get much more complicated. Take, for example, a query that expresses this: On Dec 31, 2012, what did the official records show that John Doe's address was? Facts change over time, but so do official records of facts. This is known as a bi-temporal query, and there's significant existing work in the RDBMS world on supporting this class of query. In general, the solution involves keeping track of multiple timestamps for each fact, one for when the fact was considered valid in the real world, and one for when our knowledge of the fact (official record) is in force.

If one-dimensional time is complicated for casual users to express, say in RDFa, how much more so are these kinds of data and corresponding queries? One can imagine a proliferation of unnamed graphs, each containing two (or more) time axes, with a corresponding increase in necessary plumbing and query complexity. Isn't this the kind of thing that XML databases already excel at? Just because something can be done entirely with triples doesn't mean it's a good idea to do so.

Trust, Security, and Provenance

Returning to use cases, one often associated with semantic technologies is tracking the provenance of statements. Any given statement may have an arbitrarily complicated network of facts that contribute to the given statement. In some cases, certain statements may have different levels of access permissions depending on which user is accessing the database.

For example, a financial services company needs to make legally-binding reports for regulatory purposes. Certain users need the ability to justify any published statement upon request, including which systems data flowed through on a particular date (even after said system change or no longer exist). Like the earlier, bi-temporal case, this can be accomplished by tracking more state at the level of named (or anonymous) graphs--which subsystems were involved in producing a fact (even if a certain subsystem no longer exists), which schemas were in force at the time (even if they no longer exist in the present), and so on. Such information may or may not be amenable to representation as triples.

Trust-Aware SPARQL ^[1]offers a partial approach to this. Again, the solution involves an extension to the SPARQL langauge and looks very similar to examples the earlier sections:

PREFIX ub: <http://www.lehigh.edu/.../univ-bench.owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?n ?c ?t
WHERE {
    { ?s rdf:type ub:Student .
      ?s ub:name ?n }
    { ?s ub:takesCourse ?c .
      TRUST AS ?t }
}

In this example, the 'coefficient of trustworthiness' is stored for each triple and returned in a variable ?t. Different implemenations have partially solved this problem by extending the triple model to include quads, quints, hexes, and so on.

Another form of metadata that can conceptually be applied at the triple level is security access, for example Access Control Lists, which would not be implemented as triples themselves, thus breaking the model of statements-about-statements. Much of the work done in this area is unlikely to ever be standardized.

Security in particular has been more ofen implemented in document-oriented technologies than triple-oriented technologies, which begs the question of how to think about semantic modeling for applications.

How to think about embedded RDF, and thereby named graph inference

Previous sections have hinted at the possibility of explicitly modeling triples as statements contained within documents. This section makes it explicit and examines the consequences. What does it mean for a triple to be embedded in a document? Consider some possible interpretations, as they relate to application modeling:

Nothing at all. Documents are mere conveyences to be disposed of as quickly as possible. (This attitude is implied in the term 'semantic lifting')
If a document is in one or more collections (in the XQuery sense) consider embedded triples to be in equivalent named graphs where the collection name is the graph name.
Triples embedded in a document are considered to be in a named graph, where the document name is the graph name.
Like 3, but even more tightly-scoped. The particular element-scope of where the triple occurs is relevant, as seen in the introduction, where different triples embedded in the same document had different likelihoodPercent values.

Other than the first interpretation, none of these are mutually exclusive.

Named graphs, which seem at first conceptually simple, underly more power and complexity than might be readily apparent. For one thing, they have multiple associated URIs and/or points of access:

The URI that names the graph (which may not exist in cases of blank-node-identified graphs)
The URI that can be used to dereference the graph (for example, in the SPARQL 1.1 Graph Store HTTP Protocol the named graph URI is composed into a longer URI for purposes of dereference as a HTTP GET request.)
The means by which machine-readable conditions that apply to this graph can be extracted. As previously mentioned, these conditions may not be conveniently representable as triples.^[2].

What do we mean by conditions of a graph? It may be useful to borrow an older term, that of a conceptual graph as defined to inSowa 1976 . Perhaps the important thing about a named graph isn't that it is named (and increasingly, they are not) but what a statement within means to the people who use it or how it relates to the overall operations of a business enterprise, three common examples of which earlier sections of this paper has examined.

Thus, we can define conditions as follows: a condition is an assumption that holds for an entire graph and applies to any statements within the graph. It fulfills much the same purpose as capital-R Reification, but instead of forming statements about single statements, addresses an entire graph. This includes whether or to what degree the facts are trustworthy, their time of validity either in the real-world or as recorded in an official record, any relevant facts about the provenance or history of coming to know these facts, and anything else deemed necessary by your application. Furthermore, we will call these graphs conceptual graphs to highlight their unique standing as opposed to regular named graphs. This definition is admittedly fuzzy, but that is a reflection of the fuzziness of the real world, something that XMLcomes closer to embracing than does RDF.

Conceptual graph conditions can just as easily be encoded in XML as triples, and in many cases, XML is more convenient and straightfoward to process and query. If interpretations 3 or 4 above are in force, an XML document that embeds triples can straightforwardly record the conditions that apply to those triples within.

The meaning of inference in light of named graphs doesn't seem to be a solved issue^[3] but thinking in terms of conditions might begin to point toward an answer. When statement C is logically inferred from statements A and B, a materialized statement C must end up as part of a graph that has conditions compatible with both of the graphs that A and B respectively reside in. In many cases this is a logical intersection, but given the broad definition of what can comprise a condition, it may need to be figured out on a case-by-case basis.

For example, if statement A is deemed to be valid only in the years 2000-2010, and statement B is valid only in the years 2008-2013, you'd expect statement C to be valid only 2008-2010 (and be treated as part of a conceptual graph that says as much). Other conditions including likelihood are potentially less straightforward. For example if statement A has a likelihood of 40% and statement B a likelihood of 50%, does the likelihood of statement C amount to prob(A) * prob(B) = 20% or would fuzzy set logic Fuzzy Sets be more applicable, in which case the answer is min(prob(A), prob(B)) = 40%? In many cases more complex Bayesian techniques would be required as part of inference.

A further complication arises with intepretation 4 above, where element-level scope is significant. In these cases materialized triples from inference need to exist in a particular element scope, which may not exist outside of the inference. Does this imply that part of inference is to construct new conceptual graph conditions in the form of new documents?

The Future

Anything that can be modeled with purely triples can also be modeled in part or in whole in XML. And anything that can be modeled in XML can (with enough layers of abstraction) be modeled as triples, though perhaps not elegantly. Combining the two can play to the strenghts of each, and open the way toward elegantly solving interesting real-world problems.

For example: Look through software reviews to see that on <Date> <SoftwarePackageX> was shown to work on OSX. From this infer

<SoftwarePackageX> :compatibleWith <OSX:ParticularVersion>

But more mundane solutions should not be dismissed out of hand. I've seen many instances of this: Based on a set of trusted facts about <topic X> assemble (infer?) an XML document, which is made available to full-text search engines. Often times this goes beyond mere assembly, and includes some amount of rules-driven processing. In search applications, Infoboxes (assembled by mechanical rules) in search results could be considered an example of this.

The real world is messy, and a formal models can only express this via additional complexity in themselves. Purists and partisans for a particular technology don't do well on the forward-deployed frontlines of technology. Perhaps a mixed approach, leveraging different technologies in their respective sweet spots isn't such a bad starting point for thinking about the relationship between an application and the facts is makes use of.

To conclude, a call to action: Every time you see a paper or proposal for some complicated extension to the RDF data model and possibly the SPARQL language, think back to the opening line of this paper, and ponder whether a lightbulb is actually the right solution.

References

[Reification Considered Harmful] Eric Hellman, Reification Considered Harmful [online]. [cited 17th August, 2013]. http://go-to-hellman.blogspot.com/2009/05/part-3-reification-considered-harmful.html

[Fukushige 2005] Yoshio Fukushige, Representing Probabilistic Relations in RDF Proceedings of the ISWC Workshop on Uncertainty Reasoning for the Semantic Web, 2005. http://ceur-ws.org/Vol-173/pos_paper5.pdf

[Dombrowski 2010] Andrew Dombrowski, and Quinn Dombrowski. A formal approach to XML semantics: implications for archive standards. Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). http://www.balisage.net/Proceedings/vol6/html/Dombrowski01/BalisageVol6-Dombrowski01.html. doi:https://doi.org/10.4242/BalisageVol6.Dombrowski01.

[Fowler Temporal Patterns] Martin Fowler, Temporal Patterns [online]. [cited 12th July, 2013]. http://martinfowler.com/eaaDev/timeNarrative.html

[Gutierrez 2007] Claudio Gutierrez, Carlos A. Hurtado, and Alejandro Vaisman Introducting Time into RDF IEEE Transactions on Knowledge and Data Engineering, 2007. http://doi.ieeecomputersociety.org/10.1109/TKDE.2007.34. doi:https://doi.org/10.1109/TKDE.2007.34.

[Henson 2009] Cory A. Henson, Holger Neuhaus, Amit P. Sheth, Krishnaprasad Thirunarayan, and Rajkumar Buyya An Ontological Representation of Time Series Observations on the Semantic Sensor Web 1st International Workshop on theSemantic Sensor Web, 2009; Informal Proceedings. http://www.academia.edu/2174123/An_ontological_representation_of_time_series_observations_on_the_Semantic_Sensor_Web

[Rodríguez 2009] Alejandro Rodríguez, Robert McGrath, Yong Liu, and James Myers Semantic management of streaming data. International Workshop on Semantic Sensor Networks at ISWC, 2009 http://cet.ncsa.illinois.edu/publications/SemanticSN2009streaming.pdf

[Temporal SPARQL] Gregory Todd Williams tSPARQL: Using Quadstores for Temporal Querying of RDF [online]. [cited 12th July, 2013] http://tw.rpi.edu/2007/11/tsparql-poster.pdf

[Trust-Aware SPARQL] Olaf Hartig tSPARQL - A Trust-Aware Query Language [online]. [cited 12th July, 2013]. http://trdf.sourceforge.net/tsparql.shtml

[Carroll 2005] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and Patrick Sticker Named Graphs, Provenance, and Trust WWW 2005 Proceedings of the 14th international conference on World Wide Web. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.4871. doi:https://doi.org/10.1145/1060745.1060835.

[SPARQL 1.1 Graph Store HTTP Protocol] Chimezie Ogbuji (editor), SPARQL 1.1 Graph Store HTTP Protocol [online]. [cited 1st September, 2013]. http://www.w3.org/TR/sparql11-http-rdf-update/

[Sowa 1976] John F. Sowa Conceptual Graphs for a Data Base Interface http://www.jfsowa.com/pubs/cg1976.pdf

[Fuzzy Sets] L.A. Zadeh Fuzzy Sets http://www-bisc.cs.berkeley.edu/Zadeh-1965.pdf

[Turtle 2013] David Beckett, and Tim Berners-Lee (editors) Turtle - Terse RDF Triple Language [online]. [cited 12th July, 2013]. http://www.w3.org/TeamSubmission/turtle/

^[1] Commonly called tSPARQL, though I refer to it by a longer moniker in order to distinguish it from Temporal SPARQL, which also is commonly called tSPARQL.

^[2] This is an important distinction, as it can greatly complicate matters, for example in a bi-temporal update when you mark certain records as obsolete (by adding a effective end-date to the record) you raise the spectre of identity of blank-node issues in the course of the update.

^[3] For example: http://answers.semanticweb.com/questions/1315/the-semantics-of-named-graphs

Author's keywords for this paper:

metadata; provenance; RDF; XML; modeling; formalism

Micah Dubinko

Lead Engineer

MarkLogic

`<Micah.Dubinko@marklogic.com>`

Micah Dubinko has worked on diverse projects, from portable heart monitors to mobile applications to search engines. He was the data engineer for Yahoo! SearchMonkey, a project that influenced his outlook toward semantic technologis. He is currently Lead Engineer for semantics and applications at MarkLogic.

BalisageThe Markup Conference

Balisage Paper: Transcending Triples

Modeling semantic applications that go beyond just triples

Micah Dubinko

`<Micah.Dubinko@marklogic.com>`

Table of Contents

Introduction

Asserting facts in XML

Asserting facts over time

Trust, Security, and Provenance

How to think about embedded RDF, and thereby named graph inference

The Future

References

Author's keywords for this paper:

`<Micah.Dubinko@marklogic.com>`

Balisage Series on Markup Technologies