How to cite this paper
Contemporary transformation of ancient documents for recording and retrieving maximum
information: when one form of markup is not enough
Balisage: The Markup Conference 2012
August 7 - 10, 2012
In this paper we primarily consider what we can gain from enhancing TEI-encoded texts with
RDF, though there are other choices of re-representation which could also be profitable in the
future. We consider the use of OAC annotations as part of our work for the future. To
illustrate our approach, we take as a case study the Sharing Ancient Wisdoms (SAWS) project, which explores and analyses the tradition of wisdom literatures in
ancient Greek, Arabic and other languages. Our methods for representing semantic links within
and between specific sections of these texts, and describing the relationships that exist
between them in a systematic way, are documented and explained. We consider that this approach
has the potential to be used widely to link and describe related sections of a variety of
different types of texts. Given the common practice of publishing TEI documents as part of
Digital Humanities research output, our central contribution is to demonstrate how the
usefulness of these TEI documents can be developed further in diverse directions, beyond their
current application for digital edition publication.
The Sharing Ancient Wisdoms (SAWS) use case: sources and materials
is a key use case for this work, demonstrating a requirement for a markup approach
that encapsulates various types of information, including structural markup and semantic
annotation. The SAWS project aims to present its texts digitally in a manner that enables
linking and comparisons within and between anthologies, their source texts, and the texts that
draw upon them. We are also creating a framework through which other projects can link their
own materials to these texts via the Semantic Web, thus providing a ‘hub’ for future
scholarship on these texts and in related areas. The project is funded by HERA (Humanities in
the European Research Area) as part of a programme to investigate cultural dynamics in Europe,
and is composed of teams at the Department of Digital Humanities and the Centre for e-Research
at King's College London, The Newman Institute Uppsala in Sweden, and the University of
Throughout antiquity and the Middle Ages, anthologies of extracts from larger texts
containing wise or useful sayings were created and circulated widely, as a practical response
to the cost and inaccessibility of full texts in an age when these existed only in manuscript form. SAWS focuses on gnomologia (also known as florilegia), which are manuscripts that
collected moral or social advice, and philosophical ideas, although the methods and tools
developed are applicable to other manuscripts of an analogous form (e.g. medieval scientific
or medical texts).
The key characteristics of these manuscripts are that they are collections of smaller
extracts of earlier works, and that, when new collections were created, they were rarely
straightforward copies. Rather, sayings were selected from various manuscripts, reorganised or
reordered, and subtly (or not so subtly) modified or reattributed. The genre also crossed
linguistic barriers, in particular being translated from Greek into Arabic, and again these
were rarely a matter of straightforward translations; they tend to be variations. In later
centuries, these collections were translated into western European languages, and their
significance is underlined by the fact that Caxton’s first imprint (the first book ever
published in England) was one such collection. Thus the corpus of material can be regarded as a very complex directed network or
graph of manuscripts and individual sayings that are interrelated in a great variety of ways,
an analysis of which can reveal a great deal about the dynamics of the cultures that created
and used these texts.
Identifying and extracting the required data for SAWS
TEI traditionally excels in areas such as text structure definition and document metadata,
and although it possesses the means to identify and define semantic relationships between
sections of text, none of these methods has, as far as we have been able to determine, been
adopted widely or used as a standard mechanism for recording the nature of the relationship
between texts. For instance we could use
<ref target="..."> to point to another section
of text, but we would need to modify the schema to require that
@type should appear, allowing
us to insert a description of the relationship between these two sections. Another possibility
would be to use an
<interp> element with an
@xml:id attribute that contained the
required relationship: its
@inst attribute could then be used to point to another section of
text. However, the insertion of an attribute detailing the source of the asserted relationship
(i.e. the person or bibliographic source responsible for making the assertion) is also vital
to SAWS: we need to be able to trace the scholarly source of that link. The
element, which is a recent addition to the TEI, provides us with the ability to include all of
the desired information within one element: the ID of the section of text being linked from;
the ID of the section of text being linked to; the nature of the relationship between the two
sections; and the identity of the source responsible for making the assertion. Our use of the
<relation> element is discussed fully below (see ‘Use case implementation: illustrating
the SAWS usage of TEI and RDF’), but it is worth noting in this introductory section the
important point that the use of
<relation> allows us to enter RDF directly into the TEI
document (i.e. the triples we are defining about the sections of text and their relationships)
and to combine this with information about scholarly responsibility, all within one element.
This is particularly useful when the data is being entered by scholars who are familiar with
TEI encoding and are marking up the rest of their documents in TEI, but who do not have any
training in RDF. Being able to enter the RDF data directly into the TEI document means that
they do not have to learn a second set of skills, while at the same time we can make use of
the advantages of RDF (see below, ‘Resulting benefits for information exploration and
retrieval in the SAWS project’).
These types of semantic relationships within and between texts are particularly important
to the understanding of how themes and ideas were transmitted between cultures, and across
languages and time. As an example use case, in the SAWS project a key point of interest to our
manuscript scholars is to represent relationships within and between different collections of
wise or moral sayings, and to investigate how these collections have been referred to, amended
and/or passed on from manuscript to manuscript. We want to record and visualise the links
within and between these collections; from these collections to their source texts (e.g.
Aristotle’s writings); and from these collections to their recipient texts (e.g. the
11th-century Strategikon of Kekaumenos, as well as later texts). Critically, we want to do
this in a way which can be repeated by others, so that our collection of texts acts as an
example and starting point for a larger enterprise taking this approach beyond our project
At the moment, scholars of gnomologia and their related
texts tend to work from manuscripts and printed editions, and the links between the texts they
are working on are recorded within commentaries and footnotes. Sometimes their editions will
include studies of the relationships between specific manuscripts: for instance, a discussion
of the transmission of a particular work through a number of different manuscripts. What SAWS
will provide is the ability for scholars to investigate much more deeply the relationships
between specific sayings within those texts, and to follow those links through a number of
different variants and languages. This is achieved by enabling identification and annotation
of relationships by different scholars within a ‘hub’ that will provide visualisations of
those relationships as well as direct links to the texts concerned (or in the case of texts
that are not digitised, a URI for that text). Scholars interested in a particular saying or
set of sayings will immediately be able to see both the fact that the saying is related to
sayings within other texts (each of these identifiers will be displayed to them, with a
clickable link to that text), and will also see a description of the nature of the
relationships that have been identified. They will also be able to view who has asserted that
relationship, and can add their own assertions or notes as desired.
As an illustration of why this is important for textual scholars, consider this saying
from Gnomologium Vaticanum (no. 87):
Ὁ αὐτὸς ἐρωτηθεὶς τίνα μᾶλλον ἀγαπᾷ, Φίλιππον ἢ Ἀριστοτέλην, εἶπεν· “ὁμοίως ἀμφοτέρους· ὁ
μὲν γάρ μοι τὸ ζῆν ἐχαρίσατο, ὁ δὲ τὸ καλῶς ζῆν ἐπαίδευσεν.”
Alexander, asked whom he loved more, Philip or Aristotle, said:
”Both equally, for one gave me the gift of life, the other taught me to live the virtuous
We can identify that this saying (i.e. section of text) exists in various forms in earlier
works, and that there are relationships that can be defined between our first example and
those below (and indeed between the various examples below):
Plutarch, Life of Alexander 8.4.1:
Ἀριστοτέλην δὲ θαυμάζων ἐν ἀρχῇ καὶ ἀγαπῶν οὐχ ἧττον, ὡς αὐτὸς ἔλεγε, τοῦ πατρός, ὡς δι'
ἐκεῖνον μὲν ζῶν, διὰ τοῦτον δὲ καλῶς ζῶν ...
Alexander admired Aristotle at the start and loved him no less, as
he himself said, than his own father, since he had life through his father but the virtuous
life through Aristotle …
Diogenes Laertius 5.19, Life of Aristotle:
Tῶν γονέων τοὺς παιδεύσαντας ἐντιμοτέρους εἶναι τῶν μόνον γεννησάντων· τοὺς μὲν γὰρ τὸ
ζῆν, τοὺς δὲ τὸ καλῶς ζῆν παρασχέσθαι.
Aristotle said that educators are more to be honored than mere
begetters, for the latter offer life but the former offer the good life.
Pythagoras? Selections from the Sayings of the Four
Philosophers: (B) Pythagoras saying 18 (ed. Gutas):
وقال الآباء هم سبب الحياة والحكماء هم سبب صلاح الحياة
He said: Fathers are the cause of life, but philosophers are the
cause of the good life.
We can see clearly that these four sayings are related to one another in various ways, but
that there are complexities between these texts that need to be described and documented (and
ideally visualised) if we are going to be able to trace these relationships in a systematic
In the last example above, we can see that the saying has been attributed to a different
author (Pythagoras), rather than being associated with Aristotle or his pupil Alexander:
alternative attributions are a common feature of this type of text, and they add another layer
of complexity to the types of relationship that need to be defined.
In our TEI document, therefore, we need to be able to:
insert links between these sections of text (which may or may not already be
make scholarly assertions in a systematic way about the nature of the (often
complex) relationships between these texts.
In order to achieve these aims, we have chosen to enhance our TEI with RDF. RDF provides
an ideal way to store and manipulate our relationship data: each of the sayings can be linked
to other relevant sections of text by means of a subject-predicate-object relationship that is
defined as part of an ontology, which acts as an authority list. One of the main advantages of
the ontology for the SAWS project is that it ensures consistency of description across texts
that can vary greatly in their nature, but interestingly it has also acted as a means of
stimulating scholarly discussion about the nature of the relationships and the ways in which
they should be described. The textual scholars involved in the project have found that the
necessity to be completely explicit about their decision-making processes and definitions has
prompted them to identify, and describe concisely, new relationships that exist within and
between their texts.
The way in which we are implementing the use of RDF within our TEI documents will now be
described, and will be followed by specific examples from our SAWS texts to illustrate how
this is being put into practice.
Background: Previous TEI and RDF combinatory approaches
We would like to be able to use RDF-like syntax to mark up information of semantic
interest such as relations between the text and links to external entities, supported by a
relevant vocabulary. Whilst RDFa allows RDF to be directly encoded in markup documents, it has
been primarily deployed in XHTML documents to date. It would be desirable to extend the scope
of RDF to a wider scale, and particularly for our purposes (and others) to TEI XML documents, without extensive changes being required to the variant of
XML being used for the source document or to the skills and workflow being used in the markup
process. This last point is of particular concern for non-technical users of TEI markup: an
established and growing community, not least given the increasing adoption of TEI by
humanities scholars for Digital Humanities research. Keeping structural, syntactical and semantic information in the same documents
where possible also makes the process of markup more simple and less error-prone for
non-technical users who wish to mark up documents with their annotations, though it is
acknowledged that this is not always possible. To date, no method for accommodating TEI and
RDF in the same document has been adopted as standard by the TEI community, though several
approaches have recently been offered.
is a Java-based tool for converting TEI files to a form which can incorporate and
output RDF/XML markup. Based around the Jena framework for semantic web applications, RDFTEF implements a basic ontology for representing structural and syntactical
elements and allows additional ontologies to be added as required. Though SPARQL queries can
be fashioned to query the resulting RDF, these need to be relatively complex and standard XML
tools cannot be deployed within the RDFTEF environment. RDFTEF has been criticised as ‘[o]nly a “toy” experiment’ for these limitations and due to its lack of ongoing maintenance (last source code
update 2007). Also, RDFTEF introduces a new stage of work to the existing editing workflow and
requires extra software to be deployed for and learned by the users. Given the non-technical
nature of the target audience who will be marking up the documents with this semantic
information, this is a significant concern to the SAWS project and potentially hinders the
adoption of our approach by our target users.
The issues for non-technical users also problematise other interesting approaches, where
RDFa has been used to encode RDF in a TEI document.
Although the markup process was relatively straightforward, specialised scripts
had to be deployed to extract the RDF information in a form suitable for adding to a triple
store. Deploying such scripts is non-trivial for non-technical users both in setting up the
appropriate environment and in executing the scripts. The scripts used by Jewell’s and
Lawrence’s work were also highly specific to the type of information in those documents,
rather than being more domain-general. These issues with over-specific scripts and associated
implementation issues were also seen in a similar script-based approach to automated creation
of RDF triples from TEI documents, in work performed by the SPQR project. In terms of implementation and re-use, there is a more user-friendly alternative
of transformations through XSLT stylesheets, the execution of which is incorporated into the
user interface of tools like the Oxygen XML editor. To avoid or at least reduce
over-specificity and encourage re-use of our materials, the adoption of a more generic
underlying model for transformations is an interesting alternative, as is explored in this
Another tool is available to represent document structure(s) with RDF: the EARMARK OWL ontology
The inclusion of RDF in TEI documents is a current area of interest in the TEI community.
Members of the TEI-Ontologies Special Interest Group (SIG) are using XSLTs to convert TEI to RDF, by relating TEI markup to vocabulary in the
CIDOC-CRM cultural heritage model (a recognised ISO standard: ISO 21127). Some discussion has also been made by the
SIG about the inclusion of FRBRoo (a bibliographical records model harmonised with CIDOC-CRM) in the base vocabulary, however work in this area has not progressed and development has been concentrated
around a TEI-CIDOC harmonisation. This co-operation between TEI and CIDOC-CRM has been
formally active since the formation of the SIG in 2004 and has seen regular but reasonably
slow-paced development, probably due to the other commitments and geographical displacement of the
researchers involved. Some mappings have been drafted (last updated 2007/8) and stylesheets (last updated 2011) and guidelines (last updated 2010) have been published, but several issues exist that are
hampering the SIG’s progress:
The approach taken by the SIG requires some changes to be made to TEI, with new
elements to be added and others to be extended. This raises questions as to the applicability of the resulting stylesheet to
existing and legacy TEI documents.
The size of the current TEI P5 tagset, containing hundreds of elements, raises
practical difficulties in providing a comprehensive mapping from TEI to alternative
representations. The TEI ontologies SIG has identified a subset of TEI elements to map
to CIDOC-CRM, choosing only elements which represent semantically meaningful elements
within the text, “elements such as persons, places, dates and events”. This approach is practical but disregards many triples of potential interest
within the TEI markup such as document structure and metadata. It also limits the scope
of output triples to only those elements encodable using TEI markup, such as names of
places and people.
It is questionable whether CIDOC-CRM is the best choice of vocabulary to be used for
modelling textual document information, especially as its only direct representation of
lexical material is through one class (E33 Linguistic Object) and its two subclasses
(E34 Inscription, E35 Title). This choice of CIDOC as base model is acknowledged to be
influenced by the research interests of the SIG members in cultural heritage and museum documentation. Particularly for metadata information such as that contained in the TEI
Header, the Dublin Core model seems a more natural choice and is a highly developed and widely adopted
ontology. A mapping from TEI to DC has been tackled in stylesheets but does not appear in their main approach or considerations.
It is desirable (e.g. for SAWS) to be able to mark up triple-like relations directly in
TEI, particularly if those relations are specific to the subject domain of the original text
and/or if the relations indicate semantic information which cannot currently be encoded using
TEI markup. The
<relation> element has recently been recommended by the TEI for encoding RDF relations in a TEI document,
representing the Subject-Predicate-Object triple format through the following attributes of
@passive respectively. This has increased the
expressiveness of standard TEI markup without requiring changes within TEI. Further, RDF can
be included directly in TEI markup, allowing researchers to use the workflow and tools they
are already accustomed to rather than introducing a requirement for new tools to be learnt and
used, external to the existing workflow. This is of particular benefit for users of TEI who do
not have a strong technical background.
Automatic extraction of information from TEI documents
Much information can be extracted from the markup already in a TEI document, particularly
metadata and document structure. This ensures that markup work already invested in texts can
be extracted from the text and represented in alternative forms that are more amenable to
querying and automated reasoning. For example, in SAWS, there is an interest in how the
structure and ordering of wise sayings changes as they are copied from one manuscript to
Acknowledging the size of the TEI tagset and the associated practical difficulties in
mapping, we take the minimal subset of TEI needed to encode a document in TEI markup,
TEI-Bare. Work done with this schema serves as a basis for further extensions, for example to
TEI-Lite, identified as “the most widely used TEI customization”. The Dublin Core Metadata Initiative forms the base model for the mappings from TEI.
The comparison of TEI and RDF is an oddly emotional topic. The strength of RDF lies in its
apparent simplicity, and in its interoperability. RDF data is discoverable, and reusable. An
OAC annotation for instance may have any number of targets of differing types. TEI allows for
extremely granular expression with a context; RDF may often not require context to be
Deceptively simple SPO assertions can be combined to tell complex stories. The following
annotation is relatively terse, but conveys much information, all of it easily discoverable
using either SOLR or SPARQL. There is considerable metadata surrounding the individual
annotation indicating what standards were employed, how it was encoded, the creation date etc.
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"
<dcterms:created>2012-07-18 18:52:12 UTC</dcterms:created>
<cnt:chars>Sample text for demo</cnt:chars>
<cnt:chars><svg:rect xmlns:svg='http://www.w3.org/2000/svg' x='283.5' y='615.5'
width='377' height='108' r='0' rx='0' ry='0' fill='#ffffff' stroke='#000000'
style='opacity: 0.7; stroke-width: 2;' opacity='0.7' stroke-width='2'
An advantage of OAC style encoding is that embedded tags are not necessary for the
designation of a target. A target may be defined as either svg coordinates as in the example
below, or starting and stopping at two line/character points. These points may be inside
tagsets allowing us to mimic overlapping tags without breaking xml validation. In this example
the rdf targets a body of text beginning with the 6th character, and being 11 characters long,
and ties this back to an authority record.
<w:content type="props">Jean Golfin</w:content>
<w:term type="info">Golfin, Jean</w:term>
By moving our structural TEI encoding, still very valuable in its native form, to OAC/RDF
equivalents, we expose relationships based either on the physical textual coordinates, x/y
coordinates, or structural location.
Use case implementation: illustrating the SAWS usage of TEI and RDF
The requirements for the SAWS project have been described above; namely that we need to
insert links between sections of text within and between documents (some of which exist in
digital form, and some of which do not), and to make scholarly assertions in a systematic way
about the nature of these often complex relationships between sections of text.
First of all, therefore, we must define the basic unit of interest (a ‘section’ or
‘segment’ of text), i.e. the saying (or part of the saying). The SAWS TEI schema, designed at
King’s College London for the encoding of gnomologia, uses the
<seg> element to mark up
this unit of intellectual interest, such as a saying (statement) together with its surrounding
story (narrative). For example:
Alexander, asked whom he loved more, Philip or Aristotle, said:
“Both equally, for one gave me the gift of life, the other taught me to live the virtuous
This contains both a statement and a narrative:
Alexander, asked whom he loved more, Philip or Aristotle, said:
Both equally, for one gave me the gift of life, the other taught
me to live the virtuous life.
Each of these
<seg> elements is given an
provide a unique identifier (which is automatically generated using simple XSLT). This
identifier differentiates one
<seg> from all other
<seg>, for instance
<seg type="statement" xml:id="K.al-Haraka_ci_s1">, where K.al-Haraka_ci_s1
is the unique identifier. In other words, it allows each intellectually interesting unit (as
identified by our team’s scholars) to be distinguished from each other unit, thus providing
the means of referring to a specific, often very brief, section of the text.
Secondly, we must have a systematic way of defining the relationship between one section
of text and another. Using a systematic method is important for two reasons: to ensure
consistency in the descriptive terms that we use across the SAWS project, and to develop a
shared vocabulary between SAWS and other projects to which we want to make links (and which
want to link their data to ours). We have therefore taken every possible opportunity to
explore with other manuscript scholars the terms they need to use to describe the
relationships that they can observe within, and between, their texts. Relationships identified
include terms such as isCloseRenderingOf, isLooseTranslationOf, isVerbatimOf, and a variety of
other terms that represent in an agreed form the different ways in which sections of text are
connected to one another.
We are representing these relationships using an ontology that extends the FRBR-oo model (the harmonisation of the FRBR model of bibliographic records and the CIDOC Conceptual Reference Model (CIDOC-CRM)). The SAWS ontology, developed through collaboration between domain experts and technical
observers, models the classes and links in the SAWS manuscripts. Basing the SAWS ontology
around FRBR-oo provides most vocabulary for both the bibliographic (FRBR) and cultural
heritage (CIDOC) aspects being modelled. Using this underlying ontology as a basis,
relationships between (or within) manuscripts can be added to the TEI documents using RDF markup.
To include RDF triples in TEI documents, three entities have to be represented for each
triple: the subject being linked from, the object being linked to, and a description of the
link between them. The subject and object entities in the RDF triple are represented by the
@xml:id that has been given to each of the TEI sections of
interest. We use the TEI element
added to TEI) to place RDF markup in the SAWS documents, with four attributes as
The value of
@active is the
@xml:id of the subject being linked from;
The value of
@passive is the
@xml:id or URI of the object being linked to;
The value of
@ref is the description of the
relationship, which is drawn directly from the list of relationships in the
The value of
@resp is the name or identifier of a
particular individual or resource (such as a bibliographic reference). Many of the links
being highlighted are subjectively identified and are a matter of expert opinion, so it is
important to record the identity of the person(s) responsible.
برهان ثالث كل محرّك لذاته فهو راجع على ذاته
Πᾶν τὸ ἑαυτὸ κινοῦν πρώτως πρὸς ἑαυτό ἐστιν ἐπιστρεπτικόν.
This is equivalent to stating that the Arabic segment with the xml:id "ci_s5” in the
K._al-Haraka document is a close rendering of the Greek segment identified as “ci1” in
Proclus_ET_Prop.17, and that this relationship has been asserted by Elvira Wakelnig. The
definition of ‘isCloseRenderingOf’ has been agreed upon and
documented within the ontology, and the schema has been populated from the ontology so that a
drop-down menu appears in the XML editor, from which the required value of
@ref can be selected. The
<relation/> element can be placed anywhere within the TEI document, or
indeed in a separate document if required: for our own purposes we have found it useful to
place it immediately after the closing tag of the
identified as the “active” entity.
Some of the content of our texts could also be enhanced by being viewed in context by
including information external to the XML document. For this purpose, the SAWS project will
also use Linked Data principles to mark up our texts with semantic links to collections of
data on the ancient world, such as the Pleiades historical gazetteer of ancient places and the Pelagios collection of ancient data interlinked through Pleiades references, and the Prosopography of the Byzantine World, which aims to document all the individuals mentioned in textual Byzantine sources
from the seventh to thirteenth centuries. We also plan to mark up links to existing relevant
documents such as those stored in the Perseus Digital Library (which holds editions of some of the texts we identify as source texts for the
Examples of transformations from TEI to RDF for the SAWS use case
Taking the SAWS use case as an example, the TEI version of the Kitāb al-Ḥaraka (“Book of
Happiness”) held at Ankara Üniversitesi contains the following TEI-Bare-compliant information
in its TEI header:
<title>Hacı Mahmud Efendi 5683</title>
<publisher>Sharing Ancient Wisdoms</publisher>
Applying the XSLT generates the following Dublin Core triples:
<dct:title>Hacı Mahmud Efendi 5683</dct:title>
As an example of structural triples, take SAWS’ TEI version of the Corpus Parisinum
manuscript as stored in the Digby collection in the Bodleian library, Oxford, UK, in which a
<div xml:id="Aristippus01"> section is contained by its parent,
xml:id="Part01">. From this we can derive the following two triples:
Resulting benefits for information exploration and retrieval in the SAWS project
We now have the capacity to extract many triples from our TEI document. The TEI-Bare XSLT allows us to extract RDF triples representing information about the document
structure and metadata about the markup, as encoded in the TEI markup. This XSLT can also now
be simply extended to extract more semantics, by transforming the triples encoded through the
<relation> element into RDF/XML syntax.
Once information is available in RDF format, it can be queried and reasoned with.
Critically, queries can be constructed based around the semantics encoded in the triples. The distribution of knowledge across Linked Data means that logical inferences can
be made to derive new knowledge from the facts, and also from the external data sources that
have been referenced by the RDF triples.
The ability to traverse links between sets of data and discover related information
serendipitously is one of the major benefits of adopting linked data for the SAWS project. For
the scholars working in SAWS, the study of the links between and within documents is a central
part of the academic research underpinning this project. Extra assistance in finding relevant information can help discover sources of
interest that might otherwise have been missed, as many potential sources are geographically
scattered, occasionally hard and/or time-consuming to access and may also be completely
unknown outside of a handful of scholars. As an example, the Perseus Digital Library holds a
collection of Classics-related documents which collectively contain over 68 million words, as
well as an Arabic collection containing over 5 million words, and other collections. Navigating such quantities of potential research material to find content of
interest is one of the challenges faced by Classics researchers. Digitisation and cataloguing
of the sources through projects like Perseus has been an important step in facilitating this
research, and is being enhanced further by semantic navigation such as that undertaken in the
To illustrate ways in which linked data specifically assists scholars in the use case of
SAWS, we look at how the scholars can discover information in new ways, draw from a broader
set of sources and compile evidence for their research. If, say, a researcher is looking at
how a particular place of interest is described across different manuscripts, information in
the Pleiades historical gazetteer can be consulted when constructing queries. Researchers can
ask to see, for example, all texts that refer to that particular geographical location, even
if the texts use different place names to refer to that geographical location (as it was often
the case that places were referred to by different names in different historical periods). For
SAWS, this helps with the added complication of manuscripts in different languages, with
different character sets (compare for example Ancient Greek, Arabic). This is possible through
examining the place names mentioned in the SAWS manuscripts in the context of the information
in the Pleiades ontology, which gives a precise geographical reference for each place.
For example the place “Aphrodisias” (URI
http://pleiades.stoa.org/places/638753) was known by the names:
Ninoe (in the Classical period),
Aphrodeisias (Hellenistic-republican, Roman periods),
Lelegon polis (unspecified period),
Stauropolis (Late-antique period)
Aphrodisias (Roman, Late-antique periods).
In Ancient Greek it is referred to as Ἀφροδισιάς (or Νινόη, Ἀφροδεισιάς, Λελέγων πόλις,
Developing this example, we can disambiguate between Aphrodisias located in modern-day
Turkey and the Aphrodisias located by modern-day Spain (URI
http://pleiades.stoa.org/places/255978/), which the textual information alone
would not allow us to distinguish.
Returning to the issues of the SAWS manuscripts being written in various languages
(Ancient Greek and Arabic being the two main languages, and some related documents in Spanish,
Latin, and English, to date): Although the TEI documents contain transcriptions of manuscripts
in the original language, the use of RDF and linking allows the manuscript information to
transcend language boundaries to some extent, as parts of the text can be linked to resources
which are more language-neutral (e.g. the person “Aristotle” can be represented by the URI
http://dbpedia.org/resource/Aristotle independently of whether they are
referred to as Aristotle, Ἀριστοτέλης, أرسطو , Aristoteles, Aristóteles or other alternative
forms in the original document). This is particularly helpful in studying the transmission of
information in the manuscripts across languages, especially if the researcher does not have
sufficient language skills to navigate between the different languages.
Evaluation of the SAWS implementation
To evaluate the usefulness of this work, researchers on the SAWS project are currently
encoding RDF information into existing TEI versions of manuscripts they are interested in.
Having discussed what research questions they would like to explore, a demonstration of the
TEI publications and the enhancements possible with the RDF information occurred in a workshop
in June 2012. This highlighted several positive benefits, in particular increasing motivation of
actually seeing how the manuscripts could be navigated in this format, both through exploring
the TEI digital edition and through seeing the tangible benefits of a semantically enhanced
The demo also prompted useful constructive feedback, leading to further relation types
being identified for the SAWS model. This demo also prompted some interesting scholarly
debates following the identification of different interpretations of the notion of translation
(which would not necessarily have been noticed and acted upon, had the scholars not been
required to collaboratively formalise their tacit knowledge). Following this demo, ongoing
further consultation with manuscript scholars has provided, and will continue to provide,
formative evaluative feedback for further developments.
With a basic TEI to RDF mapping in place, and using an easily extensible transformation
mechanism such as XSLT, this is a firm basis for future development of mappings by both
ourselves and others, to include more of the TEI tagset. More generally, the choice of TEI
tags being included will be dictated by individual needs (for example, SAWS uses a specific
customisation of the TEI schema, as mentioned above, so is concentrating on tags used in that
schema). In particular, we are discussing with collaborators how FRBR-oo can be used to
enhance the base ontological model for the TEI to RDF mapping, for a richer vocabulary which
includes more detailed semantics than Dublin Core (given that Dublin Core concentrates on
modelling metadata and basic structures). We hope to discuss this work with members of the
Special Interest Group on TEI and ontologies and make contributions to this group’s
Upon determining our mappings, obtaining the data becomes a matter of simple extraction.
The RDF in our example makes direct connections - A is a child of B. Having information
available in RDF is useful not only for what can be done directly with RDF, but for the
possible transformations from RDF to other data representations. One of this paper’s authors
is working with the image-based manuscript annotation environment Shared Canvas, which makes use of Open Annotation Collaboration (OAC) syntax for annotations. An OAC annotation maps neatly to an RDF triple, where an
active/subject item has an annotation with a body of x (e.g. isCloseTranslationOf) and a1
target of y (e.g. xml:id=GV132874897).)
OAC-RDF mappings are more complex, but more meaningful. Once our basic mappings are in
place, we can spin off (or at least establish the framework for) more complex expressions.
Relationships can build on relationships, attaching creators (with foaf tags) to annotations,
which tie bodies of text (further identified by their character encoding) to the target being
described. There is no real depth limit. The data is all there to be explored, and the
framework exists to add many layers of metadata.
The Islandora is an open source project to allow users to manage a Fedora Repository
through PHP using a Drupal front end. Fedora Repositories are particularly adept at
maintaining and versioning the metadata that accompanies scholarly objects. The Digital
Humanities project is sponsored by EMiC to develop a suite of application for the management
and critical analysis of Canadian modernism. One of the authors of this paper is the lead
programmer in both these projects, so will be able to incorporate these transformations into
the workflow to expose the data publicly. Of particular interest to our team is the ability to
extract data from the TEI stream to build and maintain authority lists.
We therefore have several possible avenues of work to explore in this area. Future
development will both require, and foster, collaboration amongst those who are pursuing the
question of what can be gained from the enhancement of TEI-encoded documents. It is envisaged
that the outcomes of this research will be applicable across a wide variety of texts, and it
is hoped that this paper will stimulate interest in new areas of future research into
combining different types of markup.
This work partly results from collaborative development between two of the paper authors
initiated at the Interedition 9th bootcamp, Leuven, Belgium, 2012, funded through COST action
IS0704. The SAWS project is funded by HERA as project 09-HERA-JRP-CD-FP-152 and we acknowledge
the benefits of this fruitful collaboration with our project partners. In preparing the final
version of this paper we were assisted by the feedback from several anonymous
W. Caxton, The Dictes and Wise Sayings of the Philosophers (originally published London, 1477), reprinted 1877 (Elliot Stock, London)
A. Dekhtyar and I. E. Iacob. A framework for management of concurrent XML markup.
Data & Knowledge Engineering 52(2):185-208, 2005.
M. Doerr, “The CIDOC CRM - an Ontological Approach to Semantic Interoperability of
Metadata”, AI Magazine, Vol. 24, No. 3 (2003)
M. Doerr, and P. LeBoeuf, “Modelling Intellectual Processes: The FRBR – CRM
Harmonization” Digital Libraries: Research and Development, Vol. 4877, pp. 114-123. Springer
Ø. Eide, A. Felicetti, C. Ore, A. D'Andrea, and J. Holmen. Encoding Cultural
Heritage Information for the Semantic Web. In EPOCH Conference on Open Digital Cultural
Heritage Systems, Rome, Italy, 2008.
Hedges, Mark; Jordanous, Anna; Dunn, Stuart; Roueche, Charlotte; Kuster, Marc W.;
Selig, Thomas; Bittorf, Michael; Artes, Waldemar; "New models for collaborative textual
scholarship,", Proceedings of the 6th IEEE International Conference
on Digital Ecosystems Technologies (DEST), Campione d’Italia, Italy.
H. V. Jagadish, L. V. S. Lakshmanan, M. Scannapieco, D. Srivastava, and N.
Wiwatwattana. Colorful XML: One Hierarchy Isn't Enough. In Proceedings of ACM SIGMOD International Conference on
Management of Data, volume 1, pages 251-262. ACM Press, 2004. doi:10.1145/1007568.1007598.
M. O. Jewell. Semantic Screenplays: Preparing TEI for Linked Data. In Proceedings
of Digital Humanities, London, UK, 2010.
A. Jordanous, K. F. Lawrence, M. Hedges, and C. Tupman. Exploring
manuscripts: sharing ancient wisdoms across the semantic web. In Proceedings of the 2nd International Conference on Web Intelligence, Mining and
Semantics (WIMS '12), Craiova, Romania. 2012.
K. F. Lawrence. Wherefore Art Thou? - Crowdsourcing Linked Data from Shakespeare to
Dr Who. In Proceedings of Web Science, Koblenz, Germany, 2011.
Christian-Emil Ore and Øyvind Eide. TEI and cultural heritage ontologies: Exchange
of information? Literary and Linguistic Computing 24(2): 161-172, 2009. doi:10.1093/llc/fqp010.
S. Peroni and F. Vitali. Annotations with EARMARK for arbitrary, overlapping and
out-of order markup. In Proceedings of the 9th ACM symposium on Document engineering, pages
171-180, Munich, Germany, 2009. doi:10.1145/1600193.1600232.
E. Pierazzo. A rationale of digital documentary editions. Literary and Linguistic
Computing, 26(4):463-477, 2011. doi:10.1093/llc/fqr033.
P. Portier, N. Chatti, S. Calabretto, E. Egyed-Zsigmond, and J. Pinon. Modeling,
encoding and querying multi-structured documents. Information Processing & Management.
M. Richard, “Florilèges grecs”, Dictionnaire de
Spiritualité V (1962), cols. 475-512
F. Rodríguez Adrados, Greek wisdom literature and the Middle
Ages: the lost Greek models and their Arabic and Castilian Translations (2001),
English translation by Joyce Greer (2009), pp. 91-97 on Greek models; D. Gutas, “Classical
Arabic Wisdom Literature: Nature and Scope”, Journal of the American
Oriental Society, Vol. 101, No. 1, Oriental Wisdom (Jan. -Mar., 1981), pp.
Solomon, J. (ed)., Accessing antiquity: The computerization of classical studies. Tucson: University of Arizona Press. 1993.
Sanderson, R. Albritton, B. Schwemmer, R. Van de Sompel, H. "SharedCanvas: A
Collaborative Model for Medieval Manuscript Layout Dissemination". Proceedings of the 11th ACM/IEEE
Joint Conference on Digital Libraries, Ottawa, Canada, June 2011.
B. Tillett, “What is FRBR? A Conceptual Model for the Bibliographic Universe”,
Library of Congress Cataloging Distribution Service, Library of Congress, Vol. 25, pp.1-8
G. Tummarello, C. Morbidoni, and E. Pierazzo. Toward textual encoding based on RDF.
In Proceedings of the 9th International Conference on Electronic Publishing (ELPUB 2005), Kath.
Univ. Leuven, June, pages 57-63. 2005.
Tupman, Charlotte; Hedges, Mark; Jordanous, Anna; Lawrence, Faith; Roueche, Charlotte;
Wakelnig, Elvira; Dunn, Stuart. Sharing Ancient Wisdoms: developing structures for
tracking cultural dynamics by linking moral and philosophical anthologies with their
source and recipient texts. In Proceedings of Digital
Humanities (DH2012), Hamburg, Germany. 2012.