The Component Metadata Infrastructure is developed in the context of the CLARIN project
(Váradi et al. 2008). CLARIN aims at building an integrated and interoperable
research infrastructure for language resources. The goal is to provide a stable, persistent
and accessible infrastructure for the eHumanities. One important aspect of CLARIN is to enable
easy sharing of language resources. This will allow researchers to use existing resources as a
basis for their work, e.g. by optimizing their existing or new tools, by building derivative
resources or expose their resources to a broader audience. Therefore, to make this
infrastructure more usable, resources need to be easily accessible, in particular easily
findable. The most common approach towards achieving this is to provide descriptive metadata
about these resources and use these information to find resources of interest for a particular
researcher.
Part of this context is also an already large installed base of metadata descriptions
available using fixed metadata schemas as IMDI and Simons et al. 2008.
Although the quality of the metadata is sometimes questionable, it would be unacceptable to put
a new framework into place that would lock out these existing metadata resources
Since CLARIN is a rather large, diverse project, different project members have different
opinions on how to adequately model the metadata for their types of resources. For a lot of
existing resources extensive amounts of metadata descriptions are already available. It seems
naïve to assume that agreement on a common metadata schema for a large-scale project like
CLARIN can be achieved and will most likely result in the least common denominator, e.g. Dublin
Core (Dublin Core, Baker 1998), losing a lot of the express power that is used in
existing metadata, as would using a “pivot” schema, both would result in information loss.
CLARIN tries to solve the problem by the Component Metadata Infrastructure (CMDI), which is
basically a framework to accommodate for different XML-based metadata formats. CMDI provides,
supported by various tools, a framework and work flows for creating metadata formats and
metadata descriptions as well as semantic foundation for processing metadata descriptions.
Framework overview
CMDI is a framework (see Figure 1) to build component based
metadata descriptions. A metadata component is basically a collection of atomic metadata
fields or data categories (DatCats) and describes a specific aspect or dimension of a
resource, e.g. the title of a document, the creator or the native language of a subject in a
video recording. Components can have a recursive structure, i.e. in addition to atomic
fields, the components can also contain other components. Thus, components serve as small
building blocks or reusable templates for a specific aspect of a resource. Together with a
header, these components are combined into metadata profiles, each of which can be used as a
schema for metadata instances. Both, components and profiles can (and should) be stored in a
component registry, which is a directory of components to be reused in different contexts.
Users can either reuse existing profiles for their metadata descriptions or create new
profiles by reusing or creating new components, either manually or with a specialized
component editor. Various profiles already exists in the component registry, e.g. for IMDI,
OLAC, Dublin Core or the TEI header.
The storage of the schemas in a centralized infrastructure is common practice for metadata schemas,
though of course this adds the problem of sustainability to the process, inasmuch as the repository
of schemas needs to be constantly available.
Though this could be seen as problematic in principle for pragmatic reasons it seems more appropriate
than to use local copies with modifications, because it makes sure that tools can operate on the centrally
stored files. For a metadata archive, a local store of schema copies could be instantiated, but this would
result in the requirement to adjust the schema reference, for interoperability this could cause an additional
problem. Hence the use of a central infrastructure is probably the safest solution and in the context of an
infrastructure of data and services most likely to be sustainable. This is also consistent with the approaches
described by Rehm et al. 2011.
Each metadata field is linked to exactly one data category in a data category registry (DCR)
using a persistent identifier. The DCR indicates how the content of the field in a metadata
description should be interpreted. If the same data category is used in various metadata
schemas, the reference to the DCR will still be the same. This is also independent of the
concrete naming of the XML element, including names, cases and orthographic variants.
For example, the field title in titleStmt in the TEI header is linked
to the same category as the title in Dublin Core.
In the CLARIN project the preferred concept registry is the ISO data category registry
ISO-DCR. This registry is an
implementation of the ISO 12620 standard model for data categories and offers ample
functionality for the needs of the CMDI framework. For the CMDI framework it makes no
essential difference if another registry such as for instance the
DCMI is used. However the ISO-DCR does have a tight integration with other CMDI software
components such as the component editor, for efficient searching for suitable data categories
or even combining metadata modelling with defining new data categories.
The component registry contains CMDI components and profiles. If a metadata creator
needs to describe a (for him) new type of resource, he can browse through the available
profiles and see if there is one that suits his needs. If there is no suitable profile
available he can create a new one, based on existing components or he can create new
components and work these with existing ones into a new profile.
When creating metadata elements in new metadata components users can browse and search
for entries in the ISO-DCR to find a concept that matches the semantics of the metadata
element. The identifier of the concept is then automatically inserted in the metadata
component specification.
To create metadata descriptions users load profiles into the metadata editor, which then
can automatically generate forms based on the metadata profile. The user then fills out these forms
to enter the data. Of course users may also use an XML editor to create metadata descriptions
directly and use the provided XML schemas (see below) to validate the XML documents.
The resulting metadata records are offered for harvesting by
OAI-PMH
and gathered in one or more central repositories.
Multiple ways to exploit the collected metadata are foreseen ranging from systems doing
simple keyword search to those using faceted browsing or structured search. In all of these
semantic mapping using the ISO-DCR plays a crucial role. When a user specifies a metadata
query, the ISO-DCR then allows to expand this query into set of equivalent ones that will be
able to retrieve metadata records where a different terminology than specified in the
original query. The identifiers of the terms in the query are used to find equivalent terms
and these are then used to generate an additional query. E.g. when a user queries for
titleStmt an additional query is generated for title,
since titleStmt is linked to title via the ISO-DCR.
As mentioned before, a metadata component describes various aspects or dimensions of a resource.
Figure 2 shows a schematic representation of a very simple example
metadata component “Actor”.
It contains two atomic fields “firstName” and “lastName” and refers to another component “ActorLanguage”,
which contains a repeatable atomic field “actorLanguageName”.
An entity “Actor” therefore consists of a first name, a last name and a list of languages.
In CMDI components are expressed in XML files. Figure 3 displays the “Actor”
component in the CMDI component XML specification tag-set. CMD_Component elements define new components,
CMD_Element elements new atomic fields. The ConceptLink attribute is the most important aspect
in terms of interoperability, because it stores the link to a DCR, or more specific the PID of a data category.
Software interpreting the component definitions can use this concept link to draw further conclusions from information,
like establishing an equality relation be between different field in different metadata schemes and use this,
e.g. for smart searching. The component descriptions are normally transformed to XML Schema using an XSLT transformation.
These XML Schemas are available from the component registry and can e.g. be used in special metadata editors or plain XML
editors to aid the user in creating metadata records. Figure 3 shows an example instance
of an “Actor” component. In a complete CMDI metadata record the component together with and one or more links to the
described resource are wrapped with a header.
Especially in connection with the standardization efforts mentioned in section “Conclusion”, TEI ODD
will be evaluated as an alternative apparatus for defining metadata components.
Other representation formats such as RDF, OWL and Topic Maps do not seem appropriate for the description of the metadata in comparison to CMDI. It is obvious,
that CMDI due the recursive structure of defining components can become rather complex, but the structures are at least assumed to be human readable and structured
according to a human prose text on a resource.
In contrast to this, the RDF-family is not requiring the linear order, presenting the RDF-triples in arbitrary order.
Though CMDI documents can be rendered in RDF (and probably in OWL and Topic Maps), the struture of CMDI is more transparent and usable to human users. CMDI is also
not a form of knowledge representation, in which the concepts of a resource are described, but it is intended to provide structured information about a resource for
human users.
Tools
For the use of the Component Metadata Infrastructure, various tools exist, some being
reused from other contexts, others were explicitly developed in this context. Among them are
editors, registries and search applications, which will be described briefly.
ISOcat: the Data category registry for ISO TC 37
The data category ISOcat ISO 12620 stores data categories and
implements ISO 12620:2009. It is a specialized concept registry, historically developed
for data categories used in terminology exchange. However, the concept was so flexible and
useful that it was extended to further areas, including linguistic resource management
with all required metadata categories.
As a web-based registry for data categories and concepts, ISOcat can be extended by
additional data categories as required by users to cater for the individual project
needs. Data categories can be defined privately or publicly, submitted for
ISO-standardization or not.
Each data category in ISOcat receives a Persistent Identifier (PID) which is used to
reference to it, especially suited to be included in metadata and schemata of linguistic
resources to foster semantic interoperability. Some schema languages, e.g., TBX XCS and
TEI ODD, have built-in support to embed these PIDs into the schema. However, more generic
schema languages such as Relax NG and W3C XML Schema do not, but with the definition of
attributes schemas in these languages can easily be extended to include them.
The CMDI Component Registry
The Component Registry is also a web-based service, but currently not part of an ISO
standard. Within the Component Registry, users of CMDI can store their metadata components
and profiles. But it not only allows storage, but also contains editing functionalities.
In the CMDI Component Registry each component is also assigned an identifier that is unique
in the context of the component registry, in order for other components to integrate it.
Additionally this component identifier can be used as a reference for the profiles, that is,
the instances document type declaration and namespaces can point to the component registry
for their XML Schema.
Arbil: The CMDI supporting metadata editor
A special challenge for any metadata framework is the creation of instances, which needs to be easy
and user friendly. As CMDI is highly adjustable and flexible, this poses additional
complications. With the metadata editor Arbil, there is an XML-Editor that is aware of
CMDI-structures and connects to the component registry downloading the available (schematized)
CMDI profiles. Since there can be very many, the user can limit the number of CMDI profiles that
are actually shown in the user interface
Relation Registry
The CMDI Relation Registry (RR) is designed to augment a limitation in ISO-DCR and allow
the metadata search user to create (temporary) simple relations between different data categories in the ISO-DCR.
The ISO-DCR can overcome “accidental” semantic overlap between different terms, i.e. two metdata
developers used different terms but agree on the same definitions. The RR can be used by users searching
the metadata to overcome intentional semantic overlap, i.e. the metadata modelers decided that two terms
actually mean different things, but where the user decides that this difference is irrelevant for
him. He would specify the relation “Term1” == “Term2” and the semantic mapping machinery
of the metadata search would expand every query with “Term1” with one that also uses “Term2”.
Joint Metadata Repository
The joint metadata repository (JMDR) is the place where all the harvested CMDI metadata records are stored.
The harvesting method is the well-known OAI-PMH, currently there is not yet a registry where the
CMDI metadata providers are registered, but such a registry is under consideration.
There may be several of such joint metadata repositories, each specializing in one type of metadata search service.
Considering the (expected) great variety of metadata schemas, it was thought advantageous to use native XML
database to allow searching through the collected CMDI records.
Currently no semantic normalization is done when the records are stored in the JMDR, this is to allow a query
to retrieve only those records that actually use a profile specific terminology.
Searching over structured CMDI data
Added value of highly structured and rich metadata descriptions can be achieved if the
search process is more elaborated, leading to more precise and fast results than a
full-text search, without lowering the recall. Two examples of such search interfaces are
the Virtual Language Observatory and the NaLiDa Faceted Search. Both harvest the CMDI
metadata from data providers, but they have a different functionality.
The Virtual Language Observatory started of with earlier metadata versions. It
presents a number of different facets, from which a user selects the interesting data
categories.
The NaLiDa faceted browser is slightly more elaborated as it implements conditional
facets, i.e. additional facets appear based on earlier selections. For example the facet
“corpus type” is irrelevant for non-corpora, hence is only shown if the resource type
“corpus” is selected. However, the NaLiDa faceted browser focuses on resources in a
national context.