How to cite this paper
Shifrin, Laurel. “Managing multiple vocabularies across a global enterprise.” Presented at International Symposium on Versioning XML Vocabularies and
Systems, Montréal, Canada, August 11, 2008. In Proceedings of the International Symposium on Versioning XML Vocabularies and
Systems. Balisage Series on Markup Technologies, vol. 2 (2008). https://doi.org/10.4242/BalisageVol2.Shifrin01.
International Symposium on Versioning XML Vocabularies and
August 11, 2008
Balisage Paper: Managing multiple vocabularies across a global enterprise
Manager, Data Architecture
Laurel Shifrin is the Manager of Data Architecture at LexisNexis, an international
company offering online information and solutions. She has been working with markup
languages since 1993 and presented at Markup Technologies, a predecessor to Balisage.
is also the co-author (with Michael Atkinson) of Flickipedia:
Perfect Films for Every Occasion, Holiday, Mood, Ordeal and Whim(Chicago
Copyright © 2008 LexisNexis
Organizations share vocabularies across disparate user groups and data to maximize
value of their investment in XML, and, without question, those XML vocabularies need
change as the businesses evolve and expand. Managing change to DTDs and schemas is
enough with a small group of co-located users working on the same content types. What
happens when you have hundreds of XML consumers spread across the globe and they have
completely different requirements, systems, and content? Get a view of the challenges
implementing change management and vocabulary versioning on a very large scale.
Table of Contents
- Background of LexisNexis and the migration to XML
- Shared Vocabularies
- XML Maintenance Protocol
- XML Maintenance System
- XML Maintenance Process
- XML Versioning Policy
- Benefits and Challenges
One of the primary motivators for executive sponsorship of an XML implementation is
decreased cost and increased efficiency. One of the ways to maximize this effectiveness
create a centrally managed XML vocabulary that can be shared across the entire organization.
In a small company where technical developers and users alike work in a single location,
this can be a relatively straightforward prospect: only a small number of people are
and they can communicate frequently and easily. Likewise, if the content is specialized
a single discipline, the vocabulary itself may be semantically rich but otherwise
uncomplicated. However, even in a small organization with a fairly lean vocabulary,
the XML definition is important: it will certainly continue to expand and ad hoc change
management can result in unexpected – and unpleasant – outcomes. In a large organization
a great diversity of data types, the potential for problems associated with loose
is increased. Governance of markup is paramount, not just because of the downstream
applications, stylesheets and other artifacts expecting certain data constructs but
because of the cost in changing your data: once you create content in XML and store
odds are that it’s going to stay that way for a very long time and that changes will
incremental. As the amount of data grows, so does the importance of vocabulary control.
Background of LexisNexis and the migration to XML
LexisNexis is a leading international provider of business information solutions to
professionals in a variety of areas, including: legal, corporate, government, law
tax, accounting, academic, and risk and compliance assessment. In more than 100 countries,
across six continents, LexisNexis provides customers with access to five billion searchable
documents from more than 40,000 legal, news and business sources. We have over 13,000
employees located in business units around the globe.
Since the majority of these business units have been acquired as existing entities,
acquisition they typically each had their own editorial systems, data formats and
products, much of these reflective of the different legal systems in place around
Some of the acquired business units already had XML-based Content Management Systems,
SGML, some were using either proprietary markup systems or Microsoft Word documents
varying degree of semantics – sometimes not at all but occasionally captured using
In 2004, we launched a common delivery platform that included news, financial and
data from seven countries across three continents. This shared platform was designed
consolidate the technology for delivering data to our customers, thus decreasing
establish a publishing pipeline that could enable the addition of new countries in
relatively clear-cut manner;
create a standardized user interface that could be recognized globally; and
make it possible to develop new products and services across different markets and
countries more easily.
At the same time, the U.S. business unit continued to enhance the existing flagship
American products, lexis.com and nexis.com, and to build our XML-based editorial systems.
editorial staff are similarly spread across the U.S., working in different time zones
create different products consisting of data that is sometimes shared but often unique
nature and geared toward a specific market; for example, an editor creating solutions
legal practitioners who specialize in Intellectual Property works with a very dissimilar
set than an editor abstracting and indexing International Statistics for an academic
All of these projects were staffed with representatives from a host of different groups
multiple countries (engineering, editorial, user experience, etc. ), sometimes leveraging
offshore resources as well. The schedules, time-line and project management were also
toward the specific goals of the individual efforts, with a limited amount of overlap
consideration of the other projects.
Underlying all of these considerable efforts was the XML definition, managed by a
centralized Data Architecture team, based in the U.S. Our team is dispersed among
locations and two time zones and we maintain over 100 DTDs and schemas.
At the beginning of our markup development, we were faced with two options. The first
option was building standalone DTDs for each content type -- and, for legal statutory
potentially building standalone DTDs for each content type within each
jurisdiction, as the structure and standards for a given jurisdiction can be
quite dissimilar from another and we’re frequently under a contractual obligation
prevents us from standardizing across jurisdictions. The second option was to build
from modules and share as much common vocabulary as possible.
There were several benefits to creating standalone DTDs: smaller scope makes for swifter,
less complicated development; fewer requirements for the markup to support; discrete
groups would mean less controversy and disagreement. However, the benefits of creating
modularized vocabulary would prevent redundancy in the markup, duplicated effort,
believed, would prove more efficient in the long run.
This was the genesis of what we have today: XML vocabularies that are shared across
multiple editorial systems, discrete applications, and editorial groups, and also
definition for data delivery from the business units around the world. The Data Architecture
team uses a toolbox of XML modules (body text elements, metadata elements, content-specific
elements, etc.) from which unique DTDs can be created, so that whole DTDs do not need
written from scratch for each new content type that is added to our voluminous data
Additionally, the XML definition for a given semantic component may need to be transformed
as it moves through our publishing workflow; for example, metadata necessary for Editorial
tools may be dropped on external delivery. To ensure data quality as it moves through
publishing pipeline, we maintain a DTD or schema for data validation at specific points
process. Thus, a certain element may have a very robust set of attributes at the beginning
the editorial cycle but may have a much leaner group of attributes farther down the
path – or,
conversely, may have additional metadata added by other post-editorial processes.
multiple configurations of a given semantic are defined using parameter entities with
With such a great variety and quantity of data moving through and being added to our
pipeline at any given time, it is expected that our DTDs will change – they need to
support improvements and enhancements to our products, services and systems and to
expand our content offerings. However, with so many disparate user and developer groups
the globe relying on established XML vocabularies, managing that change must be done
and transparently. Very early in our migration to XML, it became evident that the
method of requesting and communicating changes over cubicle walls or via email was
to be sufficient – and that was back when the Data Architecture team itself consisted
people sitting within 10 feet of one another and the majority of the interested parties
stone’s throw away. We determined that a formal maintenance protocol was needed, which
out to be prescient, considering the growth of our own team and the even greater expansion
the community depending on our XML. It was essential to institute a centralized system,
just to control the vocabularies but also the data using the vocabularies and to establish
common understanding and means of expression.
The maintenance protocol we introduced had three features:
an application for requesting and storing information about each DTD/schema
a publicized change control process; and
a versioning policy for the vocabulary itself.
XML Maintenance Protocol
XML Maintenance System
Our maintenance system is an intranet application accessible by any employee. Each
change request must contain enough information so that all of our XML consumers can
the impact of the change on their own realm of responsibility, including:
the reason for the change;
affected modules or DTDs/schemas;
existing and proposed content models;
change benefits and risks;
a document sample that has been validated to a mocked-up DTD/schema.
All change requests are assigned a unique ID and stored in an SQL database. The system
allows users to query various fields, or generate lists of changes by platform and/or
status. When a change request is submitted, the maintenance system alerts the appropriate
Data Architect, who is then responsible for shepherding the change through the maintenance
process. Generally, the users provide some basic information about the problem they’re
trying to solve and then the Data Architect follows the usual steps in designing markup:
reviewing requirements, conducting user interviews, modeling the data, assessing downstream
impacts, tagging or converting sample documents, etc. A related tool helps us research
existing vocabulary so that we’re as consistent and efficient as possible – a database
elements and attributes, which we search using XQuery.
XML Maintenance Process
Each change request is first reviewed in a Data Architecture-only round table
discussion; because the modules are shared, each Data Architect may find the new markup
his DTD and so needs to understand and agree to the change. This has the additional
of team members learning more about each other’s content expertise and of course provides
opportunity for newer staff members to learn about data modeling from those with more
experience. At this point, only our team can view the change request on the Intranet
this reduces churn and controversy within the larger community.
Once the change passes muster with our group, it is then viewable to everyone on our
Intranet site and distributed to representatives throughout the entire LexisNexis
organization. We typically allow 10 business days for assessing impacts, although
expedite the review period when deadlines demand.
If an objection or negative impact is raised, we will reconsider the change based
feedback and we sometimes must negotiate a compromise between the requester and the
objector. It’s entirely possible that a change that benefits one group will have a
impact on another group and the overall business benefits and risks must be evaluated
determine the right course of action. A simple DTD change may seem trivial – what
possibly cause damage? – but if we were to make a backwards-incompatible change without
working out the details of how to remediate the data, we could block up our publishing
pipeline; invalid data will get bounced back. This is not so bad if it’s a few hundred
documents but if it’s a few hundred thousand, this could wreak havoc. Additionally,
very aggressive deadlines for updating our documents, particularly for caselaw and
we were to introduce invalid markup that then resulted in even a short delay of delivering
our documents online, our customers would transfer to our competitors pretty quickly.
Once a change is approved, the new markup is implemented in the module with
documentation specifically referencing the change itself (change request ID, date,
Architect making the change, etc), so that we can view our DTDs in the future and
to scratch our heads wondering why a particular change was made two years ago. Then
are normalized and validated using a series of scripts and tested using a set of sample
documents. The number of DTDs being updated at a given time depends on several factors:
which system, platform or group needed the change, how many changes are in the queue
to be implemented, and whether or not we need to meet a specific project deadline.
Documentation is embedded within the DTDs and schemas so that it can be kept up-to-date
as changes are implemented; we autogenerate several different types of documentation
users, including a summary of what changes are included in each DTD update.
XML Versioning Policy
Revision control and storage of our source modules is handled by Rational ClearCase.
Since our customers only see normalized DTD/Schemas, we adopted a versioning policy
those deliverables. We use a three node, numerical versioning syntax, expressing three
degrees of revision: major, minor, and micro.
Major changes are defined as changes that would either render documents generated
the previous version invalid or alters a previous version’s existing default DOM expansion (such as changing the default or fixed value of
an attribute). When a change is determined to be major, the major revision node is
incremented by 1, and the minor and micro nodes are moved to 0.
Minor changes are defined as changes that will not
render documents generated under the previous version invalid, such as making previously
required nodes optional. When a change is determined to be minor, the minor revision
incremented by 1, and the micro node is moved to 0. The major node is left unchanged.
Micro changes are defined as changes that have no syntactic or semantic impact to
document (e.g., enhancements to documentation stored within comments in the DTD).
change is determined to be micro, the micro revision node is incremented by 1, and
and minor nodes are left unchanged.
We typically do not maintain multiple versions of the same DTD or schema at once,
we have agreed to this for short periods of time and under the caveat that if we need
maintain multiple major versions, we will not accept any backwards incompatible change
requests to the lower version – if a group needs a backwards incompatible change,
they’re going to need to upgrade to the later version.
Benefits and Challenges
This process and system have provided several benefits. The first is just simple
record-keeping: there’s no wondering about why a particular change was made three
because we can easily go back to the original details. Central control of the XML
prevents the practice of keeping local copies that rapidly fall out of sync as each
makes changes without regard to the consequences for other teams. This was the situation
one of the acquired business units and when we had to fold their DTDs into our catalog,
a labor-intensive, frustrating task to try to find all the copies of their DTDs and
consolidate them to a single canonical version of each.
This record-keeping provides us with an easy way to derive metrics about our own
efficiency in managing change. We can determine which DTDs or modules require the
modifications, which could be an indication that something needs a closer, more holistic
analysis. As the manager of the team, I can also draw metrics on the amount of time
individual team members to process a change, which could indicate any number of issues:
unbalanced workloads, lack of adequate training and skills, etc.
Central control of the XML also helps us keep redundancy in check – we’re less likely
create a new element for a very similar semantic because the process for doing this
transparent. When our team first began formal tracking of our DTD changes and introduction
new vocabulary, we were still rather siloed in our approach, which resulted in individual
Architects creating separate elements or attributes for the same or very similar semantic.
Adopting a maintenance process ensured that no XML can be introduced or changed without
the rest of the team aware. Our twice weekly discussions and the use of our Element
for researching existing vocabulary were an additional resolution to this problem.
Another benefit is that opening up the process to the development and user groups
increased their understanding of XML, which can be inconsistent across such a broad
And we've added to the information coming to us, not just
from us: by notifying the community of a proposed change rather than just distributing
updated DTD, we’ve made allowances for the situations when feedback is critical. We
many applications downstream of the XML definition it would be impossible for the
Architecture team to understand the minute details of how the XML is being interrogated
used everywhere (though we do know quite a bit of it), so we rely on the application
content experts for information in this regard.
The transparency of our process has its downside: users have become so accustomed
seeing XML syntax that they are prone to requesting specific markup instead of stating
requirements and letting the Data Architects design the markup. Then if we propose
alternate design, this sometimes causes frustration and contention all around. Also,
given semantic can be modeled in multiple ways, in addition to impact assessments,
we have to
be willing to accept constructive criticism about our markup design and to be flexible
to alter it.
Backwards-incompatibility can also be challenging. In the earliest stages of our initial
migration to markup (which was SGML), the number of DTD consumers could fit comfortably
conference room and the scope of the data conversion was intentionally limited. At
making backwards-incompatible changes to the DTD wasn’t a big problem. Also, if you’re
enough to have documents that are published once and don’t need updating,
backwards-incompatibility isn’t an issue. However, many of our documents do need to
continually and now that we’re several years into such a large-scale migration, making
change that would invalidate data is not a trivial matter. And with a shared vocabulary,
could have an enormous ripple effect into document types owned by groups completely
with the data that needed the backwards-incompatible change. For instance, a financial
developer might not understand why we need to push a change to a shared module to
problem for legislative products. Certainly the freedom to make these kinds of
backwards-incompatible changes easily was something we sacrificed when we went down
Another challenge is that sharing XML to this degree necessitates a relatively loose
of rules, particularly for online document delivery. We tend to move the data standards
conformance checks farther upstream: our Editorial DTDs and schemas are far stricter
delivery DTDs and we frequently employ other technologies such as Schematron to ensure
standards are met.
The protocol we follow for managing our XML vocabularies has been in place now since
albeit in a much more rigorous form now than it was at that time. The engineering
used various forms of change management for their software development -- both home-grown
off-the-shelf change control products -- but our XML management protocol (system and
has withstood whatever else has come and gone. In fact, several other ancillary vocabularies
have been put under our auspices because our system is so respected. For an enterprise
size to share XML across content types and continents, it’s critical to keep the change
management transparent and well-publicized. Even smaller organizations can benefit
following an established process, if only to eliminate the questions of why a predecessor
a particular modification. This system enables us to expand our XML vocabularies at
aggressive rate our internal users and external customers expect.
It is difficult to imagine how engineering costs, time-to-market, and product and
integration could be done on such a large scale if DTDs and schemas were spread throughout
organization under the control of local content and application development groups
a public process. In the loosely-coupled systems designs typified by Service Oriented
Architectures of today’s large scale systems, centrally managed schemas provide the
specifications for the content that systems trade with one another as well as something
as important – a common vocabulary that can be used by distant, possibly even unrelated,
product and application development groups to explain and express their needs. This
communication makes it possible for us to deliver a very broad set of products, services
content types but it also provides a framework for innovation through shared understanding
across the organization.