Managing multiple vocabularies across a global enterprise

Laurel Shifrin

Manager, Data Architecture 

LexisNexis

Copyright © 2008 LexisNexis

expand Abstract

expand Laurel Shifrin

Balisage logo

Proceedings

expand How to cite this paper

Managing multiple vocabularies across a global enterprise

International Symposium on Versioning XML Vocabularies and Systems
August 11, 2008

Introduction

One of the primary motivators for executive sponsorship of an XML implementation is decreased cost and increased efficiency. One of the ways to maximize this effectiveness is to create a centrally managed XML vocabulary that can be shared across the entire organization.

In a small company where technical developers and users alike work in a single location, this can be a relatively straightforward prospect: only a small number of people are involved and they can communicate frequently and easily. Likewise, if the content is specialized toward a single discipline, the vocabulary itself may be semantically rich but otherwise uncomplicated. However, even in a small organization with a fairly lean vocabulary, managing the XML definition is important: it will certainly continue to expand and ad hoc change management can result in unexpected – and unpleasant – outcomes. In a large organization with a great diversity of data types, the potential for problems associated with loose management is increased. Governance of markup is paramount, not just because of the downstream applications, stylesheets and other artifacts expecting certain data constructs but also because of the cost in changing your data: once you create content in XML and store it, the odds are that it’s going to stay that way for a very long time and that changes will be incremental. As the amount of data grows, so does the importance of vocabulary control.

Background of LexisNexis and the migration to XML

LexisNexis is a leading international provider of business information solutions to professionals in a variety of areas, including: legal, corporate, government, law enforcement, tax, accounting, academic, and risk and compliance assessment. In more than 100 countries, across six continents, LexisNexis provides customers with access to five billion searchable documents from more than 40,000 legal, news and business sources. We have over 13,000 employees located in business units around the globe.

Since the majority of these business units have been acquired as existing entities, upon acquisition they typically each had their own editorial systems, data formats and online products, much of these reflective of the different legal systems in place around the world. Some of the acquired business units already had XML-based Content Management Systems, some had SGML, some were using either proprietary markup systems or Microsoft Word documents with a varying degree of semantics – sometimes not at all but occasionally captured using MS Word styles.

In 2004, we launched a common delivery platform that included news, financial and legal data from seven countries across three continents. This shared platform was designed to meet several goals:

  • consolidate the technology for delivering data to our customers, thus decreasing cost;

  • establish a publishing pipeline that could enable the addition of new countries in a relatively clear-cut manner;

  • create a standardized user interface that could be recognized globally; and

  • make it possible to develop new products and services across different markets and countries more easily.

At the same time, the U.S. business unit continued to enhance the existing flagship American products, lexis.com and nexis.com, and to build our XML-based editorial systems. Our editorial staff are similarly spread across the U.S., working in different time zones to create different products consisting of data that is sometimes shared but often unique in nature and geared toward a specific market; for example, an editor creating solutions for legal practitioners who specialize in Intellectual Property works with a very dissimilar data set than an editor abstracting and indexing International Statistics for an academic audience.

All of these projects were staffed with representatives from a host of different groups in multiple countries (engineering, editorial, user experience, etc. ), sometimes leveraging offshore resources as well. The schedules, time-line and project management were also geared toward the specific goals of the individual efforts, with a limited amount of overlap and consideration of the other projects.

Shared Vocabularies

Underlying all of these considerable efforts was the XML definition, managed by a single, centralized Data Architecture team, based in the U.S. Our team is dispersed among eight locations and two time zones and we maintain over 100 DTDs and schemas.

At the beginning of our markup development, we were faced with two options. The first option was building standalone DTDs for each content type -- and, for legal statutory data, potentially building standalone DTDs for each content type within each jurisdiction, as the structure and standards for a given jurisdiction can be quite dissimilar from another and we’re frequently under a contractual obligation that prevents us from standardizing across jurisdictions. The second option was to build our DTDs from modules and share as much common vocabulary as possible.

There were several benefits to creating standalone DTDs: smaller scope makes for swifter, less complicated development; fewer requirements for the markup to support; discrete user groups would mean less controversy and disagreement. However, the benefits of creating the modularized vocabulary would prevent redundancy in the markup, duplicated effort, and, we believed, would prove more efficient in the long run.

This was the genesis of what we have today: XML vocabularies that are shared across multiple editorial systems, discrete applications, and editorial groups, and also provide the definition for data delivery from the business units around the world. The Data Architecture team uses a toolbox of XML modules (body text elements, metadata elements, content-specific elements, etc.) from which unique DTDs can be created, so that whole DTDs do not need to be written from scratch for each new content type that is added to our voluminous data stores.

Additionally, the XML definition for a given semantic component may need to be transformed as it moves through our publishing workflow; for example, metadata necessary for Editorial tools may be dropped on external delivery. To ensure data quality as it moves through the publishing pipeline, we maintain a DTD or schema for data validation at specific points in the process. Thus, a certain element may have a very robust set of attributes at the beginning of the editorial cycle but may have a much leaner group of attributes farther down the path – or, conversely, may have additional metadata added by other post-editorial processes. These multiple configurations of a given semantic are defined using parameter entities with conditional switches.

With such a great variety and quantity of data moving through and being added to our pipeline at any given time, it is expected that our DTDs will change – they need to change to support improvements and enhancements to our products, services and systems and to continue to expand our content offerings. However, with so many disparate user and developer groups around the globe relying on established XML vocabularies, managing that change must be done carefully and transparently. Very early in our migration to XML, it became evident that the small-shop method of requesting and communicating changes over cubicle walls or via email was not going to be sufficient – and that was back when the Data Architecture team itself consisted of three people sitting within 10 feet of one another and the majority of the interested parties were a stone’s throw away. We determined that a formal maintenance protocol was needed, which turned out to be prescient, considering the growth of our own team and the even greater expansion of the community depending on our XML. It was essential to institute a centralized system, not just to control the vocabularies but also the data using the vocabularies and to establish a common understanding and means of expression.

The maintenance protocol we introduced had three features:

  • an application for requesting and storing information about each DTD/schema change;

  • a publicized change control process; and

  • a versioning policy for the vocabulary itself.

XML Maintenance Protocol

XML Maintenance System

Our maintenance system is an intranet application accessible by any employee. Each change request must contain enough information so that all of our XML consumers can evaluate the impact of the change on their own realm of responsibility, including:

  • the reason for the change;

  • affected modules or DTDs/schemas;

  • existing and proposed content models;

  • change benefits and risks;

  • a document sample that has been validated to a mocked-up DTD/schema.

All change requests are assigned a unique ID and stored in an SQL database. The system allows users to query various fields, or generate lists of changes by platform and/or change status. When a change request is submitted, the maintenance system alerts the appropriate Data Architect, who is then responsible for shepherding the change through the maintenance process. Generally, the users provide some basic information about the problem they’re trying to solve and then the Data Architect follows the usual steps in designing markup: reviewing requirements, conducting user interviews, modeling the data, assessing downstream impacts, tagging or converting sample documents, etc. A related tool helps us research existing vocabulary so that we’re as consistent and efficient as possible – a database of elements and attributes, which we search using XQuery.

XML Maintenance Process

Each change request is first reviewed in a Data Architecture-only round table discussion; because the modules are shared, each Data Architect may find the new markup in his DTD and so needs to understand and agree to the change. This has the additional benefit of team members learning more about each other’s content expertise and of course provides an opportunity for newer staff members to learn about data modeling from those with more experience. At this point, only our team can view the change request on the Intranet site; this reduces churn and controversy within the larger community.

Once the change passes muster with our group, it is then viewable to everyone on our Intranet site and distributed to representatives throughout the entire LexisNexis organization. We typically allow 10 business days for assessing impacts, although we will expedite the review period when deadlines demand.

If an objection or negative impact is raised, we will reconsider the change based on the feedback and we sometimes must negotiate a compromise between the requester and the objector. It’s entirely possible that a change that benefits one group will have a negative impact on another group and the overall business benefits and risks must be evaluated to determine the right course of action. A simple DTD change may seem trivial – what could possibly cause damage? – but if we were to make a backwards-incompatible change without working out the details of how to remediate the data, we could block up our publishing pipeline; invalid data will get bounced back. This is not so bad if it’s a few hundred documents but if it’s a few hundred thousand, this could wreak havoc. Additionally, we have very aggressive deadlines for updating our documents, particularly for caselaw and news. If we were to introduce invalid markup that then resulted in even a short delay of delivering our documents online, our customers would transfer to our competitors pretty quickly.

Once a change is approved, the new markup is implemented in the module with documentation specifically referencing the change itself (change request ID, date, Data Architect making the change, etc), so that we can view our DTDs in the future and not have to scratch our heads wondering why a particular change was made two years ago. Then the DTDs are normalized and validated using a series of scripts and tested using a set of sample documents. The number of DTDs being updated at a given time depends on several factors: which system, platform or group needed the change, how many changes are in the queue waiting to be implemented, and whether or not we need to meet a specific project deadline.

Documentation is embedded within the DTDs and schemas so that it can be kept up-to-date as changes are implemented; we autogenerate several different types of documentation for our users, including a summary of what changes are included in each DTD update.

XML Versioning Policy

Revision control and storage of our source modules is handled by Rational ClearCase. Since our customers only see normalized DTD/Schemas, we adopted a versioning policy for those deliverables. We use a three node, numerical versioning syntax, expressing three degrees of revision: major, minor, and micro.

Major changes are defined as changes that would either render documents generated under the previous version invalid or alters a previous version’s existing default DOM expansion (such as changing the default or fixed value of an attribute). When a change is determined to be major, the major revision node is incremented by 1, and the minor and micro nodes are moved to 0.

Minor changes are defined as changes that will not render documents generated under the previous version invalid, such as making previously required nodes optional. When a change is determined to be minor, the minor revision node is incremented by 1, and the micro node is moved to 0. The major node is left unchanged.

Micro changes are defined as changes that have no syntactic or semantic impact to the document (e.g., enhancements to documentation stored within comments in the DTD). When a change is determined to be micro, the micro revision node is incremented by 1, and the major and minor nodes are left unchanged.

We typically do not maintain multiple versions of the same DTD or schema at once, though we have agreed to this for short periods of time and under the caveat that if we need to maintain multiple major versions, we will not accept any backwards incompatible change requests to the lower version – if a group needs a backwards incompatible change, then they’re going to need to upgrade to the later version.

Benefits and Challenges

This process and system have provided several benefits. The first is just simple record-keeping: there’s no wondering about why a particular change was made three years ago because we can easily go back to the original details. Central control of the XML vocabulary prevents the practice of keeping local copies that rapidly fall out of sync as each group makes changes without regard to the consequences for other teams. This was the situation at one of the acquired business units and when we had to fold their DTDs into our catalog, it was a labor-intensive, frustrating task to try to find all the copies of their DTDs and consolidate them to a single canonical version of each.

This record-keeping provides us with an easy way to derive metrics about our own efficiency in managing change. We can determine which DTDs or modules require the most modifications, which could be an indication that something needs a closer, more holistic analysis. As the manager of the team, I can also draw metrics on the amount of time needed by individual team members to process a change, which could indicate any number of issues: unbalanced workloads, lack of adequate training and skills, etc.

Central control of the XML also helps us keep redundancy in check – we’re less likely to create a new element for a very similar semantic because the process for doing this is so transparent. When our team first began formal tracking of our DTD changes and introduction of new vocabulary, we were still rather siloed in our approach, which resulted in individual Data Architects creating separate elements or attributes for the same or very similar semantic. Adopting a maintenance process ensured that no XML can be introduced or changed without making the rest of the team aware. Our twice weekly discussions and the use of our Element Repository for researching existing vocabulary were an additional resolution to this problem.

Another benefit is that opening up the process to the development and user groups has increased their understanding of XML, which can be inconsistent across such a broad audience. And we've added to the information coming to us, not just from us: by notifying the community of a proposed change rather than just distributing an updated DTD, we’ve made allowances for the situations when feedback is critical. We have so many applications downstream of the XML definition it would be impossible for the Data Architecture team to understand the minute details of how the XML is being interrogated and used everywhere (though we do know quite a bit of it), so we rely on the application and content experts for information in this regard.

The transparency of our process has its downside: users have become so accustomed to seeing XML syntax that they are prone to requesting specific markup instead of stating their requirements and letting the Data Architects design the markup. Then if we propose an alternate design, this sometimes causes frustration and contention all around. Also, because a given semantic can be modeled in multiple ways, in addition to impact assessments, we have to be willing to accept constructive criticism about our markup design and to be flexible enough to alter it.

Backwards-incompatibility can also be challenging. In the earliest stages of our initial migration to markup (which was SGML), the number of DTD consumers could fit comfortably in a conference room and the scope of the data conversion was intentionally limited. At that time, making backwards-incompatible changes to the DTD wasn’t a big problem. Also, if you’re lucky enough to have documents that are published once and don’t need updating, backwards-incompatibility isn’t an issue. However, many of our documents do need to be updated continually and now that we’re several years into such a large-scale migration, making a DTD change that would invalidate data is not a trivial matter. And with a shared vocabulary, this could have an enormous ripple effect into document types owned by groups completely unfamiliar with the data that needed the backwards-incompatible change. For instance, a financial product developer might not understand why we need to push a change to a shared module to solve a problem for legislative products. Certainly the freedom to make these kinds of backwards-incompatible changes easily was something we sacrificed when we went down this path.

Another challenge is that sharing XML to this degree necessitates a relatively loose set of rules, particularly for online document delivery. We tend to move the data standards conformance checks farther upstream: our Editorial DTDs and schemas are far stricter than the delivery DTDs and we frequently employ other technologies such as Schematron to ensure data standards are met.

Conclusion

The protocol we follow for managing our XML vocabularies has been in place now since 1999, albeit in a much more rigorous form now than it was at that time. The engineering groups have used various forms of change management for their software development -- both home-grown or off-the-shelf change control products -- but our XML management protocol (system and process) has withstood whatever else has come and gone. In fact, several other ancillary vocabularies have been put under our auspices because our system is so respected. For an enterprise of this size to share XML across content types and continents, it’s critical to keep the change management transparent and well-publicized. Even smaller organizations can benefit from following an established process, if only to eliminate the questions of why a predecessor made a particular modification. This system enables us to expand our XML vocabularies at the aggressive rate our internal users and external customers expect.

It is difficult to imagine how engineering costs, time-to-market, and product and content integration could be done on such a large scale if DTDs and schemas were spread throughout the organization under the control of local content and application development groups and without a public process. In the loosely-coupled systems designs typified by Service Oriented Architectures of today’s large scale systems, centrally managed schemas provide the reliable specifications for the content that systems trade with one another as well as something just as important – a common vocabulary that can be used by distant, possibly even unrelated, product and application development groups to explain and express their needs. This means of communication makes it possible for us to deliver a very broad set of products, services and content types but it also provides a framework for innovation through shared understanding across the organization.