<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2"><title>Managing multiple vocabularies across a global enterprise</title><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>Organizations share vocabularies across disparate user groups and data to maximize the
        value of their investment in XML, and, without question, those XML vocabularies need to
        change as the businesses evolve and expand. Managing change to DTDs and schemas is difficult
        enough with a small group of co-located users working on the same content types. What
        happens when you have hundreds of XML consumers spread across the globe and they have
        completely different requirements, systems, and content? Get a view of the challenges of
        implementing change management and vocabulary versioning on a <emphasis role="ital">very</emphasis> large scale.</para></abstract><author><personname><firstname>Laurel</firstname><surname>Shifrin</surname></personname><personblurb><para>Laurel Shifrin is the Manager of Data Architecture at LexisNexis, an international
          company offering online information and solutions. She has been working with markup
          languages since 1993 and presented at Markup Technologies, a predecessor to Balisage. She
          is also the co-author (with Michael Atkinson) of <emphasis role="ital">Flickipedia:
            Perfect Films for Every Occasion, Holiday, Mood, Ordeal and Whim</emphasis>(Chicago
          Review Press).</para></personblurb><affiliation><jobtitle>Manager, Data Architecture </jobtitle><orgname>LexisNexis</orgname></affiliation><email>laurel.shifrin@lexisnexis.com</email></author><legalnotice><para>Copyright © 2008 LexisNexis</para></legalnotice></info><section><title>Introduction</title><para>One of the primary motivators for executive sponsorship of an XML implementation is
      decreased cost and increased efficiency. One of the ways to maximize this effectiveness is to
      create a centrally managed XML vocabulary that can be shared across the entire organization.</para><para>In a small company where technical developers and users alike work in a single location,
      this can be a relatively straightforward prospect: only a small number of people are involved
      and they can communicate frequently and easily. Likewise, if the content is specialized toward
      a single discipline, the vocabulary itself may be semantically rich but otherwise
      uncomplicated. However, even in a small organization with a fairly lean vocabulary, managing
      the XML definition is important: it will certainly continue to expand and ad hoc change
      management can result in unexpected – and unpleasant – outcomes. In a large organization with
      a great diversity of data types, the potential for problems associated with loose management
      is increased. Governance of markup is paramount, not just because of the downstream
      applications, stylesheets and other artifacts expecting certain data constructs but also
      because of the cost in changing your data: once you create content in XML and store it, the
      odds are that it’s going to stay that way for a very long time and that changes will be
      incremental. As the amount of data grows, so does the importance of vocabulary control.</para></section><section><title>Background of LexisNexis and the migration to XML</title><para>LexisNexis is a leading international provider of business information solutions to
      professionals in a variety of areas, including: legal, corporate, government, law enforcement,
      tax, accounting, academic, and risk and compliance assessment. In more than 100 countries,
      across six continents, LexisNexis provides customers with access to five billion searchable
      documents from more than 40,000 legal, news and business sources. We have over 13,000
      employees located in business units around the globe.</para><para>Since the majority of these business units have been acquired as existing entities, upon
      acquisition they typically each had their own editorial systems, data formats and online
      products, much of these reflective of the different legal systems in place around the world.
      Some of the acquired business units already had XML-based Content Management Systems, some had
      SGML, some were using either proprietary markup systems or Microsoft Word documents with a
      varying degree of semantics – sometimes not at all but occasionally captured using MS Word
      styles.</para><para>In 2004, we launched a common delivery platform that included news, financial and legal
      data from seven countries across three continents. This shared platform was designed to meet
      several goals: <itemizedlist><listitem><para>consolidate the technology for delivering data to our customers, thus decreasing
            cost;</para></listitem><listitem><para>establish a publishing pipeline that could enable the addition of new countries in a
            relatively clear-cut manner;</para></listitem><listitem><para>create a standardized user interface that could be recognized globally; and</para></listitem><listitem><para>make it possible to develop new products and services across different markets and
            countries more easily.</para></listitem></itemizedlist></para><para>At the same time, the U.S. business unit continued to enhance the existing flagship
      American products, lexis.com and nexis.com, and to build our XML-based editorial systems. Our
      editorial staff are similarly spread across the U.S., working in different time zones to
      create different products consisting of data that is sometimes shared but often unique in
      nature and geared toward a specific market; for example, an editor creating solutions for
      legal practitioners who specialize in Intellectual Property works with a very dissimilar data
      set than an editor abstracting and indexing International Statistics for an academic audience.</para><para>All of these projects were staffed with representatives from a host of different groups in
      multiple countries (engineering, editorial, user experience, etc. ), sometimes leveraging
      offshore resources as well. The schedules, time-line and project management were also geared
      toward the specific goals of the individual efforts, with a limited amount of overlap and
      consideration of the other projects.</para></section><section><title>Shared Vocabularies</title><para>Underlying all of these considerable efforts was the XML definition, managed by a single,
      centralized Data Architecture team, based in the U.S. Our team is dispersed among eight
      locations and two time zones and we maintain over 100 DTDs and schemas.</para><para>At the beginning of our markup development, we were faced with two options. The first
      option was building standalone DTDs for each content type -- and, for legal statutory data,
      potentially building standalone DTDs for each content type <emphasis role="ital">within each
        jurisdiction</emphasis>, as the structure and standards for a given jurisdiction can be
      quite dissimilar from another and we’re frequently under a contractual obligation that
      prevents us from standardizing across jurisdictions. The second option was to build our DTDs
      from modules and share as much common vocabulary as possible.</para><para>There were several benefits to creating standalone DTDs: smaller scope makes for swifter,
      less complicated development; fewer requirements for the markup to support; discrete user
      groups would mean less controversy and disagreement. However, the benefits of creating the
      modularized vocabulary would prevent redundancy in the markup, duplicated effort, and, we
      believed, would prove more efficient in the long run.</para><para>This was the genesis of what we have today: XML vocabularies that are shared across
      multiple editorial systems, discrete applications, and editorial groups, and also provide the
      definition for data delivery from the business units around the world. The Data Architecture
      team uses a toolbox of XML modules (body text elements, metadata elements, content-specific
      elements, etc.) from which unique DTDs can be created, so that whole DTDs do not need to be
      written from scratch for each new content type that is added to our voluminous data stores.</para><para>Additionally, the XML definition for a given semantic component may need to be transformed
      as it moves through our publishing workflow; for example, metadata necessary for Editorial
      tools may be dropped on external delivery. To ensure data quality as it moves through the
      publishing pipeline, we maintain a DTD or schema for data validation at specific points in the
      process. Thus, a certain element may have a very robust set of attributes at the beginning of
      the editorial cycle but may have a much leaner group of attributes farther down the path – or,
      conversely, may have additional metadata added by other post-editorial processes. These
      multiple configurations of a given semantic are defined using parameter entities with
      conditional switches.</para><para>With such a great variety and quantity of data moving through and being added to our
      pipeline at any given time, it is expected that our DTDs will change – they need to change to
      support improvements and enhancements to our products, services and systems and to continue to
      expand our content offerings. However, with so many disparate user and developer groups around
      the globe relying on established XML vocabularies, managing that change must be done carefully
      and transparently. Very early in our migration to XML, it became evident that the small-shop
      method of requesting and communicating changes over cubicle walls or via email was not going
      to be sufficient – and that was back when the Data Architecture team itself consisted of three
      people sitting within 10 feet of one another and the majority of the interested parties were a
      stone’s throw away. We determined that a formal maintenance protocol was needed, which turned
      out to be prescient, considering the growth of our own team and the even greater expansion of
      the community depending on our XML. It was essential to institute a centralized system, not
      just to control the vocabularies but also the data using the vocabularies and to establish a
      common understanding and means of expression.</para><para>The maintenance protocol we introduced had three features: <itemizedlist><listitem><para>an application for requesting and storing information about each DTD/schema
          change;</para></listitem><listitem><para>a publicized change control process; and</para></listitem><listitem><para>a versioning policy for the vocabulary itself.</para></listitem></itemizedlist></para></section><section><title>XML Maintenance Protocol</title><section><title>XML Maintenance System</title><para>Our maintenance system is an intranet application accessible by any employee. Each
        change request must contain enough information so that all of our XML consumers can evaluate
        the impact of the change on their own realm of responsibility, including: <itemizedlist><listitem><para>the reason for the change;</para></listitem><listitem><para>affected modules or DTDs/schemas;</para></listitem><listitem><para>existing and proposed content models;</para></listitem><listitem><para>change benefits and risks;</para></listitem><listitem><para>a document sample that has been validated to a mocked-up DTD/schema.</para></listitem></itemizedlist></para><para>All change requests are assigned a unique ID and stored in an SQL database. The system
        allows users to query various fields, or generate lists of changes by platform and/or change
        status. When a change request is submitted, the maintenance system alerts the appropriate
        Data Architect, who is then responsible for shepherding the change through the maintenance
        process. Generally, the users provide some basic information about the problem they’re
        trying to solve and then the Data Architect follows the usual steps in designing markup:
        reviewing requirements, conducting user interviews, modeling the data, assessing downstream
        impacts, tagging or converting sample documents, etc. A related tool helps us research
        existing vocabulary so that we’re as consistent and efficient as possible – a database of
        elements and attributes, which we search using XQuery.</para></section><section><title>XML Maintenance Process</title><para>Each change request is first reviewed in a Data Architecture-only round table
        discussion; because the modules are shared, each Data Architect may find the new markup in
        his DTD and so needs to understand and agree to the change. This has the additional benefit
        of team members learning more about each other’s content expertise and of course provides an
        opportunity for newer staff members to learn about data modeling from those with more
        experience. At this point, only our team can view the change request on the Intranet site;
        this reduces churn and controversy within the larger community.</para><para>Once the change passes muster with our group, it is then viewable to everyone on our
        Intranet site and distributed to representatives throughout the entire LexisNexis
        organization. We typically allow 10 business days for assessing impacts, although we will
        expedite the review period when deadlines demand.</para><para>If an objection or negative impact is raised, we will reconsider the change based on the
        feedback and we sometimes must negotiate a compromise between the requester and the
        objector. It’s entirely possible that a change that benefits one group will have a negative
        impact on another group and the overall business benefits and risks must be evaluated to
        determine the right course of action. A simple DTD change may seem trivial – what could
        possibly cause damage? – but if we were to make a backwards-incompatible change without
        working out the details of how to remediate the data, we could block up our publishing
        pipeline; invalid data will get bounced back. This is not so bad if it’s a few hundred
        documents but if it’s a few hundred thousand, this could wreak havoc. Additionally, we have
        very aggressive deadlines for updating our documents, particularly for caselaw and news. If
        we were to introduce invalid markup that then resulted in even a short delay of delivering
        our documents online, our customers would transfer to our competitors pretty quickly.</para><para>Once a change is approved, the new markup is implemented in the module with
        documentation specifically referencing the change itself (change request ID, date, Data
        Architect making the change, etc), so that we can view our DTDs in the future and not have
        to scratch our heads wondering why a particular change was made two years ago. Then the DTDs
        are normalized and validated using a series of scripts and tested using a set of sample
        documents. The number of DTDs being updated at a given time depends on several factors:
        which system, platform or group needed the change, how many changes are in the queue waiting
        to be implemented, and whether or not we need to meet a specific project deadline.</para><para>Documentation is embedded within the DTDs and schemas so that it can be kept up-to-date
        as changes are implemented; we autogenerate several different types of documentation for our
        users, including a summary of what changes are included in each DTD update.</para></section><section><title>XML Versioning Policy</title><para>Revision control and storage of our source modules is handled by Rational ClearCase.
        Since our customers only see normalized DTD/Schemas, we adopted a versioning policy for
        those deliverables. We use a three node, numerical versioning syntax, expressing three
        degrees of revision: major, minor, and micro.</para><para>Major changes are defined as changes that would either render documents generated under
        the previous version invalid or alters a previous version’s <emphasis role="ital">existing</emphasis> default DOM expansion (such as changing the default or fixed value of
        an attribute). When a change is determined to be major, the major revision node is
        incremented by 1, and the minor and micro nodes are moved to 0.</para><para>Minor changes are defined as changes that <emphasis role="bold">will not</emphasis>
        render documents generated under the previous version invalid, such as making previously
        required nodes optional. When a change is determined to be minor, the minor revision node is
        incremented by 1, and the micro node is moved to 0. The major node is left unchanged.</para><para>Micro changes are defined as changes that have no syntactic or semantic impact to the
        document (e.g., enhancements to documentation stored within comments in the DTD). When a
        change is determined to be micro, the micro revision node is incremented by 1, and the major
        and minor nodes are left unchanged.</para><para>We typically do not maintain multiple versions of the same DTD or schema at once, though
        we have agreed to this for short periods of time and under the caveat that if we need to
        maintain multiple major versions, we will not accept any backwards incompatible change
        requests to the lower version – if a group needs a backwards incompatible change, then
        they’re going to need to upgrade to the later version.</para></section></section><section><title>Benefits and Challenges</title><para>This process and system have provided several benefits. The first is just simple
      record-keeping: there’s no wondering about why a particular change was made three years ago
      because we can easily go back to the original details. Central control of the XML vocabulary
      prevents the practice of keeping local copies that rapidly fall out of sync as each group
      makes changes without regard to the consequences for other teams. This was the situation at
      one of the acquired business units and when we had to fold their DTDs into our catalog, it was
      a labor-intensive, frustrating task to try to find all the copies of their DTDs and
      consolidate them to a single canonical version of each.</para><para>This record-keeping provides us with an easy way to derive metrics about our own
      efficiency in managing change. We can determine which DTDs or modules require the most
      modifications, which could be an indication that something needs a closer, more holistic
      analysis. As the manager of the team, I can also draw metrics on the amount of time needed by
      individual team members to process a change, which could indicate any number of issues:
      unbalanced workloads, lack of adequate training and skills, etc.</para><para>Central control of the XML also helps us keep redundancy in check – we’re less likely to
      create a new element for a very similar semantic because the process for doing this is so
      transparent. When our team first began formal tracking of our DTD changes and introduction of
      new vocabulary, we were still rather siloed in our approach, which resulted in individual Data
      Architects creating separate elements or attributes for the same or very similar semantic.
      Adopting a maintenance process ensured that no XML can be introduced or changed without making
      the rest of the team aware. Our twice weekly discussions and the use of our Element Repository
      for researching existing vocabulary were an additional resolution to this problem.</para><para>Another benefit is that opening up the process to the development and user groups has
      increased their understanding of XML, which can be inconsistent across such a broad audience.
      And we've added to the information coming <emphasis role="ital">to</emphasis> us, not just
      from us: by notifying the community of a proposed change rather than just distributing an
      updated DTD, we’ve made allowances for the situations when feedback is critical. We have so
      many applications downstream of the XML definition it would be impossible for the Data
      Architecture team to understand the minute details of how the XML is being interrogated and
      used everywhere (though we do know quite a bit of it), so we rely on the application and
      content experts for information in this regard.</para><para>The transparency of our process has its downside: users have become so accustomed to
      seeing XML syntax that they are prone to requesting specific markup instead of stating their
      requirements and letting the Data Architects design the markup. Then if we propose an
      alternate design, this sometimes causes frustration and contention all around. Also, because a
      given semantic can be modeled in multiple ways, in addition to impact assessments, we have to
      be willing to accept constructive criticism about our markup design and to be flexible enough
      to alter it.</para><para>Backwards-incompatibility can also be challenging. In the earliest stages of our initial
      migration to markup (which was SGML), the number of DTD consumers could fit comfortably in a
      conference room and the scope of the data conversion was intentionally limited. At that time,
      making backwards-incompatible changes to the DTD wasn’t a big problem. Also, if you’re lucky
      enough to have documents that are published once and don’t need updating,
      backwards-incompatibility isn’t an issue. However, many of our documents do need to be updated
      continually and now that we’re several years into such a large-scale migration, making a DTD
      change that would invalidate data is not a trivial matter. And with a shared vocabulary, this
      could have an enormous ripple effect into document types owned by groups completely unfamiliar
      with the data that needed the backwards-incompatible change. For instance, a financial product
      developer might not understand why we need to push a change to a shared module to solve a
      problem for legislative products. Certainly the freedom to make these kinds of
      backwards-incompatible changes easily was something we sacrificed when we went down this path.</para><para>Another challenge is that sharing XML to this degree necessitates a relatively loose set
      of rules, particularly for online document delivery. We tend to move the data standards
      conformance checks farther upstream: our Editorial DTDs and schemas are far stricter than the
      delivery DTDs and we frequently employ other technologies such as Schematron to ensure data
      standards are met.</para></section><section><title>Conclusion</title><para>The protocol we follow for managing our XML vocabularies has been in place now since 1999,
      albeit in a much more rigorous form now than it was at that time. The engineering groups have
      used various forms of change management for their software development -- both home-grown or
      off-the-shelf change control products -- but our XML management protocol (system and process)
      has withstood whatever else has come and gone. In fact, several other ancillary vocabularies
      have been put under our auspices because our system is so respected. For an enterprise of this
      size to share XML across content types and continents, it’s critical to keep the change
      management transparent and well-publicized. Even smaller organizations can benefit from
      following an established process, if only to eliminate the questions of why a predecessor made
      a particular modification. This system enables us to expand our XML vocabularies at the
      aggressive rate our internal users and external customers expect.</para><para>It is difficult to imagine how engineering costs, time-to-market, and product and content
      integration could be done on such a large scale if DTDs and schemas were spread throughout the
      organization under the control of local content and application development groups and without
      a public process. In the loosely-coupled systems designs typified by Service Oriented
      Architectures of today’s large scale systems, centrally managed schemas provide the reliable
      specifications for the content that systems trade with one another as well as something just
      as important – a common vocabulary that can be used by distant, possibly even unrelated,
      product and application development groups to explain and express their needs. This means of
      communication makes it possible for us to deliver a very broad set of products, services and
      content types but it also provides a framework for innovation through shared understanding
      across the organization.</para></section></article>
