The Clinical Document Architecture (CDA) Release 2 is derived from the Healthcare
Level 7 (HL7) Reference Information Model (RIM). It defines a formal model and semantics
for concepts found in a clinical document — acts, such as procedures, substance
administrations, and observations of laboratory findings; entities such as people,
places, devices, and drugs; relationships such as participation and causality; and
CDA is undeniably complex, covering as it does the full range of clinical documents
for multiple countries and regions. Its derivation from a Unified Modeling Language
(UML) representation of healthcare concepts reflects that complexity.
A CDA Implementation Guide (IG) defines a specific implementation of CDA by
specifying how to use these building blocks to express a particular kind of clinical
document — a report on infection in hemodialysis patients, or the history and
physical taken during a visit to a physician.
The CDA model is instantiated as one W3C Schema (version 1.0) that covers all these
related clinical report types (also called implementations). The model is very
expressive but there are technology-determined gaps between the UML of the model and the
XML Schema expression of it, and then further gaps between the generic XML Schema that
covers all clinical documents and the specific type of document described in a given
implementation guide for a specific purpose. Some of these gaps could perhaps be filled
if CDA used W3C XSD 1.1, or RELAX NG, to define the schema, but neither of those options
are likely to be possible in the near term, given the practicalities of implementations
of complex standards that are used across the world in critical healthcare systems.
People who are guaranteeing to a healthcare organization that the documents they
deliver contain the right information for a specific purpose, and expressed using the
right syntax, need to know that the validation we provide for testing will pass all good
files and fail all bad files. This means we have to test our validation mechanism, and
that mechanism has to be in addition to the basic CDA.xsd (1.0) validation.
The validation mechanism we choose has to fulfill a number of criteria. It must be
easy for non-XML experts to use to test the files that come out of their
implementations, and it must be able to be used in many ways. It must cover the gaps between
the prose definitions in a specific implementation guide and the generic XSD schema that
is used for a large number of implementation guides as much as possible. The goal is to
make validation available in appropriate forms to guarantee the quality of the XML
documents that the public health organizations receive from hospitals and other
This paper introduces Clinical Document Architecture (CDA) concepts to show why
Schematron validation is needed to supplement schema validation, discusses how we
currently produce and test the Schematron validation, and explores some challenges. We
are interested in other approaches to quality management and testing that we could
investigate to supplement our current methods.
The documentation examples used in this paper are taken from the HL7
Implementation Guide for CDA® Release 2 - Level 3: Healthcare Associated Infection
Reports, Release 7 (US Realm) (HAI R7 IG), available at
Much of this work was carried out for the Lantana Consulting Group.
The authors appreciate the comments made by the anonymous peer reviewers as well
as Rick Geimer and Liora Alschuler from Lantana.
Aspects of the CDA Model
Those seeking to represent clinical documents in XML face the same choices as in many
other areas: where do we place the line between a centrally-mandated model that omits
data that only a few participants need to record, and a free-for-all in which no two
documents are modelled in the same way? CDA addresses this in a two-pronged approach
— the abstract model is expressed as elements; attribute values refine the
meaning, with heavy reliance on public vocabularies. These vocabularies are slightly
different than the ones often assumed in an XML context; they are not concepts described
in an XML schema but rather an ontology or set of codes that can be (and are) described
in many different formats. In many ways they play a role similar to that played by
elements in some schemas -- they associate semantic meaning with the data. Healthcare
professionals who specialize in vocabulary can spend just as long arguing over the precise meaning of a
term that is to be defined in a codeset, or over which one to use in which context, as XML
schema designers spend in arguing over what name to give a particular element in its
context. We will come back to vocabularies again throughout this paper.
What does it mean to say that attribute values refine the meaning of an
CDA has two aspects: a text-heavy document aspect, called the narrative block, that is
recorded in HTML-like elements; and an interoperable, machine-processable aspect with
more precise semantics, called clinical statements or coded entries. Coded entries do not record presentational aspects such as section,
paragraph, table, list, or figure. Rather, they record specializations of the abstract concepts of
entities, acts, and relationships. These specializations have XML
names: a participant is a specialization of the entity concept;
procedure and substanceAdministration are specializations of an act;
component-of, is-reason-for, is-cause-of are types of relationship. Some of these examples are elements, others are attributes.
In many applications of XML, attribute values are not central to interpretation.
Some of us were taught, when first learning a pointy-bracket syntax,
that an element name classifies the content and an attribute provides
additional, secondary information:
In CDA attributes have a stronger role: rather than providing supplementary
information, they usually continue refining the taxonomic distinctions
made by elements.
In the procedure element below, the code element refines its parent's meaning
by specifying the kind of procedure, using a value from a specific vocabulary. The vocabulary
is identified in the codeSystem attribute by a dot-notation object identifier (OID):
<!--ID of procedure -->
<id root="2.16.840.1.1138184.108.40.206.220.127.116.11" extension="232323"/>
<code codeSystem="2.16.840.1.113883.6.96" code="423827005"
The refinement of meaning can have multiple levels. The previous example captures "A
procedure; what kind of procedure? An endoscopy." The next example shows a
waterfall-like nesting of questions and answers: "A participant; what kind of
participant? A location; what kind of location? A service delivery location; what kind
of service delivery location? A Medical/Surgical Critical Care unit."
<!--ID of facility -->
<id root="2.16.840.1.113818.104.22.168.22.214.171.124" extension="9W""/>
codeSystemName="HL7 Healthcare Service Location Code" code="1029-8"
displayName="Medical/Surgical Critical Care"/>
CDA has two more general specializations of the act concept,
the observation elements and act elements.
<observation classCode="OBS" moodCode="EVN" negationInd="false">
<code codeSystem="2.16.840.1.113883.6.96" code="50373000"
<value xsi:type="PQ" value="180" unit="cm"/>
This captures "An observation; of what? A body height; what height was observed? 180cm."
A consequence of the semantic role of attributes in CDA XML is that the words "value" and "code" have several usages: the value element,
its value attribute, the value of that attribute (which may be a code), the value of the code element's code attribute (which is always a code), or -- which is usually clear from context -- the value of some other attribute. (Ordinary speech is similarly challenged in distinguishing between the abstract concepts, which are UML classes, such as Act, and XML elements in the CDA schema, such as act.) To cut through that confusion, don't think about the XML first! Focus on the clinical content -- what is being expressed? -- and consider the XML elements and attributes as packaging.
What is to be expressed? A body height.
That's an observation. An observation of what? body height (code element: code attribute).
What was observed? 180cm. (value element: datatype is physical quantity, value is 180, unit is cm)
Uses of Attribute Values
In CDA, attribute values can have implications for the node tree, primarily
through alternatives and through conditional requirements.
The range of
relationships in clinical content goes far beyond
child containment, so the model interposes a wrapper that can carry
information about the relationship. Here we’re recording the micro-organism
cause of a positive blood culture:
<observation classCode="OBS" moodCode="EVN" negationInd="false">
<code code="ASSERTION" codeSystem="2.16.840.1.113883.5.4"/>
displayName="Positive blood culture"/>
<entryRelationship typeCode="CAUS" inversionInd="true">
<observation classCode="OBS" moodCode="EVN">
One powerful attribute, @moodCode, expresses something akin to mood in
English verbs: it can change the sense of a substanceAdministration element from
prescription (an intent) to application (an event).
To record that something did not happen or was not done, CDA provides a
negation mechanism — this is also an attribute value. This patient experienced
no adverse reaction:
<observation classCode="OBS" moodCode="EVN" negationInd="false">
<code codeSystem="2.16.840.1.113883.5.4" code="ASSERTION"/>
This has great expressive power when used in combination with relationships
(the cause of the fever was not the bacterium).
Of course, that is not the same as not
knowing whether the cause of the fever was the bacterium....
Unlike many paper forms and database tables, CDA makes a strong distinction between a value and
the reason a value is not recorded. Such reasons are recorded in a @nullFlavor
attribute. Here, we haven’t asked for the patient’s birthdate (perhaps the
patient arrived unconscious and without his wallet):
For an elderly person living in a remote village, the appropriate nullFlavor
might be “UNK” — the question was asked, but
the answer wasn’t known.
The Price of Power
The price of this expressive power and interoperability is complexity, of course.
Nevertheless this provides a reasonably concise expression of the very large world
of clinical documents that is the model's scope. Every layer has a role; any
collapsing of the model leaves some body of information out. The act relationships
and moods are elegant: convert an intent to administer tachytherapy into a report of
having done it by changing the moodCode from INT to EVN, and look forward to
processing the a volume of XML documents to compare the number of intents to the number of events, or to
report on the average elapsed time between intent and event.
Vocabulary for those Attribute Values
Controlled and widely-used vocabularies are crucial to making this approach work.
There are several vocabularies covering every aspect of healthcare, from units of
measure to precise descriptions of body parts as a surgeon would view them). These
public vocabularies are crucial for interoperability.
There are many public vocabularies in the healthcare realm: for example, SNOMED
CT is a core general terminology with more than 311,000 active concepts
organized into hierarchies that is commonly used for clinical findings and body
parts; RxNorm provides
normalized names for clinical drugs and ingredients. There are, of course,
overlapping vocabularies with concepts that almost, but not quite agree with each
other, so in practice many healthcare systems need to support multiple vocabularies
to cover all the cases.
These vocabularies are made available under differing licensing terms, and in
different formats. For the schematron testing purposes we create custom XML files
with only the terms (codes) that are relevant to the specific implementation guide.
The format we use has entries like this:
<code value="413495001" displayName="ASA physical status class 1"
NHSNdisplayName="Normally healthy patient"
<code value="413496000" displayName="ASA physical status class 2"
NHSNdisplayName="Patient with mild systemic disease"
<code value="413497009" displayName="ASA physical status class 3"
NHSNdisplayName="Patient with severe systemic disease, not incapacitating"
<code value="413498004" displayName="ASA physical status class 4"
NHSNdisplayName="Patient with incapacitating systemic disease, constant threat to life"
<code value="413499007" displayName="ASA physical status class 5"
NHSNdisplayName="Moribund patient, < 24-hour life expectancy"
The many-digit numbers are globally unique object identifiers (usually abbreviated
as OID). These identifiers are the preferred method of identifying objects in HL7
standards such as CDA, and are used for everything from sets of vocabulary (e.g.,
the ValueSet definition above) to chunks of the implementation guide, known as
templates (referenced in the pattern id in the schematron snippet). HL7 has an OI
registry, available at http://www.hl7.org/oid/index.cfm, with more information about the design
and use of OIDs.
One thing to note about OIDS: there is something of a structure in that the
left-most number is considered the root and the right-most number the leaf node on
the tree. OIDs are assigned to organizations at a particular sub-tree level, and how
that organization chooses to arrange its sub-tree depends on that organization. It
may choose to have a logical structure for its OIDs, or not.
Constraints, Value Sets, Alternatives and Conditionals
In any healthcare record there are rules about which information must be present.
In CDA convention these are represented as constraints on the CDA model. The
constraints have a formal prose representation that is published in a document called
an Implementation Guide because it defines an implementation of CDA.
For example, an observation representing an
5. SHALL contain [1..1] code (CONF:11542).
a. This code SHALL contain [1..1] @code, which SHALL be selected from ValueSet
2.16.840.1.114126.96.36.19991 NHSNAdverseReactionTypeCode DYNAMIC (CONF:4698).
We generate this prose representation from a database. A constraint is associated with
a context (observation) and is recorded in data such as "conformance verb" (SHALL), "value",
"value conformance", and "value set".
A value set is a set of coded concepts, drawn from one or more public vocabularies,
that are appropriate for the context. In the example above, the value set members are types
of adverse reaction. The concepts in the previous example, showing patient status,
are members of a value set named ASAClassCode.
Constraints that express alternatives are common in some implementations of CDA.
One necessary usage is to require that
a code element contain either (a) both the
codeSystem attributes OR (b) a
Value-driven conditional rules arise for specific content situations;
for example, if the procedure being recorded was a cesarean (the relevant
the report must also specify the estimated maternal blood loss.
Why Schematron validation is Needed to Supplement Schema Validation
As we've seen, much of the meaning of a CDA document resides in an element's attribute
values, which are used to
refine the meaning of those elements (rather than merely to describe the
object, as in
expand the varieties of relationship beyond what’s available from the XML
vary the verb mood,
switch the subject and object of a compound expression,
explain the absence of a value.
These tools can build remarkably complex sentences. “Marie’s grandmother, who is her
legal guardian, said Marie had pneumonia when she was six, which went untreated and is a
possible explanation for the scarring on her lungs; however, Marie’s mother denied this
and her father was unsure.”
In any healthcare records there are report-specific rules about which data must be
present. Since so much of the content in CDA is recorded in attribute values, these
rules amount to value dependencies, which are not adequately expressible in W3C Schema
validation. Some of the report-specific rules could be tested with a custom W3C Schema,
but not all, and, in practice, many of the most important report-dependent rules cannot
be checked by even a custom W3C Schema.
The two main problem areas for validation are alternatives and value-dependent conditionals.
As we saw above, one commonly-used construct in CDA is to require that a code element
contain either the code and codeSystem attributes (with optional displayname and
codeSystemName attributes), OR a nullFlavor attribute. The CDA Schema allows all the relevant attributes to appear on the code element, in any
combination. As a result, a valid document instance might populate the code attribute without
the codeSystem attribute, or populate both the code and nullFlavor attributes. Both combinations are inherently meaningless, but the CDA Schema
can’t check for them.
The conditional rules that arise for specific content situations can be expressed as
if [some XPath] then [some other XPath]
If procedure/code/@code="1234" (a specific type of procedure), then performer/id must be present.
Since the type of procedure is recorded as an attribute value, even
a custom XML Schema can’t check this requirement.
Schematron covers the gap. We use a two-step validation approach: first against the
CDA XML Schema file (CDA.xsd), and then against a Schematron file and custom vocabulary
file that tests the rules that cannot be expressed in the
CDA.xsd. Currently we are using Schematron 1.5 and XPath 1, for compatibility/historical
reasons; we are gradually moving to ISO Schematron and thence to XPath 2.