Balisage logo

Preliminary Proceedings

Automatically Denormalizing Document Relationships

Will Thompson

Chief Software Engineer

O'Connor's

Balisage: The Markup Conference 2017
August 1 - 4, 2017

Copyright © 2017 by Will Thompson.

How to cite this paper

Thompson, Will. “Automatically Denormalizing Document Relationships.” Presented at Balisage: The Markup Conference 2017, Washington, DC, August 1 - 4, 2017. In Proceedings of Balisage: The Markup Conference 2017. Balisage Series on Markup Technologies, vol. 19 (2017). doi:10.4242/BalisageVol19.Thompson01.

Abstract

Native XML databases provide no exception to the problem that data may not be easily contained by any single data storage idiom. Many-to-many relationships, in particular, present a unique problem for documents, as strategies for joining across documents are a potential minefield of software maintenance and performance problems. Automatic denormalization shifts the responsibilty for managing relationships to write-time, making an explicit trade-off for simplicity and speed at runtime. This paper discusses existing strategies for managing relationships across documents and explores design patterns and use cases for performing automatic denormalization and their trade-offs.

Table of Contents

Document databases
Many-to-many relationships
Recommendation 1: Don’t do that
Recommendation 2: Keys and Joins
Incomplete views
Runaway code
Proposed solution: automatic denormalization
Relationship create
Related document update
Delete
Read
Caveats
Extensions and Optimizations
Conclusion

Document databases

The relational model was conceived “primarily as a tool to free users from the frustrations of having to deal with the clutter of [database] storage representation details” [Codd1979], and its normalized representational form was intended to ensure stored data “would be free from [the] insertion, update, and deletion anomalies” [Codd1990] that result in inconsistent data, while also providing a more flexible data model from which developers could perform ad-hoc queries. In addition to the system design and maintenance benefits from abstracting a data storage and query layer, constraints in storage hardware capacity, speed, and reliability made the relational model a prudent choice for system designers, and that has persisted for decades.

The overarching constraint of the relational model, however, is that while ideal for highly structured and naturally relational data, hierarchical or document-like data needs to be deconstructed from its natural form, and additional work must be done to insert and retrieve these data. But hierarchies exist everywhere. In print documents, web pages, emails, message board threads, and Facebook posts, as well as abstract and entity data models. Documents are a natural structure for modeling these forms of data, and document databases have emerged as a natural and native means to store and query in document idioms. XQuery databases, in particular, allow a developer to store a document as XML (and sometimes JSON) and, using XQuery, XPath, and XSLT, insert, transform and query documents with the expressiveness of native-designed languages that provide insight into the rich semantics of the underlying data.

One common problem facing developers, however, is that real-world data do not typically fit neatly into either a relational or a document model, so they are forced to design with some degree of impedance mismatch.

Most relational databases offer some level of support for XML, though typically with very limited XQuery and XPath features based on older versions or working drafts [1][2] and with limited indexing functionality [3][4]. In cases where data is highly structured and only minimally document-like, a relational database can still be a good bet if those parts don’t require complex querying.

In cases where much of the data would be ideally modeled using documents, the idiomatic environment provided by XQuery databases are a better bet. However, documents are not always self contained, and they frequently describe relationships - books, articles, and web pages link to other books, articles, and web pages; entities reference other entities, etc. Conventional methods of handling joins across related documents using XQuery and XPath language constructs may be sufficient in some cases; however, systems grow in complexity over time as they store larger volumes of data, add support for varieties of data types, and handle more sophisticated modes of querying and visualizing data, like search grammars and facets [5]. Here we will examine methods for dealing with document relationships, problems associated with joins, and one technique for coping with them as an application and its data scales.

Many-to-many relationships

One-to-one and one-to-many relationships are both hierarchical and are thus isomorphic to the XML document object model. These structures can be modeled and self-contained within a single XML document [Williams2000]. For a many-to-many relationship, using XML vocabulary, a child element would need to have multiple parents, which is not possible in the XML document model.

For example, a database may need to store a model for attorney and client entities. Most properties of each are hierarchical, but naturally an attorney will have multiple clients, and a client may hire multiple attorneys. How is this difference reconciled in a system whose data model is designed for self contained documents?

Recommendation 1: Don’t do that

The most common advice on modeling these types of relationships in XML is to simply avoid them. There are a few ways to go about that. The first and simplest is to discard the data about that relationship [Williams2002]. That doesn’t work out very well in our example. Client documents would have no information about the attorneys representing them and the same for attorney documents with respect to their clients. In some limited cases, this option may be feasible, but generally only when there is a real-world many-to-many relationship that is not important to model in the database.

However, some data is just inherently relational, and if the application requires that it be modeled in XML, another potential solution is to reduce the real-world many-to-many relationship to a one-to-many relationship in XML by scoping the documents to represent only one side of the relationship [Williams2002]. Using the above example, this would require modeling either the attorney or client as a document root and the other as a child element. However, if the element-assigned entity is related to more than one document-assigned entity, this will result in duplicate records of the element-assigned entity.

  <attorney id="123">
      <client id="abc"></client>
      <client id=”xyz”></client>
  </attorney>
   
  <attorney id="456">
      <client id="def"></client>
      <client id=”abc”></client>
    </attorney>
      

In practice, this approach may satisfy the requirements for some applications, but generally only if the document-assigned entity is the only one queried directly (because querying the child entity would require deduplication), and if the child-assigned entity is never updated (because updates would require special handling for finding and updating its duplicates). Even when this can work, the trade-off is risky for any application whose requirements are not fixed indefinitely, since going outside of those constraints would entail complex workarounds or possibly rewriting much of the application back-end.

Recommendation 2: Keys and Joins

The next approach is the most common and amounts to essentially giving up on the document-centric model in favor of a relational model. In this model, we both entities are designed as independent documents and they are related using a two-way ID and IDREF convention, imitating the primary-key and foreign-key design of the relational mode [Williams2000] [Williams2002] [Walmsley2015].

  <attorney id="123">
      ...    
      <attorney-client idref="abc"/>
      <attorney-client idref="xqy"/>    
  </attorney>
   
  <client id="abc">
      ...
      <client-attorney idref="123"/>    
  </client>
    
  <client id="xyz">    
      ...
      <client-attorney idref="123"/>    
  </client> 
      

This design is intuitive for anyone familiar with relational models, and it solves the above problem of duplicate canonical entities in the model. However, it introduces several new problems as a result of the normalized data, all of which stem from the fact that data about an entity and data about its related entities are now in separate documents.

Note

id and idref attributes are typically associated with the xs:ID and xs:IDREF types from XML Schema and the corresponding XPath functions fn:id() and fn:idref(), but those are all scoped within a single document. In the context of database joins, those types and functions are not very useful, but the attribute names have been used in these examples for clarity.

Incomplete views

The core problem with referenced documents is that accessing a complete view requires processing that document and building its missing content dynamically using joins. For each document we must find its references, get the referenced documents from the database, and insert them into the expanded view.

In the context of result sets, this effect is multiplied, and the overhead for generating complete views on a set of documents is no longer simply a factor of the number of documents selected by a query, but of that number plus the number of referenced documents in the result set. Performance-sensitive applications will need to ensure that a small result set of documents containing many references does not result in an explosion of reads. In some cases, that will not be necessary - or it can be mitigated. For example, in a page of query results, the application can choose not to expand references when rendering results, and only expand an individual document once it has been selected. But it is often the case that the application needs to address aspects of the expanded view in queries, which brings us to the next problem.

When using key-based relationships, search applications may be forced to support queries with constraints that join data across documents. For example, using the attorneys and clients model above, we may need to write or generate a query that selects “attorneys living on the west coast representing a client on the east coast”. Here we have a pair of related constraints, each in a different set of documents associated with ID/IDREF relationships. We now have to read both sets of documents filtered by their constraints and join them together at query time based on the criteria that they are related via ID and IDREF. Even a selective query with few results may, in the worst case, result in an explosion of document reads due to the fact that the selectivity of the query hinges only on the intersection of constraints, not any per-document constraint.

  let $states-west := (‘CA’, ‘OR’, ‘WA’)
  let $states-east := (‘ME’, ‘NH’, ‘RI’, ... , ‘FL’)
  let $attorneys-west := //attorney[state = $states-west]
  let $clients-east := //client[state = $states-east]
  return $attorneys-west[attorney-client/@idref = $clients-east/@id]
        

If either $attorneys-west or $clients-east selects a large number of documents, the overhead to perform this query will be high. It might be improved by selecting the east coast clients first, using their ids to add an additional constraint to the attorneys, increasing selectivity:

  let $clients-east := //client[state = $states-east]
  return //attorney[attorney-client = $clients-east/@id][state = $states-west]
        

However, it is possible that $clients-east still leads to overhead - or that by serializing the queries, we prevent the query engine from executing them in parallel. Ultimately, the best optimizations may be data dependent, resulting in queries that are difficult to write and maintain, especially as the content of the database changes.

Another common application requirement is displaying search facets, wherein aggregated data about a query’s full set of results are displayed alongside a subset of sorted results displayed as snippets. Conventionally, search snippets are hyperlinked to a more complete view of the item in the paged set, and facets are hyperlinked to a query that returns a new result set, constrained by the property and value indicated by the facet.

Figure 1: Search results

image ../../../vol19/graphics/Thompson01/Thompson01-001.svg

Most XQuery databases provide configurable indexes for efficiently querying specified path or element values that can be used to facilitate fast faceted search calculations. It is not possible, however, to create an index for a document using values from a referenced document, so indexing does not provide a shortcut around manual joins. Like the first example, calculating facets for documents outside the result set requires joining across both sets documents using XQuery:

  let $client-results := //client[(: implementation-specific full text constraint :)]
  let $client-attorney-refs := $client-results/client-attorney/@idref
  return subsequence(
      for $attorney in //attorney[@id = $client-attorney-refs]
      let $count := count($client-results[client-attorney/@idref = $attorney/@id]
      order by count descending
      return <attorney-facet value=”{ $attorney/full-name }” count=”{ $count }” />,
      1, 5)
        

Best practices for handling these types of use cases in different database implementations have different conventions with respect to indexing and writing optimized queries that can be unintuitive[6][7], and building a solution to support joins generically within an application can be tricky and still lead to variable or disappointing performance. Armed with the implementation-recommended indexing and query patterns, it is possible to minimize the impact of joins, but depending on the database implementation and application, it may still be necessary to put hard limits on the size of result sets or drop some of the harder to calculate information from the aggregate queries to prevent high latency queries.

Runaway code

One issue that may be less obvious at the outset, but can be insidious and capable of having a powerful impact, is the fragmenting effect on code that can occur from introducing a new paradigm into an application and maintenance problems that result from adding complexity and managing it as the application scales. At a high level, combining self-contained document models with relational-like data models brings together contrasting concepts for reading and writing information within an application. As outlined above, solutions built to handle querying across related documents will introduce coding patterns that diverge from those of fully coherent documents.

Software developers typically follow design principles and utilize coding patterns as they work, in many cases religiously. Despite the fact that many principles overlap or conflict, and various trends come and go over time, they all seem to work toward the same goal: preventing code complexity from ballooning out of control. Whether or not any particular set of maintenance strategies is more effective than another is a topic of debate (and unfortunately very little research), but experience has taught the industry as a whole that some strategy is necessary to continuously reduce complexity through the implementation of abstraction, modularization, limiting dependencies, and code reuse.

Applications take on expanded requirements over time, and it is critical for maintainers to carefully examine their designs and refactor as code is updated to implement new features. The types of code differences introduced in examples above to support the diverging document concept can affect nearly every aspect of code maintenance. Even in applications that are not performance critical, avoiding the introduction of these types of queries and runtime dependencies can result in code that is simpler to reason about, organize, and maintain as it scales.

Proposed solution: automatic denormalization

Automatic denormalization is not a new idea. The solution presented here is a type of precomputation similar to indexed views or materialized views in relational databases. In relational databases, the precomputed views are typically implemented with the goal of improving read performance. In this instance, while performance is a goal, it is not the primary goal. This solution takes a brutally practical approach and accepts the inherent performance trade-offs with eyes open.

The problems outlined above demonstrate the difficulty in integrating documents that support many-to-many relationships alongside documents whose relationships are self-contained. We can take significant steps to resolve this problem problem by denormalizing the many-to-many relationships as they are ingested into the database. Instead of storing references to canonical documents using IDREFs, the referencing document will store a complete copy of the referenced document. In a way, documents still resemble the relational-like document model, but with references swapped for copies of the referenced data.

The denormalization process can be broken down into two essential subprocesses: one for creating the initial many-to-many relationship between two documents and another for updating any document that is part of a relationship.

Note

XQuery update syntax differs substantially across database implementations, so the following examples are presented using figures only.

Relationship create

Creating a denormalized many-to-many relationship follows similar steps as in the normalized two-way IDREF case, and in both cases, both documents must be updated. In the normalized create, only an element denoting a reference to the other is added; in the denormalized create, the contents of the referenced document is used wholesale in place of the reference. A new requirement is that a denormalized create must read-on-update to exchange content. In some databases (depending on configuration) lock contention will not differ when reading a document that is updated in the same transaction[?], so the added resource cost of denormalizing this operation may be minimal.

Figure 2: Create denormalized relationship

image ../../../vol19/graphics/Thompson01/Thompson01-002.svg

This example creates a relationship on documents with no existing relationships. However, when relating or updating documents with relationships, a copied entity stored in a <ref:copy> should not also include its references/copies. From the perspective of the containing entity, the relationships within a copied entity are either circular, or they will reference documents that do not have references back, leaving no way to update them consistently when the canonical document is updated. Therefore, an entity copy transformation should be used to remove existing relationships on any dereferenced copy.

Figure 3: Entity copy transformation

image ../../../vol19/graphics/Thompson01/Thompson01-003.svg

Related document update

Now that references are replaced with denormalized copies of the canonical entity’s data, updating a document requires updating every document on the other side of the relationship as well. Similar to the normalized model, where reading a complete document required expanding every reference to a canonical document, now to maintain consistency, any document update requires propagating the new information to its denormalized copy in referenced documents.

Figure 4: Updating a related document

image ../../../vol19/graphics/Thompson01/Thompson01-004.svg

This step is the essence of the denormalization process, and it deliberately defies one of the original stated design goals of the relational model and normal form (to free relations from insert, update, and delete dependencies). The primary reason behind that goal was the expectation that managing dependencies would quickly become complex or impractical (or a problem writing to storage hardware would occur), leading to “update anomalies galore” [Codd1990], and inconsistencies across related data would accumulate throughout the database.

The automatic denormalization process proposed here can be implemented in a single component capable of handling any update operation. As a practical matter, maintenance of the component to support changes to the data model is simple, and in many cases (depending on database) the component can be written to handle generic references, such that code changes would only be required if the denormalization schema itself changes. BaseX, eXist-db, and MarkLogic all provide methods to apply cascading updates atomically, in a single transaction, eliminating a class of possible undesirable side effects from managing redundant data, one of the motivating factors behind normalization.

Because the number of relationships per document will typically not grow relative to the size of the database, avoiding joins across large portions of the database can also be a worthwhile performance trade-off for cascading document updates. However, the comparison is not apples-to-apples, since writing to a database is a more resource intensive operation than reading from it. Also, the details of lock handling across database implementations differ (and some are configurable), so performance testing for a specific implementation and configuration is appropriate if the likely trade-off is not clear from an application’s requirements or usage statistics.

Delete

Denormalized delete follows the same pattern as update: When a document is deleted, iterate over its references and remove the copy in any referenced document. Deleting a relationship follows the inverse pattern as create.

Read

Now that that work has been done on the create/update/delete side to ensure consistent denormalized relationships, documents have been restored to an idealized structure. Rendering documents with relationships no longer has to go through a runtime expansion process to produce a complete view, and it is possible to query related documents without joining. The additional consistency across queries and rendering components should provide more opportunities for code reuse and less complexity to maintain overall.

The pathological query from above (“return any attorney working on the west coast representing a client on the east coast”) is now simplified and is structured more optimally for evaluation:

  //attorney[state = $states-west][ref:copies/client/state = $states-east)]

The facet-calculating query example is similarly simplified without the need for joining and can benefit even further in MarkLogic and eXist. Because those databases provide functions to access index values directly, by indexing the path /client/ref:copies/attorney/full-name, it is possible to more efficiently calculate facets on its values.

Without the need to join data across documents, we avoid nebulous and often unsuccessful optimizations, and with related documents matching the same self-contained structure as other documents, we avoid the precarious path of integrating disparate data models into our application.

Caveats

Relationship-heavy workloads and/or update-heavy workloads

The resource overhead of this technique will be impacted by a factor of the frequency of updates to documents with relationships and by the number of relationships per document. One must also consider the overlap of relationships across documents, as concurrent updates to multiple documents with relationships to a single document will all contend for the ability to lock and write to that document during cascading updates.

Only recommended for two-way joins

For example, a LawFirm entity may have a many-to-many relationship with an Attorney entity, which also has a many-to-many relationship with a Client entity.

Figure 5: 3-way many-to-many relationship

image ../../../vol19/graphics/Thompson01/Thompson01-005.svg

Using only the technique described here, it is not possible to get denormalized data about Clients from Law Firms and vice versa. It is possible to modify the entity copy transformation to retain references for entities such that an entity’s references and its references’ references are part of the cascading update. But this is dangerous behavior that may result in combinatorial explosions of updates and it can easily become impractical for many use cases.

Extensions and Optimizations

Entities and sub-entities

In real-world models, it is often unrealistic to insist that relationships occur only between document roots. Instead, it is more idiomatic to consider a document as an entity that may contain descendant subentities recursively, such that any entity or sub-entity can be related to any other entity or sub-entity.

Figure 6: Sub-entities

image ../../../vol19/graphics/Thompson01/Thompson01-006.svg

Additionally, it may be helpful when generating a copy of a sub-entity to also include data about that entity’s parent or ancestors. This pattern can be managed through modifications to the entity copy transform (used above only to remove references) and to update behavior. When an entity is copied, process the parent or ancestors recursively, and when an entity is updated, trigger updates on its child or descendant sub-entities. The structure of the denormalized ancestor markup should be determined by the needs of the application. In some cases, the application may benefit from maintaining the nested structure of the related document in its copy and annotating the directly-related entity:

    <client id="abc">
      <name>O.J. Simpson</name>
      <address> ... </address>
      <client-instance id="1357" id-client="abc">
        <name> ... </name>
        <address> ... </address>
        <ref:copies>
          <attorney> 
            <name>Johnny Cochran</name>
            ...
            <ref:context>
              <matter>
                <name>People v. O. J. Simpson</name>
                <judge>Lance Ito</judge>
                <dt-arraignment>1994-06-20T09:00:00-07:00</dt-arraignment>
              </matter>
            </ref:context>
          </attorney>      
        </ref:copies>
      </client-instance>
    </client>
          

In another, it may be more convenient to break the hierarchy into siblings and annotate the entities:

    <ref:copies>
      <ref:copy>
        <ref:context>
          <matter>
            <name>People v. O. J. Simpson</name>
            <judge>Lance Ito</judge>
            <dt-arraignment>1994-06-20T09:00:00-07:00</dt-arraignment>
          </matter>
        </ref:context>
        <ref:parent>
           <attorney> 
             <name>Johnny Cochran</name>
             ...
           </attorney>
        </ref:parent>
      </ref:copy>
    </ref:copies>
          

The annotation schema is not prescriptive and should be adjusted to fit the needs of the application, so other variations may be appropriate. For example, relationship annotation elements could be children of entities instead of parents, or they could be attributes on the copied element.

Overlapping hierarchies

Sometimes a model may not call for a many-to-many relationship, but it may have naturally overlapping one-to-many relationships. Overlapping data structures are not supported by the XML model directly, but they can be approximated using relationships.

Figure 7: Overlappting documents

image ../../../vol19/graphics/Thompson01/Thompson01-007.svg

In this scenario, one of the trees can be modeled as a “base” document, and overlapping trees can be modeled using many-to-many relationships between the shared entity of the base document and the parent entity of the overlapping document. Using automatic denormalization, overlapping hierarchies can be stored as separate and completely coherent documents.

Sparse entity copy transformations

If within in the context of a particular type document, it is known ahead of time that full copies of referenced entities are unnecessary, then the entity copy transformation can be extended to remove other parts of the copy. Depending on the application, it may be appropriate to create rules based on the type of entity copied, the type of entity containing the copy, or a combination of the two.

  declare function ref:make-copy($source as element(), $target as element()) as element()
  {
    typeswitch($source)
    (: Rules based on copied entity :)
    case element(attorney) | element(client) return 
      From: Attorney or Client + To:Anything rules
    case element(matter) return 
      (: Combination rules :)
      typeswitch($target)
      case element(client) return From:Matter + To:Client rules
      default return From:Matter + To:Not-Client rules
    default return
      typeswitch($target) return
      (: Rules based on containing entity :)
  }
          
Nearline reference updates (allow dirty reads)

If the number of relationships and/or frequency of updates per document are high, denormalization may be hard to justify. However, in some cases it may not be important for updated data to be consistent in all related documents immediately (within the same transaction). In these instances, one may implement a queue for processing updates, such that when a document is updated, cascading updates to related documents are only added to the queue to be processed later. This method allows the canonical document update to commit quickly, while the related data becomes “eventually consistent” and provides an opportunity for the developer to tune the queue processing to stay within a particular range of resource utilization.

Conclusion

NoSQL databases have emerged to serve an increasingly more common need to handle larger, more complex, and greater varieties of data structures without elaborate workarounds or by using non-database storage systems. Initially, many NoSQL databases were hyped as replacements for relational databases, but as the hype cycle dies down and these systems mature, it seems more likely that the future holds not one, but many types of databases suited to handle many types of data storage needs.

Many aspects of software development can be presented as a set of tradeoffs, and judgment is necessary to make good bets that will affect the speed of development, application performance, and maintainability. Even taking an optimistic view of the solutions available from the variety of new data storage models on the horizon, applications grow and change over time, and the likelihood that any one model will meet an application’s requirements perfectly over its lifetime are still unlikely. Finding the most reasonable tradeoffs when handling the things that do not fit our ideals, as always, can be one of the most critical aspects of software development.

The techniques described here provide a relatively simple set of patterns that offer reasonable tradeoffs for a common set of problems facing XQuery database developers. By explicitly creating and encapsulating the management of one set of dependencies - specifically, dependencies the relational model intended to eliminate outright - it’s possible to eliminate a different set of runtime dependencies that can have a more troubling influence on the development and maintenance of an application throughout its lifecycle.

References

[Codd1979] Codd, E. F. "Extending the Database Relational Model to Capture More Meaning." ACM Transactions on Database Systems 4.4 (1979): 397-434. Print.

[Codd1990] Codd, Edgar F. The Relational Model for Database Management: Version 2. Reading, Mass.: Addison-Wesley, 1990. Print.

[Williams2000] Williams, Kevin, Michael Brundage, and Patrick Dengler. Professional XML Databases. Birmingham: Wrox, 2000. N. pag. Print.

[Williams2002] Williams, Kevin. "XML for Data: Modeling Many-to-many Relationships." IBM DeveloperWorks. IBM, 1 Jan. 2002. Web. 12 May 2017. https://www.ibm.com/developerworks/library/x-xdm2m.html.

[Walmsley2015] Walmsley, Priscilla. XQuery: Search Across a Variety of XML Data. 2nd ed. N.p.: O'Reilly Media, 2015. Print.



[1] See "XQuery Language Reference (SQL Server)," Microsoft Docs, https://docs.microsoft.com/en-us/sql/xquery/xquery-language-reference-sql-server.

[2] See "SQL*Plus® User's Guide and Reference," ORACLE Help Center, http://docs.oracle.com/database/122/SQPUG/XQUERY.htm.

[4] See "8.13. XML Type." PostgreSQL documentation, https://www.postgresql.org/docs/9.6/static/datatype-xml.html.

[5] See "Faceted Search," Wikipedia, https://en.wikipedia.org/wiki/Faceted_search.

[6] See "Tuning the Database," EXist-db Documentation , http://exist-db.org/exist/apps/doc/tuning.xml.

[7] See "Query Performance and Tuning Guide," MarkLogic 8 Product Documentation, https://docs.marklogic.com/8.0/guide/performance.

Author's keywords for this paper: Document databases; Relationship modeling; Performance

Will Thompson

Chief Software Engineer

O'Connor's

Will Thompson leads software development on core technologies for O'Connor's Online, the web-based legal research platform that complements the company's expanding library of mostly home-grown (and mostly Texas-based) legal books. Will works on a wide array of software, from back-end editorial toolchains to customer-facing search engines.