How to cite this paper

Holstege, Mary. “Metaphors We Code By: Taking Things A Little Too Seriously.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018).

Balisage: The Markup Conference 2018
July 31 - August 3, 2018

Balisage Paper: Metaphors We Code By

Taking Things A Little Too Seriously

Mary Holstege

Distinguished Engineer

MarkLogic Corporation

Mary Holstege has been developing software in Silicon Valley for decades, in and around markup technologies and information extraction. She currently works at MarkLogic Corporation, where she mainly works on search. She holds a Ph.D. from Stanford University in Computer Science, for a thesis on document representation.

Copyright ©2018 Mary Holstege


Computer information and software are abstractions. We comprehend them through the use of metaphors. Different metaphors lead us to understand our information and our processing of it in different ways. They lead us to focus on certain aspects of the experience over other aspects.

This paper examines some metaphors we use to talk about markup. Being mindful of what our metaphors are telling us implicitly allows us to see what we are missing. By taking the metaphor a little too seriously, we can look to the non-metaphorical domain as a source of inspiration for good practices.

Table of Contents

Metaphors in Use: A Wee Analysis
Metaphorical Variations


Sure, software is metaphor — what else would it be?

— Michael Sperberg-McQueen

In the classic work Metaphors We Live By, George Lakoff and Mark Johnson pointed out the crucial role metaphor plays in organizing our everyday speech and our common understanding of the world. Choosing a metaphor is choosing to highlight certain aspects of experience and to downplay others. Examining those choices allows us to consider how we are choosing to structure how we see the world.

What are the metaphors we use to understand our handling of content in XML or some other markup languages? The way we talk about these things puts the focus on certain issues and hides others. What do they tell us about what aspects of that content and its handling we put in the foreground and which we elide? The metaphors we choose to use bias us in certain ways. We can choose to be mindful about how we talk about what we're doing and circumvent that bias. That bias may have pragmatic consequences, but it can have ethical consequences too, when it pushes human actors from view. If we go further, and take the metaphors seriously (perhaps a little too seriously) are there other lessons we can learn from the non-metaphorical version of those concepts?

Metaphors in Use: A Wee Analysis

To see what kinds of metaphors we use to talk about markup, I started by performing a quick analysis of the vocabulary used in the papers from the Balisage Series on Markup Technologies.

There are about two million words in this corpus and about forty thousand distinct words[1]. Filtering out stop words, the top ten ignoring case are XML, document, element, data, model, language, markup, text, information, process, and content, so indeed this corpus concerns itself with markup, as advertised.

The corpus was loaded into a document database (MarkLogic) as HTML and then converted to XML using tidy. Despite the irony of back-converting something that was originally XML into XML, analyzing XHTML actually makes things easier, as the content is almost entirely wrapped in p elements.

Two methods were used to probe the popularity of particular metaphors. Seven different metaphors were probed, which we will look at in more detail later on: documents, trees, construction, paths, fluids, textiles, and music.

The first method tokenized and gathered counts of each distinct word from the text nodes of every document in the corpus. A separate set of counts were gathered for each distinct stem of each word.

Figure 1: Method 1: Word Counts

xquery version "1.0-ml"; (: MarkLogic-specific extensions :)
declare namespace xh="";

let $counts := map:map()
let $_ :=
  for $word in 
    (for $p in //xh:p return cts:stem(cts:tokenize($p)[. instance of cts:word]))
  let $wordcount := map:get($counts,$word)
    if (empty($wordcount))
    then map:put($counts,$word,1)
    else map:put($counts,$word,$wordcount + 1)
for $word in map:keys($counts)
let $wordcount := map:get($counts,$word)
order by $wordcount descending
return ($word||" "||$wordcount||"

Stemming and tokenizing every word of every text node, and collecting sorted counts.

An awk script then summed up the counts for vocabulary items related to the specific metaphor.

Figure 2: Method 1: Selecting and Summing


/document|book|scroll|page|file|folder|archive|directory/ {
    document += $2;
    documentlist[$1] = $0;

/tree|leaf|root|branch|trunk|wood|forest|jungle/ {
    tree += $2;
    treelist[$1] = $0;

/construct|build|built|architect|carpenter|carpentry|brick|concrete/ {
    construction += $2;
    constructionlist[$1] = $0;

/road|track|navigate|navigation|path|journey|steps|movement|climb|walk/ {
    path += $2;
    pathlist[$1] = $0;

/flow|pipeline|stream|slide|volume|pour|plumbing|leak|drip|drop|fluid/ {
    fluid += $2;
    fluidlist[$1] = $0;

/weave|textile|thread|woven|cloth|fabric|clothes|clothing|knit/ {
    textile += $2;
    textilelist[$1] = $0;

/song|melody|tune|music|notes|tune|harmony|counterpoint|chord|rhythm/ {
    music += $2;
    musiclist[$1] = $0;

    print "document="document;
    for (w in documentlist) print documentlist[w];
    print "================================="
    print "tree="tree;
    for (w in treelist) print treelist[w];
    print "================================="
    print "construction="construction;
    for (w in constructionlist) print constructionlist[w];
    print "================================="
    print "path="path;
    for (w in pathlist) print pathlist[w];
    print "================================="
    print "fluid="fluid;
    for (w in fluidlist) print fluidlist[w];
    print "================================="
    print "textile="textile;
    for (w in textilelist) print textilelist[w];
    print "================================="
    print "music="music;
    for (w in musiclist) print musiclist[w];
    print "================================="

Selecting (case-insensitively) from the word list, and summing up.

This method is quick and easy, but a little crude — there are plenty of usages of certain of these terms that have little to do with the metaphor in question — but it gives a broad view of overall patterns.

The second method made use of more complex full-text search capabilities, allowing the probes to exclude some of the usages that were irrelevant to the metaphor being probed. This method also allowed the selection of some utterances, below. It also means that we're counting the number of paragraphs in which the terms occur rather than the number of distinct instances of the term. It turns out that this doesn't much change the relative or absolute distributions.

Figure 3: Method 2: Full-text Search

xquery version "1.0-ml"; (: MarkLogic extensions :)
declare namespace xh="";

declare variable $DOCUMENT := 
  ("document NOT_IN documentation", "book", "scroll",
    "page NOT_IN p:bibliomixed",
    "file", "folder", "archive", "directory"

declare variable $TREE :=
  ("tree", "leaf", "root", "branch", "trunk NOT_IN a:trunk", 
   "wood NOT_IN a:wood", "forest", "jungle");

declare variable $CONSTRUCTION :=
  ("construct", "construction", "build", "builder", 
   "architect", "architecture",
   "carpentry", "carpenter NOT_IN Carpenter[unstemmed]",
   "brick NOT_IN Brick[unstemmed]", "concrete");

declare variable $PATH :=
  ("road", "track NOT_IN (tracking[unstemmed] OR tracked[unstemmed])",
   "navigate", "navigation", "path", "journey", "steps[unstemmed]", 
   "movement", "climb", "walk");

declare variable $FLUID :=
  ("flow", "pipeline", "stream", "slide", "volume NOT_IN p:bibliomixed",
   "pour", "plumbing", "leak", "drip", "drop", "fluid");

declare variable $TEXTILE :=
  ("weave", "thread", "woven", "cloth", "fabric", "clothes", 
   "clothing", "knit");

declare variable $MUSIC :=
  ("song", "melody", "tune", "notes", "harmony", "counterpoint",
   "chord", "rhythm");

declare variable $BINDINGS :=
    (map:entry("p", function($op,$val,$opts) {
        xs:QName("xh:p"), xs:QName("class"), $val, $opts)
    map:entry("a", function($op,$val,$opts) {
      cts:element-query(xs:QName("xh:a"), $val, $opts)

declare variable $SAMPLE_SIZE := 5;

declare function local:sample($subquery as xs:string)
  let $query := cts:parse($subquery,$BINDINGS)
  let $results := cts:search(//xh:p, $query, "score-random")
  let $count := count($results)
  let $sample :=
    (for $res in $results
     let $rand := xdmp:random($count idiv $SAMPLE_SIZE)
     where $rand <= 1
     return $res)[1 to $SAMPLE_SIZE]
  let $highlighted :=
    for $s in $sample return cts:highlight($s,$query,<xh:b>{$cts:text}</xh:b>)
  return <subquery><q>{$subquery}</q><count>{$count}</count><sample>{$highlighted}</sample></subquery>

declare function local:analyze($label as xs:string, $subqueries as xs:string*)
  let $results :=
    for $q in $subqueries return local:sample($q)
  let $count := sum($results/count)
  return (
===="||$label||" "||$count, <sample>{$results/sample/*}</sample>

local:analyze("DOCUMENT", $DOCUMENT),
local:analyze("TREE", $TREE),
local:analyze("PATH", $PATH),
local:analyze("FLUID", $FLUID),
local:analyze("TEXTILE", $TEXTILE),
local:analyze("MUSIC", $MUSIC) 

Searching for matching utterances, and selecting samples.

Each of the following sections discusses a particular metaphor. Sample utterances illustrate the application of the metaphor. They all come from this very conference series.

Metaphorical Variations


Figure 4: Document utterances[2]

As illustrated by the excerpt from Clapton's Wikipedia page below...

The horizontal slider at the top of the screen provides an alternate way to quickly scroll through the stages

When a block of digital files are transferred from an organisation to the Archives it is a requirement that a CSV file is provided with the data containing some metadata about each file or folder.

Such features are provided as functions, defined in a separate recommendation detailing a standard library.

Digital folder - A digital folder (also known as a directory) is a computer cataloguing structure that can contain files and/or more digital folders.

Some files in the system are, in effect, transient and so do not require archiving.

The document metaphor dominates our understanding of computer data. An astonishing one percent of all the words used in the corpus relate to this metaphor. The other metaphors hover around one or two permille at best. We store documents on the file system in directories, we archive them, or convert the information into records. We page and scroll through them and create bookmarks. Our software is put into libraries, our data is indexed, our file systems are organized into volumes. Our user interfaces show us little pictures of books and folders and animations of pages flying by as we copy data from one slice of spinning magnetic domains to another.

What is the story we tell ourselves about our data with this metaphor?

Data is authored and divided into useful and coherent units. It is finite in extent, and scaled to something a human could read in a reasonable amount of time — minutes, a hour or so, perhaps a day. It is organized and categorized: it is comprehensible. It contains human text. Data is visual: it is something to be read. Data is developed, and then done. Books are written. A new edition may come out, but it is a new edition. It is also the natural state of books to be read and shared.

This is not to say all these properties are true of any particular collection of data. Rather, these are the properties we inchoately assume and understand when tell ourselves that data is documents, even when that data in no way is an electronic representation of a physical document.

The metaphor steers us towards thinking about certain things, and steers us away from thinking about others. When a social media feed is understood through the lens of documents, we think about editorial discretion, we think that it is something people produce. We think about human scales and timeframes. We don't think about mobs of bots generating posting swarms. A different metaphor, around pest-control or epidemiology, would better serve us in that case.

What would it mean to take the document metaphor seriously? It is so pervasive, one could hardly take this metaphor more seriously, it would seem.

Still, there are documents and there are documents. The Book of Kells is not your weekly shopping list. One is a treasured artifact, carefully protected, where modern copies are valued at tens of thousands of dollars. The other lies crumpled in your wastebasket, or rotting in the lint trap of your dryer. A document is an impermanent artifact whose inks fade and whose pages brittle with age and are eaten by beetle and moth larvae. To take the document metaphor more seriously would be to seriously consider the impermanence of things and the need for active curation and preservation and which documents need our care and which should be thrown out. This is hard to do when all documents look alike in our electronic spaces.

Make no mistake, electronic documents also need active curation and preservation. Large swathes of data gathered during the Apollo missions, at the cost of much treasure and some lives, rotted on tapes that became obsolete, then unreadable, and then lost.

When you think about your data through the document metaphor, what should you think about to counter the implicit bias of that metaphor? What are the blind spots?

Human authorship

Does your application/interface/process still work when data is machine-generated?

Human scale in time and size

What happens if your data consists of large amounts of very small machine-generated documents, or very large or endless ones? What happens if the rate of generation exceeds human scales?

Mostly words, maybe a few images

What happens if your data is mostly markup instead of words?

Data does not change, it is added to incrementally

Can you handle change? Rapid and continuing change?

Data is for sharing

Is yours?

Data is generally important and permanent

Can you tell what is important and what is transient? What is your strategy for keeping the important stuff and removing the junk?

The danger is decay

How are you curating and preserving your data? Flipping the script: do you have a process for removing junk?

For example, think of the assumptions built into a classic search engine: indexing happens in an incremental and lazy fashion, on timescales of hours after data is changed. Search engines in the 80s and 90s assumed it was OK to completely rebuild the search index when the set of documents changed. These are all assumptions based on a data is documents way of thinking.


Figure 5: Tree utterances

The nodes of the result tree are formatting objects.

We create a tree-structured document list in which only the leaf elements...

I don't want to have XML documents with anything as a root element!

Sometimes we need to focus on the trees, or the leaves on the trees.

Sometimes we need to focus on the forest.

The amazing capability of navigating trees and forests of information presupposes that those trees have been constructed completely and before the navigation starts.

The metric combines the number of branches in choice model groups...

Some repositories implemented the concept of a “project root” with three subdirectories trunk, branches , and tags.

The metaphor of the tree is largely a metaphor based on a similarity of structure, albeit one typically drawn tilted on its head, so that roots are in the air and branches lie below. It derives from a long history in computer science data structures and a metonymous application of the data structure for the data.

The story we tell when we use this metaphor is one of structure: there is one starting point, the root, from which branches diverge all the way to the leaves. The branches never come back together. There is a unique starting point. Leaves differ in some fundamental way from branches and branch points. Overlap doesn't happen.

Again, it doesn't mean all of this is true about our data; only that this is what we know in our bones when we speak of our data in this way.

Let's push the metaphor a bit, and think about trees, real living trees. We should start by considering the full tree, not just what we see above ground. Trees spread their roots wide and deep: their true shape is not a widening triangle, but two triangles meeting: the visible and the invisible. The trees do not live in isolation but, as Peter Wohllenben teaches us in The Hidden Life of Trees, in communities and families with mycorrhizal networks linking their roots together.

What, if anything, is the metaphorical application of this invisible triangle of roots? Roots whisper to us of unseen riches nourishing the whole. What could that tell us about our data? Here, perhaps, we have metadata which has its own hierarchy of organization all feeding into the visible base of the tree and which links one piece of data to other pieces, one tree to another. To take that part of the metaphor on board is to intuit that metadata is a complex and extensive as data, not just a few simple tags.

Trees, real trees, do not live alone, but in woodlands and forests. What kind of forest do we understand this to be? To think about different kinds of communities of trees is to have a different understanding of what our job is in processing that data.

A woodland with one a single kind of tree, all the same? A farm of trees, carefully grown to spec? Our data processing task is one of managing the production in a careful way, fending off pests, and harvesting the results at the appropriate time. There is uniformity, and few surprises. One document is nothing special, just a cash crop. Sensor records are trees of this kind.

A wild and untamed jungle with trees of many shapes and kinds and a profusion of growth? Our data processing task is one of beating back unchecked abundance, hacking with machete, burning with fire. The jungle will always come back: our job is to keep a little of it under control, for a time. Data is understood as something that happens to us, not something we make. Social media data, perhaps? Maybe we should consider scorching a few acres from time to time to clear it out. Notice, however, the jungle metaphor is a way of hiding human responsibility: the jungle is self-creating, so no one is reponsible for perpetuating or preventing any harms. Saying Twitter is a jungle is saying that the company has no responsibility for what happens there.

A woodland filled with old and twisted trees that have accumulated centuries of accidents and contingencies to create their own unique shapes? Our data processing job here is preservation and understanding the unique needs of each individual. Each tree is old and precious. Digital archives are woodlands of this type, the Staverton Thicks of markup, with layers of annotations from generations of scholars.

Perhaps we have a well managed woodland, with coppices or pollards and livestock roaming the understory? Most of the "wild" forests of Europe were woodlands of this sort. Our data processing job here is to be the woodcutter, venturing into the wood to cut beech and oak to the ground, to gather the wood and encourage growth for the future. Or perhaps we are the swineherd, setting the pigs lose to root through the humus and keep down the underbrush. Data is something that we manage, collect, and put to use and something that grows somewhat of its own accord. Business records live in this kind of forest.

Forests, unlike isolated trees, are a little scary. In the wonderful book Gossip from the Forest, Sara Maitland points out that the forests of our fairytales and our imaginings are places where, above all else, you get lost. We may hope to find a friendly woodcutter, but we're probably on our own and the wolves are tracking us. We can lose ourselves in our data forests too, if we're not careful. They extend far into the hills and it's dark in there. Best keep an eye out for the wolves.

How can we respond to the biases of the tree or forest metaphor? Even if your markup formalism doesn't admit to non-tree-structured data, thinking about the way in which your data is not logically a tree may lead you to adding mechanisms to represent that.

Leaves, branches, trunks, and the root are not alike

Look at those element-only content models, or those purely structural elements with no attributes. Does mixed content or attributes really have no place in them? Is there really one unique starting point?

Branches never intersect

What relationships are there between different leaves and branches? How can you express that?

One hierarchy

Are there alternative hierarchies? Do you have overlap? What is the "tree of roots" for you? Metadata? What does that look like and how does it connect to the data?

Data is a living thing

Does your data really grow and change on its own? How can it be defined and planned? Think about the agency and actions of those working with your data.

Tree and forest

How does one piece of data (tree) stand in relation to others? Think about what those relationships are and how to represent them. Are your pieces of data really separate things at all? What are the dividing lines? Think about matters of scale.

Forests, jungles, woodlands

What kind of forest are you imagining? Imagine a different kind of forest, which is to say different kinds of dynamics, different kinds of scales, different kinds of requirements around management. How does your data need to be tended? Who is in control?

Getting lost

What kind of navigational aids or organizational system might you need, especially as things grow. Does what you have work when you have 10, 100, or 1000 times as much? Think about what needs to be pruned, and when, and by whom.

For example, when we think about our data as trees, cross-cutting relationships or overlaps are weird and exceptional. When you model the same information as RDF, notions of hierarchy, overlap, or parallel representation disappear entirely. Modeling data with a mix of XML and RDF can allow for a richer view. What is the appropriate metaphor for that?


Figure 6: Building utterances

XML inherited and worsened SGML's legalistic tendencies, promoting a world of markup built to industrial standards.

The closest I came to visualizing how the result tree is built, was a slider-driven, graphical tree builder with a visual block representing each node.

This permitted document creators to conform to a general document architecture without having to constrain their own documents to every detail of a specific schema.

The solution...appears to be in deploying a mix of element types, some of which are hard in the sense just described, and some of which are pliable or even soft, to serve as a kind of spackle or carpenter's putty...

While we may have calmed down a bit from the heroic architect model of Howard Roark in The Fountainhead, markup language creators still expect to be able to lay things out as plans and have them faithfully executed by others who will live up to our specifications.

The metaphor of data as a building tells us a story of intention and artifice. Data is created with deliberation and a design for a particular human purpose. It is robust, solid, durable. It is singular: a thing, not some stuff. It is regular in form, not some wild organic wolf-infested thing. It is made, not grown. It doesn't happen to you, you happen to it. Data is a tangible thing, something to be grasped.

To be a builder is to execute a plan. There are blueprints from the architect. There is a project to be managed, coordinating the work of carpenters and roofers, electricians and plumbers. There are building codes to be followed and inspections to pass.

To take the metaphor more seriously would be to seriously consider what the building codes are for our projects as well. Serious data processing projects do have building codes: schemas and standards and rules. The building inspector is the validator.

Buildings may be used by many, but we don't think of them as shared by many in the same way a book is shared. Books circulate. Buildings stand. We visit them, and then leave again.

Like trees, buildings do not exist purely in isolation. They are grouped into towns and cities. Like trees, there are underground linkages: the shared infrastructure. What is the shared infrastructure for our data constructs? There are differences in scale in various assemblages that affect interactions and usage. A village has different security and privacy considerations than a megalopolis. What is the scale of your building project?

As there are different kinds of forests, there are different kinds of buildings and building projects. Banging together a tree house is not the same as putting up a cathedral.

What kind of construction is your project?

The prefab shed in the back yard can get put up by a couple of guys in an afternoon. The hard work was done back in the factory and designing any separate pieces so they are easy to fit together. Data processing has been streamlined for us, and we just need to use some basic tools and follow a simple plan. The data processing work for the factory is to make a plan and pieces that can be put together simply. Success depends on tightly constraining what is possible, limiting the tools needed, and the kinds of interactions pieces can have. In the data realm, I think this is a goal more often aspired to than accomplished, but it is crucial for the empowerment of non-technical folks.

What one sees more often is the dry stone wall of data: when executed by a master, beautiful, solid, effective, each component fit together just so; when executed by a novice, a heap of stones that sort of does the job. Maintenance by someone other than the master is hopeless, and likely to cause more damage than repair. The lesson of wall-building in the real world is that standardization of parts and methods — bricks and mortar, so to say — makes wall-building accessible to folks other than master craftsmen. Not as strong, not as efficient, not as beautiful, but more widely possible. What are our markup bricks? HTML elements, perhaps.

One also sees skyscraper projects, where the construction of the support machinery is an undertaking in its own right. Making the stylesheets to generate the stylesheets to make the XML. The mistake we sometimes make is not to plan the support infrastructure properly, or to fail to ensure that it too follows building codes. The crane comes crashing to the ground, to the great dismay of all.

And we have our cathedrals: long-term projects consuming generations of graduate students to complete, complete with their gargoyles and stained glass, and the extra buttressing on the west wall, just to be sure.

The building metaphor centers human agency and planning; countering its bias is to think about ways in which things might happen outside of human intentions.

Data is intentional, built by humans

Does your application/interface/process still work when the data is machine-generated?

There is a blueprint

What is it? How do you ensure that it is being followed? What kinds of inspections are there? Performed how? What if you don't have a blueprint? Do you need one? Are there some parts that could be unplanned and more fluid?

There are building codes

What are the standards and how are they being applied? Are there standards? What if there are not?

Data is robust, solid, durable

What are you doing to ensure that is the case? How are you curating, preserving, and protecting it? What if it isn't? Distinguish the transient from the permanent.

Data is unchanging

What if it isn't? What happens if change is frequent?

There is a border/door/wall

Is there? How is access to the data managed? What security is there on this door? Does data need to be walled off at all? Are there really a limited number of points of entry into the data? What are they?

Data is owned

By whom? How is this managed?

Sheds, cathedrals, skyscrapers

What kind of building are you imagining? Imagine a different kind of building, with different robustness requirements, different use patterns, different kinds of coordination requirements. What are the different kinds of roles required in making this artifact? What are the infrastructural supports required?


Think about the scale of the collection of buildings. How would things change if it were much larger or smaller than you imagine? How do the separate pieces interrelate? Are they really separate?

There is shared subterranean infrastructure

What kind of shared infrastructure do you want to rely on? e.g. Public linked data. How is it funded and maintained?

For example, you may have some nicely constructed documents, and still want to bring a folksonomy to the party to help with user-centered organization, even though it is unplanned and ever-changing.


Figure 7: Path utterances

The most common approach used by databases to deal with searching and navigating through large datasets is an index, trading space for time.

NIEM uses redirection and references that on the surface makes the data model hard to understand and navigate.

These are combined into the concept of a navigation path, which enables item selection in a very elegant, concise and yet readable way.

The tests ( Streamability of Axis Steps) are a little more complex and involve six cases and a tabular form, relating context posture and the axis of travel.

This is different from navigation of an object tree, which is a movement from single item to single item.

But there are many other navigation aids such as table of contents, hyperlinks, breadcrumbs,...

The story we tell ourselves with the metaphor of the path is that we are heading somewhere: there is a starting point and a destination, and if we keep moving forwards we can get there. Data is a space to navigated. It isn't made. It doesn't grow. It just is. Our job is to find or make the path that gets us to our journey's end.

The thing about paths, too, is that they are shared. Once the trail is blazed, others may follow. So the metaphor of the path tells a story about reuse. It may also tell a story about return: There and back again. Making a path is about marking a path: markup as signposts. "This way to Llanelli." "This way to footnote 5."

To speak of paths is to focus on what you do with the data, not on the data itself.

Data as a place also tells us a story about enclosure and boundaries: my data, your data, their data. The data is bounded, and owned. You might need fences or (fire)walls to protect it.

When data is a place, you fear invasion. There are those who belong in that place and those who don't. Locking up books is censorship. Locking your front door is prudence. There can be real-world policy consequences about how we talk about our data.

The path metaphor puts the focus on interactions with data rather than on the data itself. It may be helpful to remember that shaping the data also shapes the possible navigations through it.

Data is neither made nor grown

Think about human agency in creating data. Think about change.

Data has a boundary

Does it? Does the data have relationships or interactions with data outside this boundary? How is access to the data managed? What security is there on this boundary? Does data need to be walled off at all?

Data doesn't change

What if it does? Will the paths and navigation markers still work properly?

There is a starting point and an ending point

Is there only one starting point? One place to start accessing the data? What other entry points might be useful?

Trails are blazed, and reused

How are navigations marked to begin with? Are there milestones indicating stable landmarks? How stable are they?

For example, where data may change, references used to point to particular points in a document through a path must be stable for the paths to be reused. If ids are useds to anchor places in the path, the ids must be stable. If tumblers count children, the number of children cannot change. There are plenty of document production systems that rely on ids on anchors to point to places in the TOC, where those ids are randomly generated, and therefore not consistent when the documents are edited. Unfortunate.


Figure 8: Fluid utterances

It is an error if the fo:footnote occurs as a descendant of a flow ...

pipelining invites computational efficiency by making it possible to stream the flow of information

The standard XProc steps can be divided roughly into three categories: those for which a streaming can always be achieved (e.g., p:identity )...

XML parsing and validation is widely regarded as a performance bottleneck in the processing of very large XML documents.

It meant that they would have to preprocess the data before they could pour it into their tool.

This paper is about converting huge volumes of Rich Text Format (RTF) legal commentary to XML.

Then, we let those tags slide down from their node...

This allows a programmer to focus on its implementation without worrying about the plumbing details.

...this could be because we've successfully leaked information in error messages...

The story of pipelines is a story of change. You never step in the same river twice. There is a sense of abundance and continuity, of inexorableness and endlessness. You data is pouring through your channels, and you can direct it, guide it, pen it up for a time: but it keeps on flowing. Once you start it off, there's no holding it back. You can no more stop your data than Canute can stop the tide. It's a little frightening too. You might drown in it.

In the data as fluid metaphor, our job is the job of plumbers, to set up the valves and conduits so the flow follows its proper path. Data is tangible, but it is stuff, something to be put in a container. Water runs everywhere. It isn't so much shared as distributed.

The language of leaks and breaches is also a means of removing agency and abdicating responsibility. The data is somehow responsible for its own actions rather than some person.

When you imagine your data pipelines, what kind of fluid do you imagine it to be?

Is it something benign and harmless like water from start to finish? We are the builders of aqueducts and canals, spreading the life-giving water to the parched plain. Heroes! This is a job of distribution and sharing. To take it seriously is to consider and measure how we are partitioning the data, how much is coming in, how much is going out, where it is going, where it is not.

Or is this a sewage treatment plant, where you start off with something horrid, and end up with something sparkling and clean? Our job is one of continual refinement, adding ever finer filters and treatments to ensure a good result. To take this view seriously is to make sure we are sampling our outputs to know that nothing harmful has made it through.

Or is this more of a chemical plant kind of scenario, where there are complex and dangerous interactions that might explode and endanger the townspeople if they don't go right? Is this crude oil where a leak creates a toxic mess?

If our data fluid is a volatile or harmful chemical, we should import the lessons of chemical process control. Maybe we should identify and record the critical process parameters, the inputs to the process. Maybe we should identify and monitor the critical quality attributes, the metrics on how processing is proceeding that tell us what is happening and where it is starting to fail. Chemical plants are filled with complex interactions and non-linear effects, making them difficult to control and manage safely (Perrow99). Sometimes data behaves that way too: you didn't expect there to be more than one of those kinds of elements, and you didn't check, and now your whole process misbehaves. Best have gauges measuring what you need at key junctures.

Certainly if the data we are pushing through our plumbing is of a sensitive nature — health care or banking records, for example — the cost of a leak can be extensive. A double-walled hull for the transport may not be too much to ask. Or, in the lingo of security folks, security in depth.

Maybe we need to pay attention to the effluvium that is not the primary output, but that may be dangerous if it propagates into the environment. We all may wish FaceBook had thought of their data releases in this way.

Thinking about different kinds of fluids makes us think about different kinds of pipeline steps. Cleaning is about filtering and narrowing flows. Single inputs and single outputs. Distribution is selection and about divergent flows. Chemical processing is about convergent flows, refractionation, and feedbacks.

The data as fluid metaphor removes human agency from the picture; counter its bias by thinking about human responsibilities and actions. This metaphor also removes the identity of pieces of data as distinct artifacts from view; counter that bias by thinking about the ways in which different data needs different handling.

Data is amorphous and undifferentiated

Is all data the same? Is some data more static and durable? Can it all be freely mixed? Think about whether you need to know what pieces came from where. Don't lose sight of the fact that for some purposes individual identity still matters for legal or financial reasons. Think about context that makes individual pieces of data meaningful.

Data flows continually

Think about your critical quality attributes: what should you measure to ensure that the flows went well and that you know they went well and that you know how they went wrong? Think about starting and stopping. Think about what the data looks like when it isn't moving.

Data flows and pools of its own accord

Pay attention to agency and responsibility. Data does not do anything on its own. People are responsible for managing it, protecting it, controlling it. Who are they? What are their responsibilities?

Water, sewage, chemicals

If you imagine your data fluid to be harmless, think again. What if it weren't? What if it were dangerous? What if mixing it with other data made it more dangerous? How should you contain it, manage it? What processes should be in place to prevent accidents?

Flooding, drowning

What mechanisms will you put in place to prevent this? Navigational aids, rate limits, controls?

Sometimes embracing a metaphor can help us grasp something more firmly. For example, we can understand the risks of deanonymization through the metaphor of data as fluid: mixing two chemicals that aren't dangerous in themselves can create a dangerous situation.


Figure 9: Textile utterances

Furthermore, there exists software that will weave the same specification above into easily readable hyperlinked documentation.

A cybersecurity digital thread requires standardized languages, data formats, taxonomies, and metrics

knowledge is all sort of knitted together, or woven, like cloth, and each piece of knowledge is only meaningful or useful because of the other pieces.

I also think we are doing some new thinking, combining some old clothes into new outfits, knitting some new fabric

There are lots of XML-clad web APIs out there.

We must treat every new act of building as an opportunity to mend some rent in the existing cloth...

Textiles and software go way back: Hollerith punched cards are an evolution from Jacquard punched cards, after all. The words text and textile are cousins, coming from the Latin verb texere (weaving). Knitting patterns are programs in a domain-specific instruction set, designed for humans to execute.

Textiles tell a story of unity in complexity, of interacting and entangled structures. Data is made, with skill and artistry. Taking the sewing side of the metaphor over the weaving side: data is pieced together from specifically cut pieces, again, with skill and artistry, hiding the seams. Data is flexible, within limits. It is finite and put to some deeply human purpose. There is a plan. There are measurements. We are not afraid of our textile data. There is a sense of fragility: weavings can be unravelled or torn. Data is tangible, something to be draped, something to be worn. Soft and pliable, but not freely intermixed.

Textiles are holistic but singular. They don't blur and mix like fluids. What matters isn't individual stitches, but the pattern as a whole.

As with buildings, there are plans, and patterns of various levels of complexity. But textiles are more personal. You don't share your socks.

Textiles are flexible, but they are not ever-changing. Change is about mending and patching, not fundamentally reshaping what we have.

What do different kinds textiles teach us?

Weaving teaches us that the interactions of strands create patterns of their own. Overlap not only happens, it is essential.

Knitting patterns teach us that short and simple generation instructions can produce complex results. It might be better to store the generator rather than the instance.

Embroidery speaks of annotations, adding layers of meaning.

White work teaches us that we can create complex information patterns by removing threads and creating voids.

Quilting teaches us that we can create mash-ups from disparate data sources to produce something new.

Like the building metaphor, the textile metaphor centers human agency and planning. Countering that bias is to think about what happens outside human intentions. Like the document metaphor, the textile metaphor lends itself to assumptions about human scales in time and size, and thinking about inhuman scales is a useful counterpoint.

Data is intentional, crafted by humans

Does your application/interface/process still work when the data is machine-generated, operating at machine scales?

Data is complex, tangled, and holistic

Think about reuse and slicing and dicing. Think about what pieces make sense in isolation or recombination.

Human scale in time and size

What happens if your data consists of large amounts of very small machine-generated documents, or very large or endless ones? What happens if the rate of generation exceeds human scales?

Data is skillfully fabricated by hand

Do your systems/processes hold up when intuitions about human scale don't apply? How can non-skilled use be enabled?

Data wears out

What does it mean to patch up your data, to preserve it, to decide it is worn out junk to be thrown away?

Knitting, quilting, stitching patterns

Is it better to keep a generator for the data instead of the data? What are the consituent pieces to be sewn together?

Tearing or unravelling

The textile metaphor biases us against extraction, reassembly, and reuse. Think about how you could support those use cases.

RDF data is often conceptualized as a graph, but textile metaphors work well too. There are threads of different colours knotted and linked together to form a complex whole. The difficulties of multiple roots or overlaps or concurrency just don't come up: it is the pattern as a whole that matters.


Figure 10: Music utterances

...error handling and many other options used to tune XML processes.

That degree of harmony between logical and physical structure was not required in ISO 8879

...differences intentional[ly] designed into JSON specifically as a counterpoint to XML “complexity”

The musical metaphor gets hardly any traction: less than one half permille of this corpus, and most of those usages in the corpus are about actual music, not metaphorical at all.

One place where music does get taken more seriously in the context of data is sonification. Sonification is the rendering of data as literal music. It works because humans are better at hearing subtle patterns in complexity than seeing them. Wordless information of various kinds (EKGs, gravity waves, Geiger counters) has been effectively sonified. What would sonification of markup look like?

The story this metaphor could tell is one of creating a beautiful unity out of richness. Data is created as an act of cooperation and blending together disparate voices and instruments. Music carries emotion of all kinds. Like the fluid metaphor, there is continuity and things happening over time. Music itself is often understood through the metaphor of flowing water. Like the construction metaphor, this is a planned activity. There is a composer and score and a conductor. Like the textile metaphor this is an intimate, deeply human craft. Data is not visual to be read, not tangible to be grasped, but audible, to be heard. Change is encoded in the very structure of the music itself: present, but part of the pattern. Sharing too is intrinsic in the concept of music: but the sharing of experience. Music is ephemeral.

What would markup be like if we thought of it as scoring? We wouldn't think of overlap as a problem at all, but as our essential task. Our representations would be representations of parallel tracks, our words arranged in time rather than space.

The data as music metaphor can itself be a useful alternative viewpoint to more common metaphors. What if we liked our data? What if it were a participatory activity, not an artifact? What if the goal of security weren't keeping secrets but preserving the usefulness and integrity of the performance? The musical metaphor might encourage us to consider the emotional impact of data and structure our systems accordingly.

Data is beautiful

How does the structure of your data impact its esthetics? How will that impact how humans are able to interact with it?

Data is participatory

Who is participating? Who is excluded? How can we make participation more inviting?

Data is performed

Who is the performance for? Who is performing? What is the scope of the performance in time and space?

Data is ephemeral

Not everything needs to be saved beyond the moment it was created.


When multiple streams come together, will they actually work together? Are the cadences compatible?


When talking about our data and its processing, our choice of metaphor directs our attention towards (and away from) certain aspects of reality. In particular, it selects a particular view of complexity and of change. It also selects an emotional and moral stance towards that data.

With the building metaphor we attend to the processing of data and its regularity. Data is intentional, controlled, a singular artifact. We think about plans and checks. We also think about stability and permanence. Buildings, once formed, generally do not change easily.

The path metaphor, by contrast, is all about the process. The data is in the background, a given, something for us to traverse, something to put boundary markers around. We pay attention to how one part connects to another. Change is not in the data, but in what we do with it.

The fluid metaphor evokes a frisson of fear: there is abundance, but a danger of over-abundance too. Data moves and must be guided and controlled. Attention is diverted from human agency: the data has agency of its own. Depending on what kind of fluid we imagine we have, we may attend more or less carefully to how we handle it. Change is expected and constant: everything may be reshaped from moment to moment.

The textile metaphor, by contrast, evokes feelings of cozy intimacy. Data isn't scary, although it can be very complex. Still, it is fabricated according to a plan. We have a sense of regularity and order. Change can happen, but generally in the context of repair.

When we think of trees, we think of hierarchy and branching. Cross-cutting tangles are exceptions. Yet once we start talking about forests, complexity and thoughts of the organic creep in. And a little fear. Change is about growth: addition at the leaves. Once again, attention shifts from human agency to agency of the data.

It is a pity the musical metaphor is not more popular: for here is the e pluribus unum of data, an embrace of complexity, and here is beauty, and here is joy.

In summary, these different metaphors each has a different story to tell about our data, and our relation to it:

Data is... Sharing? Fears Change Patterns
Document Passed around Rot Written once Linear, paged
Trees Commons Being lost Growth Branching, untangled
Building Used by many, owned by one Collapse Made, then solid Repetitive, planned
Place Traversed by many, owned by one Invasion Activity Background
Fluid Distributed widely Leaks, drowning Constant Chaos
Textile Personal Tearing Flexible Intertwined
Music Experienced and created together Disharmony Intrinsic Complex, multidimensional

There is a pragmatic aspect to being more mindful of the language we use to talk about what we do. Thinking about our data and the systems that surround it in particular ways biases us to thinking about certain issues and ignoring others: being more mindful of those biases allows us to temper them. Mindfully choosing to view things through the lens of a different metaphor allows us to consider a fuller view of our data and the processes and systems that operate on it.

Ethics come in to play here as well. When data leaks or there is a data breach, some human did some thing they shouldn't have done to cause that to happen, or failed to take an action they should have to prevent it. But the language of leaking talks about data as a fluid that just oozed somewhere of its own accord: it blames the data for its own condition and abdicates human responsibility. Similarly, talking in terms of jungles and organic metaphors of unchecked growth again diverts our attention from human agency and control, and blames data for its own misuse. On the other hand, when we talk in cozy terms of weaving and knitting, we are biased to think of the data in ways that prevent us from thinking about the dangers of use or misuse. When we talk of documents and libraries, we are inclined to think of books being shared freely, and neglect issues of privacy.

The metaphors we code by are another tool. They can be wielded with purpose.


[Balisage] Balisage Series on Markup Technologies. ISSN 1947-2609., accessed 2018-03-28.

[Maitland12] Sara Maitland. Gossip from the Forest. Granta Books, 2012.

[Lakoff80] George Lakoff and Mark Johnson. Metaphors We Live By. University of Chicago Press, 1980.

[Perrow99] Charles Perrow. Normal Accidents. Princeton University Press, 1999.

[Wohllenben16] Peter Wohllenben, translation by Jane Billinghurst. The Hidden Life of Trees. Greystone Books, 2016.

[1] Numbers are rough here, because it depends on what you count as the same: Distinct lexical forms? Distinct stems? Distinct senses? I counted both distinct lexical forms and distinct stems, and the numbers are in the same ballpark.

[2] All emphasis mine.


Balisage Series on Markup Technologies. ISSN 1947-2609., accessed 2018-03-28.


Sara Maitland. Gossip from the Forest. Granta Books, 2012.


George Lakoff and Mark Johnson. Metaphors We Live By. University of Chicago Press, 1980.


Charles Perrow. Normal Accidents. Princeton University Press, 1999.


Peter Wohllenben, translation by Jane Billinghurst. The Hidden Life of Trees. Greystone Books, 2016.