Holstege, Mary. “Metaphors We Code By: Taking Things A Little Too Seriously.” Presented at Balisage: The Markup Conference 2018, Washington, DC, July 31 - August 3, 2018. In Proceedings of Balisage: The Markup Conference 2018. Balisage Series on Markup Technologies, vol. 21 (2018). https://doi.org/10.4242/BalisageVol21.Holstege01.
Balisage: The Markup Conference 2018 July 31 - August 3, 2018
Mary Holstege has been developing software in Silicon Valley for decades, in and around
markup technologies and information extraction. She currently works at MarkLogic Corporation,
where she mainly works on search. She holds a Ph.D. from Stanford University in Computer
Science, for a thesis on document representation.
Computer information and software are abstractions. We comprehend them through the
use of metaphors. Different metaphors lead us to understand our information and our
processing of it in different ways. They lead us to focus on certain aspects of the
experience over other aspects.
This paper examines some metaphors we use to talk about markup. Being mindful of what
our metaphors are telling us implicitly allows us to see what we are missing. By taking
the metaphor a little too seriously, we can look to the non-metaphorical domain as
a source of inspiration for good practices.
Sure, software is metaphor — what else would it be?
— Michael Sperberg-McQueen
In the classic work Metaphors We Live By, George Lakoff and Mark Johnson pointed out the crucial role metaphor plays in organizing
our everyday speech and our common understanding of the world. Choosing a metaphor
is choosing to highlight certain aspects of experience and to downplay others. Examining
those choices allows us to consider how we are choosing to structure how we see the
What are the metaphors we use to understand our handling of content in XML or some
other markup languages? The way we talk about these things puts the focus on certain
issues and hides others. What do they tell us about what aspects of that content and
its handling we put in the foreground and which we elide? The metaphors we choose
to use bias us in certain ways. We can choose to be mindful about how we talk about
what we're doing and circumvent that bias. That bias may have pragmatic consequences,
but it can have ethical consequences too, when it pushes human actors from view. If
we go further, and take the metaphors seriously (perhaps a little too seriously) are
there other lessons we can learn from the non-metaphorical version of those concepts?
There are about two million words in this corpus and about forty thousand distinct
words. Filtering out stop words, the top ten ignoring case are XML, document, element, data, model, language, markup, text, information, process, and content, so indeed this corpus concerns itself with markup, as advertised.
The corpus was loaded into a document database (MarkLogic) as HTML and then converted
to XML using tidy. Despite the irony of back-converting something that was originally XML into XML,
analyzing XHTML actually makes things easier, as the content is almost entirely wrapped
in p elements.
Two methods were used to probe the popularity of particular metaphors. Seven different
metaphors were probed, which we will look at in more detail later on: documents, trees,
construction, paths, fluids, textiles, and music.
The first method tokenized and gathered counts of each distinct word from the text
nodes of every document in the corpus. A separate set of counts were gathered for
each distinct stem of each word.
An awk script then summed up the counts for vocabulary items related to the specific metaphor.
This method is quick and easy, but a little crude — there are plenty of usages of
certain of these terms that have little to do with the metaphor in question — but
it gives a broad view of overall patterns.
The second method made use of more complex full-text search capabilities, allowing
the probes to exclude some of the usages that were irrelevant to the metaphor being
probed. This method also allowed the selection of some utterances, below. It also
means that we're counting the number of paragraphs in which the terms occur rather
than the number of distinct instances of the term. It turns out that this doesn't
much change the relative or absolute distributions.
Each of the following sections discusses a particular metaphor. Sample utterances
illustrate the application of the metaphor. They all come from this very conference
DATA IS DOCUMENTS, PROCESSING DATA IS CLERKING
The document metaphor dominates our understanding of computer data. An astonishing
one percent of all the words used in the corpus relate to this metaphor. The other
metaphors hover around one or two permille at best. We store documents on the file system in directories, we archive them, or convert the information into records. We page and scroll through them and create bookmarks. Our software is put into libraries, our data is indexed, our file systems are organized into volumes. Our user interfaces show us little pictures of books and folders and animations of pages flying by as we copy data from one slice of spinning magnetic domains to another.
What is the story we tell ourselves about our data with this metaphor?
Data is authored and divided into useful and coherent units. It is finite in extent,
and scaled to something a human could read in a reasonable amount of time — minutes,
a hour or so, perhaps a day. It is organized and categorized: it is comprehensible.
It contains human text. Data is visual: it is something to be read. Data is developed,
and then done. Books are written. A new edition may come out, but it is a new edition. It is also the natural state of books to be read and shared.
This is not to say all these properties are true of any particular collection of data. Rather, these are the properties we inchoately
assume and understand when tell ourselves that data is documents, even when that data in
no way is an electronic representation of a physical document.
The metaphor steers us towards thinking about certain things, and steers us away from
thinking about others. When a social media feed is understood through the lens of
documents, we think about editorial discretion, we think that it is something people
produce. We think about human scales and timeframes. We don't think about mobs of
bots generating posting swarms. A different metaphor, around pest-control or epidemiology,
would better serve us in that case.
What would it mean to take the document metaphor seriously? It is so pervasive, one
could hardly take this metaphor more seriously, it would seem.
Still, there are documents and there are documents. The Book of Kells is not your weekly shopping list. One is a treasured artifact, carefully protected,
where modern copies are valued at tens of thousands of dollars. The other lies crumpled
in your wastebasket, or rotting in the lint trap of your dryer. A document is an impermanent
artifact whose inks fade and whose pages brittle with age and are eaten by beetle
and moth larvae. To take the document metaphor more seriously would be to seriously
consider the impermanence of things and the need for active curation and preservation
and which documents need our care and which should be thrown out. This is hard to
do when all documents look alike in our electronic spaces.
Make no mistake, electronic documents also need active curation and preservation.
Large swathes of data gathered during the Apollo missions, at the cost of much treasure
and some lives, rotted on tapes that became obsolete, then unreadable, and then lost.
When you think about your data through the document metaphor, what should you think
about to counter the implicit bias of that metaphor? What are the blind spots?
Does your application/interface/process still work when data is machine-generated?
Human scale in time and size
What happens if your data consists of large amounts of very small machine-generated
documents, or very large or endless ones? What happens if the rate of generation exceeds
Mostly words, maybe a few images
What happens if your data is mostly markup instead of words?
Data does not change, it is added to incrementally
Can you handle change? Rapid and continuing change?
Data is for sharing
Data is generally important and permanent
Can you tell what is important and what is transient? What is your strategy for keeping
the important stuff and removing the junk?
The danger is decay
How are you curating and preserving your data? Flipping the script: do you have a
process for removing junk?
For example, think of the assumptions built into a classic search engine: indexing
happens in an incremental and lazy fashion, on timescales of hours after data is changed.
Search engines in the 80s and 90s assumed it was OK to completely rebuild the search
index when the set of documents changed. These are all assumptions based on a data
is documents way of thinking.
DATA IS TREES, PROCESSING DATA IS FORESTRY
The metaphor of the tree is largely a metaphor based on a similarity of structure,
albeit one typically drawn tilted on its head, so that roots are in the air and branches
lie below. It derives from a long history in computer science data structures and
a metonymous application of the data structure for the data.
The story we tell when we use this metaphor is one of structure: there is one starting
point, the root, from which branches diverge all the way to the leaves. The branches
never come back together. There is a unique starting point. Leaves differ in some
fundamental way from branches and branch points. Overlap doesn't happen.
Again, it doesn't mean all of this is true about our data; only that this is what
we know in our bones when we speak of our data in this way.
Let's push the metaphor a bit, and think about trees, real living trees. We should
start by considering the full tree, not just what we see above ground. Trees spread
their roots wide and deep: their true shape is not a widening triangle, but two triangles
meeting: the visible and the invisible. The trees do not live in isolation but, as
Peter Wohllenben teaches us in The Hidden Life of Trees, in communities and families with mycorrhizal networks linking their roots together.
What, if anything, is the metaphorical application of this invisible triangle of roots?
Roots whisper to us of unseen riches nourishing the whole. What could that tell us
about our data? Here, perhaps, we have metadata which has its own hierarchy of organization
all feeding into the visible base of the tree and which links one piece of data to
other pieces, one tree to another. To take that part of the metaphor on board is to
intuit that metadata is a complex and extensive as data, not just a few simple tags.
Trees, real trees, do not live alone, but in woodlands and forests. What kind of forest
do we understand this to be? To think about different kinds of communities of trees
is to have a different understanding of what our job is in processing that data.
A woodland with one a single kind of tree, all the same? A farm of trees, carefully
grown to spec? Our data processing task is one of managing the production in a careful
way, fending off pests, and harvesting the results at the appropriate time. There
is uniformity, and few surprises. One document is nothing special, just a cash crop.
Sensor records are trees of this kind.
A wild and untamed jungle with trees of many shapes and kinds and a profusion of growth?
Our data processing task is one of beating back unchecked abundance, hacking with
machete, burning with fire. The jungle will always come back: our job is to keep a
little of it under control, for a time. Data is understood as something that happens
to us, not something we make. Social media data, perhaps? Maybe we should consider
scorching a few acres from time to time to clear it out. Notice, however, the jungle
metaphor is a way of hiding human responsibility: the jungle is self-creating, so
no one is reponsible for perpetuating or preventing any harms. Saying Twitter is a
jungle is saying that the company has no responsibility for what happens there.
A woodland filled with old and twisted trees that have accumulated centuries of accidents
and contingencies to create their own unique shapes? Our data processing job here
is preservation and understanding the unique needs of each individual. Each tree is
old and precious. Digital archives are woodlands of this type, the Staverton Thicks
of markup, with layers of annotations from generations of scholars.
Perhaps we have a well managed woodland, with coppices or pollards and livestock roaming
the understory? Most of the "wild" forests of Europe were woodlands of this sort.
Our data processing job here is to be the woodcutter, venturing into the wood to cut
beech and oak to the ground, to gather the wood and encourage growth for the future.
Or perhaps we are the swineherd, setting the pigs lose to root through the humus and
keep down the underbrush. Data is something that we manage, collect, and put to use
and something that grows somewhat of its own accord. Business records live in this
kind of forest.
Forests, unlike isolated trees, are a little scary. In the wonderful book Gossip from the Forest, Sara Maitland points out that the forests of our fairytales and our imaginings are
places where, above all else, you get lost. We may hope to find a friendly woodcutter, but we're probably on our own and the
wolves are tracking us. We can lose ourselves in our data forests too, if we're not
careful. They extend far into the hills and it's dark in there. Best keep an eye out
for the wolves.
How can we respond to the biases of the tree or forest metaphor? Even if your markup
formalism doesn't admit to non-tree-structured data, thinking about the way in which
your data is not logically a tree may lead you to adding mechanisms to represent that.
Leaves, branches, trunks, and the root are not alike
Look at those element-only content models, or those purely structural elements with
no attributes. Does mixed content or attributes really have no place in them? Is there
really one unique starting point?
Branches never intersect
What relationships are there between different leaves and branches? How can you express
Are there alternative hierarchies? Do you have overlap? What is the "tree of roots"
for you? Metadata? What does that look like and how does it connect to the data?
Data is a living thing
Does your data really grow and change on its own? How can it be defined and planned?
Think about the agency and actions of those working with your data.
Tree and forest
How does one piece of data (tree) stand in relation to others? Think about what those
relationships are and how to represent them. Are your pieces of data really separate
things at all? What are the dividing lines? Think about matters of scale.
Forests, jungles, woodlands
What kind of forest are you imagining? Imagine a different kind of forest, which is
to say different kinds of dynamics, different kinds of scales, different kinds of
requirements around management. How does your data need to be tended? Who is in control?
What kind of navigational aids or organizational system might you need, especially
as things grow. Does what you have work when you have 10, 100, or 1000 times as much?
Think about what needs to be pruned, and when, and by whom.
For example, when we think about our data as trees, cross-cutting relationships or
overlaps are weird and exceptional. When you model the same information as RDF, notions
of hierarchy, overlap, or parallel representation disappear entirely. Modeling data
with a mix of XML and RDF can allow for a richer view. What is the appropriate metaphor
DATA IS BUILDINGS, PROCESSING DATA IS CONSTRUCTION
The metaphor of data as a building tells us a story of intention and artifice. Data
is created with deliberation and a design for a particular human purpose. It is robust,
solid, durable. It is singular: a thing, not some stuff. It is regular in form, not
some wild organic wolf-infested thing. It is made, not grown. It doesn't happen to
you, you happen to it. Data is a tangible thing, something to be grasped.
To be a builder is to execute a plan. There are blueprints from the architect. There
is a project to be managed, coordinating the work of carpenters and roofers, electricians
and plumbers. There are building codes to be followed and inspections to pass.
To take the metaphor more seriously would be to seriously consider what the building
codes are for our projects as well. Serious data processing projects do have building
codes: schemas and standards and rules. The building inspector is the validator.
Buildings may be used by many, but we don't think of them as shared by many in the same way a book is shared. Books circulate. Buildings stand. We visit
them, and then leave again.
Like trees, buildings do not exist purely in isolation. They are grouped into towns
and cities. Like trees, there are underground linkages: the shared infrastructure.
What is the shared infrastructure for our data constructs? There are differences in
scale in various assemblages that affect interactions and usage. A village has different
security and privacy considerations than a megalopolis. What is the scale of your
As there are different kinds of forests, there are different kinds of buildings and
building projects. Banging together a tree house is not the same as putting up a cathedral.
What kind of construction is your project?
The prefab shed in the back yard can get put up by a couple of guys in an afternoon.
The hard work was done back in the factory and designing any separate pieces so they
are easy to fit together. Data processing has been streamlined for us, and we just
need to use some basic tools and follow a simple plan. The data processing work for
the factory is to make a plan and pieces that can be put together simply. Success
depends on tightly constraining what is possible, limiting the tools needed, and the
kinds of interactions pieces can have. In the data realm, I think this is a goal more
often aspired to than accomplished, but it is crucial for the empowerment of non-technical
What one sees more often is the dry stone wall of data: when executed by a master,
beautiful, solid, effective, each component fit together just so; when executed by
a novice, a heap of stones that sort of does the job. Maintenance by someone other
than the master is hopeless, and likely to cause more damage than repair. The lesson
of wall-building in the real world is that standardization of parts and methods —
bricks and mortar, so to say — makes wall-building accessible to folks other than
master craftsmen. Not as strong, not as efficient, not as beautiful, but more widely
possible. What are our markup bricks? HTML elements, perhaps.
One also sees skyscraper projects, where the construction of the support machinery
is an undertaking in its own right. Making the stylesheets to generate the stylesheets
to make the XML. The mistake we sometimes make is not to plan the support infrastructure
properly, or to fail to ensure that it too follows building codes. The crane comes
crashing to the ground, to the great dismay of all.
And we have our cathedrals: long-term projects consuming generations of graduate students
to complete, complete with their gargoyles and stained glass, and the extra buttressing
on the west wall, just to be sure.
The building metaphor centers human agency and planning; countering its bias is to
think about ways in which things might happen outside of human intentions.
Data is intentional, built by humans
Does your application/interface/process still work when the data is machine-generated?
There is a blueprint
What is it? How do you ensure that it is being followed? What kinds of inspections
are there? Performed how? What if you don't have a blueprint? Do you need one? Are
there some parts that could be unplanned and more fluid?
There are building codes
What are the standards and how are they being applied? Are there standards? What if
there are not?
Data is robust, solid, durable
What are you doing to ensure that is the case? How are you curating, preserving, and
protecting it? What if it isn't? Distinguish the transient from the permanent.
Data is unchanging
What if it isn't? What happens if change is frequent?
There is a border/door/wall
Is there? How is access to the data managed? What security is there on this door?
Does data need to be walled off at all? Are there really a limited number of points
of entry into the data? What are they?
Data is owned
By whom? How is this managed?
Sheds, cathedrals, skyscrapers
What kind of building are you imagining? Imagine a different kind of building, with
different robustness requirements, different use patterns, different kinds of coordination
requirements. What are the different kinds of roles required in making this artifact?
What are the infrastructural supports required?
Think about the scale of the collection of buildings. How would things change if it
were much larger or smaller than you imagine? How do the separate pieces interrelate?
Are they really separate?
There is shared subterranean infrastructure
What kind of shared infrastructure do you want to rely on? e.g. Public linked data.
How is it funded and maintained?
For example, you may have some nicely constructed documents, and still want to bring
a folksonomy to the party to help with user-centered organization, even though it
is unplanned and ever-changing.
DATA IS A PLACE, PROCESSING DATA IS A JOURNEY
The story we tell ourselves with the metaphor of the path is that we are heading somewhere:
there is a starting point and a destination, and if we keep moving forwards we can
get there. Data is a space to navigated. It isn't made. It doesn't grow. It just is.
Our job is to find or make the path that gets us to our journey's end.
The thing about paths, too, is that they are shared. Once the trail is blazed, others
may follow. So the metaphor of the path tells a story about reuse. It may also tell
a story about return: There and back again. Making a path is about marking a path:
markup as signposts. "This way to Llanelli." "This way to footnote 5."
To speak of paths is to focus on what you do with the data, not on the data itself.
Data as a place also tells us a story about enclosure and boundaries: my data, your
data, their data. The data is bounded, and owned. You might need fences or (fire)walls
to protect it.
When data is a place, you fear invasion. There are those who belong in that place
and those who don't. Locking up books is censorship. Locking your front door is prudence.
There can be real-world policy consequences about how we talk about our data.
The path metaphor puts the focus on interactions with data rather than on the data
itself. It may be helpful to remember that shaping the data also shapes the possible
navigations through it.
Data is neither made nor grown
Think about human agency in creating data. Think about change.
Data has a boundary
Does it? Does the data have relationships or interactions with data outside this boundary?
How is access to the data managed? What security is there on this boundary? Does data
need to be walled off at all?
Data doesn't change
What if it does? Will the paths and navigation markers still work properly?
There is a starting point and an ending point
Is there only one starting point? One place to start accessing the data? What other
entry points might be useful?
Trails are blazed, and reused
How are navigations marked to begin with? Are there milestones indicating stable landmarks?
How stable are they?
For example, where data may change, references used to point to particular points
in a document through a path must be stable for the paths to be reused. If ids are
useds to anchor places in the path, the ids must be stable. If tumblers count children,
the number of children cannot change. There are plenty of document production systems
that rely on ids on anchors to point to places in the TOC, where those ids are randomly
generated, and therefore not consistent when the documents are edited. Unfortunate.
DATA IS A FLUID, PROCESSING DATA IS PLUMBING
The story of pipelines is a story of change. You never step in the same river twice.
There is a sense of abundance and continuity, of inexorableness and endlessness. You
data is pouring through your channels, and you can direct it, guide it, pen it up
for a time: but it keeps on flowing. Once you start it off, there's no holding it
back. You can no more stop your data than Canute can stop the tide. It's a little
frightening too. You might drown in it.
In the data as fluid metaphor, our job is the job of plumbers, to set up the valves
and conduits so the flow follows its proper path. Data is tangible, but it is stuff,
something to be put in a container. Water runs everywhere. It isn't so much shared
The language of leaks and breaches is also a means of removing agency and abdicating
responsibility. The data is somehow responsible for its own actions rather than some
When you imagine your data pipelines, what kind of fluid do you imagine it to be?
Is it something benign and harmless like water from start to finish? We are the builders
of aqueducts and canals, spreading the life-giving water to the parched plain. Heroes!
This is a job of distribution and sharing. To take it seriously is to consider and
measure how we are partitioning the data, how much is coming in, how much is going
out, where it is going, where it is not.
Or is this a sewage treatment plant, where you start off with something horrid, and
end up with something sparkling and clean? Our job is one of continual refinement,
adding ever finer filters and treatments to ensure a good result. To take this view
seriously is to make sure we are sampling our outputs to know that nothing harmful
has made it through.
Or is this more of a chemical plant kind of scenario, where there are complex and
dangerous interactions that might explode and endanger the townspeople if they don't
go right? Is this crude oil where a leak creates a toxic mess?
If our data fluid is a volatile or harmful chemical, we should import the lessons
of chemical process control. Maybe we should identify and record the critical process
parameters, the inputs to the process. Maybe we should identify and monitor the critical
quality attributes, the metrics on how processing is proceeding that tell us what
is happening and where it is starting to fail. Chemical plants are filled with complex
interactions and non-linear effects, making them difficult to control and manage safely
(Perrow99). Sometimes data behaves that way too: you didn't expect there to be more than one
of those kinds of elements, and you didn't check, and now your whole process misbehaves.
Best have gauges measuring what you need at key junctures.
Certainly if the data we are pushing through our plumbing is of a sensitive nature
— health care or banking records, for example — the cost of a leak can be extensive.
A double-walled hull for the transport may not be too much to ask. Or, in the lingo
of security folks, security in depth.
Maybe we need to pay attention to the effluvium that is not the primary output, but
that may be dangerous if it propagates into the environment. We all may wish FaceBook
had thought of their data releases in this way.
Thinking about different kinds of fluids makes us think about different kinds of pipeline
steps. Cleaning is about filtering and narrowing flows. Single inputs and single outputs.
Distribution is selection and about divergent flows. Chemical processing is about
convergent flows, refractionation, and feedbacks.
The data as fluid metaphor removes human agency from the picture; counter its bias
by thinking about human responsibilities and actions. This metaphor also removes the
identity of pieces of data as distinct artifacts from view; counter that bias by thinking
about the ways in which different data needs different handling.
Data is amorphous and undifferentiated
Is all data the same? Is some data more static and durable? Can it all be freely mixed?
Think about whether you need to know what pieces came from where. Don't lose sight
of the fact that for some purposes individual identity still matters for legal or
financial reasons. Think about context that makes individual pieces of data meaningful.
Data flows continually
Think about your critical quality attributes: what should you measure to ensure that
the flows went well and that you know they went well and that you know how they went
wrong? Think about starting and stopping. Think about what the data looks like when
it isn't moving.
Data flows and pools of its own accord
Pay attention to agency and responsibility. Data does not do anything on its own.
People are responsible for managing it, protecting it, controlling it. Who are they?
What are their responsibilities?
Water, sewage, chemicals
If you imagine your data fluid to be harmless, think again. What if it weren't? What
if it were dangerous? What if mixing it with other data made it more dangerous? How
should you contain it, manage it? What processes should be in place to prevent accidents?
What mechanisms will you put in place to prevent this? Navigational aids, rate limits,
Sometimes embracing a metaphor can help us grasp something more firmly. For example,
we can understand the risks of deanonymization through the metaphor of data as fluid:
mixing two chemicals that aren't dangerous in themselves can create a dangerous situation.
DATA IS A TEXTILE, PROCESSING DATA IS WEAVING
Textiles and software go way back: Hollerith punched cards are an evolution from Jacquard
punched cards, after all. The words text and textile are cousins, coming from the Latin verb texere (weaving). Knitting patterns are programs in a domain-specific instruction set, designed
for humans to execute.
Textiles tell a story of unity in complexity, of interacting and entangled structures.
Data is made, with skill and artistry. Taking the sewing side of the metaphor over
the weaving side: data is pieced together from specifically cut pieces, again, with
skill and artistry, hiding the seams. Data is flexible, within limits. It is finite
and put to some deeply human purpose. There is a plan. There are measurements. We
are not afraid of our textile data. There is a sense of fragility: weavings can be
unravelled or torn. Data is tangible, something to be draped, something to be worn.
Soft and pliable, but not freely intermixed.
Textiles are holistic but singular. They don't blur and mix like fluids. What matters
isn't individual stitches, but the pattern as a whole.
As with buildings, there are plans, and patterns of various levels of complexity.
But textiles are more personal. You don't share your socks.
Textiles are flexible, but they are not ever-changing. Change is about mending and
patching, not fundamentally reshaping what we have.
What do different kinds textiles teach us?
Weaving teaches us that the interactions of strands create patterns of their own.
Overlap not only happens, it is essential.
Knitting patterns teach us that short and simple generation instructions can produce
complex results. It might be better to store the generator rather than the instance.
Embroidery speaks of annotations, adding layers of meaning.
White work teaches us that we can create complex information patterns by removing
threads and creating voids.
Quilting teaches us that we can create mash-ups from disparate data sources to produce
Like the building metaphor, the textile metaphor centers human agency and planning.
Countering that bias is to think about what happens outside human intentions. Like
the document metaphor, the textile metaphor lends itself to assumptions about human
scales in time and size, and thinking about inhuman scales is a useful counterpoint.
Data is intentional, crafted by humans
Does your application/interface/process still work when the data is machine-generated,
operating at machine scales?
Data is complex, tangled, and holistic
Think about reuse and slicing and dicing. Think about what pieces make sense in isolation
Human scale in time and size
What happens if your data consists of large amounts of very small machine-generated
documents, or very large or endless ones? What happens if the rate of generation exceeds
Data is skillfully fabricated by hand
Do your systems/processes hold up when intuitions about human scale don't apply? How
can non-skilled use be enabled?
Data wears out
What does it mean to patch up your data, to preserve it, to decide it is worn out
junk to be thrown away?
Knitting, quilting, stitching patterns
Is it better to keep a generator for the data instead of the data? What are the consituent
pieces to be sewn together?
Tearing or unravelling
The textile metaphor biases us against extraction, reassembly, and reuse. Think about
how you could support those use cases.
RDF data is often conceptualized as a graph, but textile metaphors work well too.
There are threads of different colours knotted and linked together to form a complex
whole. The difficulties of multiple roots or overlaps or concurrency just don't come
up: it is the pattern as a whole that matters.
DATA IS MUSIC, PROCESSING DATA IS PERFORMING
The musical metaphor gets hardly any traction: less than one half permille of this
corpus, and most of those usages in the corpus are about actual music, not metaphorical
One place where music does get taken more seriously in the context of data is sonification.
Sonification is the rendering of data as literal music. It works because humans are
better at hearing subtle patterns in complexity than seeing them. Wordless information
of various kinds (EKGs, gravity waves, Geiger counters) has been effectively sonified.
What would sonification of markup look like?
The story this metaphor could tell is one of creating a beautiful unity out of richness.
Data is created as an act of cooperation and blending together disparate voices and
instruments. Music carries emotion of all kinds. Like the fluid metaphor, there is
continuity and things happening over time. Music itself is often understood through
the metaphor of flowing water. Like the construction metaphor, this is a planned activity.
There is a composer and score and a conductor. Like the textile metaphor this is an
intimate, deeply human craft. Data is not visual to be read, not tangible to be grasped,
but audible, to be heard. Change is encoded in the very structure of the music itself:
present, but part of the pattern. Sharing too is intrinsic in the concept of music:
but the sharing of experience. Music is ephemeral.
What would markup be like if we thought of it as scoring? We wouldn't think of overlap
as a problem at all, but as our essential task. Our representations would be representations
of parallel tracks, our words arranged in time rather than space.
The data as music metaphor can itself be a useful alternative viewpoint to more common
metaphors. What if we liked our data? What if it were a participatory activity, not
an artifact? What if the goal of security weren't keeping secrets but preserving the
usefulness and integrity of the performance? The musical metaphor might encourage
us to consider the emotional impact of data and structure our systems accordingly.
Data is beautiful
How does the structure of your data impact its esthetics? How will that impact how
humans are able to interact with it?
Data is participatory
Who is participating? Who is excluded? How can we make participation more inviting?
Data is performed
Who is the performance for? Who is performing? What is the scope of the performance
in time and space?
Data is ephemeral
Not everything needs to be saved beyond the moment it was created.
When multiple streams come together, will they actually work together? Are the cadences
When talking about our data and its processing, our choice of metaphor directs our
attention towards (and away from) certain aspects of reality. In particular, it selects
a particular view of complexity and of change. It also selects an emotional and moral
stance towards that data.
With the building metaphor we attend to the processing of data and its regularity.
Data is intentional, controlled, a singular artifact. We think about plans and checks.
We also think about stability and permanence. Buildings, once formed, generally do
not change easily.
The path metaphor, by contrast, is all about the process. The data is in the background,
a given, something for us to traverse, something to put boundary markers around. We
pay attention to how one part connects to another. Change is not in the data, but
in what we do with it.
The fluid metaphor evokes a frisson of fear: there is abundance, but a danger of over-abundance
too. Data moves and must be guided and controlled. Attention is diverted from human
agency: the data has agency of its own. Depending on what kind of fluid we imagine
we have, we may attend more or less carefully to how we handle it. Change is expected
and constant: everything may be reshaped from moment to moment.
The textile metaphor, by contrast, evokes feelings of cozy intimacy. Data isn't scary,
although it can be very complex. Still, it is fabricated according to a plan. We have
a sense of regularity and order. Change can happen, but generally in the context of
When we think of trees, we think of hierarchy and branching. Cross-cutting tangles
are exceptions. Yet once we start talking about forests, complexity and thoughts of the organic creep in. And a little fear. Change is about
growth: addition at the leaves. Once again, attention shifts from human agency to
agency of the data.
It is a pity the musical metaphor is not more popular: for here is the e pluribus unum of data, an embrace of complexity, and here is beauty, and here is joy.
In summary, these different metaphors each has a different story to tell about our
data, and our relation to it:
Used by many, owned by one
Made, then solid
Traversed by many, owned by one
Experienced and created together
There is a pragmatic aspect to being more mindful of the language we use to talk about
what we do. Thinking about our data and the systems that surround it in particular
ways biases us to thinking about certain issues and ignoring others: being more mindful
of those biases allows us to temper them. Mindfully choosing to view things through
the lens of a different metaphor allows us to consider a fuller view of our data and
the processes and systems that operate on it.
Ethics come in to play here as well. When data leaks or there is a data breach, some human did some thing they shouldn't have done to cause that to happen, or failed
to take an action they should have to prevent it. But the language of leaking talks about data as a fluid that just oozed somewhere of its own accord: it blames
the data for its own condition and abdicates human responsibility. Similarly, talking
in terms of jungles and organic metaphors of unchecked growth again diverts our attention
from human agency and control, and blames data for its own misuse. On the other hand,
when we talk in cozy terms of weaving and knitting, we are biased to think of the data in ways that prevent us from thinking about the
dangers of use or misuse. When we talk of documents and libraries, we are inclined to think of books being shared freely, and neglect issues of privacy.
The metaphors we code by are another tool. They can be wielded with purpose.
[Maitland12] Sara Maitland. Gossip from the Forest. Granta Books, 2012.
[Lakoff80] George Lakoff and Mark Johnson. Metaphors We Live By. University of Chicago Press, 1980.
[Perrow99] Charles Perrow. Normal Accidents. Princeton University Press, 1999.
[Wohllenben16] Peter Wohllenben, translation by Jane Billinghurst. The Hidden Life of Trees. Greystone Books, 2016.
 Numbers are rough here, because it depends on what you count as the same: Distinct lexical forms? Distinct stems? Distinct senses? I counted both distinct
lexical forms and distinct stems, and the numbers are in the same ballpark.