Balisage logo


The sustainability of the scholarly edition in a digital world

Cathy Moran Hajo

Associate Editor, Margaret Sanger Papers

New York University

International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML
August 2, 2010

Copyright © 2010 by the author. Licensed under a Creative Commons attribution, non-commercial, no derivatives 3.0 unported license (

How to cite this paper

Hajo, Cathy Moran. “The sustainability of the scholarly edition in a digital world.” Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Series on Markup Technologies, vol. 6 (2010). DOI: 10.4242/BalisageVol6.Hajo01.


Scholarly editions must be used for generations; by nature they require a stable long-term publication format. Some editors have eagerly embraced digital editing and XML, but many more editors remain unconvinced that digital publications can last as long as printed books. Community standards and DTDs for editions have not been widely adopted and editors lack consensus about what a digital edition should be. XML's stability and sustainability is critical to efforts to go beyond “the book,” and to develop new ways of presenting texts and scholarly commentary. To build 21st century editions, we need tools to make XML encoding easier, to encourage collaboration, to exploit social media, and to separate transcriptions of texts from the editorial scholarship applied to them.

Scholarly editors have long been invested in the creation of long-lasting and sustainable publications. Whether they create complex multi-year projects that rely on cooperative teamwork or develop short-term solo projects, editors understand that their work will be consulted for years to come. The expense of locating, selecting, transcribing, annotating and publishing historical documents could not be maintained if these editions were not built to last. Editors have developed practices and policies to ensure that their readers can confidently rely upon their versions of important historical manuscripts. This care has always extended to the stable publication formats editors chose, whether in letterpress or microform.[1] When editors turn to digital publication, sustainability remains of critical importance.

Scholarly editions are also committed to bringing primary sources to broader audiences. Editors take Thomas Jefferson to heart when he wrote: “…let us save what remains: not by vaults and locks which fence them from the public eye and use, in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.”[2] By publishing edited works, we take fragile and unique archival manuscripts and make them available in research libraries where scholars and students can access them. They don't reach everyone. These editions are expensive to purchase and most libraries do not carry complete sets of all of them, but given the technology of the time, printed volumes certainly served to disseminate and preserve historically important materials. The advent of digital publication challenges editors to expand their reach, to move beyond the ivory towers of research libraries to high schools, town libraries and even to the comfort of private homes. This monumental expansion of our audience forces us to rethink how we edit documents and how a truly accessible edition should behave.

Editors have always had to balance costs against their ability to preserve and disseminate documents. Never has this been a more difficult task than at present, when the cost of creating long-term quality digital editions often prohibits editors from offering them freely to the public. Digital publishing permits unprecedented access possibilities, but it is a fragile medium that is susceptible to obsolescence. Should we create an edition that is as sustainable as digital text can be, but might not be widely accessible, or should we create a widely accessible edition that might not last a long time? Neither option is acceptable. The reality of funding for scholarly editing in the 21st century is this: it is difficult enough to raise the funds to create the content of these editions. Adding the technical specialization needed to render these texts in well-formed XML is beyond the capabilities of many editing projects. Some editors, too, are reluctant to embrace the notion of providing free and open access to their editions and support the idea of subscription-based for digital editions. Arguing that their print volumes were never free of charge, many editors seek to generate income to help offset project costs or seek royalties for their intellectual work. Federal agencies prefer, or in some instances demand, that the editions they fund be produced using XML, but they do not provide sufficient guidance or tools to help editors comply. After fifteen years of being told that markup is the gold standard for digital publications, we get it. We know that we need to use XML on our digital texts, but I would say that the best adjective to describe our adoption of XML might be “reluctant.” This is especially true of historical editors as opposed to literary editors. Few historical editors have embraced digital publication's promise, and most have treated digital publication as an add-on to their central work of publishing volumes.

Aside from our goals of accessibility and sustainability, digital publication questions some of editing's guiding premises. I don't believe that many editors, even those working with digital text, have really explored or thought through these issues. To date, most digital editions still closely resemble the books they are based upon. As we finish digitizing older publications and begin to construct more and more born-digital editions, we have an opportunity to redefine the edition, the editor and the editing project. While none of us can peer into a crystal ball and know what is yet to come, I expect that the edition of the 21st century will differ substantially from the editions of the 20th century.

The transcription is central to the work of the scholarly editor. Faced with the problem of how to make primary sources useable and accessible in a print world, editors decided to transcribe them. Publishing paper-based facsimiles was too expensive and while microfilm editions offered images, they could not easily include annotation or much editorial intervention. To render all the complexities of their texts, editors developed typographical mechanisms to record changes of hand or pen, additions, insertions, deletions and margin notes. Their familiarity with the texts enabled them to read and transcribe difficult handwriting, account for variants of a text and provide historical context that makes the manuscript come alive. The transcription is the text for most editions, once it is proofread or verified as accurate, it is the center around which the project orbits and if transcriptions are questioned or found wanting, the edition is quickly discredited and disused.[3] As the capability to provide high quality images over the internet has increased dramatically, editors need to grapple with the idea that transcription might have a different role as we move into the future.

Another agreed upon principle of scholarly editing is that the editor's work should be as unobtrusive and objective as possible, which in part contributes to the edition's long shelf-life. Despite the fact that selecting texts and drafting annotation can be intensely subjective tasks, editors try to keep interpretation and historical arguments to a minimum. Our role is to present the most important documents with factual annotation that enables the reader to understand the text and reach his or her own conclusions. More often than not, the editor who feels compelled to weigh in on current historiography or the controversial issues of the day does so outside the edition, in a journal article, monograph or biography.[4] At projects that employ more than one editor, the work of each is subsumed into one consistent editorial voice. We credit editors on the title page, not each portion of the edition that the scholar created. In this way editions are truly collaborative and a rare example of team-based humanities research. As digital scholarship allows up to build larger and more far-flung collaborations, editors need to decide whether that current mode of attribution should be continued.

Finally, editors agree that because their work is so labor intensive, it must be done well. It will be a long time before anyone has the opportunity to do the work again. The existence of a previous published edition on a topic forces the editor to explain why the existing publication is so flawed that the time and money needs to be expended to redo it. One of the promises of the digital age is the ease of re-purposing objects once they have been digitized. When editors can continually revise and enhance their editions will these projects ever truly end? Will we develop mechanisms by which we can allow other scholars to try different approaches on our texts? Will they need our permission?

Editions are not created in a vacuum. We rely on more than sixty years of tradition and previous work to guide our steps. But not all our inspiration comes from the print editions of the past. We are not just content creators, but users of digital resources. We have seen the work done by archivists, public historians, and digital historians to digitize primary sources and to bring them to new and larger audiences. We also see new ways of researching and collaborating with both experts and the public and we want to try these out in our editions. Editors need to decide how many of our the tried and true methods still remain valuable, and what aspects of the new technology would be best incorporated into our editions. Digital historians seek to do more than just enable searches of electronically rendered texts, they want to encourage research and collaboration, to use data mining, visualizations, and other computer-aided tools to analyze texts on a much larger scale than once possible. Sometimes I am not sure that we recognize the power of conducting a simple Google Book search for a string of text—we are searching more than ten million books in a few seconds, a task unthinkable before digital publication. This kind of computing power can fundamentally change the way we do research as well as how we formulate a feasible research question. The problem that historians will face from this time onward is how to deal with an abundance of sources, not how to overcome a dearth. So any method that we can develop to help to dice up our editions into smaller useful portions will be valuable. While digital historians are only small minority of the profession at present, in just a generation or two, all historians will be digital historians. Editors need to keep an eye on the trends in this emerging field, to educate themselves in the technology, and learn to adapt their editions to meet the needs of this important group of stakeholders.

Scholarly editors have been hearing about the benefits of markup XML for fifteen years but that does not mean that most are comfortable with the idea. Again, I am talking primarily about editors of historical documents, rather than those working on literature. English departments and linguistic scholars adopted markup languages and computer aided text analysis early and have led the way in the development of XML, especially the Text Encoding Initiative guidelines that focuses on humanities texts. Digital editions like the Women Writers Project, the Whitman Archive, Willa Cather Archive and the William Blake Archive have benefitted from close associations with humanities computing centers to create sites that rely on robust encoding of complex manuscript material.[5]

Historians have not been as quick to embrace markup language and I am not entirely certain why. We are more reluctant to put the time into mastering XML or the programming languages needed to search and display XML texts effectively. We balk at the amount of technology we need to learn, seeing time spent on it as time spent away from our documents and our research. It may also be that XML provides greater immediate benefit for literary research than it does for historical research. The Willa Cather Archive offers text visualization and analysis tools that enable readers to search for word frequencies, create word clouds and concordances. These are useful tools, to be sure, but not the first ones that a historian might apply to a set of documents.[6] We might prefer to locate all the mentions of a person, organization, place, subject or idea. We want to find the appropriate documents and study them through close reading. I think that at some level we resist the idea that a computer could ape the way that we attack documents. If our print editions are any clue, we invest the most time in creating detailed subject indexes that organize the documents into important categories. While historians are interested in the process of document creation, tracing variants and versions over time, they are more interested in assigning content to the text. Encoding these contextual relationships are a sort of cross between annotation and indexing, something that might be done slightly differently by each project depending on the interests of an editor or the specific documents being published. Believing that each of their project's needs and processes to be unique, historical editors have not really united to develop the digital tools that would lead to new ways of looking at texts, either within or across editions.

We don't have a good model in use today for a sustainable XML edition with which we could develop a shared conception of digital editing. There are a lot of silo projects that have been developed for specific sets of documents and that are not broad-minded enough to serve all editions. I think that a lot of these idiosyncratic editions have good ideas, but often do not have sufficient infrastructure to ensure longevity.

The XML-based edition we are building at the Margaret Sanger Papers is an example of such a specific application. The Project's focus is on Margaret Sanger, the 20th century American birth control activist. Our entire archive consists of slightly under 100,000 documents, published on microfilm. Only about one percent of these documents were selected, transcribed and annotated for our four-volume print edition.[7] For our digital edition, we did not want to repeat the work that we did with our book edition. Instead we wanted to explore the capabilities of XML encoding, and text searching. We selected Sanger's articles and speeches, six hundred texts in all, as the best test for digital publication. Few of these documents were included in our book edition because of their length and the fact that they were somewhat repetitive—the vast majority dealing with some aspect of birth control. Even though all of them were accessible within the microfilm, they were not searchable in that format. We believed that these documents would best benefit from the ability to search text and the addition of metadata. We began this project in 2003, but without dedicated funding for it, progress has had to be slow. The beta site contains about three hundred documents now, searchable by text, date, title, format, publication venue, and by subject. We worked with staff at New York University's Humanities Computing Group to set up the initial search and display and as we complete the edition we will add additional searches for the names of organizations, people, places, titles of books, and text quoted by Sanger. All these are already tagged in the texts. Our XML encoding is not as complex as many others I have seen, but we feel that the combined search capabilities of text and metadata offers something different than either our book or microfilm editions. It does not offer annotation in its classic form, but allows in-depth research of Sanger's ideas, audiences, and changes in her arguments over time.[8]

Ours is a concise digital edition drawn from a closely connected group of texts that share a common format and purpose. The content encoding that we have done required an understanding of the texts, knowledge of the references made, and the ability to construct detailed subject entries that provide meaningful intellectual divisions between six hundred documents that might all fall under a handful of Library of Congress Subject Headings. If we wanted to expand this edition to include correspondence or other kinds of documents, we would have to revise our tagging scheme as well as our display interface. We could not use a system like this to digitize all one hundred thousand documents in our microfilm edition because it would be too time consuming. We do not have the staff to carry out transcription and content encoding and index entries for such a large number of documents. It is not even easy for us to make small adjustments to the edition's design because the editorial staff did not create the programs that search and display the texts.

Our situation is similar to that of many other projects who don't have access to digital text experts at a humanities text center. Our original project was encoded using a slightly amended version of the Model Editions Partnership DTD for TEI P4, created with the help of Matthew Zimmerman of NYU's Humanities Computing Group.[9] Sustainability has become a concern for us. Since our edition began, the TEI has introduced P5 and NYU dissolved it's Humanities Computing group. What we have right now works, but we need to decide whether to spend the time and money that it will take to update our encoding to comply with TEI P5 or stay with the older version. If we do choose to migrate to P5, we would need to redevelop our encoding policies, resolve the differences in the texts already encoded, and recreate the search and display interfaces to work with the new encoding. While these tasks might not be difficult for those who work with TEI, XSLT and PHP every day, for us it will require either raising the funds to hire a programmer, or spending many, many nights laboriously working our way through web tutorials and a small library of the Complete Idiot Guides!

These kinds of predicaments tempt editors like us, pressed for time, to consider farming out XML encoding to a consultant or their publisher. There can be a danger in going that way because I don't think that anyone knows better than we do how people can use our texts. If we don't master XML encoding we won't be able to participate fully in decisions made on how the texts should be tagged, nor will we be able to fully explore the possibilities of digital editing. And I think that we need to always be thinking about how to make better editions, even if it means breaking with some of our traditions. For example, I think that when we convert a print edition to digital form, we should be describing the manuscripts, not trying to replicate the organization or structure of a specific volume. When we cling to our older formats, I think that we limit the possibilities for redefining the way we do things with an eye to the capabilities of digital publishing.

A case in point is the University of Virginia's Rotunda digital imprint. Rotunda is the fastest growing source for digital historical editions, best known for its ambitious project to digitize the massive multi-volume Founding Era projects, including both volumes previously published in paper and those still to be published. The Founding Fathers problem offered a “perfect storm” of a test for digital publication. Their shared geographical, chronological, and subject focus provides a strenuous test of how XML can integrate research not only across volumes, but across editions as well. The Founders had hundreds of print volumes, only some of which were in any kind of digital form, which meant embarking on a large-scale legacy conversion project, while at the same time creating a workable platform for the continued production of volumes. Finally, it offered the challenge of developing a way to combine the editor's main access tool—the index—across large collections of volumes and editions. To date, the combined Founding Era collection includes six editing projects, almost 90,000 discrete documents and almost 850,000 index references.[10]

Putting the American Founding Era Collection online is a massive undertaking, but one that doesn't really serve as a model for developing new digital editions. Perhaps because it was led by an academic press, the Collection seems wedded to the idea that readers want to see a digital version of the old print edition. Yes, it provides text searching across the entire collection, but the organizing principle is not the document, but edition and the volume. Granted, merging the work of so many different editorial projects is no simple task, as even slight differences in editorial approach or transcription styles can result in unexpected cross-edition search results. Despite its lush appearance and useful hyperlinks to texts mentioned within each edition, this digital publication still feels very much like using a book, or a series of books, rather than an integrated digital collection. Some examples:

  • Each edition is organized by the original print series and volume. Documents retain the original print volume's page breaks, rather than that of the manuscripts they describe.

  • Because of the topical overlap in editions, the same document can appear in more than one edition—for example a letter written by George Washington to James Madison will appear both in the Washington Papers and the Madison edition. Though each edition includes its own internal hyperlinks to related texts, there is no link to take the readers between the two versions of an identical text.

  • The text searches that tie together the six editions are rudimentary. In the body of the texts, XML encoding has been used sparingly, no doubt because of the cost of converting all those back volumes. When, as often is the case in scholarly editions, a portion of the text is bracketed to indicate the editor's regularization or uncertainty, the brackets are not removed from the text searches. Thus, if one edition used brackets in a phrase and the other did not, the text search would not find all instances of that document. Searches also do not always locate variant spellings of the same word, though the Collection does employ stemming. Most documentary editions use a literal transcription policy that seeks to capture the text as written, with misspellings and abbreviations rendered as is. This isn't usually confusing when read by a human, but it becomes problematic when we rely on a computer to read the text.[11]

  • Consolidated the indexes to multi-volume editions that were published for more than fifty years is a daunting proposition. Indexing styles change over time and with historiographical trends. Rotunda has made available consolidated indexes for the Adams, Washington and Jefferson editions thus far. The indexes are created by coding in hyperlinks from the index to the volume and page number, hence the decision to retain the original pagination. Right now you can only search one index at a time and it is not clear if there will be an effort to merge them.

Most of the questions raised by the American Founding Era Collection's digitization are ones that every editor would have to address when trying to make the conversion to digital form. It is made that much more complicated by the number of projects and legacy print volumes. The decision to maintain the organizational structure of the volumes limits the functionality that new volumes of the Founding Era or other editions can have if they want to join Rotunda. Of the six editions represented in the American Founding Era, only one was “born digital.” The Dolley Madison Digital Edition, when used as a standalone product presents a more website-like document display. To the left of the text are links to biographical identifications of names mentioned, keywords assigned to the text, and places mentioned. Each of these are linked to fuller annotations, short biographical studies and description of the places. A summary of the document is used on search results pages to help the reader decide which documents to open. This digital edition doesn't offer the indexing depth found in the Washington, Adams or Madison indexes, but it does provide a more useable and flexible digital text. When the Dolley Madison Edition is searched as part of the American Founding Era Collection, however, we only see the transcription. None of the annotation is visible. All the reader gets is a somewhat cryptic link at the bottom of the text that advises the reader to “See this document in the standalone Dolley Madison Digital Edition.”[12]

I don't mean to be overly critical of Rotunda's American Founding Era project. Rotunda is the only organization even making the attempt to tackle the issue of digitizing legacy editions. I don't think that Rotunda is claiming to be developing the next generation of digital editions, but because they are actively seeking new projects, their approach has the possibility of becoming a de facto standard for digital editions. I worry that its very literal approach to these print volumes may inhibit the development of more ambitious digital editions. Rotunda's light encoding of the texts and its limited search options do not maximize the capabilities of XML encoding. Rotunda editions are likely to always resemble print publications. Neither does it have the capacity to include social media tools that digital humanists and web-savvy readers seek to foster collaboration and reader engagement. And in order to cover its costs, Rotunda charges a subscription fee for access to the collection. To be fair, our small, intensive, and understaffed digital edition doesn't provide a good model either, with little in the way of support staff or capability for migration to newer data formats.

So where do we go from here? We need to do some hard thinking about how digital editions should look how we can sustain them. Digital media presents a number of challenges to the way that editors think about their texts and how they prepare them for the public. Thinking and talking about some of these and trying to see how XML might affect our decisions may help developers to anticipate future needs.

  • Images of manuscripts are now easier to digitize and serve over the web. If editors can link their transcriptions to images of the manuscript, will that change the role of the transcription? How many people really want to see the image? How many readers really want to see all those strikeouts, additions, false starts and other complications of a text? While we could provide two different views of a transcription using stylesheets, do we need to do it? Could most of the people interested in the complexity of the original be served by looking at a good digital image? Could we then default to a regularized transcription that would be easier to read and more accurate as the base of text searches? Perhaps we can encode a links to the document image in places where the user might want to consult it. I am sure that some editors and some users would not feel comfortable doing this, but I think that it is an option worthy of a trial. It might not be appropriate for complex literary editions, but for many historical editions, it might serve well.

  • The idea that once an edition was done it was unlikely to be done again is not a product of the digital age. Once text is digitized, particularly when using markup like XML, it becomes far easier to re-purpose it, run it through text analysis tools, add new levels of encoding, and open up the possibility that other scholars might find new uses for our old editions. At some point, a scholar can go back to the American Founding Era Collection and encode those variant spellings, or create a version that ignores brackets when searching. Someone might even want to try to tackle creating a comprehensive index. The chances of this happening go directly to the question of sustainability; because these texts were encoded in XML, they should be useable, so long as the scholars are allowed to use them.

  • How will the editor's job as annotator change as more and more materials are made available on the web? In days past, the editors' subject specialization and familiarity with hard-to-find primary and secondary sources ensured high quality annotation. No single scholar could dedicate the hours and years that long-term editing projects do to their subjects. But now, the availability of more and more web-based resources means that many once hard-to-find sources are readily available to the average reader. Should the editor still summarize a book when he can link directly to it on Google Books? Do we need to provide a short biographical identification when we can add a link to an entry in the online American National Biography, the individual's obituary in the New York Times, or heaven-forbid, his entry in Wikipedia? I don't actually think that links can replace all kinds of annotation, but with many kinds of facts easier to find every day, editors should question how they annotate documents. As the digital edition reaches further outside its boundaries for annotation, it may start to resemble a web site more than a book. With the ability to use the rest of the World Wide Web as linkable resources, will editions begin to resemble an ever expanding “life and times” of the subject, limited only by the questions asked by the researcher, and the paths they choose to take?

  • Following up on the idea that annotation may change, it strikes me that content encoding might replace at least some annotation and indexing tasks. If instead of spending time annotating, we can use our expertise to encode links to other documents, to names, organizations, and topics, and spend more time creating in-depth indexing entries, we may be able to provide as good a service to the readers as we now do by conducting annotation research. All annotation will not go away, but the editor will be freed to focus on the difficult concepts that the average reader might not be able to find for herself.

  • One of the promises of digital publication is that it will make collaboration easier. One can see that an editing team in the 21st century might not need to reside in the same city or same continent. Figuring out how we can use cloud computing to construct digital editions, and looking into how we might credit contributions may help attract collaborators that have a skill or specialization we lack. One of the main problems with digital scholarship, especially of a collaborative nature, is the inability to easily cite the works on vitae or resumes. This can dissuade some academics from participating in team-based research as they build tenures portfolios. Can we develop new systems where portions of the edition are credited, such as translations, annotations and metadata?

  • How will social media networks affect editing and XML? Web 2.0 tools are increasing in sophistication and enabling large amounts of people from all walks of life to participate in the creation of editions. One could conceive of a digital edition constructed as a wiki by volunteers who locate, digitize, transcribe, research, and proofread historical texts. Such a wikidition could grow either incrementally or exponentially depending on its ease of use, general interest, and word of mouth. If it should become even a hundredth as popular as Wikipedia has, one could see a large and diverse collection of materials taking shape outside of the control of editors and scholars. Blogging software has been used to present diaries, like that of Samuel Pepys, a site that encourages readers to comment on individual entries or provide more formal annotation in a companion digital encyclopedia.[13] Investigations of the feasibility of using crowd sourcing for transcription and annotation are currently underway at the Papers of the War Department, an image-based digital edition sponsored by George Mason University.[14] If any of these experiments take off, how will we preserve the digital editions they create? Can one export a wiki or a blog post into an XML format for long-term preservation?[15] Can we develop XML-capable wikis and blogs that retain their ease of use?

  • Will editors eventually include their project research files and databases in their editions? Only a small amount of the research conducted by editing projects makes it into the footnotes of their published editions. Should we share these research files, open up project chronologies, genealogies, image files, and name databases? Should we blog about research queries that come into our offices and about out own research undertaken for the edition? Should we somehow provide our readers with the experience of working in our editorial offices, where libraries, vertical files, word processing files and databases are all pressed to the service of one topic? If we do, can we easily convert these kinds of work files to XML, or should we be seeking the development of an XML Office Suite that can handle our ongoing needs and also make them sustainable and accessible in the long haul?

Will we develop new ways of thinking about documents if we look outside of our traditions and perhaps if we look beyond what XML offers at present? Looking at other applications like GIS—Geographic Imaging System—for inspiration on how to organize information might be instructive. It strikes me that the way that a GIS map is constructed, made of discreet layers and kinds of data that can be selected in any combination by the user makes for an interesting model for digital editions. If the transcription and linked image served as the “map,” with interpretation, annotation, and metadata organized as stand-off encoding many people could share the transcription but be free to add their own interpretative layer. A letter written by radical anarchist Emma Goldman to Margaret Sanger while Goldman was on a speaking tour in Portland in 1915 would interest both the Sanger and Goldman editing projects. The Goldman project might focus its annotation on Goldman's doings and ideas, whereas the Sanger editors might see the letter as evidence of Goldman's mentoring role in these years. Other interested parties could also use the letter, adding their own comments, contextualization, and interpretation to it. For example a staff member at an Oregon historical society could use the letter in an online exhibit on the radicalism of Portland at the time. He could link the places mentioned to maps or historical photographs of the city. A genealogist might simply comment on a passing reference to her great-grandfather, adding a link to her web-based genealogy. Each of these users of the text would be adding to its meaning in different ways, all of which could enrich a reader's experience of the letter. How can we create this kind of document in a way that allows any user to their annotation, and also allow users to choose to see any combination of these annotations simply by selecting them with a mouse click?

I don't know if XML can do this, but what I am getting at is that if we don't keep thinking creatively about how we might present these documents, we will end up replicating digital versions of old book editions. If we don't continue to evolve and improve, we run the risk that other kinds of digital publication, perhaps those that are not as long-lasting, will become more popular because they have better functionality. XML was created to render in digital form the publications and scholarship that we were already producing in print. So it will be good at doing that, and maybe not as good at representing less structured organizations of texts. That doesn't mean that we shouldn't figure out ways to make XML do what we need it to do, it just means that it might be harder.

Being that we don't know—we can't know—where digital history and digital editing will go in the years to come, how can we ensure that the work we are doing now will last five years never mind fifty? Sustainability is the capacity to endure change and if we can say one thing about technology, it is that it is constantly changing. There are a few simple ways to ensure the longest life for our work. One is to make high quality content. If the legacy volumes of the Founding Father projects were not seen as a valuable resource, there would be no great effort made to preserve them. Because they have lasting value, efforts were made and money spent to keep them viable and accessible. The other best practice is to do what you can to make it easy for the next generation to preserve your work. I don't think we have to promise that it will always last, just so that it will last until the next generation of technology comes along. At that point, if the value of the work isn't there, it won't be preserved. If it is, it should not be that difficult to migrate it. So don't develop your own markup language, unless you are truly a genius, and if you do make sure to share it with the world. Pay attention to what the digital humanists of the day are using and advising. If we stick with the educated pack when it comes to data formats, we can reasonably expect that the tools will be there to preserve our work. In short, that means we should use XML.

But we are not the only ones with a responsibility for making our tests last. A big part of determining whether or not a format is sustainable is whether it achieves buy-in from those it seeks to serve. As I said, most editors know that they need to use XML to create their digital editions, but that doesn't mean they really want to. We need better tools and encoding environments to win over editors and other content providers. We need increased and sustained educational offerings and practical examples and templates that can help the most numbers of content providers, whether they be editors, archivists, scholars or students, to put up XML encoded manuscript material, and we need to make available the programs and stylesheets that will make these texts display clearly and will take advantage of the encoding to generate valuable searches. It behooves us to master XML encoding in order to take a creative part in the development of digital editions. If we hand XML encoding over to consultants or to or publishers, we are unlikely to get the kind of rich encoding that can substitute for annotation or indexing.

XML encoding is expensive. Even well-funded providers like Rotunda do not have the manpower to create in-depth encoding. It is no surprise that some of the best digital editions are coming out of universities that have dedicated digital humanities centers. Virginia, Nebraska, and Brown have built expertise and tools in XML encoding that benefits their affiliated projects. George Mason University's Center for History and New Media has taken a different approach, fostering image-based projects that use their open-source Omeka software that relies upon a form-based creation of Dublin Core metadata for each object. The costs of adopting XML for your edition at an institution that is not engaged with digital humanities is high. Educational opportunities on the web as well as through in-person workshops are wonderful resources, but they don't replace the place of the kind of extended aid one can get from experts or consultants. Lacking funding to build up a national program of educational resources on XML, we need to foster more communication among XML users working with digital texts, archives, literature, and museums. We can learn from each other and share with them the special experience that we have in publishing historical texts in easy to read forms.

Ultimately, the best thing that we can do to ensure the long-term sustainability of our editions is to engage more fully with XML developers and our colleagues working with similar materials. Those of us that use XML ought to encourage our peers to become more involved and truly master the capabilities and limitations that the format has to offer. We need to pursue joint projects and consortia to fund this development. If we can build some simple tools that can get editors started on encoding their manuscript material and displaying it on the web, we will have come a long way towards ensuring that both the XML format and the work of historical editors will be around for the long haul.

[1] The National Historical Publications and Records Commission mandates that publishers follow archival permanence standards for paper, printing, and binding. ( [accessed 28 July 2010]).

[2] Thomas Jefferson to Ebenezer Hazard, Feb. 18, 1791 (The Papers of Thomas Jefferson Digital Edition, ed. Barbara B. Oberg and J. Jefferson Looney. Charlottesville: University of Virginia Press, Rotunda, 2008. ( [accessed 28 Jul 2010]); also published in Main Series, Volume 19 (24 January-31 March 1791).

[3] See for example, Meghan Marshall, “The Impossible Art of Deciphering Manuscripts,” Slate Feb. 8, 2008 [accessed 28 July 2010]), for the reception given to Robert Faggen's The Notebooks of Robert Frost (Harvard University Press, 2007).

[4] One of the most influential of early editors, Julian Boyd, the founder of the Thomas Jefferson Papers, became well-known for his copious introductions to documents, one of which famously ran to thirty pages. (Mark F. Bernstein, “History, letter by letter,” Princeton Alumni Weekly, May 14, 2003 ( [accessed 28 July 2010].)

[5] See the Women's Writers Project's website ( [accessed 29 July 2010]), the Walt Whitman Archive ( [accessed 29 July 2010]), the Willa Cather Archive ( [Accessed 29 July 2010], and the William Blake Archive ( [accessed 29 July 2010]), among others.

[6] See the TokenX tool page at the Cather Archive ( [accessed 29 July 2010].)

[7] See Esther Katz, ed. The Margaret Sanger Microfilm Edition (Bethesda, Md., Lexis-Nexis, 1995-96) and Esther Katz, ed. with Cathy Moran Hajo and Peter C. Engelman, The Selected Papers of Margaret Sanger, 4 vols. (Urbana, Illinois, 2003-).

[9] For details on the software see Matthew Zimmerman, “Publishing XML Documents on the Web,” Connect: Information Technology at NYU [Fall 2003] ( [Accessed 30 July 2010].)

[10] Currently, the American Founding Era contains the Adams Papers, the George Washington Papers, James Madison Papers, Thomas Jefferson Papers, Dolley Madison Papers, and the Documentary History of the Ratification of the Constitution. Future additions planned include the Alexander Hamilton Papers, the Andrew Jackson Papers, the John Jay Papers, and the John Marshall Papers. For more on Rotunda see; for up to the minute statistics on the American Founding Era collection see (Accessed 29 July 2010).

[11] To see this, consult George Washington to James Madison, June 12, 1784. In the first paragraph, a sentence begins “Must the merits…” The Madison Papers transcribed the phrase as “Mus[t] the merits” while the Washington Papers regularized the case silently. When searching for “Must the merits” only the Washington Papers version of the document is returned. This will likely compromise the accuracy of searching throughout the collection. ( [Accessed 29 July 2010].)

[12] See, for example, Dolley Payne Todd Madison to Anna Payne Cutts, 18 May 1804 ( and [accessed 29 July 2010].)

[13] See (Accessed 29 July 2010).

[14] For a short description of this effort see; for more on the Papers of the War Department see (Accessed 29 July 2010). This project is jointly funded by the National Endowment for the Humanities and the National Historical Publications and Records Commission.

[15] As this paper was given, the Center for History and New Media announced the creation of Anthologize, built during its One Week, One Tool. Anthologize is a Wordpress plugin that creates eBooks out of websites that can be saved in TEI format. For more on this development, see

Cathy Moran Hajo

Associate Editor, Margaret Sanger Papers

New York University

Cathy Moran Hajo is the Associate Editor and Assistant Director of the Margaret Sanger Papers, a scholarly editing project located at NYU. With the Sanger Papers, she has published three volumes of The Selected Papers of Margaret Sanger, a two-series microfilm edition, and two electronic publications. She has worked as a documentary editor for over twenty years, specializing in the publication of historical materials in digital form, and participating in scholarly conferences and meetings on digital issues.

Cathy is a Past President of the Association for Documentary Editing. Dr. Hajo received her PhD from NYU in 2006. She is the author of several articles on documentary editing, most recently, “Scholarly Editing in a Web 2.0 World,” (Documentary Editing, 2009) and “Last Words: Documenting the End of Lives,” (Documentary Editing, Fall 2006).

In addition to her work with the Sanger Project, she is the author of Birth Control on Main Street: Organizing Clinics in the United States, 1916-1939 (U. of Illinois Press, 2010).