(Re)building the TEI Website: A Bit of History and New Directions

Hugh Cayless

Abstract

The Text Encoding Initiative has had a web presence for almost thirty years. It's instructive to consider how a large, robust, and widely-used XML vocabulary defines its presence on the web. How it has weathered the storms of change (management, institutional, technological) to be where it is today. And how it imagines its future.

The Text Encoding Initiative has had an online presence since the early days of the web. It has progressed from old school static HTML, to dynamic XML processing systems, to WordPress, and most recently to a static site built via Continuous Integration using Eleventy. I will survey where the site has been over the years and then talk about how its most recent iteration handles some of its architectural quirks, including XML sources!

The Beginnings

The Text Encoding Initiative (TEI), which develops and promulgates a set of guidelines for the markup of cultural heritage texts for research purposes, began before the World Wide Web existed. Its origins date back to a meeting at Vassar College in Poughkeepsie, New York in November 1987.^[1] It began as a joint effort of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics and was later organized (in 2001) as a consortium. The tei-c.org domain, the TEI Consortium's home on the web, was registered on March 22nd, 1999, and the first available capture in the Internet Archive is from October 9th of that year (figure 1). Before that, the site was hosted by the University of Illinois at Chicago, maintained by Wendy Plotkin and our own, dearly missed, Michael Sperberg McQueen (figure 2).^[2] The TEI is one of the longest continuously-running Digital Humanities projects in existence. It serves as the infrastructure for very many text-based projects worlwide. The history of the organization created to support the development of the TEI Guidelines, the TEI Consortium (and along with it the website) overlaps with the period archived by the Internet Archive, and so we can observe its full history at the tei-c.org domain. It has transitioned through management on a variety of academic hosts, by a variety of organizations, and finally to being self-managed. It contains (and always has) a variety of resource types, with different and overlapping publishing pipelines. It therefore makes for an interesting case study of how scholarly web communication has evovled over the last few decades.

From there, the TEI site moved to the Institute for Advanced Technology in the Humanities at the University of Virginia. But development and site generation were done by the Oxford University Computing Center and mirrored to the host at UVA. It was in this period that the site began to be generated from XML sources, first using an XSLT 1.0 stylesheet and the Saxon XSLT processor (figs. 3-5).

Figure 3: DOCTYPE declaration and comment from the front page of the 2004 site

<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
   <!--THIS FILE IS GENERATED FROM AN XML MASTER. 
 DO NOT EDIT-->

Figure 4: Comment from the 2004 site

<!--
Generated using an XSLT version 1 stylesheet
based on http://www.oucs.ox.ac.uk/stylesheets/teihtml.xsl
processed using SAXON 6.3 from Michael Kay-->

Because so much of the TEI's early content was static HTML and other formats (including, e.g. GML, Waterloo SCRIPT, PostScript, and PDF), that content has for a long time been stored in a section of the website known as the Vault. The Vault contains content that may or may not be viewable with a web browser, but even if it is, does not follow the formatting conventions (such as menus) that the rest of the site does. Published copies of the TEI Guidelines go in there, as well as old project outputs, meeting minutes, etc. A large portion of the site has thus always been static.

Dynamic, XML-driven Websites

In 2005, OUCS migrated the website to an Apache Cocoon^[3] based system (see fig. 6). Cocoon was an XML-driven web publishing system that allowed users to define transformation pipelines for a variety of routes and document types. It was extremely flexible and was, for a time, much in vogue for XML-based Digital Humanities projects. The Cocoon iteration lasted until only until the end of 2007, however, when it was migrated to an OpenCMS^[4] instance managed by the University of Virginia and customized to process TEI-based sources into HTML (fig. 7). The new setup promised to allow authenticated users to directly modify the site content via the web for the first time, without needing to have access to upload source files to the web server. A MediaWiki instance was also added in 2008, which allowed broader member participation in producing site content.

The OpenCMS setup lasted a long time—longer than it should have, in fact. Page editing was done via a Java Applet and by the mid-2010s this had become very poorly supported (not to mention a security risk). In 2014, the site was moved from UVA to the Alliance of Digital Humanities Organizations' (ADHO) cluster in Hamburg. In mid-2016 Kevin Hawkins, then the TEI web administrator, announced an RFP^[5] to migrate the site to WordPress. The plan was to work in two phases. The first would create a WordPress site with the same look and feel as the OpenCMS site; the second would work on refactoring the site to improve its aesthetics and usability. Phase 1 took a long time and was finally completed in 2018 (fig. 8). Phase 2 never began.

WordPress

Because WordPress doesn't have native support for XML sources, the TEI files in OpenCMS were converted to HTML as part of the migration. This conversion did not always result in very clean HTML. The TEI header information, for example, was dumped in a hidden HTML div on each page. Oddities like this did not always work well with WordPress's HTML editor. But what was worse, the switchover came scarcely a month before ADHO suffered a major disk failure, which disabled their services for an extended period. Since the TEI website was unavailable, without an estimated recovery time, we decided to temporarily relocate it to the University of Victoria, where our then webmaster, Luis Meneses, worked and therefore had access to their computing infrastructure. Luis accomplished this with remarkable efficiency, and the TEI site remained at UVic for about a year, but the announcement that Compute Canada would be reorganized^[6] prompted a search for a new host. Laurent Romary suggested Huma-Num^[7], the French computing infrastructure for the Social Sciences and Humanities and helped arrange the transition. In August 2019 the site and other TEI services moved to three virtual machines hosted by Huma-Num, where they have been ever since. All of this churn meant that any immediate energy that might have gone into remediating the infelicities of the new site was diverted into rescuing it.

While WordPress, since it is a Content Management System like OpenCMS, seemed like a good fit for the TEI website, it was in fact not optimal. Authenticated users could edit pages on the site and add news articles, etc. but this was often an awkward process due to the state of the once-TEI HTML sources. In addition, since the sources had been migrated over in totality and not pruned or reorganized, the site's structure itself was extremely unweildy. As a result, site maintenance slipped and much-desired projects such as an architectural redesign, and producing translations of the site seemed out of reach. Some of the website's sources were (and still remain) TEI XML documents, notably Technical Council documentation and more recently the TEI Bylaws. Displaying these in WordPress involved either keeping the XML and WordPress HTML versions in sync manually, developing a different publishing pipeline (like the one for the TEI Guidelines), or utilizing a custom-developed plugin. All of these strategies were used at one time or another but none of them were very satisfactory. OpenCMS handled content URLs by directly referencing the source filename, so for example, the homepage resolved to http://tei-c.org/index.xml. This resulted in incompatibility with WordPress's URL conventions, wherein URLs generally end with a forward slash. This problem was solved with a 1,825-item redirect list, which only added to the unmanageability of the overall site.

Moreover, since WordPress is notoriously vulnerable to being compromised, keeping it and its plugins patched was a headache and the fear of being hacked was a source of constant stress. Another solution seemed called for, but it needed to be able to handle our idiosyncratic mixture of sources, provide for easy content editing, and a much higher level of security.

Back to the Future

In 2023, I began an effort to reimagine the TEI website as a static site, with a new design, and a mission to pare back the cruft. In such a setup, the sources could be contained in a GitHub repository, which could handle the CMS functions of the WordPress site (user authentication, online editing, etc.). The requirements were 1) support for source files in both MarkDown and (crucially) TEI, 2) seamless integration with the TEI Guidelines^[8], 3) a simpler editing workflow, and 4) a more attractive, up-to-date appearance with less complexity.

After evaluating several static site generators, including Jekyll, and Hugo, I settled on Eleventy, a very flexible JavaScript-based static site generator. The requirement to handle TEI XML as a source format meant some level of customization would be required, and Eleventy makes registering new source types very easy. Moreover, it is written in JavaScript, a language I am very familiar with.^[9] To get the content out of WordPress, I exported it as an XML dump and wrote a converter to extract pages by link pattern from the export. The converter, written in Python, parsed the XML, pulled out the HTML pages that matched the designated link patterns, then converted them to MarkDown . This meant I could selectively extract content rather than blindly re-create the sprawling mess of the old site. The new site went live on September 3, 2024. The source code is located at https://github.com/TEIC/website and is directly editable by members with commit access to the repository. Site rebuilds are managed by a GitHub Action and are triggered by a push to the main branch or to the Documentation repository^[10], which contains the TEI XML sources for TEI Council Working Papers and the TEI Consortium bylaws. The site typically builds and deploys to the TEI web server in about 30 seconds.^[11]

Balisage attendees and readers will no doubt be interested in the XML-processing pipeline, which is, I'm afraid, somewhat boring. Eleventy allows developers to add new Template types in its configuration, so I created one that handles XML files and compiles them by piping them through CETEIcean, a JavaScript library for displaying TEI documents on the web. The documentation files you see on the site are therefore static HTML with TEI Custom Elements and CSS to style them. The whole process is accomplished with fewer than 30 lines of code.

Figure 9: Configuring an XML-based template language

  eleventyConfig.addExtension("xml", {

    getData: async function(inputPath) {
      const file = fs.readFileSync(inputPath, 'utf8');
      const jdom = new JSDOM(file, { contentType: "text/xml" });
      if (!jdom.window.document.querySelector("TEI")) {
        return;
      }
      return {
        "title": jdom.window.document.querySelector("titleStmt > title").textContent,
        "navkey": inputPath.replace(".*/", "").replace(".xml", ""),
        "eleventyNavigation": {
          parent: inputPath.includes("TCW") ? "Council" : "About",
          key: inputPath,
          title: jdom.window.document.querySelector("titleStmt > title").textContent
        }
      }
    },

    compile: async function(contents, inputPath) {
      const jdom = new JSDOM(contents, { contentType: "text/xml" });
      if (!jdom.window.document.querySelector("TEI")) {
        return;
      }
      let cetei = new CETEI({ documentObject: jdom.window.document });
      let doc = await cetei.domToHTML5(jdom.window.document);
      return async (data) => {
        return cetei.utilities.serializeHTML(doc, true);
      };
    }
  });

The getData function adds page metadata, including navigation data that enables the display of links to Documentation pages on the Council Activity page. The compile function is the meat of the process, where the source document loaded into a DOM and converted to TEI-flavored Custom HTML Elements. These are styled in the usual way with CSS.

The TEI Guidelines are built from their own distinct sources which reside in their own GitHub repository. They use a custom XSLT pipeline which produces HTML pages and for the current release series, P5, all of them are available in the Vault under https://tei-c.org/Vault/P5/. Full integration of the TEI Guidelines entails adjusting the CSS they use in the build process so that they match the lok and feel of the main website. But also, importantly, current versions should support the same menus as the rest of the site. The WordPress version relied on an API call to fetch the menus as JSON and dynamically write them into the page using JavaScript. The new site does something very similar, but with a static JSON representation of the menu data, which is used as a data source in the website build procedure and is also made available directly.

Figure 10: Menus

[
  {
    "id": "guidelines",
    "en": {
      "name": "Guidelines",
      "items": [
        {
          "name": "Current Guidelines",
          "url": "/release/doc/tei-p5-doc/en/html/index.html"
        },
        {
          "name": "Older Versions",
          "url": "/guidelines/p5/"
        },
        {
          "name": "Customization",
          "url": "/guidelines/customization/"
        },
        {
          "name": "Licensing & Citation",
          "url": "/guidelines/licensing-and-citation/"
        },
        {
          "name": "TEI @ GitHub",
          "url": "https://github.com/TEIC/TEI"
        },
        {
          "name": "About the Guidelines",
          "url": "/guidelines/"}
      ]
    },
    "es": {
      "name": "Directrices",
      "items": [
        {
          "name": "Directrices actuales",
          "url": "/release/doc/tei-p5-doc/es/html/index.html"
        }, ...

Figure 11: Using the menus

    <div class="collapse navbar-collapse" id="TEIMenu">
      <ul class="navbar-nav ms-1 me-auto mb-2 mb-lg-0">
        {% for menu in menus %}
          {% if menu.en.items %}
            <li class="nav-item dropdown">
              <a class="nav-link dropdown-toggle" href="#" id="{{ menu.id }}Menu" role="button" data-bs-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
                {{ menu.en.name }}
              </a>
              <div class="dropdown-menu" aria-labelledby="{{ menu.id }}Name">
                {% for item in menu.en.items %}
                  <a class="dropdown-item" href="{{ item.url }}">{{ item.name }}</a>
                {% endfor %}
              </div>
            </li>
          {% else %}
            <li class="nav-item">
              <a class="nav-link" href="{{ menu.en.url }}">{{ menu.en.name }}</a>
            </li>
          {% endif %}
        {% endfor %}
      </ul>
      ...

The question of editing workflow improvements is a little harder to quantify. Since its release, there have been 271 changes commited to the website repo, of those, 85 are mine and the rest by 15 other members of the TEI community.^[12] For a similar period of time in 2023–2024, the WordPress site had about 40 page creations or edits and 7 posts (news items) by about 5 editors. GitHub is likely a much more friendly environment to the sort of person who works with TEI, but this represents more than a doubling of both work done and of contributors doing the work.

As for the site's appearance and usability, I will leave it to the audience to judge whether the new site (fig. 12) is an improvement. There remain some broken links and there is still content from the old site that needs to be moved over, but that is being done in response to community needs and the old site remains available for the time being in a read-only state at https://old.tei-c.org/. The new site has been checked against the Web Content Accessibility Guidelines 2.2 and passes with no violations.^[13]

The new website structure and workflow is also allowing us to develop internationalized versions of parts of the site, something we have long wished to do, but which proved difficult in the WordPress régime. A contributor from Argentina has been adding translations this summer, and the site has been configured to deliver a Spanish version of the homepage if a user's language preferences have been set accordingly.

Long-running projects tend to accrete a lot of information and making that information public-facing is a difficult job—even more so in an all-volunteer organization. The TEI site has evolved over time from a static HTML site, to one pre-generated from XML sources, then dynamically generated from those sources, then to an XML-based content management system, an HTML-based content management system, and now at last back to a static site, pregenerated from mixed sources and managed from an online Git repository. It has in some ways come full circle, even though the technologies employed have changed greatly. It is perhaps significant that the website management, which for most of the organization's existence was a DIY affair, has returned to those roots as well.

References

Campbell, Alastair et al., Web Content Accessibility Guidelines (WCAG) 2.2 (2024). https://www.w3.org/TR/WCAG22/

Cayless, Hugh and Raffaele Viglianti, CETEIcean (2016–2025). https://github.com/TEIC/CETEIcean

Ide, Nancy and C. M. Sperberg-McQueen, "The TEI: History, Goals, and Future," Computers and the Humanities 29, pp. 5–15 (1995). doi:https://doi.org/10.1007/BF01830313

Kahle, Brewster et al., Internet Archive. https://archive.org/

^[1] See Ide (1994) for a history of the TEI's beginnings.

^[2] The Internet Archive's records go back as far as 1996. The earliest capture of the site at UIC dates from May 29, 1997.

^[3] https://cocoon.apache.org/. Now retired as of 2025.

^[4] https://www.opencms.org/en/.

^[5] Hawkins, email from the TEI-L archives, 2016-06-29.

^[6] Compute Canada, which operated Canada's national advanced research computing platform, ceased operations in 2022 and responsibility the platform was handed over to the Digital Research Alliance of Canada. See 2022 Resource Allocations Competition Results. As it turned out, the reorganization was less impactful than seemed likely in 2019.

^[7] From the Huma-Num website: "La principale mission de l’IR* est de construire, avec les communautés et à partir d’un pilotage scientifique, une infrastructure numérique de niveau international (nœud français des ERIC DARIAH et CLARIN) pour les SHS." IR* signifies a 'star' Research Infrastructure, and SHS the Humanities and Social Sciences.

^[8] Recall that the Guidelines are located in the Vault, but unlike the other content there should present as part of the website, with the same menus and styling.

^[9] And, importantly, enjoy. Jekyll is written in Ruby, a language I am very familiar with and hate. Hugo is written in Go, which I have only a passing knowledge of.

^[10] https://github.com/TEIC/Documentation.

^[11] The Eleventy build itself takes single-digit seconds to run, but the Action is also spinning up a container, installing NodeJS, checking out repositories, etc.

^[12] My changes are sometimes content and sometimes code edits. The other users' changes are almost all content.

^[13] The Guidelines do get flagged for a large number of violations, for the most part because the HTML produced has not yet been upgraded to modern HTML5. Various checkers were used, including the IBM Equal Access Checker browser plugin.

Hugh Cayless

Hugh is a Senior Digital Humanities Developer at Duke University Libraries. He is the Treasurer and past Council Chair for the Text Encoding Initiative Consortium.

BalisageThe Markup Conference

Balisage Paper: (Re)building the TEI Website: A Bit of History and New Directions

Hugh Cayless

Table of Contents

The Beginnings

Dynamic, XML-driven Websites

WordPress

Back to the Future

References

Balisage Series on Markup Technologies