How to cite this paper

Takeda, Joey, and Martin Holmes. “Serverless Searching with XSLT and JavaScript: Introducing staticSearch.” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Takeda01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Serverless Searching with XSLT and JavaScript

Introducing staticSearch

Joey Takeda

Digital Humanities Innovation Lab, Simon Fraser University

Joey Takeda is a Digital Humanities Developer at Simon Fraser University’s Digital Humanities Innovation Lab. He holds an MA in English Literature from the University of British Columbia. He currently serves as Technical Editor for the Winnifred Eaton Archive and is a member of TEI By Example’s International Advisory Committee.

Martin Holmes

Humanities Computing and Media Centre, University of Victoria

Martin Holmes is a programmer in the University of Victoria Humanities Computing and Media Centre. He served on the TEI Technical Council from 2010 to 2015 and was managing editor of the Journal of the Text Encoding Initiative from 2013 to 2015. He is lead programmer on several large DH projects, including The Map of Early Modern London and Digital Victorian Periodical Poetry.

Copyright rests with the authors.

Abstract

Increasing awareness of the burdens of technical debt and the risks and costs associated with complex server infrastructures have prompted digital humanists to consider moving to purely static HTML/CSS/JS websites for projects such as online digital editions. Building static sites is not difficult, but digital editions also require sophisticated search capabilities, incorporating text search with stemming and wildcards as well as a rich array of filters tuned to the contents of the edition itself. Responding to this need, we created staticSearch, a system which builds search indexes offline with XSLT and queries them statically using JavaScript, providing a straightforward approach to search for purely static websites. This presentation will describe how staticSearch works, and discuss some of the interesting and challenging problems that such a system has to overcome.

Table of Contents

Introduction
Prior Work and Existing Solutions
What is staticSearch?
Configuration
Tokenizing
Cleaning and Pre-Processing
Stemming
Indexing
Stem Files
Additional Files
Filter Files
Search Page
Conclusion
Appendix A. Statistics
Appendix B. Sample Static Search Implementations

Introduction

The Endings project is a SSHRC-funded collaboration between scholars, programmers, and librarians to devise and implement guidelines and practices that ensure the long-term survival and archivability of digital edition projects, not just as data but as functioning web applications. In the first phase of the project, we converted numerous projects, ranging from small collections of texts to densely interlinked documents and datasets, from eXist-db backends into entirely static sites consisting only of HTML, CSS, and JavaScript [Holmes and Takeda 2019a; Holmes and Takeda 2019b; Holmes and Takeda Forthcoming].

Having demonstrated the feasibility of building large sites which are both static and interactive, we were left with one remaining issue: how to provide a search engine for our collections without adding an unwanted technical debt in the form of a server-side backend. Searching is often an essential component of digital edition projects and most, if not all, of our projects are built with the assumption that there will be some sort of searching ability available in the future; much of our encoding work is predicated on the notion that good, clean encoding produces not only better texts, but ones more amenable to complex and specific querying in the future. These projects thus require robust search mechanisms that can allow for a range of queries—from simple word searches to complex faceted searches, to aid in both the usability and discoverability of documents.

This paper introduces staticSearch: a serverless text-search engine with full stemming, wildcard, keyword-in-context, and filter support.[1] It is made up of two distinct, albeit interdependent, components: an XSLT-based indexer that generates an inverted index of JSON files from a collection of XHTML files and an associated JavaScript module for querying and displaying search results.[2] While our initial expectation was that staticSearch would be practical for smaller sites and perhaps not realistic for our larger ones which have tens of thousands of documents, our pessimism has proved unwarranted. Now nearing its second major release, staticSearch has become a core part of over a dozen projects, many of which contain thousands of documents, and its performance has surprised us.

Prior Work and Existing Solutions

As we note in Holmes and Takeda Forthcoming, static websites have become popular due, in part, to their long-term durability. Static sites, once deployed, require minimal maintenance; unlike server-side applications, there is no risk of incompatible upgrades that either break your application, forcing the site to be remade, remediated, or retired, or, if ignored, make your site vulnerable to attacks or pre-emptive shutdown by system administrators. While the Endings project initially focused on generating static versions of existing web applications for long-term preservation, we quickly realized that creating static sites from the outset is relatively straightforward, and offers significant improvements in terms of workflow, project consistency, maintenance costs, and project management. The majority of the projects developed at the HCMC are now entirely static from the start.

While full-text search engines like eXist and Solr offer powerful and well-documented mechanisms for indexing and querying large document collections, there is a lack of good options when it comes to adding search to static websites. As we detailed in Holmes and Takeda 2018, we tested a number of approaches, including invoking Google Custom Search Engine (CSE)—which the authors of O’Reilly’s introduction to static website development nominate as the best solution and the undisputed king of search [Camden and Rinaldi 2018]—and hooking into a centralized Library-run Solr indexer, but all of these systems had significant drawbacks. External services such as Google CSE and the Library’s Solr are not only fragile, but also difficult to customize for our needs and challenging to update. The Lunr.js JavaScript library, which bills itself as [a] bit like Solr, but much smaller and not as bright [Nightingale 2011], is another popular mechanism for adding search to static sites [Wikle, Williamson, and Becker 2020]. Lunr.js only requires a pre-built JSON index file, which it can then parse in-memory to display search results. One of serious drawback of Lunr, however, is that it requires a single JSON file in memory, which can quickly become overwhelmed by large volumes of data. The assumption is that the index will be comprised of simple text or Markdown files, but our HTML is often highly nested and enriched with data attributes and classes that retain important information from the source documents; while not all of the information contained in the HTML is critical, much of it is important when it comes to fine-tuning and configuring specific site searches.

There are, of course, many other ways to index a document collection beyond those services list above, but our ultimate goal was creating a search engine that could easily fit within our established infrastructure of XML languages and software. As Kraetke and Imsieske 2016 note, XSLT offers a modern, powerful static website generator, one that, we realized, could quite capably handle all of the tasks required by an indexing system.

What is staticSearch?

Broadly, staticSearch works by first taking in a user supplied XML configuration file that tells the staticSearch build process where to find the documents and the search page and sets various options such as the number and length of keyword-in-context fragments to harvest for each stem.[3] It then runs the build process as follows:

  1. Checks and validates the input document collection.

  2. Checks the user’s configuration file, and if it is valid, uses it to build an XSLT configuration file for the remaining processes.

  3. Processes all documents in the collection to create versions in which stemmed tokens are tagged, and each tagged token has additional information about its context (more on this later). Each document is given an identifier consisting of its path relative to the search page.

  4. Uses the tokenized texts to build a collection of JSON files which are used to power the search.

  5. Creates the search page itself.

  6. Creates a report on the process.

Figure 1

There is one stipulation: the input document collection must consist of well-formed HTML5 in the XHTML namespace. Well-formedness is essential because we use Saxon to process the collection; the XHTML namespace arises purely out of our own prejudice.[4] That staticSearch uses and produces HTML, however, is an infrastructural feature. While extending staticSearch to other namespaces and to other XML dialects in general is certainly feasible and, in fact, our HTML documents are frequently derived from TEI XML encoded documents, it is difficult, in our minds, to imagine cases where indexing and tokenizing non-HTML files would be more effective. Since the search is meant to power a web application, users of the search are looking for information that they can find in the mass of HTML files, not the source documents from which they are produced. Our index, in other words, reflects the documents that are available in the collection and thus search results can be easily linked to the places in the source document where a term appears.

We will now discuss the technical implementation in further detail.

Configuration

The structure and syntax of the configuration file is defined by staticSearch’s custom schema (expressed as a TEI ODD file) and provides specific options for the staticSearch build process. A basic configuration file looks something like this:

                     
<config xmlns="http://hcmc.uvic.ca/ns/staticSearch" version="2">
    <params>
        <searchFile>test/search.html</searchFile>
        <versionFile>test/VERSION</versionFile>
        <recurse>true</recurse>
        <phrasalSearch>true</phrasalSearch>
        <wildcardSearch>true</wildcardSearch>
        <createContexts>true</createContexts>
        <resultsPerPage>5</resultsPerPage>
        <minWordLength>2</minWordLength>
        <maxKwicsToHarvest>5</maxKwicsToHarvest>
        <maxKwicsToShow>5</maxKwicsToShow>
        <totalKwicLength>15</totalKwicLength>
        <kwicTruncateString>...</kwicTruncateString>
        <stopwordsFile>test/test_stopwords.txt</stopwordsFile>
        <dictionaryFile>xsl/english_words.txt</dictionaryFile>
        <outputFolder>ssTest</outputFolder>
    </params>
    <rules>
        <rule weight="2"
            match="h1 | h2"/>
        <rule weight="0"
            match="span[@class='lineNum']"/>
        <rule weight="0"
            match="script | style"/>
        <rule weight="0"
            match="header | footer"/>
    </rules>
    <contexts>
        <context match="blockquote" label="Quotations"/>
        <context match="div[@class='l']"/>
        <context match="span[@class='note'] | *[contains-token(@class,'sidenotes')]"
            label="Notes"/>
        <context match="cite" label="Citations"/>
        <context match="p[contains-token(@class,'citation')]" label="Citations"/>
    </contexts>
    <excludes>
        <exclude type="index" match="html[@id='excluded']"/>
        <exclude match="meta[contains-token(@class,'excludedMeta')]" type="filter"/>
    </excludes>
</config>

There are many interesting configuration options that are beyond the scope of this paper (full documentation of each option is available on the project’s website and the GitHub repository), but the crucial parameter here is the <searchFile>. <searchFile> contains the path (relative to the configuration file) for the page in the collection that will be populated with the search form and controls for filters. This page may or may not already exist. If that page exists, then it must contain an HTML block element (<div>, <section>, etc) with the id "staticSearch"; if that page does not exist, then the page is created during the build process. The <searchFile> parameter also gives the location of the collection to index; in the above case, staticSearch will index all of the HTML files in the test/ directory.

This configuration file is transformed into an XSLT stylesheet that is included in all subsequent steps of the build process. It is necessary to convert the configuration file into its own XSLT as some configuration options, like weighting rules and context specifications, rely on XPath match statements. For example, elements can be assigned weights[5] via the <rule> element in the configuration file:

<rule match="header" weight="3"/>
<rule match="menu | aside | footer" weight="0"/>

The @weight attribute above signals the multiplier that should be applied to each instance of that element within a document when computing a term’s score. We make some assumptions about specific weighting of elements (all heading-like elements <h1> etc are given a weight of 2). A weight of 0 means that the indexer should ignore the element entirely).

The rule element (and other elements that bear a @match) are converted into <xsl:template>s that are run during the multi-phase tokenization process.

Tokenizing

Cleaning and Pre-Processing

What we refer to as the "tokenization" process is a bit of a misnomer: it refers to a single monolithic stylesheet—tokenize.xsl—that processes each source document in multiple passes in order to create the minimal HTML structure necessary for generating the index.[6] The first stage in the process is to remove irrelevant content, retaining only the information that is necessary for the indexing process and removing ignored elements, unnecessary wrappers, and most attributes. In most cases, input documents will contain a significant amount of boilerplate HTML that appears on every page and should be completely ignored by the indexer, like the site menu, sidebar, or footer. As the example above shows, these elements are given a weight of 0, which means they are removed from the tokenized document. The tokenization process also removes elements that will have no bearing on the indexing process; this includes most inline elements, like links, spans, etc, unless these elements must be retained for a specific reason (i.e. they are assigned a higher weight or they contain a fragment identifier, which can be linked from the search results).

Often, a well-configured instance of staticSearch will produce tokenized documents that are significantly smaller than the original. For example, consider this line from a poem in the Digital Victorian Periodical Poetry Project:

                        
<div class="l" data-el="l" id="l_1"><span class="lineContent"><span data-el="hi" class="hi" style="font-variant: small-caps; letter-spacing: 0.06em;">A blush</span>, a smile, a dusk sweet vio<span class="rhyme label_a" data-el="rhyme" title="Masculine rhyme (Final syllable rhymes exactly; for example, Keats/beets.); label: a">let</span>—</span><span class="lineNum">1</span></div>

After being run through the tokenizer, all classes, data attributes, superfluous wrapping elements, and other information irrelevant to the indexer are removed:

                        
<div id="l_1" ss-ctx="true">A <span ss-pos="3" ss-fid="l_1" ss-stem="blush">blush</span>, a <span ss-pos="4" ss-fid="l_1" ss-stem="smile">smile</span>, a <span ss-pos="5" ss-fid="l_1" ss-stem="dusk">dusk</span> <span ss-pos="6" ss-fid="l_1" ss-stem="sweet">sweet</span> <span ss-pos="7" ss-fid="l_1" ss-stem="violet">violet</span>—</div>

Stemming

The second process is, of course, tokenization. The tokenization stage wraps each token in a span element and decorates the element with the token’s stem, position, weight, et cetera. Each meaningful text node is matched and analyzed using <xsl:analyze-string> to identify each word where a word is understood as:

  • A number [\d]+([\.,]?\d+)

  • An alphanumeric word [\p{L}\p{M}]+

  • A hyphenated word: $alphanumeric(-$alphanumeric)*)

We also consider apostrophes and quotation marks (both straight and curly) as part of a word, so the constructed Regular Expression is slightly more complicated when expressed in the XSLT:

                        
 <xsl:variable name="numericWithDecimal">[<xsl:value-of select="string-join($allApos,'')"/>\d]+([\.,]?\d+)</xsl:variable>
 <xsl:variable name="alphanumeric">[\p{L}\p{M}<xsl:value-of select="string-join($allApos,'')"/>]+</xsl:variable>
 <xsl:variable name="hyphenatedWord">(<xsl:value-of select="$alphanumeric"/>-<xsl:value-of select="$alphanumeric"/>(-<xsl:value-of select="$alphanumeric"/>)*)</xsl:variable>
 <xsl:variable name="tokenRegex">(<xsl:value-of select="string-join(($numericWithDecimal,$hyphenatedWord,$alphanumeric),'|')"/>)</xsl:variable>

Which yields the the following:

(['‘’”“"\d]+([\.,]?\d+)|([\p{L}\p{M}'‘’”“"]+-[\p{L}\p{M}'‘’”“"]+(-[\p{L}\p{M}'‘’”“"]+)*)|[\p{L}\p{M}'‘’”“"]+)

If a word is indeed a word and is neither too short nor a stopword, it is then run through the user-configured XSLT stemmer. At the moment, staticSearch has four different stemmers: the Porter stemming algorithms for English and French (Porter 1980; Porter 2002; Porter French) ; an identity stemmer; and a diacritic stemmer, which simply strips diacritics and is otherwise idempotent.[7] Users can specify their own stemmers, but, at the moment, the stemmers need to be implemented identically in both XSLT and JavaScript. We are currently exploring options for integrating existing implementations of Porter’s stemming algorithms in Java and JavaScript (for Saxon and the browser, respectively).

Indexing

staticSearch works by generating an inverted index from the tokenized documents (Zobel and Moffat 2006). This index is simply a directory full of JSON files on the file system: each unique stemmed term has a JSON file to itself, named for itself ('book.json', 'walk.json', etc.) that contains information about the documents in which that term appears. This means that when the search page queries the index, it need only retrieve the individual JSON files for the terms which are in the search; the bulk of the index is never retrieved.

The many JSON files range in size depending, of course, on their frequency within the document collection; in most cases, the individual JSON files are trivially small, but for very common words not included in the stopwords file, they can reach into MBs. However, given that these are texts files and most servers can serve GZIP compression, the files can be highly compressed and thus retrieved almost instantly. As shown in Appendix A, regardless of compression, the JSON index is significantly smaller than the input document collection.

Stem Files

Here’s an example of the stem file for the term glow:

                        
{
    "stem": "glow",
    "instances": [
        {
            "docUri": "poems/twilight.html",
            "score": 1,
            "contexts": [
                {
                    "form": "glow",
                    "weight": "1",
                    "pos": 49,
                    "context": "Twilight for dreams, the dun and dying <mark>glow</mark>",
                    "fid": "l_9"
                }
            ]
        }
    ]
}

This contains an entry for each document which contains the stem, an overall score for that stem in that document, and precise information about each individual instance, including a keyword-in-context extract in which it is marked.

Each stem is created by grouping the entire set of stems by their @ss-stem value.

                        
<xsl:for-each-group select="$stems" group-by="string(@ss-stem)">
  <xsl:variable name="stem" select="current-grouping-key()" as="xs:string"/>
  <xsl:call-template name="makeTokenCounterMsg"/>
  <xsl:variable name="map" as="element(j:map)">
    <xsl:call-template name="makeMap"/>
  </xsl:variable>
  <xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="text">
    <xsl:sequence select="xml-to-json($map, map{'indent': $indentJSON})"/>
  </xsl:result-document>
</xsl:for-each-group>

The makeMap template takes each group of stems and creates an XML map in the JSON namespace[8] for the file:

                        
    <xsl:template name="makeMap" as="element(j:map)">
        <!--The term we're creating a JSON file for, inherited from the createMap template -->
        <xsl:variable name="stem" select="current-grouping-key()" as="xs:string"/>
        <!--The group of all the terms (so all of the spans that have this particular term
            in its @ss-stem -->
        <xsl:variable name="stemGroup" select="current-group()" as="element(span)*"/>
        <!--Create the outermost part of the structure-->
        <map xmlns="http://www.w3.org/2005/xpath-functions">
           <!--The stem is the top level string key for this map; it should be
                the same as the JSON file name.-->
            <string key="stem">
                <xsl:value-of select="$stem"/>
            </string>
             <!--Start instances array: this contains all of the instances of the stem
                 per document -->
            <array key="instances">
                <!--If every HTML document processed has an @id at the root,
                    then use that as the grouping-key; otherwise,
                    use the document uri -->
                <xsl:for-each-group select="$stemGroup"
                    group-by="document-uri(/)">
                    <!--Sort the documents so that the document with the
                    most number of this hit comes first-->
                    <xsl:sort select="count(current-group())" order="descending"/>
                    <!--The current document uri, which functions
                    as the key for grouping the spans-->
                    <xsl:variable name="currDocUri" select="current-grouping-key()" 
                    as="xs:string"/>
                    <!--The spans that are contained within this document-->
                    <xsl:variable name="thisDocSpans" 
                    select="current-group()" as="element(span)*"/>
                    <!--Get the total number of documents 
                    (i.e. the number of iterations that this
                     for-each-group will perform) for this span-->
                    <xsl:variable name="stemDocsCount" select="last()" as="xs:integer"/>
                    <!--The document that we want to process
                            will always be the ancestor html of
                        any item of the current-group() -->
                    <xsl:variable name="thisDoc"
                        select="current-group()[1]/ancestor::html"
                        as="element(html)"/>
                    <!--Get the raw score of all the spans by getting the weight for 
                        each span and then adding them all together -->
                    <xsl:variable name="rawScore" 
                        select="sum(for $span in $thisDocSpans
                        return hcmc:returnWeight($span))"
                        as="xs:integer"/>
                        
                   <!--Map for each document that has this token-->
                    <map xmlns="http://www.w3.org/2005/xpath-functions">
                        <string key="docId">
                            <xsl:value-of select="$thisDoc/@id"/>
                        </string>
                        <!--And the relative URI from the document, which is to be used
                        for linking from the KWIC to the document. We've created this
                        already in the tokenization stage and stored it in a custom
                        data-attribute-->
                        <string key="docUri">
                            <xsl:value-of select="$thisDoc/@data-staticSearch-relativeUri"/>
                        </string>
                        <!--The document's score, forked depending on configured
                            algorithm -->
                        <number key="score">
                            <xsl:choose>
                                <xsl:when test="$scoringAlgorithm = 'tf-idf'">
                                    <xsl:sequence select="hcmc:returnTfIdf($rawScore, $stemDocsCount, $currDocUri)"/>
                                </xsl:when>
                                <xsl:otherwise>
                                    <xsl:sequence select="$rawScore"/>
                                </xsl:otherwise>
                            </xsl:choose>
                        </number>
                        <!--Now add the contexts array, if specified to do so -->
                        <xsl:if test="$phrasalSearch or $createContexts">
                            <xsl:call-template name="returnContextsArray"/>
                        </xsl:if>
                    </map>
                </xsl:for-each-group>
            </array>
        </map>
    </xsl:template>

Each stem files contains precise information about each individual instance in which that stem appears. This is the most onerous part of the process as each context contains the a keyword-in-context string, which shows this word in situ.

This has been difficult to optimize. Our approach so far has been to move up the tree and use node comparison operators (<< and >>) to compile all of the nodes that precede the span and the nodes that follow and then trim each string to the configured length.

This can still lead to very long strings being stored in memory, however, and so we have tried to optimize by iterating through each node using <xsl:iterate> and breaking once a string of the desired length has been found.

                        
    <xsl:function name="hcmc:returnSnippet" as="xs:string?">
        <xsl:param name="nodes" as="node()*"/>
        <xsl:param name="isStartSnippet" as="xs:boolean"/>
    
        <!--Iterate through the nodes: 
            if we're in the start snippet we want to go from the end to the beginning-->
        <xsl:iterate select="if ($isStartSnippet) then reverse($nodes) else $nodes">
            <xsl:param name="stringSoFar" as="xs:string?"/>
            <xsl:param name="tokenCount" select="0" as="xs:integer"/>
            <!--If the iteration completes, then just return the full string-->
            <xsl:on-completion>
                <xsl:sequence select="$stringSoFar"/>
            </xsl:on-completion>
            <xsl:variable name="thisNode" select="."/>
            <!--Normalize and determine the word count of the text-->
            <xsl:variable name="thisText" select="replace(string($thisNode),'\s+', ' ')" as="xs:string"/>
            <xsl:variable name="tokens" select="tokenize($thisText)" as="xs:string*"/>
            <xsl:variable name="currTokenCount" select="count($tokens)" as="xs:integer"/>
            <xsl:variable name="fullTokenCount" select="$tokenCount + $currTokenCount" as="xs:integer"/>
            
            <xsl:choose>
                <!--If the number of preceding tokens plus the number of current tokens is 
                    less than half of the kwicLimit, then continue on, passing 
                    the new token count and the new string-->
                <xsl:when test="$fullTokenCount lt $kwicLengthHalf + 1">
                    <xsl:next-iteration>
                        <xsl:with-param name="tokenCount" select="$fullTokenCount"/>
                        <!--If we're processing the startSnippet, prepend the current text;
                            otherwise, append the current text-->
                        <xsl:with-param name="stringSoFar" 
                            select="if ($isStartSnippet)
                                    then ($thisText || $stringSoFar) 
                                    else ($stringSoFar || $thisText)"/>
                    </xsl:next-iteration>
                </xsl:when>
                
                <xsl:otherwise>
                    <!--Otherwise, break out of the loop and output the current context string-->
                    <xsl:break>
                        <!--Figure out how many tokens we need to snag from the current text-->
                        <xsl:variable name="tokenDiff" select="1 + $kwicLengthHalf - $tokenCount"/>
                        <xsl:choose>
                            <xsl:when test="$isStartSnippet">
                                <!--We need to see if there's a space before the token we care about:
                                    (there often is, but that is removed when we tokenized above) -->
                                <xsl:variable name="endSpace" 
                                    select="if (matches($thisText,'\s$')) then ' ' else ()"
                                    as="xs:string?"/>
                                <!--Get all of the tokens that we want from the string by:
                                    * Reversing the current token sequence
                                    * Getting the subset of tokens we need to hit the limit
                                    * And then reversing that sequence of tokens again.
                                -->
                                <xsl:variable name="newTokens" 
                                    select="reverse(subsequence(reverse($tokens), 1, $tokenDiff))"
                                    as="xs:string*"/>
                                <!--Return the string: we know we have to add the truncation string here too-->
                                <xsl:sequence 
                                    select="$kwicTruncateString || string-join($newTokens,' ') || $endSpace || $stringSoFar "/>
                            </xsl:when>
                            <xsl:otherwise>
                                <!--Otherwise, we're going left to right, which is simpler
                                    to handle: the same as above, but with no reversing -->
                                <xsl:variable name="startSpace" 
                                    select="if (matches($thisText,'^\s')) then ' ' else ()"
                                    as="xs:string?"/>
                                <xsl:variable name="newTokens" 
                                    select="subsequence($tokens, 1, $tokenDiff)" 
                                    as="xs:string*"/>
                                <xsl:sequence
                                    select="$stringSoFar || $startSpace || string-join($newTokens,' ') || $kwicTruncateString"/>
                            </xsl:otherwise>
                        </xsl:choose>
                    </xsl:break>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:iterate>
    </xsl:function>

Additional Files

In addition to the stem files, the build process also creates the following individual JSON files:

ssTitles.json

This maps each document’s unique identifier (its path relative to the search page) to its title. It may also include an icon with which to identify the document in search results, and an optional sort key to be used instead of its title when search results with the same score are being listed.

ssWordString.json

This is a plain-text list of all the individual (unstemmed) words appearing in the collection, separated by pipes:

...|page||pairs||paragraph||part||parts||peep||People||per||percent||percentages||perhaps|...

This file is used when processing wildcard searches. When the user enters a wildcard term, it is expanded into a regular expression which is used to extract all of the individual matching words from the word string JSON list. Each of those words is a potential match, so it is stemmed, and its stem file is retrieved. Then a search is made through all the contexts in those files to find matches for the wildcard/regex term in their contexts, so that all actual hits can be found.

For exact phrase (i.e. quoted string) searches, the quoted string is tokenized and the first non-stopword is extracted from it; that word is stemmed, and its stem file retrieved. Then all the contexts in that stem file are searched for an exact match for the phrase.

Filter Files

In addition to the text search, the user can trigger the creation of a range of different search filter controls on the search page, by including some HTML meta tags with specific formats in the document. For example, if a document has these three meta tags:

                        
    <meta name="Document type" class="staticSearch_desc" content="Poems"/>
    <meta name="Document type" class="staticSearch_desc" content="Translations"/>
    <meta name="Date of publication" class="staticSearch_date" content="1895-01-05"/>

then the containing document will be classified as belonging to two document categories, Poems and Translations, in the Document type selection filter (which we refer to as a description filter). A second date range filter will also be created. If an end-user searches for documents in either of those categories, using a date-range that includes 1895-01-05, then this document will be selected. Other filter types include boolean, number range, and feature filters, which provide a typeahead searchable list of keywords. The build process creates a separate JSON file for each of these filters.

The JSON for a description filter looks like this (heavily truncated example):

                        
{
  "filterId": "ssDesc4",
  "filterName": "Poet’s nationality",
  "ssDesc4_1": {
    "name": "English",
    "docs": 
        ["poems\/p_1095_a_duet.html",
       "poems\/p_1099_the_ox.html"]
  },
  "ssDesc4_2": {
    "name": "Irish",
    "docs": 
    [ "poems\/p_8866_golden_lilies.html",
      "poems\/p_8825_in_a_cathedral.html"]
  }
}

When an end-user’s search makes use of a filter control, then required filter JSON will also be downloaded along with any stem files needed, but the filter files are also downloaded in the background on page load so that most are already available by the time a user has initiated a search.

When filters are combined with text search, the list of documents containing hits for the text search are first computed, then those hits are filtered based on the filter settings. The small size and innate compressibility of the JSON files enables staticSearch to produce results quite rapidly, even from relatively large document collections.

Search Page

Once the documents have been indexed, staticSearch then creates the search page using data assembled by the indexing process.[9] This search page is pre-populated with all necessary values for the search, including the query input, checkboxes for filters, inputs for dates and numeric filters, et cetera; the form itself also bears custom HTML data-attributes specifying some of the configuration options—the name of the folder that contains the index, the number of results to show, and so on—to be used by the JavaScript.

Building the page beforehand means that the client-side script does not need to retrieve and parse any of the filters in order for the page to display the necessary controls; while some files—the list of stopwords, the word string, and the titles file—are crucial for any search to be performed and are thus fetched immediately on page load, staticSearch retrieves these asynchronously in the background such that the page is immediately responsive and usable.

Conclusion

While in many cases, staticSearch has been implemented in projects as a replacement for pre-existing search engines, we find ourselves using staticSearch from the start with many our projects. Since staticSearch runs on HTML files, it could be integrated into any publishing workflow that produces well-formed HTML5. We are also continuing to research ways that the static index produced by staticSearch could be packaged with a web archive file (WARC) such that web archives displayed on the Wayback Machine and other web archive viewers would retain their essential search functionality.

Overall, this paper has outlined our approach behind staticSearch and demonstrates how XSLT can be used to generate a robust search index without the use of server-side technologies.

Appendix A. Statistics

The following table details statistics about staticSearch’s indexing process for three different projects: the very small staticSearch test set of documents; the Winnifred Eaton Archive’s (Chapman 2022) documents, including transcriptions; and the Landscapes of Injustice’s (Stanger-Ross 2021) large archive of primary and secondary source materials. Statistics below were taken on an Apple MacBook Pro running 16GB of RAM and silicon architecture (M1 Pro); timing and sizes are as reported by gtime, a port of GNU time for macOS.

Table I

Project staticSearch Test Set Winnifred Eaton Archive Landscapes of Injustice
Number of HTML files tokenized 10 1820 93998
Size of Document Collection 17.4K 31M 264M
Average document size 1.8K 17K 2.9K
Number of token files 678 20514 92203
Total Size (uncompressed) 285K 188M 617M
Average size (uncompressed) 420B 9.2K 6.7K
Total Size (compressed) 171K 39M 106M
Average size (compressed) 252B 1.9K 1.2K
Build time 6s 680ms 1m 24s 52ms 8m 53s 20ms
Memory Used 391M 1.3G 3.7G

References

[Camden and Rinaldi 2018] Camden, R., and Rinaldi, B. 2018. Working with Static Sites. O’Reilly Media Inc. https://learning.oreilly.com/library/view/working-with-static/9781491960936/.

[Chapman 2022] Chapman, M., et al. 2022. The Winnifred Eaton Archive. University of British Columbia. https://winnifredeatonarchive.org.

[Holmes and Takeda 2018] Holmes, M.D. and Takeda, J. 2018. Why do I need Four Search Engines? Proceedings of the Japanese Association of Digital Humanities Conference, Tokyo, Japan, September 2018. https://conf2018.jadh.org/files/Proceedings_JADH2018.pdf#page=58.

[Holmes and Takeda 2019a] Holmes, M.D., and Takeda, J. 2019. The Prefabricated Website: Who needs a server anyway? Paper presented at the Text Encoding Initiative Conference, Graz, Austria, September 2019. https://gams.uni-graz.at/o:tei2019.116/sdef:TEI/get?context=context:tei2019.papers.

[Holmes and Takeda 2019b] Holmes, M.D. and Takeda, J. 2019. Beyond Validation: Using Programmed Diagnostics to Learn About, Monitor, and Successfully Complete Your DH Project. Digital Scholarship in the Humanities, 34 (suppl_1) (December 2019): i100–i109. Oxford University Press/EADH. doi:https://doi.org/10.1093/llc/fqz011.

[Holmes and Takeda Forthcoming] Holmes, M.D. and Takeda, J. Forthcoming. From Tamagotchis to Pet Rocks: Static Websites for Long-term Sustainability. In peer review with Digital Humanities Quarterly, 2022.

[Kraetke and Imsieske 2016] Kraetke, Martin, and Gerrit Imsieke. 2016. XSLT as a Modern, Powerful Static Website Generator: Publishing Hogrefe’s Clinical Handbook of Psychotropic Drugs as a Web App. Presented at XML In, Web Out: International Symposium on sub rosa XML, Washington, DC, August 1, 2016. In Proceedings of XML In, Web Out: International Symposium on sub rosa XML. Balisage Series on Markup Technologies, vol. 18. doi:https://doi.org/10.4242/BalisageVol18.Kraetke02.

[Nightingale 2011] Nightingale, Oliver. lunr.js. JavaScript. https://github.com/olivernn/lunr.js.

[Porter 1980] Porter, M.F. 1980. An algorithm for suffix stripping. Program 14(3) (1980), 130–137. doi:https://doi.org/10.1108/eb046814.

[Porter 2002] Porter, M.F. The English (Porter2) stemming algorithm. http://snowball.tartarus.org/algorithms/english/stemmer.html.

[Porter French] Porter, M.F. French stemming algorithm. http://snowball.tartarus.org/algorithms/french/stemmer.html.

[Quin 2008] Quin, Liam R.E. 2008. Text Retrieval for XML-Encoded Corpora: A Lexical Approach. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Quin01.

[Stanger-Ross 2021] Stanger-Ross, J., et al. Landscapes of Injustice. University of Victoria. https://loi.uvic.ca/archive.

[Wikle, Williamson, and Becker 2020] Wikle, O., Williamson, E., and Becker, D. 2020. What is Static Web and What’s it Doing in the Digital Humanities Classroom? dh+lib June 2020. https://acrl.ala.org/dh/2020/06/22/what-is-static-web-and-whats-it-doing-in-the-digital-humanities-classroom/.

[Zobel and Moffat 2006] Zobel, J., and Moffat, A. Inverted Files for Text Search Engines. ACM Computing Surveys 38 (2) (2006): 6-es. doi:https://doi.org/10.1145/1132956.1132959.



[1] All code and documentation for staticSearch can be found on its GitHub repository: https://github.com/projectEndings/staticSearch. Rendered documentation for the most recent release can be found at https://endings.uvic.ca/staticSearch/docs/index.html.

[2] While this paper focuses primarily on the XSLT-based indexer, we would like to address the concerns helpfully raised by a reviewer about the sustainability of using JavaScript for the client-side retrieval. While much of the history of JavaScript has been defined by the use of libraries like JQuery (which often cause critical problems within applications due to incompatibility of upgrades, changed pointers, etc.), the language itself has been remarkably stable and backwards-compatible since the 1990s. What makes client-side scripts fragile, in our view, is not the language itself, but changes in browser security policies (e.g. the recent rollout of CSP rules), which are ultimately managed at the server level anyway, as well the multiple points of failure introduced by external dependencies (and, crucially, the dependency’s dependencies). staticSearch is written entirely in ES6 JavaScript and does not have any dependencies (either bundled with it or as external scripts).

[3] For full details on how to implement staticSearch and integrate it into a development workflow, see the How Do I Use It? section of the project documentation (https://endings.uvic.ca/staticSearch/docs/howDoIUseIt.html).

[4] Our insistence on well-formed HTML5 in the XHTML namespace is part ideological—given our fealty to XML, our hope is that this constraint will encourage projects to create well-formed our HTML—and part practical: it is beyond the scope of our project to try and handle the range of ill-formed tag-soup HTML that is common in the wild. That said, since staticSearch does not modify the input files, implementers could use any number of existing conversion tools to pre-process their files into well-formed HTML (such as Tidy or the TagSoup parser in Python); as well, since the codebase is open-source, implementers could fork the repository and use a custom parser in Saxon.

[5] Weight here is a slightly misleading term; most discussions of search engines refer to what we call weight as boost where what we call score is usually framed as weight.

[6] Quin 2008 discusses possible solutions for full-text querying of XML with lq-text and notes that, while tedious, a pragmatic approach is to re-write documents before indexing them, perhaps with XSLT. While we were not aware of Quin’s extensions to lq-text hitherto working on staticSearch, the approach described in many ways pre-figures and anticipates our own.

[7] While the identity stemmer is not necessarily ideal, it does vastly simplify the creation of a search engine for multi-lingual documents and document collections. It also provides a convenient starting point for users who might want to implement their own stemmers.

[8] The advantage of using this structure rather than XPath maps and arrays is the ease with which we can construct an array at least until such time that the proposed XSLT 4.0 <xsl:array> instruction becomes available.

[9] See Appendix B for examples of the search pages produced by staticSearch.

×

Camden, R., and Rinaldi, B. 2018. Working with Static Sites. O’Reilly Media Inc. https://learning.oreilly.com/library/view/working-with-static/9781491960936/.

×

Chapman, M., et al. 2022. The Winnifred Eaton Archive. University of British Columbia. https://winnifredeatonarchive.org.

×

Holmes, M.D. and Takeda, J. 2018. Why do I need Four Search Engines? Proceedings of the Japanese Association of Digital Humanities Conference, Tokyo, Japan, September 2018. https://conf2018.jadh.org/files/Proceedings_JADH2018.pdf#page=58.

×

Holmes, M.D., and Takeda, J. 2019. The Prefabricated Website: Who needs a server anyway? Paper presented at the Text Encoding Initiative Conference, Graz, Austria, September 2019. https://gams.uni-graz.at/o:tei2019.116/sdef:TEI/get?context=context:tei2019.papers.

×

Holmes, M.D. and Takeda, J. 2019. Beyond Validation: Using Programmed Diagnostics to Learn About, Monitor, and Successfully Complete Your DH Project. Digital Scholarship in the Humanities, 34 (suppl_1) (December 2019): i100–i109. Oxford University Press/EADH. doi:https://doi.org/10.1093/llc/fqz011.

×

Holmes, M.D. and Takeda, J. Forthcoming. From Tamagotchis to Pet Rocks: Static Websites for Long-term Sustainability. In peer review with Digital Humanities Quarterly, 2022.

×

Kraetke, Martin, and Gerrit Imsieke. 2016. XSLT as a Modern, Powerful Static Website Generator: Publishing Hogrefe’s Clinical Handbook of Psychotropic Drugs as a Web App. Presented at XML In, Web Out: International Symposium on sub rosa XML, Washington, DC, August 1, 2016. In Proceedings of XML In, Web Out: International Symposium on sub rosa XML. Balisage Series on Markup Technologies, vol. 18. doi:https://doi.org/10.4242/BalisageVol18.Kraetke02.

×

Nightingale, Oliver. lunr.js. JavaScript. https://github.com/olivernn/lunr.js.

×

Porter, M.F. 1980. An algorithm for suffix stripping. Program 14(3) (1980), 130–137. doi:https://doi.org/10.1108/eb046814.

×

Porter, M.F. The English (Porter2) stemming algorithm. http://snowball.tartarus.org/algorithms/english/stemmer.html.

×

Porter, M.F. French stemming algorithm. http://snowball.tartarus.org/algorithms/french/stemmer.html.

×

Quin, Liam R.E. 2008. Text Retrieval for XML-Encoded Corpora: A Lexical Approach. Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1. doi:https://doi.org/10.4242/BalisageVol1.Quin01.

×

Stanger-Ross, J., et al. Landscapes of Injustice. University of Victoria. https://loi.uvic.ca/archive.

×

Wikle, O., Williamson, E., and Becker, D. 2020. What is Static Web and What’s it Doing in the Digital Humanities Classroom? dh+lib June 2020. https://acrl.ala.org/dh/2020/06/22/what-is-static-web-and-whats-it-doing-in-the-digital-humanities-classroom/.

×

Zobel, J., and Moffat, A. Inverted Files for Text Search Engines. ACM Computing Surveys 38 (2) (2006): 6-es. doi:https://doi.org/10.1145/1132956.1132959.