Pipelining and micropipelining

Pipelining describes the process of using the output of a function or other operation as the input to a subsequent function or other operation. As an example of a function pipeline, if we wish to convert a string to upper case and normalize white space within it, we can pass our original input string through a pipeline that combines the XPath upper-case() and normalize-space() functions in either order. If our original input string reads Hi,  Mom! (with two spaces between the words), we can first apply the upper-case() function (yielding HI,  MOM!, still with two spaces between the words) and then use the output of applying the upper-case() function as input into the normalize-space() function, which collapses the two consecutive space characters into one. Or we can apply the functions in the opposite order. In this example and many other pipelines, the order in which the functions are applied does not affect the result; for situations where it does, see the discussion of feeding and bleeding orders, below.

Pipelines may connect small, individual functions, as in the example above, or they may be constructed at higher levels. For example, an XSLT transformation might be designed as a pipeline of XSLT stylesheets, where the original input XML is passed into a transformation that uses one stylesheet and creates output XML, which becomes the input XML for another stylesheet, etc., until eventually the desired final output is created. While pipelining is well established as a technical term, micropipelining is less so, and we use it in this report to refer to pipelines that operate entirely within a single XSLT stylesheet, whether at the level of individual functions or in some of the more complex ways described below.

Pipes and pipelines trace their popularity as a process-control architecture to a 1964 typed note by Doug McIlroy about Unix. In a summary of what’s most important, McIlroy writes that that [w]e should have some ways of coupling programs like garden hose—screw in another segment when it becomes necessary to massage data in another way. [McIlroy 1964] As for pipelines specifically in an XSLT and XQuery context, Michael Kay writes:

Constructing an application in the form of a pipeline has many advantages, the main ones being (a) that the code of each step in the pipeline is kept very simple; (b) that it is very easy to assemble an application from a set of components, thus maximizing the potential for component reuse, and (c) there is no requirement that each step in a pipeline should use the same technology; it's easy to mix XSLT, XQuery, Java and so on in different stages.

A number of products are available to assist with the development and management of pipeline-based applications, examples being Orbeon [Orbeon] and Apache Cocoon [Cocoon]. More recently W3C has developed a standard language, XProc [W3C XProc], for the definition of pipelines. [Kay 2009]

Insofar as XSLT allocates its processing among templates and functions, Wendell Piez writes that transformations are sometimes (frequently) pipelines themselves internally (in whole or in part), as indeed XSLT can be seen as a language for pipelining function calls. [Piez 2016]

Beyond using the output of one process as input to the next, pipelining invites computational efficiency by making it possible to stream the flow of information, potentially avoiding the overhead of building complete intermediate structures in memory or writing intermediate output of the first process to disk and then reading it from disk into the input of the next process. Whether an application takes advantage of this possibility depends on the implementation (and not all operations are streamable), but although pipelining is similar in some respects (especially pertaining to the eventual final output) to saving and reusing intermediate results, in a functional environment that supports streaming and multiprocessing, it also has the potential of offering performance benefits by eschewing the creation of intermediate storage objects in variables or in disk files and permitting data to be processed as soon as it and a processing core become available. Furthermore, pipelining may also prove economically advantageous with respect to development effort, about which Piez writes that:

a well designed pipeline of steps can be cheaper to build and easier to maintain than a single transformation doing the same thing (and [for the same reason] micropipelining within a transformation is also a useful thing). Not infrequently, a complex problem can be addressed by refactoring it into […] phases, each […] of which can be simpler and easier (even together) to build and operate than a one-pass solution—even one that by some measures is more efficient. [Piez 2016]

Patterns and antipatterns

The term pattern refers to a generalizable way of addressing a common and generalizable type of problem. It is used widely in object-oriented programming (see, for example, Gamma 1994), but it is not restricted to that domain, and in this presentation the generalizable problem we care about is pipelining in XSLT, and the generalizable strategies that are commonly employed to deal with it are our patterns. The term antipattern was introduced by Andrew Koenig in 1995 to describe something that looks like a paradigm for a solution to a common type of problem, that is, a pattern, but that for structural reasons leads to results that do not live up to best practice expectations. [Koenig 1995] As used in this report, the distinction between patterns and antipatterns helps us focus on the strengths and weaknesses of alternative strategies for pipelining within XSLT.

It is often said that code is read more often than it is written, and from that perspective, legibility, which contributes to ease of maintenance, is one feature that can distinguish patterns from antipatterns. Another is computational complexity, and in that context scalability is another such feature (but see immediately below).[2] Michael Kay, in a discussion of Jackson Structured Programming, writes that [t]he essential principle of JSP was the idea that the structure of the program should mirror the structure of the data, continuing that JSP can contribute to both code legibility and maintenance, on the one hand, and operational efficiency, on the other:

[T]he user achieves the benefits that come from writing the application as a controlling application (maintainability, readability, reusability) combining these with the improvements in system performance that come from using a push-based control model. [Kay 2009]

The boundary between a pattern and an antipattern is not always unambiguous. Some programming strategies discussed below, such as nested functions, are easy to write and read in simple situations (e.g., two functions, each of which takes one parameter), but may scale poorly with respect to legibility as the complexity of the context increases. Such models might therefore be sensible patterns when applied in those simpler situations, while transforming themselves gradually into antipatterns as the complexity of the context increases. Scalability may be a useful feature in a pattern, but the strategy that outperforms others with big data is not necessarily optimal in contexts where the data never grow very large. For that reason, the discussion below avoids labeling specific solutions globally as pattern or antipattern, and instead discusses both the strengths and weaknesses of each approach, with attention to the programming context. We thus use pattern and antipattern as a framework for evaluating the strengths and weaknesses of coding strategies, rather than as a taxonomy for classifying them as Good or Bad in a binary fashion.

Micropipelining in context

Pipelining may play a role in XSLT development on at least three levels: 1) internally, within the XSLT processor; 2) between or across stylesheets; and 3) within a single stylesheet. We describe these below by way of providing a broad context and overview, but the focus of this report is on only the last of the three, on pipelining within a single XSLT stylesheet, which we call micropipelining.

1. Pipelining within the XSLT processor

To the typical XSLT developer, pipelining within the XSLT processor is a black box. Not only can optimizations within an XSLT processor transform XSLT instructions into internal operations that achieve the same results by following different execution logic, but the largely declarative XSLT programming paradigm means that the order employed by the processor to execute parts of a transformation is not entirely under the control of the XSLT developer. Among other things, these properties enable an XSLT processor to improve performance by executing operations in parallel, a feature with consequences that sometimes surprise XSLT developers who may be used to thinking within an imperative paradigm. For example, many new XSLT programmers have tried at some point to profile a stylesheet by inserting the current-dateTime() or current-time() function before and after an operation of interest, only to be surprised when the operation appears to take no time at all! This happens because there is no promise in the XSLT processing model that functions will be executed in the order in which they appear in the XSLT code, and for that reason XSLT was designed to follow a functional programming paradigm here:

In XSLT 2.0 it is defined that multiple calls on current-dateTime() and the other two [time and date] functions will return the same result every time they are called within a single transformation. This means that you can’t call the function at the beginning and end of the transformation to measure the elapsed time. The reason for this rule is that XSLT is rather purist about being a strictly functional language, and in a strictly functional language, calling the same function twice with the same arguments always returns the same result. This property makes life much simpler for optimizers. [Kay 2009, p. 738]

The functional nature of XSLT means that although an XSLT developer might write code that is amenable to pipelined execution, how the pipeline is executed is at least partially dependent on processor-internal code to which the XSLT developer has no direct access. Michael Kay’s alliteratively titled You pull, I’ll push: on the polarity of pipelines, cited above, continues that:

For performance, the stages of a pipeline should communicate by means of streams of events. Traditionally, this can be done either by writing components with pull polarity (the consumer calls the supplier when data is required), or by components with push polarity (the supplier calls the consumer when data is available). Performance problems arise when components with different polarity are mixed within the same pipeline. [Kay 2009, emphasis added]

As important as this polarity is for performance, to the extent that it is determined within the XSLT processor, rather than within the XSLT code, it is at least partially outside the control of the XSLT developer. For that reason, in this XSLT-developer-focused report, pipelining within the XSLT processor will not be discussed further.

2. Pipelining between and across XSLT stylesheets

Pipelining between or across XSLT stylesheets is much more under the control of the XSLT developer. A complex transformation might be broken down into simpler stages, where each stage is implemented in a separate XSLT stylesheet, and interim results are written to disk by one process and then read from disk by another. This strategy may make the individual stylesheets easier to read because each may implement only a small and coherent aspect of the overall transformation. This approach also makes it easy to examine the intermediate output during development, and to develop each step of the pipeline without distraction by what precedes or follows. But disk I/O is much slower than in-memory access, and the cost of writing the intermediate files is likely to compromise the efficiency of the transformation. Saving intermediate stages to disk and processing them as separate transformations also entails a cost in re-launching the transformation engine; for example, running the Java implementation of Saxon has to create a new Java Virtual Machine (JVM) and then start a new Saxon process. Whether JVM or I/O or process initialization or other overhead deserves consideration is likely to depend on the volume of data; augmenting the execution time by an order of magnitude may not matter to the developer if the overall time is still subjectively brief.

The overhead of disk I/O can be avoided by using pipelining utilities, whether broader purpose (such as Apache Cocoon [Cocoon]) or more narrowly XML specific (XProc [XProc]). Pipelining tools avoid the overhead of disk I/O, although whether they build complete intermediate in-memory structures or decompose (and perhaps parallelize or stream) the transformation in a functional manner is implementation-specific. These technologies are well know to the Balisage audience: XProc has been the subject of six presentations by six different authors or teams of authors at Balisage alone (see https://www.balisage.net/Proceedings/topics/XProc.html), and Cocoon has also made several appearances, including as part of the toolkit used to produce the Balisage proceedings you are reading now (https://www.balisage.net/Proceedings/vol17/acknowledgements.html). Pipelining between or across XSLT stylesheets with technologies like Cocoon or XProc does not prohibit writing intermediate output to disk for examination during debugging, but it does not require it, and XProc in particular incorporates additional features that may be useful in complex transformations of XML documents (see https://www.w3.org/TR/xproc/#pipeline-concepts). The focus of the current report is on micropipelining, that is, on pipelines implemented within a single XSLT stylesheet, and we will not explore XProc or other technologies for pipelining between or across stylesheets, except to note that whether to implement a pipeline within a single stylesheet or between or across stylesheets is a separate consideration that lies outside the within-stylesheet scope of the present report.

3. Micropipelining: pipelining within a single XSLT stylesheet

As mentioned above, we use the term micropipelining to refer to pipelining within a single XSLT stylesheet. Micropipelines implemented within a single stylesheet can always be refactored and spread across multiple stylesheets, but in situations where the steps are conceptually part of what a human would consider a single operation, and where intermediate stages are unlikely to be of interest except for debugging purposes during development, combining them within a single XSLT stylesheet that performs that single operation may make for easier implementation and maintenance.

The application that motivated this exploration is connected to our use of XML technologies to identify and analyze meter and rhyme in Russian verse. We presented a report on the identification of meter at Balisage 2015 [Birnbaum and Thorsen 2015], and in order also to identify rhyme we need to convert normal Russian orthography to a phonetic representation. As long as the place of stress is known,[3] the phonetic conversion, although complex, turns out to be easier in Russian than in English because Russian orthography, although not phonetic, is closer to the phonetics of the language than is the case with English. [4]

It is common in historical linguistics to represent language change in terms of the application of ordered rules as a pipeline that accepts an initial form as input, pipes it (that is, passes it along) through a sequence of individual sound-change rules, and generates transformed output. Constraints on the order in which rules apply are traditionally described using alimentary and sanguinary metaphors, and those metaphors emerge from the linguistic reality that sound change may happen at a certain point in history and then cease to operate, that is, cease to be a productive process:

  • Feeding order. The output of one rule creates new input for another rule. For example, if in Rule A x → y (read as x is transformed into y) and in Rule B y → z, Rule A feeds Rule B by creating new input for it.

  • Bleeding order. The output of one rule modifies a form that would otherwise have served as input for another rule. For example, if in Rule A z → q and in Rule B x → y / _ z (read as x is transformed into y when it precedes z), Rule A is said to bleed Rule B because it reduces the application of Rule B by the destroying the environment it requires.

  • Counter-feeding order. The output of one rule does not create new input for another rule, but were the order of the two rules reversed, they would observe a feeding relationship. For example, if in Rule A y → z and in Rule B x → y, the two stand in a counter-feeding relationship, since Rule B creates what would have been new input into Rule A, except that it fails to do so because Rule A operates first, and is no longer active when the potential new input is created by Rule B.

  • Counter-bleeding order. The output of one rule does not bleed forms that might otherwise have served as input for another rule, but were the order of the two rules reversed, they would observe a bleeding relationship. For example, if in Rule A x → y / _ z and in Rule B z → q, the rules observe a counter-bleeding order because were the order reversed, Rule B would reduce the applicability of Rule A by destroying the context it requires.

Note that feeding and bleeding are not opposites, and that some rules may not be critically ordered with respect to others because their inputs, outputs, and conditioning environments do not intersect with one another. These types of relationships among rules are important because of their role in directing or constraining linguistic change (see, for example, Kiparsky 1973). In the context of the present report, these relationships also provide a vocabulary for discussing the interaction of XPath functions and XSLT templates when those are used to implement transformations that model linguistic processes. Feeding, bleeding, and their opposites are appropriate for modeling historical language change because history imposes a real chronological order, but a similar model that uses similar terminology may also describe other linguistic phenomena, including the synchronic derivation of surface forms from more abstract underlying forms. In the present instance, we employ this framework to describe, as a pipeline of individual rules, the XSLT implementation of a transformation of standard Russian orthography (which is not phonetic) to a phonetic representation that can serve as the basis for identifying rhyme in verse.

Micropipelining: a closer look

Background and context

When we began implementing in XSLT the machine-assisted identification and analysis of Russian rhyme, we were aware that pipelining was required because our algorithm for converting natural Russian orthography (enhanced by information about stress, which is not part of normal Russian writing) was conceptualized as a series of rules of insertion, deletion, and replacement, the order of which was constrained by feeding and bleeding relationships. As we have been crafting bespoke solutions to these types of pipelining tasks over the years, we have wished for a tutorial that would describe the options, say something useful about the advantages and disadvantages of each, point out where they behave similarly or differently from one another, and perhaps make recommendations for deciding among them in the field. The principal goal of the present report is to fill the gap by describing and exploring the alternatives that we considered, including the one we eventually adopted for the Russian rhyme task.

Convenience variables

As the name implies, convenience variables are variables created for the benefit of the programmer, rather than because they necessarily improve the efficiency or operation of the program. The purpose of convenience variables is that typically they improve the legibility of the code, which makes it easier to develop (one piece at a time) and maintain.

Variables that improve the efficiency of the transformation process itself are not (or: are not entirely) convenience variables (although they may also be convenient). For example, if we need to calculate the same result more than once in an XSLT stylesheet, performing the calculation only once and saving it in a variable that we can then reuse may reduce the amount of computation and improve the performance of the transformation. On the other hand, when we calculate a value that we will use only once, save it in a variable, and then use the variable, not only is there no similar saving, but we actually risk reducing the efficiency of the transformation by imposing the overhead needed to create the variable as an intermediate structure. Within XSLT, this is especially the case when variables are declared without explicit typing using the @as attribute, the omission of which results in the expensive creation of a temporary in-memory document (that we may not need) to hold the value.[5]

As an example of a convenience variable, consider an XSLT transformation to SVG where we need to calculate the x and y coordinates of an object and then plot it. If those calculations are complex, packing them into the declaration of the object may compromise its transparency. Compare the version with convenience variables:

<xsl:variable name="xPos" as="xs:double" select="position() * $xScale"/>
<xsl:variable name="yPos" as="xs:double" select=". * $yScale"/>
<rect x={$xPos} y="-{$yPos}" width="{$width}" height="{$yPos}"/>

to one without:

<rect x={position() * $xScale} y="-{. * $yScale}" width="{$width}" height="{. * $yScale}"/>

The more complex the calculation, the more the legibility may be improved through the use of convenience variables.[6]

Convenience variables may play a role in pipelining when the result of one function is passed to another. For example,

<xsl:value-of select="djb:func2(djb:func1($x))"/>

may be rewritten as:

<xsl:variable name="y" select="djb:func1($x)"/>
<xsl:value-of select="djb:func2($y)">

It is common, while developing and debugging pipelines, to insert, delete, alter, or rearrange the individual steps in order to fine-tune their interaction. Removing a step, for example, with nested functions means finding the function name, the open parenthesis (which is always immediately after the function name), and the closing parenthesis. This last step may be annoyingly difficult if the nesting is deep and if the functions have multiple parameters; if they have only one, the closing parentheses are all bunched together at the end, and it doesn’t matter which one we remove, but if they are not all consecutive, the developer has to find exactly the right one. With convenience variables, removing a step requires not only removing the step itself, but also adjusting a variable reference elsewhere. Consider first the nested version:

<xsl:sequence select="djb:func3(djb:func2(djb:funct1($input)))"/>

If we wish to remove func2() from the pipeline, we remove the string “func2(” and any one of the closing parentheses. Now consider the version with convenience variables:

<xsl:variable name="first" select="djb:func1($input)"/>
<xsl:variable name="second" select="djb:func2($first)"/>
<xsl:sequence select="djb:func3($second)"/>

If we remove the middle line in the interest of removing djb:func2() from the pipeline, we can find the entire function easily, and we don’t have to look separately for the closing parenthesis. But our modifications are not limited to just one line; we must also change the variable reference in the last line from $second to $first. This is not an unmanageable burden, but it is an extra step, and an extra opportunity for error.

The difference between nesting functions directly inside one another and nesting inside one function a variable containing the result of evaluating another function is discussed in greater detail below. The distinction between convenience variables and those that contribute to the quality of the code in ways that are not limited to human legibility is important because the costs and benefits of creating variables to hold intermediate values must take into consideration whether using the variable conveys advantages beyond convenience, that is, beyond the legibility that supports ease of maintenance.

Nested functions

One of the most natural types of micropipelining involves the nesting of functions, as in the example earlier that combines normalize-space() and upper-case() by nesting one inside the other. This simple nested structure involves only two functions, each of which accepts only one parameter, but as the number of functions and the number of parameters grows, it is easy to lose one’s place in a thicket of parentheses. Furthermore, the normalize-space() and upper-case() functions do not participate in feeding or bleeding relationships, which means that once we have determined that we need to use them, there is no risk of nesting them in a way that produces the wrong output, and that therefore requires reordering. That matters because part of the challenge of multiple nested functions, and of the multiple pairs of parentheses that come with them, involves legibility during development. The fact that here the nesting order doesn’t matter obviates a potential need to transplant a contained function into a containing position, and vice versa, because we inadvertently nested them incorrectly the first time.

Consider, for example, a situation using the Wordhoard TEI edition of Hamlet,[7] where our task is to construct a comma-separated list of distinct speakers (from the <speaker> element children of the <sp> speech elements) in Act 4. We can retrieve a deduplicated list of all <speaker> values with a single XPath function wrapped around a single path expression: distinct-values(//body/div[4]//speaker). The catch is that this list includes not only Rosencrantz and Guildenstern, both of whom speak alone, but also Rosencrantz and Guildenstern (as the string content of a single <speaker> element) where they speak in unison, and our result should recognize that Rosencrantz and Guildenstern is not a unique speaker. It is possible to perform the operation in a single line using nested functions:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="text"/>
    <xsl:template match="/">
        <xsl:value-of
            select="distinct-values(tokenize(string-join(//body/div[4]//speaker/replace(., ' and ', ' '), ' '), '\s+'))"
        />
    </xsl:template>
</xsl:stylesheet>

Here we pipe the output of the innermost replace() function into string-join(), and then into tokenize(), and then into distinct-values().[8] The order of functions has some flexibility, but the following constraints apply:

  1. string-join() feeds tokenize()

  2. tokenize() feeds distinct-values()

  3. replace() may occur at any time before tokenize() (that is, before or after string-join()), or we can eliminate replace() entirely and instead use a predicate to exclude the eventual token with the value and.

In favor of this strategy, nesting is the most direct possible representation of the pipeline relationship, since the input parameters to each function are included between its parentheses and each function statement evaluates to its output. Against this strategy, the depth of the nesting and the fact that several of the functions require multiple parameters greatly compromise the legibility, and thus the development and subsequent maintenance, of the code. Writing this code is messy and error-prone, and, recalling that code is read much more often than it is written, nobody would regard it as easy to read. We might try to improve the legibility through indentation:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="text"/>
    <xsl:template match="/">
        <xsl:value-of
            select="distinct-values(
                tokenize(
                    string-join(
                        //body/div[4]//speaker/replace(
                            .,
                            ' and ',
                            ' '
                        ),
                        ' '),
                    '\s+')
                )"
        />
    </xsl:template>
</xsl:stylesheet>

although most developers would not consider this very legible either.[9] There is good reason that the functional programming language Lisp, known for its deeply nested parenthesized constructions, has been glossed humorously as Lots of Irritating Silly Parentheses and in other equivalent ways.[10]

We can improve the legibility by rewriting the code to use convenience variables:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="text"/>
    <xsl:template match="/">
        <xsl:variable name="first" as="xs:string+"
            select="//body/div[4]//speaker/replace(., ' and ', ' ')"/>
        <xsl:variable name="second" as="xs:string" select="string-join($first, ' ')"/>
        <xsl:variable name="third" as="xs:string+" select="tokenize($second, '\s+')"/>
        <xsl:variable name="fourth" as="xs:string+" select="distinct-values($third)"/>
        <xsl:sequence select="$fourth"/>
    </xsl:template>
</xsl:stylesheet>

This improvement comes at a price, though. Much as the complexity of the compound operations makes it easy for a developer inadvertently to nest the functions incorrectly in the nested syntax, it is also easy to order them incorrectly in the version that employs convenience variables. A debugging fix that is simple to describe (revise the order in which the functions are applied, which might incorrectly suggest that this could be mapped onto rearrange the lines in which the variables are created) turns out to be error prone because each line refers to two variable names, the one it is creating with its @name attribute and the one that it is using to determine the value, which is part of the value of its @select attribute. Changes in the order in which the functions are executed thus becomes far more complex and error-prone than dragging the lines into a different order. One way to conceptualize the challenge is that the variable declarations exist in a feeding relationship not just by virtue of the order in which they are declared (see immediately below about the XSLT prohibition against using a variable before it has been declared), but also by virtue of the way each declaration uses a value created by a different declaration.

Given that variables in XSLT cannot be redeclared, one might wonder why XSLT requires that sibling variable declarations within a stylesheet be arranged in conformity with the order in which they are referenced. That is, one might wonder why XSLT prohibits ordering sibling <xsl:variable> elements in arbitrary order, much as it is possible to apply templates within one template that references another regardless of the order in which the two are listed in the XSLT code. Whatever the reason for this restriction, it means that rearranging the order in which the functions are applied involves not only changing the references among them, but also modifying the order of the <xsl:variable> elements so that the order of declaration follows the order in which the values are used.

Function chaining

One reason that deeply nested functions with multiple parameters are difficult to read is that code is written from left to right, which tacitly encourages humans to read it from left to right, but the feeding order (and thus the execution) of nested functions flows from the inside out. This means that the function names flow from right to left (but not starting the far right of the line, which is filled with closing parentheses and possibly function arguments), yet because functions may have multiple parameters, it is necessary to look to both the left and the right when moving out to the next level. In languages that support object-oriented programming, the dotted method notation avoids this clash because both the notation and the execution flow from left to right. For example, the following Python example returns the same results as the XSLT above:[11]

names = ['King','Gertrude','King','Gertrude','King','Gertrude','King','Hamlet','Rosencrantz and Guildenstern',\
         'Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet',\
         'Rosencrantz','Hamlet','Rosencrantz','Hamlet','Guildenstern','Hamlet','King','Rosencrantz','King',\
         'Rosencrantz','King','Rosencrantz','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King',\
         'Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Fortinbras',\
         'Captain','Fortinbras','Hamlet','Captain','Hamlet','Captain','Hamlet','Captain','Hamlet','Captain','Hamlet',\
         'Captain','Hamlet','Captain','Rosencrantz','Hamlet','Gertrude','Gentleman','Gertrude','Gentleman','Horatio',\
         'Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia',\
         'King','Ophelia','King','Ophelia','King','Ophelia','King','Ophelia','King','Gertrude','King','Gentleman',\
         'Gertrude','King','Laertes','Danes','Laertes','Danes','Laertes','Gertrude','Laertes','King','Laertes','King',\
         'Gertrude','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King','Danes','Laertes',\
         'Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','King',\
         'Laertes','King','Horatio','Servant','Horatio','Sailor','Horatio','Sailor','Horatio','King','Laertes','King',\
         'Laertes','King','Messenger','King','Messenger','King','Laertes','King','Laertes','King','Laertes','King',\
         'Laertes','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King',\
         'Laertes','King','Laertes','King','Laertes','King','Gertrude','Laertes','Gertrude','Laertes','Gertrude',\
         'Laertes','King']
result = set(' '.join(names).replace(' and ', ' ').split())
print(result)

The set() function is used to remove the duplicates and its input is contained in the parentheses that immediately follow the function name, but otherwise the order of execution and the syntactic expression both flow from left to right using dotted method notation: from join() (equivalent to XSLT string-join()) into replace() (also called replace() in XSLT) into split() (equivalent to tokenize() in XSLT).[12]

While dotted method notation is natural within an object-oriented model, the notation is also not inherently incompatible with functional programming paradigms. For example, pipelines with dot notation when working within a functional model in Python are supported by the PyFunctional package [PyFunctional], which makes creating data pipelines easy by using chained functional operators, e.g.:

seq(1, 2, 3).map(lambda x: x * 2).reduce(lambda x, y: x + y)

XPath 3.1 introduced the “arrow operator” (=>) for this purpose, so that:

tokenize((normalize-unicode(upper-case($string))),"\s+")
can be rewritten as
$string => upper-case() => normalize-unicode() => tokenize("\s+")
[XPath 3.1, §3.16] One limitation to this XPath operator is that the value that is passed can be used only as the first argument to the next function, which means that traditional methods must be used when the value supplies any non-initial parameter.

Variables, templates, and functions

The description of nested functions above was based on functions from the core XPath library, but much more complex operations can be embedded in user-defined functions, which can then be nested on a single line, similarly to the nesting in the basic XPath library example. Indeed, a user-defined function can transform an entire input document into an entire intermediate document, optionally assigning the value of the function to a variable, and then apply another function to the intermediate document. If we create two (imaginary) user-defined functions, one that translates a document from Russian into English and another that tokenizes the English translation and returns a report (perhaps an HTML table) of the distribution of Scrabble values of the words, the starting point of the transformation might look like:

<xsl:template match="/">
    <xsl:variable name="intermediate" as="document-node()" select="djb:translate_into_english(.)"/>
    <xsl:sequence select="djb:create_scrabble_values($intermediate)"/>
</xsl:template>

As was the case in the discussion of core XPath library functions above, $intermediate is a convenience variable, and the functions can, alternatively, be nested directly:

<xsl:template match="/">
    <xsl:sequence select="djb:create_scrabble_values(djb:translate_into_english(.))"/>
</xsl:template>

The two user-defined functions might call other functions, apply templates, or otherwise perform operations of arbitrary complexity.

Insofar as functions and templates have similar properties (and are compiled to the same type of operations within the processor), instead of creating user-defined functions, it is also possible to use convenience variables with templates:

<xsl:template match="/">
    <xsl:variable name="intermediate" as="document-node()">
        <xsl:apply-templates select="."/>
    </xsl:variable>
    <xsl:apply-templates select="$intermediate"/>
</xsl:template>

Dimitre Novatchev provided real examples of this type of chaining in XSLT 1.0 and 2.0 in a StackOverflow response in 2010. [Novatchev 2010]

The use of intermediate variables with either functions or templates neither requires nor precludes serializing all or part of the intermediate results with <xsl:result-document> or <xsl:message> if that is needed for development or for production. But the use of templates with intermediate variables for pipelining does exhibit one of the principal limitations discussed above of using functions with variables for pipelining: editing the pipeline requires, beyond the insertion, deletion, or rearrangement of lines, changing the value of the @select attribute when the intermediate variable is used. For example, in the code above, if between the translation into English and the creation of the Scrabble value output we want to pluralize all words that are capable of being pluralized, not only do we need to create another variable, but we also need to change the value of the @select attribute of an <xsl:apply-templates> elements in a different line (modifications associated with the insertion are highlighted):

<xsl:template match="/">
    <xsl:variable name="intermediate" as="document-node()">
        <xsl:apply-templates select="."/>
    </xsl:variable>
    <xsl:variable name="pluralized" as="document-node()">
        <xsl:apply-templates select="$intermediate"/>
    </xsl:variable>
    <xsl:apply-templates select="$pluralized"/>
</xsl:template>

Inserting or deleting a step requires, in addition to the insertion or deletion itself, one adjustment to the value of a @select attribute in a different line. Rearranging steps requires more adjustments. There is also no easy way of changing the location where intermediate results are serialized, whether for development or for production. And because the use of intermediate variables requires additional physical lines of code for legibility, it becomes harder to see the entire pipeline as an uninterrupted sequence. What would be more robust, flexible, and maintainable would be a pattern that has the following properties:

  • Each step of the pipeline fits on a single short line of XSLT code, with no wrappers, no intervening statements, and nothing else that intrudes into a bare-bones list of steps.

  • Each step of the pipeline is entirely self-contained. Inserting, deleting, or rearranging steps does not require any modification to any code pertaining to other steps. This self-containment has two subcomponents:

    • The code that specifies the order of operations in the pipeline does not mention or pass any values explicitly. This means that changing the sequence of steps does not require changing variables in the arguments.

    • The code blocks that perform the pipeline operations all expect the same parameters and return the same result types. This means that changing the sequence of steps does not require any changes within any of the steps.

  • Intermediate results can be dumped at any time (or at more than one time) as a step in the pipeline, whether for development or for production purposes. Dumping is a part of the pipeline that happens between transformative operations on the data; it is not part of any individual transformative operation within the pipeline.

  • Optional: Steps can be added to the list before the code that performs them has been created, without requiring any adjustment to the parameters.[13]

Nested template calls

As was noted above, functions and templates have similar capabilities, but because the nesting syntax is different, performing the same complex operation by using nested template calls instead of nested functions addresses some of the preceding desiderata. In the example below, we pass the original input, saved in the $russian-text variable, as a parameter into a translate-into-english template, the return value of which becomes the value of an $english-text parameter, which we then pass into a create-scrabble-values template. That last template returns our results:

<xsl:variable name="scrabble_values">
    <xsl:call-template name="create-scrabble-values">
        <xsl:with-param name="english-text" as="document-node()">
            <xsl:call-template name="translate-into-english">
                <xsl:with-param name="input" select="$russian-text" as="document-node()"/>
            </xsl:call-template>
        </xsl:with-param>
    </xsl:call-template>
</xsl:variable>

This method addresses some, but not all, of the desiderata listed above. Its primary advantage is that it keeps the parameters adjacent to the template calls that use them, which is an improvement over scattering them, as often happens with nested functions. But insertion, deletion, and especially rearrangement are not one-line operations, and neither is dumping interim results during development, and the hierarchy becomes harder to follow as the number of steps, and therefore the degree of indentation in pretty-printed output, increases. Below we examine how the visitor pattern addresses these concerns.

The visitor pattern

The visitor pattern in an XSLT context

The visitor pattern is usually defined and illustrated in an object-oriented context that may be difficult to map onto the XSLT programming paradigm.[14] James Sugrue provides some real-world analogies:

A real world analogy always helps with the understanding of a design pattern. One example I have seen for the Visitor pattern in action is a taxi example, where the customer […] orders a taxi, which arrives at his door. Once the person sits in [it], the visiting taxi is in control of the transport for that person.

Shopping in the supermarket is another common example, where the shopping cart is your set of elements. When you get to the checkout, the cashier acts as a visitor, taking the disparate set of elements (your shopping), some with prices and others that need to be weighed, in order to provide you with a total.

In summary, if you want to decouple some logical code from the elements that you’re using as input, visitor is probably the best pattern for the job. [Sugrue 2010]

Because XSLT is not object-oriented, logical operations on objects (e.g., nodes of various types, atomic values) are naturally decoupled from the objects themselves. Michael Kay, writing about the simulation of higher-order functions in XSLT 2.0, situates the visitor pattern in a functional context:

[Y]ou might write a function that does a depth-first traversal of a tree and processes each node in the tree using a function that is supplied as an argument. This makes it possible to use one tree-walking routine to achieve many different effects (you might recognize this as the Visitor Pattern).

In XSLT (and XPath), functions are not first-class objects, so they cannot be passed as arguments to other functions.[15] However, the template-matching capability of <xsl:apply-templates> can be used to simulate higher order functions.

The use of <xsl:apply-templates> has been developed to a fine art by Dimitre Notatchev in his FXSL library (http://fxsl.sourceforge.net/), which provides many examples of the technique. [Kay 2008]

The source

The full XSLT stylesheet used to identify rhyme in Russian verse, which is the source of the excerpts presented here, is available at https://raw.githubusercontent.com/djbpitt/poetry/master/rhyme.xsl. It can be run against itself, in which case it processes default input that is embedded in the stylesheet. Additional documentation, and the ancillary XSLT stylesheets that it imports, are available at the project’s GitHub repo at https://github.com/djbpitt/poetry.

Defining the order of operations

The transformation is driven by a sequence of operations, each of which is represented by an empty element in a custom namespace:

<xsl:variable name="operations" as="element()+">
    <djb:prepareWords/>
    <djb:lexical/>
    <djb:proclitics/>
    <djb:enclitics/>
    <djb:tsa/>
    <djb:palatalize/>
    <djb:jot/>
    <djb:romanize/>
    <djb:finalDevoice/>
    <djb:regressiveDevoice/>
    <djb:regressiveVoice/>
    <djb:palatalAssimilation/>
    <djb:consonantCleanup/>
    <djb:vowelReduction/>
    <!-- 
        diagnosticOutput writes the string to stderr as an <xsl:message>
        move it around or comment it out, as appropriate
    -->
    <!--<djb:diagnosticOutput/>-->
    <djb:stripSpaces/>
    <djb:rhymeString/>
</xsl:variable>

Note that <djb:diagnosticOutput> can be inserted wherever the developer wishes to examine interim results, including in multiple locations.

Starting the transformation

The transformation is run once for each <line> element in a poem. It operates through sibling recursion, applying templates to the first member of the $operations variable defined in the example above and passing as parameters the current line ($input) and the sequence of remaining parameters (remove($operations, 1)):

<xsl:apply-templates select="$operations[1]" mode="operate">
    <xsl:with-param name="input" select="."/>
    <xsl:with-param name="remaining" select="remove($operations, 1)"/>
</xsl:apply-templates>

Note that we apply templates not to the input (the <line> element), but to the operation, and we pass along to the operation as parameters the input and the sequence of remaining operations. None of the operations refers to any other explicitly, and none refers to any specific parameters, which means that inserting, deleting, and rearranging operations involves changing only the list of operations itself. Because the sibling recursion strategy will walk through the operations in order (see below), one operation feeds the next without having to know its identity.

Template matching

When we apply templates to an element in the sequence of operations, a single template matches all elements in the custom namespace, and it dispatches the processing to the appropriate location within an <xsl:choose> element. That single template begins as follows:

<xsl:template match="djb:*" mode="operate">
    <xsl:param name="input" as="xs:string" required="yes"/>
    <xsl:param name="remaining" as="element()*"/>
    <xsl:variable name="results" as="xs:string*">
        <xsl:choose>
            <!-- ******************************************* -->
            <!-- djb:lexical: Idiosyncrasies in pronunciation (including -ogo) -->
            <!-- ******************************************* -->
            <xsl:when test="self::djb:lexical">
                <xsl:sequence select="djb:lexical($input)"/>
            </xsl:when>
            <!-- ******************************************* -->
            <!-- djb:proclitics: Merge proclitics with bases -->
            <!-- ******************************************* -->
            <xsl:when test="self::djb:proclitics">
                <xsl:sequence select="djb:proclitic($input, 1)"/>
            </xsl:when>
            <!-- ******************************************* -->
            <!-- djb:enclitics: Merge enclitics with bases -->
            <!-- ******************************************* -->
            <xsl:when test="self::djb:enclitics">
                <xsl:sequence select="djb:enclitic($input, 1)"/>
            </xsl:when>
            <!-- ******************************************* -->
            <!-- djb:tsa: Convert ть?ся$ to тса -->
            <!-- ******************************************* -->
            <xsl:when test="self::djb:tsa">
                <xsl:sequence select="replace($input, 'ться$', 'тса')"/>
            </xsl:when>
            <!-- ******************************************* -->
            <!-- djb:palatalize: Capitalize all palatalized consonants (including unpaired) -->
            <!-- ******************************************* -->
            <xsl:when test="self::djb:palatalize">
                <xsl:variable name="result1" as="xs:string+">
                    <xsl:analyze-string select="$input"
                        regex="([бвгдзклмнпрстфх])([яеиёюЯЕИЁЮь])">
                        <xsl:matching-substring>
                            <xsl:sequence
                                select="concat(upper-case(regex-group(1)), regex-group(2))"/>
                        </xsl:matching-substring>
                        <xsl:non-matching-substring>
                            <xsl:sequence select="."/>
                        </xsl:non-matching-substring>
                    </xsl:analyze-string>
                </xsl:variable>
                <xsl:sequence select="translate(string-join($result1, ''), 'чйщ', 'ЧЙЩ')"/>
            </xsl:when>

The pipline operations are the only elements in the djb: namespace, so this template matches all of them, and nothing else. It accepts the two parameters and then creates a variable called $results, inside of which it uses <xsl:choose> to identify the operation that is being processed. Some of these operations (e.g., <djb:lexical>, <djb:proclitics>, <djb:enclitics>) call user-defined functions (in a separate function library) to do their work. Others use core XPath library functions (e.g., <djb:tsa>). Others are longer and perform more complex processing (e.g., <djb:palatalize>). Each <xsl:when> element, which processes a specific operation, returns a sequence as the value of the (interim) $results variable that wraps the <xsl:choose> element. How this variable contributes to the result of the surrounding template is described below.

Unmatched operations

At different moments in the development we implemented different treatments of operations that are not matched in the single processing template. One option is to let operations listed in the sequence but not associated with processing code pass their logical input, unchanged, along to the next step in the pipeline. This is the simplest form of a stub; the name of the operation is present, but the default behavior if the operation is not defined is to pass:

<!-- ******************************************* -->
<!-- Default to continue processing if operate step has no code-->
<!-- ******************************************* -->
<xsl:otherwise>
    <xsl:apply-templates select="$remaining[1]" mode="visit">
        <xsl:with-param name="input" select="$input"/>
        <xsl:with-param name="remaining" select="remove($remaining, 1)"/>
    </xsl:apply-templates>
</xsl:otherwise>

The <xsl:otherwise> default applies templates to the next operation in the sequence, passing along the input it received unchanged.

Alternatively, when the development is complete, we might prefer to raise a fatal error if the sequence of operations includes any that have not been defined:

<!-- ******************************************* -->
<!-- Default to terminate if operate step has no code-->
<!-- ******************************************* -->
<xsl:otherwise>
    <xsl:message terminate="yes">Unmatched visitor element <xsl:value-of
        select="local-name()"/></xsl:message>
</xsl:otherwise>

One or the other of these two defaults should be commented out at all times.

Processing the next operation

Except for the <xsl:otherwise> option that raises an error and terminates, each operation creates a sequence that becomes the value of the $results variable that wraps the <xsl:choose> element. In some cases the value is a single string; in others it is a sequence of strings (thus the plural name).[16] After the $results variable has been constructed, it is used to construct a $result (singular), typed as a single string. The template then passes that value to the next operation, using sibling recursion:

<xsl:variable name="result" as="xs:string" select="string-join($results, '')"/>
<xsl:apply-templates select="$remaining[1]" mode="#current">
    <xsl:with-param name="input" select="$result"/>
    <xsl:with-param name="remaining" select="remove($remaining, 1)"/>
</xsl:apply-templates>

Termination

The sibling recursion pattern removes each operation after it has been processed and then passes the sequence of remaining operations along as a parameter. When the last operation has been processed and removing it from the sequence of operations leaves an empty sequence, the pipeline has finished, and returns gracefully:

<xsl:if test="empty($remaining)">
    <xsl:sequence select="$result"/>
</xsl:if>

Evaluation

The paradigm described here meets all of the desiderata enumerated earlier and repeated here:

  • Each step of the pipeline fits on a single short line of XSLT code, with no wrappers, no intervening statements, and nothing else that intrudes into a bare-bones list of steps.

  • Each step of the pipeline is entirely self-contained. Inserting, deleting, or rearranging steps does not require any modification to any code pertaining to other steps.

  • The list of steps does not mention or pass any parameters explicitly. Parameter passing is handled within the processing steps (whether functions or templates), and changing the sequence of steps does not require changing the code that manages parameters within any of the steps.

  • Intermediate results can be dumped at any time (or at more than one time) as a step in the pipeline, whether for development or for production purposes. Dumping is a part of the pipeline that happens between transformative operations on the data; it is not part of any individual transformative operation within the pipeline.

  • Steps can be added to the list before the code that performs them has been created, without requiring any adjustment to the parameters.

Adapting the visitor pattern to use <xsl:iterate>[17]

The introduction of <xsl:iterate> in XSLT 3.0 suggests an alternative way to implement the visitor pattern.[18] Consider the following example, which combines the processing pipeline and the text to be transformed in a single XML document:

<example>
    <pipeline xmlns="Pipeline">
        <tokenize/>
        <lower-case/>
        <swap s1="fox" s2="dog"/>
        <swap s1="quick" s2="lazy"/>
        <pipeline>
            <replace match="b" replace="z"/>
            <replace match="o" replace="a"/>
        </pipeline>
        <replace match="lazy" replace="eager"/>
        <capitalize/>
        <reverse/>
        <string-join separator="-"/>
    </pipeline>
    <source>
        <div>
            <p>The quick BROWN fox jumps OVER the lazy dog</p>
        </div>
    </source>
</example>

As in the original visitor pattern example, the pipeline steps are specified as elements, in a custom namespace, to which templates are applied. <pipeline> elements can be nested; this has no effect on the execution (the nesting is flattened), but it can simplify modular development and inclusion. The following XSLT stylesheet applies the pipeline steps in order using xsl:iterate:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
    xmlns:p="Pipeline" version="3.0">
    <xsl:output method="xml" indent="true"/>
    <xsl:mode on-no-match="shallow-copy"/>
    <xsl:template match="/">
        <xsl:apply-templates select="example/source" mode="#current">
            <xsl:with-param name="pipeline" select="example/p:pipeline" tunnel="true"/>
        </xsl:apply-templates>
    </xsl:template>
    <xsl:template match="text()[matches(., '\S')]">
        <xsl:param name="pipeline" tunnel="true"/>
        <xsl:call-template name="pipeline">
            <xsl:with-param name="input" select="." as="xs:string"/>
            <xsl:with-param name="pipeline" as="element()" select="$pipeline"/>
        </xsl:call-template>
    </xsl:template>
    <xsl:template name="pipeline" as="item()*">
        <xsl:param name="input" as="item()*"/>
        <xsl:param name="pipeline" as="element()"/>
        <xsl:iterate select="$pipeline/*">
            <xsl:param name="input" select="$input"/>
            <xsl:on-completion>
                <xsl:sequence select="$input"/>
            </xsl:on-completion>
            <xsl:next-iteration>
                <xsl:with-param name="input" as="item()*">
                    <xsl:apply-templates select=".">
                        <xsl:with-param name="input" select="$input"/>
                    </xsl:apply-templates>
                </xsl:with-param>
            </xsl:next-iteration>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="p:tokenize" as="xs:string*">
        <xsl:param name="input" as="xs:string?"/>
        <xsl:variable name="split" select="(@split, '\s+')[1]"/>
        <xsl:sequence select="tokenize($input, $split)"/>
    </xsl:template>
    <xsl:template match="p:string-join" as="xs:string">
        <xsl:param name="input" as="item()*"/>
        <xsl:variable name="sep" select="(@separator, ' ')[1]"/>
        <xsl:sequence select="string-join($input ! string(.), $sep)"/>
    </xsl:template>
    <xsl:template match="p:lower-case" as="xs:string*">
        <xsl:param name="input" as="item()*"/>
        <xsl:sequence select="$input ! string(.) ! lower-case(.)"/>
    </xsl:template>
    <xsl:template match="p:capitalize" as="xs:string*">
        <xsl:param name="input" as="item()*"/>
        <xsl:sequence
            select="$input ! string(.) ! (upper-case(substring(., 1, 1)) || lower-case(substring(., 2)))"
        />
    </xsl:template>
    <xsl:template match="p:reverse" as="item()*">
        <xsl:param name="input" as="item()*"/>
        <xsl:sequence select="reverse($input)"/>
    </xsl:template>
    <xsl:template match="p:replace" as="xs:string*">
        <xsl:param name="input" as="item()*"/>
        <xsl:variable name="match" select="(@match, ' ')[1]"/>
        <xsl:variable name="replace" select="(@replace, ' ')[1]"/>
        <xsl:sequence select="$input ! string(.) ! replace(., $match, $replace)"/>
    </xsl:template>
    <xsl:template match="p:swap" as="xs:string*">
        <xsl:param name="input" as="item()*"/>
        <xsl:variable name="s1" select="@s1"/>
        <xsl:variable name="s2" select="@s2"/>
        <xsl:for-each select="$input">
            <xsl:variable name="parts" as="xs:string*">
                <xsl:analyze-string select="string(.)" regex="{$s1}|{$s2}">
                    <xsl:matching-substring>
                        <xsl:sequence
                            select="
                                if (. eq $s1) then
                                    $s2
                                else
                                    $s1"
                        />
                    </xsl:matching-substring>
                    <xsl:non-matching-substring>
                        <xsl:sequence select="."/>
                    </xsl:non-matching-substring>
                </xsl:analyze-string>
            </xsl:variable>
            <xsl:sequence select="string-join($parts)"/>
        </xsl:for-each>
    </xsl:template>
    <xsl:template match="p:pipeline" as="item()*">
        <xsl:param name="input" as="item()*"/>
        <xsl:call-template name="pipeline">
            <xsl:with-param name="input" select="$input"/>
            <xsl:with-param name="pipeline" select="."/>
        </xsl:call-template>
    </xsl:template>
</xsl:stylesheet>

The output is:

<source>
    <div>
        <p>Fax-Quick-The-Aver-Jumps-Dag-Zrawn-Eager-The</p>
    </div>
</source>

The <xsl:iterate> strategy uses iteration over the children of the <pipeline> element instead of sibling recursion. The example above also relies on separate templates for every element in the Pipeline namespace instead of using a single template and <xsl:choose>, a decision that is independent of the choice between iteration and recursion. The recursion implementation relies on explicit surgery on the list of pipeline steps with the help of the XPath remove() function, as well as a somewhat indirect test for the end of the pipeline, which uses the XPath empty() function to check whether any pipeline steps remain. The iterative approach may therefore feel more natural, since it uses elements with more transparent, self-documenting, names, such as <xsl:next-iteration> and <xsl:on-completion>.

Nested calls to the replace() functions

Use cases for nested replace() functions

Developers who work with uncommon writing systems (ancient, medieval, minority, etc.) often need to convert documents from one character coding to another. Sometimes the writing system is the same but the character set is different, such as when pre-Unicode legacy documents must be converted to Unicode. Sometimes the conversion is from one writing system to another, such as when Cyrillic texts are Romanized for library catalog purposes, or when linguistic field data in domain-specific notation are converted to the International Phonetic Alphabet (IPA).

If the mapping from one character representation to another is one to one, it can be implemented with a single translate() function. The implementation of translate() has the advantage of not suffering from feeding or bleeding conflicts. For example, translate($input,'AB','BA') replaces A with B and B with A simultaneously, so that old instances of A that are translated to B do not then wind up feeding the translation of B to A, or vice versa.

But mappings that are one to many, many to many, and many to one cannot be managed with translate() because translate() can only map exactly one character to exactly one character (or to zero characters). For this reason, mappings that are not one-to-one are performed using the replace() function, and if there are many such replacements, the transformation requires a pipeline of replace() operations (which must be ordered in a way that is sensitive to feeding and bleeding issues). Even if we bracket ordering constraints, the necessity of pipelining replace() functions (in our work, often more than two dozen) raises the legibility and maintainability issues discussed above. We avoid those issues by implementing a table-driven approach that separates the processing with replace() from the match and replacement values. The example below is drawn from a Uyghur-to-IPA mapping project.

The mapping table

The mapping table consists of pairs of match and replacement strings, grouped as those that are not one to one (<multiple>) and those that are (<unitary>):[19]

<pairs>
    <multiple>
        <pair>
            <orth>ch</orth>
            <ipa>ʧʰ</ipa>
        </pair>
        <pair>
            <orth>gh</orth>
            <ipa>ɣ</ipa>
        </pair>
        <pair>
            <orth>ng</orth>
            <ipa>ŋ</ipa>
        </pair>
        <pair>
            <orth>sh</orth>
            <ipa>ʃ</ipa>
        </pair>
        <pair>
            <orth>zh</orth>
            <ipa>ʒ</ipa>
        </pair>
    </multiple>
    <unitary>
        <pair>
            <orth>a</orth>
            <ipa>a</ipa>
        </pair>
        <pair>
            <orth>e</orth>
            <ipa>ɛ</ipa>
        </pair>
        <pair>
            <orth>b</orth>
            <ipa>b</ipa>
        </pair>
        <pair>
            <orth>d</orth>
            <ipa>d</ipa>
        </pair>
        <pair>
            <orth>é</orth>
            <ipa>e</ipa>
        </pair>
        <pair>
            <orth>f</orth>
            <ipa>f</ipa>
        </pair>
        <pair>
            <orth>g</orth>
            <ipa>g</ipa>
        </pair>
        <pair>
            <orth>h</orth>
            <ipa>h</ipa>
        </pair>
        <pair>
            <orth>x</orth>
            <ipa>χ</ipa>
        </pair>
        <pair>
            <orth>i</orth>
            <ipa>i</ipa>
        </pair>
        <pair>
            <orth>j</orth>
            <ipa>ʤ</ipa>
        </pair>
        <pair>
            <orth>k</orth>
            <ipa>k</ipa>
        </pair>
        <pair>
            <orth>q</orth>
            <ipa>q</ipa>
        </pair>
        <pair>
            <orth>l</orth>
            <ipa>l</ipa>
        </pair>
        <pair>
            <orth>l</orth>
            <ipa>ɫ</ipa>
        </pair>
        <pair>
            <orth>m</orth>
            <ipa>m</ipa>
        </pair>
        <pair>
            <orth>n</orth>
            <ipa>n</ipa>
        </pair>
        <pair>
            <orth>o</orth>
            <ipa>o</ipa>
        </pair>
        <pair>
            <orth>ö</orth>
            <ipa>ø</ipa>
        </pair>
        <pair>
            <orth>p</orth>
            <ipa>pʰ</ipa>
        </pair>
        <pair>
            <orth>r</orth>
            <ipa>r</ipa>
        </pair>
        <pair>
            <orth>s</orth>
            <ipa>s</ipa>
        </pair>
        <pair>
            <orth>t</orth>
            <ipa>tʰ</ipa>
        </pair>
        <pair>
            <orth>u</orth>
            <ipa>u</ipa>
        </pair>
        <pair>
            <orth>ü</orth>
            <ipa>y</ipa>
        </pair>
        <pair>
            <orth>w</orth>
            <ipa>w</ipa>
        </pair>
        <pair>
            <orth>y</orth>
            <ipa>j</ipa>
        </pair>
        <pair>
            <orth>z</orth>
            <ipa>z</ipa>
        </pair>
        <pair>
            <orth>‘</orth>
            <ipa>ˀ</ipa>
        </pair>
        <pair>
            <orth>'</orth>
            <ipa>ˀ</ipa>
        </pair>
    </unitary>
</pairs>

Processing mappings that are not one to one

Because some of the characters in the <multiple> portion of the mapping table also appear in the <unitary> portion, bleeding constraints require that mappings that are not one to one be processed before those that are, and that processing requires the replace() function. This processing is performed with a single user-defined function:

<xsl:function name="djb:multireplace" as="xs:string">
    <!-- this function handles all replacements that aren't one-to-one -->
    <xsl:param name="instring" as="xs:string"/>
    <xsl:param name="pairs" as="element(pair)*"/>
    <xsl:choose>
        <xsl:when test="empty($pairs)">
            <xsl:sequence select="$instring"/>
        </xsl:when>
        <xsl:otherwise>
            <xsl:variable name="result"
                select="replace($instring, $pairs[1]/orth, $pairs[1]/ipa)"/>
            <xsl:sequence select="djb:multireplace($result, remove($pairs, 1))"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:function>

When the function is called for the first time on a string, $pairs has been declared as a sequence of all <pair> elements that are children of <multiple>. The function (in the <xsl:otherwise> clause) recruits the first <pair> elements in that sequence and uses its <orth> child as the match pattern and its <ipa> child as the replacement. It then removes the <pair> it has just used from the head of the sequence of <pair> elements and calls itself recursively; when there are no <pair> elements left to process, the recursion terminates and the last string value is returned. The separation of the replace() function from the arguments on which it operates makes it easy for the developer to insert, delete, or rearrange the pairings in the mapping file.

Processing mappings that are one to one

Mapping that are one to one can be processed with a single instance of the translate() function:

<xsl:variable name="unitary-orth" select="string-join($unitary//orth,'')" as="xs:string"/>
<xsl:variable name="unitary-ipa" select="string-join($unitary//ipa,'')" as="xs:string"/>
<xsl:sequence select="translate($temp,$unitary-orth,$unitary-ipa)"/>

The table-driven design all but eliminates the opportunity for off-by-one errors when aligning the individual characters in the match argument with those in the replacement argument.

Conclusion

The most natural and direct human understanding of pipelining is that it takes the output of one operation (function or template) and uses it as the input to the next. The most direct representations of that processing model in XSLT code encounter challenges to legibility as developers drown in a sea of parentheses or struggle to update variable names and references as they add, delete, or rearrange steps in the pipeline. The use cases illustrated in this report avoid those challenges and struggles by moving the statement of the pipeline into a sequence of simple steps (operations in the visitor pattern, mapping pairs in the transliteration example) that are easy to read and to edit because adjustments in a step are entirely self-contained, and do not require adjustments in others. The feeding and bleeding logic remains the intellectual responsibility of the developer, but the inevitable mistakes in this area are more easily remedied during debugging because modifications in the order of pipeline steps are constrained to a single element. None of the methods described here is new, but their explicit juxtaposition, comparison, and evaluation in a tutorial context based on real use cases has clarified much about micropipelining for the author, and, it is hoped, for the reader, as well.

References

[Adams and Birnbaum 1997] Adams, Lawrence D. and David J. Birnbaum. Text technology 7, 1 (1997): 1–17. Corrected reprint available at http://poetry.obdurodon.org/reports/adams-birnbaum.xhtml

[Birnbaum and Thorsen 2015] Birnbaum, David J., and Elise Thorsen. Markup and meter: Using XML tools to teach a computer to think about versification. Presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi:https://doi.org/10.4242/BalisageVol15.Birnbaum01. https://www.balisage.net/Proceedings/vol15/html/Birnbaum01/BalisageVol15-Birnbaum01.html

[Cocoon] Apache Cocoon. http://cocoon.apache.org/

[Gamma 1994] Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented software. Boston, MA: Addison-Wesley, 1994.

[Kay 2008] Kay, Michael. XSLT 2.0 and XPath 2.0 programmer’s reference, 4th edition. Indianapolis: Wiley/Wrox, 2008.

[Kay 2009] Kay, Michael. You pull, I’ll push: on the polarity of pipelines. Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:https://doi.org/10.4242/BalisageVol3.Kay01. https://www.balisage.net/Proceedings/vol3/html/Kay01/BalisageVol3-Kay01.html

[Kiparsky 1973] Kiparsky, Paul. Abstractness, opacity, and global rules. In Three dimensions of linguistic theory, O. Fujimura, ed, pages 57–86. Tokyo: TEC, 1973.

[Knuth 1974] Kunth, Donald E. Structured programming with go to statements. Computing surveys 6(4): 261–301. doi:https://doi.org/10.1145/356635.356640. http://web.archive.org/web/20130731202547/http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf

[Koenig 1995] Patterns and antipatterns. Journal of object-oriented programming 8(1): 46-48 (1995)

[McIlroy 1964] The origin of Unix pipes. Bell Labs document archive. http://doc.cat-v.org/unix/pipes/

[Novatchev 2010] Novatchev, Dimitre. Response to Is daisy chaining xslt an accepted practice? http://stackoverflow.com/questions/4571956/is-daisy-chaining-xslt-an-accepted-practice

[Piez 2016] Piez, Wendell. Framing the problem: building customized editing environments and workflows. Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:https://doi.org/10.4242/BalisageVol17.Piez01. https://www.balisage.net/Proceedings/vol17/html/Piez01/BalisageVol17-Piez01.html

[Piez 2014] Piez, Wendell. Response to Ignoring ambiguous matches. https://www.oxygenxml.com/archives/xsl-list/201402/msg00085.html

[PyFunctional] PyFunctional (Python package). https://github.com/EntilZha/PyFunctional

[Saxonica <xsl:iterate>] xsl:iterate. Saxon 9.8 Documentation. http://www.saxonica.com/html/documentation/xsl-elements/iterate.html

[Sugrue 2010] Sugrue, James. Visitor pattern tutorial with Java examples. https://dzone.com/articles/design-patterns-visitor

[Why] Why is join() a string method instead of a list or tuple method? Part of the [Python] Design and history FAQ. https://docs.python.org/3/faq/design.html#why-is-join-a-string-method-instead-of-a-list-or-tuple-method

[XPath 3.1] XML Path Language (XPath) 3.1. W3C Recommendation, 21 March 2017. https://www.w3.org/TR/xpath-31/

[XProc] XProc: An XML Pipeline Language. W3C Recommendation, 11 May 2010. https://www.w3.org/TR/xproc/



[1] The author is grateful to two anonymous Balisage referees for comments and suggestions, to Wendell Piez for the introduction to the visitor pattern, and to John Lumley for the <xsl:iterate> example.

[2] Donald Knuth’s oft-cited statement that premature optimization is the root of all evil does not blanketly disparage optimization. In narrow context it reads:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. [Knuth 1974, p. 268; see also the larger context on that page]

[3] The place of stress in Russian, as in English, is not normally indicated orthographically, and because stress can fall on different syllables in different words, including different inflected forms of the same lexeme, it is not predictable without linguistic (lexical and morphological) knowledge. The identification of meter depends on knowing the place of stress, and Birnbaum and Thorsen 2015 describes how stress is added automatically to enrich original unstressed input in natural Russian orthography before performing metrical analysis.

[4] An early version of the algorithm we developed for this purpose was described in Adams and Birnbaum 1997. The XSLT implementation of the enhanced version currently used in the project is available at https://github.com/djbpitt/poetry/blob/master/rhyme.xsl.

[5] If a sequence constructor is used with no as attribute, a temporary document is constructed. This is done by creating a new document node and using the value of the result sequence to form the children of the document node. [Kay 2008, p. 503] There are times when constructing a temporary tree is nonetheless appropriate; for an example, see Kay 2008, p. 510ff.

[6] In this case, the reuse of the same value in the @y and @height attributes means that the variable is not merely a matter of convenience, since that reuse also means that the same expression does not have to be evaluated more than once.

[8] Replacing and with a space is a brittle strategy because it would break on multiple-speaker values that were joined in different ways, such as <speaker>Curly, Larry, and Moe</speaker>. For that matter, tokenizing on white space is brittle because it would break on single persons with names like Player King or First Player. The point of the example is not to find an optimal solution to the multiple-speaker problem, but to provide an example of pipelining functions.

[9] Furthermore, it does not emerge unchanged when the stylesheet is pretty-printed in <oXygen/>, and probably in at least some other editors.

[10] See https://xkcd.com/297/. The name is actually derived from LISt Processor.

[11] To keep the focus on the pipelining code, the speaker names, as they appear in the act, are supplied as a Python list. In a real application they would be extracted from the XML using one of the several Python XML libraries.

[12] The syntax of the join() method appears to make some programmers feel uncomfortable. As is explained in Why, join() is a method not of the list (the Python equivalent of an XSLT sequence, in this case the sequence of speaker names), but of the separator string (in this example, a space character), which might suggest that the code is operating on the separator character and using the list to perform the operation. The reason this may feel unnatural is that what is being joined is the list items, and a human might think that we perform joining to modify a list, and not to modify a separator string. That perception is buttressed by the Python split() function, where 'Curly,Larry,Moe'.split(',') splits the Three Stooges at the comma. That is, split() is a method of the complex object we seek to modify with the help of a simple string, but join() is a method of the simple string that we use (at least from a human perspective) to modify a complex list. The perceived assymetry of join() with split() invites confusion, especially for an XSLT programmer for whom first or leftmost feels like the locus of an operation because that’s the way the signatures of the XPath contains(), ends-with(), tokenize(), and (especially) string-join() functions operate.

[13] This can be done with functions or templates by creating stubs that pass their input along to the next step in the pipeline, but it would be more convenient to have that happen by default, without separate explicit stub code for each not-yet-fully-specified step.

[14] I am grateful to Wendell Piez for introducing me to this pattern, and for his many generous contributions, in the course of a rich correspondence in spring 2017, to my elaboration of the ideas described here. Piez identifies this strategy as an instance of the visitor pattern in Piez 2014.

[15] [Added:] This statement is based on XPath 2.0 and XSLT 2.0. Functions may be used as first-class objects in more recent versions of the standards.

[16] The first operation (<djb:prepareWords>) returns a sequence of elements, rather than a string or sequence of strings, and is handled separately in a template that matches the operation’s fully qualified name explicitly, outside the template that matches djb:*. The more explicit @match value has a higher priority than the wildcard element match, and therefore supervenes when needed.

[17] I am grateful to John Lumley for introducing me to this pattern, and for supplying the code used here to illustrate it.

[18] Although not all features of XSLT 3.0 streaming are supported in the open source Saxon HE product, <xsl:iterate> is. See Saxonica <xsl:iterate>

[19] The match parameter is a regular expression, but in most of our applications it has turned out to be a string, that is, a regular expression that uses no metacharacters or escapes and matches only its entire literal self.

Author's keywords for this paper:
pipeline

David J. Birnbaum

Professor and Chair

Department of Slavic Languages and Literatures, University of Pittsburgh

David J. Birnbaum is Professor and Chair of the Department of Slavic Languages and Literatures at the University of Pittsburgh. He has been involved in the study of electronic text technology since the mid-1980s, has delivered presentations at a variety of electronic text technology conferences, and has served on the board of the Association for Computers and the Humanities, the editorial board of Markup languages: Theory and practice, and the Text Encoding Initiative Council. Much of his electronic text work intersects with his research in medieval Slavic manuscript studies, but he also often writes about issues in the philosophy of markup.