Constructing an application in the form of a pipeline has many advantages, the main
ones being (a) that the code of each step in the pipeline is kept very simple; (b) that it
is very easy to assemble an application from a set of components, thus maximizing the
potential for component reuse, and (c) there is no requirement that each step in a
pipeline should use the same technology; it's easy to mix XSLT, XQuery, Java and so on in
different stages.A number of products are available to assist with the development and management of
pipeline-based applications, examples being Orbeon [Orbeon] and Apache Cocoon [Cocoon].
More recently W3C has developed a standard language, XProc [W3C XProc], for the definition
of pipelines. []
Insofar as XSLT allocates its processing among templates and functions, Wendell Piez
writes that transformations are sometimes (frequently) pipelines themselves internally
(in whole or in part), as indeed XSLT can be seen as a language for pipelining function
calls. []Beyond using the output of one process as input to the next, pipelining invites
computational efficiency by making it possible to stream the flow of information, potentially
avoiding the overhead of building complete intermediate structures in memory or writing
intermediate output of the first process to disk and then reading it from disk into the input
of the next process. Whether an application takes advantage of this possibility depends on the
implementation (and not all operations are streamable), but although pipelining is similar in
some respects (especially pertaining to the eventual final output) to saving and reusing
intermediate results, in a functional environment that supports streaming and multiprocessing,
it also has the potential of offering performance benefits by eschewing the creation of
intermediate storage objects in variables or in disk files and permitting data to be processed
as soon as it and a processing core become available. Furthermore, pipelining may also prove
economically advantageous with respect to development effort, about which Piez writes that:
a well designed pipeline of steps can be cheaper to build and easier to maintain than
a single transformation doing the same thing (and [for the same reason] micropipelining
within a transformation is also a useful thing). Not infrequently, a complex problem can
be addressed by refactoring it into […] phases, each […] of which can be simpler and
easier (even together) to build and operate than a one-pass solution—even one that by some
measures is more efficient. []
Patterns and antipatternsThe term pattern refers to a generalizable way of
addressing a common and generalizable type of problem. It is used widely in object-oriented
programming (see, for example, ), but it is not restricted to that
domain, and in this presentation the generalizable problem we care about is pipelining in
XSLT, and the generalizable strategies that are commonly employed to deal with it are our
patterns. The term antipattern was introduced by Andrew
Koenig in 1995 to describe something that looks like a paradigm for a solution to a common
type of problem, that is, a pattern, but that for structural reasons leads to results that do
not live up to best practice expectations. [] As
used in this report, the distinction between patterns and antipatterns helps us focus on the
strengths and weaknesses of alternative strategies for pipelining within XSLT.It is often said that code is read more often than it is written, and from that
perspective, legibility, which contributes to ease of maintenance, is one feature that can
distinguish patterns from antipatterns. Another is computational complexity, and in that
context scalability is another such feature (but see immediately below).Donald Knuth’s oft-cited statement that premature optimization is the root of
all evil does not blanketly disparage optimization. In narrow context it reads:
Programmers waste enormous amounts of time thinking about, or worrying about, the
speed of noncritical parts of their programs, and these attempts at efficiency
actually have a strong negative impact when debugging and maintenance are considered.
We should forget about small efficiencies, say about
97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%. [, p. 268; see also the larger context on that page]
Michael Kay, in a discussion of Jackson Structured Programming, writes that
[t]he essential principle of JSP was the idea that the structure of the program
should mirror the structure of the data, continuing that JSP can contribute to both
code legibility and maintenance, on the one hand, and operational efficiency, on the other:
[T]he user achieves the benefits that come from writing the application as a controlling
application (maintainability, readability, reusability) combining these with the
improvements in system performance that come from using a push-based control model. []
The boundary between a pattern and an antipattern is not always unambiguous. Some
programming strategies discussed below, such as nested functions, are easy to write and read
in simple situations (e.g., two functions, each of which takes one parameter), but may scale
poorly with respect to legibility as the complexity of the context increases. Such models
might therefore be sensible patterns when applied in those simpler situations, while
transforming themselves gradually into antipatterns as the complexity of the context
increases. Scalability may be a useful feature in a pattern, but the strategy that outperforms
others with big data is not necessarily optimal in contexts where the data
never grow very large. For that reason, the discussion below avoids labeling specific
solutions globally as pattern or antipattern, and instead discusses both the strengths and
weaknesses of each approach, with attention to the programming context. We thus use pattern
and antipattern as a framework for evaluating the strengths and weaknesses of coding
strategies, rather than as a taxonomy for classifying them as Good or Bad in a binary
fashion.Micropipelining in contextPipelining may play a role in XSLT development on at least three levels: 1) internally,
within the XSLT processor; 2) between or across stylesheets; and 3) within a single
stylesheet. We describe these below by way of providing a broad context and overview, but the
focus of this report is on only the last of the three, on pipelining within a single XSLT
stylesheet, which we call micropipelining.1. Pipelining within the XSLT processorTo the typical XSLT developer, pipelining within the XSLT processor is a black box. Not
only can optimizations within an XSLT processor transform XSLT instructions into internal
operations that achieve the same results by following different execution logic, but the
largely declarative XSLT programming paradigm means that the order employed by the processor
to execute parts of a transformation is not entirely under the control of the XSLT
developer. Among other things, these properties enable an XSLT processor to improve
performance by executing operations in parallel, a feature with consequences that sometimes
surprise XSLT developers who may be used to thinking within an imperative paradigm. For
example, many new XSLT programmers have tried at some point to profile a stylesheet by
inserting the current-dateTime() or current-time() function before
and after an operation of interest, only to be surprised when the operation appears to take
no time at all! This happens because there is no promise in the XSLT processing model that
functions will be executed in the order in which they appear in the XSLT code, and for that
reason XSLT was designed to follow a functional programming paradigm here:
In XSLT 2.0 it is defined that multiple calls on current-dateTime() and
the other two [time and date] functions will return the same result every time they are
called within a single transformation. This means that you can’t call the function at
the beginning and end of the transformation to measure the elapsed time. The reason for
this rule is that XSLT is rather purist about being a strictly functional language, and
in a strictly functional language, calling the same function twice with the same
arguments always returns the same result. This property makes life much simpler for
optimizers. [, p. 738]
The functional nature of XSLT means that although an XSLT developer might write code
that is amenable to pipelined execution, how the pipeline is executed is at least partially
dependent on processor-internal code to which the XSLT developer has no direct access.
Michael Kay’s alliteratively titled You pull, I’ll push: on the polarity of
pipelines, cited above, continues that:
For performance, the stages of a pipeline should communicate by means of streams of
events. Traditionally, this can be done either by writing components with pull polarity
(the consumer calls the supplier when data is required), or by components with push
polarity (the supplier calls the consumer when data is available). Performance problems arise when components with different polarity are mixed within
the same pipeline. [, emphasis added]
As important as this polarity is for performance, to the extent that it is determined
within the XSLT processor, rather than within the XSLT code, it is at least partially
outside the control of the XSLT developer. For that reason, in this XSLT-developer-focused
report, pipelining within the XSLT processor will not be discussed further.2. Pipelining between and across XSLT stylesheetsPipelining between or across XSLT stylesheets is much more under the control of the XSLT
developer. A complex transformation might be broken down into simpler stages, where each
stage is implemented in a separate XSLT stylesheet, and interim results are written to disk
by one process and then read from disk by another. This strategy may make the individual
stylesheets easier to read because each may implement only a small and coherent aspect of
the overall transformation. This approach also makes it easy to examine the intermediate
output during development, and to develop each step of the pipeline without distraction by
what precedes or follows. But disk I/O is much slower than in-memory access, and the cost of
writing the intermediate files is likely to compromise the efficiency of the transformation.
Saving intermediate stages to disk and processing them as separate transformations also
entails a cost in re-launching the transformation engine; for example, running the Java
implementation of Saxon has to create a new Java Virtual Machine (JVM) and then start a new
Saxon process. Whether JVM or I/O or process initialization or other overhead deserves
consideration is likely to depend on the volume of data; augmenting the execution time by an
order of magnitude may not matter to the developer if the overall time is still subjectively
brief.The overhead of disk I/O can be avoided by using pipelining utilities, whether broader
purpose (such as Apache Cocoon []) or more narrowly XML specific
(XProc []). Pipelining tools avoid the overhead of disk I/O, although
whether they build complete intermediate in-memory structures or decompose (and perhaps
parallelize or stream) the transformation in a functional manner is implementation-specific.
These technologies are well know to the Balisage audience: XProc has been the subject of six
presentations by six different authors or teams of authors at Balisage alone (see
https://www.balisage.net/Proceedings/topics/XProc.html), and Cocoon has also
made several appearances, including as part of the toolkit used to produce the Balisage
proceedings you are reading now
(https://www.balisage.net/Proceedings/vol17/acknowledgements.html).
Pipelining between or across XSLT stylesheets with technologies like Cocoon or XProc does
not prohibit writing intermediate output to disk for examination during debugging, but it
does not require it, and XProc in particular incorporates additional features that may be
useful in complex transformations of XML documents (see
https://www.w3.org/TR/xproc/#pipeline-concepts). The focus of the current
report is on micropipelining, that is, on pipelines implemented within a single XSLT
stylesheet, and we will not explore XProc or other technologies for pipelining between or
across stylesheets, except to note that whether to implement a pipeline within a single
stylesheet or between or across stylesheets is a separate consideration that lies outside
the within-stylesheet scope of the present report.3. Micropipelining: pipelining within a single XSLT stylesheetAs mentioned above, we use the term micropipelining to refer to pipelining within a
single XSLT stylesheet. Micropipelines implemented within a single stylesheet can always be
refactored and spread across multiple stylesheets, but in situations where the steps are
conceptually part of what a human would consider a single operation, and where intermediate
stages are unlikely to be of interest except for debugging purposes during development,
combining them within a single XSLT stylesheet that performs that single operation may make
for easier implementation and maintenance.The application that motivated this exploration is connected to our use of XML
technologies to identify and analyze meter and rhyme in Russian verse. We presented a report
on the identification of meter at Balisage 2015 [], and in
order also to identify rhyme we need to convert normal Russian orthography to a phonetic
representation. As long as the place of stress is known,The place of stress in Russian, as in English, is not normally indicated
orthographically, and because stress can fall on different syllables in different words,
including different inflected forms of the same lexeme, it is not predictable without
linguistic (lexical and morphological) knowledge. The identification of meter depends on
knowing the place of stress, and describes how stress is
added automatically to enrich original unstressed input in natural Russian orthography
before performing metrical analysis. the phonetic conversion, although complex, turns out to be easier in Russian
than in English because Russian orthography, although not phonetic, is closer to the
phonetics of the language than is the case with English. An early version of the algorithm we developed for this purpose was described in
. The XSLT implementation of the enhanced version
currently used in the project is available at
https://github.com/djbpitt/poetry/blob/master/rhyme.xsl.It is common in historical linguistics to represent language change in terms of the
application of ordered rules as a pipeline that accepts an initial form as input, pipes it
(that is, passes it along) through a sequence of individual sound-change rules, and
generates transformed output. Constraints on the order in which rules apply are
traditionally described using alimentary and sanguinary metaphors, and those metaphors
emerge from the linguistic reality that sound change may happen at a certain point in
history and then cease to operate, that is, cease to be a productive process:Feeding order. The output of one rule creates new
input for another rule. For example, if in Rule A x → y
(read as x is transformed into y) and in Rule B y → z, Rule A feeds Rule
B by creating new input for it.Bleeding order. The output of one rule modifies a
form that would otherwise have served as input for another rule. For example, if in Rule
A z → q and in Rule B x → y / _
z (read as x is transformed into y when it precedes z), Rule A
is said to bleed Rule B because it reduces the application of Rule B by the destroying
the environment it requires.Counter-feeding order. The output of one rule does
not create new input for another rule, but were the order of the two rules reversed,
they would observe a feeding relationship. For example, if in Rule A y → z and in Rule B x → y, the
two stand in a counter-feeding relationship, since Rule B creates what would have been
new input into Rule A, except that it fails to do so because Rule A operates first, and
is no longer active when the potential new input is created by Rule B.Counter-bleeding order. The output of one rule does
not bleed forms that might otherwise have served as input for another rule, but were the
order of the two rules reversed, they would observe a bleeding relationship. For
example, if in Rule A x → y / _ z and in Rule B
z → q, the rules observe a counter-bleeding order
because were the order reversed, Rule B would reduce the applicability of Rule A by
destroying the context it requires.Note that feeding and bleeding are not opposites, and that some rules may not be
critically ordered with respect to others because their inputs, outputs, and conditioning
environments do not intersect with one another. These types of relationships among rules are
important because of their role in directing or constraining linguistic change (see, for
example, ). In the context of the present report, these
relationships also provide a vocabulary for discussing the interaction of XPath functions
and XSLT templates when those are used to implement transformations that model linguistic
processes. Feeding, bleeding, and their opposites are appropriate for modeling historical
language change because history imposes a real chronological order, but a similar model that
uses similar terminology may also describe other linguistic phenomena, including the
synchronic derivation of surface forms from more abstract underlying forms. In the present
instance, we employ this framework to describe, as a pipeline of individual rules, the XSLT
implementation of a transformation of standard Russian orthography (which is not phonetic)
to a phonetic representation that can serve as the basis for identifying rhyme in
verse.Micropipelining: a closer lookBackground and contextWhen we began implementing in XSLT the machine-assisted identification and analysis of
Russian rhyme, we were aware that pipelining was required because our algorithm for
converting natural Russian orthography (enhanced by information about stress, which is not
part of normal Russian writing) was conceptualized as a series of rules of insertion,
deletion, and replacement, the order of which was constrained by feeding and bleeding
relationships. As we have been crafting bespoke solutions to these types of pipelining tasks
over the years, we have wished for a tutorial that would describe the options, say something
useful about the advantages and disadvantages of each, point out where they behave similarly
or differently from one another, and perhaps make recommendations for deciding among them in
the field. The principal goal of the present report is to fill the gap by describing and
exploring the alternatives that we considered, including the one we eventually adopted for
the Russian rhyme task.Convenience variablesAs the name implies, convenience variables are
variables created for the benefit of the programmer, rather than because they necessarily
improve the efficiency or operation of the program. The purpose of convenience variables is
that typically they improve the legibility of the code, which makes it easier to develop
(one piece at a time) and maintain.Variables that improve the efficiency of the transformation process itself are not (or:
are not entirely) convenience variables (although they may also be convenient). For example,
if we need to calculate the same result more than once in an XSLT stylesheet, performing the
calculation only once and saving it in a variable that we can then reuse may reduce the
amount of computation and improve the performance of the transformation. On the other hand,
when we calculate a value that we will use only once, save it in a variable, and then use
the variable, not only is there no similar saving, but we actually risk reducing the
efficiency of the transformation by imposing the overhead needed to create the variable as
an intermediate structure. Within XSLT, this is especially the case when variables are
declared without explicit typing using the @as attribute, the omission of which
results in the expensive creation of a temporary in-memory document (that we may not need)
to hold the value.If a sequence constructor is used with no as attribute, a
temporary document is constructed. This is done by creating a new document node and
using the value of the result sequence to form the children of the document
node. [, p. 503] There are times when constructing a
temporary tree is nonetheless appropriate; for an example, see , p. 510ff.As an example of a convenience variable, consider an XSLT transformation to SVG where we
need to calculate the x and y coordinates of an object and then
plot it. If those calculations are complex, packing them into the declaration of the object
may compromise its transparency. Compare the version with convenience variables:<xsl:variable name="xPos" as="xs:double" select="position() * $xScale"/>
<xsl:variable name="yPos" as="xs:double" select=". * $yScale"/>
<rect x={$xPos} y="-{$yPos}" width="{$width}" height="{$yPos}"/>to one without:<rect x={position() * $xScale} y="-{. * $yScale}" width="{$width}" height="{. * $yScale}"/>The more complex the calculation, the more the legibility may be improved through the
use of convenience variables.In this case, the reuse of the same value in the @y and
@height attributes means that the variable is not merely a matter of
convenience, since that reuse also means that the same expression does not have to be
evaluated more than once.Convenience variables may play a role in pipelining when the result of one function is
passed to another. For example,<xsl:value-of select="djb:func2(djb:func1($x))"/>may be rewritten as:<xsl:variable name="y" select="djb:func1($x)"/>
<xsl:value-of select="djb:func2($y)">It is common, while developing and debugging pipelines, to insert, delete, alter, or
rearrange the individual steps in order to fine-tune their interaction. Removing a step, for
example, with nested functions means finding the function name, the open parenthesis (which
is always immediately after the function name), and the closing parenthesis. This last step
may be annoyingly difficult if the nesting is deep and if the functions have multiple
parameters; if they have only one, the closing parentheses are all bunched together at the
end, and it doesn’t matter which one we remove, but if they are not all consecutive, the
developer has to find exactly the right one. With convenience variables, removing a step
requires not only removing the step itself, but also adjusting a variable reference
elsewhere. Consider first the nested version:<xsl:sequence select="djb:func3(djb:func2(djb:funct1($input)))"/>If we wish to remove func2() from the pipeline, we remove the string
“func2(” and any one of the closing parentheses. Now consider the version with convenience
variables:<xsl:variable name="first" select="djb:func1($input)"/>
<xsl:variable name="second" select="djb:func2($first)"/>
<xsl:sequence select="djb:func3($second)"/>If we remove the middle line in the interest of removing djb:func2() from
the pipeline, we can find the entire function easily, and we don’t have to look separately
for the closing parenthesis. But our modifications are not limited to just one line; we must
also change the variable reference in the last line from $second to
$first. This is not an unmanageable burden, but it is an extra step, and an
extra opportunity for error.The difference between nesting functions directly inside one another and nesting inside
one function a variable containing the result of evaluating another function is discussed in
greater detail below. The distinction between convenience variables and those that
contribute to the quality of the code in ways that are not limited to human legibility is
important because the costs and benefits of creating variables to hold intermediate values
must take into consideration whether using the variable conveys advantages beyond
convenience, that is, beyond the legibility that supports ease of maintenance.Nested functionsOne of the most natural types of micropipelining involves the nesting of functions, as
in the example earlier that combines normalize-space() and
upper-case() by nesting one inside the other. This simple nested structure
involves only two functions, each of which accepts only one parameter, but as the number of
functions and the number of parameters grows, it is easy to lose one’s place in a thicket of
parentheses. Furthermore, the normalize-space() and upper-case()
functions do not participate in feeding or bleeding relationships, which means that once we
have determined that we need to use them, there is no risk of nesting them in a way that
produces the wrong output, and that therefore requires reordering. That matters because part
of the challenge of multiple nested functions, and of the multiple pairs of parentheses that
come with them, involves legibility during development. The fact that here the nesting order
doesn’t matter obviates a potential need to transplant a contained function into a
containing position, and vice versa, because we inadvertently nested them incorrectly the
first time.Consider, for example, a situation using the Wordhoard TEI edition of Hamlet,http://www.wwp.northeastern.edu/outreach/seminars/xslt_2012-03/demos/xslt_intro/ham.xml where our task is to construct a comma-separated list of distinct speakers (from
the <speaker> element children of the <sp> speech
elements) in Act 4. We can retrieve a deduplicated list of all <speaker>
values with a single XPath function wrapped around a single path expression:
distinct-values(//body/div[4]//speaker). The catch is that this list includes
not only Rosencrantz and Guildenstern, both of whom speak
alone, but also Rosencrantz and Guildenstern (as the string content of a
single <speaker> element) where they speak in unison, and our result
should recognize that Rosencrantz and Guildenstern is not a unique speaker.
It is possible to perform the operation in a single line using nested functions:<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
xpath-default-namespace="http://www.tei-c.org/ns/1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of
select="distinct-values(tokenize(string-join(//body/div[4]//speaker/replace(., ' and ', ' '), ' '), '\s+'))"
/>
</xsl:template>
</xsl:stylesheet>Here we pipe the output of the innermost replace() function into
string-join(), and then into tokenize(), and then into
distinct-values().Replacing and with a space is a brittle strategy because it would
break on multiple-speaker values that were joined in different ways, such as
<speaker>Curly, Larry, and Moe</speaker>. For that matter,
tokenizing on white space is brittle because it would break on single persons with names
like Player King or First Player. The point of the example
is not to find an optimal solution to the multiple-speaker problem, but to provide an
example of pipelining functions. The order of functions has some flexibility, but the following constraints
apply:string-join() feeds tokenize()tokenize() feeds distinct-values()replace() may occur at any time before tokenize() (that
is, before or after string-join()), or we can eliminate
replace() entirely and instead use a predicate to exclude the eventual
token with the value and.In favor of this strategy, nesting is the most direct possible representation of the
pipeline relationship, since the input parameters to each function are included between its
parentheses and each function statement evaluates to its output. Against this strategy, the
depth of the nesting and the fact that several of the functions require multiple parameters
greatly compromise the legibility, and thus the development and subsequent maintenance, of
the code. Writing this code is messy and error-prone, and, recalling that code is read much
more often than it is written, nobody would regard it as easy to read. We might try to
improve the legibility through indentation:<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
xpath-default-namespace="http://www.tei-c.org/ns/1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of
select="distinct-values(
tokenize(
string-join(
//body/div[4]//speaker/replace(
.,
' and ',
' '
),
' '),
'\s+')
)"
/>
</xsl:template>
</xsl:stylesheet>although most developers would not consider this very legible either.Furthermore, it does not emerge unchanged when the stylesheet is pretty-printed in
<oXygen/>, and probably in at least some other editors. There is good reason that the functional programming language Lisp, known for
its deeply nested parenthesized constructions, has been glossed humorously as Lots of
Irritating Silly Parentheses and in other equivalent ways.See https://xkcd.com/297/. The name is actually derived from
LISt Processor.We can improve the legibility by rewriting the code to use convenience variables:<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"
xpath-default-namespace="http://www.tei-c.org/ns/1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:variable name="first" as="xs:string+"
select="//body/div[4]//speaker/replace(., ' and ', ' ')"/>
<xsl:variable name="second" as="xs:string" select="string-join($first, ' ')"/>
<xsl:variable name="third" as="xs:string+" select="tokenize($second, '\s+')"/>
<xsl:variable name="fourth" as="xs:string+" select="distinct-values($third)"/>
<xsl:sequence select="$fourth"/>
</xsl:template>
</xsl:stylesheet>This improvement comes at a price, though. Much as the complexity of the compound
operations makes it easy for a developer inadvertently to nest the functions incorrectly in
the nested syntax, it is also easy to order them incorrectly in the version that employs
convenience variables. A debugging fix that is simple to describe (revise the order
in which the functions are applied, which might incorrectly suggest that this
could be mapped onto rearrange the lines in which the variables are created)
turns out to be error prone because each line refers to two variable names, the one it is
creating with its @name attribute and the one that it is using to determine the
value, which is part of the value of its @select attribute. Changes in the
order in which the functions are executed thus becomes far more complex and error-prone than
dragging the lines into a different order. One way to conceptualize the challenge is that
the variable declarations exist in a feeding relationship not just by virtue of the order in
which they are declared (see immediately below about the XSLT prohibition against using a
variable before it has been declared), but also by virtue of the way each declaration uses a
value created by a different declaration.Given that variables in XSLT cannot be redeclared, one might wonder why XSLT requires
that sibling variable declarations within a stylesheet be arranged in conformity with the
order in which they are referenced. That is, one might wonder why XSLT prohibits ordering
sibling <xsl:variable> elements in arbitrary order, much as it is
possible to apply templates within one template that references another regardless of the
order in which the two are listed in the XSLT code. Whatever the reason for this
restriction, it means that rearranging the order in which the functions are applied involves
not only changing the references among them, but also modifying the order of the
<xsl:variable> elements so that the order of declaration follows the
order in which the values are used.Function chainingOne reason that deeply nested functions with multiple parameters are difficult to read
is that code is written from left to right, which tacitly encourages humans to read it from
left to right, but the feeding order (and thus the execution) of nested functions flows from
the inside out. This means that the function names flow from right to left (but not starting
the far right of the line, which is filled with closing parentheses and possibly function
arguments), yet because functions may have multiple parameters, it is necessary to look to
both the left and the right when moving out to the next level. In languages that support
object-oriented programming, the dotted method notation
avoids this clash because both the notation and the execution flow from left to right. For
example, the following Python example returns the same results as the XSLT above:To keep the focus on the pipelining code, the speaker names, as they appear in the
act, are supplied as a Python list. In a real application they would be extracted from
the XML using one of the several Python XML libraries.names = ['King','Gertrude','King','Gertrude','King','Gertrude','King','Hamlet','Rosencrantz and Guildenstern',\
'Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet','Rosencrantz','Hamlet',\
'Rosencrantz','Hamlet','Rosencrantz','Hamlet','Guildenstern','Hamlet','King','Rosencrantz','King',\
'Rosencrantz','King','Rosencrantz','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King',\
'Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Hamlet','King','Fortinbras',\
'Captain','Fortinbras','Hamlet','Captain','Hamlet','Captain','Hamlet','Captain','Hamlet','Captain','Hamlet',\
'Captain','Hamlet','Captain','Rosencrantz','Hamlet','Gertrude','Gentleman','Gertrude','Gentleman','Horatio',\
'Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia','Gertrude','Ophelia',\
'King','Ophelia','King','Ophelia','King','Ophelia','King','Ophelia','King','Gertrude','King','Gentleman',\
'Gertrude','King','Laertes','Danes','Laertes','Danes','Laertes','Gertrude','Laertes','King','Laertes','King',\
'Gertrude','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King','Danes','Laertes',\
'Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','Ophelia','Laertes','King',\
'Laertes','King','Horatio','Servant','Horatio','Sailor','Horatio','Sailor','Horatio','King','Laertes','King',\
'Laertes','King','Messenger','King','Messenger','King','Laertes','King','Laertes','King','Laertes','King',\
'Laertes','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King','Laertes','King',\
'Laertes','King','Laertes','King','Laertes','King','Gertrude','Laertes','Gertrude','Laertes','Gertrude',\
'Laertes','King']
result = set(' '.join(names).replace(' and ', ' ').split())
print(result)The set() function is used to remove the duplicates and its input is
contained in the parentheses that immediately follow the function name, but otherwise the
order of execution and the syntactic expression both flow from left to right using dotted
method notation: from join() (equivalent to XSLT string-join())
into replace() (also called replace() in XSLT) into
split() (equivalent to tokenize() in XSLT).The syntax of the join() method appears to make some programmers
feel uncomfortable. As is explained in ,
join() is a method not of the list (the Python equivalent of an XSLT
sequence, in this case the sequence of speaker names), but of the separator string (in
this example, a space character), which might suggest that the code is operating on the
separator character and using the list to perform the operation. The reason this may
feel unnatural is that what is being joined is the list items, and a human might think
that we perform joining to modify a list, and not to modify a separator string. That
perception is buttressed by the Python split() function, where
'Curly,Larry,Moe'.split(',') splits the Three Stooges at the comma. That
is, split() is a method of the complex object we seek to modify with the
help of a simple string, but join() is a method of the simple string that
we use (at least from a human perspective) to modify a complex list. The perceived
assymetry of join() with split() invites confusion, especially
for an XSLT programmer for whom first or leftmost feels
like the locus of an operation because that’s the way the signatures of the XPath
contains(), ends-with(), tokenize(), and
(especially) string-join() functions operate.While dotted method notation is natural within an object-oriented model, the notation is
also not inherently incompatible with functional programming paradigms. For example,
pipelines with dot notation when working within a functional model in Python are supported
by the PyFunctional package [], which makes creating
data pipelines easy by using chained functional operators, e.g.:
seq(1, 2, 3).map(lambda x: x * 2).reduce(lambda x, y: x + y)XPath 3.1 introduced the “arrow operator” (=>) for this purpose, so that:
tokenize((normalize-unicode(upper-case($string))),"\s+")
can be rewritten as
$string => upper-case() => normalize-unicode() => tokenize("\s+")[, §3.16] One limitation to this XPath operator is that the value that is
passed can be used only as the first argument to the next
function, which means that traditional methods must be used when the value supplies any
non-initial parameter.Variables, templates, and functionsThe description of nested functions above was based on functions from the core XPath
library, but much more complex operations can be embedded in user-defined functions, which
can then be nested on a single line, similarly to the nesting in the basic XPath library
example. Indeed, a user-defined function can transform an entire input document into an
entire intermediate document, optionally assigning the value of the function to a variable,
and then apply another function to the intermediate document. If we create two (imaginary)
user-defined functions, one that translates a document from Russian into English and another
that tokenizes the English translation and returns a report (perhaps an HTML table) of the
distribution of Scrabble values of the words, the starting point of the transformation might
look like:<xsl:template match="/">
<xsl:variable name="intermediate" as="document-node()" select="djb:translate_into_english(.)"/>
<xsl:sequence select="djb:create_scrabble_values($intermediate)"/>
</xsl:template>As was the case in the discussion of core XPath library functions above,
$intermediate is a convenience variable, and the functions can,
alternatively, be nested directly:<xsl:template match="/">
<xsl:sequence select="djb:create_scrabble_values(djb:translate_into_english(.))"/>
</xsl:template>The two user-defined functions might call other functions, apply templates, or otherwise
perform operations of arbitrary complexity.Insofar as functions and templates have similar properties (and are compiled to the same
type of operations within the processor), instead of creating user-defined functions, it is
also possible to use convenience variables with templates:<xsl:template match="/">
<xsl:variable name="intermediate" as="document-node()">
<xsl:apply-templates select="."/>
</xsl:variable>
<xsl:apply-templates select="$intermediate"/>
</xsl:template>Dimitre Novatchev provided real examples of this type of chaining in XSLT 1.0 and 2.0 in
a StackOverflow response in 2010. []The use of intermediate variables with either functions or templates neither requires
nor precludes serializing all or part of the intermediate results with
<xsl:result-document> or <xsl:message> if that is
needed for development or for production. But the use of templates with intermediate
variables for pipelining does exhibit one of the principal limitations discussed above of
using functions with variables for pipelining: editing the pipeline requires, beyond the
insertion, deletion, or rearrangement of lines, changing the value of the
@select attribute when the intermediate variable is used. For example, in the
code above, if between the translation into English and the creation of the Scrabble value
output we want to pluralize all words that are capable of being pluralized, not only do we
need to create another variable, but we also need to change the value of the
@select attribute of an <xsl:apply-templates> elements in
a different line (modifications associated with the insertion are highlighted):<xsl:template match="/">
<xsl:variable name="intermediate" as="document-node()">
<xsl:apply-templates select="."/>
</xsl:variable>
<xsl:variable name="pluralized" as="document-node()">
<xsl:apply-templates select="$intermediate"/>
</xsl:variable>
<xsl:apply-templates select="$pluralized"/>
</xsl:template>Inserting or deleting a step requires, in addition to the insertion or deletion itself,
one adjustment to the value of a @select attribute in a different line.
Rearranging steps requires more adjustments. There is also no easy way of changing the
location where intermediate results are serialized, whether for development or for
production. And because the use of intermediate variables requires additional physical lines
of code for legibility, it becomes harder to see the entire pipeline as an uninterrupted
sequence. What would be more robust, flexible, and maintainable would be a pattern that has
the following properties:Each step of the pipeline fits on a single short line of XSLT code, with no
wrappers, no intervening statements, and nothing else that intrudes into a bare-bones
list of steps.Each step of the pipeline is entirely self-contained. Inserting, deleting, or
rearranging steps does not require any modification to any code pertaining to other
steps. This self-containment has two subcomponents: The code that specifies the order of operations in the pipeline does not
mention or pass any values explicitly. This means that changing the sequence of
steps does not require changing variables in the arguments.The code blocks that perform the pipeline operations all expect the same
parameters and return the same result types. This means that changing the sequence
of steps does not require any changes within any of the steps.Intermediate results can be dumped at any time (or at more than one time) as a step
in the pipeline, whether for development or for production purposes. Dumping is a part
of the pipeline that happens between transformative operations on the data; it is not
part of any individual transformative operation within the pipeline.Optional: Steps can be added to the list before the code that performs them has been
created, without requiring any adjustment to the parameters.This can be done with functions or templates by creating stubs that pass their
input along to the next step in the pipeline, but it would be more convenient to
have that happen by default, without separate explicit stub code for each
not-yet-fully-specified step.Nested template callsAs was noted above, functions and templates have similar capabilities, but because the
nesting syntax is different, performing the same complex operation by using nested template
calls instead of nested functions addresses some of the preceding desiderata. In the example
below, we pass the original input, saved in the $russian-text variable, as a
parameter into a translate-into-english template, the return value of which
becomes the value of an $english-text parameter, which we then pass into a
create-scrabble-values template. That last template returns our
results:<xsl:variable name="scrabble_values">
<xsl:call-template name="create-scrabble-values">
<xsl:with-param name="english-text" as="document-node()">
<xsl:call-template name="translate-into-english">
<xsl:with-param name="input" select="$russian-text" as="document-node()"/>
</xsl:call-template>
</xsl:with-param>
</xsl:call-template>
</xsl:variable>This method addresses some, but not all, of the desiderata listed above. Its primary
advantage is that it keeps the parameters adjacent to the template calls that use them,
which is an improvement over scattering them, as often happens with nested functions. But
insertion, deletion, and especially rearrangement are not one-line operations, and neither
is dumping interim results during development, and the hierarchy becomes harder to follow as
the number of steps, and therefore the degree of indentation in pretty-printed output,
increases. Below we examine how the visitor pattern addresses these concerns.The visitor patternThe visitor pattern in an XSLT contextThe visitor pattern is usually defined and
illustrated in an object-oriented context that may be difficult to map onto the XSLT
programming paradigm.I am grateful to Wendell Piez for introducing me to this pattern, and for his many
generous contributions, in the course of a rich correspondence in spring 2017, to my
elaboration of the ideas described here. Piez identifies this strategy as an instance
of the visitor pattern in . James Sugrue provides some real-world analogies:
A real world analogy always helps with the understanding of a design pattern. One
example I have seen for the Visitor pattern in action is a taxi example, where the
customer […] orders a taxi, which arrives at his door. Once the person sits in [it], the
visiting taxi is in control of the transport for that person.Shopping in the supermarket is another common example, where the shopping cart is
your set of elements. When you get to the checkout, the cashier acts as a visitor,
taking the disparate set of elements (your shopping), some with prices and others that
need to be weighed, in order to provide you with a total.…In summary, if you want to decouple some logical code from the elements that you’re
using as input, visitor is probably the best pattern for the job. []
Because XSLT is not object-oriented, logical operations on objects (e.g., nodes of
various types, atomic values) are naturally decoupled from the objects themselves. Michael
Kay, writing about the simulation of higher-order functions in XSLT 2.0, situates the
visitor pattern in a functional context:
[Y]ou might write a function that does a depth-first traversal of a tree and
processes each node in the tree using a function that is supplied as an argument. This
makes it possible to use one tree-walking routine to achieve many different effects (you
might recognize this as the Visitor Pattern).In XSLT (and XPath), functions are not first-class objects, so they cannot be passed
as arguments to other functions.[Added:] This statement is based on XPath 2.0 and XSLT 2.0. Functions may be
used as first-class objects in more recent versions of the standards. However, the template-matching capability of
<xsl:apply-templates> can be used to simulate higher order
functions.The use of <xsl:apply-templates> has been developed to a fine art
by Dimitre Notatchev in his FXSL library (http://fxsl.sourceforge.net/),
which provides many examples of the technique. []
The sourceThe full XSLT stylesheet used to identify rhyme in Russian verse, which is the source
of the excerpts presented here, is available at
https://raw.githubusercontent.com/djbpitt/poetry/master/rhyme.xsl. It can
be run against itself, in which case it processes default input that is embedded in the
stylesheet. Additional documentation, and the ancillary XSLT stylesheets that it imports,
are available at the project’s GitHub repo at
https://github.com/djbpitt/poetry.Defining the order of operationsThe transformation is driven by a sequence of operations, each of which is represented
by an empty element in a custom namespace:<xsl:variable name="operations" as="element()+">
<djb:prepareWords/>
<djb:lexical/>
<djb:proclitics/>
<djb:enclitics/>
<djb:tsa/>
<djb:palatalize/>
<djb:jot/>
<djb:romanize/>
<djb:finalDevoice/>
<djb:regressiveDevoice/>
<djb:regressiveVoice/>
<djb:palatalAssimilation/>
<djb:consonantCleanup/>
<djb:vowelReduction/>
<!--
diagnosticOutput writes the string to stderr as an <xsl:message>
move it around or comment it out, as appropriate
-->
<!--<djb:diagnosticOutput/>-->
<djb:stripSpaces/>
<djb:rhymeString/>
</xsl:variable>Note that <djb:diagnosticOutput> can be inserted wherever the
developer wishes to examine interim results, including in multiple locations.Starting the transformationThe transformation is run once for each <line> element in a poem.
It operates through sibling recursion, applying templates to the first member of the
$operations variable defined in the example above and passing as parameters
the current line ($input) and the sequence of remaining parameters
(remove($operations, 1)):<xsl:apply-templates select="$operations[1]" mode="operate">
<xsl:with-param name="input" select="."/>
<xsl:with-param name="remaining" select="remove($operations, 1)"/>
</xsl:apply-templates>Note that we apply templates not to the input (the <line> element),
but to the operation, and we pass along to the operation as parameters the input and the
sequence of remaining operations. None of the operations refers to any other explicitly,
and none refers to any specific parameters, which means that inserting, deleting, and
rearranging operations involves changing only the list of operations itself. Because the
sibling recursion strategy will walk through the operations in order (see below), one
operation feeds the next without having to know its identity.Template matchingWhen we apply templates to an element in the sequence of operations, a single template
matches all elements in the custom namespace, and it dispatches the processing to the
appropriate location within an <xsl:choose> element. That single
template begins as follows:<xsl:template match="djb:*" mode="operate">
<xsl:param name="input" as="xs:string" required="yes"/>
<xsl:param name="remaining" as="element()*"/>
<xsl:variable name="results" as="xs:string*">
<xsl:choose>
<!-- ******************************************* -->
<!-- djb:lexical: Idiosyncrasies in pronunciation (including -ogo) -->
<!-- ******************************************* -->
<xsl:when test="self::djb:lexical">
<xsl:sequence select="djb:lexical($input)"/>
</xsl:when>
<!-- ******************************************* -->
<!-- djb:proclitics: Merge proclitics with bases -->
<!-- ******************************************* -->
<xsl:when test="self::djb:proclitics">
<xsl:sequence select="djb:proclitic($input, 1)"/>
</xsl:when>
<!-- ******************************************* -->
<!-- djb:enclitics: Merge enclitics with bases -->
<!-- ******************************************* -->
<xsl:when test="self::djb:enclitics">
<xsl:sequence select="djb:enclitic($input, 1)"/>
</xsl:when>
<!-- ******************************************* -->
<!-- djb:tsa: Convert ть?ся$ to тса -->
<!-- ******************************************* -->
<xsl:when test="self::djb:tsa">
<xsl:sequence select="replace($input, 'ться$', 'тса')"/>
</xsl:when>
<!-- ******************************************* -->
<!-- djb:palatalize: Capitalize all palatalized consonants (including unpaired) -->
<!-- ******************************************* -->
<xsl:when test="self::djb:palatalize">
<xsl:variable name="result1" as="xs:string+">
<xsl:analyze-string select="$input"
regex="([бвгдзклмнпрстфх])([яеиёюЯЕИЁЮь])">
<xsl:matching-substring>
<xsl:sequence
select="concat(upper-case(regex-group(1)), regex-group(2))"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:sequence select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:sequence select="translate(string-join($result1, ''), 'чйщ', 'ЧЙЩ')"/>
</xsl:when>The pipline operations are the only elements in the djb: namespace, so
this template matches all of them, and nothing else. It accepts the two parameters and
then creates a variable called $results, inside of which it uses
<xsl:choose> to identify the operation that is being processed. Some
of these operations (e.g., <djb:lexical>,
<djb:proclitics>, <djb:enclitics>) call
user-defined functions (in a separate function library) to do their work. Others use core
XPath library functions (e.g., <djb:tsa>). Others are longer and
perform more complex processing (e.g., <djb:palatalize>). Each
<xsl:when> element, which processes a specific operation, returns a
sequence as the value of the (interim) $results variable that wraps the
<xsl:choose> element. How this variable contributes to the result of
the surrounding template is described below.Unmatched operationsAt different moments in the development we implemented different treatments of
operations that are not matched in the single processing template. One option is to let
operations listed in the sequence but not associated with processing code pass their
logical input, unchanged, along to the next step in the pipeline. This is the simplest
form of a stub; the name of the operation is present, but the default behavior if the
operation is not defined is to pass:<!-- ******************************************* -->
<!-- Default to continue processing if operate step has no code-->
<!-- ******************************************* -->
<xsl:otherwise>
<xsl:apply-templates select="$remaining[1]" mode="visit">
<xsl:with-param name="input" select="$input"/>
<xsl:with-param name="remaining" select="remove($remaining, 1)"/>
</xsl:apply-templates>
</xsl:otherwise>The <xsl:otherwise> default applies templates to the next operation
in the sequence, passing along the input it received unchanged.Alternatively, when the development is complete, we might prefer to raise a fatal
error if the sequence of operations includes any that have not been defined:<!-- ******************************************* -->
<!-- Default to terminate if operate step has no code-->
<!-- ******************************************* -->
<xsl:otherwise>
<xsl:message terminate="yes">Unmatched visitor element <xsl:value-of
select="local-name()"/></xsl:message>
</xsl:otherwise>One or the other of these two defaults should be commented out at all times.Processing the next operationExcept for the <xsl:otherwise> option that raises an error and
terminates, each operation creates a sequence that becomes the value of the
$results variable that wraps the <xsl:choose> element.
In some cases the value is a single string; in others it is a sequence of strings (thus
the plural name).The first operation (<djb:prepareWords>) returns a sequence of
elements, rather than a string or sequence of strings, and is handled separately in a
template that matches the operation’s fully qualified name explicitly, outside the
template that matches djb:*. The more explicit @match value
has a higher priority than the wildcard element match, and therefore supervenes when
needed. After the $results variable has been constructed, it is used to
construct a $result (singular), typed as a single string. The template then
passes that value to the next operation, using sibling recursion:<xsl:variable name="result" as="xs:string" select="string-join($results, '')"/>
<xsl:apply-templates select="$remaining[1]" mode="#current">
<xsl:with-param name="input" select="$result"/>
<xsl:with-param name="remaining" select="remove($remaining, 1)"/>
</xsl:apply-templates>TerminationThe sibling recursion pattern removes each operation after it has been processed and
then passes the sequence of remaining operations along as a parameter. When the last
operation has been processed and removing it from the sequence of operations leaves an
empty sequence, the pipeline has finished, and returns gracefully:<xsl:if test="empty($remaining)">
<xsl:sequence select="$result"/>
</xsl:if>EvaluationThe paradigm described here meets all of the desiderata enumerated earlier and
repeated here:Each step of the pipeline fits on a single short line of XSLT code, with no
wrappers, no intervening statements, and nothing else that intrudes into a bare-bones
list of steps.Each step of the pipeline is entirely self-contained. Inserting, deleting, or
rearranging steps does not require any modification to any code pertaining to other
steps.The list of steps does not mention or pass any parameters explicitly. Parameter
passing is handled within the processing steps (whether functions or templates), and
changing the sequence of steps does not require changing the code that manages
parameters within any of the steps.Intermediate results can be dumped at any time (or at more than one time) as a
step in the pipeline, whether for development or for production purposes. Dumping is a
part of the pipeline that happens between transformative operations on the data; it is
not part of any individual transformative operation within the pipeline.Steps can be added to the list before the code that performs them has been
created, without requiring any adjustment to the parameters.Adapting the visitor pattern to use <xsl:iterate>I am grateful to John Lumley for introducing me to this pattern, and for supplying
the code used here to illustrate it.The introduction of <xsl:iterate> in XSLT 3.0 suggests an alternative
way to implement the visitor pattern.Although not all features of XSLT 3.0 streaming are supported in the open source
Saxon HE product, <xsl:iterate> is. See Consider the following example, which combines the processing pipeline and the
text to be transformed in a single XML document:<example>
<pipeline xmlns="Pipeline">
<tokenize/>
<lower-case/>
<swap s1="fox" s2="dog"/>
<swap s1="quick" s2="lazy"/>
<pipeline>
<replace match="b" replace="z"/>
<replace match="o" replace="a"/>
</pipeline>
<replace match="lazy" replace="eager"/>
<capitalize/>
<reverse/>
<string-join separator="-"/>
</pipeline>
<source>
<div>
<p>The quick BROWN fox jumps OVER the lazy dog</p>
</div>
</source>
</example>As in the original visitor pattern example, the pipeline steps are specified as
elements, in a custom namespace, to which templates are applied.
<pipeline> elements can be nested; this has no effect on the execution
(the nesting is flattened), but it can simplify modular development and inclusion. The
following XSLT stylesheet applies the pipeline steps in order using
xsl:iterate:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
xmlns:p="Pipeline" version="3.0">
<xsl:output method="xml" indent="true"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="/">
<xsl:apply-templates select="example/source" mode="#current">
<xsl:with-param name="pipeline" select="example/p:pipeline" tunnel="true"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="text()[matches(., '\S')]">
<xsl:param name="pipeline" tunnel="true"/>
<xsl:call-template name="pipeline">
<xsl:with-param name="input" select="." as="xs:string"/>
<xsl:with-param name="pipeline" as="element()" select="$pipeline"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="pipeline" as="item()*">
<xsl:param name="input" as="item()*"/>
<xsl:param name="pipeline" as="element()"/>
<xsl:iterate select="$pipeline/*">
<xsl:param name="input" select="$input"/>
<xsl:on-completion>
<xsl:sequence select="$input"/>
</xsl:on-completion>
<xsl:next-iteration>
<xsl:with-param name="input" as="item()*">
<xsl:apply-templates select=".">
<xsl:with-param name="input" select="$input"/>
</xsl:apply-templates>
</xsl:with-param>
</xsl:next-iteration>
</xsl:iterate>
</xsl:template>
<xsl:template match="p:tokenize" as="xs:string*">
<xsl:param name="input" as="xs:string?"/>
<xsl:variable name="split" select="(@split, '\s+')[1]"/>
<xsl:sequence select="tokenize($input, $split)"/>
</xsl:template>
<xsl:template match="p:string-join" as="xs:string">
<xsl:param name="input" as="item()*"/>
<xsl:variable name="sep" select="(@separator, ' ')[1]"/>
<xsl:sequence select="string-join($input ! string(.), $sep)"/>
</xsl:template>
<xsl:template match="p:lower-case" as="xs:string*">
<xsl:param name="input" as="item()*"/>
<xsl:sequence select="$input ! string(.) ! lower-case(.)"/>
</xsl:template>
<xsl:template match="p:capitalize" as="xs:string*">
<xsl:param name="input" as="item()*"/>
<xsl:sequence
select="$input ! string(.) ! (upper-case(substring(., 1, 1)) || lower-case(substring(., 2)))"
/>
</xsl:template>
<xsl:template match="p:reverse" as="item()*">
<xsl:param name="input" as="item()*"/>
<xsl:sequence select="reverse($input)"/>
</xsl:template>
<xsl:template match="p:replace" as="xs:string*">
<xsl:param name="input" as="item()*"/>
<xsl:variable name="match" select="(@match, ' ')[1]"/>
<xsl:variable name="replace" select="(@replace, ' ')[1]"/>
<xsl:sequence select="$input ! string(.) ! replace(., $match, $replace)"/>
</xsl:template>
<xsl:template match="p:swap" as="xs:string*">
<xsl:param name="input" as="item()*"/>
<xsl:variable name="s1" select="@s1"/>
<xsl:variable name="s2" select="@s2"/>
<xsl:for-each select="$input">
<xsl:variable name="parts" as="xs:string*">
<xsl:analyze-string select="string(.)" regex="{$s1}|{$s2}">
<xsl:matching-substring>
<xsl:sequence
select="
if (. eq $s1) then
$s2
else
$s1"
/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:sequence select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:sequence select="string-join($parts)"/>
</xsl:for-each>
</xsl:template>
<xsl:template match="p:pipeline" as="item()*">
<xsl:param name="input" as="item()*"/>
<xsl:call-template name="pipeline">
<xsl:with-param name="input" select="$input"/>
<xsl:with-param name="pipeline" select="."/>
</xsl:call-template>
</xsl:template>
</xsl:stylesheet>The output is:<source>
<div>
<p>Fax-Quick-The-Aver-Jumps-Dag-Zrawn-Eager-The</p>
</div>
</source>The <xsl:iterate> strategy uses iteration over the children of the
<pipeline> element instead of sibling recursion. The example above also
relies on separate templates for every element in the Pipeline namespace
instead of using a single template and <xsl:choose>, a decision that is
independent of the choice between iteration and recursion. The recursion implementation
relies on explicit surgery on the list of pipeline steps with the help of the XPath
remove() function, as well as a somewhat indirect test for the end of the
pipeline, which uses the XPath empty() function to check whether any pipeline
steps remain. The iterative approach may therefore feel more natural, since it uses elements
with more transparent, self-documenting, names, such as
<xsl:next-iteration> and <xsl:on-completion>.Nested calls to the replace() functionsUse cases for nested replace() functionsDevelopers who work with uncommon writing systems (ancient, medieval, minority, etc.)
often need to convert documents from one character coding to another. Sometimes the
writing system is the same but the character set is different, such as when pre-Unicode
legacy documents must be converted to Unicode. Sometimes the conversion is from one
writing system to another, such as when Cyrillic texts are Romanized for library catalog
purposes, or when linguistic field data in domain-specific notation are converted to the
International Phonetic Alphabet (IPA).If the mapping from one character representation to another is one to one, it can be
implemented with a single translate() function. The implementation of
translate() has the advantage of not suffering from feeding or bleeding
conflicts. For example, translate($input,'AB','BA') replaces A with B and B with A simultaneously, so that
old instances of A that are translated to B do not then wind
up feeding the translation of B to A, or vice versa.But mappings that are one to many, many to many, and many to one cannot be managed
with translate() because translate() can only map exactly one
character to exactly one character (or to zero characters). For this reason, mappings that
are not one-to-one are performed using the replace() function, and if there
are many such replacements, the transformation requires a pipeline of
replace() operations (which must be ordered in a way that is sensitive to
feeding and bleeding issues). Even if we bracket ordering constraints, the necessity of
pipelining replace() functions (in our work, often more than two dozen)
raises the legibility and maintainability issues discussed above. We avoid those issues by
implementing a table-driven approach that separates the processing with
replace() from the match and replacement values. The example below is drawn
from a Uyghur-to-IPA mapping project.The mapping tableThe mapping table consists of pairs of match and replacement strings, grouped as those
that are not one to one (<multiple>) and those that are
(<unitary>):The match parameter is a regular expression, but in most of our applications it
has turned out to be a string, that is, a regular expression that uses no
metacharacters or escapes and matches only its entire literal self.<pairs>
<multiple>
<pair>
<orth>ch</orth>
<ipa>ʧʰ</ipa>
</pair>
<pair>
<orth>gh</orth>
<ipa>ɣ</ipa>
</pair>
<pair>
<orth>ng</orth>
<ipa>ŋ</ipa>
</pair>
<pair>
<orth>sh</orth>
<ipa>ʃ</ipa>
</pair>
<pair>
<orth>zh</orth>
<ipa>ʒ</ipa>
</pair>
</multiple>
<unitary>
<pair>
<orth>a</orth>
<ipa>a</ipa>
</pair>
<pair>
<orth>e</orth>
<ipa>ɛ</ipa>
</pair>
<pair>
<orth>b</orth>
<ipa>b</ipa>
</pair>
<pair>
<orth>d</orth>
<ipa>d</ipa>
</pair>
<pair>
<orth>é</orth>
<ipa>e</ipa>
</pair>
<pair>
<orth>f</orth>
<ipa>f</ipa>
</pair>
<pair>
<orth>g</orth>
<ipa>g</ipa>
</pair>
<pair>
<orth>h</orth>
<ipa>h</ipa>
</pair>
<pair>
<orth>x</orth>
<ipa>χ</ipa>
</pair>
<pair>
<orth>i</orth>
<ipa>i</ipa>
</pair>
<pair>
<orth>j</orth>
<ipa>ʤ</ipa>
</pair>
<pair>
<orth>k</orth>
<ipa>k</ipa>
</pair>
<pair>
<orth>q</orth>
<ipa>q</ipa>
</pair>
<pair>
<orth>l</orth>
<ipa>l</ipa>
</pair>
<pair>
<orth>l</orth>
<ipa>ɫ</ipa>
</pair>
<pair>
<orth>m</orth>
<ipa>m</ipa>
</pair>
<pair>
<orth>n</orth>
<ipa>n</ipa>
</pair>
<pair>
<orth>o</orth>
<ipa>o</ipa>
</pair>
<pair>
<orth>ö</orth>
<ipa>ø</ipa>
</pair>
<pair>
<orth>p</orth>
<ipa>pʰ</ipa>
</pair>
<pair>
<orth>r</orth>
<ipa>r</ipa>
</pair>
<pair>
<orth>s</orth>
<ipa>s</ipa>
</pair>
<pair>
<orth>t</orth>
<ipa>tʰ</ipa>
</pair>
<pair>
<orth>u</orth>
<ipa>u</ipa>
</pair>
<pair>
<orth>ü</orth>
<ipa>y</ipa>
</pair>
<pair>
<orth>w</orth>
<ipa>w</ipa>
</pair>
<pair>
<orth>y</orth>
<ipa>j</ipa>
</pair>
<pair>
<orth>z</orth>
<ipa>z</ipa>
</pair>
<pair>
<orth>‘</orth>
<ipa>ˀ</ipa>
</pair>
<pair>
<orth>'</orth>
<ipa>ˀ</ipa>
</pair>
</unitary>
</pairs>Processing mappings that are not one to oneBecause some of the characters in the <multiple> portion of the
mapping table also appear in the <unitary> portion, bleeding
constraints require that mappings that are not one to one be processed before those that
are, and that processing requires the replace() function. This processing is
performed with a single user-defined function:<xsl:function name="djb:multireplace" as="xs:string">
<!-- this function handles all replacements that aren't one-to-one -->
<xsl:param name="instring" as="xs:string"/>
<xsl:param name="pairs" as="element(pair)*"/>
<xsl:choose>
<xsl:when test="empty($pairs)">
<xsl:sequence select="$instring"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="result"
select="replace($instring, $pairs[1]/orth, $pairs[1]/ipa)"/>
<xsl:sequence select="djb:multireplace($result, remove($pairs, 1))"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>When the function is called for the first time on a string, $pairs has
been declared as a sequence of all <pair> elements that are children of
<multiple>. The function (in the <xsl:otherwise> clause)
recruits the first <pair> elements in that sequence and uses its
<orth> child as the match pattern and its <ipa> child as
the replacement. It then removes the <pair> it has just used from the head
of the sequence of <pair> elements and calls itself recursively; when
there are no <pair> elements left to process, the recursion terminates and the last
string value is returned. The separation of the replace() function from the
arguments on which it operates makes it easy for the developer to insert, delete, or
rearrange the pairings in the mapping file.Processing mappings that are one to oneMapping that are one to one can be processed with a single instance of the
translate() function:<xsl:variable name="unitary-orth" select="string-join($unitary//orth,'')" as="xs:string"/>
<xsl:variable name="unitary-ipa" select="string-join($unitary//ipa,'')" as="xs:string"/>
<xsl:sequence select="translate($temp,$unitary-orth,$unitary-ipa)"/>The table-driven design all but eliminates the opportunity for off-by-one errors when
aligning the individual characters in the match argument with those in the replacement
argument.ConclusionThe most natural and direct human understanding of pipelining is that it takes the output
of one operation (function or template) and uses it as the input to the next. The most direct
representations of that processing model in XSLT code encounter challenges to legibility as
developers drown in a sea of parentheses or struggle to update variable names and references
as they add, delete, or rearrange steps in the pipeline. The use cases illustrated in this
report avoid those challenges and struggles by moving the statement of the pipeline into a
sequence of simple steps (operations in the visitor pattern, mapping pairs in the
transliteration example) that are easy to read and to edit because adjustments in a step are
entirely self-contained, and do not require adjustments in others. The feeding and bleeding
logic remains the intellectual responsibility of the developer, but the inevitable mistakes in
this area are more easily remedied during debugging because modifications in the order of
pipeline steps are constrained to a single element. None of the methods described here is new,
but their explicit juxtaposition, comparison, and evaluation in a tutorial context based on
real use cases has clarified much about micropipelining for the author, and, it is hoped, for
the reader, as well.Works citedAdams, Lawrence D. and
David J. Birnbaum. Text technology 7, 1 (1997): 1–17.
Corrected reprint available at
http://poetry.obdurodon.org/reports/adams-birnbaum.xhtmlBirnbaum, David J.,
and Elise Thorsen. Markup and meter: Using XML tools to teach a computer to think about
versification. Presented at Balisage: The Markup Conference 2015, Washington, DC,
August 11 - 14, 2015. In Proceedings of Balisage: The Markup Conference
2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi:10.4242/BalisageVol15.Birnbaum01.
https://www.balisage.net/Proceedings/vol15/html/Birnbaum01/BalisageVol15-Birnbaum01.htmlApache Cocoon.
http://cocoon.apache.org/Gamma, Erich, Richard Helm, Ralph
Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented
software. Boston, MA: Addison-Wesley, 1994.Kay, Michael. XSLT 2.0
and XPath 2.0 programmer’s reference, 4th edition. Indianapolis: Wiley/Wrox,
2008.Kay, Michael. You pull, I’ll push: on
the polarity of pipelines. Presented at Balisage: The Markup
Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup
Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Kay01.
https://www.balisage.net/Proceedings/vol3/html/Kay01/BalisageVol3-Kay01.htmlKiparsky, Paul. Abstractness,
opacity, and global rules. In Three dimensions of
linguistic theory, O. Fujimura, ed, pages 57–86. Tokyo: TEC, 1973.Kunth, Donald E. Structured
programming with go to statements. Computing surveys 6(4): 261–301. doi:10.1145/356635.356640.
http://web.archive.org/web/20130731202547/http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdfPatterns and
antipatterns.Journal of object-oriented programming 8(1): 46-48
(1995)The origin of Unix
pipes. Bell Labs document archive.
http://doc.cat-v.org/unix/pipes/Novatchev, Dimitre. Response to
Is daisy chaining xslt an accepted practice?
http://stackoverflow.com/questions/4571956/is-daisy-chaining-xslt-an-accepted-practicePiez, Wendell. Framing the problem:
building customized editing environments and workflows. Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5,
2016. In Proceedings of Balisage: The Markup Conference
2016. Balisage Series on Markup Technologies, vol. 17 (2016). doi:10.4242/BalisageVol17.Piez01.
https://www.balisage.net/Proceedings/vol17/html/Piez01/BalisageVol17-Piez01.htmlPiez, Wendell. Response to Ignoring
ambiguous
matches. https://www.oxygenxml.com/archives/xsl-list/201402/msg00085.htmlPyFunctional (Python package).
https://github.com/EntilZha/PyFunctionalxsl:iterate. Saxon 9.8 Documentation.
http://www.saxonica.com/html/documentation/xsl-elements/iterate.htmlSugrue, James. Visitor pattern
tutorial with Java examples.
https://dzone.com/articles/design-patterns-visitorWhy is join() a string method instead
of a list or tuple method? Part of the [Python] Design and
history FAQ.
https://docs.python.org/3/faq/design.html#why-is-join-a-string-method-instead-of-a-list-or-tuple-method
XML Path Language (XPath) 3.1. W3C
Recommendation, 21 March 2017. https://www.w3.org/TR/xpath-31/ XProc: An XML Pipeline Language. W3C
Recommendation, 11 May 2010. https://www.w3.org/TR/xproc/