How to cite this paper

Kalvesmaki, Joel. “Multiple String Comparison in XSLT with tan:collate().” Presented at Balisage: The Markup Conference 2022, Washington, DC, August 1 - 5, 2022. In Proceedings of Balisage: The Markup Conference 2022. Balisage Series on Markup Technologies, vol. 27 (2022). https://doi.org/10.4242/BalisageVol27.Kalvesmaki01.

Balisage: The Markup Conference 2022
August 1 - 5, 2022

Balisage Paper: Multiple String Comparison in XSLT with tan:collate()

Joel Kalvesmaki

Joel Kalvesmaki is a software developer for the Government Publishing Office, founder and director of the Text Alignment Network, and a scholar specializing in the study of early Christianity.

Copyright Joel Kalvesmaki, released under a Creative Commons Attribution 4.0 International license

Abstract

In this article I introduce tan:collate(), an XSLT function for comparing three or more strings. Collated multiple string comparison is a difficult problem that merits experimentation with alternative methods. tan:collate() approaches the task in three different ways. Two of those methods involve a sequence of pairwise comparisons, with each comparison grafted into a guide tree. The third method applies the distinctive staggered-sample approach of tan:diff() to multiple strings, to detect a common superskeleton, before attempting to refine results. This paper situates the problem of multiple string comparison in the context of textual criticism, proposes a general output format and test suite, and presents measured results for the quality and performance of tan:collate() against an example test suite. Released as part of an open-source function library, tan:collate() enables developers to incorporate robust text comparison directly into XSLT applications.

Table of Contents

Multiple String Difference
A Standardized Output for N-Way Text Difference Operations
Multiple String Difference (N-Way Diff) and Collation
The Three Approaches in tan:collate()
Building and Refining the Guide Tree
Testing tan:collate()
tan:collate() Quality and Performance Test Setup
tan:collate() Test Results
Commonality Metric
Superskeleton Metric
Performance
tan:collate() Applied to Real Texts
Further Work
Conclusion

Multiple String Difference

In the 2021 Balisage Conference proceedings, I published a description of tan:diff(), an XSLT function that perform pairwise string comparison. [kalvesmaki] The restriction to two strings raises the question, could tan:diff() be applied to three or more versions? If so, how? The need for multiple-string comparison commonly arises, and a practical XSLT application would benefit a variety of endeavors.

  • An editor may need to review changes in an article, blog, or book that has gone through multiple revisions, from multiple hands.

  • Computer programmers may want to assess many versions of the same code base, to document when a particular bug was introduced.

  • A lawyer or forensic specialist may need to compare numerous versions of a document to demonstrate instances of forgery or tampering.

  • A publisher may want to evaluate the results of multiple optical character recognition (OCR) passes on the same block of text, to determine an optimal configuration.

  • A textual scholar may want to compare dozens or hundreds of transcriptions from different manuscript witnesses, to explore the phylogenetic relationships between manuscripts, and create a stemma codicum.

  • Biologists may want to compare DNA sequences as if textual differences, to identify mutations and abnormalities between test samples.

Pairwise comparison of three versions of a text would result in three individual output trees. Four versions would result in six pairwise comparisons; five versions, in ten; six, in fifteen, and so forth (n versions results in (n × (n - 1)) / 2 pairwise comparisons). There may be cases where dozens, hundreds, or thousands of versions need to be compared, in which case the burgeoning number of output trees is impossible to manage. To interpret and analyze so many versions, the data must be synthesized within a single tree structure.

The task requires attention to performance, because the number of possible pairwise difference operations grows exponentially with the number of versions. One could avoid this problem by simply arbitrarily choosing a sequence in which sequential pairwise comparisons should be integrated into a single output tree. But quality might suffer. The first pair that begins the result tree exerts a significant bias on the output. Every new pairwise comparison grafted into the result tree must conform to the contours of the first. One could attempt as a preliminary measure to assess the best sequence of the versions, but using what method or methods? The most obvious approach would be to fetch all possible pairwise comparisons, then sort them into an ideal sequence. But that sorting operation could be expensive on numerous files. Would any gains in quality be significant enough to justify the extra work? And what sorting method produces an optimal sequence? Does "optimal" have a clear meaning?

Multiple-string comparsion presents considerable challenges. In this paper I present tan:collate(), an XSLT 3.0 function that compares three or more strings. To my knowledge, it is the first function of its kind in the XSLT language. After presenting the problem, I discuss the output format and the methods used by tan:collate(). A test suite is proposed, to measure the performance and quality of the function, and results of tests on a sample suite are summarized.

A Standardized Output for N-Way Text Difference Operations

We start with a series of texts, for example, "be gone", "beg one", and "bag bone". If each of these were compared pairwise with tan:diff(), we would get the following three output trees:

<diff>
   <common>be</common>
   <b>g</b>
   <common> </common>
   <a>g</a>
   <common>one</common>
</diff>
<diff>
   <common>b</common>
   <a>e</a>
   <b>a</b>
   <common>g </common>
   <b>b</b>
   <common>one</common>
</diff>
<diff>
   <common>b</common>
   <a>e</a>
   <b>ag</b>
   <common> </common>
   <a>g</a>
   <b>b</b>
   <common>one</common>
</diff>

We want those results to be combined into a single tree that allows us to see at a glance where all versions agree, and where they do not. Because an N-way difference output may involve hundreds or thousands of versions, keeping track of each version requires some kind of id system. Ids could be simply integers (see below) but the option to use more meaningful terms or phrases would be useful.

One further output requirement has to do with location. Strings to be compared will be of variable lengths, and shared passages will likely fall in different absolute positions in each text. In our example above, all three end in "one" but that substring occurs later in the third string than it does in the first two. Version-specific position may be an important feature for applications that process the output. Because that information is fetched by tan:collate() to assemble the output tree, it is retained in the output, rather than discarded.

The desired output, at the bare minimum, should be something like the following:

<collation>
   <c>
      <txt>b</txt>
      <wit ref="1" pos="1"/>
      <wit ref="2" pos="1"/>
      <wit ref="3" pos="1"/>
   </c>
   <u>
      <txt>e</txt>
      <wit ref="1" pos="2"/>
      <wit ref="2" pos="2"/>
   </u>
   <u>
      <txt>a</txt>
      <wit ref="3" pos="2"/>
   </u>
   <u>
      <txt>g</txt>
      <wit ref="2" pos="3"/>
      <wit ref="3" pos="3"/>
   </u>
   <c>
      <txt> </txt>
      <wit ref="1" pos="3"/>
      <wit ref="2" pos="4"/>
      <wit ref="3" pos="4"/>
   </c>
   <u>
      <txt>g</txt>
      <wit ref="1" pos="4"/>
   </u>
   <u>
      <txt>b</txt>
      <wit ref="3" pos="5"/>
   </u>
   <c>
      <txt>one</txt>
      <wit ref="1" pos="5"/>
      <wit ref="2" pos="5"/>
      <wit ref="3" pos="6"/>
   </c>
</collation>

In the tree above, <collation> specifies that it is the result of tan:collate(). Its content consists of individual <c>s, separated from each other by one or more <u>s. <c>, short for "common", represents text shared by all versions; <u>, for "unique" or "uncommon", represents defective text, i.e., text found in some but not all versions.

Within each <c> and <u> is a single <txt> element, with the substring of text in question. It is followed by one or more <wit>s, specifying which versions attest to that particular substring. Within each <wit> the @ref specifies the version id; and @pos, an integer marking exactly the position of the substring in that version (starting with 1).

The tree has several advantages for further XML-based processing. Anyone can restore the text of any version with a simple XPath expression, tan:collation/*[tan:wit[@ref eq $my-id]]/tan:txt, where $my-id is the string id of the version desired. Other similar XPath operations may be useful. For example, tan:collation/(* except tan:u[every $i in tan:wit satisfies ($i/@ref = $ids-to-ignore)] would get rid of all defective readings from witnesses not of interest.

In tan:diff() output, the unique readings are ordered such that <a> always precedes <b> (if both are present). That order is arbitrary, more a convenience than a requirement, because string a is not necessarily chronologically prior to string b. The counterpart in tan:collate() output are sequences of one or more <u>s. But the order of <u>s is arbitrary, aside from the requirement that for any version, successive <u>s accurately preserve each version's text, in correct sequence. If, in the example above, versions 1, 2, and 3 had a codepoint where they all disagreed, there would be no guarantee that the three alternative readings will appear in a particular order. Although the output is an ordered tree, it really represents a directed acyclic graph, with adjacent <u>s representing unordered, alternative version-paths from one <c> to the next.

The output tree reflects the practices of textual criticism, the science of detecting the original reading behind manuscript variants. A textual editor takes various manuscript witnesses (hence the name of <wit>), assigns them a code or siglum, and uses those sigla to document editorial decisions. In the typical critical edition, the main text represents the reading attested to by most or all relevant manuscripts. Exceptions and editorial interventions are declared at the foot of the page, where the editor indicates via sigla any manuscripts that depart from the main text.

In the example below, taken from Bekker's 1831 edition of Aristotle's works, the first page specifies what manuscripts are signified by sigla A through H. The second page shows how the sigla are used at the bottom (here only manuscripts A-D), to explain tersely where individual manuscripts differ from the main text adopted by the editor.

The sigla system is useful for describing the genetic relationship between manuscripts. Editors may group manuscripts into families, or into families of families, assigning each family a siglum (normally in an alphabet or case distinct from what was used for individual manuscripts). Sigla can even be assigned to conjectures—manuscripts or families that lack any direct material evidence. Those manuscript and family sigla—extant or conjectural—can then be arranged to illustrate the scholar's reconstructed history of the text, by depicting the relationships in a stemmatic diagram. The following example shows Butler's stemma for the 5th-c. CE Historia Lausiaca.

The stemmatic relationship among versions will be an important backdrop for the discussion of tan:collate(), which was initially created to support textual scholarship. The function has been written with the assumption that the input strings descend from a common archetype, and that the user wishes the output to respect as much as possible that archetype.

Multiple String Difference (N-Way Diff) and Collation

tan:diff() was designed to compare two strings that shared some linear genetic relationship. One was a revision of the other, perhaps some steps removed. That assumption was especially important for the corruption test, to measure the accuracy and performance of the function. The corruption test was set up by first generating several natural language strings of lengths 100, 1000, 10,000, 100,000, and 1,000,000 characters. For each string-length, a series of copies were made, each time with a random one percent of the codepoints changed to a character that did not exist in the original. Thus, the first copy had 1% of its codepoints corrupted, the next 2%, and so on, down to the one hundredth version, in which all the characters were changed to ones not found in the original. That controlled test suite made it possible to measure difference results precisely.

That test works fine for pairwise comparison. But with three or more versions of a text, one cannot assume that for any arbitrary pair of versions one version descends from the other. Like those textual scholars staring at a dozen or more manuscripts with many variations, we can assume nothing more than a general family relationship between the versions, i.e., that they likely descend from a common archetype.

Most functions or applications that perform N-way diff proceed seriatim. The input strings are placed in a sequence. The first two strings are diffed. One of the strings is treated as a bridge string and diffed against the third input string. At this point we have two diff outputs, and they must be merged together. Once that is done, we have what is termed the guide tree. From that guide tree a new bridge string is chosen and diffed against the next string, and those results are merged into the guide tree. The process continues until all strings have been diffed and merged into the guide tree for the final output.

In the seriatim approach, the first two strings exert an enormous influence upon the results, sometimes resulting in a suboptimal guide tree. Consider, for example an N-way string comparison that begins by comparing "XXabcde" with "fghijXX". That first alignment will set "XX" as the common anchor. But what if the next string to be compared is "abcdefghij"? The output will be fragmented, since the guide tree began with the "XX" as the primary anchor. The algorithm had no basis for thinking that "abcde" or "fghij" might provide a better foundation. Such a scenario can be avoided if one can first sort the input strings into an optimal sequence.

What is that ideal sequence? Similarity seems an obvious candidate. Such similarity is quantified in the default output of tan:diff() not according to Levenshtein distance but along the model of a Venn diagram. Given strings a and b, some number of codepoints are shared (ideally the longest common subsquence), some codepoints are unique to a, and some unique to b. Any N-way diff could also be presented as a Venn diagram of N circles. The example below depicts a three-way difference (the portion of shared commonality of 1,953 codepoints is intentionally diminished, to make the differences more visible).

Using the Venn diagram to choose an optimal order, we notice that the leftmost two circles cluster with each other, and the right one is the outlier. It would seem to make sense to start with the two versions on the left. And we would be right. But things can get complicated quickly. Consider the following example:

In the 5-way Venn diagram above, pair A and C have the most in common, followed by the pair B and D. If we order the input strings based upon pairwise commonality, then we are likely to choose the sequence A, C, B, D, E. But another glance at the Venn diagram indicates that the two pairs of close versions are rather distant from each other, and that E is likely the best starting point. Although E does not have significant overlap with any single version, it has the least amount of unique text, and it encompasses not only the area where the other four agree but also all the passages where only three of them agree. E shares the most with the whole.

So we modify our principle of the optimal sequence and define commonality not in terms of pairs but in terms of the whole. Like textual scholars, we are searching for some archetype from which all the versions descend. We highly value clusters of common overlap, because they attest to the original string, the superskeleton of the textual corpus, if you will.[1]

A good collation algorithm should imitate, if possible, the other habits of textual editors. When a scholar first begins to edit a text, she or he finds all the relevant manuscripts, and then attempts to put them into families and then into a particular global order of priority. That task is called collation, and for good reason, because it is an act not merely of comparing but sorting, just as with other acts of collation. When anyone collates photocopies, or codepoints, or anything else, the primary goal is to sort the items. So too when creating critical editions: the goal of collation is to produce a sorted order of witnesses. String comparison is a means to that end. And it is only one of several means, joining other criteria such as provenance, scribal hands, and codicological data. Although string difference is taken into account, it is not pursued blindly. An old philological adage is, "Differences should be weighed, not counted." Background knowledge about, for example, the quality of the scribes may significantly affect how much weight a scholar ascribes to a particular difference, which in turn affects the priority assigned to a textual witness.

No algorithm should be expected to replicate the connoiseurship a scholar uses when weighing differences. On the other hand, every string difference algorithm should either require the user to determine the optimal sequence before launching the function, or else provide a surrogate sorting algorithm that ranks each string according to predicted shared commonality. Put another way, a good collation algorithm should prioritize the commonality shared among all versions. It should reveal the superskeleton.

Ideally, one could take all possible sequences, then evaluate them against each other. The result with the shortest cumulative text (in tan:collate()'s output, determined by string-length(string-join(tan:collation/*/ns:txt))) would be the preferred one. Such a process, of course, would impose a significant performance penalty. To take all possible sequences would expand the operation into factorial time, O(n!). A three-way collation would require six sequence comparisons. A four-way collation would require twenty-four of them; a five-way, one hundred twenty. And so forth.

The Three Approaches in tan:collate()

tan:collate() approaches the task of collated N-way diff in three different modes, here named naive, calculated commonality, and superskeleton. The core signature of the function is tan:collate($strings-to-collate as xs:string*, $string-labels as xs:string*, $detect-superskeleton as xs:boolean, $preoptimize-string-order as xs:boolean). The first two parameters are common to all three approaches, so our discussion here focuses on $detect-superskeleton and $preoptimize-string-order.

The first approach, naive, is to assume that the user has already preoptimized the order of the input. In this scenario, $detect-superskeleton is false and $preoptimize-string-order is false. The process applies seriatim pairwise comparisons. Strings 1 and 2 are diffed, and string 2 becomes the bridge string. The bridge is diffed against string 3. The two diff outputs are merged to form the core guide tree, and subsequent diffs are added. Additions occur by tracking the position of every string, and fragmenting both the result tree and the new diff input so that they can be merged. Details of this merge operation are complex. A basic outline is explained in the next section, but details go beyond the scope of this article.

The clear advantage for the naive method is that the layer it adds to tan:diff() operates in linear time. But the user is responsible for creating the optimal sequence.

In the second approach, calculated commonality, each version is given a commonality score that determines the optimal order of the input. In this scenario, $detect-superskeleton is false and $preoptimize-string-order is true. The important difference from the naive mode is the preprocess. Every pair of input strings are passed through tan:diff(), and the commonality is measured according to the following formula: c / (c + (a + b)/2), where c is the full length of the common text, and a and b, the length of text that is unique to a and b. Each diff output, then, has a commonality factor between 0 and 1.0 inclusive, where 1.0 represents no difference at all between the two versions. For each input version, the commonality score consists of the sum of its commonality score in every diff it participates in. The greater the total sum of commonality for a given version, the earlier it is placed in the sequence that informs the guide tree. In the case of the 5-way Venn depicted above, version E would be placed first, because the sum of its commonality with each of the other versions is greater than any other version's sum of commonality scores. The other versions follow according to their summed commonality score. That becomes the sequence in which versions are added to the master guide tree, built according to the same rules as followed in the naive mode. But the bridge string is not necessarily the most recent string added. Rather, it is the version in the guide tree that has the greatest amount of commonality with the incoming string.

The disadvantage of the calculated commonality approach is that it requires a full tableau of pairwise comparisons. The process is memory-consumptive, and because it adds a layer that operates in quadratic time it can be prohibitively expensive to run. But a clear advantage this method has over the naive one is that the bridge string can be more selectively chosen, for better results. If the third string has more commonality with the first string than the second, then the first string serves as a bridge, and the guide tree is smoother.

The third approach, superskeleton, the most unusual, attempts to detect the superskeleton of the input, applying the same principle of staggered samples used by tan:diff(). But the process is applied not pair-by-pair but to all versions, all at the same time. In this scenario $detect-superskeleton is true and $preoptimize-string-order is ignored. The heart of the underlying function, tan:diff(), is modified. Every sample that is drawn from the short string is looked for not merely in one string but in all the other strings. The modified tan:diff() places an anchor only where it can be found in every version. That anchor is expanded to the extent agreed upon by all the versions. The process is repeated on either side of the new anchor. If at any point an anchor of an acceptable minimum size cannot be found, the algorithm falls back to the calculated commonality mode.

In superskeleton mode, a threshold is observed, specifying the minimum size of a permitted match, to avoid frivolous anchors. By default, the value is the rounded logarithmic value of the shortest string length, which means that if the shortest string is ten characters or fewer the minimum is a match of length 1; up to one hundred characters, a minimum of length 2; up to one thousand characters, a minimum of length 3; and so forth. The value can be overridden through a global parameter.

When a match is found multiple times in at least one of the strings, an attempt is made to find the optimal match in each string. For each input string, the relative position of all matches are gathered into a sequence of decimals from 0 through 1, inclusive. The ensemble of decimal sequences are passed to tan:closest-cluster(), which considers all the possible sequences that could be created by selecting one item from each input sequence. The function returns one set that cumulatively closest of all possibilities, to establish an anchor and continue the superskeleton process. If each version has dozens of matches, then finding the closest cluster could be prohibitively expensive, and, from the point of view of the user, not worthwhile, because the prospective anchor may not be reliable. In those cases, the process continues to look for a match.

If the versions are very different from each other, superskeleton detection is likely to be ineffective. In such a scenario, where the superskeleton is very hard to detect or highly fragmented, tan:collate() falls back to the calculated commonality method.

Further, the superskeleton method looks only for text shared by every witness, and does not look for majority readings that may attest to vestiges of a superskeleton. Those fragments are normally detected through the cleanup process, where blocks of adjacent <u>s are analyzed to refine the results, a cleanup process applied to all three modes of tan:collate() and described in the next section.

The superskeleton mode adds to tan:diff() a layer that runs in linear time, namely, the time required to match and expand every anchor. That benefit is applied only in passages where there are no changes in any of the versions.

Building and Refining the Guide Tree

In the naive and calculated commonality approaches, tan:collate() must build a guide tree. Building that guide tree is somewhat complicated. It begins by converting each tan:diff() output into a tree that adheres to the tan:collate() format. The first pairwise comparison is the starting guide tree. Into it we must merge the second pairwise comparison. Common between the two comparisons is a bridge string.

To illustrate, we take three input strings, "abcdefgh", "abQRdeSTgh", and "abcQdSTefg". We will create a guide tree following this sequence, giving first priority to the difference between the first two strings, second priority to that between the last two strings, and ignoring the difference between the first and last strings.

The first two strings compared create the first diff, depicted below with gaps and colors to make the similarities and differences evident:

We see that there are points of commonality, separated by two clusters created by substitution. We then get the next diff, between strings 2 and 3:

This particular difference has no substitutions, only three insertions and two deletions. In comparing the two sets of differences, we see that merging them will require fragmentation of the starting guide tree. The common strings "de" and "gh" will need to be broken up, and the unique string "QR" will need to be broken as well. Because the second string is the bridge between the first two comparisons, its relative position determines where insertions are made, as illustrated here:

This becomes the new guide tree. In hindsight, we can observe that in prioritizing the comparison between strings 2 and 3 over that between strings 1 and 3, the "ST" match has also been prioritized, and disallowed the possibility that the better match might be between string 1 and string 3's "ef". If there were more versions to be collated, later discoveries may show the sequence to be flawed. Rather than looking back at every version to refine a new guide tree, we wait until all comparisons are made and the final guide tree is produced. At that point, we can review all the <u>s between <c>s and try to determine a more optimal alignment of the fragments.

The preceding description applies only to naive and calculated commonality modes. In contrast, the superskeleton mode's approach to the guide tree is quite simple. In fact, it has no guide tree. The algorithm uses staggered samples to detect all blocks of universally shared text, and anything in between those blocks is passed on to the calculated commonality mode.

In all three modes, the interim results will consist of a series of <c>s with intervening <u>s. Within an adjacent set of <u>s, there may be overlap across multiple witnesses, and the results could use refinement. They are treated like a miniature collation problem, and reassessed by passing the cluster of versions back through tan:collate() in calculated commonality mode. In such a localized re-collation, the original order of strings can be discarded and reassessed. The cleaned up results are considerably improved, because commonality is calculated afresh for the versions at that particular passage. In the example above, within the second cluster of unique readings, the fragments would be drawn into a new collation tree on the basis of a better string order: 3, 1, 2.[2] Because string 3 in this cluster has more in common with string 2 than string 1 does, string 3 will be the bridge. So "ef" will be aligned before any attempt is made to integrate string 2. The result is that "e" is marked as a shared character, an extra part of the superskeleton not evident in the interim results. The final output is illustrated here:

Testing tan:collate()

Any algorithm such as tan:collate() should be evaluated upon the basis of a robust test suite, both to assess the accuracy and quality of the output, and to measure the efficiency of the implementation.

The article on tan:diff() introduced what I called the corruption test, where ordinary text was incrementally corrupted. To facilitate measurement, a base text was drawn up in blocks of 100, 1000, 10,000, 100,000, and 1,000,000 characters, and each of those blocks was corrupted in increments of 1%. Corruption involved substitution of a character that was nowhere to be found in the original. A simple example of a ten-character string, "People, be", corrupted in increments of 10%, would be as follows:

People, be
PeoΔle, be
PeoΔle,Δbe
PΔoΔle,Δbe
PΔoΔleΔΔbe
PΔoΔleΔΔbΔ
ΔΔoΔleΔΔbΔ
ΔΔoΔleΔΔΔΔ
ΔΔoΔlΔΔΔΔΔ
ΔΔΔΔlΔΔΔΔΔ
ΔΔΔΔΔΔΔΔΔΔ

The test is controlled, artificial, and demanding. For any given level of corruption, it is possible to measure the output and determine if it is both accurate and of the highest quality. For example, comparing the original string to one that has 40% corruption should result exactly in 60% of the characters held in common, i.e., the longest common subsequence. In a string of 10 characters the problem is trivial, but in strings of 1M characters in length, achieving optimal results with non-trivial rates of corruption can be quite difficult.

For tan:collate(), however, the corruption test as originally framed is inadequate. We need a test suite that can handle three or more strings. And they cannot be part of a single line of dependency. As noted in the discussion above on textual scholarship, versions of the same text are represented as parts of a tree describing a line of descent. Any two versions are more likely to be cousins to each other than to stand in an ancestor-descendant relationship. Hence, in any comparable corruption test for tan:collate(), the comparanda must be independent of each other, without compromising the other virtues of the corruption test.

The strategy is to revise the corruption test so as to generate independent pairs of string comparanda. If that independence can be applied to two versions, the process can be generalized for three, four, or any number of versions.

We begin by thinking of the corruption test suite as the outcome of a kind of card game. We start with the ten-character string "People, be", and two players. Each player is assigned a special unique letter, not found in the input string and different from the other player's unique letter. A series of integers are generated, one for each position in the original string. These integers are the playing cards. They are shuffled and distributed to each player.

Table I

Two-player corruption test, setup

Player 1 Player 2
Unique character Γ Δ
Position cards 1, 7, 4, 10, 9 6, 2, 3, 8, 5

The game consists of as many rounds as there are players. Each round comprises as many turns as the smallest number of cards initially dealt (in our example, round one has five turns). Therefore, a full game consists of as many turns as the total number of cards. But for practical reasons, as explained below, the last round is normally of little interest and no consequence.

At round one, turn one, each player takes the first card in their hand, uses it to locate the position in the original string, and replaces that character with their unique letter.

Table II

Two-player corruption test, round one

Turn Player 1 Player 2
0 People, be People, be
1 Γeople, be PeoplΔ, be
2 ΓeopleΓ be PΔoplΔ, be
3 ΓeoΓleΓ be PΔΔplΔ, be
4 ΓeoΓleΓ bΓ PΔΔplΔ,Δbe
5 ΓeoΓleΓ ΓΓ PΔΔpΔΔ,Δbe

In our primary example, at turn one, with a single substitution, each player is left with a version that is 10% corrupt, and the point of corruption is different from the one used by the other player. The same is true for all the other turns. By the fifth turn, every position in player 1's string differs from same position in player 2's.

By setting up the process this way, we have created two independent lines of corruption. Any item in player 1's column can be compared to any item in player 2's, and we know precisely the measure of difference, because each player's hand of corrputible positions are mutually exclusive. Player 1, turn 1 is 10% corrupt from the original, and Player 2, turn 3 is 30%. If we compared those two strings against each other, we would expect a total 40% corruption and therefore 60% commonality.

In the context of two players, the exercise is uninteresting, because it can be reduced quite easily to the original model specified in the tan:diff() article. The extra steps are quite unnecessary for pairwise string comparison. But the rationale for this approach becomes clearer when we add a third player, i.e., anticipate three-way string comparison. Here is a sample setup:

Table III

Three-player corruption test, setup

Player 1 Player 2 Player 3
Unique character Γ Δ Θ
Position cards 1, 4, 10, 9 6, 2, 8 7, 3, 5

The first round would look like this:

Table IV

Two-player corruption test, round one

Turn Player 1 Player 2 Player 3
0 People, be People, be People, be
1 Γeople, be PeoplΔ, be PeopleΘ be
2 ΓeoΓle, be PΔoplΔ, be PeΘpleΘ be
3 ΓeoΓle, bΓ PΔoplΔ,Δbe PeΘpΘeΘ be

At the initial setup, player one was given an extra card. So at the end of the first round (turn three), that player still has a single unplayed card, 9. Thus, all three players preserve the ninth position of the original string intact.

Before we move to the second round, notice that some of the principles in the two-player version are still at play. All three columns of strings are independent of each other. One string drawn from each player across different turns in round one could be the subject of a three-way string comparison, and we would know precisely what the results should be. If the three strings from turn one are drawn, we are dealing with an accumulated 30% corruption. To process them through tan:collate() we would expect 70% of the characters from the original string to be wrapped in <c>s. The rest will be in <u>s, and no <u>s will be shared by two players. If the length of common text is less than 70% we have suboptimal results. We may see a rate higher than 70% if we encounter fortuitous commonality (see below).

This example also helps illustrate the concept of the superskeleton, i.e., the parts of the original version that can be detected in multiple versions. Even after the third turn, the entire superskeleton can be seen, if we take into account text shared by most but not all versions.

We now prepare round two, turn four. Each player passes to the right all the cards that have already been played. Player 1 keeps the fourth, unplayed card, at the top of the new hand. All players play the fourth turn.

Table V

Three-player corruption test, round two, turn four

Player 1 Player 2 Player 3
new cards 9, 7, 3, 5 1, 4, 10 6, 2, 8
turn 4 ΓeoΓle, ΓΓ ΔΔoplΔ,Δbe PeΘpΘΘΘ be

Now, each of the three versions have suffered 40% corruption from the original. If we were to perform a three-way string comparison, there would be no <c> text, only <u>s. The <u>s with more than one witness would be a vestige of the superskeleton. But it fades from view as more turns are taken:

Table VI

Three-player corruption test, round two continued

Turn Player 1 Player 2 Player 3
5 ΓeoΓleΓ ΓΓ ΔΔoΔlΔ,Δbe PΘΘpΘΘΘ be
6 ΓeΓΓleΓ ΓΓ ΔΔoΔlΔ,ΔbΔ PΘΘpΘΘΘ Θe

More turns could be taken, by once again passing cards to the right. But it is uncertain whether it is worth continuing the game. When there are no points of commonality, what are we looking for, exactly?

The exercise above could be expanded to four, five, and more players. The levels of corruption could be made more granular. The length of the initial input string could be adjusted. No matter how the parameters are adjusted, every test will be conducted using the same rules. The corruption level of each version can be calculated. The corruption levels can be summed, to determine how much commonality if any should be detected.

Given an arbitrary set of strings, one from each player, passed through tan:collate(), the results can be assessed based on two different measures: commonality and superskeleton.

Commonality measure. Based on the corruption levels in each input string, calculate the expected commonality. The algorithm should return in its <c>s the inverse percentage in codepoints. For example, if the corruption levels for five versions are 2%, 13%, 5%, 8%, and 10% then the total corruption is 38%. Consequently at least 62% of the original string or better should be retained in <c>s. Because of this requirement, a commonality test is meaningful only if at least one codepoint has not been corrupted.

The total corruption level does not consist of a simple sum of the individual corruption levels of each version, but rather a calculation as to which codepoints have and have not been corrupted. Consider, for example, three strings with 1%, 2%, and 70% corruption. Here the corruption level is not 73% but 70%, because the codepoints corrupted by the first two strings have also been corrupted by the third.

There may be cases where we get results better than expected because of fortuitous commonality. The phenomenon of fortuitous commonality appears so often, it is worth specially noting. Consider an original text of "eye," where the corruption process produces the following three strings to be compared: "%ye", "e*e" and "ey$". Because every position has been corrupted, these three versions have collectively a total 100% corruption rate, so one would expect 0% commonality. But because an e appears in all three versions, the detected commonality level is 33%. Fortuitous commonality is a common appearance in natural language texts, and was evident in the tests that were performed.

Superskeleton measure. Some <u>s attest to a majority reading, i.e., places where most witnesses attest to the same variant reading. The entire skeleton should be detectable as long as each position has not been changed by at least two players. However many players there are, the penultimate round begins with nearly every codepoint shared by exactly two versions, but the number of shared codepoints diminishes as the round unfolds. Thus, it is guaranteed that if every version comes from before the penultimate round, the entire superskeleton should be detectable through an XPath expression such as string-join($result/(tan:c | tan:u[tan:wit[2]])/tan:txt). An N-way difference algorithm should be graded as to whether its output can be used to extract a superskeleton that reflects the original input. A shorter superskeleton is symptomatic of fragmentation, and a deterioration in quality. A longer superskeleton may also reflect deterioration in quality, or (more likely) the occurrence of fortuitous commonality.

Although the test suite and the two metrics described above have been designed to test the performance and quality of tan:collate(), they can be used to independently test any collation algorithm. Readers do not need to access my test suite to verify results for themselves. The rules outlined above permits anyone to create their own test suite and apply it to tan:collate() or any other multiple-string differencing algorithm.

The test is admittedly artificial, but it is better than nothing. To my knowledge, there are no gold-standard data sets for N-way text differencing (i.e., a corpus of texts consisting of a known original and a library of descendant versions, corrupted by transmission that reflects natural cultural practices). The test is also artificially narrow, because it focuses exclusively on substitution, which is only one of several types of natural text changes. Ideally, it would be of immense benefit to develop comparable rules for at least four other types of test suites, one dealing with transposition, another with substantial insertions, another with substantial deletions, and another with duplication (haplography).

tan:collate() Quality and Performance Test Setup

To test tan:collate(), several strings of length 100, 1,000, 10,000, 100,000, and 1,000,000 characters were generated from natural language texts, in this case random portions of the King James Version of the Bible. For each string length, test suites were built for 3, 4, 5, and 6 players, as described above. The number of turns was limited to exactly 100, meaning that each turn represented 1% corruption. That means that for the smallest string, 100 characters, the game could be played exactly as described above, with one hundred cards distributed to the players, each card representing a single place to be corrupted. But for larger strings, a card was replaced with a card packet, representing a cluster of randomly permuted positions. Taking the 1,000-character string as an example, the position numbers one through one thousand were shuffled, and instead of one hundred cards being distributed, one hundred packets of ten cards were distributed. All other rules remained the same.

A full test suite was generated for each scenario, e.g., 300 files for 3 players; 400 files for 4 players, and so forth, for a total of 9,005 files (retaining the original five strings).

The primary goal was to measure the performance and quality of tan:collate(), not merely in itself, but across its three modes: naive, calculated commonality (CC), and superskeleton.

A simple XSLT file was created, to run tan:collate() according to the choice of strings (number, length, level of corruption), and the desired mode. The stylesheet ran tan:collate() and reported on the quality of results (the amount of common text, the length of the superskeleton), and alerted the user to any lost text. Thus, the calculated time reflected not merely the work of tan:collate() but some simple post-processing.

An ad hoc Java program was created, to control and time all the operations. Working through all possible permutations was unrealistic. For the three-player, 100-character length test suite, three million unique tests were possible (one hundred files per player, three modes). Any six-player test suite would have three trillion unique tests. To make the operation more tractable, a small but distributed number of corruption levels was assigned to each player, so that permuted combinations would reflect a wide range of corruption scenarios. Iterations were performed on player 1's corruption levels 1, 2, 4, 8, 16, 32, 64; for player 2, through 1, 3, 9, 27, 81; and so on to player 6, where iterations were performed on corruption levels 1, 7, 49. Each combination was applied across all three different approaches, naive, highest-commonality, and superskeleton.

Although test suites were generated for texts of length 100K and 1M characters, they were not included in the results discussed below due to time constraints. I thought it more important to run thousands of tests with slight parameter changes on the three smallest sets of test suites rather than merely dozens of tests across all five test suite sizes. All tests were performed on a Dell Inspiron 5570 (Intel Core 1.6GHz i5-8250U with 4 physical and 8 logical cores), and allocated up to 4 GB of heap memory.

Before tests were executed, it was expected that the quality of the results would be equally high, across all three modes. Much of this expectation was shaped by the excellent results from tan:diff().

Similar considerations shaped my expectations for performance. As noted in the previous article, tan:diff() runs in logorithmic time, according to string length. Each of the three modes of tan:collate() add a layer surrouding the core tan:diff() application. I expected that the naive and superskeleton tests would be comparable, because each one appears to add on top of tan:diff() a layer that runs in linear time—given N input items, the naive approach must run and collate N copies of tan:diff(), and the superskeleton approach must check every matching sample N times before accepting it as an anchor. The CC mode was expected to run in quadratic time, because it must perform tan:diff() N × (N - 1)) / 2 times.

tan:collate() Test Results

Around forty thousand automated tests generated an enormous result data set, which tracked several properties. Below are summarized results, first concerning output quality (commonality metric and superskeleton metric), followed by performance results.[3]

It should be noted that across all the tests, every output of tan:collate() was double-checked against the input, to ensure that no text was lost, replaced, or inserted. Happily, that proved to be the case. So even in situations where the quality of the output was not optimal, the results were nevertheless accurate.

Commonality Metric

The commonality metric consists of the combined string length of all tan:c/tan:txt nodes in the output, measured against the expected amount of commonality, calculated as 100 minus the total percentage of codepoint positions corrupted. The expected commonality metric was subtracted from the actual commonality metric. A result below zero was treated as a flawed result, because genuine commonality was being missed. The significance of results above zero were unclear. Fortuitious commonality (see above) was the only discernable explanation. Are such extra matches to be treated as desirable or not? To some, they may be taken as clutter or noise that should be avoided. To others, the chance matches may be heuristic devices, helpful for recovering a lost original. To acccommodate multiple, legitimate perspectives, I treated any test as zero or above as a success, and any below zero as a failure.

Across the tens of thousands of tests run, 13.9% of them resulted in failures. No individual cases of failed tests were studied, to determine the cause. When assessing the results, my primary goal was to discern trends, not to interpret individual scores.

Failed tests occurred more frequently in tests of shorter length, occurring more than twice as often in tests of length 100 than in tests of length 1000. Failed tests were quite rare in tests of length 10K (0.2%).

Within tests of length 100, the number of failed tests tended to double or triple with the addition of a new player. But in tests of length 1000, the number of failed tests for three players was greater than the number for four players. But the addition of a fifth then sixth player each time increased the failure rate seven- or eightfold. The reasons for this trend are unclear.

Failed tests appeared to be relatively evenly distributed across the corruption rate, at least in tests of length 100. In tests of length 1000, those failures tended to cluster at the low and high levels of corruption. The reasons for this trend, as well, are unclear.

Failed tests were somewhat higher (roughly 30%) in the naive mode than the other two modes, but only in tests of length 100. In tests of length 1000, nearly 40% of the failures occurred in superskeleton mode, followed by naive mode (35%) and CC mode (25%). Whether mode significantly affects the commonality metric needs further study.

Turning from failed tests to successful ones, I found that an increase in the expected quantity of commonality was significantly influenced by text length. Tests of length 100 resulted in an average commonality level of 1.7; those of length 1000, 15.9; and those of length 10K, 21.3.

Figure 1: Average difference of actual versus expected commonality levels of three modes, by text length

Another significant influence on extra commonality applied to corruption level. Generally speaking, tests with the lowest and highest cumulative corruption had the commonality results closest to zero. From 0% corruption through 100%, the amount of commonality rose, then fell, forming a gentle symmetrical sine arc.

Figure 2: Average difference of actual versus expected commonality levels of three modes, by cumulative corruption rate

Curiously, the superskeleton mode tended to capture the greatest amount of fortuitous commonality in sets with a corruption level of 40% or lower, but at corruption levels greater than 50% the mode's commonality score dropped below that of the other two modes. Generally speaking, the CC mode yielded 5% to 10% better commonality than did the naive mode. That is to be expected, since the CC mode attempts to maximize commonality by assessing every pairwise diff before proceeding. But the behavior of the superskeleton results was unexpected, and cannot, at the moment, be explained.

Also unusual was how commonality differed according to the number of players. In tests of length 100, the extra amount of commonality improved as players were added. But in tests of length 1000 and 10K, the story was different. The commonality results for four players was remarkably higher than it was the other three, going as high as 27% over the expected commonality. It is unclear why four players produced abnormally high commonality rates.

Figure 3: Average difference of actual versus expected commonality levels of three modes, by number of players and text length

Overall, the commonality metric results were somewhat surprising. A significant portion of especially smaller, high-version input resulted in a decline in the optimal commonality. For tests that succeeded, fluctuations in the scores, due to fortuitous commonality, were easy to observe but hard to explain.

Superskeleton Metric

The superskeleton was detected by joining the string values of all (tan:c | tan:u[tan:wit[2]])/tan:txt nodes in the output. Ideally that new string should be identical to the original archetype from which all the other versions descend. The length of the detected superskeleton was taken as a qualitative measure of the algorithm's output. A shorter superskeleton would result from its fragmentation into separate <u>s with a single witness. A longer superskeleton would result either from fortuitous commonality between multiple versions, or, in the case of five players or more, fragmentation of a part of the superskeleton into multiple <u>s with multiple witnesses. Because a low superskeleton score was the result purely of fragmentation, I considered it to be undesirable and therefore a failure. But because a high superskeleton score was the result of either fortuitous commonality or fragmentation, I regarded it as only possibly problematic.

Overall, 16% of the tests resulted in detected superskeletons shorter than the ideal length—failures. The majority resulted in larger-than-expected superskeletons, resulting in superskeletons whose lengths on average increased by 3.6%, 8.0%, and 3.1% for tests of length 100, 1000, and 10K respectively.

Figure 4: Average difference of actual versus expected superskeleton length of three modes, by text length

Generally speaking, negative superskeleton scores occurred most frequently in tests of length 10K, increasing with frequency and intensity as the corruption level intensified. The naive mode tended to exaggerate the length of the superskeleton.

Figure 5: Average difference of actual versus expected superskeleton length of three modes, by cumulative corruption rate

Negative results also tended to correlate with the number of players. In tests with three players the superskeleton measure was almost always negative. Tests with four players commonly approached the expected length until corruption reached 50%, at which point the superskeleton length began to decline significantly. Superskeleton length was almost always positive for tests of 5 and 6 players, with 5-player tests averaging 3.9% increase and 6-player tests, 9.0%.

Figure 6: Average difference of actual versus expected superskeleton length of three modes, by number of players and text length

Results differed mode by mode. The superskeleton mode, with an average 3.5% increase, tended to come closest to the ideal length, followed by CC (4.0%) and naive (6.1%).

Overall, the results were somewhat surprising. Before the tests were run, it was hoped that tan:collate() would be able to reproduce the original text by means of the superskeleton. That happened only when the texts were short and had a low level of aggregate corruption. In general, the superskeleton mode did a better job overall of preserving the length of the original archetype. But those observations should be treated as preliminary. Other trends associated with the superskeleton metric cannot be easily explained, and require further research.

Performance

Because of the number of variables, I found that I needed to study performance from different perspectives. Below are several summaries of the trends, each one approaching the same data set, but from a different primary quality.

Performance across text length. Across all mode and player options, tests of length 100 took an average of 154.1 ms. Tests of length 1000 required an average 1,187.1 ms. Tests of length 10K required an average of 21,594.1 ms. Much of this steep increase was exacerbated by the CC mode, as it took on increased text lengths and player counts. From the 100-length tests to the 1000-length ones, the CC mode time increased only eight- or ninefold (the more players the steeper the increase). But from the 1000-length tests to the 10K-length ones, the CC mode time increased on average 20-fold. Such a steep increase made analysis of the CC mode across 100K- and 1M-length tests not tractable, at least not with the same density of test cases.

Figure 7: Average test speed in milliseconds of three modes, by number of players and text length

Performance across modes. Across the modes naive, CC, and superskeleton, tests of length 100 averaged, respectively, 142.4, 199.9, and 120.1 ms. At tests of length 1000, the averages were 1,054.1, 1,713.1, and 794.1 ms, respectively. In the 10K-length tests the averages were 16,765.2, 40,022.4, and 7,975.9 ms, respectively. Clearly, the CC mode took the longest time to perform. The naive mode generally took less than half the time of the CC mode. But the superskeleton mode cut the naive mode's time nearly in half. This performance gain increased as the text length increased. As noted above, under text length, by the statistics earlier in this paragraph, the CC mode was clearly working in quadratic time, as expected. The naive mode, however, appeared to demonstrate a time complexity greater than linear. From 100- to 1000-length tests time needed increased sevenfold. But the move from 1000- to 10K-length tests imposed a sixteenfold increase in speed. That gradually increasing time penalty in naive mode is likely due to the work required to build the result tree. By contrast, the superskeleton mode, which does not build a result tree, stayed relatively close to linear time. From 100- to 1000-length tests, the time needed increased sevenfold. But the move from 1000- to 10K-length tests imposed only a tenfold increase in speed.

Performance across number of players. As might be expected, an increase in players required greater processing time. Across all modes, string lengths, and corruption levels, the addition of a new player imposed anywhere from a 22% to 80% performance increase, usually around 40%. A higher performance penalty appeared to happen in the CC mode, but that seemed to be the only factor that exacerbated the effect of the addition of new players. As text length increased, or as the corruption rate increased, the amount of time required with the addition of an extra player did not vary significantly.

Performance across levels of corruption. In tan:diff(), performance was significantly affected by the amount of corruption, represented in the graph by a gentle sine wave, with the maximum time peaking around the 50% to 70% level of corruption. In tan:collate(), however, the data followed a different pattern. In tests of length 100, the processing time required grew slowly but in linear fashion according to the amount of complexity. In tests of length 1000, the plotted curve looked more like a lazy half-parabola, with a steep rise in rates at low level of corruption, slowing as corruption increased. Tests of length 10K exhibited a tighter parabola, but at levels of corruption 65% and greater the graph exhibited erratic behavior, with levels of 99% corruption being quite time-consuming to assess. That behavior was most pronounced in the CC mode. The superskeleton mode had little variance in performance across levels of corruption, averaging 8,132 ms per operation (5,702 for three players, 9,142 for six).

Figure 8: Average test speed in milliseconds of three modes, by cumulative corruption level (only texts length 10K, any number of players)

Overall considerations on performance. Anyone using tan:collate() in demanding environments should be aware of the factors that may affect system performance. The number of versions has a predictable, smooth effect, as does the amount of corruption. The lengths of the input strings are highly significant performance factors, but that effect is exacerbated by the most important factor of all, the mode. The calculated commonality (CC) mode performed the worst, and superskeleton the best. Because results of qualitative scores of the modes are inconclusive, it is not necessarily clear that the superskeleton is always the most desirable. But because, on balance, the superskeleton mode seems not to introduce any deterioriation in quality compared to the other two modes, and because it is clearly the best performing, it is set as the default mode in tan:collate().

tan:collate() Applied to Real Texts

tan:collate() has been applied in numerous real-world contexts, and has easily handled high-volume, complex work. The corruption test results should already give the reader a sense of performance on relatively small texts. So let us consider a real-world stress case, on very long texts, of numerous versions, that would pose a challenge in production.

For my example, as in the paper on tan:diff(), I took an arbitrary volume from the Code of Federal Regulations (title 7, volume 1), and applied tan:collate() to the XML files that were published in different years (available for download at http://govinfo.gov).

Taking the plain string value of each file (without any normalization, including spacing), the string lengths were as follows:

Table VII

Code of Federal Regulations, title 7, volume 1, string length of XML files

Year String length
2021 3,290,256
2020 2,616,953
2019 2,620,171
2018 2,766,855
2017 2,547,263
2016 2,535,163

Using the latest versions as the starting point, these strings were fed through tan:collate(), as three-, four-, five-, and six-way comparisons, across the three different modes. The results are quite informative.

Table VIII

tan:collate() performance of comparison of different versions of the Code of Federal Regulations, title 7, volume 1

Mode performance (seconds)
Years compared Version count Naive CC Superskeleton
2019-2021 3 70.3 145.6 75.4
2018-2021 4 163.7 530.5 134.3
2017-2021 5 259.9 531.3 160.9
2016-2021 6 277.2 727.0 186.0

On the six-way test, the calculated commonality mode required more than twelve minutes to complete. The superskeleton mode, in contrast, needed slightly more than three minutes.

A study of the output of each of the scenarios was not conducted. In my experience, an assessment of the quality of output requires the connoisseurship of a person familiar with the material, and knowledgeable as to how the texts should be related. We should follow the textual scholar's adage: differences should be weighed, not counted. Because tan:collate() can only count, not weigh, its quantitative results are not the litmus test for quality. But because the function, as well as its dependency tan:diff(), have so many parameters for adjustment, the experienced user may find the resources needed to properly weigh the differences.

Equally important in the real world are the interpretations of our comparisons. As a core part of the TAN function library, tan:collate() drives a number of processes, the most important being TAN Diff+, an XSLT application that infuses tan:collate() results with statistical analysis and presents the output in a dynamic, interactive HTML display.

Below is a fragment of a sample output (original here) from TAN Diff+, showing how well four different OCR outputs matched the ground truth original, of a Greek text:

Further Work

The commonality metric requires further study. Why suboptimal results happen so frequently, and in unexpected patterns, deserves close study.

Although the corruption test suite has been useful, it should be taken as the beginning of a general effort to provide project-independent benchmarks in the field of text comparison. Highly desirable is a set of gold-standard test data. For example, critical editions could be disassembled and simplified, one transcription per manuscript, and the edition itself turned into a simplified version of what would be expected from a collation of the input versions. That last item may require annotations documenting the rationale that motivated the editors' choices.

The corruption tests are all substitution-based. Comparable tests for insertions, deletions, transpositions, and duplication, would be essential to develop, if one wished a more holistic, realistic method of testing the quality and performance of a text-comparison algorithm.

Transposition needs further study. Very commonly editors and writers move a passage of significant length from one part of a text to another. It may then be further modified. Or broken up by or intermingled with other transpositions. Transposition detection may be of significant benefit to any collation algorithm. If a particular passage is littered with small clumps of matches and many passages of distinct readings, perhaps a transposition would be easier and faster to detect than trying to find an optimal alignment. But this raises an even more complex question. In a pairwise string comparison, a transposition might be easy to express. But in N-way string comparison, how should transposition be detected and communicated?

Both tan:diff() and tan:collate() happen to have been first implemented in XSLT, due to the pressing need, and the near-impossibility of applying traditional, stateful solutions. But nothing prevents the algorithms behind these functions from being implemented in other languages. Doing so would be highly desirable, since TAN's staggered-sample approach (I am still unaware of any precedent) improves on the traditional algorithm, turning a quadratic time operation into a logarithmic one.

Conclusion

Despite remaining puzzles and challenges, tan:collate() provides an efficient, robust, accurate way to perform sorted N-way string comparison. The quality of the output is quite high, and always accurate. Its novel approach to string comparison, using staggered samples across all versions of the input, provides a tremendous benefit in performance. It has been used in production environments to support a wide range of practical applications. As part of the open-source TAN function library, tan:collate() enables developers to incorporate robust text comparison directly into XSLT applications.

References

[kalvesmaki] Kalvesmaki, Joel. “String Comparison in XSLT with tan:diff().” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Kalvesmaki01.



[1] Unfortunately, this example is artificial. The Venn diagrams above presume texts in stasis, and present a global view of the versions. But in the real world, many textual versions change as the text proceeds, sometimes approaching the whole, other times drifting away. Malleable commonality regularly features in historical texts, but will not be dealt with in this article.

[2] Diff output commonality measures for "ef", "eST", and "STef" (the largest block of differences in the example above): 1 vs 2: commonality 0.4; 1 vs 3: commonality 0.67; 2 vs 3: commonality 0.571. Thus, string 1 has a sum 1.07 commonality; string 2, 0.97; string 3, 1.23. Hence the new order, 3, 1, 2.

[3] Results of corruption levels 97-99% were omitted, as they tended to produce results that were significant outliers in quality and performance metrics.

×

Kalvesmaki, Joel. “String Comparison in XSLT with tan:diff().” Presented at Balisage: The Markup Conference 2021, Washington, DC, August 2 - 6, 2021. In Proceedings of Balisage: The Markup Conference 2021. Balisage Series on Markup Technologies, vol. 26 (2021). doi:https://doi.org/10.4242/BalisageVol26.Kalvesmaki01.