<?xml version="1.0" encoding="UTF-8"?><article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0-subset Balisage-1.2" xml:id="HR-23632987-8973"><title>Text Retrieval for XML-Encoded Corpora: A Lexical Approach</title><info><confgroup><conftitle>Balisage: The Markup Conference 2008</conftitle><confdates>August 12 - 15, 2008</confdates></confgroup><abstract><para>This paper describes some modifications done to an open source
      text retrieval package to make it XML-aware, and contrasts this lexical
      approach, in which XML documents are primarily treated as sequences of
      characters rather than trees, with the W3C XPath 1.0 and XQuery 2.0
      Full-Text facility.</para><para>Specific usage scenarios are taken into consideration, including
      World Wide Web publication and the searching and analysis of text
      corpora for research purposes.</para></abstract><author><personname><firstname>Liam</firstname><othername>R. E.</othername><surname>Quin</surname></personname><personblurb><para>Mr Quin has been involved with declarative, descriptive markup
        since the early 1980s. He wrote his open-source text retrieval system
        and first distributed it in the late 1980s.</para><para>He has worked at the World Wide Web Consortium since 2001, where
        he is XML Activity Lead, or, informally, Mrs XML.</para></personblurb><affiliation><jobtitle>XML Activity lead</jobtitle><orgname>W3C</orgname></affiliation><email>liam@w3.org</email></author><legalnotice><para>Copyright © 2008 Liam R E Quin. Used by permission.</para></legalnotice><keywordset role="author"><keyword>XML</keyword><keyword>Full Text</keyword><keyword>Information Retrieval</keyword><keyword>Natural Language Processing</keyword><keyword>Computational Linguistics</keyword></keywordset></info><section><title>Introduction</title><para>The W3C XML Query Working Group has published a specification for
    performing full-text queries over instances of the XPath and XQuery Data
    Model using an extension of the XQuery syntax. This is a text retrieval
    facility that operates on an abstract representation of XML trees, rather
    than on text files that happen to contain markup. Elements and their
    attributes are reified into hierarchies of nodes, text leaps into the
    lacunæ and swims between them, and not a pointy bracket in sight.</para><para>This paper compares the XQuery Full Text Facility with a more
    traditional open source text retrieval system, lq-text, and also explores
    the work done to make lq-text become more suitable to the processing needs
    of people who work with XML.</para><para>Disadvantage and advantages of the two approaches are
    discussed.</para></section><section><title>A Brief Description of the Full Text Facility</title><para>Although this paper is primarily concerned with a lexical approach,
    an understanding of the XPath 2 and XQuery approach is useful, and will be
    taken as a baseline for comparison.</para><para>Informally, a full text search is a search to find all documents in
    a collection, or all elements of some specific type (for example)
    containing one or more specific words. For example, one might want to find
    all occurrences of the phrase “warm socks” in a multi-gigabyte corpus of
    text. The underlying assumption of full text is that the implementation
    uses an index that has been constructed separately in advance, although
    this is not necessarily true.</para><section><title>Primary characteristics</title><para>XQuery 1.0 and XPath 2.0 Full-Text 1.0 [<xref linkend="FullText-2007"/>] extends XPath 2.0 (and XQuery 1.0 in turn,
      which itself extends XPath 2.0) to add support for explicit syntax for
      full text searches.</para><para>XPath 2.0 is node-based, matching text nodes which are contained
      by element nodes in a collection of XML document trees. The result is a
      Boolean value (when used in an XPath predicate) together with an
      optional numerical score or ranking.</para><para>The Full-Text facility includes a large number of possible
      modifiers, many of which are optional features and may or may or be
      available in any given implementation. These include (for example) both
      query expansion through a thesaurus and also query narrowing using a
      different sort of thesaurus. One can search for two tokens (words, for
      English) within a certain number of tokens, sentences or even
      paragraphs. The optional features are marked as being “at risk” in W3C
      parlance, meaning that unimplemented (or unimplementable) features will
      be dropped from the draft specification before it is published as a W3C
      Recommendation.</para></section></section><section><title>A Brief Description of lq-text</title><para>Lq-text is an open source text retrieval package that was first
    released in 1989. It has had sporadic development since then. Its main
    claims to fame are high precision, good performance (particularly when the
    data does not fit into available virtual memory), flexible concordance
    generation and an open, extensible, multi-process architecture.</para><para>Lq-text operates on text files. It makes an index to the files; this
    index stores the location of each occurrence of each natural-language word
    in all of the files. The resulting index is stored efficiently, and
    generally takes between a quarter and three quarters of the storage size
    of the original documents. The index is an adjunct; lq-text also refers to
    the original files, although these can be compressed to save space if
    needed. The package is designed to work best with many small files rather
    than a few large ones.</para><para>When lq-text indexes files, it can run a format-specific filter on
    each file before indexing it. The list of filters is currently built in to
    the software (but since it is open source, you can in fact change it if
    you wish).</para><para>A suite of separate Unix programs operate on the index for
    retrieval; some of these will be described in this paper. They are used in
    conjunction with each other, using a documented text-based format to
    communicate.</para><para>It is this open architecture that can be exploited to enable
    XML-specific searches, and that is the primary work described in this
    paper.</para></section><section><title>Commonalities Between The Approaches</title><para>An underlying assumption is that some sort of indexing will have
    been performed before queries are run; this is of course for all full-text
    systems, and although in some cases the constructed indexes do not persist
    between invocations of the query software, usually the indexes are kept
    and re-used.</para><para>Although the Full-Text facility operates on trees and lq-text
    operates on flat text files, in practice both systems are matching
    sequence of tokens against an index, and returning matches based on text
    content.</para><para>The XQuery Update Facility allows queries to update documents, and,
    as a result, implementations must be able to re-index documents
    efficiently. Lq-text can also re-index documents, most efficiently when
    both the original and the new version are available.</para></section><section><title>Lq-text and XML: Objectives</title><para>The author wanted to experiment to understand what work would be
    needed to make lq-text be useful for people working with XML documents.
    Some goals of this work included:</para><!--<itemizedlist spacing="compact">--><itemizedlist><listitem><para>Make minimal changes to the architecture and index and match
        format, because of limited programming resources;</para></listitem><listitem><para>Retain a small index and efficient retrieval;</para></listitem><listitem><para>Solve common use cases rather than providing extensive and
        general mechanisms.</para></listitem></itemizedlist><para>Although lq-text was not (at the start of the work) XML-aware, it
    has the ability to run a format-specific filter program when indexing any
    given document. There was already an SGML filter, but all it did was
    ensure that element and attribute names were not indexed. This filter was
    re-used for XML, modified to allow indexing of elements and attributes.
    But at that point the work had only begun.</para><para>The following use cases were determined sufficient for
    experiments:</para><itemizedlist><listitem><para>Identify all documents containing two or more phrases in the
        same element, for any given element;</para></listitem><listitem><para>Refine the search to an element with a specific attribute set to
        a given value;</para></listitem><listitem><para>Highlight the matches of the search in context;</para></listitem><listitem><para>For a given match, print the parent element and its content, or
        the contents of the parent tag, or a given attribute value, or the
        name of the parent element... possibly constrained to any named
        ancestor element not just the parent.</para></listitem></itemizedlist><para>This of course is much less than one might want in a full XML-aware
    text retrieval system. On the other hand, the XPath-based approach taken
    by the Full-Text facility does not support highlighting of matches or
    generation of concordances, and the author felt this to be essential
    functionality, both for research and for industrial or commercial
    use.</para><para>The approach taken was to extend <emphasis role="ital">lqkwic</emphasis>, the concordance program, so the paper will
    describe the lq-text architecture and then <emphasis role="ital">lqkwic</emphasis>, and then explain the extensions that were
    added. After that, an example program will be shown that uses <emphasis role="ital">lqkwic</emphasis> to solve one of the use cases given above.
    At that point we will be able to compare an XQuery or XSLT 2
    solution.</para><para>Support for a subset of XML (“just enough XML, Eh?”) was
    implemented; this subset will also be described, as it may be of interest
    for other people considering adding XML support to older software.</para></section><section><title>Lq-text Architecture in Detail</title><para>Before explaining how lq-text was extended, it is necessary to give
    at least an abbreviated account of how lq-text works.</para><para>Lq-text builds and maintains a separate index for each set of
    documents, which it calls a <emphasis role="ital">database</emphasis>.
    When building the index, lq-text applies simple <emphasis role="ital">stemming</emphasis>, by reducing words to a root. Currently,
    only plural and possessive forms are recognised and recorded, and other
    forms are indexed separately. This code is specific to the English
    language, and may be removed in a future version, with stemming instead
    being done by term expansion at query time.</para><para>Lq-text comprises a suite of separate programs, and each program
    always uses a single database. For the sake of simplicity in this paper we
    will assume that only a single lq-text database is in use at any time,
    unless otherwise stated.</para><para>Some of the programs included with lq-text are listed for reference
    in the table. Only a few of them will be discussed further in this paper,
    but the table may give the reader a clearer sense of the software.</para><table><caption><para>Lq-text Programs</para></caption><col align="right" valign="top" span="1"/><col valign="top" span="1"/><thead><tr valign="top"><th>Program</th><th>Purpose</th></tr></thead><tbody><tr valign="top"><td><emphasis role="ital">lqaddfile</emphasis></td><td>Used to add documents to the index, and to manipulate the
          index.</td></tr><tr valign="top"><td><emphasis role="ital">lqunindexfile</emphasis></td><td>removes a file from the index.</td></tr><tr valign="top"><td><emphasis role="ital">lqphrase</emphasis></td><td>matches one or more exact phrases</td></tr><tr valign="top"><td><emphasis role="ital">lqquery</emphasis></td><td>matches words or phrases, but supports wildcard expansion</td></tr><tr valign="top"><td><emphasis role="ital">lqrank</emphasis></td><td>reorders results based on the number of documents matched
          (quorum ranking)</td></tr><tr valign="top"><td><emphasis role="ital">lqsort</emphasis></td><td>sorts matches by various criteria e.g. by the word before the
          match</td></tr><tr valign="top"><td><emphasis role="ital">lqshow</emphasis></td><td>text-terminal (curses) program to show matched text</td></tr><tr valign="top"><td><emphasis role="ital">lqsed</emphasis></td><td>process documents, highlighting matches by insertion</td></tr><tr valign="top"><td><emphasis role="ital">lqkwic</emphasis></td><td>the main keyword in context concordance program</td></tr></tbody></table><para>Once an index is built (for example with lqaddfile), it can be used.
    A sample search might be as follows:</para><programlisting xml:space="preserve">$ lqquery "on his face" | lqkwic</programlisting><para>For one small corpus (Brewer's Dictionary of Phrase and Fable, with
    about 17,000 files) the results are as follows:</para><programlisting xml:space="preserve">==== Document 1: xml/1251.xml: Balafré ====
1:t which left a frightful scar <emphasis role="bold">on his face</emphasis> (1550–1588).  So Ludovic Lesly, an
==== Document 2: xml/3720.xml: Cloud ====
2: He [Antony] has a cloud <emphasis role="bold">on his face</emphasis>.
==== Document 3: xml/6070.xml F ====
3: F is written <emphasis role="bold">on his face</emphasis>. “Rogue” is written on his face
4: face. “Rogue” is written <emphasis role="bold">on his face</emphasis>. The letter F used to be branded n
==== Document 4: xml/8745.xml Ill Omens ====
5: he happened to trip and fall <emphasis role="bold">on his face</emphasis>. This would have been considered a
6: shore at Bulverhythe he fell <emphasis role="bold">on his face</emphasis>, and a great cry went forth that i</programlisting><para>Here, the matched text is shown with a few words of context on
    either side, giving rise to the term <emphasis role="ital">key word in
    context</emphasis>, KWIC, <emphasis role="ital">index</emphasis>.</para><para>Two lq-text programs, <emphasis role="ital">lqquery</emphasis> and
    <emphasis role="ital">lqkwic</emphasis>, were combined in the search,
    using a Unix pipe; that is, both programs were run concurrently, with the
    output of one being fed as the input to the other. This is a usual way of
    working with lq-text, and although it sometimes requires some thought, it
    does mean that lq-text exploits multi-processor systems well, and also
    works well with Unix and Linux, which were designed to run pipelines of
    small programs very efficiently.</para><para>This description begs the question, exactly what output is passed
    from <emphasis role="ital">lqquery</emphasis> to <emphasis role="ital">lqkwic</emphasis> in the example? The answer to this question
    exposes the underlying index architecture, and can be seen by running just
    the first program without the second:</para><programlisting xml:space="preserve">$ lqquery "on his face"
3 0 41 2792 1251.xml
3 0 55 11703 3720.xml
3 0 15 14314 6070.xml
3 0 21 14314 6070.xml
3 0 75 17285 8745.xml
3 1 8 17285 8745.xml</programlisting><para>The format, as can be determined by inspection, is a sequence of
    lines of text, and, in each line, a number of space-separated fields. Each
    line represents a single match, and just as there were six results before,
    there are six matches here. The fields are, from left to right, the number
    of words matched, the block in the file, the word in the block, the file
    number and (optionally) the filename.</para><para>The lq-text index does not store exact locations for matches.
    Instead, the location to the nearest block number, and the word within the
    block, are stored. Blocks are by default 128 bytes in size. The result of
    this is that a match location within a file is usually represented by a
    pair of fairly small integers, but that finding the actual intended words
    to highlight requires accessing the file and counting words. This is a
    trade-off: a lq-text index is often much smaller than the indexed files,
    because the average English word is about 5 characters long (depending
    somewhat on the corpus), and it only takes 2 bytes in most cases to store
    the information about a match.</para><para>Lines in the match list starting with a <code>#</code> are
    considered to be comments, and lines of the form
    <code>{ variable = value }</code> are used by
    <emphasis role="ital">lqkwic</emphasis> to set values that can be used
    later, as we shall see.</para><para>Lq-text programs generally both accept this match format as input
    and produce it as output, so that they can be combined. In particular, the
    <emphasis role="ital">lqkwic</emphasis> program can both read and produce
    this format, as we shall see in the next section.</para></section><section><title>The lq-text lqkwic program</title><para>The lqkwic program takes lq-text matches as input, and prints them
    using a user-supplied format, or a built-in format. Matches are grouped by
    file, and another format is used to print the start of each group of
    documents, and yet another can be supplied to be used at the end of each
    group.</para><para>The format takes the form of a string with embedded variables that
    are interpolated each time the format is used. An example may clarify the
    format:</para><programlisting xml:space="preserve">$ lqquery "on his fa*" |
    lqkwic -S '' -A '' -s '${MatchNumber} ${MatchedText}\n'
1 on his father
2 on his face
3 on his favourite
4 on his face
5 on his father
6 on his face
7 on his face
8 on his father
9 on his favourite
10 on his face
11 on his face</programlisting><para>Here, the formats for the start and end of each group of matches
    have been set to the empty string with <code>-S ''</code> and <code>-A
    ''</code> respectively. The per-match format is set to a string in which
    for each match the match number is printed, followed by a space, followed
    by the matched text and (indicated by <code>\n</code> in the grand Unix
    tradition) a newline. The single quotes are used to surround the strings
    to prevent the Unix shell from seeing the dollar signs and treating them
    as references to shell variables.</para><para>Although the MatchedText variable is obviously useful for testing,
    one would normally use it in conjunction with other variables, such as
    TextBefore and TextAfter. The purpose of this section is not to document
    lqkwic, but to give the reader an understanding of the sorts of things one
    can print, since lqkwic has uses that are far removed from concordance
    generation, and since we will shortly be taking advantage of such
    uses.</para><para>The following table shows some of the variables available. In many
    cases, lqkwic must read the actual matched documents (or at least part of
    them), in order to evaluate the variables.</para><table><caption><para>The lqkwic formatting variables</para></caption><col align="right" valign="top" span="1"/><col valign="top" span="1"/><thead><tr valign="top"><th>Variable</th><th>Description</th></tr></thead><tbody><tr valign="top"><td>DocName</td><td>the name of the current document, as stored in the database</td></tr><tr valign="top"><td>FileName</td><td>the absolute path corresponding to ${DocName}</td></tr><tr valign="top"><td>DocTitle</td><td>the title of the document</td></tr><tr valign="top"><td>FID</td><td>the File Identifier Number of the document (an integer)</td></tr><tr valign="top"><td>FileNumber</td><td>starts at 1, increases for each new document in the output</td></tr><tr><td/><td><!--* the DTD appears to lack tgroup *--></td></tr><tr valign="top"><td>BlockInFile, WordInBlock</td><td>these determine the location of the match</td></tr><tr valign="top"><td>NumberOfWordsInPhrase</td><td>the length in words of the phrase matched</td></tr><tr valign="top"><td>TextBefore</td><td>the text in the document immediately before the match</td></tr><tr valign="top"><td>MatchedText</td><td>the document text that exactly matches the phrase</td></tr><tr valign="top"><td>TextAfter</td><td>the text in the document immediately after the match</td></tr><tr><td/><td><!--* the DTD appears to lack tgroup *--></td></tr><tr valign="top"><td>MatchNumber</td><td>starts at 1 and increases for each match</td></tr><tr valign="top"><td>MatchWithinFile</td><td>like MatchNumber but reset for each new document</td></tr><tr valign="top"><td>StartByte</td><td>the byte offset in the file at which the match begins</td></tr><tr valign="top"><td>EndByte</td><td>the byte offset in the file at which the match ends</td></tr><tr valign="top"><td>MatchLength</td><td>length in bytes of ${MatchedText} (EndByte - StartByte)</td></tr></tbody></table><para>There are also constructs for formatting variables, for padding them
    to a given width (measured in Unicode characters, not bytes), and for
    filtering them through routines that delete punctuation, convert
    punctuation to spaces, perform case conversion and so forth.</para></section><section><title>Extending lqkwic</title><para>The following XML-specific variables were added as an experiment to
    try to understand how viable the approach would be:</para><table><caption><para>XML-specific Variables</para></caption><col align="right" valign="top" span="1"/><col valign="top" span="1"/><thead><tr valign="top"><th>Variable</th><th>Description</th></tr></thead><tbody><tr valign="top"><td>XML.Parent.Tag</td><td>The content of the containing element's tag, between the
	  angle brackets</td></tr><tr valign="top"><td>XML.ContentBefore</td><td>Content up to the &gt; of the start tag of the immediately enclosing parent element (including any tags and content that open and close entirely between the match and the parent tag)</td></tr><tr valign="top"><td>XML.Parent.Name</td><td>the name of the parent element</td></tr><tr valign="top"><td>XML.Parent.EndTag</td><td>the content of the parent element's end tag</td></tr><tr valign="top"><td>XML.ContentAfter</td><td>content up to the &lt; of the parent's end tag</td></tr></tbody></table><para>It is not clear that this is sufficient to answer our use case of
    finding multiple phrases in the same XML element. To do that, we would
    need a way to identify parent elements and compare them.</para><para>One could use the File number and the byte offset of the matched
    text (<code>${StartByte}</code>), but this is not sufficient, because
    there may be close and open tags between matches of two phrases.</para><para>One approach to finding phrases with a common containing element
    named (of type) E would be to find all of the start and end tags for E,
    and then use the file, block and word within block numbers to perform
    range algebra.</para><para>But it would be more efficient if this were not needed. In a corpus
    of many files, it is likely that the element E will occur in many files,
    perhaps many times, and searching for them all will be too slow.</para><para>If lqkwic could print the location of the parent tag, a much simpler
    faster algorithm would be possible.</para><para>The notation <code>-&gt;startbyte</code> or
    <code>-&gt;endbyte</code> was added; after any XML variable name, it
    generates the corresponding byte offset in the matched file.</para><para>In addition, the notation <code>XML.parent.Tag.e</code> was added,
    to be similar to the XPath notation <code>ancestor::e</code>; it is
    possible that a future version of <emphasis role="ital">lqkwic</emphasis>
    will use the XPath notation, as long as there is no danger that users will
    be confused into thinking that lq-text is using a node-based model
    internally.</para><para>The search for a parent tag is implemented by reading the matched
    document at the block containing the match, and for some distance
    beforehand. lqkwic then searches backwards from the match to find an open
    tag which has no corresponding close tag in the intervening distance. It
    is worth noting that this sort of approach is not generally possible with
    SGML, where empty elements have no end tag. The syntactic innovation of
    XML was to require empty tags to have a trailing slash, as in &lt;p/&gt;
    or &lt;p id="p301" /&gt;, and this enables the software to skip empty
    elements reliably. Start and end tags can be skipped more easily of
    course, although the algorithm used for backwards parsing does rely on
    attributes not containing unquoted &lt; or &gt; signs.</para><para>Unfortunately, backwards parsing suffers from a major drawback: the
    search for the parent tag will fail if it is too far away. Although lqkwic
    could in theory read arbitrarily back in the file, this could mean that
    presenting matches in a dictionary would be very expensive, with every
    match processed necessitating a search back to the start of a large
    document.</para><para>In practice, an in-memory cache may be sufficient to achieve
    reasonable performance in most cases. Another possibility might be to
    store parent pointers in the index. For now, lq-text is primarily intended
    for working with many thousands of small files; use XSLT to split large
    files before indexing them.</para></section><section><title>A sample program</title><para>We are now in a position to find all elements E that contain all of
    a set of phrases P0 ... Pn, as follows:</para><para>First, match the phrases, and, for each match, use a format of the
    form <code>${xml.contentbefore.E-&gt;endbyte}</code> to find the end byte
    of the start tag of the parent element of type E; that is, the location
    just after the <code>&gt;</code> at the end of the start tag. If two
    matches have the same value for the start tag, and are in the same file,
    then they share the same XML ancestor E.</para><para>We can match the phrases with a single invocation of <emphasis role="ital">lqrank</emphasis> except for one difficulty: there is no way
    to determine, for a given match, to which phrase it corresponds, so we
    cannot determine whether an element contains all of the phrases.</para><para>The <emphasis role="ital">lqrank</emphasis> program has the ability
    (when instructed with the <code>-g</code> option) to output a line,
    <code>{ q = N }</code> where N is an integer, to
    identify to which result set the following matches correspond. This is
    available to <emphasis role="ital">lqkwic</emphasis> formats as the
    variable <code>g.q</code> (the <code>g</code> stands for <emphasis role="ital">glue</emphasis>, the unpublished and unfinished lq-text
    integration language).</para><para>Using this, it becomes a relatively simple matter in a language such
    as Perl, Python or even the Unix shell, to run</para><programlisting xml:space="preserve"><emphasis role="ital">print phrases one per line</emphasis> |
    lqrank -r all -g -F - |
    lqkwic -s '${FID} ${g.q}
    ${xml.contentbefore.E-&gt;endbyte} ${Match}\n'</programlisting><para>Each match is in this way prefixed by the numeric identifier of the
    document in the index (FID), the phrase number and the byte offset of the
    end of the nearest ancestor E element's end tag. The
    <code>-F -</code> option makes <emphasis role="ital">lqrank</emphasis> read the list of phrases to match from its
    input, rather than expecting them as command-line arguments; one could
    also use the Unix <emphasis role="ital">xargs</emphasis> program for this
    purpose.</para><para>Next we must group the matches by file identifier and startbyte, and
    if every different phrase occurred at least once, we print all the matches
    for that file identifier and startbyte.</para><para>The result can then be fed to <emphasis role="ital">lqkwic</emphasis> to generate a concordance, or perhaps to
    fetch information about the parent element, or both.</para><para>The program outlined here (and given in full in the appendix, in the
    Perl programming language) is intended as an example of the sort of
    flexibility that might be achieved as lq-text becomes more XML
    aware.</para></section><section><title>Unicode</title><para>In 1988, the use of 8-bit character sets was pretty usual; lq-text
    is at least 8-bit clean for data, so that conversion to UTF-8 seemed a
    simple matter, and also has some locale awareness. There were two tricky
    parts to the process of adding UTF-8 support. The first was to ensure that
    characters, rather than bytes, were counted when formatting, and of course
    that a UTF-8 octet sequence was never split part-way through.</para><para>The second difficulty was much harder: making sure that combining
    characters are never split from their corresponding base character. This
    last is not yet complete, but initial work using the GNOME glibc library
    is promising. This is the main issue preventing lq-text from being
    shipped, at present, and may have been completed by the time this paper is
    presented in August 2008.</para><para>Software cannot tell by inspecting a singly byte (or octet, as
    standards people say, in case 9-bit systems should reoccur) whether that
    octet forms part of a longer UTF-8 sequence. One needs to scan backwards
    to check, because the <emphasis role="ital">first</emphasis> octet is the
    one that indicates the number of octets to follow in the sequence that
    constitutes a single character. This is of course easy to deal with as
    long as one can scan backwards a little. For diacritical marks and other
    combining characters, however, one must consult a database. The author
    could not help but wish that a single bit in the character representation
    could have been reserved for this purpose, but that would have prevented
    Unicode from being backwards-compatible with ISO 8859-1, a goal at the
    time Unicode was designed. A future version of lq-text may use its own
    database, with only the character properties that lq-text needs, perhaps
    created automatically at the same time as each database so as to take
    locale information into account.</para></section><section><title>Comparing with XQuery 1.0 or XSLT 2 + Full Text</title><para>The published draft of Full-Text does not support concordance
    generation, although some implementations in practice (such as MarkLogic)
    do appear to offer the necessary functionality through product-specific
    extensions. The author of this paper considers match highlighting to be
    essential functionality in practice. A future version of Full-Text may
    well include it.</para><para>Let us then assume, as we must, that we are using an XQuery or XSLT
    implementation that supports in some way identifying match locations, and
    hence allows highlighting.</para><section><title>Advantages of Full-Text</title><orderedlist><listitem><para>With Full-Text, XPath predicates and axes are available, so
          that one can easily find ancestors, parents, position in the element
          tree, and so forth. The lexical approach is very limited in this
          regard.</para></listitem><listitem><para>Full-Text is (or probably will soon be) a standard, and one
          can easily move between implementations. The necessity of using
          vendor extensions for highlighting reduces this somewhat, but of
          course there is only one implementation of lq-text, albeit with
          source code freely available.</para></listitem><listitem><para>An XPath implementation with Full-Text might have indexes for
          element location that enable higher performance, for example by
          using one CPU to find elements and another to resolve the text
          search. Although this sort of optimisation is largely at the
          research level today, it is likely to find its way into products,
          both closed and open source, in the near future. Lq-text uses
          multiple programs, which can run on separate CPUs of course (and
          will do so without any action from the user on a multi-CPU system)
          but there are no plans for finer-grained parallelism.</para></listitem><listitem><para>The Full-Text facility is designed to work with Unicode and
          XML-based language support, giving a high degree of
          internationalisation. Although the author is adding Unicode support
          to lq-text (which previously, because it predated Unicode, used
          8-bit character sets and a locale-based mechanism), it is not yet
          complete and pervasive.</para></listitem><listitem><para>Since lq-text is not tree-based, it does not currently have
          any means to respect xml:lang, nor does it have any understanding of
          namespaces. Prefixed elements and attribute names are not currently
          handled. A solution involving the XML indexing filter is being
          considered for both of these issues, but its effectiveness is as yet
          unknown.</para></listitem><listitem><para>The Full_text XPath extension is already in wider use than
          lq-text; training, support, books and forums are available for it,
          but not for lq-text.</para></listitem></orderedlist></section><section><title>Advantages of a lexical approach</title><orderedlist><listitem><para>Open access to the match list supports flexibility and
          extensibility. The use of separate programs also allows intermediate
          results to be cached or stored and compared easily. By contrast,
          XQuery (where Full-Text is most likely to be found) is a large
          monolithic language. Open Source XQuery implementations are mostly
          in Java, which does not lend itself to good performance if a JVM
          must be started for each query, for example outside a servlet
          environment. None the less it should be mentioned that the fastest
	  readily available indexed XQuery implementation in the
	  author's experience is in Java, and once
          the JVM is started, is very fast.</para></listitem><listitem><para>Because the data is not forced into the shape of a tree, it is
          possible to experiment, for example with overlapping markup. The
          generation of results by <emphasis role="ital">lqkwic</emphasis> can
          include a span from start element to corresponding end element,
	  regardless of other start or end tags.  Although XQuery
	  and XPath 2.0 Full-Text allows for matching as if tags were
	  absent, it does not give good control over which tags are to
	  be treated as word boundaries and which not.  But this is
	  a difficult thing to do at query-time in any case, and
      neither system today has a complete answer for this.</para></listitem><listitem><para>Lq-text can be used to generate non-XML results, for example a
          bitmap image representing a graph of word occurrence. XQuery and
          XSLT are limited to text and XML, although one can certainly write
          out SVG with them.</para><figure><title>Occurrences of four-digit numbers</title><mediaobject><imageobject><imagedata fileref="../../../vol1/graphics/Quin01/Quin01-001.png" format="png"/></imageobject></mediaobject><para>A graph showing four-digit numbers along the x-axis, from
            1500 to 1890 (presumably most of which represent years), and on
            the y-axis the number of times that number occurs in the corpus
            (17,000 entries from <emphasis role="ital">Brewer's Dictionary of
            Phrase and Fable</emphasis>).</para></figure></listitem><listitem><para>Experiments with Salton-style similarity functions,
          clustering, and other Information Retrieval techniques might
          eventually find their way back from work like this and into a future
          Full-Text specification. See, for example <xref linkend="Salton-1989"/> or <xref linkend="Konchady2006"/> for
          descriptions of some applicable information retrieval
          techniques.</para></listitem><listitem><para>Lq-text lies more in the world of traditional Unix text
          processing than in the world of relational databases. If one is
          primarily interested in finding content in a database, Full-Text is
          a clear winner. If one is more interested in exploring or searching
          text, perhaps lq-text has something to offer.</para></listitem></orderedlist></section><section><title>JEXE: Just Enough XML, Eh?</title><para>Some XML features were harder to see how to support than others.
      The author has no intent to support all of XML at this time, but just
      enough to be useful. This is regardless of how the XML is parsed. The
      following features are not supported, and are unlikely to be
      supported:</para><orderedlist><listitem><para>CDATA sections; you can use entities instead. This is because
          the retrieval software does not scan the document from the start
          each time, but from the middle, and cannot determine whether markup
          is part of a marked section.</para></listitem><listitem><para>External general entities (and XInclude); it is more useful
          for people working with XML as files to know the file than the
          document; if you want to resolve included entities, use a
          pre-processor such as <emphasis role="ital">xmllint</emphasis>
          before indexing.</para></listitem><listitem><para>Arbitrary namespace support; the limit on namespaces is that
          all <emphasis role="ital">xmlns</emphasis> declarations must come
          before any regular attributes. In other words, the order of
          attributes (or pseudo-attributes) is significant. This may change in
          the future; it is because of a limitation in the indexer to do with
          the amount of available look-ahead.</para></listitem><listitem><para>General entities; although support is planned for entities,
          the plan is to read the replacement text from a per-database
          configuration file. This is already done by lqkwic, but should also
          be done by the indexer. This means that per-document entities are
          not supported. External entities are not supported: the unit of
          retrieval is the file, not the document.</para></listitem><listitem><para>The internal subset; currently lq-text can skip over an
          internal subset correctly in most cases (it is possible to construct
          an internal subset that will confuse it, I suspect, although this is
          always detected and a warning issued), but it is not parsed.</para></listitem><listitem><para>Fixed and defaulted attribute values; without reading a DTD or
          internal subset, there are no default values. This could be thought
          of as a minimization feature of SGML that was overlooked during the
          design of XML.</para></listitem><listitem><para>XML Notations; if the DTD were to be read, it might be
          possible to associate a URI or a MIME content type with a different
          tokenisation system, but the document author cannot know what MIME
          type will be used if a file is served on the Web; the DTD is not
          authoritative, and currently lq-text does not use HTTP to fetch
          things, but only works with local files. If lq-text used HTTP,
          behaviour would be based on the content type header for downloaded
          entities, not on any notation value in the DTD. For a local file,
          the notation value could be treated as a list of plausible content
          types, perhaps, but in practice content sniffing is more likely to
          work.</para></listitem></orderedlist><para>The result of this is that a JEXE document consists of an XML
      declaration (the encoding, if given, must be in UTF-8, however), an
      optional doctype declaration to point at an external DTD to be ignored,
      and then one (or more) simple element trees. Elements may have
      attributes, and may also declare namespaces. Namespace prefixes may be
      “normalised” based on a per-database configuration file, with elements
      in a default namespace that is associated with a URI given an explicit
      prefix [This feature is not implemented at the time of writing]. Numeric
      character references are expanded on indexing. Entity references are
      replaced by their per-database string values on retrieval; the plan is
      to index entity references both with their entity name and with their
      expanded value.</para><para>The resulting XML can be parsed “from the middle out” for the
      purposes of retrieval.</para><para>Although there is only support for a subset of XML, enough of the
      syntax is understood that you can index any XML document. However, some
      features, such as CDATA sections, permit the construction of documents
      that will confuse retrieval, even though the actual CDATA sections will
      be correctly parsed. A possible work-around is to process documents with
      XSLT before indexing them, creating surrogate documents.</para></section><section><title>Future Work</title><para>The work has shown that adding some simple XML support to lq-text
      is possible, but leaves a lot to be desired. For people already using
      lq-text, the support described in this paper is useful, but it is
      unlikely to persuade many people to try the package for the first
      time.</para><para>Adding more support for “just enough XML” will make the package
      more interesting. In the short term, extending the Unicode support is
      necessary before a release, as is more thorough testing and (as always)
      more documentation. After that, changes in the indexer to add support
      for (just enough) namespaces, general text entities and numeric
      character references have been sketched out.</para><para>There are no plans to use a full XML parser right now; although
      the author had originally intended to do so, the difficulty in tracking
      exact byte positions in the input delayed the work, and at this point
      although it is now possible, it has become a matter of human
      resources.</para><para>It is possible that the work here would be enough to enable
      lq-text to be used by an implementor of XQuery, and the author would
      like to do experiments in that area.</para><para>Searching a corpus of documents with disparate markup can be
      difficult with either approach, because one tends to write patterns that
      depend on the markup retrieved. One approach is to try to map queries at
      runtime; this can be a difficult problem of matching incompatible
      hierarchies of elements; see <xref linkend="Euzenat-2007"/> on various
      approaches to the problem of matching ontologies. A more pragmatic
      approach is to re-write documents before indexing them, perhaps with
      XSLT. This approach works with both approaches to text retrieval, but
      can be tedious. An intermediate approach might be to define some XPath
      expressions, or to use a W3C XML Schema to impost some specific types,
      to identify sections, titles, paragraphs, and to mark which elements are
      considered to break apart words, phrases and paragraphs. The index could
      then include this information alongside the element structure. More
      experiments in this area are planned.</para></section><section><title>Conclusions</title><para>The author's original goal in adding XML support to lq-text was to
      use lq-text to help optimise an XQuery implementation. After
      experimenting with an XQuery implementation that supported Full-Text,
      the author decided instead to focus on enhancing lq-text to see if the
      results would be useful. It turns out that they are indeed useful, and
      development is continuing.</para><para>It must be admitted, however, that any advantage of lq-text over
      sophisticated XQuery implementations is likely to diminish over
      time.</para><para>The subset of XML supported (and with planned support), “just
      enough XML, Eh?” (JEXE), may be worth documenting separately.</para></section></section><bibliography><title>Bibliography</title><bibliomixed xreflabel="Adolphs, 2006" xml:id="Adolphs-2006">Adolphs,
    Svenja, “Introducing Electronic Text Analysis” (Routledge, 2006). A very
    clear and impressively slender introduction to the application of
    information retrieval, and especially the keyword-in-context list, to
    literary and linguistic research.</bibliomixed><bibliomixed xreflabel="Baeza-Yates and Marais, 1999" xml:id="Baeza-Yates-1999">Baeza-Yates, Ricardo, and Marais, H., “Modern
    Information Retrieval” (ACM Press, 1999). Describes information retrieval
    mostly from the perspective of a researcher in text retrieval rather than
    a programmer or a user, and assumes more background knowledge,
    particularly in mathematics, than Manu Konchady’s book, so may be best
    read second.</bibliomixed><bibliomixed xreflabel="Euzenat and Shvaiko, 2007" xml:id="Euzenat-2007">Euzenat, Jérôme and Shvaiko, Pavel, “Ontology
    Matching” (Springer, 2007); a surprisingly clear introduction to problems
    such as relating two or more different classification schemes (such as XML
    schemas) over the same subject matter, although the presentation uses a
    mathematical notation, and some background in formal logic may be
    helpful.</bibliomixed><bibliomixed xreflabel="Konchady2006" xml:id="Konchady2006">Konchady,
    Manu, “Text Mining Application Programming” (Charles River Media, Boston
    USA, 2006). A useful programmer-level introduction to topics relating to
    implementing and using text retrieval, part-of-speech tagging, clustering
    and other topics, together with just enough mathematics, but not specific
    to any particular language. Includes CD-ROM with code samples in Perl,
    however.</bibliomixed><bibliomixed xreflabel="Salton, 1989" xml:id="Salton-1989">Salton, Gerald,
    “Automatic Text Processing” (Addison-Wesley, 1989). Perhaps a little
    dated, but the late Dr. Salton was extremely influential in the field. His
    earlier, 1983, book formed the basis for a single chapter of this work,
    but the 1983 book is harder to find today.</bibliomixed><bibliomixed xreflabel="W3C Full-Text, 2007" xml:id="FullText-2007">Sihem
    Amer-Yahia, Chavdar Botev, Stephen Buxton, Pat Case, Jochen Doerre, Mary
    Holstege, Jim Melton, Michael Rys and Jayavel Shanmugasundaram (Editors),
    “XQuery 1.0 and XPath 2.0 Full-Text 1.0” [online]. [cited 18th April
    2008].
    <link xlink:type="simple" xlink:show="new" xlink:actuate="onRequest">http://www.w3.org/TR/xpath-full-text-10/</link>.</bibliomixed></bibliography></article>
