Balisage logo

Proceedings

Text Retrieval for XML-Encoded Corpora: A Lexical Approach

Liam R. E. Quin

XML Activity lead

W3C

Balisage: The Markup Conference 2008
August 12 - 15, 2008

Copyright © 2008 Liam R E Quin. Used by permission.

How to cite this paper

Quin, Liam R. E. “Text Retrieval for XML-Encoded Corpora: A Lexical Approach.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 - 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). DOI: 10.4242/BalisageVol1.Quin01.

Abstract

This paper describes some modifications done to an open source text retrieval package to make it XML-aware, and contrasts this lexical approach, in which XML documents are primarily treated as sequences of characters rather than trees, with the W3C XPath 1.0 and XQuery 2.0 Full-Text facility.

Specific usage scenarios are taken into consideration, including World Wide Web publication and the searching and analysis of text corpora for research purposes.

Table of Contents

Introduction
A Brief Description of the Full Text Facility
Primary characteristics
A Brief Description of lq-text
Commonalities Between The Approaches
Lq-text and XML: Objectives
Lq-text Architecture in Detail
The lq-text lqkwic program
Extending lqkwic
A sample program
Unicode
Comparing with XQuery 1.0 or XSLT 2 + Full Text
Advantages of Full-Text
Advantages of a lexical approach
JEXE: Just Enough XML, Eh?
Future Work
Conclusions

Introduction

The W3C XML Query Working Group has published a specification for performing full-text queries over instances of the XPath and XQuery Data Model using an extension of the XQuery syntax. This is a text retrieval facility that operates on an abstract representation of XML trees, rather than on text files that happen to contain markup. Elements and their attributes are reified into hierarchies of nodes, text leaps into the lacunæ and swims between them, and not a pointy bracket in sight.

This paper compares the XQuery Full Text Facility with a more traditional open source text retrieval system, lq-text, and also explores the work done to make lq-text become more suitable to the processing needs of people who work with XML.

Disadvantage and advantages of the two approaches are discussed.

A Brief Description of the Full Text Facility

Although this paper is primarily concerned with a lexical approach, an understanding of the XPath 2 and XQuery approach is useful, and will be taken as a baseline for comparison.

Informally, a full text search is a search to find all documents in a collection, or all elements of some specific type (for example) containing one or more specific words. For example, one might want to find all occurrences of the phrase “warm socks” in a multi-gigabyte corpus of text. The underlying assumption of full text is that the implementation uses an index that has been constructed separately in advance, although this is not necessarily true.

Primary characteristics

XQuery 1.0 and XPath 2.0 Full-Text 1.0 [W3C Full-Text, 2007] extends XPath 2.0 (and XQuery 1.0 in turn, which itself extends XPath 2.0) to add support for explicit syntax for full text searches.

XPath 2.0 is node-based, matching text nodes which are contained by element nodes in a collection of XML document trees. The result is a Boolean value (when used in an XPath predicate) together with an optional numerical score or ranking.

The Full-Text facility includes a large number of possible modifiers, many of which are optional features and may or may or be available in any given implementation. These include (for example) both query expansion through a thesaurus and also query narrowing using a different sort of thesaurus. One can search for two tokens (words, for English) within a certain number of tokens, sentences or even paragraphs. The optional features are marked as being “at risk” in W3C parlance, meaning that unimplemented (or unimplementable) features will be dropped from the draft specification before it is published as a W3C Recommendation.

A Brief Description of lq-text

Lq-text is an open source text retrieval package that was first released in 1989. It has had sporadic development since then. Its main claims to fame are high precision, good performance (particularly when the data does not fit into available virtual memory), flexible concordance generation and an open, extensible, multi-process architecture.

Lq-text operates on text files. It makes an index to the files; this index stores the location of each occurrence of each natural-language word in all of the files. The resulting index is stored efficiently, and generally takes between a quarter and three quarters of the storage size of the original documents. The index is an adjunct; lq-text also refers to the original files, although these can be compressed to save space if needed. The package is designed to work best with many small files rather than a few large ones.

When lq-text indexes files, it can run a format-specific filter on each file before indexing it. The list of filters is currently built in to the software (but since it is open source, you can in fact change it if you wish).

A suite of separate Unix programs operate on the index for retrieval; some of these will be described in this paper. They are used in conjunction with each other, using a documented text-based format to communicate.

It is this open architecture that can be exploited to enable XML-specific searches, and that is the primary work described in this paper.

Commonalities Between The Approaches

An underlying assumption is that some sort of indexing will have been performed before queries are run; this is of course for all full-text systems, and although in some cases the constructed indexes do not persist between invocations of the query software, usually the indexes are kept and re-used.

Although the Full-Text facility operates on trees and lq-text operates on flat text files, in practice both systems are matching sequence of tokens against an index, and returning matches based on text content.

The XQuery Update Facility allows queries to update documents, and, as a result, implementations must be able to re-index documents efficiently. Lq-text can also re-index documents, most efficiently when both the original and the new version are available.

Lq-text and XML: Objectives

The author wanted to experiment to understand what work would be needed to make lq-text be useful for people working with XML documents. Some goals of this work included:

  • Make minimal changes to the architecture and index and match format, because of limited programming resources;

  • Retain a small index and efficient retrieval;

  • Solve common use cases rather than providing extensive and general mechanisms.

Although lq-text was not (at the start of the work) XML-aware, it has the ability to run a format-specific filter program when indexing any given document. There was already an SGML filter, but all it did was ensure that element and attribute names were not indexed. This filter was re-used for XML, modified to allow indexing of elements and attributes. But at that point the work had only begun.

The following use cases were determined sufficient for experiments:

  • Identify all documents containing two or more phrases in the same element, for any given element;

  • Refine the search to an element with a specific attribute set to a given value;

  • Highlight the matches of the search in context;

  • For a given match, print the parent element and its content, or the contents of the parent tag, or a given attribute value, or the name of the parent element... possibly constrained to any named ancestor element not just the parent.

This of course is much less than one might want in a full XML-aware text retrieval system. On the other hand, the XPath-based approach taken by the Full-Text facility does not support highlighting of matches or generation of concordances, and the author felt this to be essential functionality, both for research and for industrial or commercial use.

The approach taken was to extend lqkwic, the concordance program, so the paper will describe the lq-text architecture and then lqkwic, and then explain the extensions that were added. After that, an example program will be shown that uses lqkwic to solve one of the use cases given above. At that point we will be able to compare an XQuery or XSLT 2 solution.

Support for a subset of XML (“just enough XML, Eh?”) was implemented; this subset will also be described, as it may be of interest for other people considering adding XML support to older software.

Lq-text Architecture in Detail

Before explaining how lq-text was extended, it is necessary to give at least an abbreviated account of how lq-text works.

Lq-text builds and maintains a separate index for each set of documents, which it calls a database. When building the index, lq-text applies simple stemming, by reducing words to a root. Currently, only plural and possessive forms are recognised and recorded, and other forms are indexed separately. This code is specific to the English language, and may be removed in a future version, with stemming instead being done by term expansion at query time.

Lq-text comprises a suite of separate programs, and each program always uses a single database. For the sake of simplicity in this paper we will assume that only a single lq-text database is in use at any time, unless otherwise stated.

Some of the programs included with lq-text are listed for reference in the table. Only a few of them will be discussed further in this paper, but the table may give the reader a clearer sense of the software.

Table I

Lq-text Programs

ProgramPurpose
lqaddfileUsed to add documents to the index, and to manipulate the index.
lqunindexfileremoves a file from the index.
lqphrasematches one or more exact phrases
lqquerymatches words or phrases, but supports wildcard expansion
lqrankreorders results based on the number of documents matched (quorum ranking)
lqsortsorts matches by various criteria e.g. by the word before the match
lqshowtext-terminal (curses) program to show matched text
lqsedprocess documents, highlighting matches by insertion
lqkwicthe main keyword in context concordance program

Once an index is built (for example with lqaddfile), it can be used. A sample search might be as follows:

$ lqquery "on his face" | lqkwic

For one small corpus (Brewer's Dictionary of Phrase and Fable, with about 17,000 files) the results are as follows:

==== Document 1: xml/1251.xml: Balafré ====
1:t which left a frightful scar on his face (1550–1588).  So Ludovic Lesly, an
==== Document 2: xml/3720.xml: Cloud ====
2: He [Antony] has a cloud on his face.
==== Document 3: xml/6070.xml F ====
3: F is written on his face. “Rogue” is written on his face
4: face. “Rogue” is written on his face. The letter F used to be branded n
==== Document 4: xml/8745.xml Ill Omens ====
5: he happened to trip and fall on his face. This would have been considered a
6: shore at Bulverhythe he fell on his face, and a great cry went forth that i

Here, the matched text is shown with a few words of context on either side, giving rise to the term key word in context, KWIC, index.

Two lq-text programs, lqquery and lqkwic, were combined in the search, using a Unix pipe; that is, both programs were run concurrently, with the output of one being fed as the input to the other. This is a usual way of working with lq-text, and although it sometimes requires some thought, it does mean that lq-text exploits multi-processor systems well, and also works well with Unix and Linux, which were designed to run pipelines of small programs very efficiently.

This description begs the question, exactly what output is passed from lqquery to lqkwic in the example? The answer to this question exposes the underlying index architecture, and can be seen by running just the first program without the second:

$ lqquery "on his face"
3 0 41 2792 1251.xml
3 0 55 11703 3720.xml
3 0 15 14314 6070.xml
3 0 21 14314 6070.xml
3 0 75 17285 8745.xml
3 1 8 17285 8745.xml

The format, as can be determined by inspection, is a sequence of lines of text, and, in each line, a number of space-separated fields. Each line represents a single match, and just as there were six results before, there are six matches here. The fields are, from left to right, the number of words matched, the block in the file, the word in the block, the file number and (optionally) the filename.

The lq-text index does not store exact locations for matches. Instead, the location to the nearest block number, and the word within the block, are stored. Blocks are by default 128 bytes in size. The result of this is that a match location within a file is usually represented by a pair of fairly small integers, but that finding the actual intended words to highlight requires accessing the file and counting words. This is a trade-off: a lq-text index is often much smaller than the indexed files, because the average English word is about 5 characters long (depending somewhat on the corpus), and it only takes 2 bytes in most cases to store the information about a match.

Lines in the match list starting with a # are considered to be comments, and lines of the form { variable = value } are used by lqkwic to set values that can be used later, as we shall see.

Lq-text programs generally both accept this match format as input and produce it as output, so that they can be combined. In particular, the lqkwic program can both read and produce this format, as we shall see in the next section.

The lq-text lqkwic program

The lqkwic program takes lq-text matches as input, and prints them using a user-supplied format, or a built-in format. Matches are grouped by file, and another format is used to print the start of each group of documents, and yet another can be supplied to be used at the end of each group.

The format takes the form of a string with embedded variables that are interpolated each time the format is used. An example may clarify the format:

$ lqquery "on his fa*" |
    lqkwic -S '' -A '' -s '${MatchNumber} ${MatchedText}\n'
1 on his father
2 on his face
3 on his favourite
4 on his face
5 on his father
6 on his face
7 on his face
8 on his father
9 on his favourite
10 on his face
11 on his face

Here, the formats for the start and end of each group of matches have been set to the empty string with -S '' and -A '' respectively. The per-match format is set to a string in which for each match the match number is printed, followed by a space, followed by the matched text and (indicated by \n in the grand Unix tradition) a newline. The single quotes are used to surround the strings to prevent the Unix shell from seeing the dollar signs and treating them as references to shell variables.

Although the MatchedText variable is obviously useful for testing, one would normally use it in conjunction with other variables, such as TextBefore and TextAfter. The purpose of this section is not to document lqkwic, but to give the reader an understanding of the sorts of things one can print, since lqkwic has uses that are far removed from concordance generation, and since we will shortly be taking advantage of such uses.

The following table shows some of the variables available. In many cases, lqkwic must read the actual matched documents (or at least part of them), in order to evaluate the variables.

Table II

The lqkwic formatting variables

VariableDescription
DocNamethe name of the current document, as stored in the database
FileNamethe absolute path corresponding to ${DocName}
DocTitlethe title of the document
FIDthe File Identifier Number of the document (an integer)
FileNumberstarts at 1, increases for each new document in the output
BlockInFile, WordInBlockthese determine the location of the match
NumberOfWordsInPhrasethe length in words of the phrase matched
TextBeforethe text in the document immediately before the match
MatchedTextthe document text that exactly matches the phrase
TextAfterthe text in the document immediately after the match
MatchNumberstarts at 1 and increases for each match
MatchWithinFilelike MatchNumber but reset for each new document
StartBytethe byte offset in the file at which the match begins
EndBytethe byte offset in the file at which the match ends
MatchLengthlength in bytes of ${MatchedText} (EndByte - StartByte)

There are also constructs for formatting variables, for padding them to a given width (measured in Unicode characters, not bytes), and for filtering them through routines that delete punctuation, convert punctuation to spaces, perform case conversion and so forth.

Extending lqkwic

The following XML-specific variables were added as an experiment to try to understand how viable the approach would be:

Table III

XML-specific Variables

VariableDescription
XML.Parent.TagThe content of the containing element's tag, between the angle brackets
XML.ContentBeforeContent up to the > of the start tag of the immediately enclosing parent element (including any tags and content that open and close entirely between the match and the parent tag)
XML.Parent.Namethe name of the parent element
XML.Parent.EndTagthe content of the parent element's end tag
XML.ContentAftercontent up to the < of the parent's end tag

It is not clear that this is sufficient to answer our use case of finding multiple phrases in the same XML element. To do that, we would need a way to identify parent elements and compare them.

One could use the File number and the byte offset of the matched text (${StartByte}), but this is not sufficient, because there may be close and open tags between matches of two phrases.

One approach to finding phrases with a common containing element named (of type) E would be to find all of the start and end tags for E, and then use the file, block and word within block numbers to perform range algebra.

But it would be more efficient if this were not needed. In a corpus of many files, it is likely that the element E will occur in many files, perhaps many times, and searching for them all will be too slow.

If lqkwic could print the location of the parent tag, a much simpler faster algorithm would be possible.

The notation ->startbyte or ->endbyte was added; after any XML variable name, it generates the corresponding byte offset in the matched file.

In addition, the notation XML.parent.Tag.e was added, to be similar to the XPath notation ancestor::e; it is possible that a future version of lqkwic will use the XPath notation, as long as there is no danger that users will be confused into thinking that lq-text is using a node-based model internally.

The search for a parent tag is implemented by reading the matched document at the block containing the match, and for some distance beforehand. lqkwic then searches backwards from the match to find an open tag which has no corresponding close tag in the intervening distance. It is worth noting that this sort of approach is not generally possible with SGML, where empty elements have no end tag. The syntactic innovation of XML was to require empty tags to have a trailing slash, as in <p/> or <p id="p301" />, and this enables the software to skip empty elements reliably. Start and end tags can be skipped more easily of course, although the algorithm used for backwards parsing does rely on attributes not containing unquoted < or > signs.

Unfortunately, backwards parsing suffers from a major drawback: the search for the parent tag will fail if it is too far away. Although lqkwic could in theory read arbitrarily back in the file, this could mean that presenting matches in a dictionary would be very expensive, with every match processed necessitating a search back to the start of a large document.

In practice, an in-memory cache may be sufficient to achieve reasonable performance in most cases. Another possibility might be to store parent pointers in the index. For now, lq-text is primarily intended for working with many thousands of small files; use XSLT to split large files before indexing them.

A sample program

We are now in a position to find all elements E that contain all of a set of phrases P0 ... Pn, as follows:

First, match the phrases, and, for each match, use a format of the form ${xml.contentbefore.E->endbyte} to find the end byte of the start tag of the parent element of type E; that is, the location just after the > at the end of the start tag. If two matches have the same value for the start tag, and are in the same file, then they share the same XML ancestor E.

We can match the phrases with a single invocation of lqrank except for one difficulty: there is no way to determine, for a given match, to which phrase it corresponds, so we cannot determine whether an element contains all of the phrases.

The lqrank program has the ability (when instructed with the -g option) to output a line, { q = N } where N is an integer, to identify to which result set the following matches correspond. This is available to lqkwic formats as the variable g.q (the g stands for glue, the unpublished and unfinished lq-text integration language).

Using this, it becomes a relatively simple matter in a language such as Perl, Python or even the Unix shell, to run

print phrases one per line |
    lqrank -r all -g -F - |
    lqkwic -s '${FID} ${g.q}
    ${xml.contentbefore.E->endbyte} ${Match}\n'

Each match is in this way prefixed by the numeric identifier of the document in the index (FID), the phrase number and the byte offset of the end of the nearest ancestor E element's end tag. The -F - option makes lqrank read the list of phrases to match from its input, rather than expecting them as command-line arguments; one could also use the Unix xargs program for this purpose.

Next we must group the matches by file identifier and startbyte, and if every different phrase occurred at least once, we print all the matches for that file identifier and startbyte.

The result can then be fed to lqkwic to generate a concordance, or perhaps to fetch information about the parent element, or both.

The program outlined here (and given in full in the appendix, in the Perl programming language) is intended as an example of the sort of flexibility that might be achieved as lq-text becomes more XML aware.

Unicode

In 1988, the use of 8-bit character sets was pretty usual; lq-text is at least 8-bit clean for data, so that conversion to UTF-8 seemed a simple matter, and also has some locale awareness. There were two tricky parts to the process of adding UTF-8 support. The first was to ensure that characters, rather than bytes, were counted when formatting, and of course that a UTF-8 octet sequence was never split part-way through.

The second difficulty was much harder: making sure that combining characters are never split from their corresponding base character. This last is not yet complete, but initial work using the GNOME glibc library is promising. This is the main issue preventing lq-text from being shipped, at present, and may have been completed by the time this paper is presented in August 2008.

Software cannot tell by inspecting a singly byte (or octet, as standards people say, in case 9-bit systems should reoccur) whether that octet forms part of a longer UTF-8 sequence. One needs to scan backwards to check, because the first octet is the one that indicates the number of octets to follow in the sequence that constitutes a single character. This is of course easy to deal with as long as one can scan backwards a little. For diacritical marks and other combining characters, however, one must consult a database. The author could not help but wish that a single bit in the character representation could have been reserved for this purpose, but that would have prevented Unicode from being backwards-compatible with ISO 8859-1, a goal at the time Unicode was designed. A future version of lq-text may use its own database, with only the character properties that lq-text needs, perhaps created automatically at the same time as each database so as to take locale information into account.

Comparing with XQuery 1.0 or XSLT 2 + Full Text

The published draft of Full-Text does not support concordance generation, although some implementations in practice (such as MarkLogic) do appear to offer the necessary functionality through product-specific extensions. The author of this paper considers match highlighting to be essential functionality in practice. A future version of Full-Text may well include it.

Let us then assume, as we must, that we are using an XQuery or XSLT implementation that supports in some way identifying match locations, and hence allows highlighting.

Advantages of Full-Text

  1. With Full-Text, XPath predicates and axes are available, so that one can easily find ancestors, parents, position in the element tree, and so forth. The lexical approach is very limited in this regard.

  2. Full-Text is (or probably will soon be) a standard, and one can easily move between implementations. The necessity of using vendor extensions for highlighting reduces this somewhat, but of course there is only one implementation of lq-text, albeit with source code freely available.

  3. An XPath implementation with Full-Text might have indexes for element location that enable higher performance, for example by using one CPU to find elements and another to resolve the text search. Although this sort of optimisation is largely at the research level today, it is likely to find its way into products, both closed and open source, in the near future. Lq-text uses multiple programs, which can run on separate CPUs of course (and will do so without any action from the user on a multi-CPU system) but there are no plans for finer-grained parallelism.

  4. The Full-Text facility is designed to work with Unicode and XML-based language support, giving a high degree of internationalisation. Although the author is adding Unicode support to lq-text (which previously, because it predated Unicode, used 8-bit character sets and a locale-based mechanism), it is not yet complete and pervasive.

  5. Since lq-text is not tree-based, it does not currently have any means to respect xml:lang, nor does it have any understanding of namespaces. Prefixed elements and attribute names are not currently handled. A solution involving the XML indexing filter is being considered for both of these issues, but its effectiveness is as yet unknown.

  6. The Full_text XPath extension is already in wider use than lq-text; training, support, books and forums are available for it, but not for lq-text.

Advantages of a lexical approach

  1. Open access to the match list supports flexibility and extensibility. The use of separate programs also allows intermediate results to be cached or stored and compared easily. By contrast, XQuery (where Full-Text is most likely to be found) is a large monolithic language. Open Source XQuery implementations are mostly in Java, which does not lend itself to good performance if a JVM must be started for each query, for example outside a servlet environment. None the less it should be mentioned that the fastest readily available indexed XQuery implementation in the author's experience is in Java, and once the JVM is started, is very fast.

  2. Because the data is not forced into the shape of a tree, it is possible to experiment, for example with overlapping markup. The generation of results by lqkwic can include a span from start element to corresponding end element, regardless of other start or end tags. Although XQuery and XPath 2.0 Full-Text allows for matching as if tags were absent, it does not give good control over which tags are to be treated as word boundaries and which not. But this is a difficult thing to do at query-time in any case, and neither system today has a complete answer for this.

  3. Lq-text can be used to generate non-XML results, for example a bitmap image representing a graph of word occurrence. XQuery and XSLT are limited to text and XML, although one can certainly write out SVG with them.

    Figure 1: Occurrences of four-digit numbers

    png image ../../../vol1/graphics/Quin01/Quin01-001.png

    A graph showing four-digit numbers along the x-axis, from 1500 to 1890 (presumably most of which represent years), and on the y-axis the number of times that number occurs in the corpus (17,000 entries from Brewer's Dictionary of Phrase and Fable).

  4. Experiments with Salton-style similarity functions, clustering, and other Information Retrieval techniques might eventually find their way back from work like this and into a future Full-Text specification. See, for example Salton, 1989 or Konchady2006 for descriptions of some applicable information retrieval techniques.

  5. Lq-text lies more in the world of traditional Unix text processing than in the world of relational databases. If one is primarily interested in finding content in a database, Full-Text is a clear winner. If one is more interested in exploring or searching text, perhaps lq-text has something to offer.

JEXE: Just Enough XML, Eh?

Some XML features were harder to see how to support than others. The author has no intent to support all of XML at this time, but just enough to be useful. This is regardless of how the XML is parsed. The following features are not supported, and are unlikely to be supported:

  1. CDATA sections; you can use entities instead. This is because the retrieval software does not scan the document from the start each time, but from the middle, and cannot determine whether markup is part of a marked section.

  2. External general entities (and XInclude); it is more useful for people working with XML as files to know the file than the document; if you want to resolve included entities, use a pre-processor such as xmllint before indexing.

  3. Arbitrary namespace support; the limit on namespaces is that all xmlns declarations must come before any regular attributes. In other words, the order of attributes (or pseudo-attributes) is significant. This may change in the future; it is because of a limitation in the indexer to do with the amount of available look-ahead.

  4. General entities; although support is planned for entities, the plan is to read the replacement text from a per-database configuration file. This is already done by lqkwic, but should also be done by the indexer. This means that per-document entities are not supported. External entities are not supported: the unit of retrieval is the file, not the document.

  5. The internal subset; currently lq-text can skip over an internal subset correctly in most cases (it is possible to construct an internal subset that will confuse it, I suspect, although this is always detected and a warning issued), but it is not parsed.

  6. Fixed and defaulted attribute values; without reading a DTD or internal subset, there are no default values. This could be thought of as a minimization feature of SGML that was overlooked during the design of XML.

  7. XML Notations; if the DTD were to be read, it might be possible to associate a URI or a MIME content type with a different tokenisation system, but the document author cannot know what MIME type will be used if a file is served on the Web; the DTD is not authoritative, and currently lq-text does not use HTTP to fetch things, but only works with local files. If lq-text used HTTP, behaviour would be based on the content type header for downloaded entities, not on any notation value in the DTD. For a local file, the notation value could be treated as a list of plausible content types, perhaps, but in practice content sniffing is more likely to work.

The result of this is that a JEXE document consists of an XML declaration (the encoding, if given, must be in UTF-8, however), an optional doctype declaration to point at an external DTD to be ignored, and then one (or more) simple element trees. Elements may have attributes, and may also declare namespaces. Namespace prefixes may be “normalised” based on a per-database configuration file, with elements in a default namespace that is associated with a URI given an explicit prefix [This feature is not implemented at the time of writing]. Numeric character references are expanded on indexing. Entity references are replaced by their per-database string values on retrieval; the plan is to index entity references both with their entity name and with their expanded value.

The resulting XML can be parsed “from the middle out” for the purposes of retrieval.

Although there is only support for a subset of XML, enough of the syntax is understood that you can index any XML document. However, some features, such as CDATA sections, permit the construction of documents that will confuse retrieval, even though the actual CDATA sections will be correctly parsed. A possible work-around is to process documents with XSLT before indexing them, creating surrogate documents.

Future Work

The work has shown that adding some simple XML support to lq-text is possible, but leaves a lot to be desired. For people already using lq-text, the support described in this paper is useful, but it is unlikely to persuade many people to try the package for the first time.

Adding more support for “just enough XML” will make the package more interesting. In the short term, extending the Unicode support is necessary before a release, as is more thorough testing and (as always) more documentation. After that, changes in the indexer to add support for (just enough) namespaces, general text entities and numeric character references have been sketched out.

There are no plans to use a full XML parser right now; although the author had originally intended to do so, the difficulty in tracking exact byte positions in the input delayed the work, and at this point although it is now possible, it has become a matter of human resources.

It is possible that the work here would be enough to enable lq-text to be used by an implementor of XQuery, and the author would like to do experiments in that area.

Searching a corpus of documents with disparate markup can be difficult with either approach, because one tends to write patterns that depend on the markup retrieved. One approach is to try to map queries at runtime; this can be a difficult problem of matching incompatible hierarchies of elements; see Euzenat and Shvaiko, 2007 on various approaches to the problem of matching ontologies. A more pragmatic approach is to re-write documents before indexing them, perhaps with XSLT. This approach works with both approaches to text retrieval, but can be tedious. An intermediate approach might be to define some XPath expressions, or to use a W3C XML Schema to impost some specific types, to identify sections, titles, paragraphs, and to mark which elements are considered to break apart words, phrases and paragraphs. The index could then include this information alongside the element structure. More experiments in this area are planned.

Conclusions

The author's original goal in adding XML support to lq-text was to use lq-text to help optimise an XQuery implementation. After experimenting with an XQuery implementation that supported Full-Text, the author decided instead to focus on enhancing lq-text to see if the results would be useful. It turns out that they are indeed useful, and development is continuing.

It must be admitted, however, that any advantage of lq-text over sophisticated XQuery implementations is likely to diminish over time.

The subset of XML supported (and with planned support), “just enough XML, Eh?” (JEXE), may be worth documenting separately.

References

[Adolphs, 2006] Adolphs, Svenja, “Introducing Electronic Text Analysis” (Routledge, 2006). A very clear and impressively slender introduction to the application of information retrieval, and especially the keyword-in-context list, to literary and linguistic research.

[Baeza-Yates and Marais, 1999] Baeza-Yates, Ricardo, and Marais, H., “Modern Information Retrieval” (ACM Press, 1999). Describes information retrieval mostly from the perspective of a researcher in text retrieval rather than a programmer or a user, and assumes more background knowledge, particularly in mathematics, than Manu Konchady’s book, so may be best read second.

[Euzenat and Shvaiko, 2007] Euzenat, Jérôme and Shvaiko, Pavel, “Ontology Matching” (Springer, 2007); a surprisingly clear introduction to problems such as relating two or more different classification schemes (such as XML schemas) over the same subject matter, although the presentation uses a mathematical notation, and some background in formal logic may be helpful.

[Konchady2006] Konchady, Manu, “Text Mining Application Programming” (Charles River Media, Boston USA, 2006). A useful programmer-level introduction to topics relating to implementing and using text retrieval, part-of-speech tagging, clustering and other topics, together with just enough mathematics, but not specific to any particular language. Includes CD-ROM with code samples in Perl, however.

[Salton, 1989] Salton, Gerald, “Automatic Text Processing” (Addison-Wesley, 1989). Perhaps a little dated, but the late Dr. Salton was extremely influential in the field. His earlier, 1983, book formed the basis for a single chapter of this work, but the 1983 book is harder to find today.

[W3C Full-Text, 2007] Sihem Amer-Yahia, Chavdar Botev, Stephen Buxton, Pat Case, Jochen Doerre, Mary Holstege, Jim Melton, Michael Rys and Jayavel Shanmugasundaram (Editors), “XQuery 1.0 and XPath 2.0 Full-Text 1.0” [online]. [cited 18th April 2008]. http://www.w3.org/TR/xpath-full-text-10/.

Author's keywords for this paper: XML; Full Text; Information Retrieval; Natural Language Processing; Computational Linguistics

Liam R. E. Quin

XML Activity lead

W3C

Mr Quin has been involved with declarative, descriptive markup since the early 1980s. He wrote his open-source text retrieval system and first distributed it in the late 1980s.

He has worked at the World Wide Web Consortium since 2001, where he is XML Activity Lead, or, informally, Mrs XML.