Feature | OED | FEW | Comment |
---|---|---|---|
Pages | 21730 | 16865 | |
Volumes | 20 | 25 | |
Entries | 300000 | 20000 | FEW entries are etymons, not lexemes, thus fewer |
Lexemes | 600000 | 900000 (*) | (*) back-of-the-envelop estimate |
<X><Y>some nice text</Y> <Z>and text to be made invisible</Z> and <W>finally</W> <Y>nice text again</Y></X>
is virtualized into these three virtual strings: (1) some nice text
, (2) and finally
and (3) nice text again
, if using the following configuration: <X>
and <Y>
tags should stop the virtualization mechanism, <Z>
tags as well as their contents (if considered as elements) should be made invisible, and <W>
tags (not their contents, if considered as elements) should be made invisible.<date>
tag inserted:
1787would be matched as a licit date. This would lead to a false positive as it has already been tagged as part of a bibliographical reference.
<biblio>
tags (as well as others, in practice) should be made totally invisible, their contents included, prior to full text search. The full text to which the search operation is applied thus becomes:
f.belongs to the keyword list of grammatical categories (it's an abbreviation standing for: feminine substantive).
f.would be matched as a licit grammatical category. This would lead to a false positive as
f.is here an abbreviation of the word
fleur(i.e. flower) and has already been tagged as part of the definition of a lexeme.
<def>
tags (as well as others, in practice and for the relevant algorithm) should be made totally invisible, their contents included, prior to full text search. The full text to which the search operation is applied thus becomes:
–character is not a hyphen, but an en dash. This character, as well as the spacing around it, is accounted for in the regular expression used to match dates.
<date>
tag inserted:
4e-6e s.(i.e. 4th-6th century) would be split into six fragments. This would lead to a false negative, i.e. the date would not be matched. By virtually removing the <e> and <lb/> tags (as well as others, in practice), the date can be matched by a regular expression. The full text to which the search operation is applied thus becomes:
<affix>
tag inserted:
-ivuswould be separated from the rest of the word. This would lead to a false negative, i.e. the affix would not be matched. By virtually removing the <i> tags (as well as others, in practice), the affix keyword can be matched. The full text to which the search operation is applied thus becomes:
<X><Y>some text</Y><Z>some text</Z></X>
. Second example: <X><Y>some text</Y></X> <Z>some text</Z>
. A forward search (in our proposed node list representation), relative to the target node <X>
, for a <Z>
tag will find a match in both examples, whereas a similar descendant search (in a tree representation) would not find a match in the second example.<i>
, <lb>
, <s>
, <t>
, <u>
, <v>
}:
<s>
, <lb>
}<i>
}<v>
}<t>
}<u>
}Once upon a time,), configured with the aforementioned visibility partition:
Once upon a time, there was a sentence with an important part, followed by an .
It was followed by a second sentence separated from the first by a visible tag.
A word near the end of the third sentence was split by a break tag.
</v>
and </t>
; indeed, empty virtual strings are never added to the returned sequence.merge
), the break tag can be processed differently based on the value of the attribute. Three behaviors are defined:
FTIgnore
option. It must be noted that constructing an intermediate full text representation and searching this full text representations are tightly interwoven, in order to return to user code the XML elements that include contents matching the full text search query.Reading Contexts»
contextes de lecture»