Markup as Index Interface
Thinking Like a Search Engine
To a search engine, indexes are specified by the content: the words, phrases, and characters that are actually present tell the search engine what inverted indexes to create. Other external knowledge can be applied add to this inventory of indexes. For example, knowledge of the document language can lead to indexes for word stems or decompounding. These can unify different content into the same index or split the same content into multiple indexes. That is, different words manifest in the content can be unified under a single search key, and the same word can have multiple manifestations under different search keys. Turning this around, the indexes represent the retrievable information content in the document. Full text search is not an either/or yes/no system, but one of relative fit (scoring). Precision balances against recall, mediated by scoring.
The search engine perspective offers a different way to think about markup:
As a specification of the retrievable information content of the document.
As something that can, with additional information, unify different markup or provide multiple distinct views of the same markup.
As something that can be present to greater or lesser degrees, with a goodness of match (scoring).
As a specification that can be adjusted to balance precision and recall.
What does this search engine perspective on markup mean, concretely? Can we use it to reframe some persistent conundrums, such as vocabulary resolution and overlap? Let's see.