Jeremy Pickens contributed to this post.
Jeremy did a great job of presenting our Reverted Indexing paper, but the short session made it difficult to answer all questions and comments thoroughly. For example, William Webber wrote up a post summarizing our work, in which he observed
The authors surmise that the reverted index is more effective because it suggests more selective expansion terms, and they reproduce example term sets as evidence. This explanation is convincing enough as far as it goes; but what is not explained is why the reverted index’s expansion terms are more selective. The reason is not obvious. A single-term reverted index is not much more than a weighted direct index, mapping from documents to the terms that occur in them
I would like to address his comments because this is a key aspect of Reverted Indexing.
The key to understanding how a Reverted Index assigns weights is the following mapping:
- The number of queries that retrieved a particular document becomes the document frequency statistic in the reverted index.
- The sum of the retrieval scores of all docids retrieved by a Basis Query becomes the reverted document length
In general, basis queries that have a high probability/score for retrieving a particular docid are preferred. But if those basis queries retrieve a lot of other documents as well, the relative contribution of that score is lowered (due to a higher reverted document length). Therefore, docids that are retrieved by more specific basis queries (with lower DF) get preferred by the ranking algorithm. Terms with high DF (such as Africa and African in the example from the paper) tend not to get selected by the expansion algorithm because of the cutoff that William mentioned in his post, but also because of the reverted document length component. The cutoff is applied to a list of expansion terms (basis queries) ranked by their weights. Increasing the cutoff would serve only to increase the reverted document length, and therefore lower the relative contribution of these documents.
In short, the more specific expansion terms are selected over the more general terms by leveraging existing weighting frameworks that prefer “shorter” documents in the reverted index, which are roughly a function of lower document frequencies in the standard index, i.e.,
rev_tf / rev_doclength <-> score / sum_of_scores
Note that there is also a third factor at play in the reverted indexes: Reverted IDF.
Finally, these more specific terms also result in efficiency gains because the inverted lists associated with more specific expansion terms are shorter.
Thanks for the clarification, Gene (and Jeremy). You’re right; I was confused about the source of IDF in the reverted index. I’ll clarify my understanding and blog again in the next couple of days.
[…] FXPAL Blog On technology and beyond! « On term selection in reverted indexing […]
[…] Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first is that (1,10) normalization strips term occurences of their IDF factor. The second is that the cutoff depth used in creating the reverted corpus sets an upper bound on reverted term DF. Let me say immediately that my second assertion is wrong, and was due to confusion on my part: cutoff depth constrains reverted document length, not reverted DF; the latter depends (as Gene points out) on how many terms a document is highly retrievable by. […]
[…] By william, on November 4th, 2010 Copied with permission from IREvalEtAl GD Star Ratingloading…Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first […]