{"id":4882,"date":"2010-11-02T07:35:14","date_gmt":"2010-11-02T14:35:14","guid":{"rendered":"http:\/\/palblog.fxpal.com\/?p=4882"},"modified":"2010-11-02T22:40:30","modified_gmt":"2010-11-03T05:40:30","slug":"on-term-selection-in-reverted-indexing","status":"publish","type":"post","link":"https:\/\/blog.fxpal.net\/?p=4882","title":{"rendered":"On term selection in reverted indexing"},"content":{"rendered":"<p><em>Jeremy Pickens contributed to this post.<\/em><\/p>\n<p>Jeremy did a great job of presenting our <a title=\"Pickens, J, Cooper, M., and Golovchinsky, G. (2010) Reverted Indexing for Feedback and Expansion. In Proc. CIKM 2010\" href=\"http:\/\/fxpal.com\/?p=abstract&amp;abstractID=581\" target=\"_blank\">Reverted Indexing paper<\/a>, but the short session made it difficult to answer all questions and comments thoroughly.\u00a0<span style=\"font-size: 13.2px;\">For example, William Webber wrote up a <a title=\"Reverted indexes and true relevance feedback  IREvalEtAl\" href=\"http:\/\/blog.codalism.com\/?p=1255\" target=\"_blank\">post<\/a> summarizing our work, in which he observed<\/span><\/p>\n<blockquote><p>The authors surmise that the reverted index is more effective because it suggests more selective expansion terms, and they reproduce example term sets as evidence. This explanation is convincing enough as far as it goes; but what is not explained is why the reverted index\u2019s expansion terms are more selective. The reason is not obvious. A single-term reverted index is not much more than a weighted direct index, mapping from documents to the terms that occur in them<\/p><\/blockquote>\n<p>I would like to address his comments because this is a key aspect of Reverted Indexing.<\/p>\n<p><!--more-->The key to understanding how a Reverted Index assigns weights is the following mapping:<\/p>\n<ul>\n<li><span style=\"font-size: 13.2px;\">The number of queries that retrieved a particular document becomes the document frequency statistic in the reverted index. <\/span><\/li>\n<li><span style=\"font-size: 13.2px;\">The sum of the retrieval scores of all docids retrieved by a Basis Query becomes the reverted document length<\/span><\/li>\n<\/ul>\n<p>In  general, basis queries that have a high probability\/score for retrieving a  particular docid are preferred. \u00a0But if those basis queries retrieve a lot  of other documents as well, the relative contribution of that score is  lowered (due to a higher reverted document length). \u00a0Therefore, docids that are retrieved by more  specific basis queries (with lower DF) get preferred by the ranking algorithm.           Terms with high DF (such as Africa and African in           the example from the paper) tend not to get selected by the           expansion algorithm because of the cutoff that William           mentioned in his post, but also because of the reverted  document length component. \u00a0The cutoff is applied to a list of           expansion terms (basis queries) ranked by their weights. Increasing the cutoff would serve only to increase the reverted document length, and  therefore lower the relative contribution of these documents.<\/p>\n<p><span style=\"font-size: 13.2px;\">In short, the more specific expansion terms are selected over the more           general terms by leveraging existing weighting frameworks that           prefer &#8220;shorter&#8221; documents in the reverted index, which are  roughly a function of lower document frequencies in the standard index,  i.e., <\/span><\/p>\n<p><span style=\"font-size: 13.2px;\">rev_tf \/ rev_doclength &lt;-&gt; score \/ sum_of_scores<\/span><\/p>\n<p><span style=\"font-size: 13.2px;\">Note that there is also a third factor at play in the reverted indexes: Reverted IDF.<\/span><\/p>\n<div>What  is reverted IDF? \u00a0It is the transform of the number of basis query &#8220;reverted documents&#8221;  in which a docid is found. \u00a0Docids that are retrieved by a lot of  different basis queries will receive a lower &#8220;term&#8221; weight when doing a  reverted query. \u00a0Therefore, terms that come from those documents will  not be weighted as highly. \u00a0On the other hand, docids that are retrieved  by relatively few basis queries will receive a higher term  weight.<\/div>\n<div>Now, maybe this is a bit of a  stretch, because this isn&#8217;t the only reason why a docid would have fewer  basis queries that retrieve it, but one important reason why a docid would  have fewer basis queries that retrieve it is because the original document was shorter, with relatively few unique terms. \u00a0And  maybe this is a bit of handwaving, and shouldn&#8217;t be believed without  some sort of specific empirical study, but shorter documents a moderate number of unique terms tend to be more specific, more focused documents.<\/div>\n<div>So  you have this double whammy. \u00a0Basis queries that retrieve a lot of  documents tend to be downweighted by the reverted doclength factor,  whereas docids that are  retrieved by a lot of basis queries tend to be downweighted by the  reverted IDF factor. The c<span style=\"font-size: 13.2px;\">ombination of these two factors yields a strong, natural reason that the reverted queries prefer low DF terms.<\/span><\/div>\n<p><span style=\"font-size: 13.2px;\">Finally, these more specific terms also result in efficiency gains because the inverted lists associated with more specific expansion terms are shorter.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Jeremy Pickens contributed to this post. Jeremy did a great job of presenting our Reverted Indexing paper, but the short session made it difficult to answer all questions and comments thoroughly.\u00a0For example, William Webber wrote up a post summarizing our work, in which he observed The authors surmise that the reverted index is more effective [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[255,275],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4882"}],"collection":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4882"}],"version-history":[{"count":12,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4882\/revisions"}],"predecessor-version":[{"id":4887,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4882\/revisions\/4887"}],"wp:attachment":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4882"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4882"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4882"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}