{"id":4209,"date":"2010-07-13T07:28:38","date_gmt":"2010-07-13T14:28:38","guid":{"rendered":"http:\/\/palblog.fxpal.com\/?p=4209"},"modified":"2010-07-12T16:32:17","modified_gmt":"2010-07-12T23:32:17","slug":"boolean-illogic","status":"publish","type":"post","link":"https:\/\/blog.fxpal.net\/?p=4209","title":{"rendered":"Boolean illogic"},"content":{"rendered":"<p>I am trying to understand how Google patent search works, and am encountering some quite odd behavior. I am not talking about the <a title=\"Google\u2019s Patent Search \u201cfeature\u201d | FXPAL Blog\" href=\"http:\/\/palblog.fxpal.com\/?p=4041\" target=\"_blank\">inventor search bug<\/a> (which is still un-fixed), but about Boolean logic.<\/p>\n<p>If I run the query [<a title=\"&quot;information retrieval&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?q=%22information+retrieval%22&amp;lr=&amp;sa=N&amp;start=0\" target=\"_blank\">&#8220;information retrieval&#8221;<\/a>], the system retrieves 323 documents. Similarly, [<a title=\"&quot;dynamic hypertext&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?q=%22dynamic+hypertext%22&amp;lr=&amp;sa=N&amp;start=0\" target=\"_blank\">&#8220;dynamic hypertext&#8221;<\/a>] retrieves 368 documents. The combination, [<a title=\"&quot;information retrieval&quot; &quot;dynamic hypertext&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=%22information+retrieval%22+%22dynamic+hypertext%22&amp;btnG=Search+Patents\" target=\"_blank\">&#8220;information retrieval&#8221; &#8220;dynamic hypertext&#8221;<\/a>] yields 16. Putting a plus in front of either quoted phrase does not affect the results. So far, this seems reasonable.<\/p>\n<p><!--more-->This seems reasonable until you ask the following questions:<\/p>\n<ul>\n<li>Why does the query [<a title=\"&quot;information retrieval&quot; OR &quot;dynamic hypertext&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=%22information+retrieval%22+OR+%22dynamic+hypertext%22&amp;btnG=Search+Patents\" target=\"_blank\">&#8220;information retrieval&#8221; OR &#8220;dynamic hypertext&#8221;<\/a>] return only 294 documents? You would think it should produce 323 + 368 &#8211; 16 = 675 results.<\/li>\n<li>Why does the query [<a title=\"&quot;information retrieval&quot; -&quot;dynamic hypertext&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=%22information+retrieval%22+-%22dynamic+hypertext%22&amp;btnG=Search+Patents\" target=\"_blank\">&#8220;information retrieval&#8221; -&#8220;dynamic hypertext&#8221;<\/a>] return 326 results when you might expect 323-16=307??<\/li>\n<li>How did the number of results go <em>up<\/em> when a more restrictive clause was added??<\/li>\n<li>Why does transposing the terms ([<a title=\"&quot;-dynamic hypertext&quot; &quot;information retrieval&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=-%22dynamic+hypertext%22+%22information+retrieval%22&amp;btnG=Search+Patents\" target=\"_blank\">-&#8220;dynamic hypertext&#8221; &#8220;information retrieval&#8221;<\/a>]) return 324 documents?<\/li>\n<li>While the query [<a title=\"-&quot;information retrieval&quot; &quot;dynamic hypertext&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=-%22information+retrieval%22+%22dynamic+hypertext%22&amp;btnG=Search+Patents\" target=\"_blank\">-&#8220;information retrieval&#8221; &#8220;dynamic hypertext&#8221;<\/a>] returns 352, which is consistent (368-16=352), why does a transposition of the terms ([<a title=\"&quot;dynamic hypertext&quot; -&quot;information retrieval&quot; | Google Patent Search\" href=\"http:\/\/www.google.com\/patents?tbs=bks%3A1&amp;tbo=1&amp;q=%22dynamic+hypertext%22+-%22information+retrieval%22+&amp;btnG=Search+Patents\" target=\"_blank\">&#8220;dynamic hypertext&#8221; -&#8220;information retrieval&#8221;<\/a>]) return 353, again adding one additional document?<\/li>\n<\/ul>\n<p>So what is the ground truth? How many matching documents are there, really? I decided to cross-check these numbers with <a title=\"Advanced Search | USPTO\" href=\"http:\/\/patft.uspto.gov\/netahtml\/PTO\/search-adv.htm\" target=\"_blank\">USPTO searches<\/a>. USPTO seems to use a Boolean search system, so I figured I would compare the results. I searched the USPTO for the two phrases in the title (TTL), abstract (ABST), description (SPEC), and claims (ACLM) fields like this:<\/p>\n<p><code><br \/>\nTTL\/\"dynamic hypertext\" or SPEC\/\"dynamic hypertext\" or ABST\/\"dynamic hypertext\" or ACLM\/\"dynamic hypertext\"<br \/>\n<\/code><\/p>\n<p>The table below summarizes what I found:<\/p>\n<table>\n<tbody>\n<tr>\n<td><strong>Query<\/strong><\/td>\n<td><strong>USPTO count<\/strong><\/td>\n<td><strong>Google count<\/strong><\/td>\n<\/tr>\n<tr style=\"text-align: center;\">\n<td style=\"text-align: left;\">&#8220;dynamic hypertext&#8221;<\/td>\n<td>221<\/td>\n<td>368<\/td>\n<\/tr>\n<tr>\n<td>&#8220;information retrieval&#8221;<\/td>\n<td style=\"text-align: center;\">7489<\/td>\n<td style=\"text-align: center;\">323<\/td>\n<\/tr>\n<tr>\n<td>&#8220;dynamic hypertext&#8221; and<br \/>\n&#8220;information retrieval&#8221;<\/td>\n<td style=\"text-align: center;\">10<\/td>\n<td style=\"text-align: center;\">16<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>While for typical precision-oriented web search this sort of funny math doesn&#8217;t matter because you&#8217;re not interested in all the results, but just in one, any one, that matches your information need, this is not the case for patent search. Patent search has a significant recall-oriented component when it is in fact important to find every single document that matches your criteria. In such situations, apparently, one might not wish to rely on Google&#8217;s algorithms or on its lack of transparency.<\/p>\n<p><strong>Note: <\/strong>If you repeat this experiment, you are likely to get slightly different counts due to the varying availability of different parts of the index over time. None-the-less, this is should either not account for the logical inconsistencies in the results sets, or it should be considered a bug for recall-oriented tasks on collections such as the patent database.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am trying to understand how Google patent search works, and am encountering some quite odd behavior. I am not talking about the inventor search bug (which is still un-fixed), but about Boolean logic. If I run the query [&#8220;information retrieval&#8221;], the system retrieves 323 documents. Similarly, [&#8220;dynamic hypertext&#8221;] retrieves 368 documents. The combination, [&#8220;information [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[123],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4209"}],"collection":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4209"}],"version-history":[{"count":11,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4209\/revisions"}],"predecessor-version":[{"id":4212,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/4209\/revisions\/4212"}],"wp:attachment":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4209"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4209"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}