{"id":3980,"date":"2010-06-14T06:35:05","date_gmt":"2010-06-14T13:35:05","guid":{"rendered":"http:\/\/palblog.fxpal.com\/?p=3980"},"modified":"2010-06-14T06:40:19","modified_gmt":"2010-06-14T13:40:19","slug":"3980","status":"publish","type":"post","link":"https:\/\/blog.fxpal.net\/?p=3980","title":{"rendered":"Parsing patents"},"content":{"rendered":"<p>Since Google <a title=\"Free download: 10 terabytes of patents and trademarks | Google  Public Policy Blog\" href=\"http:\/\/googlepublicpolicy.blogspot.com\/2010\/06\/free-download-10-terabytes-of-patents.html\" target=\"_blank\">announced<\/a> its distribution of patents, I have been poking around the data trying to understand what&#8217;s in there and starting to index it for retrieval. The first challenge I&#8217;ve had to deal with is data formats. The second is how to display documents to users efficiently.<\/p>\n<p>The full text of the patents is available in ZIP files, one file per week, based on the date patents were granted. The files cover patents issued from 1976 to (as of this writing) the first week of 2010. In addition to the text, they contain all manner of metadata such as when the patent was filed, who the inventors and assignees were, etc. Interestingly, the zipped up files are in two different formats: patents from 2001 on are in XML, while earlier ones are in a funky <em>ad hoc<\/em> text format.<\/p>\n<p><!--more--><\/p>\n<h3>Parsing<\/h3>\n<p>The XML format was easy to parse using standard tools, whereas the proprietary format required some specialized code. Despite that, the XML format proved the more problematic in some ways:<\/p>\n<ul>\n<li>The DTD wasn&#8217;t really available in machine-readable form; the closest I could find was <a title=\"XML DTD : us-patent-grant-v41-2005-08-25.dtd | USPTO\" href=\"http:\/\/www.uspto.gov\/web\/offices\/ac\/ido\/oeip\/sgml\/st32\/redbook\/rb2004\/us-patent-grant-v41-2005-08-25-DTD-Documentation\/us-patent-grant-v41-2005-08-25_dtd.html\" target=\"_blank\">this page<\/a> on the USPTO site<\/li>\n<li>While both formats indicated the distinction between the patent&#8217;s summary vs. its details, the XML format chose to do it through processing instructions rather than via containment. Unfortunately, standard implementations of the <a title=\"Class DefaultHandler | java.sun.com\" href=\"http:\/\/java.sun.com\/j2se\/1.5.0\/docs\/api\/org\/xml\/sax\/helpers\/DefaultHandler.html?is-external=true\" target=\"_blank\">DefaultHandler<\/a> (such as the <a title=\"Class XmlSlurper | groovy.codehaus.org\" href=\"http:\/\/groovy.codehaus.org\/api\/groovy\/util\/XmlSlurper.html\" target=\"_blank\">XmlSlurper<\/a> I was using) drop processing instructions on the floor. To use that data would have required a complete rewrite of the handler. In the end, I decided to fudge it and added a pre-processing step that scanned for the specific processing instructions that bracketed the summary and details sections of the patent and replaced them with elements that the XML parser would represent properly. Unfortunately, this hack cost me DTD compliance, so I had to turn validation off.<\/li>\n<\/ul>\n<p>For those curious about the text format, it contains stuff like this:<\/p>\n<p><code> <\/code><\/p>\n<pre>PATN\r\nWKU  D02428814\r\nSRC  5\r\nAPN  611301&amp;\r\nAPT  4\r\nART  292\r\nAPD  19750908\r\nTTL  Diver's helmet\r\nISD  19770104\r\nNCL  1\r\nECL  1\r\nEXP  Feifer; Melvin B.\r\nNDR  2\r\nNFG  6\r\nTRM  14\r\nINVT\r\nNAM  Jones; Richard F.\r\nCTY  Santa Barbara\r\nSTA  CA\r\n<\/pre>\n<p>The file is organized into field codes with associated values (e.g., TTL is the title). Values that are longer than a single line continue on the next line which starts with a few spaces to distinguish it as a continuation. Field codes are grouped into sections identified by four-character codes (e.g., INVT, above) with three-character codes for sub-fields. In short, it&#8217;s pretty easy to parse. In addition, it clearly sets out the summary (BSUM) and the details (DETD) fields.<\/p>\n<p>The reason this is important is that Xue and Croft\u00a0 <a title=\"Xue, X. and Croft, W. B. (2009) Automatic query generation for patent search. In Proc. CIKM '09. ACM, New York, NY, 2037-2040.\" href=\"http:\/\/maroo.cs.umass.edu\/pub\/web\/getpdf.php?id=896\" target=\"_blank\">found<\/a> that the summary field appears to be most effective for relevance feedback searches on patent.<\/p>\n<p>The meaning of the codes is documented <a title=\"Informal Notes on the U.S. Patent Data Format |  Linguistic Data Consortium\" href=\"http:\/\/www.ldc.upenn.edu\/Catalog\/desc\/addenda\/LDC93T3D_USPatent\" target=\"_blank\">here<\/a>, although the example is in SGML, rather than in the format described above. This SGML example does not have a one-to-one correspondence between the field codes and the element names, but the associated documentation lists the field codes and their definitions.<\/p>\n<p>One interesting (aka bizarre) aspect of both formats is that the patent number itself is encoded in the WKU field in a manner that does not correspond directly to the value you see on an actual patent! I reverse-engineered the actual patent with the following regex:<\/p>\n<p><code> <\/code><\/p>\n<pre>\/^([a-zA-Z]{0,2})0(\\d{5,7}).$\/<\/pre>\n<p>which looks for an optional prefix, then a zero, then the 5-7 digit patent number, and then another digit (which seems to be a check-sum of some kind, which I discard). The prefix and the patent number, concatenated, produce the patent number that the USPTO search engine recognizes. Of course this doesn&#8217;t handle the patent applications; those need to be processed separately.<\/p>\n<h3>Querying<\/h3>\n<p>My goal in indexing these patents is to experiment with retrieval algorithms and user interfaces, but I don&#8217;t want to actually host all the data. While I can store and display the text associated with each patent, I don&#8217;t really want to store all the PDFs or drawings or page images, as that will consume all those terabytes Google was so proud of. Unfortunately, I have not yet figured out how to get Google or the USPTO to serve up a PDF or a page image for a specific patent that I can identify with a patent number.<\/p>\n<p>While Google does provide JavaScript <a title=\"Google AJAX Search API | Google\" href=\"http:\/\/code.google.com\/apis\/ajaxsearch\/documentation\/\" target=\"_blank\">API for searching its patents<\/a> (among other collections), the API is rather limited and I am not sure how to get it to return a PDF of a specific patent reliably.<\/p>\n<p>It would be a terrible waste of resources to host that stuff and to serve it through a single point in the network. That leaves me with a few possibilities, including figuring out a way to obtain documents from a third-party based on the patent number or joining some kind of a consortium that would host the documents\u00a0 (for research purposes only!).<\/p>\n<p>I wonder if NIST is willing to do this.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since Google announced its distribution of patents, I have been poking around the data trying to understand what&#8217;s in there and starting to index it for retrieval. The first challenge I&#8217;ve had to deal with is data formats. The second is how to display documents to users efficiently. The full text of the patents is [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[123,213],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/3980"}],"collection":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3980"}],"version-history":[{"count":11,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/3980\/revisions"}],"predecessor-version":[{"id":4044,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/3980\/revisions\/4044"}],"wp:attachment":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3980"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3980"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3980"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}