Parsing patents
Since Google announced its distribution of patents, I have been poking around the data trying to understand what’s in there and starting to index it for retrieval. The first challenge I’ve had to deal with is data formats. The second is how to display documents to users efficiently.
The full text of the patents is available in ZIP files, one file per week, based on the date patents were granted. The files cover patents issued from 1976 to (as of this writing) the first week of 2010. In addition to the text, they contain all manner of metadata such as when the patent was filed, who the inventors and assignees were, etc. Interestingly, the zipped up files are in two different formats: patents from 2001 on are in XML, while earlier ones are in a funky ad hoc text format.