Blog Archive: 2010

Don’t go there

on Comments (4)

The field of information retrieval is inherently (some might say pathologically) data-driven. We need datasets to test algorithms, to compare systems, etc. This is all good. It’s particularly good to have data that are meaningful and relevant, because it makes it easier to motivate users and to generalize findings to data that people care about.

I expect that in the next few cycles of conference submissions, we will see a number of papers analyze the “cable” data leaked by Bradley Manning to Wikileaks. It’s a large enough dataset with topical relevance that is sure to attract all sorts of analyses, much like the Enron email dataset did in 2004.

But there are some important differences.

Continue Reading