Building the Ivory Tower

on

I recently read on Jeff Dalton’s blog that a new open-source search engine, called  Ivory, has been released by Jimmy Lin. Ivory is based on Hadoop, and is  designed to handle terabyte-sized collections. Unlike Lucene, this is a research project, Jimmy Lin writes,

aimed at information retrieval researchers who need access to low-level data structures and who generally know their way around retrieval algorithms. As a result, a lot of “niceties” are simply missing—for example, fancy interfaces or ingestion support for different file types. It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often. In short, Ivory is experimental!

Still, it is interesting to see another tool made available publicly to help spur IR research. It joins a collection of search engines and search tooklits, including Lucene, Lemur,  and Terrier. At first glance, these efforts look similar. Certainly they all show their academic roots in being able to index TREC collections out of the box. It’s not clear (to me!) what the tradeoffs among the various tools are, and how well they can be applied to support interactive search.

But it’s a trend worth following, and an experiment worth conducting to compare the various approaches with respect to the variety of information seeking tasks. Having that kind of a characterization would help those deciding which tool to use for a new research or deployment project. Dimensions on which the tools could be compared include the following:

  • Responsiveness: how well-suited is the search engine for supporting interactive use? What is its CPU and memory footprint?
  • Expressiveness: how rich is the query language available to the client UI, for queries constructed by the user directly, or on the user’s behalf by the client software.
  • Transparency: how easy is it for client software to access low-level corpus statistics and intermediate calculations that led to a particular ranking
  • Modularity: how easy it is for an application to replace particular parts of the indexing or retrieval algorithms with different variants designed to support situations not directly anticipated by the toolkit designers.
  • Reliability: how stable is the code for using in a deployment rather than for offline experimentation
  • Quality: how good are the search results on known collections? Does the search engine excel at certain kinds of data (e.g., linked document collections, TREC, etc.) or is it the ranking performance uniformly  good?
  • Support: does the code have an active forum where questions about its use can be answered? Are its developers responsive in generating bug fixes and answering complicated questions that the community of users is unable to field?
  • Ease of use: how straightforward is it (it’s never “easy”!) to install the toolkit and to configure and integrate it with other source code? Is there a client-server model, or do you have to wrap your own server around it? Is the server architecture scalable? If not, how straightforward is it to add support for load balancing, redundancy, etc.?
  • Extensibility: Is the source code available and sufficiently documented to allow third parties to extend the code when necessary?

This is only a partial sample of a longer list of questions, including answers to all of these at different collection sizes. Some comparisons of these tools have been made (see here and here), but these are somewhat ad hoc. It would be great if a methodology for comparing and evaluating these tools could be devised that would allow the creators of these tools to rate their software on these various dimensions, or for third parties to manage comparisons against a known set of benchmarks. This kind of effort would allow us to make even greater progress whether we are building ivory towers or bustling market places.

Share on: