{"id":1472,"date":"2009-07-29T07:16:36","date_gmt":"2009-07-29T14:16:36","guid":{"rendered":"http:\/\/palblog.fxpal.com\/?p=1472"},"modified":"2009-07-27T00:01:24","modified_gmt":"2009-07-27T07:01:24","slug":"building-the-ivory-tower","status":"publish","type":"post","link":"https:\/\/blog.fxpal.net\/?p=1472","title":{"rendered":"Building the Ivory Tower"},"content":{"rendered":"<p>I recently read on Jeff Dalton&#8217;s <a title=\"Ivory: A New MapReduce Indexing and Retrieval System | Jeff's Search Engine Caff\u00e8\" href=\"http:\/\/www.searchenginecaffe.com\/2009\/07\/ivory-new-mapreduce-indexing-and.html\" target=\"_blank\">blog<\/a> that a new open-source search engine, called\u00a0 <a href=\"http:\/\/www.umiacs.umd.edu\/%7Ejimmylin\/ivory\/docs\/index.html\">Ivory<\/a>, has been released by<a href=\"http:\/\/www.umiacs.umd.edu\/%7Ejimmylin\/\"> Jimmy Lin<\/a>. Ivory is based on Hadoop, and is\u00a0 designed to handle terabyte-sized collections. Unlike Lucene, this is a research project, Jimmy Lin writes,<\/p>\n<blockquote><p>aimed at information retrieval researchers who need access to low-level data structures and who generally know their way around retrieval algorithms.  As a result, a lot of &#8220;niceties&#8221; are simply missing\u2014for example, fancy interfaces or ingestion support for different file types.  It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often.  In short, Ivory is <strong>experimental<\/strong>!<\/p><\/blockquote>\n<p><!--more-->Still, it is interesting to see another tool made available publicly to help spur IR research. It joins a collection of search engines and search tooklits, including <a title=\"Lucene | lucene.apache.org\" href=\"http:\/\/lucene.apache.org\/\" target=\"_blank\">Lucene<\/a>, <a title=\"Lemur Toolkit Features | Lemurproject.org\" href=\"http:\/\/www.lemurproject.org\/features.php\" target=\"_blank\">Lemur<\/a>,\u00a0 and <a title=\"TERabyte RetrIEveR | University of Glasgow\" href=\"http:\/\/ir.dcs.gla.ac.uk\/terrier\/\" target=\"_blank\">Terrier<\/a>. At first glance, these efforts look similar. Certainly they all show their academic roots in being able to index TREC collections out of the box. It&#8217;s not clear (to me!) what the tradeoffs among the various tools are, and how well they can be applied to support interactive search.<\/p>\n<p>But it&#8217;s a trend worth following, and an experiment worth conducting to compare the various approaches with respect to the variety of information seeking tasks. Having that kind of a characterization would help those deciding which tool to use for a new research or deployment project. Dimensions on which the tools could be compared include the following:<\/p>\n<ul>\n<li>Responsiveness: how well-suited is the search engine for supporting interactive use? What is its CPU and memory footprint?<\/li>\n<li>Expressiveness: how rich is the query language available to the client UI, for queries constructed by the user directly, or on the user&#8217;s behalf by the client software.<\/li>\n<li>Transparency: how easy is it for client software to access low-level corpus statistics and intermediate calculations that led to a particular ranking<\/li>\n<li>Modularity: how easy it is for an application to replace particular parts of the indexing or retrieval algorithms with different variants designed to support situations not directly anticipated by the toolkit designers.<\/li>\n<li>Reliability: how stable is the code for using in a deployment rather than for offline experimentation<\/li>\n<li>Quality: how good are the search results on known collections? Does the search engine excel at certain kinds of data (e.g., linked document collections, TREC, etc.) or is it the ranking performance uniformly\u00a0 good?<\/li>\n<li>Support: does the code have an active forum where questions about its use can be answered? Are its developers responsive in generating bug fixes and answering complicated questions that the community of users is unable to field?<\/li>\n<li>Ease of use: how straightforward is it (it&#8217;s never &#8220;easy&#8221;!) to install the toolkit and to configure and integrate it with other source code? Is there a client-server model, or do you have to wrap your own server around it? Is the server architecture scalable? If not, how straightforward is it to add support for load balancing, redundancy, etc.?<\/li>\n<li>Extensibility: Is the source code available and sufficiently documented to allow third parties to extend the code when necessary?<\/li>\n<\/ul>\n<p>This is only a partial sample of a longer list of questions, including answers to all of these at different collection sizes. Some comparisons of these tools have been made (see <a title=\"Eckard, E., and Chappelier, J-C. (2007) Free Software for research in Information Retrieval and Textual Clustering\" href=\"http:\/\/infoscience.epfl.ch\/record\/115460\/files\/Free_sofware_for_IR.pdf\" target=\"_blank\">here<\/a> and <a title=\"Perea-Ortega, J. M., Garc\u00eda-Cumbreras, M. A., Garc\u00eda-Vega, M., and Ure\u00f1a-L\u00f3pez, L. A. (2008) Comparing Several Textual Information Retrieval Systems for the Geographical Information Retrieval Task. In Proc. 13th international Conference on Natural Language and information Systems: Applications of Natural Language To information Systems. Springer-Verlag, 142-147.\" href=\"http:\/\/dx.doi.org\/10.1007\/978-3-540-69858-6_15\" target=\"_blank\">here<\/a>), but these are somewhat <em>ad hoc<\/em>. It would be great if a methodology for comparing and evaluating these tools could be devised that would allow the creators of these tools to rate their software on these various dimensions, or for third parties to manage comparisons against a known set of benchmarks. This kind of effort would allow us to make even greater progress whether we are building ivory towers or bustling market places.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently read on Jeff Dalton&#8217;s blog that a new open-source search engine, called\u00a0 Ivory, has been released by Jimmy Lin. Ivory is based on Hadoop, and is\u00a0 designed to handle terabyte-sized collections. Unlike Lucene, this is a research project, Jimmy Lin writes, aimed at information retrieval researchers who need access to low-level data structures [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[98],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/1472"}],"collection":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1472"}],"version-history":[{"count":10,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/1472\/revisions"}],"predecessor-version":[{"id":1480,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=\/wp\/v2\/posts\/1472\/revisions\/1480"}],"wp:attachment":[{"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1472"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1472"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.fxpal.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1472"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}