In his recent post, Daniel Tunkelang issued a call for renewed interest in recall as a measure of performance of information retrieval systems, particularly for exploratory search tasks. It is interesting to note that there are several possible ways to measure recall and precision for interactive tasks, and which measure you should use depends on what aspect of the entire human-computer system you are interested in.
Consider a (ranked) set of documents identified by a search engine in response to a query. The set could contain hundreds, thousands, or even millions of results, which could (given some ground truth) be used to compute recall and precision. But is that a sensible thing to do? If you are interested in how well the search algorithm worked, the answer is probably yes. If, however, you’re interested in how well the user interface worked, this measure is flawed because so many (quite often most) documents identified in response to a query are never shown to the user, and are thus irrelevant. In my PhD thesis, published in part here, I proposed modified measures of recall and precision to reflect the interactive search experience.
I modified recall and precision measures by normalizing not by the total number of documents retrieved, but by the total number of documents viewed by the user. While at first blush this may seem like equating precision with recall, this is not the case when documents are presented as clusters or other non-linear ways as appropriate for sets.
Furthermore, if you are interested in measuring something about the user’s effectiveness in using a system, you might also want to measure selected recall and precision, where this scores are normalized by the number of documents the person marked or judged as relevant.
For example, in an information seeking experiment that varied the presentation style of search results, I found that viewed recall and precision increased with the increase in the number of documents whose contents were displayed automatically in response to queries.
Thus we can unpack the recall and precision measures into three distinct measurements: retrieved (traditional) recall and precision for system performance, viewed recall and precision for interface perfromance, and selected recall and precision for user behavior. Choose your measure wisely!