What a tangled MeSH we weave

on

William Webber recently wrote an interesting analysis of the reports of the original Cranfield experiments that were so influential in establishing the primacy of evaluation in information seeking, and in particular a certain kind of evaluation methodology around recall and precision based on a ground truth. One reason that the experiments were so influential was that they provided strong evidence that previously-held assumptions about the effectiveness of various indexing techniques were unfounded. Specifically, the experiments showed that full-text indexing outperformed controlled vocabularies. While this result was shocking in the 1950s, 50 years later it seems banal. Or almost.

My initial reaction after having read this post was that sure, this was true for general searches, but for specialized domains there were advantages for using controlled vocabulary.  For example, PubMed uses MeSH to augment (expand) users’ queries behind the scenes and that system is used by millions of people, sometimes in life-or-death situations. Surely its continued use is a strong indicator of its effectiveness.

But is it? Has the use of MeSH for query expansion been evaluated? I turned to William Hersh’s third edition of “Information Retrieval: A Health and Biomedical Perspective” which has a whole section (4.3) on  controlled vocabularies, including a very nice description of MeSH. But there wasn’t much there about the benefits to the user of this (or any other) controlled vocabulary. In another chapter (section 7.4), Hersh talks about some studies that compared full-text vs. abstract searching, and reports briefly on the results of a study by McKinin, Sievert, et al (1991) that compared MEDLINE searching with indexing terms vs. text words. The results, summarized by Hersh suggest no difference in performance in recall (42% for indexing terms vs. 41% for full text), and a considerable disadvantage for indexing terms for precision (55% vs. 62%). Surprisingly, there is no discussion of the significance of these results.

I then poked around on the web, and after some doing found a recent article by Lu et al. that claims that

… to the best of our knowledge, this is the first formal evaluation on the benefits of applying the query expansion technique to a daily operational search system as opposed to retrieval systems mainly designed and tested in research laboratories. Therefore, our analysis plays a critical role in the understanding of future technology development needs for PubMed, along with its mission to better fulfill the information needs of millions of PubMed users.

The authors say that while MeSH-based query expansion has been implemented and evaluated in several research systems in the context of the TREC Genomics ad hoc retrieval track, the results reported by those studies are mixed, with some reporting improved performance and others reporting no improvement or even worse performance with the inclusion of the expansion terms from thesauri.

The authors then describes an evaluation they performed on the same corpus. They constructed full-text and MeSH-expanded queries for 55 topics from the 2006 and 2007 TREC Genomics ad hoc retrieval tasks, and used the F-measure to evaluate the resulting performance. For the 2006 data (21 topics), they found better performance for expanded queries for 7 topics, worse for 5, and unchanged for 9; these results were not statistically significant. For 2007 data (34 topics), they found benefits to expansion in 20 topics, with 8 performing worse and 6 unchanged; these results were statistically significant.

When comparing results using precision at 5, 10, and 20 documents (measures useful to indicate the likelihood that a person would actually see a document retrieved by the system) results were mixed, no clear advantage to either method. In the end, the authors conclude that while thesaural expansion tends to improve recall somewhat, the improvement comes at the expense of precision, just as McKinin et al. had found.

So the lessons of the Cranfield experiments seem to remain current even after 50 years of progress in information seeking algorithms. What’s even more surprising, however, is the lack of a solid body of work that characterizes the effectiveness of such an established thesaurus.

Update: Fixed typo in an author’s name.

Share on: 

14 Comments

  1. The TREC Genomics “topics” are not at all representative of queries — they’re just textual descriptions of information needs that are given once and for all. Real queries are interactive, and tend to be highly structured with boolean and field information. This can work OK for some specific information needs, but when you have high throughput assays, there’s really no current approach to search that works.

    F measure is very misleading, as is precision at 20. The biologists are willing to trade precision for recall — they’ll patiently sort through result sets with 10% precision after spending days tuning queries. What they don’t want to do is spend a year in the lab and analysis software repeating an experiment that’s already been done.

    What I’d like to see is precision at 99% recall, or even at 95% recall, used as an evaluation. The problem is that it doesn’t work with TREC’s pooled evalautions, which can only evaluate precision.

    In medicine, I imagine the search needs are very very different.

    My wife’s lab at NYU is currently looking for articles that are about variant alleles of yeast genes that have homologs in mouse and have been subjected to a temperature-sensitive expression assay with a conclusion about whether the allele is temperatue sensitive or insensitive. This is a very structured relational information need.

    The Homologene database gives you homologies between genes of different species, Entrez-Gene gives you genes, aliases and functional descriptions (broken out by species), and Uniprot gives you a protein database like Entrez-Gene. The simple problem of identifying that a phrase refers to a yeast gene is tricky, because homologous genes tend to get named the same thing in different species. One problem is that aliases of genes can be acronyms like AT, which are then written as “at” if they’re in a species that lowercases gene mentions. Even seemingly discriminative acronyms are hard to find given the prevalence of acronyms in biomedical jargon.

    I think the most broken aspect of PubMed is that they don’t have snippets. That makes it very time consuming to do exploratory search as you go back and forth between the title/overview and abstracts with no highlighting of search terms.

  2. I meant to add that this reminds me of the experiments on stemming, which are also inconclusive. I summarize C. D. Paice’s and David Hull’s detailed analyses of various stemmers in a LingPipe blog post:

    To Stem or Not to Stem?

    These are also with TREC-like queries, I’m afraid, but there’s a lot of detailed error analysis that’s well worth reading.

    By the way, stemming or synonym expansion, when they’re useful, also improve precision at K by moving more relevant documents up in the list.

  3. Thanks for the great comments, Bob!

    The paper I talked about actually used the TREC topics to construct both probabilistic and boolean queries representative of the kinds of queries that are run on PubMed, and they did some validation of their approach against queries mined from the PubMed log. I decided to omit those details, but the paper seemed thorough in that respect.

    With respect to your observation about stemming, that thought occurred to me as well! I just couldn’t be bothered to do the searching to find those references when I was writing the post at 1 am :-)

  4. @ian_soboroff pointed out (via twitter) that

    … re your concl, lots of work shows that thesaurus exp is a mixed bag at best. Cf lots of WordNet QE work.

    I have seen those mixed results as well, but they surprise me less with respect to WordNet because it seems to me that the data there are inherently more ambiguous than in MeSH, and the assignment of MeSH terms to articles should be more considered (on average) than the choice of words in an arbitrary document.

    But those details aside, what surprises me is the lack of experimental testing of the assumption that these terms are indeed useful.

  5. One more reference I just ran across, because we’re thinking about building a MeSH classifier and document tagger:

    Dolf Trieschnigg et al. 2009. MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics.

    They conclude MeSH is helpful for genomics IR (also using the TREC collections), but they take a rather roundabout approach compared to just thesaurus expansion, using KNN classifiers and restructuring of queries.

    PS: Gene: Which paper were you talking about that formed PubMed-like queries from TREC topics?

  6. Bob,

    It’s the Lu paper: Zhiyong Lu, Won Kim and W. John Wilbur. (2009) Evaluation of query expansion using MeSH in PubMed. Information Retrieval 12(1), Feb 2009. pp. 69-80
    DOI 10.1007/s10791-008-9074-8

    available online here

  7. There is in my experience a lack of really solid survey papers and meta-analyses on a whole series of these sorts of results — which is why posts like yours are particularly valuable and thought-provoking. Much of the received wisdom on these issues is floating around as common knowledge in the heads of senior researchers in the field, but if it never gets written down, then that is where it stays; and of course if it’s not written down, it can’t be critically assessed.

    An example that I came across recently where this sort of reflective survey work had been performed was in Karen Sparck Jones, “Further reflections on TREC” (IPM 2000) (Google scholar will bring up a link). She asks a series of questions of the form “What has TREC taught us about X?” (see from page 63 onwards); for instance, “Is manual query formation better than automatic? Is adaptive tuning to a collection valuable?”, and so forth. The idea that one should seek to draw such conclusions from an effort like TREC might seem to an outsider an obvious one, but in fact it is surprisingly rarely done. (Or perhaps I’m just not reading the right papers.)

  8. William, sounds to me like you’ve got a good candidate for a special issue of a journal. What do you think about an issue around meta-analysis of TREC or IR in general to revisit the kinds of issues you’ve raised in your posts? I’ve got just the right venue in mind, too.

  9. […] link is being shared on Twitter right now. @hcir_geneg said Posted "What a tangled MeSH we […]

  10. I’m intrigued! I can certainly think of some possible topics:

    – what have we really learnt from TREC? (pretty broad, I know)
    – has ad-hoc retrieval technology stopped improving and, if so, when?
    – when is ranked retrieval superior to Boolean and vice versa?
    – what techniques has the web track of TREC demonstrated to be effective?

    plus your remarks on controlled languages and thesauri.

  11. Also what we’ve learned about interactivity in search, perhaps something about the effectiveness of ranked lists, on people’s understanding of relevance feedback, etc. Tackling all sacred cows of IR might be too much, but certainly a critical review of some key ideas would be a great contribution! Who else do you think would be interested in doing this?

  12. […] is a fascinating discussion ongoing over at fxpal between librarians and information researchers regarding the use of MeSH […]

  13. […] MeSH-based queries perform better on average than keyword-based queries, as I wrote in an earlier post. This means that MeSH queries and keyword queries retrieve about the same number of useful […]

Comments are closed.