Have queries, want answers

on

Sarah Vogel’s comment on yesterday’s post got me thinking about recall-oriented search. She wrote about preferring Boolean queries for complex searches because they gave her a sense for when she really had exhausted a particular topic, something that’s often required for medical literature reviews. But we really have multiple problems here, that it may be useful to decouple: one is the issue of coverage (did we find all there was to find?) and the other is ranking (the order in which documents are shown).

A perfect query would retrieve all the required documents, and no others; at that point, order wouldn’t matter a whole lot. A slightly less perfect query might still retrieve all the required documents, but might also throw in some that aren’t. If it gets the order right, though, the all the useful documents will precede the others. But this outcome relies on two things: having the right query terms, and having the right ranking function. (Of course for most real topics, you’ll need multiple queries, but I’m building a straw-man.)

Back to the PubMed Search Strategies queries I wrote about yesterday: they represent considerable effort to construct an expression that gives good coverage for a particular concept, but when used with PubMed they suffer because the ranking isn’t based on the degree of match.

Predicting useful documents

One set of experiments I would like to be able to do is to analyze the final set of documents judged to be useful that were produced by a search that consists of several such concepts to understand how the different expressions of concepts contributed to that set of documents. My guess is that the expressions for each concept have some redundancy, and it would be interesting to understand that redundancy. In other words, given

  • a decent ranking function that takes the quality of a match into account (even if the documents are selected using a Boolean expression), and
  • sets of OR-ed terms that represent each concept (leaving the ANDs in place)

how close can we get to the ideal set of results (as judged by the person who did the comprehensive search) by adding or removing terms from the sets characterizing the constituent concepts?

Will we find that only a few terms are enough, taken in combination, to produce the required set of documents? Or will we find it difficult to get good recall without retrieving large sets of documents? Furthermore, will a solution to one combination of concepts be sufficiently similar to those of other sets so that we can generalize the results? If we cannot automate this process, will there still be clear points at which a person’s insight could guide the search? These questions are vague in the absence of concrete examples, but they are probably a good starting point for more concrete discussion.

Keywords vs. MeSH

Another class of questions I think could be answered with this data is related to the differences between keyword and MeSH expressions of concepts. Sarah Vogel commented that the quality of the index terms varies across MeSH. If the quality of the index could be characterized in some gross level, then it could be used as an independent variable in evaluating the effectiveness of MeSH for expressing queries.

Also, it would be interesting to understand the differences between the MeSH queries and keyword queries on PubMed Search Strategies site. While the two kinds of queries were often paired, they didn’t retrieve the same set of documents. If the person who created the queries thought them to be similar, are their differences important? Are the documents retrieved by both queries (keyword and MeSH) somehow more useful, more central to the concept?

Related to this issue is the finding that there is no evidence that MeSH-based queries perform better on average than keyword-based queries, as I wrote in an earlier post. This means that MeSH queries and keyword queries retrieve about the same number of useful (relevant) documents. But do they retrieve the same documents, or different subsets? If the latter, what can we learn about the queries to figure out how to combine them to identify a larger subset of relevant documents? It would also be interesting to to see if conclusions from MeSH carry over in a meaningful way to other thesauri.

The point of this rambling post is that where there are data, there are interesting questions to ask and answer, and this particular set of queries (with the promise of more to come) intrigues me. But these queries are mere shadows on a wall, and it would be considerably more interesting to engage with people who construct such expressions to understand better how they look for information, and how to make them more productive.

Share on: 

9 Comments

  1. Sarah Vogel says:

    Hi Gene,

    Me again – you ask such interesting questions! I should preface my comments below with the disclaimer that I don’t claim to be an expert PubMed searcher. I generally use the Medline data on search systems that provide access to other databases and more power search tools (Ovid, Dialog, Datastar, STN primarily). Most of my comments will still be relevant but you should keep this in mind.

    I like your separation of retrieval & ranking into separate concepts to be explored separately. As a purely practical matter, when I do a search (and I think this is true for most of my colleagues as well), I become a human ranking engine and provide results in an order that I think makes sense based on what I understand of the original question and why it is being asked. I’m not sure that this is as easy as it looks because there is a lot of background information that goes into this ranking that I haven’t provided to PubMed such as why the question is being asked, questioner’s role (scientist, clinician, statistician, etc), who are the hot research groups, etc. However, given the power of statistical analysis and other tools at your disposal there may be a way you wizards can approximate this in the interface if you can figure out which things matter the most.

    [For me personally, ranking on PubMed is a nice-to-have but not a needs-to-have because I very rarely search only PubMed. I’d much rather have some sort of easy-to-use (!) gadget that would work with results from multiple databases and spit me out a bibliography in a good ranked order.]

    My guess is that those pharmacoepidemiology searches you mention are two different stabs at the same concept for the same search question. Typically when I run a search in PubMed I use both a keyword strategy and MeSH strategy and/or combine elements of both. If the indexing works for my topic, MeSH can help me target in on really great articles really fast (for example adverse effects of a drug). However if I rely only on MeSH, I may miss a lot of articles because they either haven’t been indexed yet (really new) or weren’t indexed as an adverse effect in the first place (we could spend hours talking about THAT issue!). If I feel like I need to be more comprehensive, I would start working on a search that combined keywords and MeSH terms to increase the recall. Using MeSH appropriately in this setting can help you keep your search targeted in the right subject domains.

    In my experience, if MeSH has indexing that works for my concept(s), usually (though not always) most of the best documents are returned with a good MeSH search and I generally don’t get too much junk. However it almost always misses some really good documents that can be retrieved with a good keyword search. MeSH frequently doesn’t have the indexing I need and in those cases, I would primarily use a keyword search though I might use MeSH to help me keep my retrieval in the right general subject domains.

    I would be very interested in what kinds of measures you would look at to see if your MeSH conclusions carry over in a meaningful way to other thesauri. From the searcher’s point of view, if I don’t need total recall, I often select a database based on how well it covers the subject matter and indexes that information. On the surface, Medline & Embase are very similar databases but they can index the same article quite differently so you can get pretty different results. This isn’t necessarily a bad thing so it would be interesting so see how you would go about comparing them.

  2. If it were possible to represent your context for a search (questioner’s role, who are the hot research groups, etc.) to the search engine, then it might be possible to use that information to identify promising documents. But given that searchers’ notions of relevance (usefulness) of documents vary from person to person, and may even evolve during a search session, such representation may not be tractable.

    On the other hand, being about to order documents retrieved from multiple databases into a single list to produce a ranked bibliography is straightforward.

    A thesaurus’s utility with respect to retrieval can be assessed by the degree to which it can be helpful to identify useful (relevant) documents. Comparing two thesauri that index the same (or similar) collection would be interesting for a range of queries to see what the tradeoffs for using one versus the other are. In particular, it should be possible to identify the subset of useful documents (with respect to a given information need) present in both collections and to measure what fraction of that subset the different thesauri help to identify. My guess is that neither will get all, and that a combined approach would increase recall.

  3. […] terms in medical literature and if MeSH actually adds any value.  (This discussion continues on to medical literature searching in general). Included in the discussion is a link to an old paper demonstrating that MeSH shows statistically […]

  4. Just to poke at one of your implicit assumptions: there are only a few situations where we want to retrieve “all the relevant documents” (lawsuits, CIA work). More often, we just want to retrieve any set of documents that answers our information need. There may be many distinct but overlapping sets of documents that do so. The goal of your ranking should be for a complete set to appear as early as possible in the ranking. But this may call for interesting models of overlap—there’s no need to retrieve redundant documents. We played with this a little bit in our paper on retrieving _fewer_ relevant documents a few years ago at SIGIR.

  5. David, you raise a good point about redundancy, and one related question is how reliable redundancy identification is. If searchers trust the algorithm (or the stakes are low), then omitting some results may be the right thing to do to reduce complexity.

    But it seems that for certain kinds of medical searches, redundancy has to be judged by the searcher. In that case you may still be able to help the person by identifying potential redundancy in some useful way, but searchers will still expect (or perhaps even be required to) to see the entire result set.

  6. The biology researchers are putting in tremendous amounts of effort doing search compared to how easy we have it in computer science. They seem to have heroic tolerance for the pain of low-precision search, often spending hours or even days and sometimes weeks on a particular information-seeking need. Some consortiums, like Wormbase, are doing this collectively on a more organized basis with professional librarians involved.

    We’re currently working on an NIH-funded grant to do high-recall searches for specific entity and relational queries, such as “all references to the human gene SERPINA3” or “all relations between mutant genes and diseases”. This is surprisingly hard given the number of high-quality curated resources like Entrez-Gene and the Gene Ontology (GO). Especially if you want to plug-and-play quickly-bootstrapped categories and relation types.

    Legal searches are very similar to medical ones in their complexity and specificity. So I’m guessing proximity would be high on the wish list for professional librarians doing search over full-text docs.

    We actually started out working for DARPA building tools for intelligence analysts. They were very very unhappy when a document they searched for using boolean search didn’t show up in someone’s fancy tool. They loved that we just plugged in Lucene (which did have relevance) in addition to our entity-faceted search. So even if you have some kind of diversity-based or cluster-based presentation, everything has to be there for these searchers.

  7. Thanks for the Wormbase reference: I hadn’t heard of them. Do you know if the wormbase site has predefined queries that I can look at? I wasn’t able to find them by casual inspection on their web site. I guess I didn’t spend enough time digging :-)

    My impression from having talked to and read stuff by librarians is that they are leery of techniques that rely on thresholds (e.g., proximity, probability) because they are afraid that something relevant will be cut off.

  8. Sarah Vogel says:

    I so agree on the need for proximity searching in especially in full text documents! Frankly it’s one of my biggest issues in searching Google Scholar. While I’m leery of automatically applied thresholds, I use proximity searching all the time in the professional search systems such as Dialog, Datastar & STN – even in databases which only have abstracts. The best systems give me the control to choose my own thresholds which vary based the specific search. I select the thresholds by running test searches to see what I miss with various thresholds and selecting the one that gives me pain I can stand. (Though I admit to having my own standard defaults which probably wouldn’t stand up to a rigorous statistical evaluation!)

    I agree that I’m not at all fond of probability thresholds when doing a search that needs high recall because I can’t control the results. This makes it hard to explain to my clients what we might be missing.

    Bottomline for me, is that I need to understand the search engine enough to be able to advise my clients on what more could be done. Black boxes (like Google) are useful for searches like David describes above; I also need need places to go that give me more options.

    And on the subject of fulltext searching in the scientific literature; why, oh why, won’t most systems give me some limited field searching. Like let me search for the words only in the Methods section. Or search only in the title abstract & text of the article (ie exclude author addresses, references, etc). I pull up so much irrelevant crap on Google Scholar because my term is only mentioned in an article title in the references.

  9. […] discussion among commenters on a post about PubMed search strategies raised the issue of how people need to make sense of the results that a search engine provides. For […]

Comments are closed.