I have been curious about the evolution of research interests in the IR community for a while, and have recently decided to do something quantitative about it. My plan is to track how different aspects of the field wax and wane throughout the conference series. To start off, I decided to compare SIGIR 2010 with SIGIR 2000. This is an arbitrary starting point, but I wanted to do something topical (relevant?) to start.
I counted the ACM classification codes for all papers in the two conferences to try to get a sense for the evolution of the field. I counted each code in a paper once (excluding duplicates within the paper), without discounting papers that had multiple codes. I only counted full papers and keynote talks, but did not count demos, posters, or any other short papers.
Here are some summary statistics for the two years; the full data that I used is here.
Category | 2000 | 2010 |
# papers | 40 | 90 |
# classes | 83 | 132 |
# classes/paper | 2.1 | 1.5 |
# unique classes | 38 | 42 |
Most common class | H.3 | H.3.3 |
Most common class freq | 14 | 23 |
… normalized | 0.17 | 0.17 |
2nd most common class | H.3.4 | H.3.3 |
2nd most common frequencies | 5 | 19 |
… normalized | 0.06 | 0.14 |
% G classes | 2.4 | 1.5 |
% H classes | 65 | 87 |
% I classes | 30 | 8.3 |
% J classes | 1.2 | 1.5 |
% unique G classes | 5.3 | 4.8 |
% unique H classes | 50 | 71 |
% unique I classes | 39 | 17 |
% unique J classes | 2.6 | 2.4 |
The main observation is that the 2000 classifications were more diverse: there were 38 unique classes out of 40 papers in 2000, and only 42 out of 90 in 2010; While the most frequent classes accounted for about the same fraction of all classes, the distribution became more diverse more quickly in 2000 than in 2010, where the second most frequent classification accounted for 14% of all papers vs. 6% for 2000. While some of this diversity came from within-paper factors (2000 papers had 2.1 classes per paper assigned on average, while 2001 papers had only 1.5), it seems that there was between-paper diversity as well. It’s also possible that because these codes were relatively new in 2000, people hadn’t figured out consistently where to situate their work. But these are computer scientists we’re talking about, and figuring out a rather shallow taxonomy is not that difficult.
The other thing to note is the shift from a better balance between H and I classes in 2000 to more emphasis on H in 2010. I think this corresponds roughly with a move away from symbolic processing to more statistical approaches.
Another aspect worth investigating in this data is how much these shifts are caused by technological progress impacting algorithm selection, and how much is signaling (in the sense used by Chen and Konstan) about the values held by the community. To what extent is the focus on H.3 (INFORMATION STORAGE AND RETRIEVAL) caused by technological possibilities vs. the methodological narrowing of a field? A comparison with trends in other related conferences might be useful here.
Of course all of this analysis is preliminary, and I should be looking at trends over the years, rather than just picking two at random. For the moment, however, I am satisfied with these results, given the rather tedious effort of copying the classifications from the ACM DL pages.
This copying was time consuming for sure, but also frustrating because the numbers I collected by hand should be obtainable easily by a faceted query. This kind of processing would be trivial in Solr, for example. But the ACM DL, while it does have faceted filtering, only returns lists of papers, not classification counts. I am hoping that for further work, I will be able to obtain the metadata for this conference series (and also for some related conferences such as JCDL and CIKM), in some machine-readable form which I could then index properly to test hypotheses more easily.
[…] This post was mentioned on Twitter by Xavier Amatriain, Gene Golovchinsky. Gene Golovchinsky said: Posted “Exploring diversity of SIGIR” http://palblog.fxpal.com/?p=4257 compared #SIGIR00 & #SIGIR10 […]