Daniel Tunkelang’s recent post on Twitter search got me thinking about what an HCIR geek would do, which produced the following random thoughts.
First, we should start with tasks. What kinds of information do people want to find in tweet streams? Do they want to find a document that’s been referenced? Do they want information about an event? Are they interested in finding a community of interest? What other useful tasks are there with respect to this stream?
How tweets should be indexed probably depends on what kinds of tweets they are. Twitter seems to be used for a variety of purposes, and the different purposes are likely to have different statistical characteristics, making more than one approach to indexing appropriate. For example, some tweets mention a URL, whereas others don’t. Should the contents of the page referred to by the URL be indexed with the tweet? Probably. What weight you give the document text will depend on whether your goal is to find the tweets or the documents being tweeted about. Tweets that don’t have URLs should not share statistics directly with those that do because document content may overwhelm tweets.
Another class of tweet is communication during or about some event, such as a conference or a concert. These typically use a hash tag to identify the event, but for popular events (e.g., currently #sxsw), the hash tag doesn’t help with precision because there are so many people using it. Trending topics tend to feed back on themselves, encouraging more people to use the has tags, thereby reducing the effectiveness of the tag. One way to deal with this problem may involve clustering based on the social (follow) graph and allowing people to select whether they want to see messages from their network neighborhood, or from other parts. Clustering based on message content may help, as might segmenting by time or by geo-codes if (when) they become available.
To help find communities of interest, you probably want to augment the social network graph with statistical text-based similarity metrics aggregated over many tweets.
Interestingly, because tweets are so short, they are likely to have very different statistical characteristics compared to normal documents. For example, document frequency is probably not a useful feature to include in tweet similarity measures. What about inverse tweet frequency (the analog of IDF)? Does the compressed format argue for not using stop words at all? How would standard techniques for event detection in news (for example) need to be modified to handle the much sparser and burstier text?
This seems like an interesting research area that can support a range of Master’s and PhD-level research, and probably a number of startups as well.
I also think this area isn’t unique to Twitter. There are other status update streams, such as Facebook and LinkedIn; instant message conversations; chat rooms; and perhaps even other communications media I’ve neglected. Of course, Twitter is hot, so that’s probably the best place to focus the efforts.
You’re right, and some of the challenges are due to scale, and Twitter seems more likely to push the envelope, as Nova Spivak points out.