Michael Bernstein and a cast of thousands published an interesting paper at UIST 2010, which was honored with the Best Student Paper award. The paper describes and evaluates Soylent, a tool that uses Mechanical Turk to generate corrections and suggestions to improve writing. (The name Soylent is not a substitute for dairy in the weeks leading up to Easter; rather, it is derived from the film Soylent Green.)
This work is interesting in a number of ways: it automates the distribution and collection of Mechanical Turk tasks and then integrates the results into an interactive system, it recognizes the limitations of fully-automated approaches, and it suggests a design pattern that can be applied in other contexts .
The main contribution of this paper is the idea of embedding paid crowd workers in an interactive user interface to support complex cognition and manipulation tasks on demand. These crowd workers do tasks that computers cannot reliably do automatically and the user cannot easily script.
The paper implements three different components that use Mechanical Turk input: Shortn to shorten text, Crowdproof to do proofreading, and The Human Macro to specify repeated tasks. The Find-Fix-Verify pattern is used to mitigate errors by splitting the identification (find), generation (fix) and validation (verify) parts among different Turkers.
One challenge with this approach is response time and cost. For example, the authors report that most actual work times were under four minutes per stage for the Shortn task, whereas the overall response times were closer to 45-60 minutes for most tasks. The authors argue that as the number of Turkers increases, wait times will decrease and approach the actual work times. It’s not clear to me whether the rate at which Turkers accept these tasks will keep pace with the rate at which writers will submit them. On the other hand, the paper reports anecdotally that decreasing the payout for each stage of the job resulted in comparable quality but took longer. This suggests that it may be possible to pay more for faster service; how to set price points, and whether the potential availability of higher-paying HITs will cause Turkers to hold out for them in lieu of completing the lower-cost ones is an open issue.
The other, related, issue is cost: the Shortn tasks cost roughly between $4.50 and $9.50 for a few paragraphs; Crowdproof cost $2 to $5 per paragraph, and don’t report costs for The Human Macro. These numbers are not cheap, particularly if help is needed throughout a paper. For example, this would generally not be an effective technique for correcting significant mistakes in the writing of non-native English speakers, the kinds of mistakes that often lead to a paper being rejected because it is too hard to understand.
Nonetheless, this is an interesting, provocative, and well-written paper that breaks new ground in crowdsourcing and in interactive system design. It’ll be interesting to see the evolution of design patterns for crowdsourcing over the next few year. One possible challenge for the long-term stability of crowdsourcing design patterns is the human factor. Uunlike computer systems whose behavior doesn’t change over time (MVC works just the same now as it did in the 1980s), some patterns designed to work around undesired behaviors by Turkers may not remain effective if Turkers understand how they are applied and how to game them. And, as ever, there is the challenge of scale: the challenge of finding enough qualified Turkers to sustain the demand for their labor.
[…] This post was mentioned on Twitter by Gene Golovchinsky, Robert P Reibold. Robert P Reibold said: FXPAL Blog » Blog Archive » Soylent is food for thought: It'll be interesting to see the evolution of design patte… http://bit.ly/bBvIe1 […]
Having slightly recovered from the trip to New York to present this work, I can say — thanks, Gene, for talking about it and sharing your thoughts.
My evidence (currently unpublished) is that paying more doesn’t necessarily get you faster work, at least not in the short term. You’ll still get just as many Turkers within the first minute or two of publishing your task. What changes is how quickly the slow-comers take up your task. The acceptance curve is approximately exponential: you have to wait exponentially longer for each new turker. If you pay more, you change the coeffecient, but it’s still exponential. It just depends on whether you need a few people to work in a short time period after posting (in which case you should just pay $0.01) or need lots of work done, but not immediately (in which case you should up your payment if you want to hurry the process).
Getting costs down while maintaining quality is an interesting challenge. Panos Ipeirotis has an interesting take on the core problem, saying that MTurk is a market for lemons. So, it’s cheap exactly because the quality is low — I can’t trust the work you’ll do, so I need to pay more to put the work through Find-Fix-Verify to increase quality. If we had better reputation systems in the market, then maybe we could pay fewer people, slightly more, for better work.
I wonder whether it is economically viable to have HIT markets that don’t produce lemons. Clearly a different mechanism that has structural incentives for quality is required, but will the cost of this higher quality extinguish the market for it?