We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.
We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques. The dataset is available here.
We ask each worker to rate 10 popular paintings using 3 rating methods:
We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.
At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).
Graphical interface to let the worker express their preference on the ranking induced by their own ratings.
Participants clearly prefer the ranking obtained from their pairwise comparisons. We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.
Number of expression of preference of the ranking induced by the three different techniques.
While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).
Average time per test.
For more, see our full paper, Pairwise, Magnitude, or Stars: What’s the Best Way for Crowds to Rate?