Pairwise, Magnitude, or Stars: What's the Best Way for Crowds to Rate?

March 18, 2017

Is the ubiquitous five star system improvable?

We compare three popular techniques of rating content: five star rating, pairwise comparison, and magnitude estimation.

We collected 39 000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques. The dataset is available here.

Methodology

We ask each worker to rate 10 popular paintings using 3 rating methods:

Magnitude: Using any positive number (zero excluded).
Star: Choosing between 1 to 5 stars.
Pairwise: Pairwise comparisons between two images, with no ties allowed.

We run 6 different experiments (one for each combination of these three types) with 100 participants in each of them. We can thus analyze the bias given by the rating system order, and the results without order bias by using the aggregated data.

At the end of the rating activity in the task, we dynamically build the three painting rankings induced by the choices of the participant, and ask them which of the three rankings better reflects their preference (the ranking comparison is blind: There is no indication on how each ranking has been obtained, and their order is randomized).

Graphical interface to let the worker express their preference on the ranking induced by their own ratings.

What’s the preferred technique?

Participants clearly prefer the ranking obtained from their pairwise comparisons. We notice a memory bias effect: The last technique used is more likely to get the most accurate description of the real user preference. Despite this, the pairwise comparison technique obtained the maximum number of preferences in all cases.

Number of expression of preference of the ranking induced by the three different techniques.

Effort

While the pairwise comparison technique clearly requires more time than the other techniques, it would be comparable in terms of time with the other techniques using a dynamic test system (of order NlogN).

Average time per test, grouped by the order in which the tests have been run
Average time per test.

What did we learn?

Star rating is confirmed to be the most familiar way for users to rate content.
Magnitude is unintuitive with no added benefit.
Pairwise comparison, while requiring a higher number of low-effort user ratings, best reflects intrinsic user preferences and seems to be a promising alternative.