· Fri Apr 20, 2018 ·

Rotten tomatoes, IMDb, Metacritic: which ratings can I trust?

Main topic Recommender

I HATE SPOILERS. I really do. Movies are about the genuine discovery, the feeling of getting in the theater without knowing what to expect. An excitement that is hard to preserve when social media continuously flashes the latest footage of new releases. A trailer can really build anticipation but I am always afraid that everything will be revealed.
Yet there are cases when I need some help choosing over hundreds of movies on Netflix. First thing I do: I turn to the internet. But the more I spend time looking at websites scores, the more I get confused. Which one to trust? How to make sense of sometimes large differences between the ratings?
This article will take a simple approach to exploring the most common rating websites such as IMDb and Rotten Tomatoes, and compare them with my own scores.

1. Personal and online ratings

First thing I wanted to do is to compare my own ratings with some of the most popular online rating systems. So I looked at the past 3 years, from 2014 to 2017, and assembled a list of 360 movies with their ratings on the following websites:

Obviously, there are other ratings websites available out there, such as Flixter, but I couldn't get my hands on datasets to work without scraping the entire website. You can also find a list of movie ratings on Zanmel blog.

Besides the more traditional website rating systems, I found few that focused on using social media conversations. Tweetings collects every mention of 'I rated #IMDb' on twitter. This is very similar to IMDb itself but can give you a layer of analysis if you have access to the user profiles for instance. Another twitter-based rating system I found is twitflicks which goes a bit further in the collection of posts by looking at all mentions of a movie on Twitter and analyze the associated sentiment. Going even further, they determined the strength of the emotion to come up with the rating.

twitter is a great base for user content and opinions but it has its own bias as well. Not to talk about the difficulty of collecting the right mentions, it is also a social media that is largely used in few countries (US, Japan, India) to be efficiently representative.

Biases will be further discussed in the next section. Biases are important to understand the dataset, but they become less relevant if that dataset 'suits' me well. So without further due, let's have a look at the results. How are my personal scores on recent movies related to the above online ratings?

Comparison of my ratings (X-axis) on 360 movies from 2014 to 2017

The charts above give the Pearson correlation between my ratings (X-axis) and other rating systems (Y-axis). I wanted to get the correlation on the whole dataset (360 movies from 2014 to 2017) and on their 'top movies' (online ratings above 6). The results are pretty clear. I got a high correlation for 2 of the user-driven scores: IMDb and Rotten tomatoes (user rating). Fandango 5-scale rating is pretty far from my own and from all other ratings. This source will be excluded in the following analysis.
It is interesting to notice that both critic-driven systems are highly correlated as well as shown in the correlation matrix below.

Correlation heatmap between rating systems

As said shortly previously, we need to better understand how the dataset is composed in order to have a thorough analysis. The next section will help understand how these rating systems work, their methodologies and their potential bias.

2. Rating methodologies & bias towards male audience

This section will focus on getting a short introduction to each of the rating methodologies seen previously. While they may differ, they all seem to have a bias toward men from the literature I could find online. I will not go into the details because I could not find data to support a view on that matter apart from what was given in this article by FiveThirtyEight. So if you have more information or datasets on these biases, please share them with me I'd like to go a bit deeper than what I could in this article.

Online reviews and ratings have been recently under more spotlight because it was argued to have an influence on the theater audience. While this is not the subject of this article (but might a good one for my next one), you can read the article by WIRED to find out more about rating methodologies or about the impact of bad RT scores might have on movie performance.

The Internet Movie Data Base, a.k.a. IMDb has a rating system based on users vote on films on a scale out of 10. Each IMDb user can submit only one rating per film. These ratings are then modified so that a certain demographic does not weight too much in the overall rating. While this rating system seems bias-proof, the limitation comes from the pool of users which is at 70% male.

Rotten tomato, on the other hand, offers both a rating given by film critics and one given by users. A movie will be given a score out of 100 based on the averaged rating. End of 2015, for the London Film Festival, Meryl Streep mentioned that after her own study, female critics were about 4 times less than male critics on Rotten Tomatoes.

Metacritic also provides with a score on a 0 to 100 scale based on a weighed average of critics and publishers ratings. You can find a detailed list of featured critics on their website.

3. Distribution of ratings

We all have some sense of what these ratings mean or what they represent. A movie rated 1 is likely to be among the worst and a 10-rated among the best of all time. Let's take some examples. My 2 favourite Will Ferrell movies are Step Brothers and Stranger than fiction, rated 6.9 and 7.6 respectively on IMDb. I must say he's probably my favourite comedian.

The question here is to know what 6.9 and 7.6 really mean? A 6.9 is somehow higher than the median of a 1-to-10 scale but lower than a 7.6-rated movie. Thus we might expect a 6.9-rated movie to be somewhat better than the typical movie. But is this expectation justified? Also, we know a 7.6 is obviously better than a 6.9. But how much better? How can we understand this 0.7-star difference? What does it mean when we look at the distribution of all the movies?

Distribution of movie ratings for the different datasets

Plotting the distribution is a great way to figure out where most of the movies are positioned. As we can see from the charts, 5.0 is neither the average score or the median score. In almost all the datasets, the median is around 6.3. This means that 50% of the movies have a score below and above 6.3.

Median was a great introduction to having a better recommendation on a movie rating. As we will see in the next section, percentiles give us a grid for understanding the relative value of a film.

4. Percentiles, a better way to understand ratings

Percentiles give us a good proxy of how the movie stacks up against other movies. That’s likely the kind of information I'm looking for to decide which movies to add to my Netflix watchlist.

Percentiles of movie ratings for the different datasets

If you want to get the 10% best movies according to the rating systems, you need to look at the scores on the 90th percentiles. This will give you the value below which 90% of the ratings may be found. If we come back to my 2 Will Ferrell movies, none of them made it to the golden space of the top 10% movies of all time. And now we can answer how much better is their 0.7 score difference: it's 20% better. Step Brothers is at the percentile 66% (the film is in the top 34%) while Stranger thank fiction is at the percentile 86% (top 14%).

That's an important note to keep in mind. Even though there is a strong correlation between my ratings and IMDb or Rotten tomatoes, it doesn't necessarily mean that I would like whatever is highly ranked on these websites. I might prefer a certain type of movies more than the others. That's when looking at differences and outliers comes into play.

5. Outliers: Exploring the differences

For this section, I did the difference between IMDb ratings and my ratings (X-axis) over the number of IMDb reviews (Y-axis in log). This should help me understand which type of movies I tend to prefer and which I tend to dislike versus the IMDb users.

Outliers for horror movies (pink) and comedies (blue)

Moving up the chart along the Y axis and you will see what IMDb users favour more than me (difference positive). This chart would show that, to a certain extend, I tend to prefer comedies (blue) over horror movies (pink). Comedies have a wide spectrum, from Zoolander 2 to Captain Fantastic, I enjoy them all!

6. Final note

In this article, I did a very simple correlation between IMDb, MovieLens, Rotten tomatoes and my own ratings. It appeared that I can rely on IMDb score and Rotten Tomatoes user scores.

Three things to take away from this article. First, knowing the biases of some of the rating system have, it is important to do the exercise on your own and see which one you can trust the most. Second, percentiles of each platform will give you a more accurate sense of how 'good' a movie is. Finally, scores give a quantitative value for a movie but one should look at the comments and critics to appreciate the qualitative value of a movie. Even though you might have a strong correlation, you need to look out for outliers, your own biases towards certain types of movies. This exercise highlighted my preference for comedy movies and even a 6.5 on IMDb can be among my favorites.

As a final note, I wanted to share with you a website that compiles most of the different ratings systems. You can check it out Flickmetrix. A must-go website if you are looking for your next movie to watch!