iMDB Ratings
(2003-2022)

Author

Elegant Evee
Diya Bansal, Sarah Young, Cyrus Irani, Tairan Zhang

Published

May 5, 2023

Introduce the topic and motivation

Our project is focused on exploring the relationship between different variables amongst the 100 most popular movies (from iMDB) for each year from 2003-2022. The research questions we are exploring here are:

Which release dates see the greatest profits?
How strong is the relationship between a film’s release year and income versus its iMDB rating?

Introduce the data

100 most popular movies for each year from 2003-2022
1989 (why not 2000?) rows and 13 columns
Movie title, iMDB rating, year of release, month of release, budget, income, etc.
Each row represents a unique film that has all the above data on iMBD

Q1 – Highlights from EDA


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ stringr 1.5.0
✔ tidyr   1.2.1     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

Rows: 2000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Title, Month, Certificate, Runtime, Directors, Stars, Genre, Filmi...
dbl  (2): Rating, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q1 – Highlights from EDA

Q1 – Inference/modeling/other analysis

Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

\[ p-value=0<0.05=\alpha \]

We reject the null hypothesis in favor of the alternate hypothesis.

Therefore, the data provides evidence that the there is a difference in profits between favorable and unfavorable months.

Q1 – Inference/modeling/other analysis

# A tibble: 1 × 2
   lower_ci   upper_ci
      <dbl>      <dbl>
1 16356012. 135972444.

We are 95% confident that the true mean profit of favorable months is between ~16 million USD to ~136 million USD higher than the profit for unfavorable months, on average.

If we were to simulate this again, at least 95% of these intervals will contain the true mean.

Q1 – Conclusions/Future Work

Observation: Movies released in favorable months (May, June, July, December) have a higher average profit then those released in unfavorable months (all other months).

Inference: We can expect movies released in May, June, July, Dec to earn a higher profit (on average) as opposed to movies that released in other months.

Support: Confidence interval and p-value (which showed us that there is a statistically significant difference in profits between favorable and unfavorable release months).

Small drop around 2020 (can be attributed to COVID-19)
Future work can do a detailed analysis of profits for release years to find more trends and explore the cause for this dips

Q2 – Highlights from EDA

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `income = parse_number(income)`.
Caused by warning:
! 145 parsing failures.
row col expected  actual
  6  -- a number Unknown
 16  -- a number Unknown
 17  -- a number Unknown
 18  -- a number Unknown
 19  -- a number Unknown
... ... ........ .......
See problems(...) for more details.

Q2 – Inference/modeling/other analysis

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -8.44e- 1  7.30e+ 0    -0.116 9.08e- 1
2 year         3.70e- 3  3.63e- 3     1.02  3.09e- 1
3 income       5.84e-10  7.34e-11     7.96  3.09e-15

\[ \begin{split} \widehat{Rating} = -8.440238*10^{-1} + 3.696143*10^{-3} \times Year \\ + 5.842187*10^{-10} \times Income \end{split} \]

When year and income = 0, expected rating is -0.844.
Expected rating increases by 0.0037 for every additional year and by 5.842187*10^-10 for every additional dollar.

Q2 – Inference/modeling/other analysis

# A tibble: 1 × 1
       r
   <dbl>
1 0.0317

# A tibble: 1 × 1
      r
  <dbl>
1 0.183

Correlation between release year and iMDB rating is negligible because r ~ 0, and income has a weakly positive relationship with rating (r = 0.1829604).

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1   0.0938    0.256

CI for blockbuster movies (income of at least $100,000,000) vs. non-blockbuster movies.

Q2 – Conclusions/Future Work

Observation: Movies with high box office incomes tend to earn higher iMDB ratings.

Inference: We can expect movies earning at least $100,000,000 at the box office to have higher iMDB ratings than those earning less than $100,000,000.

Support: Confidence interval, which showed us that the true mean difference between blockbusters and non-blockbusters is probably positive.

Could explore further by . . .

Comparing results to Metacritic scoring (different methodology)
Comparing budget to rating as a measure of care put into a movie

References

The data that we used was from Kaggle, by a user with the username GEORGE SCUTELNICU (https://www.kaggle.com/datasets/georgescutelnicu/top-100-popular-movies-from-2003-to-2022-imdb).