library(tidyverse)
library(tidytuesdayR)
Summer Movies
<- tidytuesdayR::tt_load(2024, week = 31) tuesdata
---- Compiling #TidyTuesday Information for 2024-07-30 ----
--- There are 2 files available ---
── Downloading files ───────────────────────────────────────────────────────────
1 of 2: "summer_movie_genres.csv"
2 of 2: "summer_movies.csv"
<- tuesdata$summer_movie_genres
summer_movie_genres <- tuesdata$summer_movies
summer_movies
# Save dataframes as CSV inside the "data" folder
write.csv(summer_movies, "data/summer_movies.csv", row.names = FALSE)
write.csv(summer_movie_genres, "data/summer_movie_genres.csv", row.names = FALSE)
Dataset
The “Summer Movies” dataset originates from the Internet Movie Database (IMDb) and was compiled as part of the TidyTuesday project. It contains films, videos, and TV movies that contain “summer” in their title. The dataset consists of two tables: summer_movies.csv
, which contains details such as title type, IMDb rating, number of votes, and genres, and summer_movie_genre.csv
, which breaks down movies by individual genres. This dataset was chosen for its potential to analyze how ratings, votes, and genres interact across different title types (movies, videos, TV movies), providing insights into trends in summer-themed media.
summer_movies.csv
- main dataset, each entry is a movie with details such as identifier, title type, IMDb rating, number of votes, and genres.summer_movie_genre.csv
– Breaks down movies by individual genres.
Rows: 905
Columns: 10
Title Type (
title_type
) – Categorizes entries as movie, video, or TV movie.IMDb Rating (
average_rating
) – A score from 1 to 10 based on user reviews.Number of Votes (
num_votes
) – The total number of user votes for a title on IMDb.
We chose this dataset due to it having multiple interesting variables, both categorical and numerical that can be used to analyze various analytics about “Summer” movies. Our shared interest in movie analytics led us to deciding to use this dataset for our project.
Questions
How have different genres changed in popularity (number of votes) over time?
Is there a correlation between the number of votes, the average rating, and the runtime across different title types (movie
, video
, tvMovie
)?
Analysis plan
Question #1
Merge Datasets:
Since the individual genre information is stored separately in
summer_movie_genres.csv
, we first merge this withsummer_movies.csv
using the common identifiertconst
.This will allow us to analyze both the number of votes (
num_votes
) and genre classifications for each movie.
Aggregate Data:
Group by
year
andgenres
, summingnum_votes
to get total votes per genre per year. This step helps identify trends in genre popularity over time.We think this approach provides a better sense of which genres were dominant in specific time periods, rather than just looking at raw vote counts.
Normalize Popularity:
Since overall voting patterns may vary by year due to different levels of movie viewership, we compute a percentage popularity metric for each genre per year: (number of votes per genre / total votes per year) *100.
This normalization ensures that we can fairly compare trends over time, preventing bias from years with exceptionally high or low overall voting activity.
Visualizations:
Line Chart:
We will plot a multi-line chart where each line represents a genre, and the y-axis reflects the total number of votes.
We believe this is effective because it allows us to see how different genres rise and fall in popularity across years.
Stacked Area Plot:
A stacked area chart will help us visualize relative genre dominance by stacking the proportion of votes per genre over time.
This helps us see which genres were most popular in different eras and how the overall composition of popular movies has changed.
Question #2
Filtering Data by Title Type:
- We categorize movies based on
title_type
(e.g., movie, video, tvMovie).
- We think that analyzing these categories separately will provide more meaningful insights, as different title types may have inherently different characteristics (e.g., TV movies might be shorter and receive fewer votes than theatrical releases).
- We categorize movies based on
Correlation Analysis:
- We calculate correlation coefficients between
num_votes
,average_rating
, andruntime_minutes
for each title type.
The correlation values will tell us whether:
Movies with higher ratings also receive more votes.
Longer movies tend to get higher ratings or more votes.
We anticipate that there might be positive correlations between votes and ratings (since widely viewed movies tend to be more rated), but the correlation with runtime might be less predictable.
- We calculate correlation coefficients between
Visualization:
Scatter Plot Matrix:
A scatter plot matrix will allow us to observe pairwise relationships between
num_votes
,average_rating
, andruntime_minutes
.We will color the points based on
title_type
to distinguish different types of media.This will give us an intuitive sense of whether these variables show linear trends or clusters by category.
Box Plot:
To compare
average_rating
andnum_votes
across differenttitle_type
categories, we will use box plots.We believe this is useful because it will show the spread and central tendency of each metric per title type, helping us see if certain categories (e.g., movies vs. TV movies) have consistently higher ratings or receive more votes.