Project title

Author

Dank Tea - Jiahua Chen, Hung Phung, Brianna Ramnath

Introduction

The “Summer Movies” dataset originates from the Internet Movie Database (IMDb) and was compiled as part of the TidyTuesday project. It contains films, videos, and TV movies that contain “summer” in their title. The dataset consists of two tables: summer_movies.csv, which is the main dataset that contains details such as title type, IMDb rating, number of IMDb votes, and genres, and summer_movie_genre.csv, which breaks down movies by individual genres. The two sets are linked by a column of unique identifiers for each “summer” movie.

This dataset was chosen for its potential to analyze how ratings, votes, and genres interact across different title types (movies, videos, TV movies), providing insights into trends in summer-themed media.

Question 1: How have different genres changed in popularity (number of votes) over time?

Introduction

Our first question aims to explore how the popularity of each genre present in the “summer” movies data has changed over time. We define popularity here as having received a substantial amount of votes on IMDb. The parts of the dataset necessary to answer this question are the genres field from the genres dataframe, and year and num_votes from the movies dataframe. This question will allow us to gauge whether audience members’ preferences for type of summer-themed movies have changed over time. Results of this analysis could also be compared to movies as a whole in future research, to see whether genre preferences for “summer” movies is similar to the general popularity of genres.

Approach

To analyze the evolution of movie genre popularity over time, we employ two complementary visualizations: a multi-line plot and a stacked area chart, both enhanced with appropriate scaling and filtering for clarity. First, we merge summer_movies and summer_movies_genres on tconst and remove missing values to ensure data accuracy. We then aggregate the total votes per genre per year, followed by normalization to calculate each genre’s percentage share of total yearly votes. Instead of plotting all genres, we focus on the three top favorites of three team members: Action, Comedy, and Romance. We also think these genres represent diverse storytelling styles and have shown consistent popularity across decades. The multi-line plot tracks absolute vote counts over time, using logarithmic scaling (scale_y_log10()) to manage the wide variation in vote counts, ensuring smaller values remain visible without being overshadowed by outliers. Meanwhile, the stacked area chart displays the relative proportion of each genre’s votes per year, helping us visualize shifts in audience preferences. To improve interpretability, we use colorblind-friendly palettes (scale_color_viridis_d() and scale_fill_viridis_d()), apply legend reversal for better readability, and structure the X-axis with consistent breaks for clarity.

Analysis

# Merge datasets on 'tconst' and remove rows with NA values
merged_data <- movies %>%
  inner_join(genres, by = "tconst") %>%
  drop_na(year, genres.y, num_votes)  # Using genres.y for genre column

# Aggregate number of votes by year and genre
votes_by_genre <- merged_data %>%
  group_by(year, genres.y) %>%
  summarize(total_votes = sum(num_votes, na.rm = TRUE), .groups = "drop")

# Normalize popularity by calculating percentage of total votes per year
votes_by_year <- votes_by_genre %>%
  group_by(year) %>%
  mutate(percent_popularity = (total_votes / sum(total_votes, na.rm = TRUE)) * 100) %>%
  ungroup()

# Define a custom theme for better aesthetics
custom_theme <- theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    axis.title = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 10)
  )

# Select only four key genres
selected_genres <- c("Action", "Comedy", "Romance")

# Filter dataset to include only the selected genres
filtered_votes_by_genre <- votes_by_genre %>%
  filter(genres.y %in% selected_genres)

filtered_votes_by_year <- votes_by_year %>%
  filter(genres.y %in% selected_genres)

# Plot: Multi-line chart for selected genres
ggplot(filtered_votes_by_genre, aes(x = year, y = total_votes, color = genres.y)) +
  geom_line(linewidth = 1.2) +
  scale_x_continuous(breaks = seq(min(filtered_votes_by_genre$year, na.rm = TRUE), 
                                  max(filtered_votes_by_genre$year, na.rm = TRUE), by = 20)) +
  scale_y_log10(labels = label_comma()) +
  scale_color_viridis_d(end = 0.8) +
  labs(
    title = "Genre Popularity Over Time (Total Votes)",
    x = "Year",
    y = "Total Votes",
    color = "Genre"
  ) +
  guides(color = guide_legend(reverse = TRUE)) +  # Reverse legend order
  custom_theme

# Plot: Stacked area plot for selected genres
ggplot(filtered_votes_by_year, aes(x = year, y = percent_popularity, fill = genres.y)) +
  geom_area(position = "fill", alpha = 0.8) +
  scale_x_continuous(breaks = seq(min(filtered_votes_by_year$year, na.rm = TRUE), 
                                  max(filtered_votes_by_year$year, na.rm = TRUE), by = 20)) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_viridis_d(end = 0.8) +
  labs(
    title = "Relative Popularity of Selected Genres Over Time",
    x = "Year",
    y = "Percentage of Total Votes",
    fill = "Genre"
  ) +
  custom_theme

Discussion

Visualization 1 Discussion

The visualization shows that over time, the most popular genre, among Romance, Comedy, and Action, among “summer” movies has changed year to year. for example, in 1958, Romance movies had gained the most votes among summer movies, while that changed in 1963, when Comedy became the most voted for genre. These kinds of changes by year could be because of change in audience preference over time, or could be because of a certain well-advertised and talked about movie each year that fell into a certain genre, that caused an influx of votes for that movie, and subsequently, for that genre.

The Action genre is historically lower than Comedy and Romance, which we might expect from more fun and light-hearted movies involving “summer”. A notable year seems to be 2009, when there was a large influx in votes in all 3 genres. The comedy and romance genre votes may represent votes for “500 Days of Summer”, which received 585K votes according to IMDb.

Visualization 2 Discussion

The visualization shows the proportion of votes per genre over time. There are no votes for romance movies with “summer” in the title until around 1935, and no votes for Action movies until around 1956. We suspect the emerge of action movies are correlated with the introduction of computer-generated imagery (CGI) in the 50s. Most notably, it seems that every other year or so, there is a large influx in votes for movies in the romance genre, therefore dominated the other genres. Similar trends seem to occur for the other genres at well, at alternate years. Around 1997 to 2020, comedy and romance dominated heavily over action in terms of popularity.

Question 2: How do average rating and runtime correlate across different title types?

Introduction

This question aims to explore the relationship between the runtime of a “summer” movie and the average rating received on IMDb, separated by title type (movie, video, tvMovie). Visualzing this relationship could show that making a movie longer or shorter would affect the audience rating. Additionally, analyzing types of movies separately may provide more meaningful insights, as different title types may have inherently different characteristics (e.g., TV movies might be shorter and have lower ratings than theatrical releases).

Approach

To analyze the relationship between IMDb rating and runtime across different title types, we will create two distinct visualizations: a scatter plot and a box plot. These plots will allow us to examine patterns in audience engagement and rating distribution while making meaningful comparisons between different types of media.

The scatter plot will display the relationship between runtime and IMDb rating. Each point will represent a movie, TV movie, or video, and the x-axis be the runtime in minutes. Color mapping will be used to differentiate between title types, making it easy to see whether certain types of media tend to have a longer runtime or higher ratings. The box plot will compare the distribution of IMDb ratings across different title types. This visualization will provide insights into how ratings differ between movies, TV movies, and videos by showing the median, interquartile range, and potential outliers. The box plot is ideal for identifying whether one category tends to have higher or more consistent ratings than the others.

Analysis

library(tidyverse)


movies <- read_csv("data/summer_movies.csv")
genres <- read_csv("data/summer_movie_genres.csv")


merged_data <- movies %>%
  inner_join(genres, by = "tconst") %>%
  drop_na(num_votes, average_rating, runtime_minutes, title_type)

# Convert 'year' to integer
merged_data$year <- as.integer(merged_data$year)

# Scatter plot: Runtime vs. IMDb Rating
runtime_plot <- ggplot(merged_data, aes(x = runtime_minutes, y = average_rating, color = title_type)) +
  geom_point(alpha = 0.6, size = 1.5) + 
  geom_smooth(se = FALSE) + 
  labs(
    title = "Runtime vs. IMDb Rating",
    x = "Runtime (Minutes)",
    y = "IMDb Rating",
    color = "Title Type"
  ) +
  scale_x_continuous(labels = scales::comma) +  # Format axis labels
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12, face = "bold"),
    legend.title = element_text(size = 11, face = "bold")
  )

# Print scatter plot
print(runtime_plot)

boxplot_rating <- ggplot(merged_data, aes(x = title_type, y = average_rating, fill = title_type)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) + # Reduce clutter by removing outlier points
  labs(
    title = "Distribution of IMDb Ratings by Title Type",
    x = "Title Type",
    y = "IMDb Rating",
    fill = "Title Type"
  ) +
  scale_y_continuous(breaks = seq(1, 10, by = 1)) + # Ensure consistent y-axis intervals
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12, face = "bold"),
    legend.position = "none" # Remove redundant legend
  )

# Print box plot
print(boxplot_rating)

Discussion

Visualization 1 Discussion

The visualization fails to show a significant correlation between runtime and average IMDb rating for summer-related movies. Most movies tend to be around 90 to 100 minutes, which is to be expected, but ratings vary from low to high for most runtimes across all three movie types. There are some small trends that could be gleaned from this data. For types movie and tvMovie, there tends to be less very low ratings as runtime tends lower or higher from the mean.

Visualization 2 Discussion

All movie types have a mean IMDb rating between 6 and 7. Videos appear to have the highest average ratings, but not by a significant lead, and also have the most variation in ratings. Ultimately, there is not a significant relationship between title type and average ratings.

Presentation

Our presentation can be found here.

Data

R for Data Science. (2024). Tidy Tuesday: Summer Movies Dataset (2024-07-30). Retrieved from https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-07-30/readme.md

References

R for Data Science. (2024). Tidy Tuesday: Summer Movies Dataset (2024-07-30). Retrieved from https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-07-30/readme.md