Holiday Movies & Their Changes


Yellow Echidna
Nayeon Kwon, Aishwarya Gupta, Yuxuan Chen

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<>) to force all conflicts to become errors
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':


The following object is masked from 'package:stats':


The following object is masked from 'package:graphics':



Our dataset includes CSV files about movies with holiday themes. From those datasets, the one we are using, ‘holiday_movies.csv’, contains detailed information about each movie, including identifiers, titles, release years, runtime, genres, ratings, vote counts, and boolean flags indicating the presence of specific holiday keywords in the title (like “Christmas,” “Hanukkah,” “Kwanzaa,” and “holiday”).

The  ‘tconst’ column is a unique alphanumeric identifier for each movie, typical for movie databases like IMDb. The ‘title_type’, ‘primary_title’, ‘original_title’, and ‘simple_title’  fields provide different formats of the movie’s title. The ‘title_type’ shows the format of the title (movie, video, or tvMovie), ‘primary_title’ shows the promotional title, ‘original_title’ shows the original language title, and ‘simple_title’ contains titles for filtering and grouping. The ‘year’, ‘runtime_minutes’, ‘average_rating’, and ‘num_votes’ fields contain numerical data about the release year, movie duration, IMDb user rating, and the number of votes, respectively. The ‘genres’ field contains up to three genres associated with each movie. Lastly, The ‘christmas’, ‘hanukkah’, ‘kwanzaa’, and ‘holiday’ fields are boolean (logical) indicators that reflect whether the movie’s title contains particular holiday-related terms.

Question 2 <- How did Christmas, Hanukkah, and Kwanzaa movie distribution change over the years? How different are their average ratings?


The second question looks into the distribution of the three different types of holiday movies over the years, especially looking into the effect of average ratings on their movie production. We were interested in this question since we were curious about the different types of holidays each movie was created for and whether there were any changes in the trends over the years. We plan to answer the question by utilizing variables, ‘christmas’, ‘hanukkah’, ‘kwanzaa’, ‘year’, and ‘average_rating’.


To explore our question, we plan to utilize both a line plot and a bar plot. The line plot will illustrate the evolution of each movie type over the years, allowing us to compare trends across different holiday’s movies easily. Its continuous nature will enable viewers to track changes over time effortlessly. On the other hand, the bar plot will focus on depicting the average rating per holiday for each holiday’s movies. This visualization will facilitate a straightforward comparison, highlighting which holiday’s movies are rated highest and which are rated lowest. Together, these plots will provide a comprehensive overview, combining temporal trends with evaluative data to enhance our understanding.



Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

holiday_movies_line <- holiday_movies %>%
  mutate(across(c(christmas, hanukkah, kwanzaa), as.numeric)) %>%
  group_by(year) %>%
    christmas_movies = sum(christmas),
    hanukkah_movies = sum(hanukkah),
    kwanzaa_movies = sum(kwanzaa)
  ) %>%
  pivot_longer(cols = ends_with("_movies"), names_to = "holiday", values_to = "num_movies", names_prefix = "")

line_plot <- ggplot(holiday_movies_line, aes(x = year, y = num_movies, color = holiday)) +
  geom_line() +
    title = "Holiday Movie Distribution Over Years",
    x = "Year", y = "Number of Movies", 
    color = "Holiday"
    ) +
  theme_minimal() +
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.title.position = "plot",
    legend.position = "bottom"

holiday_movies_bar <- holiday_movies %>%
  pivot_longer(cols = c(christmas, hanukkah, kwanzaa), names_to = "holiday_type", values_to = "is_holiday") %>%
  filter(is_holiday) %>%
  group_by(holiday_type) %>%
  summarise(average_rating = mean(average_rating, na.rm = TRUE))

bar_plot <- ggplot(holiday_movies_bar, aes(x = holiday_type, y = average_rating, fill = holiday_type)) +
  geom_bar(stat = "identity") +
    title = "Average Ratings",
    x = "Holiday", y = "Average Rating", 
    fill = "Holiday") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
    axis.text.x = element_text(angle = 45, hjust = 1, size = rel(0.9)),
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.title.position = "plot",
    legend.position = ""

grid.arrange(line_plot, bar_plot, ncol = 2, widths = c(5, 2))


The line plot titled “Holiday Movie Distribution Over Years” shows a significant increase in the number of Christmas movies produced over time, with a particularly sharp rise in recent years. Hanukkah and Kwanzaa movies, on the other hand, remain relatively scarce throughout the years, indicating a stronger cultural or commercial emphasis on Christmas-themed content in the film industry.

Moving to the bar plot “Average Ratings,” it is apparent that Kwanzaa movies hold the highest average rating, followed closely by Hanukkah and then Christmas movies. This suggests that while Christmas movies are more abundant, Kwanzaa films, though fewer in number, tend to receive higher ratings on average. This could imply a quality-over-quantity scenario where the lesser-produced holiday movies receive more favorable reviews, or it might reflect a niche audience’s rating behavior for these genres.


Our presentation can be found here.


Harmon, J. (2023). rfordatascience/tidytuesday. GitHub. [Retrieved on Feb 29th, 2024]


Harmon, J. (2023). rfordatascience/tidytuesday. GitHub. [Retrieved on Feb 29th, 2024]