Project proposal

Author

Team name

library(tidyverse)
library(haven)
library(dplyr)
install.packages("tidytuesdayR")
The following package(s) will be installed:
- tidytuesdayR [1.2.1]
These packages will be installed into "~/proj-01-proud-beaver/renv/library/linux-ubuntu-noble/R-4.5/x86_64-pc-linux-gnu".

# Installing packages --------------------------------------------------------
- Installing tidytuesdayR ...                   OK [linked from cache]
Successfully installed 1 package in 20 milliseconds.

Dataset

tuesdata <- tidytuesdayR::tt_load('2025-07-29')
---- Compiling #TidyTuesday Information for 2025-07-29 ----
--- There are 2 files available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 2: "movies.csv"
Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
  
  # Bad:
  vroom("X,Y\n1.5,2.3\n")
  
  # Good:
  vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
  Please report the issue at <https://github.com/tidyverse/readr/issues>.
2 of 2: "shows.csv"
movies <- tuesdata$movies
shows <- tuesdata$shows
movies
# A tibble: 36,121 × 8
   source      report title available_globally release_date hours_viewed runtime
   <chr>       <chr>  <chr> <chr>              <date>              <dbl> <chr>  
 1 1_What_We_… 2025J… Back… Yes                2025-01-17      313000000 1H 54M…
 2 1_What_We_… 2025J… STRAW Yes                2025-06-06      185200000 1H 48M…
 3 1_What_We_… 2025J… The … Yes                2025-03-28      198900000 2H 5M …
 4 1_What_We_… 2025J… Exte… Yes                2025-04-30      159000000 1H 49M…
 5 1_What_We_… 2025J… Havoc Yes                2025-04-25      154900000 1H 47M…
 6 1_What_We_… 2025J… The … No                 NA              106800000 1H 26M…
 7 1_What_We_… 2025J… The … Yes                2025-03-14      158200000 2H 8M …
 8 1_What_We_… 2025J… Coun… Yes                2025-02-28      101000000 1H 25M…
 9 1_What_We_… 2025J… Ad V… Yes                2025-01-10      114000000 1H 38M…
10 1_What_We_… 2025J… Desp… No                 NA              109400000 1H 34M…
# ℹ 36,111 more rows
# ℹ 1 more variable: views <dbl>
shows
# A tibble: 27,803 × 8
   source      report title available_globally release_date hours_viewed runtime
   <chr>       <chr>  <chr> <chr>              <date>              <dbl> <chr>  
 1 1_What_We_… 2025J… Adol… Yes                2025-03-13      555100000 3H 50M…
 2 1_What_We_… 2025J… Squi… Yes                2024-12-26      840300000 7H 10M…
 3 1_What_We_… 2025J… Squi… Yes                2025-06-27      438600000 6H 8M …
 4 1_What_We_… 2025J… Zero… Yes                2025-02-20      315800000 5H 9M …
 5 1_What_We_… 2025J… Miss… Yes                2025-01-01      218600000 3H 46M…
 6 1_What_We_… 2025J… Amer… Yes                2025-02-17      120600000 2H 9M …
 7 1_What_We_… 2025J… Ms. … Yes                NA              162100000 3H 2M …
 8 1_What_We_… 2025J… Sire… Yes                2025-05-22      252300000 4H 44M…
 9 1_What_We_… 2025J… The … Yes                2025-01-23      457800000 8H 36M…
10 1_What_We_… 2025J… Ginn… Yes                2025-06-05      508200000 10H 34…
# ℹ 27,793 more rows
# ℹ 1 more variable: views <dbl>
write.csv(movies, file = "data/movies.csv", row.names = FALSE)
write.csv(shows, file = "data/shows.csv", row.names = FALSE)

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

This dataset comes from the TidyTuesday project (Week 30, 2025), sourced from Netflix’s official bi-annual Engagement Reports published on the Netflix Newsroom. It is split into two tables, movies and shows, where the movies table has 36121 rows and 8 columns, and the shows table has 27803 rows and 8 columns, combining for 63,924 total observations. Each row represents a title within a specific reporting period rather than a unique title, meaning the same movie or show can appear multiple times across the four reporting windows from July 2023 through June 2025. Key variables include title, release_date, runtime, hours_viewed, views (total hours viewed divided by runtime), available_globally, and report.

Reason why we chose this Dataset

We chose this dataset as it provides large-scale, real-world engagement data from Netflix, which is a major global stremaing platform. Unlike ratings or rankings, which tend to reflect opinion or relative placement, this dataset measures actual audience behavior through total hours viewed. This means that it will be more valuable for understanding what content actually captures and sustains attention.

We found that this dataset also suited our research questions as it contained multiple reporting periods, which would allow us to analyze changes over time. It also differentiates between movies and shows, includes release dates to calculate content age, and identifies whether or not titles were available globally. Together, these features would allow us to examine how content, lifecycle, format, and distribution strategy can relate to performance.

Questions

The two questions you want to answer.

Q1: Do movies or shows released in winter/summer get more views than those released in spring/fall?

Q2: Does globally available content consistently outperform regionally restricted titles, and has that gap changed across reporting periods?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

#Question 1 Plan of Analysis:

Variable Involved The variable involved are: - release_date: Both in movies and shows title age - views: Both in movies and shows, number of viewership at time of reporting - report: Both in movies and shows used to determine the reference date for age calculation - title: Bothin in movies and shows identifier

Variables to be created:

  • content_type — a label ("Movie" or "Show") added before stacking both datasets into one combined table using bind_rows().

  • release_month — extracted from release_date using month(), stored as an abbreviated label (Jan, Feb, etc.) for plotting

  • release_month_num — the numeric version of the release month (1–12), used to ensure months are ordered correctly on the x-axis

  • season — a factor variable derived from release_month_num assigning each title to “Spring”, “Summer”, “Fall”, or “Winter” based on its release month, ordered as Spring, Summer, Fall, Winter those views.

Analytical approach:

  1. Data preparation — Add a content_type column to each table, then combine with bind_rows() into a single netflix tibble. Extract release_month and release_month_num from release_date, then derive the season variable using a case_when() statement.

  2. Bar chart — Compute average views per release_month grouped by season, then plot with release_month on the x-axis, average views on the y-axis, and bars color-mapped by season. This reveals whether certain months or seasons consistently drive higher engagement across all titles.

  3. Box plot — Plot season on the x-axis against log10(views) on the y-axis, faceted by content_type. This shows the full spread and consistency of viewership within each season and whether the seasonal pattern differs between movies and shows.

Q2: Variables involved: - available_globally (categorical: Yes/No) — grouping variable - hours_viewed (numeric) — primary engagement metric - report (categorical: 4 reporting periods) — temporal dimension - Content type (movies vs. shows) — derived from which dataset each row comes from

Variables to be created: - content_type: a new column (“Movie” or “Show”) added before row-binding the two datasets, enabling a combined analysis - log_hours_viewed: log-transformed version of hours_viewed to handle the heavy right skew in viewership data - report may need to be re-coded as an ordered factor so plots display periods chronologically

External data to be merged: None

Planned visualizations: - Faceted violin/box plot — distribution of log_hours_viewed by available_globally, faceted by report, colored by content_type - Line or dot plot of median hours_viewed over report periods, grouped by available_globally and faceted by content_type — to show whether the gap narrows, widens, or stays stable over time