The following package(s) will be installed:
- tidytuesdayR [1.2.1]
These packages will be installed into "~/proj-01-proud-beaver/renv/library/linux-ubuntu-noble/R-4.5/x86_64-pc-linux-gnu".
# Installing packages --------------------------------------------------------
- Installing tidytuesdayR ... OK [linked from cache]
Successfully installed 1 package in 20 milliseconds.
Dataset
tuesdata <- tidytuesdayR::tt_load('2025-07-29')
---- Compiling #TidyTuesday Information for 2025-07-29 ----
--- There are 2 files available ---
── Downloading files ───────────────────────────────────────────────────────────
1 of 2: "movies.csv"
Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
# Bad:
vroom("X,Y\n1.5,2.3\n")
# Good:
vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
Please report the issue at <https://github.com/tidyverse/readr/issues>.
2 of 2: "shows.csv"
movies <- tuesdata$moviesshows <- tuesdata$showsmovies
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
This dataset comes from the TidyTuesday project (Week 30, 2025), sourced from Netflix’s official bi-annual Engagement Reports published on the Netflix Newsroom. It is split into two tables, movies and shows, where the movies table has 36121 rows and 8 columns, and the shows table has 27803 rows and 8 columns, combining for 63,924 total observations. Each row represents a title within a specific reporting period rather than a unique title, meaning the same movie or show can appear multiple times across the four reporting windows from July 2023 through June 2025. Key variables include title, release_date, runtime, hours_viewed, views (total hours viewed divided by runtime), available_globally, and report.
Reason why we chose this Dataset
We chose this dataset as it provides large-scale, real-world engagement data from Netflix, which is a major global stremaing platform. Unlike ratings or rankings, which tend to reflect opinion or relative placement, this dataset measures actual audience behavior through total hours viewed. This means that it will be more valuable for understanding what content actually captures and sustains attention.
We found that this dataset also suited our research questions as it contained multiple reporting periods, which would allow us to analyze changes over time. It also differentiates between movies and shows, includes release dates to calculate content age, and identifies whether or not titles were available globally. Together, these features would allow us to examine how content, lifecycle, format, and distribution strategy can relate to performance.
Questions
The two questions you want to answer.
Q1: Do movies or shows released in winter/summer get more views than those released in spring/fall?
Q2: Does globally available content consistently outperform regionally restricted titles, and has that gap changed across reporting periods?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
#Question 1 Plan of Analysis:
Variable Involved The variable involved are: - release_date: Both in movies and shows title age - views: Both in movies and shows, number of viewership at time of reporting - report: Both in movies and shows used to determine the reference date for age calculation - title: Bothin in movies and shows identifier
Variables to be created:
content_type — a label ("Movie" or "Show") added before stacking both datasets into one combined table using bind_rows().
release_month — extracted from release_date using month(), stored as an abbreviated label (Jan, Feb, etc.) for plotting
release_month_num — the numeric version of the release month (1–12), used to ensure months are ordered correctly on the x-axis
season — a factor variable derived from release_month_num assigning each title to “Spring”, “Summer”, “Fall”, or “Winter” based on its release month, ordered as Spring, Summer, Fall, Winter those views.
Analytical approach:
Data preparation — Add a content_type column to each table, then combine with bind_rows() into a single netflix tibble. Extract release_month and release_month_num from release_date, then derive the season variable using a case_when() statement.
Bar chart — Compute average views per release_month grouped by season, then plot with release_month on the x-axis, average views on the y-axis, and bars color-mapped by season. This reveals whether certain months or seasons consistently drive higher engagement across all titles.
Box plot — Plot season on the x-axis against log10(views) on the y-axis, faceted by content_type. This shows the full spread and consistency of viewership within each season and whether the seasonal pattern differs between movies and shows.
Q2: Variables involved: - available_globally (categorical: Yes/No) — grouping variable - hours_viewed (numeric) — primary engagement metric - report (categorical: 4 reporting periods) — temporal dimension - Content type (movies vs. shows) — derived from which dataset each row comes from
Variables to be created: - content_type: a new column (“Movie” or “Show”) added before row-binding the two datasets, enabling a combined analysis - log_hours_viewed: log-transformed version of hours_viewed to handle the heavy right skew in viewership data - report may need to be re-coded as an ordered factor so plots display periods chronologically
External data to be merged: None
Planned visualizations: - Faceted violin/box plot — distribution of log_hours_viewed by available_globally, faceted by report, colored by content_type - Line or dot plot of median hours_viewed over report periods, grouped by available_globally and faceted by content_type — to show whether the gap narrows, widens, or stays stable over time