Project proposal

Author

Trusting-Panda

library(tidyverse)
library(lubridate)

movies <- read_csv("data/movies.csv")
shows <- read_csv("data/shows.csv")

Dataset

This project uses the “What have we been watching on Netflix?” dataset from TidyTuesday (2025-07-29), which compiles official Netflix Engagement Reports from late 2023 through the first half of 2025. These reports summarize the total hours users spent watching each title during six-month reporting periods. The full dataset covers four consecutive six-month reporting periods: 2023Jul-Dec, 2024Jan-Jun, 2024Jul-Dec, and 2025Jan-Jun. These reports summarize the total hours users spent watching each title during these distinct periods.

The dataset includes two files:

movies with 36121 rows and 8 columns
shows with 27803 rows and 8 columns

The key variables include report, available_globally, release_date, hours_viewed, runtime, and views. We picked this dataset because Netflix releases these reports every six months, so we can look at how viewing numbers change over time and compare titles with different availability settings.

glimpse(movies)

Rows: 36,121
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Back in Action", "STRAW", "The Life List", "Exterr…
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Ye…
$ release_date       <date> 2025-01-17, 2025-06-06, 2025-03-28, 2025-04-30, 20…
$ hours_viewed       <dbl> 313000000, 185200000, 198900000, 159000000, 1549000…
$ runtime            <chr> "1H 54M 0S", "1H 48M 0S", "2H 5M 0S", "1H 49M 0S", …
$ views              <dbl> 164700000, 102900000, 95500000, 87500000, 86900000,…

glimpse(shows)

Rows: 27,803
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Adolescence: Limited Series", "Squid Game: Season …
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y…
$ release_date       <date> 2025-03-13, 2024-12-26, 2025-06-27, 2025-02-20, 20…
$ hours_viewed       <dbl> 555100000, 840300000, 438600000, 315800000, 2186000…
$ runtime            <chr> "3H 50M 0S", "7H 10M 0S", "6H 8M 0S", "5H 9M 0S", "…
$ views              <dbl> 144800000, 117300000, 71500000, 61300000, 58000000,…

Questions

Is Netflix’s content fleeting or enduring? Utilizing all five reporting periods, how do the post-release hype curves of films decay over time, and do these trajectories differ significantly across officially designated genres and languages?
How much does “global availability” contribute marginally to viewership? Are there specific types of programs that can generate extremely high returns even without global availability?

Analysis plan

Question 1: Post-release hype curves across genres and languages

Variables involved: - report - the specific reporting period (utilizing all four available periods to build a time series)

hours_viewed - engagement metric
title - used as the primary key to fetch external metadata
release_date - to establish the baseline for our time series
runtime - to normalize viewing patterns

Variables to be created: - genre_category - official primary genre (e.g., Animation, Action, Drama) fetched programmatically from the OMDB API.

content_language - official primary language fetched programmatically from the OMDB API.
periods_since_release - an integer calculating how many 6-month reporting periods have passed since the title’s release_date.
viewership_decay_rate - the percentage change in hours_viewed between consecutive reporting periods for the exact same title.

Analysis approach: To address the limitations of single-period analysis and manual coding, this question will utilize the full longitudinal depth of the dataset combined with external API data.

Filter the combined dataset to include titles that appear across multiple reporting periods to enable longitudinal tracking.
Query the OMDB (Open Movie Database) API using the title column to programmatically fetch accurate genre_category and content_language data, eliminating the severe inaccuracies of keyword-guessing.
Calculate the periods_since_release for each row to align all titles on a standardized temporal axis (e.g., Period 1 post-release, Period 2 post-release).
Apply a log transformation to hours_viewed to account for severe right-skewness caused by blockbuster titles.
Generate line plots tracking the median hours_viewed over periods_since_release.
Use faceting or color mapping to compare these decay trajectories across different genre_category and content_language groups.
Identify structural differences: Do non-English titles experience a sharper drop-off than English titles? Do Documentaries decay differently than Animations?

External data: We will utilize the OMDB API. By writing an R script to pass Netflix titles to the API, we will extract authoritative metadata (Genre and Language). This directly addresses concerns regarding the feasibility and accuracy of manually coding variables.

Note: Since we only have one reporting period (2025Jan-Jun), we’ll need to use release_date to create cohorts of titles released in different time windows and compare their current engagement levels.

Question 2: Global availability impact on viewership

Variables involved: - available_globally (binary Yes/No indicator), hours_viewed (total engagement metric), views (number of views), runtime (content length), and release_date (used for time normalization).

These variables allow us to evaluate whether global availability contributes to engagement differences while controlling for structural and temporal factors.

Variables to be created: -

months_since_release: Elapsed months between each title’s release_date and the end of the reporting period. Corrects for exposure bias, since earlier-released titles have had more time to accumulate engagement.

normalized_hours: Calculated as hours_viewed / months_since_release. Produces a time-adjusted engagement metric.

hours_per_view: Calculated as hours_viewed / views. Approximates average viewing intensity per title.

viewership_efficiency: Calculated as views / runtime. Normalizes popularity by content length.

content_type: Rather than manually categorizing titles through string matching, we will query the OMDB API using title name and release year to retrieve standardized genre and format metadata reliably.

Analysis approach:

We will analyze movies and shows separately, since runtime structure and viewing behavior differ substantially between formats.

For each dataset (movies and shows), split titles by available_globally (Yes vs. No)
Normalize hours_viewed by months_since_release to correct for exposure time bias.
Calculate summary statistics:
- Mean, median, and maximum of normalized hours_viewed for globally vs. regionally available content
- Distribution of views across both availability groups
Create visualizations:
- Side-by-side box plots comparing normalized_hours distributions for globally and regionally available titles (faceted by content type: movies vs. shows).
- Scatter plots of views vs. runtime, colored by available_globally, to examine whether global availability is associated with higher popularity across runtime ranges.
Identify high-performing regional content (titles where available_globally = No but normalized hours_viewed is exceptionally high)
Retrieve content_type and genre metadata for these outliers via the OMDB API, and examine whether strong regional performance clusters around specific genres or formats.
Estimate the marginal contribution of global availability using regression analysis: normalized_hours ~ available_globally + runtime + months_since_release

Separate regressions will be conducted for movies and shows. This allows us to estimate whether global availability independently predicts engagement after controlling for exposure time and content length.

External data: - If extension analysis is conducted, we will query the OMDB API using title name and release year to retrieve standardized genre and format metadata for identified regional outliers. This avoids manual string matching and improves reliability. All other variables are derived directly from the existing dataset. The split between globally and regionally available titles is sufficiently balanced to support meaningful comparisons without additional sampling or supplementary data.