Project proposal

Author

Proud Mink (Cindy Wang, cw748)(Reinesse Wong, rw634)(Hannah Wang, hw536)

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Dataset Description

This project uses Netflix engagement data published in Netflix’s What We Watched reports, released biannually beginning in late 2023. The dataset was compiled by Jen Richmond (RLadies Sydney) and distributed through the TidyTuesday project. It combines viewing data from four reporting periods spanning late 2023 through mid-2025 and represents approximately 99% of total Netflix viewing hours globally.

movies <- readr::read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv'
)
Rows: 36121 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
shows <- readr::read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv'
)
Rows: 27803 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
Rows: 36,121
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Back in Action", "STRAW", "The Life List", "Exterr…
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Ye…
$ release_date       <date> 2025-01-17, 2025-06-06, 2025-03-28, 2025-04-30, 20…
$ hours_viewed       <dbl> 313000000, 185200000, 198900000, 159000000, 1549000…
$ runtime            <chr> "1H 54M 0S", "1H 48M 0S", "2H 5M 0S", "1H 49M 0S", …
$ views              <dbl> 164700000, 102900000, 95500000, 87500000, 86900000,…
glimpse(shows)
Rows: 27,803
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Adolescence: Limited Series", "Squid Game: Season …
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y…
$ release_date       <date> 2025-03-13, 2024-12-26, 2025-06-27, 2025-02-20, 20…
$ hours_viewed       <dbl> 555100000, 840300000, 438600000, 315800000, 2186000…
$ runtime            <chr> "3H 50M 0S", "7H 10M 0S", "6H 8M 0S", "5H 9M 0S", "…
$ views              <dbl> 144800000, 117300000, 71500000, 61300000, 58000000,…
#combine for summary stats
netflix_combined <- bind_rows(
  movies %>% mutate(content_type = "Movie"),
  shows %>% mutate(content_type = "Show")
)
print(nrow(movies))
[1] 36121
print(ncol(movies))
[1] 8
print(nrow(shows))
[1] 27803
print(ncol(shows))
[1] 8
print(nrow(netflix_combined), big.mark = ",")
[1] 63924
print(format(
  sum(netflix_combined$hours_viewed, na.rm = TRUE) / 1e9,
  digits = 2
)) #total hours
[1] "373"
print(n_distinct(netflix_combined$report)) #num reporting periods
[1] 4

Dataset Dimensions

The dataset consists of two related tables:

  • Movies dataset: 36121 observations across 8 variables
  • Shows dataset: 27803 observations across 8variables
  • Combined total: 63924 content titles

Key variables include:

  • title: Content title
  • available_globally: Whether content is available worldwide or regionally restricted
  • release_date: Original release date of the content
  • hours_viewed: Total hours spent viewing the content (in millions)
  • runtime: Length of the movie or typical episode
  • views: number (hours viewed / runtime)
  • report: The reporting period (2024Jan-Jun)

The dataset has viewing activity totaling approximately 373 billion hours across the 4 reporting periods.

Importantly, the views variable is not independently observed by Netflix; it is inferred from hours_viewed and runtime. As a result, views will be interpreted as an approximate engagement indicator rather than true viewership counts.

Why this Dataset

We chose this dataset due to it’s analytical opportunities including time patterns like release timing, report preiods, geographic considerations, content features like runtime format, and engagement metrics like hours and view counts. This allows us to create visualizations for time serioes analysis, cohort comparisons, and segmenting our data. Additionally there is a plethora of data that allows for a deeper analysis. Compared to alternative datasets we considered, such as Pixar films, Netflix’s engagement data provides substantially greater volume and variability, enabling deeper visual exploration. Because the data is sourced directly from Netflix’s official reports, it is also reliable and relevant to real-world business applications, particularly in understanding product performance and audience behavior.

Questions

Question 1: How does engagement differ between globally distributed and regionally released titles across reporting periods and release seasons?

As Netflix expands globally, understanding how content availability (global vs. regional) relates to viewer engagement release periods is important for content release strategy. This question would help us understand whether global content out performs regional only content, and how this relationship varies across different reporting periods and seasons. This will help Netflix understand where to allocate resources for content towards global content or put my resources into regional films.

Existing Variables include:

  • hours_viewed - total viewing hours for each title (num)
  • views - calculated engagement metric (num)
  • available_globally - global vs regional (categorical)
  • release_date (date) - release date (datetime)
  • report - reporting period (categorical)

Variables to Create:

  • release_season (categorical): Create season based on release_date month (sp,fall,winter,summer)
  • release_year (numeric): extracted from release date - months since release
  • months_since_release (numeric): time between release date and end of reporting period
  • performance_tier(categorical): reate quartiles based on hours viewed within each report period
  • adjusted_hours_viewed (numeric): hours_viewed divided by months_since_release, used to account for unequal exposure time

Because the views variable is inferred from runtime and hours_viewed, engagement for Question 1 will primarily be measured using adjusted_hours_viewed.

Question 2: How quickly does engagement decay after release, and does this decay differ between movies and shows?

This question examines how audience engagement changes over time following a title’s release and whether the rate of decline differs between movies and television shows. Because titles accumulate viewing hours unevenly depending on when they are released, raw engagement totals alone do not capture post-release dynamics. Instead, this analysis focuses on time since release as a primary explanatory variable, allowing us to model engagement decay and compare decay patterns across content formats.

Existing Variables include:

  • runtime: length of show/movie
  • hours_viewed: total hours watched
  • views: engagement metrics
  • report: reporting period
  • content_type: movie or show (categorical)
  • release_date (date): original release date (you’ll need this to adjust for exposure time)

Variables to Create:

  • runtime_category(categorical): sort into short medium or long based on length of media
  • runtime_minutes (numeric): parse runtime into minutes
  • months_since_release (numeric): Time since release
  • adjusted_hours_viewed (numeric): hours_viewed divided by months_since_release

Analysis Plan

To address Question 1, we will examine how engagement differs between globally distributed and regionally released titles across reporting periods and release seasons using adjusted_hours_viewed as the primary engagement metric. Engagement distributions will be visualized using faceted boxplots with availability strategy on the x-axis, adjusted_hours_viewed on the y-axis, and release_season encoded by fill, with facets by reporting period. This design enables comparison of engagement patterns across distribution strategies while accounting for seasonal and temporal variation. In addition, we will compute summary statistics (mean and median adjusted_hours_viewed) within each availability and season group to support interpretation of distributional differences. A complementary heatmap will display mean adjusted_hours_viewed by release_season and available_globally to highlight broader seasonal trends. Together, these visualizations and summaries will allow us to evaluate whether globally released titles consistently demonstrate higher engagement than regional releases and whether these patterns shift across reporting windows. Additionally we plan to use facet wraps regional versus global content and compare them to each other to address the binary available_globally variable. This includes density plots, scatter plots with trendlines, and distribution plots. We understand that the specific regions of films will not be avilable in the data and that this will be a limitation of the data for our analysis.

For Question 2, we will examine how engagement decays over time following release and whether decay patterns differ between movies and television shows. Engagement decay will be analyzed by modeling adjusted_hours_viewed as a function of months_since_release and visualized using scatterplots with smoothed trend lines, with content_type encoded by color and facets by reporting period. The slopes and shapes of these curves will be used to compare decay rates between formats and across time. To further contextualize these patterns, supporting boxplots will compare adjusted_hours_viewed across runtime_category and content_type, allowing assessment of whether content length is associated with sustained engagement differently for movies versus shows. Together, these visualizations will characterize post-release engagement trajectories and reveal whether movies and television shows exhibit distinct lifecycle behaviors.