Project proposal

Author

Proud Mink (Cindy Wang, cw748)(Reinesse Wong, rw634)(Hannah Wang, hw536)

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Dataset Description

This project uses Netflix engagement data published in Netflix’s What We Watched reports, released biannually beginning in late 2023. The dataset was compiled by Jen Richmond (RLadies Sydney) and distributed through the TidyTuesday project. It combines viewing data from four reporting periods spanning late 2023 through mid-2025 and represents approximately 99% of total Netflix viewing hours globally.

movies <- readr::read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv'
)

Rows: 36121 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

shows <- readr::read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv'
)

Rows: 27803 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(movies)

Rows: 36,121
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Back in Action", "STRAW", "The Life List", "Exterr…
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Ye…
$ release_date       <date> 2025-01-17, 2025-06-06, 2025-03-28, 2025-04-30, 20…
$ hours_viewed       <dbl> 313000000, 185200000, 198900000, 159000000, 1549000…
$ runtime            <chr> "1H 54M 0S", "1H 48M 0S", "2H 5M 0S", "1H 49M 0S", …
$ views              <dbl> 164700000, 102900000, 95500000, 87500000, 86900000,…

glimpse(shows)

Rows: 27,803
Columns: 8
$ source             <chr> "1_What_We_Watched_A_Netflix_Engagement_Report_2025…
$ report             <chr> "2025Jan-Jun", "2025Jan-Jun", "2025Jan-Jun", "2025J…
$ title              <chr> "Adolescence: Limited Series", "Squid Game: Season …
$ available_globally <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y…
$ release_date       <date> 2025-03-13, 2024-12-26, 2025-06-27, 2025-02-20, 20…
$ hours_viewed       <dbl> 555100000, 840300000, 438600000, 315800000, 2186000…
$ runtime            <chr> "3H 50M 0S", "7H 10M 0S", "6H 8M 0S", "5H 9M 0S", "…
$ views              <dbl> 144800000, 117300000, 71500000, 61300000, 58000000,…

#combine for summary stats
netflix_combined <- bind_rows(
  movies %>% mutate(content_type = "Movie"),
  shows %>% mutate(content_type = "Show")
)
print(nrow(movies))

[1] 36121

print(ncol(movies))

[1] 8

print(nrow(shows))

[1] 27803

print(ncol(shows))

[1] 8

print(nrow(netflix_combined), big.mark = ",")

[1] 63924

print(format(
  sum(netflix_combined$hours_viewed, na.rm = TRUE) / 1e9,
  digits = 2
)) #total hours

[1] "373"

print(n_distinct(netflix_combined$report)) #num reporting periods

[1] 4

Dataset Dimensions

The dataset consists of two related tables:

Movies dataset: 36121 observations across 8 variables
Shows dataset: 27803 observations across 8variables
Combined total: 63924 content titles

Key variables include:

title: Content title
available_globally: Whether content is available worldwide or regionally restricted
release_date: Original release date of the content
hours_viewed: Total hours spent viewing the content (in millions)
runtime: Length of the movie or typical episode
views: number (hours viewed / runtime)
report: The reporting period (2024Jan-Jun)

The dataset has viewing activity totaling approximately 373 billion hours across the 4 reporting periods.

Importantly, the views variable is not independently observed by Netflix; it is inferred from hours_viewed and runtime. As a result, views will be interpreted as an approximate engagement indicator rather than true viewership counts.

Why this Dataset

We chose this dataset due to it’s analytical opportunities including time patterns like release timing, report preiods, geographic considerations, content features like runtime format, and engagement metrics like hours and view counts. This allows us to create visualizations for time serioes analysis, cohort comparisons, and segmenting our data. Additionally there is a plethora of data that allows for a deeper analysis. Compared to alternative datasets we considered, such as Pixar films, Netflix’s engagement data provides substantially greater volume and variability, enabling deeper visual exploration. Because the data is sourced directly from Netflix’s official reports, it is also reliable and relevant to real-world business applications, particularly in understanding product performance and audience behavior.

Questions

Question 1: Is Netflix engagement driven more by new releases or by older catalog titles across reporting periods?

As streaming platforms mature, understanding whether viewer engagement is driven primarily by newly released titles or by older catalog content is critical for content investment strategy. If engagement is heavily concentrated among recently released titles, Netflix may need to continually invest in fresh content to sustain subscriber attention. Alternatively, if older titles continue to generate substantial engagement, this suggests long-term value in the content library and supports investment in durable, evergreen programming.

This question examines how total engagement is distributed across titles of different ages at the time of each reporting period. Specifically, we analyze whether recently released titles (e.g., within 3 months of release) account for a disproportionate share of viewing hours compared to older titles, and whether this pattern changes across reporting periods.

Existing Variables include:

hours_viewed - total viewing hours for each title (num)
views - calculated engagement metric (num)
release_date (date) - release date (datetime)
report - reporting period (categorical)
runtime - runtime of the title (period)

Variables to Create:

report_end (date): end date of each reporting period
months_since_release (numeric): time between release_date and report_end
age_bucket (categorical): categorize titles by age at time of reporting (e.g., 0–3 months, 3–12 months, 12+ months)
engagement_share (numeric): proportion of total hours_viewed within each age bucket per reporting period
views_per_month (numeric): views divided by months_since_release to normalize engagement intensity

Question 2: How does engagement differ between globally distributed and regionally released titles across reporting periods and content type?

As Netflix expands globally, understanding how content availability (global vs. regional) relates to viewer engagement release periods is important for content release strategy. This question would help us understand whether global content out performs regional only content, and how this relationship varies across different reporting periods and seasons. This will help Netflix understand where to allocate resources for content towards global content or put my resources into regional films.

Existing Variables include:

hours_viewed - total viewing hours for each title (num)
views - calculated engagement metric (num)
available_globally - global vs regional (categorical)
release_date (date) - release date (datetime)
report - reporting period (categorical)

Variables to Create:

report_start, report_end (date): start and end dates of each reporting period.
exposure_start (date): the later of release_date or report_start, representing when the title became available during the reporting window.
months_exposed (numeric): the number of months the title was available within the reporting window.
adjusted_hours_viewed (numeric): hours viewed divided by the number of months the title was exposed during the reporting window.
performance_index (numeric): normalized engagement metric calculated by dividing adjusted_hours_viewed by the median value within each report × content_type group.

Explanation for Variables Created + Normalization: These variables are created to ensure engagement comparisons between titles are fair despite differences in release timing. Because Netflix reports engagement within fixed reporting windows, titles released earlier in the window have more time to accumulate views than titles released later. To correct for this, the exposure of each title within the reporting window is calculated using exposure_start and converted into months (months_exposed). Engagement is then adjusted by dividing total viewing hours by the number of months exposed, producing adjusted_hours_viewed, which represents average viewing hours per month of availability within the reporting period. Finally, engagement is normalized within each reporting period and content type using the median value to create the performance_index. This index allows titles to be compared relative to the typical title released in the same report window, where a value of 1 represents typical performance, values greater than 1 indicate above-average engagement, and values below 1 indicate below-average engagement.

Analysis Plan

To assess question 1: whether Netflix engagement is driven more by new releases or older catalog titles, we will use two complementary visualizations capturing both overall engagement distribution and normalized intensity. First, we will create a stacked bar chart showing the percentage of total hours_viewed contributed by each age_bucket (e.g., 0–3 months, 3–12 months, 12+ months) within each reporting period. Converting values to proportions allows us to compare how engagement composition shifts over time, independent of overall viewing volume. Second, we will construct faceted boxplots of views_per_month by age_bucket across reporting periods. Because engagement is highly skewed, boxplots effectively display differences in median intensity and spread, while views_per_month adjusts for exposure time so older titles are not mechanically advantaged. Together, these plots allow us to evaluate both structural reliance on new content and engagement efficiency across title lifecycles.

For Question 2, we examine how viewer engagement differs between globally distributed and regionally released titles, and how those differences vary across reporting periods and content types (Movie vs Show). To make fair comparisons within each reporting window, engagement is first adjusted for time actually available inside the window by computing hours viewed per month exposed (adjusted_hours_viewed). Because overall viewing levels can differ by period and by format, we then normalize within each report × content_type group by dividing by the group median to create a relative performance_index, where 1 represents the “typical” title in that specific report period and content type.

We visualize these patterns in two ways. The boxplots compare Global vs Regional titles within Movies and Shows using the log-scaled performance index, highlighting differences in the median, spread (IQR), and outliers. The ridgeline density plots, faceted by content type and stacked by reporting period, show how the full distribution of performance shifts over time and whether Global and Regional releases differ in where most titles concentrate (relative to the dashed reference at 1). Together, these visuals summarize how both the typical performance and the shape of engagement variability differ between Global and Regional titles across reporting windows.