Project proposal

Author

Trusting-Panda

library(tidyverse)
library(lubridate)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

This project we uses the “What have we been watching on Netflix?” dataset from TidyTuesday (2025-07-29), which compiles official Netflix Engagement Reports from late 2023 through the first half of 2025. These reports summarize the total hours users spent watching each title during six-month reporting periods.

The dataset includes two files:

1.movies with r nrow(movies) rows and r ncol(movies) columns

2.shows with r nrow(shows) rows and r ncol(shows) columns

The key variables include report, available_globally, release_date, hours_viewed, runtime, and views. The dataset contains both numerical and categorical variables, allowing us to explore engagement patterns across time and availability categories.

We chose this dataset because it provides multiple reporting periods and measurable engagement metrics, which enable comparisons of viewing performance over time.

movies <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv'
)
Rows: 36121 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
shows <- read_csv(
  'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv'
)
Rows: 27803 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): source, report, title, available_globally, runtime
dbl  (2): hours_viewed, views
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Questions

  • Is Netflix’s content fleeting or enduring? How do the post-release hype curves of new films differ across genres and languages?

  • How much does “global availability” contribute marginally to viewership? Are there specific types of programs that can generate extremely high returns even without global availability?

Analysis plan

Question 1: Post-release hype curves across genres and languages

Variables involved: - release_date - to calculate time since release - hours_viewed - engagement metric - report - reporting period (2025Jan-Jun) - title - contains content name and potential language indicators - runtime - to normalize viewing patterns

Variables to be created: - months_since_release - calculated as the difference between report period end date (June 2025) and release_date - content_language - extracted from title (e.g., titles with Korean characters, Spanish names, English titles) - genre_category - manually coded based on title keywords (e.g., “Action” for titles with “Attack”, “Drama” for titles like “Life”, “Animation” for titles like “Minions”, “Documentary” for non-fiction titles) - viewership_decay - percentage change in hours_viewed across different time periods since release

Analysis approach: 1. Filter for titles with release_date before 2025 that appear in the dataset 2. Calculate months_since_release for each title in the 2025Jan-Jun report period 3. Manually categorize a sample of popular titles by language (based on title text - Korean characters, Spanish words, English, etc.) 4. Create genre categories by identifying keywords in titles (Action, Romance, Animation, Documentary, etc.) 5. Create line plots showing hours_viewed trends over months_since_release 6. Compare decay patterns across language groups and genre categories 7. Identify “evergreen” content (sustained viewing) vs. “flash in the pan” content (rapid dropoff)

External data: - May manually code 50-100 top titles for genre based on IMDb/Wikipedia lookups - Language can be inferred from title text without external data

Note: Since we only have one reporting period (2025Jan-Jun), we’ll need to use release_date to create cohorts of titles released in different time windows and compare their current engagement levels.

Question 2: Global availability impact on viewership

Variables involved: - available_globally - binary variable (Yes/No) - hours_viewed - total engagement metric - views - number of views - runtime - content length - title - for content categorization

Variables to be created: - hours_per_view - calculated as hours_viewed / views (average viewing completion) - content_type - categorized from title (Movie series/franchises, Standalone films, Animated content, etc.) - viewership_efficiency - views normalized by runtime - global_vs_regional - binary categorization based on available_globally

Analysis approach: 1. Split movies dataset by available_globally (Yes vs. No) 2. Calculate summary statistics: - Mean, median, and maximum hours_viewed for each group - Distribution of views for globally vs. regionally available content 3. Create visualizations: - Box plots comparing hours_viewed distributions - Scatter plots of views vs. runtime colored by availability 4. Identify high-performing regional content (titles with available_globally = No but exceptionally high hours_viewed) 5. Categorize these outliers by content type (based on title analysis) 6. Calculate the marginal contribution of global availability: - Compare median hours_viewed between groups - Control for runtime and release_date in the comparison

External data: - No external data required - All analysis can be done with existing variables

Note: We have sufficient variation in available_globally to make meaningful comparisons without additional data sources.