Netflix Movies & Shows

Author

Proud Koala
Aniha Kuninti, Jasmine Pearson, Serene Pan

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

movies <- readr::read_csv("data/movies.csv", show_col_types = FALSE)
shows <- readr::read_csv("data/shows.csv", show_col_types = FALSE)

Skim Dataset

library(skimr)

skim(movies)

Data summary
Name	movies
Number of rows	36121
Number of columns	8
_______________________
Column type frequency:
character	5
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
source	0	1	57	57	4
report	0	1	11	11	4
title	6	1	1	144	13551
available_globally	9	1	2	19	3
runtime	12	1	5	10	212

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
release_date	29396	0.19	2013-12-12	2025-06-30	2021-08-13	1128

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hours_viewed	12	1	2790858	9054284	1e+05	2e+05	4e+05	1900000	313000000	▇▁▁▁▁
views	12	1	1573256	4974895	1e+05	1e+05	3e+05	1100000	164700000	▇▁▁▁▁

skim(shows)

Data summary
Name	shows
Number of rows	27803
Number of columns	8
_______________________
Column type frequency:
character	5
Date	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
source	0	1.00	57	57	4
report	0	1.00	11	11	4
title	6	1.00	3	225	9913
available_globally	9	1.00	2	19	3
runtime	714	0.97	5	11	1519

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
release_date	14033	0.5	2010-04-01	2025-06-30	2021-04-14	1968

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hours_viewed	12	1	9807243	26327450	1e+05	8e+05	2500000	8200000	840300000	▇▁▁▁▁
views	12	1	1414994	3665090	1e+05	2e+05	400000	1300000	144800000	▇▁▁▁▁

head(movies)

# A tibble: 6 × 8
  source       report title available_globally release_date hours_viewed runtime
  <chr>        <chr>  <chr> <chr>              <date>              <dbl> <chr>  
1 1_What_We_W… 2025J… Back… Yes                2025-01-17      313000000 1H 54M…
2 1_What_We_W… 2025J… STRAW Yes                2025-06-06      185200000 1H 48M…
3 1_What_We_W… 2025J… The … Yes                2025-03-28      198900000 2H 5M …
4 1_What_We_W… 2025J… Exte… Yes                2025-04-30      159000000 1H 49M…
5 1_What_We_W… 2025J… Havoc Yes                2025-04-25      154900000 1H 47M…
6 1_What_We_W… 2025J… The … No                 NA              106800000 1H 26M…
# ℹ 1 more variable: views <dbl>

head(shows)

# A tibble: 6 × 8
  source       report title available_globally release_date hours_viewed runtime
  <chr>        <chr>  <chr> <chr>              <date>              <dbl> <chr>  
1 1_What_We_W… 2025J… Adol… Yes                2025-03-13      555100000 3H 50M…
2 1_What_We_W… 2025J… Squi… Yes                2024-12-26      840300000 7H 10M…
3 1_What_We_W… 2025J… Squi… Yes                2025-06-27      438600000 6H 8M …
4 1_What_We_W… 2025J… Zero… Yes                2025-02-20      315800000 5H 9M …
5 1_What_We_W… 2025J… Miss… Yes                2025-01-01      218600000 3H 46M…
6 1_What_We_W… 2025J… Amer… Yes                2025-02-17      120600000 2H 9M …
# ℹ 1 more variable: views <dbl>

This project uses Netflix engagement data from the TidyTuesday Week 30 2025 dataset, by Jen Richmond. The data is derived from Netflix’s regularly released Engagement Reports and summarizes TV show and movie viewing activity from late 2023 through June 2025.

The dataset consists of two tables: one for movies, and one for shows. The movies dataset has 36,121 rows and 8 columns while the shows dataset has 27,803 rows and 8 columns. The data is represented at the title-report level. In other words, each row represents a unique title and reporting period pairing, containing the viewing performance/engagement during a specific period. Netflix’s reporting periods are on a 6-month basis, with 1 report released in the 1st month of a year and another released in the 2nd half of a year (e.g. Jan-June 2025, July-Dec 2025). Therefore due to the pre-existing level of aggregation in the released dataset as well as the proprietary nature of Netflix’s viewing data, we are unable to disaggregate/separate the data into more granular units of time (for example, it is impossible to isolate January 2025 from February 2025 viewing data for a particular title since Jan-June is reported all in one block).

Both the movies and shows datasets contain information on a particular title’s source file name, reporting period, whether the title is globally available or not, release date, hours spent viewing the show, total runtime, and proportion of time spent viewing relative to total runtime.

We chose this dataset because, as college students who regularly use streaming platforms, Netflix is a part of our daily lives. As consumers ourselves of entertainment on Netflix, we are aware that our decisions and behavior when interacting with such a streaming platform are captured in the data we leave behind on Netflix. Whenever a viewer picks up the remote, they are making many decisions (whether they are aware of it or not) such as choosing a specific title to watch, whether to pick a movie or tv show, when to stop watching, how quickly (or slowly) after a new movie/show comes out that we decide to watch it, or whether to abandon a movie/show or keep watching and when. We are thus very curious to investigate whether Netflix’s engagement data can reveal interesting or hidden patterns in user behavior on the platform on a larger scale (beyond our individual-level understanding of how we individually choose to engage with Netflix). Analyzing real engagement data allows us to connect data science concepts to our own viewing habits.

Questions

The two questions you want to answer.

1. Are movies and shows released in certain 6-month windows associated with higher average engagement? Variables: release_date (grouped into 6-month windows), hours_viewed (the primary engagement metric), format

2. Do longer movies and shows have the same amount of viewer engagement as shorter ones? Variables: hours_viewed, views, runtime

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1 Analysis Plan

In this project, we are defining engagement primarily as total hours viewed, because it reflects the aggregate amount of attention that Netflix viewers are giving to a particular title. While “views” captures how many times a title was started, the “hours viewed” metric is a better representation of the actual depth of a viewer’s consumption on the platform.

We would create a column called “format” which represents the format of the title (e.g. either “movie” or “show”). We will keep the data in long format and retain multiple reporting periods per title rather than collapsing to one row per title, since hours viewed may vary over time.

Next, we will create a release window variable based on the title’s release date. We would group the releases into 6-month periods (Jan–June 2024, July–Dec 2024). This allows us to align release timing with Netflix’s engagement reporting periods.

We will then aggregate hours viewed at the release window and format level using group_by() and summarize() to calculate the average hours viewed for titles released in each window. This allows us to compare overall trends in hours viewed across release periods rather than focusing on individual titles.

Finally, we will create faceted visualizations that separate movies and shows to compare average hours viewed across release windows and reporting periods. These visualizations will help us to see whether content released in certain time periods receive higher hours viewed and whether these patterns differ between movies and television shows.

Question 2 Analysis Plan

The plan for the question 2 is to use the variables: hours_viewed, runtime, views, and do a geom_line and geom_jittter() to see if there is a correlation between the two ideas. This is so we can see a scatter of where all the data lies, then have a line which shows the average amount of engagement in comparison to the total length of the watch time. We would transform the runtime data into hours so the metrics will be the same for calculating the proportions. Depending on the presence and amount of outliers in the filter the df to focus on movie length and show length to remove outliers. For movies, any runtime that falls outside the range of [60 minutes (1hr) - 180 minutes (3hrs)] is considered an outlier. For shows, any runtime that falls outside the range of [60 minutes (1hr) - 3600 minutes (60hrs)] is considered an outlier.

X-Axis: Total Length (hours) Y-Axis: Proportion of Watch time (Hours_Viewed divided by Views*Runtime)

Create 2 different graphs for Movies and Shows, side-by-side analysis using facet. In our analysis, we will make it clear that these data will differ because of a bias towards movies because movies are typically watched until the end, whereas an audience member of a show may finish a couple of episodes, then never watch the show again.