Project title

Appendix to report

Data cleaning

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
movies_raw <- read.csv("data/movies.csv")
movies_cleaned <- movies_raw
colnames(movies_cleaned) <- gsub("\\.", "_", colnames(movies_raw))
movies_draft = select(movies_cleaned, -3, -4, -5, -9)
movies_cleaned <- movies_cleaned |>
  mutate(
    Age_Gap = abs(Actor_2_Age - Actor_1_Age)
  ) |>
  arrange(desc(Age_Gap), desc(Release_Year))
names(movies_cleaned) <- tolower(names(movies_cleaned))
movies_cleaned <- movies_cleaned |> 
  mutate(years_since_start = release_year - min(release_year))
movies_clean <- movies_cleaned
write.csv(movies_cleaned, "movies_clean.csv", row.names = FALSE)
write.csv(movies_cleaned, file = "data/movies_clean.csv", row.names = FALSE)

The data was quite clean and tidy when we received it, but it contained some attributes (columns) that we did not need for our planned data analysis. Thus, we removed a few unneeded columns, including the names of the many actors involved, since we only want to focus on the main couple and the age gap between them. We then created a new variable that contained the age gap of the main couple, and organized it in order of descending age gap. For movies that had the same age gap, we ordered them by descending release year. We also created the variable years_since_start, which depicts the number of years since 1935 that a movie was released, because 1935 is the earliest year represented in our dataset of movies. We can calculate time intervals and trends and build models based on the elapsed years, which can provide valuable insights into how the movie age gaps have evolved over time. By representing time as the number of years since the first year in the data, we normalize the dataset and create a standardized scale that can be useful for comparison purposes.

Other appendicies (as necessary)

# Analysis 2: Same Sex vs. Heterosexual relationships over time

# same-sex relationships in Hollywood movies over time
homo_movies_draft <- movies_clean |>
  filter(actor_1_gender == actor_2_gender)

ggplot(data = homo_movies_draft, breaks=30, mapping = aes(x = release_year, y = age_gap)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_d() +
  geom_smooth(method = "lm") + 
  labs(
    title = "Same-sex Relationship Gaps in Hollywood Movies Over the Years",
    x = "Movie Release Year",
    y = "Age Gap"
  )
`geom_smooth()` using formula = 'y ~ x'

homo_age_gap_year_reg <- lm(age_gap ~ release_year, data = homo_movies_draft)
tidy(homo_age_gap_year_reg)
# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -620.      838.       -0.739   0.468
2 release_year    0.315     0.417     0.755   0.459
# heterosexual relationships in Hollywood movies over time
hetero_movies_draft <- movies_clean |>
  filter(actor_1_gender != actor_2_gender)

ggplot(data = hetero_movies_draft, breaks=30, mapping = aes(x = release_year, y = age_gap)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_d() +
  geom_smooth(method = "lm") + 
  labs(
    title = "Heterosexual Relationship Gaps in Hollywood Movies Over the Years",
    x = "Movie Release Year",
    y = "Age Gap"
  )
`geom_smooth()` using formula = 'y ~ x'

hetero_age_gap_year_reg <- lm(age_gap ~ release_year, data = hetero_movies_draft)
tidy(hetero_age_gap_year_reg)
# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   237.      29.4         8.07 1.82e-15
2 release_year   -0.113    0.0147     -7.71 2.68e-14

The first regression model above suggests that the age gap in movies with same-sex relationships have increased over the years. Although the relationship between the release year and age gap is weak because of a small sample size, there is still a loose suggestion that time affects the age gap in same-sex relationships because the linear regression line that is formed has a positive slope. The second regression model strongly suggests that there is a strong relationship between time and the age gap in heterosexual relationship gaps in Hollywood movies. It appears that as time passes, the age gap in heterosexual relationships in Hollywood movies decreases. This is evident in the fact that the points are closely located to one another in a dense cloud and the linear regression line has a negative slope.

older_man_movies <- movies_clean |>
  filter(actor_1_gender != actor_2_gender) |> 
  filter(actor_1_age > actor_2_age)

older_woman_movies <- movies_clean |>
  filter(actor_1_gender != actor_2_gender) |> 
  filter(actor_1_age < actor_2_age)

# Dataset to reflect number of movies where the man is older
older_man_movies_num <- older_man_movies |>
  group_by(release_year) |>
  count()

# Dataset to reflect number of movies where the woman is older
older_woman_movies_num <- older_woman_movies |>
  group_by(release_year) |>
  count()

# Dataset to reflect number of movies with heterosexual couples
hetero_movies_num <- hetero_movies_draft |>
  group_by(release_year) |>
  count()

# Dataset to reflect number of movies with homosexual couples
homo_movies_num <- homo_movies_draft |>
  group_by(release_year) |>
  count()
# Distribution of Actresses' Ages
movies_women <- movies_clean |> 
  filter(actor_2_gender == "woman") |>
  mutate(actor_age = actor_2_age)

ggplot(data = movies_women, mapping = aes(x = actor_age)) + 
  scale_x_continuous(limits = c(0, 80)) +
  scale_y_continuous(limits = c(0, 200), n.breaks = 5) +
  geom_histogram() + 
    labs(
    title = "Distribution of Actresses' Ages",
    x = "Age",
    y = "Number of Actresses"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).

tidy(movies_women$actor_age)
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 1,144 × 1
       x
   <int>
 1    75
 2    24
 3    20
 4    23
 5    17
 6    22
 7    30
 8    39
 9    19
10    23
# ℹ 1,134 more rows
sd(movies_women$actor_age)
[1] 8.116241
# Distribution of Actors' Ages
movies_men <- movies_clean |> 
  filter(actor_1_gender == "man") |>
  mutate(actor_age = actor_1_age)

ggplot(data = movies_men, mapping = aes(x = actor_age)) + 
  scale_x_continuous(limits = c(0, 80)) +
  scale_y_continuous(limits = c(0, 200), n.breaks = 5) +
  geom_histogram() + 
    labs(
    title = "Distribution of Actors' Ages",
    x = "Age",
    y = "Number of Actors"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
Removed 2 rows containing missing values (`geom_bar()`).

tidy(movies_men$actor_age)
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 1,145 × 1
       x
   <int>
 1    23
 2    74
 3    69
 4    68
 5    81
 6    59
 7    62
 8    69
 9    77
10    57
# ℹ 1,135 more rows
sd(movies_men$actor_age)
[1] 10.79469
# Distribution of Ages of all Actors and Actresses in Dataset

movies_clean_all_ages <- bind_rows(movies_men, movies_women)

all_ages <- c(movies_clean$actor_1_age, movies_clean$actor_2_age)

ggplot(data = data.frame(x = all_ages), aes(x)) + 
  scale_x_continuous(limits = c(0, 80)) +
  scale_y_continuous(limits = c(0, 300)) +
  geom_histogram() + 
    labs(
    title = "Distribution of All Actors' and Actresses' Ages",
    x = "Age",
    y = "Number of Actors/Actresses"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
Removed 2 rows containing missing values (`geom_bar()`).

tidy(movies_clean_all_ages$actor_age)
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 2,289 × 1
       x
   <int>
 1    23
 2    74
 3    69
 4    68
 5    81
 6    59
 7    62
 8    69
 9    77
10    57
# ℹ 2,279 more rows
sd(movies_clean_all_ages$actor_age)
[1] 10.50882
# Distribution of the romantic age gaps in our movie data

ggplot(data = movies_clean, mapping = aes(x = age_gap)) + 
  scale_x_continuous(limits = c(0, 55)) +
  scale_y_continuous(limits = c(0, 200), n.breaks = 5) +
  geom_histogram() + 
    labs(
    title = "Distribution of Romantic Age Gaps in Hollywood",
    x = "Age Gap",
    y = "Number of Movies"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).

tidy(movies_men$age_gap)
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 1,145 × 1
       x
   <int>
 1    52
 2    50
 3    49
 4    45
 5    43
 6    42
 7    40
 8    39
 9    38
10    38
# ℹ 1,135 more rows
# Summary of the release years in our movie data

tidy(movies_men$release_year)
Warning: 'tidy.numeric' is deprecated.
See help("Deprecated")
# A tibble: 1,145 × 1
       x
   <int>
 1  1971
 2  2006
 3  2002
 4  1998
 5  2010
 6  1992
 7  2009
 8  1999
 9  1999
10  1992
# ℹ 1,135 more rows
# Summary of movies by release year

summary(movies_clean)
  movie_name         release_year    director         age_difference 
 Length:1161        Min.   :1935   Length:1161        Min.   : 0.00  
 Class :character   1st Qu.:1997   Class :character   1st Qu.: 4.00  
 Mode  :character   Median :2004   Mode  :character   Median : 8.00  
                    Mean   :2001                      Mean   :10.47  
                    3rd Qu.:2012                      3rd Qu.:16.00  
                    Max.   :2022                      Max.   :52.00  
 actor_1_name       actor_1_gender     actor_1_birthdate   actor_1_age   
 Length:1161        Length:1161        Length:1161        Min.   :17.00  
 Class :character   Class :character   Class :character   1st Qu.:32.00  
 Mode  :character   Mode  :character   Mode  :character   Median :38.00  
                                                          Mean   :39.85  
                                                          3rd Qu.:47.00  
                                                          Max.   :81.00  
 actor_2_name       actor_2_gender     actor_2_birthdate   actor_2_age   
 Length:1161        Length:1161        Length:1161        Min.   :17.00  
 Class :character   Class :character   Class :character   1st Qu.:25.00  
 Mode  :character   Mode  :character   Mode  :character   Median :30.00  
                                                          Mean   :31.06  
                                                          3rd Qu.:35.00  
                                                          Max.   :75.00  
    age_gap      years_since_start
 Min.   : 0.00   Min.   : 0.00    
 1st Qu.: 4.00   1st Qu.:62.00    
 Median : 8.00   Median :69.00    
 Mean   :10.47   Mean   :65.59    
 3rd Qu.:16.00   3rd Qu.:77.00    
 Max.   :52.00   Max.   :87.00    
ggplot(data = movies_clean, mapping = aes(x = release_year)) +
  geom_histogram() + 
  scale_x_continuous(limits = c(1920, 2025), n.breaks = 5) +
    labs(
    title = "Romantic Hollywood Movies Released per Year",
    x = "Release Year",
    y = "Number of Movies"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).

# Summary of older-man movies by release year

summary(older_man_movies)
  movie_name         release_year    director         age_difference 
 Length:926         Min.   :1935   Length:926         Min.   : 1.00  
 Class :character   1st Qu.:1995   Class :character   1st Qu.: 5.00  
 Mode  :character   Median :2003   Mode  :character   Median :10.00  
                    Mean   :1999                      Mean   :11.76  
                    3rd Qu.:2011                      3rd Qu.:17.00  
                    Max.   :2022                      Max.   :50.00  
 actor_1_name       actor_1_gender     actor_1_birthdate   actor_1_age   
 Length:926         Length:926         Length:926         Min.   :19.00  
 Class :character   Class :character   Class :character   1st Qu.:35.00  
 Mode  :character   Mode  :character   Mode  :character   Median :40.00  
                                                          Mean   :41.82  
                                                          3rd Qu.:48.00  
                                                          Max.   :79.00  
 actor_2_name       actor_2_gender     actor_2_birthdate   actor_2_age   
 Length:926         Length:926         Length:926         Min.   :17.00  
 Class :character   Class :character   Class :character   1st Qu.:25.00  
 Mode  :character   Mode  :character   Mode  :character   Median :29.00  
                                                          Mean   :30.06  
                                                          3rd Qu.:34.00  
                                                          Max.   :68.00  
    age_gap      years_since_start
 Min.   : 1.00   Min.   : 0.00    
 1st Qu.: 5.00   1st Qu.:60.00    
 Median :10.00   Median :68.00    
 Mean   :11.76   Mean   :64.36    
 3rd Qu.:17.00   3rd Qu.:76.00    
 Max.   :50.00   Max.   :87.00    
ggplot(data = older_man_movies, mapping = aes(x = release_year)) +
  geom_histogram() + 
  scale_x_continuous(limits = c(1920, 2025), n.breaks = 5) +
    labs(
    title = "Older-Male Romantic Movies Released per Year",
    x = "Release Year",
    y = "Number of Movies"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).

# Summary of older-woman movies by release year

summary(older_woman_movies)
  movie_name         release_year    director         age_difference  
 Length:182         Min.   :1937   Length:182         Min.   : 1.000  
 Class :character   1st Qu.:2000   Class :character   1st Qu.: 2.000  
 Mode  :character   Median :2006   Mode  :character   Median : 4.000  
                    Mean   :2005                      Mean   : 5.286  
                    3rd Qu.:2013                      3rd Qu.: 6.000  
                    Max.   :2022                      Max.   :52.000  
 actor_1_name       actor_1_gender     actor_1_birthdate   actor_1_age   
 Length:182         Length:182         Length:182         Min.   :17.00  
 Class :character   Class :character   Class :character   1st Qu.:25.00  
 Mode  :character   Mode  :character   Mode  :character   Median :30.00  
                                                          Mean   :30.87  
                                                          3rd Qu.:34.75  
                                                          Max.   :63.00  
 actor_2_name       actor_2_gender     actor_2_birthdate   actor_2_age   
 Length:182         Length:182         Length:182         Min.   :19.00  
 Class :character   Class :character   Class :character   1st Qu.:29.00  
 Mode  :character   Mode  :character   Mode  :character   Median :35.50  
                                                          Mean   :36.15  
                                                          3rd Qu.:40.00  
                                                          Max.   :75.00  
    age_gap       years_since_start
 Min.   : 1.000   Min.   : 2.00    
 1st Qu.: 2.000   1st Qu.:65.00    
 Median : 4.000   Median :71.50    
 Mean   : 5.286   Mean   :69.55    
 3rd Qu.: 6.000   3rd Qu.:78.00    
 Max.   :52.000   Max.   :87.00    
ggplot(data = older_woman_movies, mapping = aes(x = release_year)) +
  geom_histogram() + 
  scale_x_continuous(limits = c(1920, 2025), n.breaks = 5) +
    labs(
    title = "Older-Female Romantic Movies Released per Year",
    x = "Release Year",
    y = "Number of Movies"
  ) + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing missing values (`geom_bar()`).