Project Elegant Evee

Appendix to report

Data cleaning

# Load packages
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ stringr 1.5.0
✔ tidyr   1.2.1     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(scales)
library(patchwork)

# Load data
movies <- read_csv("data/movies.csv")
Rows: 2000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Title, Month, Certificate, Runtime, Directors, Stars, Genre, Filmi...
dbl  (2): Rating, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Make all column names lowercase
names(movies) <- tolower(names(movies))

# Replace "Unknown" strings with NA for data analysis
movies <- movies |>
  mutate(
    income = na_if(income, "Unknown"),
    budget = na_if(budget, "Unknown")
  )

# Remove all films with NA values for income or budget
movies <- movies |>
  mutate(
    income = parse_number(income),
    budget = parse_number(budget)
    ) |>
  filter(is.na(income) == FALSE & is.na(budget) == FALSE)

# Split genre, directors, stars, country_of_origin
movies <- movies |>
  mutate(
    genre = str_split(string = genre, pattern = ","),
    directors = str_split(string = directors, pattern = ","),
    stars = str_split(string = stars, pattern = ","),
    country_of_origin = str_split(string = country_of_origin, pattern = ","))

# Create cleaned CSV file
write_csv(movies, "data/movies_clean.csv")

# We found na_if() through built-in R documentation and names() and tolower() through
# https://statisticsglobe.com/change-letter-case-column-names-r

We collected all of our data by downloading and reading an iMDB CSV file from Kaggle (https://www.kaggle.com/datasets/georgescutelnicu/top-100-popular-movies-from-2003-to-2022-imdb). We used the read_csv() function to store the data frame in the “movies” variable. The data was originally collected by the data set authors by scraping information from iMDB.com and was partially clened for our analysis. However, we completed significantly more cleaning on our own, as displayed above.

Although most of our data are currently stored in an appropriate format—no numerical variables are stored with type “character” in our downloaded data set—some key problems were resolved in preparation for detailed data analysis.

First, not all column names were lowercase, and the inconsistency of variable cases is poor data science practice. Using the names() and to_lower() functions, we made all columns names in our data frame lowercase.

Next, we found that the CSV file filled in missing values with the string “Unknown” instead of using NA values, which would have made data analysis difficult. We consequently used na_if() to change these “Unknown” values to NA for easier use of numerical R functions.

However, given that much of our analysis centers around analyzing/fitting the relationship between categorical variables versus budget and income (some of our only numerical variables), we thought it would be salient to strip the data set of movies with incomplete budget and income values, as these rows were useless for our needs. As such, we filtered with function filter() the data set to include only rows with complete budget and income data.

Lastly, we found that the genre, directors, stars, and country_of_origin columns sometimes have values with multiple items—a movie might have more than one genre, and the original data set listed all genres as a string separated by commas. We used the function str_split() to turn these character columns into lists comprised of multiple character vectors such that we can better analyze individual genres. The same applies to the other aforementioned columns.

These steps represent a comprehensive approach to data cleaning that enabled our results in the report.