Where does it pay to attend college?

Appendix to report

Data cleaning

library(ggplot2)
library(ggthemes)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
✔ purrr   1.0.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

salaries_type <- read.csv("data/college/salaries-by-college-type.csv")
salaries_region <- read.csv("data/college/salaries-by-region.csv")

salaries_type_summary <- salaries_type |>
  mutate(across(c(Starting.Median.Salary:Mid.Career.90th.Percentile.Salary),
                ~ as.numeric(na_if(gsub("[$,]|N/A", "", .), "")))) |>
  group_by(School.Type) |>
  summarize(starting_mean = mean(Starting.Median.Salary, na.rm = TRUE),
            midcareer_mean = mean(Mid.Career.Median.Salary, na.rm = TRUE),
            midcareer_10pct = mean(Mid.Career.10th.Percentile.Salary, na.rm = TRUE),
            midcareer_25pct = mean(Mid.Career.25th.Percentile.Salary, na.rm = TRUE),
            midcareer_75pct = mean(Mid.Career.75th.Percentile.Salary, na.rm = TRUE),
            midcareer_90pct = mean(Mid.Career.90th.Percentile.Salary, na.rm = TRUE)
            ) |>
  pivot_longer(
    cols = starting_mean:midcareer_90pct,
    names_to = "career_stage",
    values_to = "mean_salary"
  ) |>
  rename(school_type = School.Type)

salaries_region_summary <- salaries_region |>
  mutate(across(c(Starting.Median.Salary:Mid.Career.90th.Percentile.Salary),
                ~ as.numeric(na_if(gsub("[$,]|N/A", "", .), "")))) |>
  group_by(Region) |>
  summarize(starting_mean = mean(Starting.Median.Salary, na.rm = TRUE),
            midcareer_mean = mean(Mid.Career.Median.Salary, na.rm = TRUE),
            midcareer_10pct = mean(Mid.Career.10th.Percentile.Salary, na.rm = TRUE),
            midcareer_25pct = mean(Mid.Career.25th.Percentile.Salary, na.rm = TRUE),
            midcareer_75pct = mean(Mid.Career.75th.Percentile.Salary, na.rm = TRUE),
            midcareer_90pct = mean(Mid.Career.90th.Percentile.Salary, na.rm = TRUE)
            ) |>
  pivot_longer(
    cols = starting_mean:midcareer_90pct,
    names_to = "career_stage",
    values_to = "mean_salary"
  ) |>
  rename(school_region = Region)

salaries_joined <- full_join(salaries_type, salaries_region)

Joining with `by = join_by(School.Name, Starting.Median.Salary,
Mid.Career.Median.Salary, Mid.Career.10th.Percentile.Salary,
Mid.Career.25th.Percentile.Salary, Mid.Career.75th.Percentile.Salary,
Mid.Career.90th.Percentile.Salary)`

For salaries_type_summary and salaries_region_summary, for all the non-N/A median salaries, we converted the values to numeric. Next, we grouped by type of school/region in the US respectively and created a summary row with a row for each type of school/region, with each column containing the mean of the median salaries at each career stage. For the school type summary, we pivoted the table, so each row represents the mean of the median salaries for all values with the same type of school and career stage. For the region summary, we did the same, except each row represents the mean of the median salaries for values with the schools in the same region and career stage.

Finally, we created a merged data set called salaries_joined containing a full_join of salaries_type and salaries_region, meaning that it has a row representing every school from both sets, providing N/As in columns where data isn’t complete.