Where does it pay to attend college?

Exploratory data analysis

Research question(s)

How does the classification of a U.S. college affect the starting and mid-career earnings of graduates from that college? Does this vary across different regions of the U.S.?

Data collection and cleaning

library(ggplot2)
library(ggthemes)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
✔ purrr   1.0.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
college_women <- read.csv("data/college/percent-bachelors-degrees-women-usa.csv")
degrees <- read.csv("data/college/degrees-that-pay-back.csv")
salaries_type <- read.csv("data/college/salaries-by-college-type.csv")
salaries_region <- read.csv("data/college/salaries-by-region.csv")

college_women <- college_women |>
  pivot_longer(
    cols = "Agriculture":"Social.Sciences.and.History",
    names_to = "major",
    values_to = "pct_female"
  ) |>
  rename(year = Year)
  
salaries_type_summary <- salaries_type |>
  mutate(across(c(Starting.Median.Salary:Mid.Career.90th.Percentile.Salary),
                ~ as.numeric(na_if(gsub("[$,]|N/A", "", .), "")))) |>
  group_by(School.Type) |>
  summarize(starting_mean = mean(Starting.Median.Salary, na.rm = TRUE),
            midcareer_mean = mean(Mid.Career.Median.Salary, na.rm = TRUE),
            midcareer_10pct = mean(Mid.Career.10th.Percentile.Salary, na.rm = TRUE),
            midcareer_25pct = mean(Mid.Career.25th.Percentile.Salary, na.rm = TRUE),
            midcareer_75pct = mean(Mid.Career.75th.Percentile.Salary, na.rm = TRUE),
            midcareer_90pct = mean(Mid.Career.90th.Percentile.Salary, na.rm = TRUE)
            ) |>
  pivot_longer(
    cols = starting_mean:midcareer_90pct,
    names_to = "career_stage",
    values_to = "mean_salary"
  ) |>
  rename(school_type = School.Type)


salaries_region_summary <- salaries_region |>
  mutate(across(c(Starting.Median.Salary:Mid.Career.90th.Percentile.Salary),
                ~ as.numeric(na_if(gsub("[$,]|N/A", "", .), "")))) |>
  group_by(Region) |>
  summarize(starting_mean = mean(Starting.Median.Salary, na.rm = TRUE),
            midcareer_mean = mean(Mid.Career.Median.Salary, na.rm = TRUE),
            midcareer_10pct = mean(Mid.Career.10th.Percentile.Salary, na.rm = TRUE),
            midcareer_25pct = mean(Mid.Career.25th.Percentile.Salary, na.rm = TRUE),
            midcareer_75pct = mean(Mid.Career.75th.Percentile.Salary, na.rm = TRUE),
            midcareer_90pct = mean(Mid.Career.90th.Percentile.Salary, na.rm = TRUE)
            ) |>
  pivot_longer(
    cols = starting_mean:midcareer_90pct,
    names_to = "career_stage",
    values_to = "mean_salary"
  ) |>
  rename(school_region = Region)

salaries_joined <- full_join(salaries_type, salaries_region)
Joining with `by = join_by(School.Name, Starting.Median.Salary,
Mid.Career.Median.Salary, Mid.Career.10th.Percentile.Salary,
Mid.Career.25th.Percentile.Salary, Mid.Career.75th.Percentile.Salary,
Mid.Career.90th.Percentile.Salary)`

For salaries_type_summary and salaries_region_summary, for all the non-N/A median salaries, we converted the values to numeric. Next, we grouped by type of school/region in the US respectively and created a summary row with a row for each type of school/region, with each column containing the mean of the median salaries at each career stage. For the school type summary, we pivoted the table, so each row represents the mean of the median salaries for all values with the same type of school and career stage. For the region summary, we did the same, except each row represents the mean of the median salaries for values with the schools in the same region and career stage.

Finally, we created a merged data set called salaries_joined containing a full_join of salaries_type and salaries_region, meaning that it has a row representing every school from both sets, providing N/As in columns where data isn’t complete.

Data description

college_women dataset

  • Link: Kaggle - https://www.kaggle.com/datasets/sureshsrinivas/bachelorsdegreewomenusa

  • Each row in the dataset represents a different year, from 1970 to 2011, and each column represents a college major. For each year, every column shows the percentage of women enrolled in a particular major.

  • Attributes:

    • year: year of observation

    • major: college major of observation

    • pct_female: percentage of students that are female in the observation

  • The dataset was created by the Department of Education Statistics who releases a dataset annually containing the percentage of bachelor’s degrees granted to women across a variety of categories of degrees.

  • Funding came from the Department of Education.

  • Unsure on which processes would have affected data observation and recording at this time.

  • Data was cleaned via kaggle users and further by our team.

degrees dataset

  • Link: Kaggle - https://www.kaggle.com/datasets/wsj/college-salaries

  • Each row in the dataset represents an undergraduate major and summary statistics / percentiles on its respective earnings

  • Attributes

    • Undergrad major

    • Starting Median Salary

    • Mid-Career Median Salary

    • Percent change from Starting to Mid-Career Salary

    • Mid-Career 10th Percentile Salary

    • Mid-Career 25th Percentile Salary

    • Mid-Career 75th Percentile Salary

    • Mid-Career 90th Percentile Salary

  • Dataset was created to track general earnings for early and mid career professionals based on their undergraduate major

  • Data was obtained by the Wall Street Journal based on data from Payscale, Inc.

  • Unsure on which processes would have affected data observation and recording at this time.

  • No preprocessing from any users, obtained directly from Kaggle

salaries_type dataset

  • Link: Kaggle - https://www.kaggle.com/datasets/wsj/college-salaries

  • Each row represents a university, the classification of the university, and summary statistics / percentiles on its respective earnings

  • Attributes

    • School name

    • School type (classification ie Party, Engineering, Liberal Arts, Ivy League, State

    • Starting Median Salary

    • Mid-Career Median Salary

    • Mid-Career 10th Percentile Salary

    • Mid-Career 25th Percentile Salary

    • Mid-Career 75th Percentile Salary

    • Mid-Career 90th Percentile Salary

  • Dataset was created to track general earnings for early and mid career professionals based on their undergraduate college and the classification of that college

  • Data was obtained by the Wall Street Journal based on data from Payscale, Inc.

  • Unsure on which processes would have affected data observation and recording at this time.

  • No preprocessing from any users, obtained directly from Kaggle

salaries_region dataset

  • Link: Kaggle - https://www.kaggle.com/datasets/wsj/college-salaries

  • Each row represents a university, the geographical region of the university, and summary statistics / percentiles on its respective earnings

  • Attributes

    • School name

    • School region (geographical region within United States)

    • Starting Median Salary

    • Mid-Career Median Salary

    • Mid-Career 10th Percentile Salary

    • Mid-Career 25th Percentile Salary

    • Mid-Career 75th Percentile Salary

    • Mid-Career 90th Percentile Salary

  • Dataset was created to track general earnings for early and mid career professionals based on their undergraduate college and the geographical region of that college

  • Data was obtained by the Wall Street Journal based on data from Payscale, Inc.

  • Unsure on which processes would have affected data observation and recording at this time.

  • No preprocessing from any users, obtained directly from Kaggle

Data limitations

The schools represented by the salaries_type and salaries_region datasets don’t match up exactly, so there are some NA values when full joining them.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

ggplot(data = college_women, 
       mapping = aes(x = year, y = pct_female, col = major)) +
  geom_line()

ggplot(data = salaries_type_summary, 
       mapping = aes(y = mean_salary, x = school_type, col = career_stage)) +
  geom_point()

ggplot(data = salaries_region_summary, 
       mapping = aes(y = mean_salary, x = school_region, col = career_stage)) +
  geom_point()

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

  • Is our research question refined and sophisticated enough to warrant good quality results?

  • Can you provide your input on our data quality?