Unveiling Workforce Dynamics: Exploring Pay Disparities in Los Angeles 2022

Exploration

Objective(s)

We want to present an analysis of the workforce in the city of Los Angeles. Through an interactive web application we will delve into the diversity of the workforce and try to determine if there are disparities in payroll distribution based on different factors such as department, gender, and ethnicity for the year 2022.

Data collection and cleaning

library(tidyverse)
library(scales)
library(dplyr)
library(stringr)
library(ggplot2)

payroll_2022 <- data.frame()

dept_no <- c(2, 4, 6, 8, 10, 11, 12, 13, 14, 15)

payroll_2022 <- dept_no |>
  map(~ str_glue("data/City_Employee_Payroll_Current_{.}.csv")) |>
  map_df(~ read_csv(.))

skimr::skim(payroll_2022)
Data summary
Name payroll_2022
Number of rows 5270
Number of columns 20
_______________________
Column type frequency:
character 10
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LAST_NAME 0 1.00 1 20 0 3219 0
FIRST_NAME 0 1.00 1 16 0 2123 0
DEPARTMENT_TITLE 0 1.00 5 42 0 10 0
JOB_CLASS_PGRADE 0 1.00 6 6 0 428 0
JOB_TITLE 0 1.00 6 51 0 422 0
EMPLOYMENT_TYPE 0 1.00 9 9 0 1 0
JOB_STATUS 0 1.00 6 6 0 1 0
MOU_TITLE 0 1.00 8 51 0 29 0
GENDER 0 1.00 4 7 0 3 0
ETHNICITY 93 0.98 5 17 0 10 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RECORD_NBR 0 1 2.184743e+11 1.688984e+11 3030303437.00 3.636306e+09 3.139338e+11 3.630333e+11 3.939394e+11 ▆▁▁▂▇
PAY_YEAR 0 1 2.022000e+03 0.000000e+00 2022.00 2.022000e+03 2.022000e+03 2.022000e+03 2.022000e+03 ▁▁▇▁▁
DEPARTMENT_NO 0 1 6.670000e+00 3.350000e+00 2.00 4.000000e+00 4.000000e+00 8.000000e+00 1.500000e+01 ▇▁▂▃▁
MOU 0 1 1.429000e+01 1.238000e+01 0.00 3.000000e+00 1.200000e+01 2.100000e+01 6.400000e+01 ▇▅▃▁▁
REGULAR_PAY 0 1 9.398666e+04 5.390533e+04 -2784.60 5.699107e+04 8.329789e+04 1.192961e+05 4.191589e+05 ▇▇▂▁▁
OVERTIME_PAY 0 1 7.731190e+03 1.606256e+04 0.00 0.000000e+00 6.508500e+02 7.970890e+03 1.788942e+05 ▇▁▁▁▁
ALL_OTHER_PAY 0 1 5.028420e+03 5.212070e+03 -4203.26 1.573840e+03 4.466120e+03 6.368750e+03 7.459857e+04 ▇▁▁▁▁
TOTAL_PAY 0 1 1.067463e+05 5.872182e+04 200.00 6.574242e+04 9.836682e+04 1.400994e+05 4.251589e+05 ▇▇▂▁▁
CITY_RETIREMENT_CONTRIBUTIONS 0 1 2.808288e+04 1.611783e+04 0.00 1.690269e+04 2.487305e+04 3.613389e+04 1.243225e+05 ▇▆▂▁▁
BENEFIT_PAY 0 1 1.283224e+04 6.662760e+03 0.00 8.156400e+03 1.342632e+04 1.863576e+04 2.104200e+04 ▃▃▃▃▇
# Calculate the average total pay for each category
avg_department <- payroll_2022 |>
  group_by(DEPARTMENT_TITLE) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_gender <- payroll_2022 |>
  group_by(GENDER) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_ethnicity <- payroll_2022 |>
  group_by(ETHNICITY) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

The raw data is from

https://controllerdata.lacity.org/Payroll/City-Employee-Payroll-Current-/g9h8-fvhu 

We did the filter online. Since we are only looking at active full time jobs in 2022, we’ve filtered PAY_YEAR = 2022, EMPLOYMENT_TYPE = “FULL_TIME”, and JOB_STATUS = “ACTIVE” on the website. Because the dataset is too large to be uploaded on GitHub, we separated the dataset by each department number (into 10 different datasets) and uploaded them on GitHub. We then combine them into one dataset using a map function because all the dataset files are named similarly except their department numbers. All columns are already in snake_case and their appropriate data types. The dataset is then turned it into the analysis-ready dataset: avg_department, avg_gender, and avg_ethnicity as we calculated the average total pay for each category (total pay by department, gender, and ethnicity).

Data description

We have taken the data filtering out only full-time, active workers for each department logged in the dataset prior to cleaning. After attaining a dataset for each department we joined each dataset together so that we can have a centralized location for all of our data, easing the process of further analysis.

Data limitations

There are two limitations for this dataset on this phase. The first one is if the position is unionized or not, and the second one is the work time and overtime frequency. For the first one, unionized position will have different pay structures, benefits, and work rules compared to non-unionized employees. Without this information, it becomes challenging to perform comprehensive labor cost analysis and budgeting. Work time and overtime frequency will also affect the payroll. In general, the more hours worked, especially overtime hours, the higher the employee’s earnings will be. 

Exploratory data analysis

ggplot(avg_department, aes(x = avg_total_pay, y = DEPARTMENT_TITLE)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(labels = scales::dollar_format(scale = 1e-3, suffix = "K")) +
  labs(
    x = "Average Total Pay",
    y = "Department Title",
    title = "Average Total Pay by Department"
  ) +
  theme_minimal()

ggplot(avg_gender, aes(x = avg_total_pay, y = GENDER)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(labels = scales::dollar_format(scale = 1e-3, suffix = "K")) +
  labs(
    x = "Average Total Pay",
    y = "Gender",
    title = "Average Total Pay by Gender"
  ) +
  theme_minimal()

ggplot(avg_ethnicity, aes(x = avg_total_pay, y = ETHNICITY)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(labels = scales::dollar_format(scale = 1e-3, suffix = "K")) +
  labs(
    x = "Average Total Pay",
    y = "Ethnicity",
    title = "Average Total Pay by Ethnicity"
  ) +
  theme_minimal()

department_total_pay <- payroll_2022 |>
  select(DEPARTMENT_TITLE, TOTAL_PAY) |>
  mutate(DEPARTMENT_TITLE = reorder(DEPARTMENT_TITLE, TOTAL_PAY, FUN = median))

ggplot(department_total_pay, aes(x = TOTAL_PAY, y = DEPARTMENT_TITLE, fill = DEPARTMENT_TITLE)) +
      geom_boxplot(show.legend = FALSE) +
      labs(title = "Department vs. Average Pay",
           x = "Total Pay",
           y = "Department") +
      scale_x_continuous(labels = scales::dollar_format(scale = 1e-3, suffix = "K")) +
      scale_fill_viridis_d() +
      theme_minimal()

# Calculate the total average pay
total_avg_pay <- sum(avg_ethnicity$avg_total_pay)

# Calculate the percentage for each ethnicity
avg_ethnicity$percentage <- (avg_ethnicity$avg_total_pay / total_avg_pay) * 100

# Create the pie chart
ggplot(avg_ethnicity, aes(x = "", y = percentage, fill = ETHNICITY)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Set3") +  # Optional: Adds a color palette
  labs(
    x = "",
    y = "",
    title = "Percentage of Average Total Pay by Ethnicity",
    fill = "Ethnicity"
  ) +
  theme_minimal() +
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.grid = element_blank()) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_stack(vjust = 0.5))

Department Analysis: In our initial exploratory data analysis (EDA), we examined the distribution of average pay across various city departments in Los Angeles for the year 2022. We found that the average total pay across all departments hovers around $100,000. The department “Public Accountability” stands out with the highest average pay at $235,587, while the “Aging” department has the lowest average pay at $79,588.

Gender Analysis: Our EDA also involved an analysis of gender-based disparities in pay. On average, males earn significantly more with an average total pay of approximately $114,808, compared to females whose average is around $94,807. However, it’s important to note that there is a category labeled “Unknown” with a substantially lower average pay of $27,171, which warrants further investigation.

Ethnicity Analysis: In our exploration of workforce diversity, we analyzed average pay by ethnicity. “Caucasian” employees had the highest average pay at $134,676, while “NA” (Not Available) recorded the lowest average pay at $30,448. “Unknown” ethnicity had an average pay of $126,984, indicating a need for further clarification.

Visualizations, such as bar charts, box plots, and pie charts, can effectively display these differences and help identify any outliers within each department (using box plots). This initial analysis sets the stage for a deeper examination of potential disparities in pay across different departments.

Questions for reviewers

  1. Based on our understanding, gender and ethnicity will not affect total pay, but job title and department will affect it. If we want to add the prediction of total pay for the next year, what model can we use based on data we have already collected?

  2. Are there still some elements that will affect total pay?