Project title

Appendix to report

Appendix

Data Cleaning

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(dplyr)
library(stringr)
library(ggplot2)

payroll_2022 <- data.frame()

dept_no <- c(2, 4, 6, 8, 10, 11, 12, 13, 14, 15)

payroll_2022 <- dept_no |>
  map(~ str_glue("data/City_Employee_Payroll_Current_{.}.csv")) |>
  map_df(~ read_csv(.))
Rows: 38 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2787 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 313 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 892 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 144 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 3 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 933 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 36 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 100 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 24 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(payroll_2022)
Data summary
Name payroll_2022
Number of rows 5270
Number of columns 20
_______________________
Column type frequency:
character 10
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LAST_NAME 0 1.00 1 20 0 3219 0
FIRST_NAME 0 1.00 1 16 0 2123 0
DEPARTMENT_TITLE 0 1.00 5 42 0 10 0
JOB_CLASS_PGRADE 0 1.00 6 6 0 428 0
JOB_TITLE 0 1.00 6 51 0 422 0
EMPLOYMENT_TYPE 0 1.00 9 9 0 1 0
JOB_STATUS 0 1.00 6 6 0 1 0
MOU_TITLE 0 1.00 8 51 0 29 0
GENDER 0 1.00 4 7 0 3 0
ETHNICITY 93 0.98 5 17 0 10 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RECORD_NBR 0 1 2.184743e+11 1.688984e+11 3030303437.00 3.636306e+09 3.139338e+11 3.630333e+11 3.939394e+11 ▆▁▁▂▇
PAY_YEAR 0 1 2.022000e+03 0.000000e+00 2022.00 2.022000e+03 2.022000e+03 2.022000e+03 2.022000e+03 ▁▁▇▁▁
DEPARTMENT_NO 0 1 6.670000e+00 3.350000e+00 2.00 4.000000e+00 4.000000e+00 8.000000e+00 1.500000e+01 ▇▁▂▃▁
MOU 0 1 1.429000e+01 1.238000e+01 0.00 3.000000e+00 1.200000e+01 2.100000e+01 6.400000e+01 ▇▅▃▁▁
REGULAR_PAY 0 1 9.398666e+04 5.390533e+04 -2784.60 5.699107e+04 8.329789e+04 1.192961e+05 4.191589e+05 ▇▇▂▁▁
OVERTIME_PAY 0 1 7.731190e+03 1.606256e+04 0.00 0.000000e+00 6.508500e+02 7.970890e+03 1.788942e+05 ▇▁▁▁▁
ALL_OTHER_PAY 0 1 5.028420e+03 5.212070e+03 -4203.26 1.573840e+03 4.466120e+03 6.368750e+03 7.459857e+04 ▇▁▁▁▁
TOTAL_PAY 0 1 1.067463e+05 5.872182e+04 200.00 6.574242e+04 9.836682e+04 1.400994e+05 4.251589e+05 ▇▇▂▁▁
CITY_RETIREMENT_CONTRIBUTIONS 0 1 2.808288e+04 1.611783e+04 0.00 1.690269e+04 2.487305e+04 3.613389e+04 1.243225e+05 ▇▆▂▁▁
BENEFIT_PAY 0 1 1.283224e+04 6.662760e+03 0.00 8.156400e+03 1.342632e+04 1.863576e+04 2.104200e+04 ▃▃▃▃▇
# Calculate the average total pay for each category
avg_department <- payroll_2022 |>
  group_by(DEPARTMENT_TITLE) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_gender <- payroll_2022 |>
  group_by(GENDER) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_ethnicity <- payroll_2022 |>
  group_by(ETHNICITY) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

write.csv(avg_department, "avg_department.csv", row.names = FALSE)
write.csv(avg_gender, "avg_gender.csv", row.names = FALSE)
write.csv(avg_ethnicity, "avg_ethnicity.csv", row.names = FALSE)
  1. Data Source:
  • The raw data was obtained from the Los Angeles City Controller’s data portal, available at City Employee Payroll Current Data.

  • The data encompasses payroll information for city employees.

  1. Data Selection and Filtering:
  • The data was filtered online to focus on active, full-time jobs for the year 2022. The specific filters applied were PAY_YEAR = 2022, EMPLOYMENT_TYPE = “FULL_TIME”, and JOB_STATUS = “ACTIVE”.

  • Due to the large size of the dataset, it was divided by department numbers into 10 different subsets, each uploaded separately on GitHub.

  1. Data Integration and Cleaning in R:
  • The R programming language, along with various libraries such as tidyverse, scales, dplyr, stringr, and ggplot2, was used for data manipulation and cleaning.

  • Each department’s dataset was named similarly, differing only in their department numbers. This naming convention facilitated the combination of these datasets into one comprehensive dataset using a map function.

  1. Data Processing and Analysis-Ready Dataset:
  • The combined dataset was reviewed using skimr::skim to ensure proper integration and to get an overview of the data structure.

  • The dataset was then processed to calculate the average total pay for each category (department, gender, and ethnicity). This was done using the group_by and summarize functions in R, resulting in three separate datasets: avg_department, avg_gender, and avg_ethnicity.

  • Each of these datasets represents the average total pay categorized by department title, gender, and ethnicity, respectively.

  1. Final Output:
  • The final, analysis-ready datasets (avg_department, avg_gender, avg_ethnicity) were written as .csv files and are included as part of the project submission.

  • These datasets provide a cleaned and structured view of the payroll information, making them suitable for further analysis and visualization in the project.