Project title

Appendix to report

Appendix

Data Cleaning

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(dplyr)
library(stringr)
library(ggplot2)

payroll_2022 <- data.frame()

dept_no <- c(2, 4, 6, 8, 10, 11, 12, 13, 14, 15)

payroll_2022 <- dept_no |>
  map(~ str_glue("data/City_Employee_Payroll_Current_{.}.csv")) |>
  map_df(~ read_csv(.))

Rows: 38 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2787 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 313 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 892 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 144 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 3 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 933 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 36 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 100 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 24 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): LAST_NAME, FIRST_NAME, DEPARTMENT_TITLE, JOB_CLASS_PGRADE, JOB_TIT...
dbl (10): RECORD_NBR, PAY_YEAR, DEPARTMENT_NO, MOU, REGULAR_PAY, OVERTIME_PA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(payroll_2022)

Data summary
Name	payroll_2022
Number of rows	5270
Number of columns	20
_______________________
Column type frequency:
character	10
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
LAST_NAME	0	1.00	1	20	3219
FIRST_NAME	0	1.00	1	16	2123
DEPARTMENT_TITLE	0	1.00	5	42	10
JOB_CLASS_PGRADE	0	1.00	6	6	428
JOB_TITLE	0	1.00	6	51	422
EMPLOYMENT_TYPE	0	1.00	9	9	1
JOB_STATUS	0	1.00	6	6	1
MOU_TITLE	0	1.00	8	51	29
GENDER	0	1.00	4	7	3
ETHNICITY	93	0.98	5	17	10

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RECORD_NBR	1	2.184743e+11	1.688984e+11	3030303437.00	3.636306e+09	3.139338e+11	3.630333e+11	3.939394e+11	▆▁▁▂▇
PAY_YEAR	1	2.022000e+03	0.000000e+00	2022.00	2.022000e+03	2.022000e+03	2.022000e+03	2.022000e+03	▁▁▇▁▁
DEPARTMENT_NO	1	6.670000e+00	3.350000e+00	2.00	4.000000e+00	4.000000e+00	8.000000e+00	1.500000e+01	▇▁▂▃▁
MOU	1	1.429000e+01	1.238000e+01	0.00	3.000000e+00	1.200000e+01	2.100000e+01	6.400000e+01	▇▅▃▁▁
REGULAR_PAY	1	9.398666e+04	5.390533e+04	-2784.60	5.699107e+04	8.329789e+04	1.192961e+05	4.191589e+05	▇▇▂▁▁
OVERTIME_PAY	1	7.731190e+03	1.606256e+04	0.00	0.000000e+00	6.508500e+02	7.970890e+03	1.788942e+05	▇▁▁▁▁
ALL_OTHER_PAY	1	5.028420e+03	5.212070e+03	-4203.26	1.573840e+03	4.466120e+03	6.368750e+03	7.459857e+04	▇▁▁▁▁
TOTAL_PAY	1	1.067463e+05	5.872182e+04	200.00	6.574242e+04	9.836682e+04	1.400994e+05	4.251589e+05	▇▇▂▁▁
CITY_RETIREMENT_CONTRIBUTIONS	1	2.808288e+04	1.611783e+04	0.00	1.690269e+04	2.487305e+04	3.613389e+04	1.243225e+05	▇▆▂▁▁
BENEFIT_PAY	1	1.283224e+04	6.662760e+03	0.00	8.156400e+03	1.342632e+04	1.863576e+04	2.104200e+04	▃▃▃▃▇

# Calculate the average total pay for each category
avg_department <- payroll_2022 |>
  group_by(DEPARTMENT_TITLE) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_gender <- payroll_2022 |>
  group_by(GENDER) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

avg_ethnicity <- payroll_2022 |>
  group_by(ETHNICITY) |>
  summarize(avg_total_pay = mean(TOTAL_PAY))

write.csv(avg_department, "avg_department.csv", row.names = FALSE)
write.csv(avg_gender, "avg_gender.csv", row.names = FALSE)
write.csv(avg_ethnicity, "avg_ethnicity.csv", row.names = FALSE)

Data Source:

The raw data was obtained from the Los Angeles City Controller’s data portal, available at City Employee Payroll Current Data.
The data encompasses payroll information for city employees.

Data Selection and Filtering:

The data was filtered online to focus on active, full-time jobs for the year 2022. The specific filters applied were PAY_YEAR = 2022, EMPLOYMENT_TYPE = “FULL_TIME”, and JOB_STATUS = “ACTIVE”.
Due to the large size of the dataset, it was divided by department numbers into 10 different subsets, each uploaded separately on GitHub.

Data Integration and Cleaning in R:

The R programming language, along with various libraries such as tidyverse, scales, dplyr, stringr, and ggplot2, was used for data manipulation and cleaning.
Each department’s dataset was named similarly, differing only in their department numbers. This naming convention facilitated the combination of these datasets into one comprehensive dataset using a map function.

Data Processing and Analysis-Ready Dataset:

The combined dataset was reviewed using skimr::skim to ensure proper integration and to get an overview of the data structure.
The dataset was then processed to calculate the average total pay for each category (department, gender, and ethnicity). This was done using the group_by and summarize functions in R, resulting in three separate datasets: avg_department, avg_gender, and avg_ethnicity.
Each of these datasets represents the average total pay categorized by department title, gender, and ethnicity, respectively.

Final Output:

The final, analysis-ready datasets (avg_department, avg_gender, avg_ethnicity) were written as .csv files and are included as part of the project submission.
These datasets provide a cleaned and structured view of the payroll information, making them suitable for further analysis and visualization in the project.