Unveiling Workforce Dynamics: Exploring Pay Disparities in Los Angeles 2022

Proposal

library(tidyverse)
library(skimr)

Data 1

Problem or question

Proposal: Analyzing Payroll Disparities in Los Angeles City Departments

Topic: Investigating Payroll Disparities

Problem or Question:

Question/Objective: Can we identify and understand disparities in payroll distribution among Los Angeles City employees based on various factors such as department, gender, and ethnicity for the year 2023?

Importance: Ensuring equitable payroll distribution is essential for fair and efficient municipal governance. This project aims to shed light on potential disparities in payroll among City employees, contributing to transparency and accountability.

Variables: Categorical (Department Title, Gender, Ethnicity), Quantitative (Regular Pay, Overtime Pay, All Other Pay, Total Pay, City Retirement Contributions, Benefit Pay)

Major Deliverable: A comprehensive report highlighting payroll disparities, interactive data visualizations, and a web application for further exploration.

Introduction and data

Dataset: Los Angeles City Employee Payroll (Current)

Source of Data

The dataset is sourced from the Los Angeles City Controller’s Office and includes payroll information for all City employees. It is updated bi-weekly, with the exception of the Department of Water and Power, which is updated quarterly.

Link:
https://controllerdata.lacity.org/Payroll/City-Employee-Payroll-Current-/g9h8-fvhu

Data Collection

Payroll information includes employee details such as name, department, job class, employment type, job status, and payments. The data allows for a detailed analysis of employee compensation and contributions to benefits and retirement.

Ethical Concerns

Ethical concerns primarily revolve around data privacy and confidentiality. Care will be taken to ensure that no personally identifiable information is disclosed, and results will be aggregated to protect the identities of individual employees.

Glimpse of data

# add code here
data1 <- read.csv("data/City_Employee_Payroll__Current.csv")
skimr::skim(data1)
Data summary
Name data1
Number of rows 800
Number of columns 20
_______________________
Column type frequency:
character 10
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LAST_NAME 0 1.0 2 18 0 601 0
FIRST_NAME 0 1.0 2 12 0 495 0
DEPARTMENT_TITLE 0 1.0 3 45 0 39 0
JOB_CLASS_PGRADE 0 1.0 6 6 0 289 0
JOB_TITLE 0 1.0 5 71 0 281 0
EMPLOYMENT_TYPE 0 1.0 9 9 0 3 0
JOB_STATUS 0 1.0 6 10 0 2 0
MOU_TITLE 1 1.0 8 51 0 32 0
GENDER 0 1.0 4 7 0 3 0
ETHNICITY 81 0.9 5 17 0 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
RECORD_NBR 0 1 2.707489e+11 1.758930e+11 3.238394e+09 3.939374e+09 3.933380e+11 3.939340e+11 3.939390e+11 ▃▁▁▁▇
PAY_YEAR 0 1 2.023000e+03 0.000000e+00 2.023000e+03 2.023000e+03 2.023000e+03 2.023000e+03 2.023000e+03 ▁▁▇▁▁
DEPARTMENT_NO 0 1 5.813000e+01 2.845000e+01 2.000000e+00 3.800000e+01 7.000000e+01 8.400000e+01 9.400000e+01 ▃▃▃▇▇
MOU 0 1 1.318000e+01 1.084000e+01 0.000000e+00 3.000000e+00 1.200000e+01 2.300000e+01 6.500000e+01 ▇▆▁▁▁
REGULAR_PAY 0 1 5.053002e+04 3.910564e+04 0.000000e+00 1.402420e+04 4.768006e+04 8.019732e+04 2.067357e+05 ▇▆▃▁▁
OVERTIME_PAY 0 1 6.767200e+03 1.606183e+04 0.000000e+00 0.000000e+00 0.000000e+00 5.196650e+03 1.559988e+05 ▇▁▁▁▁
ALL_OTHER_PAY 0 1 4.712890e+03 1.145373e+04 0.000000e+00 5.809200e+02 1.940120e+03 5.858870e+03 1.692352e+05 ▇▁▁▁▁
TOTAL_PAY 0 1 6.201011e+04 5.022272e+04 9.990000e+00 1.600773e+04 5.663668e+04 9.440255e+04 2.845892e+05 ▇▆▂▁▁
CITY_RETIREMENT_CONTRIBUTIONS 0 1 1.717714e+04 1.634174e+04 0.000000e+00 0.000000e+00 1.458144e+04 2.647150e+04 6.177546e+04 ▇▅▂▂▁
BENEFIT_PAY 0 1 7.689150e+03 6.239330e+03 0.000000e+00 7.922000e+02 6.800940e+03 1.438974e+04 1.691928e+04 ▇▃▃▂▇

Data 2

Problem or question

Proposal: Develop a predictive model for calculating the probability of winning in

Texas Hold’em with Initial Two Cards.

Topic: Probability and Model for Texas Hold’em

Question/Objective:

The objective is to develop a predictive model that leverages information on the color

and number of poker cards to estimate the likelihood of winning in a Texas Hold’em

game based on the initial two cards.

Importance: This helps the player make decisions by calculating the winning

probability of the first two cards. Our project also has an educational value which

helps players understand Texas Hold’em from a statistics and strategy side.

Variables: Categorical (S1, Suit of card #1, S2, Suit of card #2, S3, Suit of card #3,

S4, Suit of card #4, S5, Suit of card #5, Poker Hand)

Quantitive (C1, Rank of card #1, C2, Rank of card #2, C3, Rank of card #3, C4, Rank

of card #4, C5, Rank of card #5 )

Major Deliverable: A prediction of the probability of winning in Texas Hold’em with

just the initial two cards, and strategic insights for players to improve their

decision-making in Texas Hold’em games.

Introduction and data

Dataset: Poker Hand

Source of Data:

The dataset is sourced from the UCI machine learning repository. It was originally a

data file but it has been changed to csv already.

Link:

https://archive.ics.uci.edu/dataset/158/poker+hand

Data Collection:

Poker hand includes all five cards with their color that will give you in each game.

Color columns named with S#, Suit of card #. Number columns named with C#, Rank

of card #. It also provides information on poker hands ( 0: Nothing in hand; not a

recognized poker hand, 1: One pair; one pair of equal ranks within five cards, 2: Two

pairs; two pairs of equal ranks within five cards, 3: Three of a kind; three equal ranks

within five cards, 4: Straight; five cards, sequentially ranked with no gaps, 5: Flush;

five cards with the same suit, 6: Full house; pair + different rank three of a kind, 7:

Four of a kind; four equal ranks within five cards, 8: Straight flush; straight + flush, 9:

Royal flush; {Ace, King, Queen, Jack, Ten} + flush)to simply identify.

Ethical Concerns:

Poker is always related to gambling and may raise some potential issues. Ethical

concerns for this model are connected to the adverse consequences caused by someaddictions. To avoid these, it is crucial to add some guidelines related to responsible

gambling to our model.

Glimpse of data

# add code here
data2 <- read.csv("data/Pokerhand.csv")
skimr::skim(data2)
Data summary
Name data2
Number of rows 600
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
S1..Suit.of.card..1 0 1 2.51 1.14 1 1 2 4 4 ▇▇▁▇▇
C1..Rank.of.card..1 0 1 7.01 3.78 1 4 7 10 13 ▇▅▇▆▇
S2..Suit.of.card..2 0 1 2.42 1.10 1 1 2 3 4 ▇▇▁▆▆
C2..Rank.of.card..2 0 1 7.08 3.76 1 4 7 10 13 ▇▅▇▅▇
S3..Suit.of.card..3 0 1 2.62 1.11 1 2 3 4 4 ▆▇▁▇▇
C3..Rank.of.card..3 0 1 7.27 3.77 1 4 7 11 13 ▆▅▇▅▇
S4.Suit.of.card..4 0 1 2.53 1.11 1 2 3 4 4 ▇▇▁▇▇
C4..Rank.of.card..4 0 1 7.03 3.74 1 4 7 10 13 ▇▅▇▅▇
S5.Suit.of.card..5 0 1 2.53 1.09 1 2 3 3 4 ▇▇▁▇▇
C5..Rank.of.card..5 0 1 7.06 3.81 1 4 7 10 13 ▇▅▇▅▇
Poker.Hand 0 1 0.76 1.29 0 0 1 1 9 ▇▁▁▁▁

Data 3

Problem or question

Objective: Explore the evolution of global temperatures over the past century and discern potential factors influencing these changes, with a focus on data from the International Monetary Fund (IMF) Climate Change Data Portal.

Importance: This project seeks to illuminate the patterns and potential causative agents of climate change, offering a data-driven base for policy-making and public awareness campaigns.

Variables:

  • Quantitative: Year, Average Temperature, CO2 Emissions, etc.

  • Categorical: Country, Region

Major Deliverable: A Shiny web application that enables users to interactively visualize temperature changes and related factors over time and across different regions.

Introduction and data

Data Source: IMF Climate Change Data Portal (https://climatedata.imf.org/pages/climatechange-data#cc2).

Collection Method:

  • The IMF Climate Change Data Portal aggregates data from various global sources, providing a comprehensive dataset related to climate change indicators.

  • Data will be accessed either directly from the portal or via any available API, ensuring accurate and up-to-date information.

  • There are multiple data files that contain information that is imperative to the project proposal and we will wrangle the data to parse the important information 

Description: The dataset includes observations of yearly average temperatures, CO2 emissions, and other relevant variables, categorized by year, country, and region.

Ethical Concerns: Ensuring ethical use of the data, acknowledging any limitations or biases in the data, and ensuring that interpretations and communications are accurate and responsible.

Detailed Approach

  1. Data Acquisition and Cleaning:

    1. Utilize R to access, clean, and preprocess data from the IMF Climate Change Data Portal.

    2. Ensure data consistency, handle missing values, and validate the accuracy where possible.

    3. Pull from multiple datasets and join them based on year 

    4. We are only focusing on the year 2022 for all relevant datasets

  2. Exploratory Data Analysis (EDA):

    1. Conduct thorough EDA to understand the distributions, trends, and relationships within the data.

    2. Utilize ggplot2 to visualize trends in global temperatures and potential influencing factors over time and across different regions.

  3. Model Development:

    1. Develop statistical models (e.g., regression models) to analyze the relationships between global temperatures and potential influencing factors.

    2. Validate models using appropriate metrics and diagnostic plots to ensure reliability and accuracy.

  4. Shiny Web Application Development:

    1. Develop an interactive Shiny web application that allows users to explore visualizations of the data and insights from the analysis.

    2. Ensure the application is user-friendly, accessible, and provides valuable and accurate insights.

  5. Interpretation and Communication:

    1. Clearly interpret the findings from the analysis and model, ensuring that insights are communicated in an accurate, clear, and impactful manner.

    2. Develop a comprehensive report or presentation that summarizes the findings, methodology, and implications of the project.

  6. Feedback and Iteration:

    1. Seek feedback from peers, instructors, and potential users to enhance the quality and impact of the project.

    2. Iterate on the analysis, model, and application based on feedback and any additional insights.

Glimpse of data

  • The dataset will be accessed, cleaned, and stored in a structured format (e.g., CSV) in the project repository, ensuring reproducibility and accessibility for all team members.

  • Preliminary analysis will be conducted using functions like kimr::skim() to understand its structure, variables, and initial insights.

# add code here
data3.1 <- read.csv("data/Change_in_Mean_Sea_Levels.csv")
skimr::skim(data3.1)
Data summary
Name data3.1
Number of rows 802
Number of columns 13
_______________________
Column type frequency:
character 10
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1 5 5 0 1 0
ISO3 0 1 3 3 0 1 0
Indicator 0 1 42 44 0 2 0
Unit 0 1 11 11 0 1 0
Source 0 1 216 216 0 1 0
CTS_Code 0 1 4 4 0 1 0
CTS_Name 0 1 24 24 0 1 0
CTS_Full_Descriptor 0 1 73 73 0 1 0
Measure 0 1 4 14 0 25 0
Date 0 1 11 11 0 86 0

Variable type: logical

skim_variable n_missing complete_rate mean count
ISO2 802 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ObjectId 0 1 401.50 231.66 1.00 201.25 401.50 601.75 802.00 ▇▇▇▇▇
Value 0 1 80.92 61.25 -182.53 52.85 73.43 110.07 467.87 ▁▇▆▁▁
data3.2 <- read.csv("data/Atmospheric_CO2_Concentrations.csv")
skimr::skim(data3.2)
Data summary
Name data3.2
Number of rows 24
Number of columns 12
_______________________
Column type frequency:
character 9
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1 5 5 0 1 0
ISO3 0 1 3 3 0 1 0
Indicator 0 1 49 81 0 2 0
Unit 0 1 7 17 0 2 0
Source 0 1 324 324 0 1 0
CTS_Code 0 1 4 4 0 1 0
CTS_Name 0 1 41 41 0 1 0
CTS_Full_Descriptor 0 1 90 90 0 1 0
Date 0 1 7 7 0 12 0

Variable type: logical

skim_variable n_missing complete_rate mean count
ISO2 24 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ObjectId 0 1 12.50 7.07 1.00 6.75 12.50 18.25 24.00 ▇▇▆▇▇
Value 0 1 209.54 213.53 0.28 0.53 208.22 418.83 420.99 ▇▁▁▁▇
data3.3 <- read.csv("data/Annual_Surface_Temperature_Change.csv")
skimr::skim(data3.3)
Data summary
Name data3.3
Number of rows 225
Number of columns 72
_______________________
Column type frequency:
character 9
numeric 63
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country 0 1 4 35 0 225 0
ISO2 1 1 0 2 1 224 0
ISO3 0 1 3 3 0 225 0
Indicator 0 1 96 96 0 1 0
Unit 0 1 14 14 0 1 0
Source 0 1 243 243 0 1 0
CTS_Code 0 1 4 4 0 1 0
CTS_Name 0 1 26 26 0 1 0
CTS_Full_Descriptor 0 1 75 75 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ObjectId 0 1.00 113.00 65.10 1.00 57.00 113.00 169.00 225.00 ▇▇▇▇▇
F1961 37 0.84 0.16 0.41 -0.69 -0.10 0.06 0.32 1.89 ▂▇▂▁▁
F1962 36 0.84 -0.01 0.34 -0.91 -0.16 -0.06 0.11 1.00 ▁▃▇▂▁
F1963 37 0.84 -0.01 0.39 -1.27 -0.21 0.00 0.23 1.20 ▁▂▇▃▁
F1964 37 0.84 -0.07 0.31 -0.88 -0.24 -0.06 0.13 1.10 ▂▆▇▁▁
F1965 37 0.84 -0.25 0.27 -1.06 -0.39 -0.23 -0.09 0.86 ▁▆▇▂▁
F1966 33 0.85 0.11 0.38 -1.80 -0.04 0.10 0.28 1.15 ▁▁▃▇▁
F1967 34 0.85 -0.11 0.34 -1.05 -0.26 -0.15 0.01 1.13 ▁▇▇▂▁
F1968 34 0.85 -0.20 0.27 -1.63 -0.34 -0.19 -0.07 0.48 ▁▁▂▇▂
F1969 35 0.84 0.16 0.31 -0.90 -0.01 0.20 0.35 0.94 ▁▂▆▇▁
F1970 36 0.84 0.09 0.35 -1.29 -0.05 0.13 0.30 0.98 ▁▁▅▇▁
F1971 34 0.85 -0.20 0.23 -0.87 -0.33 -0.20 -0.07 0.68 ▁▅▇▁▁
F1972 33 0.85 -0.08 0.38 -1.80 -0.19 -0.04 0.11 0.93 ▁▁▃▇▁
F1973 32 0.86 0.23 0.33 -0.99 0.06 0.26 0.46 1.15 ▁▂▇▇▁
F1974 33 0.85 -0.16 0.30 -0.98 -0.36 -0.19 -0.03 1.12 ▁▇▆▁▁
F1975 37 0.84 -0.02 0.42 -1.09 -0.28 -0.13 0.11 1.89 ▁▇▂▁▁
F1976 36 0.84 -0.25 0.32 -0.96 -0.44 -0.27 -0.07 0.73 ▂▇▇▂▁
F1977 40 0.82 0.17 0.25 -0.60 0.00 0.18 0.32 1.08 ▁▃▇▂▁
F1978 36 0.84 0.07 0.29 -0.87 -0.03 0.10 0.23 0.91 ▁▁▇▃▁
F1979 36 0.84 0.23 0.39 -1.24 0.10 0.27 0.44 1.29 ▁▁▇▇▁
F1980 34 0.85 0.25 0.34 -0.76 0.07 0.29 0.45 0.97 ▁▂▅▇▂
F1981 34 0.85 0.18 0.32 -0.91 0.04 0.18 0.38 1.56 ▁▃▇▁▁
F1982 33 0.85 0.18 0.32 -0.68 0.00 0.18 0.39 1.14 ▂▂▇▃▁
F1983 35 0.84 0.34 0.54 -2.06 0.15 0.45 0.64 1.62 ▁▁▂▇▂
F1984 37 0.84 0.08 0.33 -1.46 -0.11 0.06 0.28 0.85 ▁▁▅▇▂
F1985 37 0.84 0.07 0.37 -1.19 -0.06 0.10 0.31 0.89 ▁▁▆▇▂
F1986 35 0.84 0.15 0.29 -0.76 0.01 0.17 0.33 0.84 ▁▁▇▇▁
F1987 35 0.84 0.41 0.48 -1.65 0.19 0.49 0.69 1.56 ▁▁▃▇▂
F1988 35 0.84 0.49 0.29 -0.50 0.33 0.48 0.67 1.34 ▁▂▇▅▁
F1989 35 0.84 0.26 0.49 -1.54 -0.05 0.14 0.41 2.18 ▁▃▇▂▁
F1990 36 0.84 0.56 0.47 -0.74 0.27 0.45 0.76 1.84 ▁▅▇▂▂
F1991 37 0.84 0.37 0.30 -0.70 0.19 0.39 0.54 1.14 ▁▂▇▇▂
F1992 17 0.92 0.24 0.57 -1.34 -0.01 0.30 0.53 1.60 ▁▂▇▆▁
F1993 16 0.93 0.22 0.40 -1.35 0.01 0.28 0.48 1.10 ▁▁▃▇▂
F1994 17 0.92 0.61 0.49 -0.42 0.30 0.49 0.83 1.96 ▁▇▃▂▁
F1995 15 0.93 0.63 0.44 -0.33 0.38 0.63 0.81 2.10 ▂▇▇▂▁
F1996 15 0.93 0.28 0.41 -0.79 0.02 0.31 0.52 1.60 ▂▅▇▂▁
F1997 18 0.92 0.54 0.48 -0.43 0.26 0.55 0.82 1.93 ▃▇▇▂▁
F1998 15 0.93 0.97 0.39 -0.61 0.78 1.00 1.19 2.47 ▁▂▇▂▁
F1999 16 0.93 0.74 0.45 -0.27 0.46 0.64 1.03 2.06 ▂▇▆▃▁
F2000 16 0.93 0.67 0.53 -0.72 0.30 0.54 1.00 2.07 ▁▇▇▅▂
F2001 17 0.92 0.85 0.47 -0.19 0.50 0.73 1.28 1.99 ▁▇▅▅▂
F2002 13 0.94 0.92 0.38 0.01 0.68 0.84 1.14 2.26 ▁▇▃▂▁
F2003 11 0.95 0.84 0.43 -0.25 0.59 0.84 1.05 2.33 ▂▆▇▂▁
F2004 12 0.95 0.78 0.37 -0.62 0.54 0.73 0.98 2.15 ▁▂▇▂▁
F2005 13 0.94 0.85 0.37 -0.39 0.67 0.84 1.07 2.20 ▁▃▇▂▁
F2006 10 0.96 0.88 0.42 -0.50 0.61 0.84 1.13 2.34 ▁▅▇▃▁
F2007 8 0.96 1.02 0.55 -0.22 0.68 0.92 1.22 2.73 ▁▇▃▂▁
F2008 13 0.94 0.81 0.49 -0.14 0.44 0.69 1.11 2.61 ▃▇▃▁▁
F2009 13 0.94 0.91 0.38 -0.32 0.68 0.89 1.19 1.77 ▁▂▇▅▃
F2010 10 0.96 1.10 0.60 -0.34 0.77 1.11 1.31 3.06 ▂▅▇▂▁
F2011 8 0.96 0.82 0.39 -0.48 0.56 0.76 1.09 1.70 ▁▂▇▆▃
F2012 10 0.96 0.90 0.44 -0.13 0.59 0.81 1.19 2.14 ▁▇▆▃▁
F2013 9 0.96 0.93 0.32 0.12 0.74 0.90 1.19 1.64 ▁▃▇▅▂
F2014 9 0.96 1.11 0.56 -0.09 0.74 0.99 1.34 2.70 ▁▇▅▂▂
F2015 9 0.96 1.27 0.46 -0.43 1.02 1.22 1.52 2.61 ▁▂▇▃▁
F2016 12 0.95 1.44 0.40 0.25 1.15 1.45 1.71 2.46 ▁▅▇▆▂
F2017 11 0.95 1.28 0.39 0.02 1.03 1.28 1.53 2.49 ▁▃▇▃▁
F2018 12 0.95 1.30 0.60 0.24 0.86 1.12 1.83 2.77 ▃▇▂▃▂
F2019 12 0.95 1.44 0.47 0.05 1.17 1.41 1.70 2.69 ▁▃▇▃▂
F2020 13 0.94 1.55 0.62 0.23 1.16 1.48 1.83 3.69 ▂▇▅▁▁
F2021 12 0.95 1.34 0.48 -0.42 1.02 1.33 1.63 2.68 ▁▂▇▆▁
F2022 12 0.95 1.38 0.67 -1.30 0.88 1.31 1.92 3.24 ▁▁▇▅▁