Unveiling Workforce Dynamics: Exploring Pay Disparities in Los Angeles 2022

Proposal

library(tidyverse)
library(skimr)

Data 1

Problem or question

Proposal: Analyzing Payroll Disparities in Los Angeles City Departments

Topic: Investigating Payroll Disparities

Problem or Question:

Question/Objective: Can we identify and understand disparities in payroll distribution among Los Angeles City employees based on various factors such as department, gender, and ethnicity for the year 2023?

Importance: Ensuring equitable payroll distribution is essential for fair and efficient municipal governance. This project aims to shed light on potential disparities in payroll among City employees, contributing to transparency and accountability.

Variables: Categorical (Department Title, Gender, Ethnicity), Quantitative (Regular Pay, Overtime Pay, All Other Pay, Total Pay, City Retirement Contributions, Benefit Pay)

Major Deliverable: A comprehensive report highlighting payroll disparities, interactive data visualizations, and a web application for further exploration.

Introduction and data

Dataset: Los Angeles City Employee Payroll (Current)

Source of Data

The dataset is sourced from the Los Angeles City Controller’s Office and includes payroll information for all City employees. It is updated bi-weekly, with the exception of the Department of Water and Power, which is updated quarterly.

Link:
https://controllerdata.lacity.org/Payroll/City-Employee-Payroll-Current-/g9h8-fvhu

Data Collection

Payroll information includes employee details such as name, department, job class, employment type, job status, and payments. The data allows for a detailed analysis of employee compensation and contributions to benefits and retirement.

Ethical Concerns

Ethical concerns primarily revolve around data privacy and confidentiality. Care will be taken to ensure that no personally identifiable information is disclosed, and results will be aggregated to protect the identities of individual employees.

Glimpse of data

# add code here
data1 <- read.csv("data/City_Employee_Payroll__Current.csv")
skimr::skim(data1)

Data summary
Name	data1
Number of rows	800
Number of columns	20
_______________________
Column type frequency:
character	10
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
LAST_NAME	0	1.0	2	18	601
FIRST_NAME	0	1.0	2	12	495
DEPARTMENT_TITLE	0	1.0	3	45	39
JOB_CLASS_PGRADE	0	1.0	6	6	289
JOB_TITLE	0	1.0	5	71	281
EMPLOYMENT_TYPE	0	1.0	9	9	3
JOB_STATUS	0	1.0	6	10	2
MOU_TITLE	1	1.0	8	51	32
GENDER	0	1.0	4	7	3
ETHNICITY	81	0.9	5	17	9

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RECORD_NBR	1	2.707489e+11	1.758930e+11	3.238394e+09	3.939374e+09	3.933380e+11	3.939340e+11	3.939390e+11	▃▁▁▁▇
PAY_YEAR	1	2.023000e+03	0.000000e+00	2.023000e+03	2.023000e+03	2.023000e+03	2.023000e+03	2.023000e+03	▁▁▇▁▁
DEPARTMENT_NO	1	5.813000e+01	2.845000e+01	2.000000e+00	3.800000e+01	7.000000e+01	8.400000e+01	9.400000e+01	▃▃▃▇▇
MOU	1	1.318000e+01	1.084000e+01	0.000000e+00	3.000000e+00	1.200000e+01	2.300000e+01	6.500000e+01	▇▆▁▁▁
REGULAR_PAY	1	5.053002e+04	3.910564e+04	0.000000e+00	1.402420e+04	4.768006e+04	8.019732e+04	2.067357e+05	▇▆▃▁▁
OVERTIME_PAY	1	6.767200e+03	1.606183e+04	0.000000e+00	0.000000e+00	0.000000e+00	5.196650e+03	1.559988e+05	▇▁▁▁▁
ALL_OTHER_PAY	1	4.712890e+03	1.145373e+04	0.000000e+00	5.809200e+02	1.940120e+03	5.858870e+03	1.692352e+05	▇▁▁▁▁
TOTAL_PAY	1	6.201011e+04	5.022272e+04	9.990000e+00	1.600773e+04	5.663668e+04	9.440255e+04	2.845892e+05	▇▆▂▁▁
CITY_RETIREMENT_CONTRIBUTIONS	1	1.717714e+04	1.634174e+04	0.000000e+00	0.000000e+00	1.458144e+04	2.647150e+04	6.177546e+04	▇▅▂▂▁
BENEFIT_PAY	1	7.689150e+03	6.239330e+03	0.000000e+00	7.922000e+02	6.800940e+03	1.438974e+04	1.691928e+04	▇▃▃▂▇

Data 2

Problem or question

Proposal: Develop a predictive model for calculating the probability of winning in

Texas Hold’em with Initial Two Cards.

Topic: Probability and Model for Texas Hold’em

Question/Objective:

The objective is to develop a predictive model that leverages information on the color

and number of poker cards to estimate the likelihood of winning in a Texas Hold’em

game based on the initial two cards.

Importance: This helps the player make decisions by calculating the winning

probability of the first two cards. Our project also has an educational value which

helps players understand Texas Hold’em from a statistics and strategy side.

Variables: Categorical (S1, Suit of card #1, S2, Suit of card #2, S3, Suit of card #3,

S4, Suit of card #4, S5, Suit of card #5, Poker Hand)

Quantitive (C1, Rank of card #1, C2, Rank of card #2, C3, Rank of card #3, C4, Rank

of card #4, C5, Rank of card #5 )

Major Deliverable: A prediction of the probability of winning in Texas Hold’em with

just the initial two cards, and strategic insights for players to improve their

decision-making in Texas Hold’em games.

Introduction and data

Dataset: Poker Hand

Source of Data:

The dataset is sourced from the UCI machine learning repository. It was originally a

data file but it has been changed to csv already.

Link:

https://archive.ics.uci.edu/dataset/158/poker+hand

Data Collection:

Poker hand includes all five cards with their color that will give you in each game.

Color columns named with S#, Suit of card #. Number columns named with C#, Rank

of card #. It also provides information on poker hands ( 0: Nothing in hand; not a

recognized poker hand, 1: One pair; one pair of equal ranks within five cards, 2: Two

pairs; two pairs of equal ranks within five cards, 3: Three of a kind; three equal ranks

within five cards, 4: Straight; five cards, sequentially ranked with no gaps, 5: Flush;

five cards with the same suit, 6: Full house; pair + different rank three of a kind, 7:

Four of a kind; four equal ranks within five cards, 8: Straight flush; straight + flush, 9:

Royal flush; {Ace, King, Queen, Jack, Ten} + flush)to simply identify.

Ethical Concerns:

Poker is always related to gambling and may raise some potential issues. Ethical

concerns for this model are connected to the adverse consequences caused by someaddictions. To avoid these, it is crucial to add some guidelines related to responsible

gambling to our model.

Glimpse of data

# add code here
data2 <- read.csv("data/Pokerhand.csv")
skimr::skim(data2)

Data summary
Name	data2
Number of rows	600
Number of columns	11
_______________________
Column type frequency:
numeric	11
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
S1..Suit.of.card..1	1	2.51	1.14	1	1	2	4	4	▇▇▁▇▇
C1..Rank.of.card..1	1	7.01	3.78	1	4	7	10	13	▇▅▇▆▇
S2..Suit.of.card..2	1	2.42	1.10	1	1	2	3	4	▇▇▁▆▆
C2..Rank.of.card..2	1	7.08	3.76	1	4	7	10	13	▇▅▇▅▇
S3..Suit.of.card..3	1	2.62	1.11	1	2	3	4	4	▆▇▁▇▇
C3..Rank.of.card..3	1	7.27	3.77	1	4	7	11	13	▆▅▇▅▇
S4.Suit.of.card..4	1	2.53	1.11	1	2	3	4	4	▇▇▁▇▇
C4..Rank.of.card..4	1	7.03	3.74	1	4	7	10	13	▇▅▇▅▇
S5.Suit.of.card..5	1	2.53	1.09	1	2	3	3	4	▇▇▁▇▇
C5..Rank.of.card..5	1	7.06	3.81	1	4	7	10	13	▇▅▇▅▇
Poker.Hand	1	0.76	1.29	0	0	1	1	9	▇▁▁▁▁

Data 3

Problem or question

Objective: Explore the evolution of global temperatures over the past century and discern potential factors influencing these changes, with a focus on data from the International Monetary Fund (IMF) Climate Change Data Portal.

Importance: This project seeks to illuminate the patterns and potential causative agents of climate change, offering a data-driven base for policy-making and public awareness campaigns.

Variables:

Quantitative: Year, Average Temperature, CO2 Emissions, etc.
Categorical: Country, Region

Major Deliverable: A Shiny web application that enables users to interactively visualize temperature changes and related factors over time and across different regions.

Introduction and data

Data Source: IMF Climate Change Data Portal (https://climatedata.imf.org/pages/climatechange-data#cc2).

Collection Method:

The IMF Climate Change Data Portal aggregates data from various global sources, providing a comprehensive dataset related to climate change indicators.
Data will be accessed either directly from the portal or via any available API, ensuring accurate and up-to-date information.
There are multiple data files that contain information that is imperative to the project proposal and we will wrangle the data to parse the important information

Description: The dataset includes observations of yearly average temperatures, CO2 emissions, and other relevant variables, categorized by year, country, and region.

Ethical Concerns: Ensuring ethical use of the data, acknowledging any limitations or biases in the data, and ensuring that interpretations and communications are accurate and responsible.

Detailed Approach

Data Acquisition and Cleaning:
1. Utilize R to access, clean, and preprocess data from the IMF Climate Change Data Portal.
2. Ensure data consistency, handle missing values, and validate the accuracy where possible.
3. Pull from multiple datasets and join them based on year
4. We are only focusing on the year 2022 for all relevant datasets
Exploratory Data Analysis (EDA):
1. Conduct thorough EDA to understand the distributions, trends, and relationships within the data.
2. Utilize ggplot2 to visualize trends in global temperatures and potential influencing factors over time and across different regions.
Model Development:
1. Develop statistical models (e.g., regression models) to analyze the relationships between global temperatures and potential influencing factors.
2. Validate models using appropriate metrics and diagnostic plots to ensure reliability and accuracy.
Shiny Web Application Development:
1. Develop an interactive Shiny web application that allows users to explore visualizations of the data and insights from the analysis.
2. Ensure the application is user-friendly, accessible, and provides valuable and accurate insights.
Interpretation and Communication:
1. Clearly interpret the findings from the analysis and model, ensuring that insights are communicated in an accurate, clear, and impactful manner.
2. Develop a comprehensive report or presentation that summarizes the findings, methodology, and implications of the project.
Feedback and Iteration:
1. Seek feedback from peers, instructors, and potential users to enhance the quality and impact of the project.
2. Iterate on the analysis, model, and application based on feedback and any additional insights.

Glimpse of data

The dataset will be accessed, cleaned, and stored in a structured format (e.g., CSV) in the project repository, ensuring reproducibility and accessibility for all team members.
Preliminary analysis will be conducted using functions like kimr::skim() to understand its structure, variables, and initial insights.

# add code here
data3.1 <- read.csv("data/Change_in_Mean_Sea_Levels.csv")
skimr::skim(data3.1)

Data summary
Name	data3.1
Number of rows	802
Number of columns	13
_______________________
Column type frequency:
character	10
logical	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Country	1	5	5	1
ISO3	1	3	3	1
Indicator	1	42	44	2
Unit	1	11	11	1
Source	1	216	216	1
CTS_Code	1	4	4	1
CTS_Name	1	24	24	1
CTS_Full_Descriptor	1	73	73	1
Measure	1	4	14	25
Date	1	11	11	86

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
ISO2	802	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ObjectId	0	1	401.50	231.66	1.00	201.25	401.50	601.75	802.00	▇▇▇▇▇
Value	0	1	80.92	61.25	-182.53	52.85	73.43	110.07	467.87	▁▇▆▁▁

data3.2 <- read.csv("data/Atmospheric_CO2_Concentrations.csv")
skimr::skim(data3.2)

Data summary
Name	data3.2
Number of rows	24
Number of columns	12
_______________________
Column type frequency:
character	9
logical	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Country	1	5	5	1
ISO3	1	3	3	1
Indicator	1	49	81	2
Unit	1	7	17	2
Source	1	324	324	1
CTS_Code	1	4	4	1
CTS_Name	1	41	41	1
CTS_Full_Descriptor	1	90	90	1
Date	1	7	7	12

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
ISO2	24	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ObjectId	0	1	12.50	7.07	1.00	6.75	12.50	18.25	24.00	▇▇▆▇▇
Value	0	1	209.54	213.53	0.28	0.53	208.22	418.83	420.99	▇▁▁▁▇

data3.3 <- read.csv("data/Annual_Surface_Temperature_Change.csv")
skimr::skim(data3.3)

Data summary
Name	data3.3
Number of rows	225
Number of columns	72
_______________________
Column type frequency:
character	9
numeric	63
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
Country	0	1	4	35	0	225
ISO2	1	1	0	2	1	224
ISO3	0	1	3	3	0	225
Indicator	0	1	96	96	0	1
Unit	0	1	14	14	0	1
Source	0	1	243	243	0	1
CTS_Code	0	1	4	4	0	1
CTS_Name	0	1	26	26	0	1
CTS_Full_Descriptor	0	1	75	75	0	1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ObjectId	0	1.00	113.00	65.10	1.00	57.00	113.00	169.00	225.00	▇▇▇▇▇
F1961	37	0.84	0.16	0.41	-0.69	-0.10	0.06	0.32	1.89	▂▇▂▁▁
F1962	36	0.84	-0.01	0.34	-0.91	-0.16	-0.06	0.11	1.00	▁▃▇▂▁
F1963	37	0.84	-0.01	0.39	-1.27	-0.21	0.00	0.23	1.20	▁▂▇▃▁
F1964	37	0.84	-0.07	0.31	-0.88	-0.24	-0.06	0.13	1.10	▂▆▇▁▁
F1965	37	0.84	-0.25	0.27	-1.06	-0.39	-0.23	-0.09	0.86	▁▆▇▂▁
F1966	33	0.85	0.11	0.38	-1.80	-0.04	0.10	0.28	1.15	▁▁▃▇▁
F1967	34	0.85	-0.11	0.34	-1.05	-0.26	-0.15	0.01	1.13	▁▇▇▂▁
F1968	34	0.85	-0.20	0.27	-1.63	-0.34	-0.19	-0.07	0.48	▁▁▂▇▂
F1969	35	0.84	0.16	0.31	-0.90	-0.01	0.20	0.35	0.94	▁▂▆▇▁
F1970	36	0.84	0.09	0.35	-1.29	-0.05	0.13	0.30	0.98	▁▁▅▇▁
F1971	34	0.85	-0.20	0.23	-0.87	-0.33	-0.20	-0.07	0.68	▁▅▇▁▁
F1972	33	0.85	-0.08	0.38	-1.80	-0.19	-0.04	0.11	0.93	▁▁▃▇▁
F1973	32	0.86	0.23	0.33	-0.99	0.06	0.26	0.46	1.15	▁▂▇▇▁
F1974	33	0.85	-0.16	0.30	-0.98	-0.36	-0.19	-0.03	1.12	▁▇▆▁▁
F1975	37	0.84	-0.02	0.42	-1.09	-0.28	-0.13	0.11	1.89	▁▇▂▁▁
F1976	36	0.84	-0.25	0.32	-0.96	-0.44	-0.27	-0.07	0.73	▂▇▇▂▁
F1977	40	0.82	0.17	0.25	-0.60	0.00	0.18	0.32	1.08	▁▃▇▂▁
F1978	36	0.84	0.07	0.29	-0.87	-0.03	0.10	0.23	0.91	▁▁▇▃▁
F1979	36	0.84	0.23	0.39	-1.24	0.10	0.27	0.44	1.29	▁▁▇▇▁
F1980	34	0.85	0.25	0.34	-0.76	0.07	0.29	0.45	0.97	▁▂▅▇▂
F1981	34	0.85	0.18	0.32	-0.91	0.04	0.18	0.38	1.56	▁▃▇▁▁
F1982	33	0.85	0.18	0.32	-0.68	0.00	0.18	0.39	1.14	▂▂▇▃▁
F1983	35	0.84	0.34	0.54	-2.06	0.15	0.45	0.64	1.62	▁▁▂▇▂
F1984	37	0.84	0.08	0.33	-1.46	-0.11	0.06	0.28	0.85	▁▁▅▇▂
F1985	37	0.84	0.07	0.37	-1.19	-0.06	0.10	0.31	0.89	▁▁▆▇▂
F1986	35	0.84	0.15	0.29	-0.76	0.01	0.17	0.33	0.84	▁▁▇▇▁
F1987	35	0.84	0.41	0.48	-1.65	0.19	0.49	0.69	1.56	▁▁▃▇▂
F1988	35	0.84	0.49	0.29	-0.50	0.33	0.48	0.67	1.34	▁▂▇▅▁
F1989	35	0.84	0.26	0.49	-1.54	-0.05	0.14	0.41	2.18	▁▃▇▂▁
F1990	36	0.84	0.56	0.47	-0.74	0.27	0.45	0.76	1.84	▁▅▇▂▂
F1991	37	0.84	0.37	0.30	-0.70	0.19	0.39	0.54	1.14	▁▂▇▇▂
F1992	17	0.92	0.24	0.57	-1.34	-0.01	0.30	0.53	1.60	▁▂▇▆▁
F1993	16	0.93	0.22	0.40	-1.35	0.01	0.28	0.48	1.10	▁▁▃▇▂
F1994	17	0.92	0.61	0.49	-0.42	0.30	0.49	0.83	1.96	▁▇▃▂▁
F1995	15	0.93	0.63	0.44	-0.33	0.38	0.63	0.81	2.10	▂▇▇▂▁
F1996	15	0.93	0.28	0.41	-0.79	0.02	0.31	0.52	1.60	▂▅▇▂▁
F1997	18	0.92	0.54	0.48	-0.43	0.26	0.55	0.82	1.93	▃▇▇▂▁
F1998	15	0.93	0.97	0.39	-0.61	0.78	1.00	1.19	2.47	▁▂▇▂▁
F1999	16	0.93	0.74	0.45	-0.27	0.46	0.64	1.03	2.06	▂▇▆▃▁
F2000	16	0.93	0.67	0.53	-0.72	0.30	0.54	1.00	2.07	▁▇▇▅▂
F2001	17	0.92	0.85	0.47	-0.19	0.50	0.73	1.28	1.99	▁▇▅▅▂
F2002	13	0.94	0.92	0.38	0.01	0.68	0.84	1.14	2.26	▁▇▃▂▁
F2003	11	0.95	0.84	0.43	-0.25	0.59	0.84	1.05	2.33	▂▆▇▂▁
F2004	12	0.95	0.78	0.37	-0.62	0.54	0.73	0.98	2.15	▁▂▇▂▁
F2005	13	0.94	0.85	0.37	-0.39	0.67	0.84	1.07	2.20	▁▃▇▂▁
F2006	10	0.96	0.88	0.42	-0.50	0.61	0.84	1.13	2.34	▁▅▇▃▁
F2007	8	0.96	1.02	0.55	-0.22	0.68	0.92	1.22	2.73	▁▇▃▂▁
F2008	13	0.94	0.81	0.49	-0.14	0.44	0.69	1.11	2.61	▃▇▃▁▁
F2009	13	0.94	0.91	0.38	-0.32	0.68	0.89	1.19	1.77	▁▂▇▅▃
F2010	10	0.96	1.10	0.60	-0.34	0.77	1.11	1.31	3.06	▂▅▇▂▁
F2011	8	0.96	0.82	0.39	-0.48	0.56	0.76	1.09	1.70	▁▂▇▆▃
F2012	10	0.96	0.90	0.44	-0.13	0.59	0.81	1.19	2.14	▁▇▆▃▁
F2013	9	0.96	0.93	0.32	0.12	0.74	0.90	1.19	1.64	▁▃▇▅▂
F2014	9	0.96	1.11	0.56	-0.09	0.74	0.99	1.34	2.70	▁▇▅▂▂
F2015	9	0.96	1.27	0.46	-0.43	1.02	1.22	1.52	2.61	▁▂▇▃▁
F2016	12	0.95	1.44	0.40	0.25	1.15	1.45	1.71	2.46	▁▅▇▆▂
F2017	11	0.95	1.28	0.39	0.02	1.03	1.28	1.53	2.49	▁▃▇▃▁
F2018	12	0.95	1.30	0.60	0.24	0.86	1.12	1.83	2.77	▃▇▂▃▂
F2019	12	0.95	1.44	0.47	0.05	1.17	1.41	1.70	2.69	▁▃▇▃▂
F2020	13	0.94	1.55	0.62	0.23	1.16	1.48	1.83	3.69	▂▇▅▁▁
F2021	12	0.95	1.34	0.48	-0.42	1.02	1.33	1.63	2.68	▁▂▇▆▁
F2022	12	0.95	1.38	0.67	-1.30	0.88	1.31	1.92	3.24	▁▁▇▅▁