An Analysis of NYPD Arrest Data
Appendix to report
Data cleaning
Data Collection
- Download data
- Place data in data folder
- Read data using read_csv function and place in dataframe
<-
nypd_arrest_data_raw read_csv("data/NYPD_Arrest_Data__Year_to_Date_.csv")
Rows: 189774 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ARREST_DATE, PD_DESC, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO...
dbl (9): ARREST_KEY, PD_CD, KY_CD, ARREST_PRECINCT, JURISDICTION_CODE, X_CO...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Cleaning
- Drop NA Values
- Keep only relevant columns through select function. These columns were: ARREST_KEY, ARREST_DATE, PD_CD, PD_DESC, KY_CD, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO, AGE_GROUP, PERP_SEX, and PERP_RACE. We kept these columns as they were relevent to performing analysis of our research questions.
- For some of the analysis we collapsed the LAW_CAT_CD column in to 3 factors instead of 4 as infractions and violations were essentially the same level and appeared very infrequently in the dataset. This was not a universal change however and thus only happened in the actual analysis and not data cleaning process.
- For analysis regarding date of the arrest we separated the date column from one date in to month, day, and year columns for each date to make analysis of the time series data easier.
- For many of the visualizations we expanded existing values in to their full names. This looked like the PERP_SEX variable having M be converted to Male and F be converted to Female. For the LAW_CAT_CD variable this looked like F becoming Felony, M becoming Misdemeanor, and I + V becoming Violation.
<- nypd_arrest_data_raw |>
nypd_arrest_data drop_na() |>
select(1:14) |>
select(!JURISDICTION_CODE) |>
select(!ARREST_PRECINCT)
nypd_arrest_data
# A tibble: 187,457 × 12
ARREST_KEY ARREST_DATE PD_CD PD_DESC KY_CD OFNS_DESC LAW_CODE LAW_CAT_CD
<dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 239553009 01/23/2022 464 JOSTLING 230 JOSTLING PL 1652… M
2 239922214 01/31/2022 397 ROBBERY,OPE… 105 ROBBERY PL 1601… F
3 239939130 02/01/2022 105 STRANGULATI… 106 FELONY A… PL 1211… F
4 240521791 02/13/2022 101 ASSAULT 3 344 ASSAULT … PL 1200… M
5 241022365 02/21/2022 397 ROBBERY,OPE… 105 ROBBERY PL 1600… F
6 242064428 03/14/2022 105 STRANGULATI… 106 FELONY A… PL 1211… F
7 242456937 03/22/2022 105 STRANGULATI… 106 FELONY A… PL 1211… F
8 242818613 03/29/2022 705 FORGERY,ETC… 358 OFFENSES… PL 1702… M
9 243132247 04/05/2022 157 RAPE 1 104 RAPE PL 1303… F
10 244567670 05/04/2022 109 ASSAULT 2,1… 106 FELONY A… PL 1200… F
# ℹ 187,447 more rows
# ℹ 4 more variables: ARREST_BORO <chr>, AGE_GROUP <chr>, PERP_SEX <chr>,
# PERP_RACE <chr>
Other Appendices
An analysis we did not include in the report but thought was of interest and could be included in the appendices would be comparing the proportion of misdemeanor vs felony arrests for black men arrested for dangerous drugs. We found the difference to be statistically significant which tells an interesting story to how there are more felony convictions for black men arrested for dangerous drugs than misdemeanor convictions. There is likely some racial biases at hand here were black people are biased against and given harsher sentences due to skin color.
Null hypothesis offense: The true proportion of black perpetrators arrested for dangerous drugs who were convicted of felonies is the same as black perpetrators arrested for dangerous drugs who were convicted of misdemeanors.
\[H_0: p_f = p_m\]
Alternative hypothesis offense: The true proportion of black perpetrators arrested for dangerous drugs who were convicted of felonies is not the same as black perpetrators arrested for dangerous drugs who were convicted of misdemeanors.
\[H_A: p_f \neq p_m\]
For this analysis we had to filter for only black perpetrators and not black perpetrators who’s offense description was ‘dangerous drugs’.
The point estimate, 0.0674148 represents the observed difference in proportion of arrests of black people arrest for felonies vs misdemeanors for dangerous drugs.
set.seed(123)
<- offense_type_data |>
null_dist_ot mutate(LAW_CAT_CD = fct_collapse(LAW_CAT_CD,
Felony = "F",
Misdemeanor = "M",
Violation = c("V", "I")),
LAW_CAT_CD = fct_relevel(LAW_CAT_CD,
c("Felony", "Misdemeanor", "Violation"))) |>
filter(LAW_CAT_CD != "Violation") |>
filter(OFNS_DESC == "DANGEROUS DRUGS") |>
droplevels() |>
specify(PERP_RACE ~ LAW_CAT_CD, success = "BLACK") |>
hypothesize(null = "independence") |>
generate(1000, type = "permute") |>
calculate(stat = "diff in props", order = c("Felony", "Misdemeanor"))
We generated the null distribution of difference in proportion of arrests of black people arrest for felonies vs misdemeanors for ‘dangerous drugs’ through permutation 1000 times.
visualize(null_dist_ot) +
shade_p_value(obs_stat = point_estimate_ot, direction = "two-sided")
|>
null_dist_ot get_p_value(obs_stat = point_estimate_ot, direction = "two-sided")
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.
# A tibble: 1 × 1
p_value
<dbl>
1 0
With a p-value less than 0.05, we reject the null hypothesis. There is significant evidence that the true proportion of black perpetrators arrested for dangerous drugs who were convicted of felonies is not the same as black perpetrators arrested for dangerous drugs who were convicted of misdemeanors.