Race and U.S. Exonerations

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
- Source: NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This dataset is updated everyday and it is collected by the Department of Health and Mental Hygiene.
Write a brief description of the observations.
- Each row represents a date of interest which is separated into three types: date of diagnosis, date of hospital admission, and date of death. This dataset represents citywide and borough-specific daily counts of COVID-19 confirmed cases and COVID-related hospitalizations and confirmed and probable deaths among New York City residents. Columns include number of cases on date of interest, hospitalized count, death count in different boroughs, etc.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What is the trend between the date of interest and the number of cases, deaths, and hospitalizations on that date citywide?
- What is the trend between the date of interest and the number of deaths for each borough?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The data collected will show the number of cases, deaths, and hospitalizations on a specific date citywide. I believe that there will be quite a few fluctuations as we know that COVID-19 had specific periods of major outbreaks.
Identify the types of variables in your research question. Categorical? Quantitative?
- Date: Categorical
- Deaths: quantitative
- Hospitalizations: quantitative
- Cases: Quantitative
- Borough: categorical

Glimpse of data

COVID <- read_csv("data/COVID-19_Daily_Counts_of_Cases__Hospitalizations__and_Deaths.csv")

Rows: 1106 Columns: 67
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): date_of_interest
dbl (66): CASE_COUNT, PROBABLE_CASE_COUNT, HOSPITALIZED_COUNT, DEATH_COUNT, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(COVID)

Data summary
Name	COVID
Number of rows	1106
Number of columns	67
_______________________
Column type frequency:
character	1
numeric	66
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
date_of_interest	0	1	10	10	0	1106	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
CASE_COUNT	1	2456.34	4984.53	601.00	1473.5	2793.75	55008	▇▁▁▁▁
PROBABLE_CASE_COUNT	1	481.01	625.08	96.00	357.5	652.00	5882	▇▁▁▁▁
HOSPITALIZED_COUNT	1	170.76	243.71	47.00	100.5	181.00	1840	▇▁▁▁▁
DEATH_COUNT	1	34.89	75.76	6.00	13.0	30.00	598	▇▁▁▁▁
PROBABLE_DEATH_COUNT	1	5.80	24.30	0.00	1.0	2.00	240	▇▁▁▁▁
CASE_COUNT_7DAY_AVG	1	2455.27	4577.64	613.50	1549.5	2836.50	39498	▇▁▁▁▁
ALL_CASE_COUNT_7DAY_AVG	1	2935.97	5121.63	765.25	1932.5	3547.75	43954	▇▁▁▁▁
HOSP_COUNT_7DAY_AVG	1	170.68	239.41	48.00	104.0	180.00	1662	▇▁▁▁▁
DEATH_COUNT_7DAY_AVG	1	34.88	74.87	7.00	12.0	29.00	566	▇▁▁▁▁
ALL_DEATH_COUNT_7DAY_AVG	1	40.67	97.87	8.00	13.0	31.75	775	▇▁▁▁▁
BX_CASE_COUNT	1	405.56	930.34	80.00	201.5	427.00	10560	▇▁▁▁▁
BX_PROBABLE_CASE_COUNT	1	94.84	143.35	14.00	63.0	127.75	1575	▇▁▁▁▁
BX_HOSPITALIZED_COUNT	1	36.77	56.35	9.00	20.0	39.00	390	▇▁▁▁▁
BX_DEATH_COUNT	1	6.58	15.87	1.00	2.0	5.00	132	▇▁▁▁▁
BX_PROBABLE_DEATH_COUNT	1	1.12	5.02	0.00	0.0	0.00	46	▇▁▁▁▁
BX_CASE_COUNT_7DAY_AVG	1	405.39	835.87	82.00	229.5	450.50	7480	▇▁▁▁▁
BX_PROBABLE_CASE_COUNT_7DAY_AVG	1	94.79	131.76	15.25	70.0	132.75	1094	▇▁▁▁▁
BX_ALL_CASE_COUNT_7DAY_AVG	1	500.19	959.35	107.00	302.0	584.25	8574	▇▁▁▁▁
BX_HOSPITALIZED_COUNT_7DAY_AVG	1	36.75	54.94	9.00	21.0	37.00	358	▇▁▁▁▁
BX_DEATH_COUNT_7DAY_AVG	1	6.59	15.55	1.00	2.0	5.00	117	▇▁▁▁▁
BX_ALL_DEATH_COUNT_7DAY_AVG	1	7.71	20.28	1.00	2.0	5.00	158	▇▁▁▁▁
BK_CASE_COUNT	1	740.56	1470.84	203.25	454.0	833.00	16667	▇▁▁▁▁
BK_PROBABLE_CASE_COUNT	1	131.44	174.73	29.00	99.0	170.50	1906	▇▁▁▁▁
BK_HOSPITALIZED_COUNT	1	51.70	71.67	16.00	30.0	53.00	555	▇▁▁▁▁
BK_DEATH_COUNT	1	10.88	23.38	2.00	4.0	9.75	201	▇▁▁▁▁
BK_PROBABLE_DEATH_COUNT	1	1.95	8.42	0.00	0.0	1.00	92	▇▁▁▁▁
BK_CASE_COUNT_7DAY_AVG	1	740.24	1357.21	213.75	465.5	842.00	11587	▇▁▁▁▁
BK_PROBABLE_CASE_COUNT_7DAY_AVG	1	131.36	162.22	29.00	104.0	172.00	1213	▇▁▁▁▁
BK_ALL_CASE_COUNT_7DAY_AVG	1	871.59	1508.30	251.00	570.5	1027.50	12787	▇▁▁▁▁
BK_HOSPITALIZED_COUNT_7DAY_AVG	1	51.68	70.06	17.00	31.0	52.00	490	▇▁▁▁▁
BK_DEATH_COUNT_7DAY_AVG	1	10.89	22.94	2.00	4.0	9.00	178	▇▁▁▁▁
BK_ALL_DEATH_COUNT_7DAY_AVG	1	12.84	30.75	3.00	4.0	10.00	252	▇▁▁▁▁
MN_CASE_COUNT	1	450.75	903.73	104.00	275.5	485.75	9114	▇▁▁▁▁
MN_PROBABLE_CASE_COUNT	1	88.84	113.01	20.00	67.0	121.75	972	▇▁▁▁▁
MN_HOSPITALIZED_COUNT	1	25.89	35.63	7.00	16.0	29.75	273	▇▁▁▁▁
MN_DEATH_COUNT	1	4.78	9.88	1.00	2.0	5.00	92	▇▁▁▁▁
MN_PROBABLE_DEATH_COUNT	1	0.80	3.19	0.00	0.0	0.00	33	▇▁▁▁▁
MN_CASE_COUNT_7DAY_AVG	1	450.54	824.06	119.00	291.5	470.75	6394	▇▁▁▁▁
MN_PROBABLE_CASE_COUNT_7DAY_AVG	1	88.77	106.61	19.50	73.0	126.50	766	▇▁▁▁▁
MN_ALL_CASE_COUNT_7DAY_AVG	1	539.30	924.27	147.25	365.0	595.00	7161	▇▁▁▁▁
MN_HOSPITALIZED_COUNT_7DAY_AVG	1	25.88	34.65	7.00	17.0	30.00	228	▇▁▁▁▁
MN_DEATH_COUNT_7DAY_AVG	1	4.77	9.58	1.00	2.0	4.00	73	▇▁▁▁▁
MN_ALL_DEATH_COUNT_7DAY_AVG	1	5.56	12.49	1.00	2.0	4.00	100	▇▁▁▁▁
QN_CASE_COUNT	1	685.39	1404.20	145.00	388.0	783.00	15225	▇▁▁▁▁
QN_PROBABLE_CASE_COUNT	1	133.17	168.74	24.00	96.0	190.00	1609	▇▁▁▁▁
QN_HOSPITALIZED_COUNT	1	47.93	75.40	13.00	26.0	51.00	609	▇▁▁▁▁
QN_DEATH_COUNT	1	10.45	23.94	1.00	4.0	9.00	202	▇▁▁▁▁
QN_PROBABLE_DEATH_COUNT	1	1.67	7.20	0.00	0.0	1.00	68	▇▁▁▁▁
QN_CASE_COUNT_7DAY_AVG	1	685.11	1298.45	149.00	408.5	814.75	11551	▇▁▁▁▁
QN_PROBABLE_CASE_COUNT_7DAY_AVG	1	133.08	158.31	23.00	101.0	191.00	1219	▇▁▁▁▁
QN_ALL_CASE_COUNT_7DAY_AVG	1	818.19	1442.99	185.00	527.5	1036.00	12689	▇▁▁▁▁
QN_HOSPITALIZED_COUNT_7DAY_AVG	1	47.91	73.99	12.25	28.0	49.00	562	▇▁▁▁▁
QN_DEATH_COUNT_7DAY_AVG	1	10.44	23.55	2.00	4.0	9.00	177	▇▁▁▁▁
QN_ALL_DEATH_COUNT_7DAY_AVG	1	12.11	30.31	2.00	4.0	9.00	240	▇▁▁▁▁
SI_CASE_COUNT	1	173.25	326.49	40.25	108.0	192.75	3720	▇▁▁▁▁
SI_PROBABLE_CASE_COUNT	1	32.66	35.86	6.00	25.5	48.00	316	▇▁▁▁▁
SI_HOSPITALIZED_COUNT	1	10.52	11.59	3.00	7.0	14.00	83	▇▂▁▁▁
SI_DEATH_COUNT	1	2.20	3.80	0.00	1.0	3.00	34	▇▁▁▁▁
SI_PROBABLE_DEATH_COUNT	1	0.25	0.96	0.00	0.0	0.00	9	▇▁▁▁▁
SI_PROBABLE_CASE_COUNT_7DAY_AVG	1	32.63	33.33	6.00	27.0	48.00	233	▇▂▁▁▁
SI_CASE_COUNT_7DAY_AVG	1	173.18	299.80	40.25	111.5	193.00	2687	▇▁▁▁▁
SI_ALL_CASE_COUNT_7DAY_AVG	1	205.79	328.04	51.25	145.0	243.00	2907	▇▁▁▁▁
SI_HOSPITALIZED_COUNT_7DAY_AVG	1	10.51	11.06	4.00	8.0	13.00	72	▇▂▁▁▁
SI_DEATH_COUNT_7DAY_AVG	1	2.18	3.52	1.00	1.0	2.00	26	▇▁▁▁▁
SI_ALL_DEATH_COUNT_7DAY_AVG	1	2.44	4.32	1.00	1.0	2.00	34	▇▁▁▁▁
INCOMPLETE	1	385.95	4838.12	0.00	0.0	0.00	60980	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.

The source of the data is the Department of Health and Mental Hygiene (DOHMH).

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original data curators collected raw data through measurements in air quality and composition. It is then adjusted for weather and season and modeled based on the environmental factors and nearby emission sources.

Write a brief description of the observations.

The observations describe every NYC neighborhoods’ metrics like outside air pollutants, health burdens, and air toxics.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What neighborhoods of NYC have highest average levels of fine particulates?, does this show a correlation with the overall air quality?
- Have the Ozone levels in my neighborhood gone down or up over the last few years?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- This data shows air quality over time in NYC neighborhoods. We want to investigate how ozone and air quality has changed over time and if the particular things in the air have an affect on the overall air quality. We believe that a higher level of fine particulates mean a worse air quality and that ozone levels in general, have done down in the last few years.
Identify the types of variables in your research question. Categorical? Quantitative?
- The variables that want to be known in the research questions are categorical.
  - Name
  - Place
  - Time Period
- Numerical
  - Date

Glimpse of data

# add code here
Air_Qual <- read.csv("data/Air_Quality.csv")
skimr::skim(Air_Qual)

Data summary
Name	Air_Qual
Number of rows	16122
Number of columns	12
_______________________
Column type frequency:
character	7
logical	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Name	1	10	76	19
Measure	1	4	47	8
Measure.Info	1	3	21	8
Geo.Type.Name	1	2	8	5
Geo.Place.Name	1	5	46	114
Time.Period	1	4	19	45
Start_Date	1	10	10	36

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Message	16122	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Unique.ID	1	339480.96	194099.81	130355	172183.25	221882.5	547749.75	671122.0	▇▂▁▂▃
Indicator.ID	1	427.13	109.66	365	365.00	375.0	386.00	661.0	▇▁▁▁▂
Geo.Join.ID	1	613339.41	7916715.24	1	202.00	303.0	404.00	105106107.0	▇▁▁▁▁
Data.Value	1	19.13	21.67	0	8.46	13.9	25.47	424.7	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.
- Source: The National Registry of Exonerations
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The Registry was founded in 2012 as a project of the Newkirk Center for Science and Society at the University of California Irvine, the University of Michigan Law School, and Michigan State University College of Law in conjunction with the Center on Wrongful Convictions at Northwestern University School of Law. Their research allowed them to collect data on every known exoneration in the United States since 1989.
Write a brief description of the observations.
- Each observation represents one exonerated individual. The dataset includes their name, age, race, sex, and various details about the crime they were exonerated for, such as location, type of crime, years of conviction, whether or not DNA was used and more.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How does DNA being collected impact how long they were in jail for?
- Are there any differences in conviction and exoneration rates between races within different states?
- How does the individual’s characteristics impact how long they were wrongfully convicted for?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic is about every known exoneration in United States since 1989. It gives information about the sentence and the individual who was exonerated. We want to investigate how the individual’s sentence impacts how long the individual was in jail for. We believe that the worse the sentence and the less DNA evidence on the scene, the longer the individual was wrongfully convicted and in jail for.
Identify the types of variables in your research question. Categorical? Quantitative?
- Categorical
  - Race
  - Sex
  - State
  - Description
  - Country
  - DNA being found
- Quantitative
  - Year Convicted
  - Year Exonerated
  - Age

Glimpse of data

# add code here
us_exonerations <- 
  read_csv("data/us_exonerations.csv")

Rows: 3284 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Last Name, First Name, Race, Sex, State, County, Tags, Worst Crime...
dbl  (3): Age, Convicted, Exonerated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(us_exonerations)

Data summary
Name	us_exonerations
Number of rows	3284
Number of columns	23
_______________________
Column type frequency:
character	20
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Last Name	0	1.00	2	18	2034
First Name	0	1.00	2	18	1305
Race	0	1.00	5	22	9
Sex	0	1.00	4	6	2
State	0	1.00	4	20	84
County	60	0.98	3	17	568
Tags	167	0.95	1	28	443
Worst Crime Display	0	1.00	5	29	45
List Add’l Crimes Recode	2002	0.39	4	132	248
Sentence	0	1.00	2	45	480
DNA	2710	0.17	3	3	1
*	3113	0.05	1	1	1
FC	2883	0.12	2	2	1
MWID	2392	0.27	4	4	1
F/MFE	2511	0.24	5	5	1
P/FA	1205	0.63	4	4	1
OM	1346	0.59	2	2	1
ILD	2405	0.27	3	3	1
Posting Date	0	1.00	8	10	1230
OM Tags	1347	0.59	2	35	134

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	26	0.99	28.43	10.24	11	20	26	34	83	▇▆▂▁▁
Convicted	0	1.00	1998.64	10.97	1956	1990	1998	2007	2021	▁▁▇▇▅
Exonerated	0	1.00	2010.45	8.96	1989	2003	2013	2018	2023	▂▃▅▇▇