Race and U.S. Exonerations

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • Source: NYC Open Data
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This dataset is updated everyday and it is collected by the Department of Health and Mental Hygiene.
  • Write a brief description of the observations.

    • Each row represents a date of interest which is separated into three types: date of diagnosis, date of hospital admission, and date of death. This dataset represents citywide and borough-specific daily counts of COVID-19 confirmed cases and COVID-related hospitalizations and confirmed and probable deaths among New York City residents. Columns include number of cases on date of interest, hospitalized count, death count in different boroughs, etc.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What is the trend between the date of interest and the number of cases, deaths, and hospitalizations on that date citywide?

    • What is the trend between the date of interest and the number of deaths for each borough?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The data collected will show the number of cases, deaths, and hospitalizations on a specific date citywide. I believe that there will be quite a few fluctuations as we know that COVID-19 had specific periods of major outbreaks.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Date: Categorical

    • Deaths: quantitative

    • Hospitalizations: quantitative

    • Cases: Quantitative

    • Borough: categorical 

Glimpse of data

COVID <- read_csv("data/COVID-19_Daily_Counts_of_Cases__Hospitalizations__and_Deaths.csv")
Rows: 1106 Columns: 67
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): date_of_interest
dbl (66): CASE_COUNT, PROBABLE_CASE_COUNT, HOSPITALIZED_COUNT, DEATH_COUNT, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(COVID)
Data summary
Name COVID
Number of rows 1106
Number of columns 67
_______________________
Column type frequency:
character 1
numeric 66
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date_of_interest 0 1 10 10 0 1106 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CASE_COUNT 0 1 2456.34 4984.53 0 601.00 1473.5 2793.75 55008 ▇▁▁▁▁
PROBABLE_CASE_COUNT 0 1 481.01 625.08 0 96.00 357.5 652.00 5882 ▇▁▁▁▁
HOSPITALIZED_COUNT 0 1 170.76 243.71 0 47.00 100.5 181.00 1840 ▇▁▁▁▁
DEATH_COUNT 0 1 34.89 75.76 0 6.00 13.0 30.00 598 ▇▁▁▁▁
PROBABLE_DEATH_COUNT 0 1 5.80 24.30 0 0.00 1.0 2.00 240 ▇▁▁▁▁
CASE_COUNT_7DAY_AVG 0 1 2455.27 4577.64 0 613.50 1549.5 2836.50 39498 ▇▁▁▁▁
ALL_CASE_COUNT_7DAY_AVG 0 1 2935.97 5121.63 0 765.25 1932.5 3547.75 43954 ▇▁▁▁▁
HOSP_COUNT_7DAY_AVG 0 1 170.68 239.41 0 48.00 104.0 180.00 1662 ▇▁▁▁▁
DEATH_COUNT_7DAY_AVG 0 1 34.88 74.87 0 7.00 12.0 29.00 566 ▇▁▁▁▁
ALL_DEATH_COUNT_7DAY_AVG 0 1 40.67 97.87 0 8.00 13.0 31.75 775 ▇▁▁▁▁
BX_CASE_COUNT 0 1 405.56 930.34 0 80.00 201.5 427.00 10560 ▇▁▁▁▁
BX_PROBABLE_CASE_COUNT 0 1 94.84 143.35 0 14.00 63.0 127.75 1575 ▇▁▁▁▁
BX_HOSPITALIZED_COUNT 0 1 36.77 56.35 0 9.00 20.0 39.00 390 ▇▁▁▁▁
BX_DEATH_COUNT 0 1 6.58 15.87 0 1.00 2.0 5.00 132 ▇▁▁▁▁
BX_PROBABLE_DEATH_COUNT 0 1 1.12 5.02 0 0.00 0.0 0.00 46 ▇▁▁▁▁
BX_CASE_COUNT_7DAY_AVG 0 1 405.39 835.87 0 82.00 229.5 450.50 7480 ▇▁▁▁▁
BX_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 94.79 131.76 0 15.25 70.0 132.75 1094 ▇▁▁▁▁
BX_ALL_CASE_COUNT_7DAY_AVG 0 1 500.19 959.35 0 107.00 302.0 584.25 8574 ▇▁▁▁▁
BX_HOSPITALIZED_COUNT_7DAY_AVG 0 1 36.75 54.94 0 9.00 21.0 37.00 358 ▇▁▁▁▁
BX_DEATH_COUNT_7DAY_AVG 0 1 6.59 15.55 0 1.00 2.0 5.00 117 ▇▁▁▁▁
BX_ALL_DEATH_COUNT_7DAY_AVG 0 1 7.71 20.28 0 1.00 2.0 5.00 158 ▇▁▁▁▁
BK_CASE_COUNT 0 1 740.56 1470.84 0 203.25 454.0 833.00 16667 ▇▁▁▁▁
BK_PROBABLE_CASE_COUNT 0 1 131.44 174.73 0 29.00 99.0 170.50 1906 ▇▁▁▁▁
BK_HOSPITALIZED_COUNT 0 1 51.70 71.67 0 16.00 30.0 53.00 555 ▇▁▁▁▁
BK_DEATH_COUNT 0 1 10.88 23.38 0 2.00 4.0 9.75 201 ▇▁▁▁▁
BK_PROBABLE_DEATH_COUNT 0 1 1.95 8.42 0 0.00 0.0 1.00 92 ▇▁▁▁▁
BK_CASE_COUNT_7DAY_AVG 0 1 740.24 1357.21 0 213.75 465.5 842.00 11587 ▇▁▁▁▁
BK_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 131.36 162.22 0 29.00 104.0 172.00 1213 ▇▁▁▁▁
BK_ALL_CASE_COUNT_7DAY_AVG 0 1 871.59 1508.30 0 251.00 570.5 1027.50 12787 ▇▁▁▁▁
BK_HOSPITALIZED_COUNT_7DAY_AVG 0 1 51.68 70.06 0 17.00 31.0 52.00 490 ▇▁▁▁▁
BK_DEATH_COUNT_7DAY_AVG 0 1 10.89 22.94 0 2.00 4.0 9.00 178 ▇▁▁▁▁
BK_ALL_DEATH_COUNT_7DAY_AVG 0 1 12.84 30.75 0 3.00 4.0 10.00 252 ▇▁▁▁▁
MN_CASE_COUNT 0 1 450.75 903.73 0 104.00 275.5 485.75 9114 ▇▁▁▁▁
MN_PROBABLE_CASE_COUNT 0 1 88.84 113.01 0 20.00 67.0 121.75 972 ▇▁▁▁▁
MN_HOSPITALIZED_COUNT 0 1 25.89 35.63 0 7.00 16.0 29.75 273 ▇▁▁▁▁
MN_DEATH_COUNT 0 1 4.78 9.88 0 1.00 2.0 5.00 92 ▇▁▁▁▁
MN_PROBABLE_DEATH_COUNT 0 1 0.80 3.19 0 0.00 0.0 0.00 33 ▇▁▁▁▁
MN_CASE_COUNT_7DAY_AVG 0 1 450.54 824.06 0 119.00 291.5 470.75 6394 ▇▁▁▁▁
MN_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 88.77 106.61 0 19.50 73.0 126.50 766 ▇▁▁▁▁
MN_ALL_CASE_COUNT_7DAY_AVG 0 1 539.30 924.27 0 147.25 365.0 595.00 7161 ▇▁▁▁▁
MN_HOSPITALIZED_COUNT_7DAY_AVG 0 1 25.88 34.65 0 7.00 17.0 30.00 228 ▇▁▁▁▁
MN_DEATH_COUNT_7DAY_AVG 0 1 4.77 9.58 0 1.00 2.0 4.00 73 ▇▁▁▁▁
MN_ALL_DEATH_COUNT_7DAY_AVG 0 1 5.56 12.49 0 1.00 2.0 4.00 100 ▇▁▁▁▁
QN_CASE_COUNT 0 1 685.39 1404.20 0 145.00 388.0 783.00 15225 ▇▁▁▁▁
QN_PROBABLE_CASE_COUNT 0 1 133.17 168.74 0 24.00 96.0 190.00 1609 ▇▁▁▁▁
QN_HOSPITALIZED_COUNT 0 1 47.93 75.40 0 13.00 26.0 51.00 609 ▇▁▁▁▁
QN_DEATH_COUNT 0 1 10.45 23.94 0 1.00 4.0 9.00 202 ▇▁▁▁▁
QN_PROBABLE_DEATH_COUNT 0 1 1.67 7.20 0 0.00 0.0 1.00 68 ▇▁▁▁▁
QN_CASE_COUNT_7DAY_AVG 0 1 685.11 1298.45 0 149.00 408.5 814.75 11551 ▇▁▁▁▁
QN_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 133.08 158.31 0 23.00 101.0 191.00 1219 ▇▁▁▁▁
QN_ALL_CASE_COUNT_7DAY_AVG 0 1 818.19 1442.99 0 185.00 527.5 1036.00 12689 ▇▁▁▁▁
QN_HOSPITALIZED_COUNT_7DAY_AVG 0 1 47.91 73.99 0 12.25 28.0 49.00 562 ▇▁▁▁▁
QN_DEATH_COUNT_7DAY_AVG 0 1 10.44 23.55 0 2.00 4.0 9.00 177 ▇▁▁▁▁
QN_ALL_DEATH_COUNT_7DAY_AVG 0 1 12.11 30.31 0 2.00 4.0 9.00 240 ▇▁▁▁▁
SI_CASE_COUNT 0 1 173.25 326.49 0 40.25 108.0 192.75 3720 ▇▁▁▁▁
SI_PROBABLE_CASE_COUNT 0 1 32.66 35.86 0 6.00 25.5 48.00 316 ▇▁▁▁▁
SI_HOSPITALIZED_COUNT 0 1 10.52 11.59 0 3.00 7.0 14.00 83 ▇▂▁▁▁
SI_DEATH_COUNT 0 1 2.20 3.80 0 0.00 1.0 3.00 34 ▇▁▁▁▁
SI_PROBABLE_DEATH_COUNT 0 1 0.25 0.96 0 0.00 0.0 0.00 9 ▇▁▁▁▁
SI_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 32.63 33.33 0 6.00 27.0 48.00 233 ▇▂▁▁▁
SI_CASE_COUNT_7DAY_AVG 0 1 173.18 299.80 0 40.25 111.5 193.00 2687 ▇▁▁▁▁
SI_ALL_CASE_COUNT_7DAY_AVG 0 1 205.79 328.04 0 51.25 145.0 243.00 2907 ▇▁▁▁▁
SI_HOSPITALIZED_COUNT_7DAY_AVG 0 1 10.51 11.06 0 4.00 8.0 13.00 72 ▇▂▁▁▁
SI_DEATH_COUNT_7DAY_AVG 0 1 2.18 3.52 0 1.00 1.0 2.00 26 ▇▁▁▁▁
SI_ALL_DEATH_COUNT_7DAY_AVG 0 1 2.44 4.32 0 1.00 1.0 2.00 34 ▇▁▁▁▁
INCOMPLETE 0 1 385.95 4838.12 0 0.00 0.0 0.00 60980 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

The source of the data is the Department of Health and Mental Hygiene (DOHMH).

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original data curators collected raw data through measurements in air quality and composition. It is then adjusted for weather and season and modeled based on the environmental factors and nearby emission sources.

  • Write a brief description of the observations.

The observations describe every NYC neighborhoods’ metrics like outside air pollutants, health burdens, and air toxics.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What neighborhoods of NYC have highest average levels of fine particulates?, does this show a correlation with the overall air quality?

    • Have the Ozone levels in my neighborhood gone down or up over the last few years?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • This data shows air quality over time in NYC neighborhoods. We want to investigate how ozone and air quality has changed over time and if the particular things in the air have an affect on the overall air quality. We believe that a higher level of fine particulates mean a worse air quality and that ozone levels in general, have done down in the last few years.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • The variables that want to be known in the research questions are categorical.
      • Name

      • Place

      • Time Period

    • Numerical
      • Date

Glimpse of data

# add code here
Air_Qual <- read.csv("data/Air_Quality.csv")
skimr::skim(Air_Qual)
Data summary
Name Air_Qual
Number of rows 16122
Number of columns 12
_______________________
Column type frequency:
character 7
logical 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 10 76 0 19 0
Measure 0 1 4 47 0 8 0
Measure.Info 0 1 3 21 0 8 0
Geo.Type.Name 0 1 2 8 0 5 0
Geo.Place.Name 0 1 5 46 0 114 0
Time.Period 0 1 4 19 0 45 0
Start_Date 0 1 10 10 0 36 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Message 16122 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Unique.ID 0 1 339480.96 194099.81 130355 172183.25 221882.5 547749.75 671122.0 ▇▂▁▂▃
Indicator.ID 0 1 427.13 109.66 365 365.00 375.0 386.00 661.0 ▇▁▁▁▂
Geo.Join.ID 0 1 613339.41 7916715.24 1 202.00 303.0 404.00 105106107.0 ▇▁▁▁▁
Data.Value 0 1 19.13 21.67 0 8.46 13.9 25.47 424.7 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    • Source: The National Registry of Exonerations
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The Registry was founded in 2012 as a project of the Newkirk Center for Science and Society at the University of California Irvine, the University of Michigan Law School, and Michigan State University College of Law in conjunction with the Center on Wrongful Convictions at Northwestern University School of Law. Their research allowed them to collect data on every known exoneration in the United States since 1989. 
  • Write a brief description of the observations.

    • Each observation represents one exonerated individual. The dataset includes their name, age, race, sex, and various details about the crime they were exonerated for, such as location, type of crime, years of conviction, whether or not DNA was used and more.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How does DNA being collected impact how long they were in jail for?

    • Are there any differences in conviction and exoneration rates between races within different states?

    • How does the individual’s characteristics impact how long they were wrongfully convicted for?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic is about every known exoneration in United States since 1989. It gives information about the sentence and the individual who was exonerated. We want to investigate how the individual’s sentence impacts how long the individual was in jail for. We believe that the worse the sentence and the less DNA evidence on the scene, the longer the individual was wrongfully convicted and in jail for.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Categorical

      • Race

      • Sex

      • State

      • Description

      • Country

      • DNA being found

    • Quantitative

      • Year Convicted

      • Year Exonerated

      • Age

Glimpse of data

# add code here
us_exonerations <- 
  read_csv("data/us_exonerations.csv")
Rows: 3284 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Last Name, First Name, Race, Sex, State, County, Tags, Worst Crime...
dbl  (3): Age, Convicted, Exonerated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(us_exonerations)
Data summary
Name us_exonerations
Number of rows 3284
Number of columns 23
_______________________
Column type frequency:
character 20
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Last Name 0 1.00 2 18 0 2034 0
First Name 0 1.00 2 18 0 1305 0
Race 0 1.00 5 22 0 9 0
Sex 0 1.00 4 6 0 2 0
State 0 1.00 4 20 0 84 0
County 60 0.98 3 17 0 568 0
Tags 167 0.95 1 28 0 443 0
Worst Crime Display 0 1.00 5 29 0 45 0
List Add’l Crimes Recode 2002 0.39 4 132 0 248 0
Sentence 0 1.00 2 45 0 480 0
DNA 2710 0.17 3 3 0 1 0
* 3113 0.05 1 1 0 1 0
FC 2883 0.12 2 2 0 1 0
MWID 2392 0.27 4 4 0 1 0
F/MFE 2511 0.24 5 5 0 1 0
P/FA 1205 0.63 4 4 0 1 0
OM 1346 0.59 2 2 0 1 0
ILD 2405 0.27 3 3 0 1 0
Posting Date 0 1.00 8 10 0 1230 0
OM Tags 1347 0.59 2 35 0 134 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 26 0.99 28.43 10.24 11 20 26 34 83 ▇▆▂▁▁
Convicted 0 1.00 1998.64 10.97 1956 1990 1998 2007 2021 ▁▁▇▇▅
Exonerated 0 1.00 2010.45 8.96 1989 2003 2013 2018 2023 ▂▃▅▇▇