library(tidyverse)
library(skimr)Race and U.S. Exonerations
Proposal
Data 1
Introduction and data
Identify the source of the data.
- Source: NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This dataset is updated everyday and it is collected by the Department of Health and Mental Hygiene.
Write a brief description of the observations.
- Each row represents a date of interest which is separated into three types: date of diagnosis, date of hospital admission, and date of death. This dataset represents citywide and borough-specific daily counts of COVID-19 confirmed cases and COVID-related hospitalizations and confirmed and probable deaths among New York City residents. Columns include number of cases on date of interest, hospitalized count, death count in different boroughs, etc.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What is the trend between the date of interest and the number of cases, deaths, and hospitalizations on that date citywide?
What is the trend between the date of interest and the number of deaths for each borough?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- The data collected will show the number of cases, deaths, and hospitalizations on a specific date citywide. I believe that there will be quite a few fluctuations as we know that COVID-19 had specific periods of major outbreaks.
- Identify the types of variables in your research question. Categorical? Quantitative?
Date: Categorical
Deaths: quantitative
Hospitalizations: quantitative
Cases: Quantitative
Borough: categorical
Glimpse of data
COVID <- read_csv("data/COVID-19_Daily_Counts_of_Cases__Hospitalizations__and_Deaths.csv")Rows: 1106 Columns: 67
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): date_of_interest
dbl (66): CASE_COUNT, PROBABLE_CASE_COUNT, HOSPITALIZED_COUNT, DEATH_COUNT, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(COVID)| Name | COVID |
| Number of rows | 1106 |
| Number of columns | 67 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 66 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date_of_interest | 0 | 1 | 10 | 10 | 0 | 1106 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CASE_COUNT | 0 | 1 | 2456.34 | 4984.53 | 0 | 601.00 | 1473.5 | 2793.75 | 55008 | ▇▁▁▁▁ |
| PROBABLE_CASE_COUNT | 0 | 1 | 481.01 | 625.08 | 0 | 96.00 | 357.5 | 652.00 | 5882 | ▇▁▁▁▁ |
| HOSPITALIZED_COUNT | 0 | 1 | 170.76 | 243.71 | 0 | 47.00 | 100.5 | 181.00 | 1840 | ▇▁▁▁▁ |
| DEATH_COUNT | 0 | 1 | 34.89 | 75.76 | 0 | 6.00 | 13.0 | 30.00 | 598 | ▇▁▁▁▁ |
| PROBABLE_DEATH_COUNT | 0 | 1 | 5.80 | 24.30 | 0 | 0.00 | 1.0 | 2.00 | 240 | ▇▁▁▁▁ |
| CASE_COUNT_7DAY_AVG | 0 | 1 | 2455.27 | 4577.64 | 0 | 613.50 | 1549.5 | 2836.50 | 39498 | ▇▁▁▁▁ |
| ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 2935.97 | 5121.63 | 0 | 765.25 | 1932.5 | 3547.75 | 43954 | ▇▁▁▁▁ |
| HOSP_COUNT_7DAY_AVG | 0 | 1 | 170.68 | 239.41 | 0 | 48.00 | 104.0 | 180.00 | 1662 | ▇▁▁▁▁ |
| DEATH_COUNT_7DAY_AVG | 0 | 1 | 34.88 | 74.87 | 0 | 7.00 | 12.0 | 29.00 | 566 | ▇▁▁▁▁ |
| ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 40.67 | 97.87 | 0 | 8.00 | 13.0 | 31.75 | 775 | ▇▁▁▁▁ |
| BX_CASE_COUNT | 0 | 1 | 405.56 | 930.34 | 0 | 80.00 | 201.5 | 427.00 | 10560 | ▇▁▁▁▁ |
| BX_PROBABLE_CASE_COUNT | 0 | 1 | 94.84 | 143.35 | 0 | 14.00 | 63.0 | 127.75 | 1575 | ▇▁▁▁▁ |
| BX_HOSPITALIZED_COUNT | 0 | 1 | 36.77 | 56.35 | 0 | 9.00 | 20.0 | 39.00 | 390 | ▇▁▁▁▁ |
| BX_DEATH_COUNT | 0 | 1 | 6.58 | 15.87 | 0 | 1.00 | 2.0 | 5.00 | 132 | ▇▁▁▁▁ |
| BX_PROBABLE_DEATH_COUNT | 0 | 1 | 1.12 | 5.02 | 0 | 0.00 | 0.0 | 0.00 | 46 | ▇▁▁▁▁ |
| BX_CASE_COUNT_7DAY_AVG | 0 | 1 | 405.39 | 835.87 | 0 | 82.00 | 229.5 | 450.50 | 7480 | ▇▁▁▁▁ |
| BX_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | 94.79 | 131.76 | 0 | 15.25 | 70.0 | 132.75 | 1094 | ▇▁▁▁▁ |
| BX_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 500.19 | 959.35 | 0 | 107.00 | 302.0 | 584.25 | 8574 | ▇▁▁▁▁ |
| BX_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | 36.75 | 54.94 | 0 | 9.00 | 21.0 | 37.00 | 358 | ▇▁▁▁▁ |
| BX_DEATH_COUNT_7DAY_AVG | 0 | 1 | 6.59 | 15.55 | 0 | 1.00 | 2.0 | 5.00 | 117 | ▇▁▁▁▁ |
| BX_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 7.71 | 20.28 | 0 | 1.00 | 2.0 | 5.00 | 158 | ▇▁▁▁▁ |
| BK_CASE_COUNT | 0 | 1 | 740.56 | 1470.84 | 0 | 203.25 | 454.0 | 833.00 | 16667 | ▇▁▁▁▁ |
| BK_PROBABLE_CASE_COUNT | 0 | 1 | 131.44 | 174.73 | 0 | 29.00 | 99.0 | 170.50 | 1906 | ▇▁▁▁▁ |
| BK_HOSPITALIZED_COUNT | 0 | 1 | 51.70 | 71.67 | 0 | 16.00 | 30.0 | 53.00 | 555 | ▇▁▁▁▁ |
| BK_DEATH_COUNT | 0 | 1 | 10.88 | 23.38 | 0 | 2.00 | 4.0 | 9.75 | 201 | ▇▁▁▁▁ |
| BK_PROBABLE_DEATH_COUNT | 0 | 1 | 1.95 | 8.42 | 0 | 0.00 | 0.0 | 1.00 | 92 | ▇▁▁▁▁ |
| BK_CASE_COUNT_7DAY_AVG | 0 | 1 | 740.24 | 1357.21 | 0 | 213.75 | 465.5 | 842.00 | 11587 | ▇▁▁▁▁ |
| BK_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | 131.36 | 162.22 | 0 | 29.00 | 104.0 | 172.00 | 1213 | ▇▁▁▁▁ |
| BK_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 871.59 | 1508.30 | 0 | 251.00 | 570.5 | 1027.50 | 12787 | ▇▁▁▁▁ |
| BK_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | 51.68 | 70.06 | 0 | 17.00 | 31.0 | 52.00 | 490 | ▇▁▁▁▁ |
| BK_DEATH_COUNT_7DAY_AVG | 0 | 1 | 10.89 | 22.94 | 0 | 2.00 | 4.0 | 9.00 | 178 | ▇▁▁▁▁ |
| BK_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 12.84 | 30.75 | 0 | 3.00 | 4.0 | 10.00 | 252 | ▇▁▁▁▁ |
| MN_CASE_COUNT | 0 | 1 | 450.75 | 903.73 | 0 | 104.00 | 275.5 | 485.75 | 9114 | ▇▁▁▁▁ |
| MN_PROBABLE_CASE_COUNT | 0 | 1 | 88.84 | 113.01 | 0 | 20.00 | 67.0 | 121.75 | 972 | ▇▁▁▁▁ |
| MN_HOSPITALIZED_COUNT | 0 | 1 | 25.89 | 35.63 | 0 | 7.00 | 16.0 | 29.75 | 273 | ▇▁▁▁▁ |
| MN_DEATH_COUNT | 0 | 1 | 4.78 | 9.88 | 0 | 1.00 | 2.0 | 5.00 | 92 | ▇▁▁▁▁ |
| MN_PROBABLE_DEATH_COUNT | 0 | 1 | 0.80 | 3.19 | 0 | 0.00 | 0.0 | 0.00 | 33 | ▇▁▁▁▁ |
| MN_CASE_COUNT_7DAY_AVG | 0 | 1 | 450.54 | 824.06 | 0 | 119.00 | 291.5 | 470.75 | 6394 | ▇▁▁▁▁ |
| MN_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | 88.77 | 106.61 | 0 | 19.50 | 73.0 | 126.50 | 766 | ▇▁▁▁▁ |
| MN_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 539.30 | 924.27 | 0 | 147.25 | 365.0 | 595.00 | 7161 | ▇▁▁▁▁ |
| MN_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | 25.88 | 34.65 | 0 | 7.00 | 17.0 | 30.00 | 228 | ▇▁▁▁▁ |
| MN_DEATH_COUNT_7DAY_AVG | 0 | 1 | 4.77 | 9.58 | 0 | 1.00 | 2.0 | 4.00 | 73 | ▇▁▁▁▁ |
| MN_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 5.56 | 12.49 | 0 | 1.00 | 2.0 | 4.00 | 100 | ▇▁▁▁▁ |
| QN_CASE_COUNT | 0 | 1 | 685.39 | 1404.20 | 0 | 145.00 | 388.0 | 783.00 | 15225 | ▇▁▁▁▁ |
| QN_PROBABLE_CASE_COUNT | 0 | 1 | 133.17 | 168.74 | 0 | 24.00 | 96.0 | 190.00 | 1609 | ▇▁▁▁▁ |
| QN_HOSPITALIZED_COUNT | 0 | 1 | 47.93 | 75.40 | 0 | 13.00 | 26.0 | 51.00 | 609 | ▇▁▁▁▁ |
| QN_DEATH_COUNT | 0 | 1 | 10.45 | 23.94 | 0 | 1.00 | 4.0 | 9.00 | 202 | ▇▁▁▁▁ |
| QN_PROBABLE_DEATH_COUNT | 0 | 1 | 1.67 | 7.20 | 0 | 0.00 | 0.0 | 1.00 | 68 | ▇▁▁▁▁ |
| QN_CASE_COUNT_7DAY_AVG | 0 | 1 | 685.11 | 1298.45 | 0 | 149.00 | 408.5 | 814.75 | 11551 | ▇▁▁▁▁ |
| QN_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | 133.08 | 158.31 | 0 | 23.00 | 101.0 | 191.00 | 1219 | ▇▁▁▁▁ |
| QN_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 818.19 | 1442.99 | 0 | 185.00 | 527.5 | 1036.00 | 12689 | ▇▁▁▁▁ |
| QN_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | 47.91 | 73.99 | 0 | 12.25 | 28.0 | 49.00 | 562 | ▇▁▁▁▁ |
| QN_DEATH_COUNT_7DAY_AVG | 0 | 1 | 10.44 | 23.55 | 0 | 2.00 | 4.0 | 9.00 | 177 | ▇▁▁▁▁ |
| QN_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 12.11 | 30.31 | 0 | 2.00 | 4.0 | 9.00 | 240 | ▇▁▁▁▁ |
| SI_CASE_COUNT | 0 | 1 | 173.25 | 326.49 | 0 | 40.25 | 108.0 | 192.75 | 3720 | ▇▁▁▁▁ |
| SI_PROBABLE_CASE_COUNT | 0 | 1 | 32.66 | 35.86 | 0 | 6.00 | 25.5 | 48.00 | 316 | ▇▁▁▁▁ |
| SI_HOSPITALIZED_COUNT | 0 | 1 | 10.52 | 11.59 | 0 | 3.00 | 7.0 | 14.00 | 83 | ▇▂▁▁▁ |
| SI_DEATH_COUNT | 0 | 1 | 2.20 | 3.80 | 0 | 0.00 | 1.0 | 3.00 | 34 | ▇▁▁▁▁ |
| SI_PROBABLE_DEATH_COUNT | 0 | 1 | 0.25 | 0.96 | 0 | 0.00 | 0.0 | 0.00 | 9 | ▇▁▁▁▁ |
| SI_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | 32.63 | 33.33 | 0 | 6.00 | 27.0 | 48.00 | 233 | ▇▂▁▁▁ |
| SI_CASE_COUNT_7DAY_AVG | 0 | 1 | 173.18 | 299.80 | 0 | 40.25 | 111.5 | 193.00 | 2687 | ▇▁▁▁▁ |
| SI_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | 205.79 | 328.04 | 0 | 51.25 | 145.0 | 243.00 | 2907 | ▇▁▁▁▁ |
| SI_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | 10.51 | 11.06 | 0 | 4.00 | 8.0 | 13.00 | 72 | ▇▂▁▁▁ |
| SI_DEATH_COUNT_7DAY_AVG | 0 | 1 | 2.18 | 3.52 | 0 | 1.00 | 1.0 | 2.00 | 26 | ▇▁▁▁▁ |
| SI_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | 2.44 | 4.32 | 0 | 1.00 | 1.0 | 2.00 | 34 | ▇▁▁▁▁ |
| INCOMPLETE | 0 | 1 | 385.95 | 4838.12 | 0 | 0.00 | 0.0 | 0.00 | 60980 | ▇▁▁▁▁ |
Data 2
Introduction and data
- Identify the source of the data.
The source of the data is the Department of Health and Mental Hygiene (DOHMH).
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The original data curators collected raw data through measurements in air quality and composition. It is then adjusted for weather and season and modeled based on the environmental factors and nearby emission sources.
- Write a brief description of the observations.
The observations describe every NYC neighborhoods’ metrics like outside air pollutants, health burdens, and air toxics.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What neighborhoods of NYC have highest average levels of fine particulates?, does this show a correlation with the overall air quality?
Have the Ozone levels in my neighborhood gone down or up over the last few years?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- This data shows air quality over time in NYC neighborhoods. We want to investigate how ozone and air quality has changed over time and if the particular things in the air have an affect on the overall air quality. We believe that a higher level of fine particulates mean a worse air quality and that ozone levels in general, have done down in the last few years.
- Identify the types of variables in your research question. Categorical? Quantitative?
- The variables that want to be known in the research questions are categorical.
Name
Place
Time Period
- Numerical
- Date
- The variables that want to be known in the research questions are categorical.
Glimpse of data
# add code here
Air_Qual <- read.csv("data/Air_Quality.csv")
skimr::skim(Air_Qual)| Name | Air_Qual |
| Number of rows | 16122 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| logical | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1 | 10 | 76 | 0 | 19 | 0 |
| Measure | 0 | 1 | 4 | 47 | 0 | 8 | 0 |
| Measure.Info | 0 | 1 | 3 | 21 | 0 | 8 | 0 |
| Geo.Type.Name | 0 | 1 | 2 | 8 | 0 | 5 | 0 |
| Geo.Place.Name | 0 | 1 | 5 | 46 | 0 | 114 | 0 |
| Time.Period | 0 | 1 | 4 | 19 | 0 | 45 | 0 |
| Start_Date | 0 | 1 | 10 | 10 | 0 | 36 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Message | 16122 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Unique.ID | 0 | 1 | 339480.96 | 194099.81 | 130355 | 172183.25 | 221882.5 | 547749.75 | 671122.0 | ▇▂▁▂▃ |
| Indicator.ID | 0 | 1 | 427.13 | 109.66 | 365 | 365.00 | 375.0 | 386.00 | 661.0 | ▇▁▁▁▂ |
| Geo.Join.ID | 0 | 1 | 613339.41 | 7916715.24 | 1 | 202.00 | 303.0 | 404.00 | 105106107.0 | ▇▁▁▁▁ |
| Data.Value | 0 | 1 | 19.13 | 21.67 | 0 | 8.46 | 13.9 | 25.47 | 424.7 | ▇▁▁▁▁ |
Data 3
Introduction and data
Identify the source of the data.
- Source: The National Registry of Exonerations
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The Registry was founded in 2012 as a project of the Newkirk Center for Science and Society at the University of California Irvine, the University of Michigan Law School, and Michigan State University College of Law in conjunction with the Center on Wrongful Convictions at Northwestern University School of Law. Their research allowed them to collect data on every known exoneration in the United States since 1989.
Write a brief description of the observations.
- Each observation represents one exonerated individual. The dataset includes their name, age, race, sex, and various details about the crime they were exonerated for, such as location, type of crime, years of conviction, whether or not DNA was used and more.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How does DNA being collected impact how long they were in jail for?
Are there any differences in conviction and exoneration rates between races within different states?
How does the individual’s characteristics impact how long they were wrongfully convicted for?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic is about every known exoneration in United States since 1989. It gives information about the sentence and the individual who was exonerated. We want to investigate how the individual’s sentence impacts how long the individual was in jail for. We believe that the worse the sentence and the less DNA evidence on the scene, the longer the individual was wrongfully convicted and in jail for.
- Identify the types of variables in your research question. Categorical? Quantitative?
Categorical
Race
Sex
State
Description
Country
DNA being found
Quantitative
Year Convicted
Year Exonerated
Age
Glimpse of data
# add code here
us_exonerations <-
read_csv("data/us_exonerations.csv")Rows: 3284 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Last Name, First Name, Race, Sex, State, County, Tags, Worst Crime...
dbl (3): Age, Convicted, Exonerated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(us_exonerations)| Name | us_exonerations |
| Number of rows | 3284 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 20 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Last Name | 0 | 1.00 | 2 | 18 | 0 | 2034 | 0 |
| First Name | 0 | 1.00 | 2 | 18 | 0 | 1305 | 0 |
| Race | 0 | 1.00 | 5 | 22 | 0 | 9 | 0 |
| Sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| State | 0 | 1.00 | 4 | 20 | 0 | 84 | 0 |
| County | 60 | 0.98 | 3 | 17 | 0 | 568 | 0 |
| Tags | 167 | 0.95 | 1 | 28 | 0 | 443 | 0 |
| Worst Crime Display | 0 | 1.00 | 5 | 29 | 0 | 45 | 0 |
| List Add’l Crimes Recode | 2002 | 0.39 | 4 | 132 | 0 | 248 | 0 |
| Sentence | 0 | 1.00 | 2 | 45 | 0 | 480 | 0 |
| DNA | 2710 | 0.17 | 3 | 3 | 0 | 1 | 0 |
| * | 3113 | 0.05 | 1 | 1 | 0 | 1 | 0 |
| FC | 2883 | 0.12 | 2 | 2 | 0 | 1 | 0 |
| MWID | 2392 | 0.27 | 4 | 4 | 0 | 1 | 0 |
| F/MFE | 2511 | 0.24 | 5 | 5 | 0 | 1 | 0 |
| P/FA | 1205 | 0.63 | 4 | 4 | 0 | 1 | 0 |
| OM | 1346 | 0.59 | 2 | 2 | 0 | 1 | 0 |
| ILD | 2405 | 0.27 | 3 | 3 | 0 | 1 | 0 |
| Posting Date | 0 | 1.00 | 8 | 10 | 0 | 1230 | 0 |
| OM Tags | 1347 | 0.59 | 2 | 35 | 0 | 134 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 26 | 0.99 | 28.43 | 10.24 | 11 | 20 | 26 | 34 | 83 | ▇▆▂▁▁ |
| Convicted | 0 | 1.00 | 1998.64 | 10.97 | 1956 | 1990 | 1998 | 2007 | 2021 | ▁▁▇▇▅ |
| Exonerated | 0 | 1.00 | 2010.45 | 8.96 | 1989 | 2003 | 2013 | 2018 | 2023 | ▂▃▅▇▇ |