library(tidyverse)
library(skimr)
Project title
Proposal
Data 1
Introduction and data
Identify the source of the data.
- Pew Research Center
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Data was collected by taking a survey of US adults by cellphone and landline phone. Survey done January 25-February 8, 2021.
Write a brief description of the observations.
As years go by from 2012 to 2021, YouTube and Facebook continue to dominate the online landscape. With the exception of YouTube and Reddit, most platforms show little growth since 2019.
Adults under 30 stand out for their use of Instagram, Snapchat and TikTok.
Age gaps in Snapchat, Instagram use are particularly wide, less so for Facebook.
Address ethical concerns about the data, if any.
- The dataset only includes three races/ethnicities: White Non-Hispanic, Black Non-Hispanic, and Hispanic.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Is there a correlation between demographic group (age, gender, race, income, academic degree, location) and online platforms?
Statement on why this question is important.
- With the rise of social media, there are consistent patterns shown in our world today and this may be due to certain factors, such as the effect of certain demographic groups. It is important to determine what specifically causes these stark differences.
A description of the research topic along with a concise statement of your hypotheses on this topic.
We want to observe if there are effects of demographic groups on the preferences of online platforms. We want to learn about the specific groups that tend to use certain online platforms.
Concise statement of hypotheses: Overall, age shows the most vast differences when it comes to preferring certain online platforms over others.
Identify the types of variables in your research question. Categorical? Quantitative?
Age (categorical): 18-24, 25-29, 18-29, 30-49, 50-64, 65+
Gender (categorical): Man, woman, in some other way
Race/ethnicity (categorical): White non-Hispanic, black non-Hispanic, Hispanic.
Highest level of school completed (categorical): HS graduate or less, some college, college degree, post grad degree.
Region (categorical): Northeast, midwest, south, west
Frequency of internet usage (categorical): Less often, not internet user, don’t know / refuse
Online platforms (categorical): Snapchat, Instagram, YouTube, TikTok, Twitter, Facebook, Pinterest, LinkedIn, WhatsApp, Reddit, Nextdoor.
Number of US adults in each demographic group variable (quantitative)
Glimpse of data
# add code here
<- read_csv("data/Social_Media_Use_in_2021/coretrendssurvey.csv") coretrends
Rows: 1502 Columns: 89
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): int_date, usr, bbsmart3f_1_other
dbl (85): respid, sample, comp, lang, state, density, qs1, sex, eminuse, int...
lgl (1): racem4
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(coretrends) skimr
Name | coretrends |
Number of rows | 1502 |
Number of columns | 89 |
_______________________ | |
Column type frequency: | |
character | 3 |
logical | 1 |
numeric | 85 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
int_date | 0 | 1.00 | 8 | 9 | 0 | 15 | 0 |
usr | 98 | 0.93 | 1 | 1 | 0 | 3 | 0 |
bbsmart3f_1_other | 1421 | 0.05 | 5 | 104 | 0 | 72 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
racem4 | 1502 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
respid | 0 | 1.00 | 134280.02 | 64625.39 | 798.00 | 106243.50 | 139521.00 | 172600.75 | 311426.00 | ▃▃▇▂▁ |
sample | 0 | 1.00 | 1.80 | 0.40 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | ▂▁▁▁▇ |
comp | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
lang | 0 | 1.00 | 1.05 | 0.22 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | ▇▁▁▁▁ |
state | 0 | 1.00 | 28.75 | 16.39 | 1.00 | 12.00 | 29.00 | 42.00 | 56.00 | ▇▅▅▇▇ |
density | 0 | 1.00 | 2.92 | 1.41 | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | ▇▇▇▇▇ |
qs1 | 300 | 0.80 | 2.00 | 0.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 | ▁▁▇▁▁ |
sex | 0 | 1.00 | 1.42 | 0.49 | 1.00 | 1.00 | 1.00 | 2.00 | 2.00 | ▇▁▁▁▆ |
eminuse | 0 | 1.00 | 1.08 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
intmob | 0 | 1.00 | 1.12 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | ▇▁▁▁▁ |
intfreq | 89 | 0.94 | 1.97 | 1.00 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
snsint2 | 0 | 1.00 | 1.31 | 0.46 | 1.00 | 1.00 | 1.00 | 2.00 | 2.00 | ▇▁▁▁▃ |
home4nw | 0 | 1.00 | 1.16 | 0.50 | 1.00 | 1.00 | 1.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
bbhome1 | 214 | 0.86 | 2.22 | 1.08 | 1.00 | 2.00 | 2.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
bbhome2 | 1482 | 0.01 | 1.10 | 0.31 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | ▇▁▁▁▁ |
device1a | 1202 | 0.20 | 1.11 | 0.32 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | ▇▁▁▁▁ |
smart2 | 34 | 0.98 | 1.18 | 0.75 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
bbsmart2 | 1217 | 0.19 | 1.94 | 1.13 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart3a | 1217 | 0.19 | 2.15 | 1.99 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart3b | 1217 | 0.19 | 1.87 | 1.32 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart3c | 1327 | 0.12 | 1.37 | 0.89 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart3d | 1217 | 0.19 | 1.73 | 1.17 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart3e | 1217 | 0.19 | 2.64 | 2.28 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▂ |
bbsmart3f | 1217 | 0.19 | 1.94 | 1.32 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
bbsmart4 | 1309 | 0.13 | 3.72 | 2.10 | 1.00 | 2.00 | 3.00 | 6.00 | 9.00 | ▆▇▂▅▂ |
cable1 | 0 | 1.00 | 1.40 | 0.59 | 1.00 | 1.00 | 1.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
cable2 | 919 | 0.39 | 1.41 | 0.62 | 1.00 | 1.00 | 1.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
cable3a | 919 | 0.39 | 1.58 | 0.75 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
cable3b | 919 | 0.39 | 1.35 | 0.78 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
cable3c | 919 | 0.39 | 1.29 | 0.53 | 1.00 | 1.00 | 1.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
web1a | 0 | 1.00 | 1.78 | 0.49 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1b | 0 | 1.00 | 1.66 | 0.56 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1c | 0 | 1.00 | 1.35 | 0.57 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1d | 0 | 1.00 | 1.80 | 0.44 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1e | 0 | 1.00 | 1.20 | 0.45 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
web1f | 0 | 1.00 | 1.82 | 0.63 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1g | 0 | 1.00 | 1.72 | 0.54 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1h | 0 | 1.00 | 1.71 | 0.64 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1i | 0 | 1.00 | 1.85 | 0.53 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1j | 0 | 1.00 | 1.84 | 0.41 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
web1k | 0 | 1.00 | 1.90 | 0.63 | 1.00 | 2.00 | 2.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
sns2a | 1156 | 0.23 | 2.63 | 1.42 | 1.00 | 1.00 | 3.00 | 3.75 | 8.00 | ▇▅▅▁▁ |
sns2b | 972 | 0.35 | 2.43 | 1.36 | 1.00 | 1.00 | 2.00 | 3.00 | 5.00 | ▇▅▅▂▃ |
sns2c | 514 | 0.66 | 2.08 | 1.30 | 1.00 | 1.00 | 2.00 | 3.00 | 9.00 | ▇▃▁▁▁ |
sns2d | 1195 | 0.20 | 2.41 | 1.55 | 1.00 | 1.00 | 2.00 | 3.00 | 9.00 | ▇▅▂▁▁ |
sns2e | 299 | 0.80 | 2.49 | 1.31 | 1.00 | 1.00 | 3.00 | 3.00 | 9.00 | ▇▇▂▁▁ |
paya | 34 | 0.98 | 1.90 | 0.40 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
payb | 285 | 0.81 | 1.90 | 0.42 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
payc | 583 | 0.61 | 1.91 | 0.54 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
prob | 89 | 0.94 | 2.86 | 0.93 | 1.00 | 2.00 | 3.00 | 3.00 | 9.00 | ▃▇▁▁▁ |
coviddisa | 0 | 1.00 | 1.98 | 1.36 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▂▁▁▁ |
coviddisb | 0 | 1.00 | 1.90 | 1.26 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▂▁▁▁ |
coviddisc | 0 | 1.00 | 1.78 | 1.67 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
coviddisd | 0 | 1.00 | 1.67 | 1.72 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
coviddise | 0 | 1.00 | 1.87 | 1.39 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▂▁▁▁ |
device1b | 0 | 1.00 | 1.48 | 0.63 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
device1c | 0 | 1.00 | 1.19 | 0.43 | 1.00 | 1.00 | 1.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
device1d | 0 | 1.00 | 1.66 | 0.64 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
books1 | 0 | 1.00 | 16.92 | 26.48 | 0.00 | 1.25 | 5.00 | 20.00 | 99.00 | ▇▁▁▁▁ |
books2a | 301 | 0.80 | 1.14 | 0.45 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
books2b | 301 | 0.80 | 1.71 | 0.52 | 1.00 | 1.00 | 2.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
books2c | 301 | 0.80 | 1.64 | 0.66 | 1.00 | 1.00 | 2.00 | 2.00 | 8.00 | ▇▁▁▁▁ |
gender | 0 | 1.00 | 1.49 | 0.82 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
age | 0 | 1.00 | 53.57 | 20.37 | 18.00 | 37.00 | 55.00 | 68.00 | 99.00 | ▆▆▇▆▂ |
marital | 0 | 1.00 | 2.88 | 2.21 | 1.00 | 1.00 | 2.00 | 5.00 | 9.00 | ▇▂▁▃▁ |
par | 0 | 1.00 | 1.84 | 0.83 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
educ2 | 0 | 1.00 | 6.59 | 11.66 | 1.00 | 4.00 | 5.00 | 6.00 | 99.00 | ▇▁▁▁▁ |
emplnw | 0 | 1.00 | 3.69 | 11.44 | 1.00 | 1.00 | 2.00 | 3.00 | 99.00 | ▇▁▁▁▁ |
disa | 0 | 1.00 | 1.95 | 0.97 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
hisp | 0 | 1.00 | 2.05 | 1.09 | 1.00 | 2.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
racem1 | 0 | 1.00 | 1.80 | 1.95 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
racem2 | 1443 | 0.04 | 4.10 | 2.02 | 1.00 | 2.00 | 5.00 | 5.00 | 7.00 | ▇▅▂▇▆ |
racem3 | 1495 | 0.00 | 4.57 | 1.62 | 2.00 | 4.00 | 5.00 | 5.00 | 7.00 | ▃▁▇▁▂ |
racecmb | 0 | 1.00 | 1.81 | 1.81 | 1.00 | 1.00 | 1.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
birth_hisp | 1340 | 0.11 | 1.66 | 0.93 | 1.00 | 1.00 | 1.00 | 3.00 | 3.00 | ▇▁▁▁▃ |
income | 0 | 1.00 | 6.49 | 2.72 | 1.00 | 4.00 | 7.00 | 9.00 | 10.00 | ▃▅▅▇▇ |
party | 0 | 1.00 | 2.64 | 1.84 | 1.00 | 2.00 | 2.00 | 3.00 | 9.00 | ▇▅▁▁▁ |
partyln | 841 | 0.44 | 4.48 | 3.48 | 1.00 | 1.00 | 2.00 | 8.00 | 9.00 | ▇▁▁▁▆ |
hh1 | 0 | 1.00 | 2.83 | 1.89 | 1.00 | 2.00 | 2.00 | 4.00 | 9.00 | ▇▃▁▁▁ |
hh3 | 331 | 0.78 | 2.72 | 1.65 | 1.00 | 2.00 | 2.00 | 3.00 | 9.00 | ▇▃▁▁▁ |
ql1 | 1202 | 0.20 | 1.39 | 1.52 | 1.00 | 1.00 | 1.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
ql1a | 1475 | 0.02 | 4.26 | 3.73 | 1.00 | 1.00 | 2.00 | 9.00 | 9.00 | ▇▁▁▁▅ |
qc1 | 300 | 0.80 | 1.83 | 1.06 | 1.00 | 1.00 | 2.00 | 2.00 | 9.00 | ▇▁▁▁▁ |
weight | 0 | 1.00 | 1.00 | 0.59 | 0.31 | 0.56 | 0.86 | 1.28 | 2.46 | ▇▆▃▁▂ |
cellweight | 300 | 0.80 | 1.00 | 0.51 | 0.39 | 0.65 | 0.86 | 1.23 | 2.27 | ▇▅▂▁▂ |
Data 2
Introduction and data
Identify the source of the data.
- NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Last updated on February 1, 2023, the data is provided by the NYPD, manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. The data was provided starting from January 1, 2020.
Write a brief description of the observations.
The dataset contains every arrest affected in New York City.
Age group between 25-44 was the most common age group that was arrested in NYC
Drug related arrests were the most common type of arrest
Black and HIspanics were arrested at a higher rate compared to those of other ethnicity.
Of the boroughs of New York, State Island had significantly lower number of arrests, while the other boroughs had relatively similar number of arrests.
Male gender accounted for almost 70% of arrests.
Address ethical concerns about data, if any
- I do not think there are any major ethical concerns since the privacy regarding the individual who got arrested is hidden, unless somebody specifically knows an instance of an arrest case and the person associated with it.
Research question
- #1: What area of New York can be identified as “most dangerous”? Is there a specific time of the day or time of year where it gets more dangerous?
Why is it important?
- Safety is always a big issue important to citizens. Some areas of the state have a bad reputation due to historical facts, so this question will first identify the areas that are deemed “dangerous” as well as address certain time periods that are more likely to be exposed to criminal activity.
Description
- This research topic will look into location (by coordinates, borough, etc.) as well as time period (hour, day, month, etc.) to look for trends in criminal activity.
Hypothesis
- East Harlem is the most dangerous area in New York and at night on holidays, crimes are most prevalent.
Variables
Arrest Date: can be both based on usage
categorical when used splitting into month of year (January, February, etc.)
quantitative when used to calculate average time of day
Perpetrator’s Sex: Categorical
Perpetrator’s Race: Categorical
Perpetratory’s Age Group: Categorical
Precinct of Arrest: Categorical
Level of Offense: Categorical
PD_CD (three digit classification code): Categorical
latitude, longitude: quantitiative
- #2: How has New York State policies (regarding justice) affected crime rates over time?
Why is it important?
- New policies regarding safety are constantly being talked about, but it is important to evaluate the effectiveness of policies that have been in introduced in the past. Policy makers can gain insight from the conclusions from this research and apply it to future policies to be made.
Glimpse of data
# add code here
<- read_csv("data/NYPD_Arrest_Data/main.csv") arrest_df
Rows: 189774 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ARREST_DATE, PD_DESC, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO...
dbl (9): ARREST_KEY, PD_CD, KY_CD, ARREST_PRECINCT, JURISDICTION_CODE, X_CO...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(arrest_df) skimr
Name | arrest_df |
Number of rows | 189774 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 10 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
ARREST_DATE | 0 | 1.00 | 10 | 10 | 0 | 365 | 0 |
PD_DESC | 0 | 1.00 | 6 | 54 | 0 | 243 | 0 |
OFNS_DESC | 0 | 1.00 | 4 | 36 | 0 | 66 | 0 |
LAW_CODE | 0 | 1.00 | 8 | 10 | 0 | 1058 | 0 |
LAW_CAT_CD | 1747 | 0.99 | 1 | 1 | 0 | 5 | 0 |
ARREST_BORO | 0 | 1.00 | 1 | 1 | 0 | 5 | 0 |
AGE_GROUP | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
PERP_SEX | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
PERP_RACE | 0 | 1.00 | 5 | 30 | 0 | 7 | 0 |
New Georeferenced Column | 0 | 1.00 | 21 | 42 | 0 | 34251 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ARREST_KEY | 0 | 1 | 247721027.46 | 5378637.00 | 238492434.00 | 243171338.25 | 247532420.50 | 252021557.75 | 261197497.00 | ▇▇▇▆▁ |
PD_CD | 561 | 1 | 406.28 | 271.33 | 9.00 | 113.00 | 339.00 | 681.00 | 997.00 | ▇▆▃▅▂ |
KY_CD | 570 | 1 | 246.62 | 149.78 | 101.00 | 110.00 | 235.00 | 344.00 | 995.00 | ▇▇▁▁▁ |
ARREST_PRECINCT | 0 | 1 | 62.69 | 35.03 | 1.00 | 40.00 | 61.00 | 100.00 | 123.00 | ▆▇▆▅▇ |
JURISDICTION_CODE | 0 | 1 | 0.94 | 7.94 | 0.00 | 0.00 | 0.00 | 0.00 | 97.00 | ▇▁▁▁▁ |
X_COORD_CD | 0 | 1 | 1005207.28 | 21228.07 | 913554.00 | 991203.00 | 1005040.00 | 1017119.00 | 1067185.00 | ▁▁▇▇▂ |
Y_COORD_CD | 0 | 1 | 208693.04 | 29606.36 | 121312.00 | 186655.75 | 207651.00 | 236145.75 | 271909.00 | ▁▃▇▆▃ |
Latitude | 0 | 1 | 40.74 | 0.08 | 40.50 | 40.68 | 40.74 | 40.81 | 40.91 | ▁▃▇▆▃ |
Longitude | 0 | 1 | -73.92 | 0.08 | -74.25 | -73.97 | -73.92 | -73.88 | -73.70 | ▁▁▇▇▂ |
Data 3
Introduction and data
Identify the source of the data.
- The Occupational Safety and Health Administration (OSHA)
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The OSHA collected data about work-related injuries and illnesses within specific industries and employment size specifications. The data is collected from 2002 to 2011.
Write a brief description of the observations.
- Each row in the data set represents a specific injury case and contains information about age, gender, body part injured, nature of the injury, etc. The most amount of non-fatal injuries was seen in the goods production industry with the most common injury type being from falls. Similarly, some of the industries with the least amount of reported non-fatal injuries were the publishing and marketing industries. When looking at fatal injuries, one of the most dangerous occupations in Construction and Transportation. Other dangerous occupations with fatal injuries include Protective service occupations, vehicle operators and surprisingly, Management occupations.
Address ethical concerns about data, if any
- There were no ethical concerns about the data. The data was collected legally by the government though employers, businesses and companies.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Which occupation is the riskiest in the United States? What are the chances of getting a fatal or non-fatal injury based on which industry you work in the US? Do other factors like age, gender, and location of the businesses play a significant role in determining the risk of an industry? Or is this more industry specific?
Why is it important?
- Workplace injuries and fatalities can have a significant impact on families and communities. Identifying the riskiest occupations and industries can help OSHA guide policies and interventions that can reducing these incidences. Looking at the economic effects of workplace injuries, they can also significantly reduce productivity and add on healthcare costs. By understanding which industries and occupations are most at risk, there will be more clarity in taking business decisions and risk management strategies.
A description of the research topic along with a concise statement of your hypotheses on this topic.
Description: We will be researching the correlation between different types of occupations and workplace settings among industries with different rates of injury risk.
Hypothesis: The Manufacturing division has the highest rates of being at risk to injury in the work industry.
Identify the types of variables in your research question. Categorical? Quantitative?
Year: quantitative
Address City: categorical
Address State: categorical
Address Street: categorical
Address Zip: quantitative
Business Name: categorical
Business Second Name: categorical
Industry Division: categorical
Industry ID: quantitative
Industry Label: categorical
Industry Major Group: categorical
Statistics Days Away: quantitative
Statistics Days Away/ Restricted/ Transfer: quantitative
Total Case Rate: quantitative
Glimpse of data
# add code here
<- read_csv("data/Work_Related_Injuries/injuries.csv") injuries
Rows: 25010 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): address.city, address.state, address.street, business.name, busines...
dbl (6): year, address.zip, industry.id, statistics.days away, statistics.da...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(injuries) skimr
Name | injuries |
Number of rows | 25010 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
address.city | 0 | 1.00 | 3 | 21 | 0 | 6852 | 0 |
address.state | 0 | 1.00 | 2 | 2 | 0 | 48 | 0 |
address.street | 0 | 1.00 | 3 | 50 | 0 | 23573 | 0 |
business.name | 0 | 1.00 | 3 | 100 | 0 | 20864 | 0 |
business.second name | 14398 | 0.42 | 1 | 53 | 0 | 9077 | 0 |
industry.division | 0 | 1.00 | 6 | 68 | 0 | 10 | 0 |
industry.label | 0 | 1.00 | 4 | 103 | 0 | 675 | 0 |
industry.major_group | 0 | 1.00 | 8 | 110 | 0 | 62 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2006.47 | 2.80 | 2002 | 2004.0 | 2006.00 | 2009.00 | 2011.00 | ▇▇▇▇▇ |
address.zip | 0 | 1 | 47625.61 | 27103.72 | 1001 | 27501.0 | 46514.00 | 70037.00 | 99999.00 | ▇▇▇▆▅ |
industry.id | 0 | 1 | 4214.99 | 1870.38 | 112 | 3086.0 | 3625.00 | 5093.00 | 9621.00 | ▁▇▃▁▂ |
statistics.days away | 0 | 1 | 2.45 | 3.56 | 0 | 0.0 | 1.29 | 3.50 | 103.02 | ▇▁▁▁▁ |
statistics.days away/restricted/transfer | 0 | 1 | 4.72 | 5.49 | 0 | 0.0 | 3.27 | 6.96 | 206.04 | ▇▁▁▁▁ |
statistics.total case rate | 0 | 1 | 8.01 | 8.16 | 0 | 2.3 | 6.20 | 11.43 | 206.04 | ▇▁▁▁▁ |
#this is the link for the dataset: https://think.cs.vt.edu/corgis/csv/injuries/