Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • Pew Research Center
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • Data was collected by taking a survey of US adults by cellphone and landline phone. Survey done January 25-February 8, 2021.
  • Write a brief description of the observations.

    • As years go by from 2012 to 2021, YouTube and Facebook continue to dominate the online landscape. With the exception of YouTube and Reddit, most platforms show little growth since 2019.

    • Adults under 30 stand out for their use of Instagram, Snapchat and TikTok.

    • Age gaps in Snapchat, Instagram use are particularly wide, less so for Facebook.

  • Address ethical concerns about the data, if any.

    • The dataset only includes three races/ethnicities: White Non-Hispanic, Black Non-Hispanic, and Hispanic.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Is there a correlation between demographic group (age, gender, race, income, academic degree, location) and online platforms?
  • Statement on why this question is important.

    • With the rise of social media, there are consistent patterns shown in our world today and this may be due to certain factors, such as the effect of certain demographic groups. It is important to determine what specifically causes these stark differences.
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • We want to observe if there are effects of demographic groups on the preferences of online platforms. We want to learn about the specific groups that tend to use certain online platforms.

    • Concise statement of hypotheses: Overall, age shows the most vast differences when it comes to preferring certain online platforms over others.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Age (categorical): 18-24, 25-29, 18-29, 30-49, 50-64, 65+

    • Gender (categorical): Man, woman, in some other way

    • Race/ethnicity (categorical): White non-Hispanic, black non-Hispanic, Hispanic.

    • Highest level of school completed (categorical): HS graduate or less, some college, college degree, post grad degree.

    • Region (categorical): Northeast, midwest, south, west

    • Frequency of internet usage (categorical): Less often, not internet user, don’t know / refuse

    • Online platforms (categorical): Snapchat, Instagram, YouTube, TikTok, Twitter, Facebook, Pinterest, LinkedIn, WhatsApp, Reddit, Nextdoor.

    • Number of US adults in each demographic group variable (quantitative)

Glimpse of data

# add code here
coretrends <- read_csv("data/Social_Media_Use_in_2021/coretrendssurvey.csv")
Rows: 1502 Columns: 89
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): int_date, usr, bbsmart3f_1_other
dbl (85): respid, sample, comp, lang, state, density, qs1, sex, eminuse, int...
lgl  (1): racem4

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(coretrends)
Data summary
Name coretrends
Number of rows 1502
Number of columns 89
_______________________
Column type frequency:
character 3
logical 1
numeric 85
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
int_date 0 1.00 8 9 0 15 0
usr 98 0.93 1 1 0 3 0
bbsmart3f_1_other 1421 0.05 5 104 0 72 0

Variable type: logical

skim_variable n_missing complete_rate mean count
racem4 1502 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
respid 0 1.00 134280.02 64625.39 798.00 106243.50 139521.00 172600.75 311426.00 ▃▃▇▂▁
sample 0 1.00 1.80 0.40 1.00 2.00 2.00 2.00 2.00 ▂▁▁▁▇
comp 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
lang 0 1.00 1.05 0.22 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
state 0 1.00 28.75 16.39 1.00 12.00 29.00 42.00 56.00 ▇▅▅▇▇
density 0 1.00 2.92 1.41 1.00 2.00 3.00 4.00 5.00 ▇▇▇▇▇
qs1 300 0.80 2.00 0.00 2.00 2.00 2.00 2.00 2.00 ▁▁▇▁▁
sex 0 1.00 1.42 0.49 1.00 1.00 1.00 2.00 2.00 ▇▁▁▁▆
eminuse 0 1.00 1.08 0.33 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
intmob 0 1.00 1.12 0.33 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
intfreq 89 0.94 1.97 1.00 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
snsint2 0 1.00 1.31 0.46 1.00 1.00 1.00 2.00 2.00 ▇▁▁▁▃
home4nw 0 1.00 1.16 0.50 1.00 1.00 1.00 1.00 8.00 ▇▁▁▁▁
bbhome1 214 0.86 2.22 1.08 1.00 2.00 2.00 2.00 8.00 ▇▁▁▁▁
bbhome2 1482 0.01 1.10 0.31 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
device1a 1202 0.20 1.11 0.32 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
smart2 34 0.98 1.18 0.75 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
bbsmart2 1217 0.19 1.94 1.13 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
bbsmart3a 1217 0.19 2.15 1.99 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
bbsmart3b 1217 0.19 1.87 1.32 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
bbsmart3c 1327 0.12 1.37 0.89 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
bbsmart3d 1217 0.19 1.73 1.17 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
bbsmart3e 1217 0.19 2.64 2.28 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▂
bbsmart3f 1217 0.19 1.94 1.32 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
bbsmart4 1309 0.13 3.72 2.10 1.00 2.00 3.00 6.00 9.00 ▆▇▂▅▂
cable1 0 1.00 1.40 0.59 1.00 1.00 1.00 2.00 8.00 ▇▁▁▁▁
cable2 919 0.39 1.41 0.62 1.00 1.00 1.00 2.00 8.00 ▇▁▁▁▁
cable3a 919 0.39 1.58 0.75 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
cable3b 919 0.39 1.35 0.78 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
cable3c 919 0.39 1.29 0.53 1.00 1.00 1.00 2.00 8.00 ▇▁▁▁▁
web1a 0 1.00 1.78 0.49 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
web1b 0 1.00 1.66 0.56 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
web1c 0 1.00 1.35 0.57 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
web1d 0 1.00 1.80 0.44 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
web1e 0 1.00 1.20 0.45 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
web1f 0 1.00 1.82 0.63 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
web1g 0 1.00 1.72 0.54 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
web1h 0 1.00 1.71 0.64 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
web1i 0 1.00 1.85 0.53 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
web1j 0 1.00 1.84 0.41 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
web1k 0 1.00 1.90 0.63 1.00 2.00 2.00 2.00 8.00 ▇▁▁▁▁
sns2a 1156 0.23 2.63 1.42 1.00 1.00 3.00 3.75 8.00 ▇▅▅▁▁
sns2b 972 0.35 2.43 1.36 1.00 1.00 2.00 3.00 5.00 ▇▅▅▂▃
sns2c 514 0.66 2.08 1.30 1.00 1.00 2.00 3.00 9.00 ▇▃▁▁▁
sns2d 1195 0.20 2.41 1.55 1.00 1.00 2.00 3.00 9.00 ▇▅▂▁▁
sns2e 299 0.80 2.49 1.31 1.00 1.00 3.00 3.00 9.00 ▇▇▂▁▁
paya 34 0.98 1.90 0.40 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
payb 285 0.81 1.90 0.42 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
payc 583 0.61 1.91 0.54 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
prob 89 0.94 2.86 0.93 1.00 2.00 3.00 3.00 9.00 ▃▇▁▁▁
coviddisa 0 1.00 1.98 1.36 1.00 1.00 2.00 2.00 9.00 ▇▂▁▁▁
coviddisb 0 1.00 1.90 1.26 1.00 1.00 2.00 2.00 9.00 ▇▂▁▁▁
coviddisc 0 1.00 1.78 1.67 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
coviddisd 0 1.00 1.67 1.72 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
coviddise 0 1.00 1.87 1.39 1.00 1.00 2.00 2.00 9.00 ▇▂▁▁▁
device1b 0 1.00 1.48 0.63 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
device1c 0 1.00 1.19 0.43 1.00 1.00 1.00 1.00 8.00 ▇▁▁▁▁
device1d 0 1.00 1.66 0.64 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
books1 0 1.00 16.92 26.48 0.00 1.25 5.00 20.00 99.00 ▇▁▁▁▁
books2a 301 0.80 1.14 0.45 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
books2b 301 0.80 1.71 0.52 1.00 1.00 2.00 2.00 8.00 ▇▁▁▁▁
books2c 301 0.80 1.64 0.66 1.00 1.00 2.00 2.00 8.00 ▇▁▁▁▁
gender 0 1.00 1.49 0.82 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
age 0 1.00 53.57 20.37 18.00 37.00 55.00 68.00 99.00 ▆▆▇▆▂
marital 0 1.00 2.88 2.21 1.00 1.00 2.00 5.00 9.00 ▇▂▁▃▁
par 0 1.00 1.84 0.83 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
educ2 0 1.00 6.59 11.66 1.00 4.00 5.00 6.00 99.00 ▇▁▁▁▁
emplnw 0 1.00 3.69 11.44 1.00 1.00 2.00 3.00 99.00 ▇▁▁▁▁
disa 0 1.00 1.95 0.97 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
hisp 0 1.00 2.05 1.09 1.00 2.00 2.00 2.00 9.00 ▇▁▁▁▁
racem1 0 1.00 1.80 1.95 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
racem2 1443 0.04 4.10 2.02 1.00 2.00 5.00 5.00 7.00 ▇▅▂▇▆
racem3 1495 0.00 4.57 1.62 2.00 4.00 5.00 5.00 7.00 ▃▁▇▁▂
racecmb 0 1.00 1.81 1.81 1.00 1.00 1.00 2.00 9.00 ▇▁▁▁▁
birth_hisp 1340 0.11 1.66 0.93 1.00 1.00 1.00 3.00 3.00 ▇▁▁▁▃
income 0 1.00 6.49 2.72 1.00 4.00 7.00 9.00 10.00 ▃▅▅▇▇
party 0 1.00 2.64 1.84 1.00 2.00 2.00 3.00 9.00 ▇▅▁▁▁
partyln 841 0.44 4.48 3.48 1.00 1.00 2.00 8.00 9.00 ▇▁▁▁▆
hh1 0 1.00 2.83 1.89 1.00 2.00 2.00 4.00 9.00 ▇▃▁▁▁
hh3 331 0.78 2.72 1.65 1.00 2.00 2.00 3.00 9.00 ▇▃▁▁▁
ql1 1202 0.20 1.39 1.52 1.00 1.00 1.00 1.00 9.00 ▇▁▁▁▁
ql1a 1475 0.02 4.26 3.73 1.00 1.00 2.00 9.00 9.00 ▇▁▁▁▅
qc1 300 0.80 1.83 1.06 1.00 1.00 2.00 2.00 9.00 ▇▁▁▁▁
weight 0 1.00 1.00 0.59 0.31 0.56 0.86 1.28 2.46 ▇▆▃▁▂
cellweight 300 0.80 1.00 0.51 0.39 0.65 0.86 1.23 2.27 ▇▅▂▁▂

Data 2

Introduction and data

  • Identify the source of the data.

    • NYC Open Data
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • Last updated on February 1, 2023, the data is provided by the NYPD, manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. The data was provided starting from January 1, 2020.
  • Write a brief description of the observations.

    • The dataset contains every arrest affected in New York City.

    • Age group between 25-44 was the most common age group that was arrested in NYC

    • Drug related arrests were the most common type of arrest

    • Black and HIspanics were arrested at a higher rate compared to those of other ethnicity.

    • Of the boroughs of New York, State Island had significantly lower number of arrests, while the other boroughs had relatively similar number of arrests.

    • Male gender accounted for almost 70% of arrests.

  • Address ethical concerns about data, if any

    • I do not think there are any major ethical concerns since the privacy regarding the individual who got arrested is hidden, unless somebody specifically knows an instance of an arrest case and the person associated with it.

Research question

  • #1: What area of New York can be identified as “most dangerous”? Is there a specific time of the day or time of year where it gets more dangerous?
    • Why is it important?

      • Safety is always a big issue important to citizens. Some areas of the state have a bad reputation due to historical facts, so this question will first identify the areas that are deemed “dangerous” as well as address certain time periods that are more likely to be exposed to criminal activity.
    • Description

      • This research topic will look into location (by coordinates, borough, etc.) as well as time period (hour, day, month, etc.) to look for trends in criminal activity.
    • Hypothesis

      • East Harlem is the most dangerous area in New York and at night on holidays, crimes are most prevalent.
    • Variables

      • Arrest Date: can be both based on usage

        • categorical when used splitting into month of year (January, February, etc.)

        • quantitative when used to calculate average time of day

      • Perpetrator’s Sex: Categorical

      • Perpetrator’s Race: Categorical

      • Perpetratory’s Age Group: Categorical

      • Precinct of Arrest: Categorical

      • Level of Offense: Categorical

      • PD_CD (three digit classification code): Categorical

      • latitude, longitude: quantitiative

  • #2: How has New York State policies (regarding justice) affected crime rates over time?
    • Why is it important?

      • New policies regarding safety are constantly being talked about, but it is important to evaluate the effectiveness of policies that have been in introduced in the past. Policy makers can gain insight from the conclusions from this research and apply it to future policies to be made.

Glimpse of data

# add code here
arrest_df <- read_csv("data/NYPD_Arrest_Data/main.csv")
Rows: 189774 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ARREST_DATE, PD_DESC, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO...
dbl  (9): ARREST_KEY, PD_CD, KY_CD, ARREST_PRECINCT, JURISDICTION_CODE, X_CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(arrest_df)
Data summary
Name arrest_df
Number of rows 189774
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ARREST_DATE 0 1.00 10 10 0 365 0
PD_DESC 0 1.00 6 54 0 243 0
OFNS_DESC 0 1.00 4 36 0 66 0
LAW_CODE 0 1.00 8 10 0 1058 0
LAW_CAT_CD 1747 0.99 1 1 0 5 0
ARREST_BORO 0 1.00 1 1 0 5 0
AGE_GROUP 0 1.00 3 5 0 5 0
PERP_SEX 0 1.00 1 1 0 2 0
PERP_RACE 0 1.00 5 30 0 7 0
New Georeferenced Column 0 1.00 21 42 0 34251 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ARREST_KEY 0 1 247721027.46 5378637.00 238492434.00 243171338.25 247532420.50 252021557.75 261197497.00 ▇▇▇▆▁
PD_CD 561 1 406.28 271.33 9.00 113.00 339.00 681.00 997.00 ▇▆▃▅▂
KY_CD 570 1 246.62 149.78 101.00 110.00 235.00 344.00 995.00 ▇▇▁▁▁
ARREST_PRECINCT 0 1 62.69 35.03 1.00 40.00 61.00 100.00 123.00 ▆▇▆▅▇
JURISDICTION_CODE 0 1 0.94 7.94 0.00 0.00 0.00 0.00 97.00 ▇▁▁▁▁
X_COORD_CD 0 1 1005207.28 21228.07 913554.00 991203.00 1005040.00 1017119.00 1067185.00 ▁▁▇▇▂
Y_COORD_CD 0 1 208693.04 29606.36 121312.00 186655.75 207651.00 236145.75 271909.00 ▁▃▇▆▃
Latitude 0 1 40.74 0.08 40.50 40.68 40.74 40.81 40.91 ▁▃▇▆▃
Longitude 0 1 -73.92 0.08 -74.25 -73.97 -73.92 -73.88 -73.70 ▁▁▇▇▂

Data 3

Introduction and data

  • Identify the source of the data.

    • The Occupational Safety and Health Administration (OSHA)
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The OSHA collected data about work-related injuries and illnesses within specific industries and employment size specifications. The data is collected from 2002 to 2011.
  • Write a brief description of the observations.

    • Each row in the data set represents a specific injury case and contains information about age, gender, body part injured, nature of the injury, etc. The most amount of non-fatal injuries was seen in the goods production industry with the most common injury type being from falls. Similarly, some of the industries with the least amount of reported non-fatal injuries were the publishing and marketing industries. When looking at fatal injuries, one of the most dangerous occupations in Construction and Transportation. Other dangerous occupations with fatal injuries include Protective service occupations, vehicle operators and surprisingly, Management occupations.
  • Address ethical concerns about data, if any

    • There were no ethical concerns about the data. The data was collected legally by the government though employers, businesses and companies.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Which occupation is the riskiest in the United States? What are the chances of getting a fatal or non-fatal injury based on which industry you work in the US? Do other factors like age, gender, and location of the businesses play a significant role in determining the risk of an industry? Or is this more industry specific?
  • Why is it important?

    • Workplace injuries and fatalities can have a significant impact on families and communities. Identifying the riskiest occupations and industries can help OSHA guide policies and interventions that can reducing these incidences. Looking at the economic effects of workplace injuries, they can also significantly reduce productivity and add on healthcare costs. By understanding which industries and occupations are most at risk, there will be more clarity in taking business decisions and risk management strategies.
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • Description: We will be researching the correlation between different types of occupations and workplace settings among industries with different rates of injury risk.

    • Hypothesis: The Manufacturing division has the highest rates of being at risk to injury in the work industry.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Year: quantitative

    • Address City: categorical

    • Address State: categorical

    • Address Street: categorical

    • Address Zip: quantitative

    • Business Name: categorical

    • Business Second Name: categorical

    • Industry Division: categorical

    • Industry ID: quantitative

    • Industry Label: categorical

    • Industry Major Group: categorical

    • Statistics Days Away: quantitative

    • Statistics Days Away/ Restricted/ Transfer: quantitative

    • Total Case Rate: quantitative

Glimpse of data

# add code here
injuries <- read_csv("data/Work_Related_Injuries/injuries.csv")
Rows: 25010 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): address.city, address.state, address.street, business.name, busines...
dbl (6): year, address.zip, industry.id, statistics.days away, statistics.da...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(injuries)
Data summary
Name injuries
Number of rows 25010
Number of columns 14
_______________________
Column type frequency:
character 8
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
address.city 0 1.00 3 21 0 6852 0
address.state 0 1.00 2 2 0 48 0
address.street 0 1.00 3 50 0 23573 0
business.name 0 1.00 3 100 0 20864 0
business.second name 14398 0.42 1 53 0 9077 0
industry.division 0 1.00 6 68 0 10 0
industry.label 0 1.00 4 103 0 675 0
industry.major_group 0 1.00 8 110 0 62 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2006.47 2.80 2002 2004.0 2006.00 2009.00 2011.00 ▇▇▇▇▇
address.zip 0 1 47625.61 27103.72 1001 27501.0 46514.00 70037.00 99999.00 ▇▇▇▆▅
industry.id 0 1 4214.99 1870.38 112 3086.0 3625.00 5093.00 9621.00 ▁▇▃▁▂
statistics.days away 0 1 2.45 3.56 0 0.0 1.29 3.50 103.02 ▇▁▁▁▁
statistics.days away/restricted/transfer 0 1 4.72 5.49 0 0.0 3.27 6.96 206.04 ▇▁▁▁▁
statistics.total case rate 0 1 8.01 8.16 0 2.3 6.20 11.43 206.04 ▇▁▁▁▁
#this is the link for the dataset: https://think.cs.vt.edu/corgis/csv/injuries/