Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
- Pew Research Center
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Data was collected by taking a survey of US adults by cellphone and landline phone. Survey done January 25-February 8, 2021.
Write a brief description of the observations.
- As years go by from 2012 to 2021, YouTube and Facebook continue to dominate the online landscape. With the exception of YouTube and Reddit, most platforms show little growth since 2019.
- Adults under 30 stand out for their use of Instagram, Snapchat and TikTok.
- Age gaps in Snapchat, Instagram use are particularly wide, less so for Facebook.
Address ethical concerns about the data, if any.
- The dataset only includes three races/ethnicities: White Non-Hispanic, Black Non-Hispanic, and Hispanic.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Is there a correlation between demographic group (age, gender, race, income, academic degree, location) and online platforms?
Statement on why this question is important.
- With the rise of social media, there are consistent patterns shown in our world today and this may be due to certain factors, such as the effect of certain demographic groups. It is important to determine what specifically causes these stark differences.
A description of the research topic along with a concise statement of your hypotheses on this topic.
- We want to observe if there are effects of demographic groups on the preferences of online platforms. We want to learn about the specific groups that tend to use certain online platforms.
- Concise statement of hypotheses: Overall, age shows the most vast differences when it comes to preferring certain online platforms over others.
Identify the types of variables in your research question. Categorical? Quantitative?
- Age (categorical): 18-24, 25-29, 18-29, 30-49, 50-64, 65+
- Gender (categorical): Man, woman, in some other way
- Race/ethnicity (categorical): White non-Hispanic, black non-Hispanic, Hispanic.
- Highest level of school completed (categorical): HS graduate or less, some college, college degree, post grad degree.
- Region (categorical): Northeast, midwest, south, west
- Frequency of internet usage (categorical): Less often, not internet user, don’t know / refuse
- Online platforms (categorical): Snapchat, Instagram, YouTube, TikTok, Twitter, Facebook, Pinterest, LinkedIn, WhatsApp, Reddit, Nextdoor.
- Number of US adults in each demographic group variable (quantitative)

Glimpse of data

# add code here
coretrends <- read_csv("data/Social_Media_Use_in_2021/coretrendssurvey.csv")

Rows: 1502 Columns: 89
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): int_date, usr, bbsmart3f_1_other
dbl (85): respid, sample, comp, lang, state, density, qs1, sex, eminuse, int...
lgl  (1): racem4

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(coretrends)

Data summary
Name	coretrends
Number of rows	1502
Number of columns	89
_______________________
Column type frequency:
character	3
logical	1
numeric	85
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
int_date	0	1.00	8	9	15
usr	98	0.93	1	1	3
bbsmart3f_1_other	1421	0.05	5	104	72

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
racem4	1502	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
respid	0	1.00	134280.02	64625.39	798.00	106243.50	139521.00	172600.75	311426.00	▃▃▇▂▁
sample	0	1.00	1.80	0.40	1.00	2.00	2.00	2.00	2.00	▂▁▁▁▇
comp	0	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	▁▁▇▁▁
lang	0	1.00	1.05	0.22	1.00	1.00	1.00	1.00	2.00	▇▁▁▁▁
state	0	1.00	28.75	16.39	1.00	12.00	29.00	42.00	56.00	▇▅▅▇▇
density	0	1.00	2.92	1.41	1.00	2.00	3.00	4.00	5.00	▇▇▇▇▇
qs1	300	0.80	2.00	0.00	2.00	2.00	2.00	2.00	2.00	▁▁▇▁▁
sex	0	1.00	1.42	0.49	1.00	1.00	1.00	2.00	2.00	▇▁▁▁▆
eminuse	0	1.00	1.08	0.33	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
intmob	0	1.00	1.12	0.33	1.00	1.00	1.00	1.00	2.00	▇▁▁▁▁
intfreq	89	0.94	1.97	1.00	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
snsint2	0	1.00	1.31	0.46	1.00	1.00	1.00	2.00	2.00	▇▁▁▁▃
home4nw	0	1.00	1.16	0.50	1.00	1.00	1.00	1.00	8.00	▇▁▁▁▁
bbhome1	214	0.86	2.22	1.08	1.00	2.00	2.00	2.00	8.00	▇▁▁▁▁
bbhome2	1482	0.01	1.10	0.31	1.00	1.00	1.00	1.00	2.00	▇▁▁▁▁
device1a	1202	0.20	1.11	0.32	1.00	1.00	1.00	1.00	2.00	▇▁▁▁▁
smart2	34	0.98	1.18	0.75	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
bbsmart2	1217	0.19	1.94	1.13	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
bbsmart3a	1217	0.19	2.15	1.99	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
bbsmart3b	1217	0.19	1.87	1.32	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
bbsmart3c	1327	0.12	1.37	0.89	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
bbsmart3d	1217	0.19	1.73	1.17	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
bbsmart3e	1217	0.19	2.64	2.28	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▂
bbsmart3f	1217	0.19	1.94	1.32	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
bbsmart4	1309	0.13	3.72	2.10	1.00	2.00	3.00	6.00	9.00	▆▇▂▅▂
cable1	0	1.00	1.40	0.59	1.00	1.00	1.00	2.00	8.00	▇▁▁▁▁
cable2	919	0.39	1.41	0.62	1.00	1.00	1.00	2.00	8.00	▇▁▁▁▁
cable3a	919	0.39	1.58	0.75	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
cable3b	919	0.39	1.35	0.78	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
cable3c	919	0.39	1.29	0.53	1.00	1.00	1.00	2.00	8.00	▇▁▁▁▁
web1a	0	1.00	1.78	0.49	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
web1b	0	1.00	1.66	0.56	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
web1c	0	1.00	1.35	0.57	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
web1d	0	1.00	1.80	0.44	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
web1e	0	1.00	1.20	0.45	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
web1f	0	1.00	1.82	0.63	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
web1g	0	1.00	1.72	0.54	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
web1h	0	1.00	1.71	0.64	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
web1i	0	1.00	1.85	0.53	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
web1j	0	1.00	1.84	0.41	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
web1k	0	1.00	1.90	0.63	1.00	2.00	2.00	2.00	8.00	▇▁▁▁▁
sns2a	1156	0.23	2.63	1.42	1.00	1.00	3.00	3.75	8.00	▇▅▅▁▁
sns2b	972	0.35	2.43	1.36	1.00	1.00	2.00	3.00	5.00	▇▅▅▂▃
sns2c	514	0.66	2.08	1.30	1.00	1.00	2.00	3.00	9.00	▇▃▁▁▁
sns2d	1195	0.20	2.41	1.55	1.00	1.00	2.00	3.00	9.00	▇▅▂▁▁
sns2e	299	0.80	2.49	1.31	1.00	1.00	3.00	3.00	9.00	▇▇▂▁▁
paya	34	0.98	1.90	0.40	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
payb	285	0.81	1.90	0.42	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
payc	583	0.61	1.91	0.54	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
prob	89	0.94	2.86	0.93	1.00	2.00	3.00	3.00	9.00	▃▇▁▁▁
coviddisa	0	1.00	1.98	1.36	1.00	1.00	2.00	2.00	9.00	▇▂▁▁▁
coviddisb	0	1.00	1.90	1.26	1.00	1.00	2.00	2.00	9.00	▇▂▁▁▁
coviddisc	0	1.00	1.78	1.67	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
coviddisd	0	1.00	1.67	1.72	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
coviddise	0	1.00	1.87	1.39	1.00	1.00	2.00	2.00	9.00	▇▂▁▁▁
device1b	0	1.00	1.48	0.63	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
device1c	0	1.00	1.19	0.43	1.00	1.00	1.00	1.00	8.00	▇▁▁▁▁
device1d	0	1.00	1.66	0.64	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
books1	0	1.00	16.92	26.48	0.00	1.25	5.00	20.00	99.00	▇▁▁▁▁
books2a	301	0.80	1.14	0.45	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
books2b	301	0.80	1.71	0.52	1.00	1.00	2.00	2.00	8.00	▇▁▁▁▁
books2c	301	0.80	1.64	0.66	1.00	1.00	2.00	2.00	8.00	▇▁▁▁▁
gender	0	1.00	1.49	0.82	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
age	0	1.00	53.57	20.37	18.00	37.00	55.00	68.00	99.00	▆▆▇▆▂
marital	0	1.00	2.88	2.21	1.00	1.00	2.00	5.00	9.00	▇▂▁▃▁
par	0	1.00	1.84	0.83	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
educ2	0	1.00	6.59	11.66	1.00	4.00	5.00	6.00	99.00	▇▁▁▁▁
emplnw	0	1.00	3.69	11.44	1.00	1.00	2.00	3.00	99.00	▇▁▁▁▁
disa	0	1.00	1.95	0.97	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
hisp	0	1.00	2.05	1.09	1.00	2.00	2.00	2.00	9.00	▇▁▁▁▁
racem1	0	1.00	1.80	1.95	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
racem2	1443	0.04	4.10	2.02	1.00	2.00	5.00	5.00	7.00	▇▅▂▇▆
racem3	1495	0.00	4.57	1.62	2.00	4.00	5.00	5.00	7.00	▃▁▇▁▂
racecmb	0	1.00	1.81	1.81	1.00	1.00	1.00	2.00	9.00	▇▁▁▁▁
birth_hisp	1340	0.11	1.66	0.93	1.00	1.00	1.00	3.00	3.00	▇▁▁▁▃
income	0	1.00	6.49	2.72	1.00	4.00	7.00	9.00	10.00	▃▅▅▇▇
party	0	1.00	2.64	1.84	1.00	2.00	2.00	3.00	9.00	▇▅▁▁▁
partyln	841	0.44	4.48	3.48	1.00	1.00	2.00	8.00	9.00	▇▁▁▁▆
hh1	0	1.00	2.83	1.89	1.00	2.00	2.00	4.00	9.00	▇▃▁▁▁
hh3	331	0.78	2.72	1.65	1.00	2.00	2.00	3.00	9.00	▇▃▁▁▁
ql1	1202	0.20	1.39	1.52	1.00	1.00	1.00	1.00	9.00	▇▁▁▁▁
ql1a	1475	0.02	4.26	3.73	1.00	1.00	2.00	9.00	9.00	▇▁▁▁▅
qc1	300	0.80	1.83	1.06	1.00	1.00	2.00	2.00	9.00	▇▁▁▁▁
weight	0	1.00	1.00	0.59	0.31	0.56	0.86	1.28	2.46	▇▆▃▁▂
cellweight	300	0.80	1.00	0.51	0.39	0.65	0.86	1.23	2.27	▇▅▂▁▂

Data 2

Introduction and data

Identify the source of the data.
- NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Last updated on February 1, 2023, the data is provided by the NYPD, manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. The data was provided starting from January 1, 2020.
Write a brief description of the observations.
- The dataset contains every arrest affected in New York City.
- Age group between 25-44 was the most common age group that was arrested in NYC
- Drug related arrests were the most common type of arrest
- Black and HIspanics were arrested at a higher rate compared to those of other ethnicity.
- Of the boroughs of New York, State Island had significantly lower number of arrests, while the other boroughs had relatively similar number of arrests.
- Male gender accounted for almost 70% of arrests.
Address ethical concerns about data, if any
- I do not think there are any major ethical concerns since the privacy regarding the individual who got arrested is hidden, unless somebody specifically knows an instance of an arrest case and the person associated with it.

Research question

#1: What area of New York can be identified as “most dangerous”? Is there a specific time of the day or time of year where it gets more dangerous?
- Why is it important?
  - Safety is always a big issue important to citizens. Some areas of the state have a bad reputation due to historical facts, so this question will first identify the areas that are deemed “dangerous” as well as address certain time periods that are more likely to be exposed to criminal activity.
- Description
  - This research topic will look into location (by coordinates, borough, etc.) as well as time period (hour, day, month, etc.) to look for trends in criminal activity.
- Hypothesis
  - East Harlem is the most dangerous area in New York and at night on holidays, crimes are most prevalent.
- Variables
  - Arrest Date: can be both based on usage
    - categorical when used splitting into month of year (January, February, etc.)
    - quantitative when used to calculate average time of day
  - Perpetrator’s Sex: Categorical
  - Perpetrator’s Race: Categorical
  - Perpetratory’s Age Group: Categorical
  - Precinct of Arrest: Categorical
  - Level of Offense: Categorical
  - PD_CD (three digit classification code): Categorical
  - latitude, longitude: quantitiative
#2: How has New York State policies (regarding justice) affected crime rates over time?
- Why is it important?
  - New policies regarding safety are constantly being talked about, but it is important to evaluate the effectiveness of policies that have been in introduced in the past. Policy makers can gain insight from the conclusions from this research and apply it to future policies to be made.

Glimpse of data

# add code here
arrest_df <- read_csv("data/NYPD_Arrest_Data/main.csv")

Rows: 189774 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ARREST_DATE, PD_DESC, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO...
dbl  (9): ARREST_KEY, PD_CD, KY_CD, ARREST_PRECINCT, JURISDICTION_CODE, X_CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(arrest_df)

Data summary
Name	arrest_df
Number of rows	189774
Number of columns	19
_______________________
Column type frequency:
character	10
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ARREST_DATE	0	1.00	10	10	365
PD_DESC	0	1.00	6	54	243
OFNS_DESC	0	1.00	4	36	66
LAW_CODE	0	1.00	8	10	1058
LAW_CAT_CD	1747	0.99	1	1	5
ARREST_BORO	0	1.00	1	1	5
AGE_GROUP	0	1.00	3	5	5
PERP_SEX	0	1.00	1	1	2
PERP_RACE	0	1.00	5	30	7
New Georeferenced Column	0	1.00	21	42	34251

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ARREST_KEY	0	1	247721027.46	5378637.00	238492434.00	243171338.25	247532420.50	252021557.75	261197497.00	▇▇▇▆▁
PD_CD	561	1	406.28	271.33	9.00	113.00	339.00	681.00	997.00	▇▆▃▅▂
KY_CD	570	1	246.62	149.78	101.00	110.00	235.00	344.00	995.00	▇▇▁▁▁
ARREST_PRECINCT	0	1	62.69	35.03	1.00	40.00	61.00	100.00	123.00	▆▇▆▅▇
JURISDICTION_CODE	0	1	0.94	7.94	0.00	0.00	0.00	0.00	97.00	▇▁▁▁▁
X_COORD_CD	0	1	1005207.28	21228.07	913554.00	991203.00	1005040.00	1017119.00	1067185.00	▁▁▇▇▂
Y_COORD_CD	0	1	208693.04	29606.36	121312.00	186655.75	207651.00	236145.75	271909.00	▁▃▇▆▃
Latitude	0	1	40.74	0.08	40.50	40.68	40.74	40.81	40.91	▁▃▇▆▃
Longitude	0	1	-73.92	0.08	-74.25	-73.97	-73.92	-73.88	-73.70	▁▁▇▇▂

Data 3

Introduction and data

Identify the source of the data.
- The Occupational Safety and Health Administration (OSHA)
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The OSHA collected data about work-related injuries and illnesses within specific industries and employment size specifications. The data is collected from 2002 to 2011.
Write a brief description of the observations.
- Each row in the data set represents a specific injury case and contains information about age, gender, body part injured, nature of the injury, etc. The most amount of non-fatal injuries was seen in the goods production industry with the most common injury type being from falls. Similarly, some of the industries with the least amount of reported non-fatal injuries were the publishing and marketing industries. When looking at fatal injuries, one of the most dangerous occupations in Construction and Transportation. Other dangerous occupations with fatal injuries include Protective service occupations, vehicle operators and surprisingly, Management occupations.
Address ethical concerns about data, if any
- There were no ethical concerns about the data. The data was collected legally by the government though employers, businesses and companies.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Which occupation is the riskiest in the United States? What are the chances of getting a fatal or non-fatal injury based on which industry you work in the US? Do other factors like age, gender, and location of the businesses play a significant role in determining the risk of an industry? Or is this more industry specific?
Why is it important?
- Workplace injuries and fatalities can have a significant impact on families and communities. Identifying the riskiest occupations and industries can help OSHA guide policies and interventions that can reducing these incidences. Looking at the economic effects of workplace injuries, they can also significantly reduce productivity and add on healthcare costs. By understanding which industries and occupations are most at risk, there will be more clarity in taking business decisions and risk management strategies.
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Description: We will be researching the correlation between different types of occupations and workplace settings among industries with different rates of injury risk.
- Hypothesis: The Manufacturing division has the highest rates of being at risk to injury in the work industry.
Identify the types of variables in your research question. Categorical? Quantitative?
- Year: quantitative
- Address City: categorical
- Address State: categorical
- Address Street: categorical
- Address Zip: quantitative
- Business Name: categorical
- Business Second Name: categorical
- Industry Division: categorical
- Industry ID: quantitative
- Industry Label: categorical
- Industry Major Group: categorical
- Statistics Days Away: quantitative
- Statistics Days Away/ Restricted/ Transfer: quantitative
- Total Case Rate: quantitative

Glimpse of data

# add code here
injuries <- read_csv("data/Work_Related_Injuries/injuries.csv")

Rows: 25010 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): address.city, address.state, address.street, business.name, busines...
dbl (6): year, address.zip, industry.id, statistics.days away, statistics.da...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(injuries)

Data summary
Name	injuries
Number of rows	25010
Number of columns	14
_______________________
Column type frequency:
character	8
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
address.city	0	1.00	3	21	6852
address.state	0	1.00	2	2	48
address.street	0	1.00	3	50	23573
business.name	0	1.00	3	100	20864
business.second name	14398	0.42	1	53	9077
industry.division	0	1.00	6	68	10
industry.label	0	1.00	4	103	675
industry.major_group	0	1.00	8	110	62

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2006.47	2.80	2002	2004.0	2006.00	2009.00	2011.00	▇▇▇▇▇
address.zip	1	47625.61	27103.72	1001	27501.0	46514.00	70037.00	99999.00	▇▇▇▆▅
industry.id	1	4214.99	1870.38	112	3086.0	3625.00	5093.00	9621.00	▁▇▃▁▂
statistics.days away	1	2.45	3.56	0	0.0	1.29	3.50	103.02	▇▁▁▁▁
statistics.days away/restricted/transfer	1	4.72	5.49	0	0.0	3.27	6.96	206.04	▇▁▁▁▁
statistics.total case rate	1	8.01	8.16	0	2.3	6.20	11.43	206.04	▇▁▁▁▁

#this is the link for the dataset: https://think.cs.vt.edu/corgis/csv/injuries/