Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

I found this dataset on Kaggle.com.

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).The data was collected for use in two papers, Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019 and Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” It has also been updated since to cover through the year 2021.

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data covers 49 US states, and is compiled from various law enforcement agencies, traffic cameras, traffic monitors, and the Dept of Transportation. They found this data for the years 2016 to 2021. Because the size was too large to push, I opted to use the same data but only for the state of New York

Write a brief description of the observations.

Each observation represents a traffic accident in the United States and contains information about the accident such as how much it impacted traffic, time of day, location, and the weather conditions at the time of the accident.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What factors (time of day, weather, location) impact the severity of a traffic delay caused by an accident.

Are there some areas of the country that have more accidents than average?

Are accidents more common near junctions, rotaries, speed bumps etc?

A description of the research topic along with a concise statement of your hypotheses on this topic.

Understanding what factors are associated with higher numbers of accidents, and what may make an accident have a larger delay on traffic could provide important insights that can be used to make our streets safer, and for transportation and gps companies to be able to more effectively plan the best routes for their customers. I would hypothesis that wet, cold weather, as well as junctions would be correlated with higher traffic delays. I also expect that traffic delays in urban areas would be longer, as it takes more time for the accident to be cleaned up and there would be more people in the area, resulting in more traffic.

Identify the types of variables in your research question. Categorical? Quantitative?

Presence of precipitation, whether the area is categorized as urban, suburban, or rural, and the presence of a junction or another traffic feature would all be categorical information. Quantitative information would include temperature, visibility, and how long it took for the accident to be cleaned up.

Glimpse of data

#usacc <- read.csv("data/US_Accidents_Dec21_updated.csv")
#head(usacc)
nyacc <- read.csv("data/accidents_ny.csv")
skimr::skim(nyacc)

Data summary
Name	nyacc
Number of rows	108049
Number of columns	48
_______________________
Column type frequency:
character	33
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ID	1	7	9	0	108049
Start_Time	1	19	29	0	83596
End_Time	1	19	29	0	95972
Description	1	14	431	0	58190
Street	1	4	50	0	6729
Side	1	1	1	0	2
City	1	0	22	25	1054
County	1	4	14	0	63
State	1	2	2	0	1
Zipcode	1	0	10	2	13208
Country	1	2	2	0	1
Timezone	1	0	10	2	2
Airport_Code	1	0	4	195	56
Weather_Timestamp	1	0	19	615	50269
Wind_Direction	1	0	8	1794	25
Weather_Condition	1	0	28	925	73
Amenity	1	4	5	0	2
Bump	1	4	5	0	2
Crossing	1	4	5	0	2
Give_Way	1	4	5	0	2
Junction	1	4	5	0	2
No_Exit	1	4	5	0	2
Railway	1	4	5	0	2
Roundabout	1	4	5	0	2
Station	1	4	5	0	2
Stop	1	4	5	0	2
Traffic_Calming	1	4	5	0	2
Traffic_Signal	1	4	5	0	2
Turning_Loop	1	5	5	0	1
Sunrise_Sunset	1	0	5	119	3
Civil_Twilight	1	0	5	119	3
Nautical_Twilight	1	0	5	119	3
Astronomical_Twilight	1	0	5	119	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X	0	1.00	1373813.62	841120.31	31580.00	635660.00	1362727.00	2078972.00	2844975.00	▇▆▆▇▆
Severity	0	1.00	2.22	0.57	1.00	2.00	2.00	2.00	4.00	▁▇▁▁▁
Start_Lat	0	1.00	41.69	1.05	40.51	40.77	41.06	42.90	45.00	▇▁▅▁▁
Start_Lng	0	1.00	-74.70	1.63	-79.76	-75.18	-73.92	-73.78	-71.94	▁▁▁▇▁
End_Lat	0	1.00	41.69	1.05	40.51	40.77	41.06	42.90	44.99	▇▁▅▁▁
End_Lng	0	1.00	-74.70	1.63	-79.76	-75.18	-73.92	-73.79	-71.94	▁▁▁▇▁
Distance.mi.	0	1.00	0.83	1.56	0.00	0.10	0.35	0.93	53.31	▇▁▁▁▁
Number	80427	0.26	2255.54	3640.85	1.00	217.00	900.00	2718.00	60725.00	▇▁▁▁▁
Temperature.F.	807	0.99	54.01	18.43	-77.80	39.00	54.00	70.00	144.00	▁▁▇▇▁
Wind_Chill.F.	17035	0.84	49.28	21.51	-30.40	32.00	51.00	68.00	144.00	▁▆▇▂▁
Humidity…	863	0.99	65.56	20.18	8.00	50.00	66.00	83.00	100.00	▁▅▇▇▇
Pressure.in.	1072	0.99	29.75	0.40	19.75	29.50	29.80	30.03	30.87	▁▁▁▁▇
Visibility.mi.	1093	0.99	9.04	2.72	0.00	10.00	10.00	10.00	30.00	▁▇▁▁▁
Wind_Speed.mph.	4392	0.96	8.89	5.62	0.00	5.00	8.00	12.00	141.50	▇▁▁▁▁
Precipitation.in.	19660	0.82	0.02	0.34	0.00	0.00	0.00	0.00	10.05	▇▁▁▁▁

# add code here

Data 2

Introduction and data

Identify the source of the data.

I found the data on Kaggle.com. The first dataset of Years of World Education by Country is from https://ourworldindata.org/global-education and the second dataset of World Happiness report is from the Gallup World Poll.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The Years of World Education by Country dataset was collected from 1870 to 2017 (and it was not stated how the data was collected). The Word Happiness report dataset was most recently collected in 2017 and was collected by the authors from the Gallup World Survey.
Write a brief description of the observations.

Each observation in the Years of World Education by Country dataset represents the known education data about a country to however long the data has been collected from to 2017 (for example, Afghanistan has collected education level data since 1870) and tells about the average years of schooling in that particular year that the data was collected in.

Each observation in the World Happiness Report represents a country and its happiness data which includes details about the region in which it is located, its happiness rank, happiness score, standard error, economy, family, health, freedom, and government trust in 2017, all of which variables influence the happiness score and ranking.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does the change in the average years of education for countries in the world relate to the happiness level the countries?
A description of the research topic along with a concise statement of your hypotheses on this topic

I have always wondered what the effect of education is on a country’s happiness level. Are highly-educated countries more happy or unhappy with their country and their well-being. I believe that this topic is important because it can help countries around the world realize if they need to put more funding into educating their children in order to have a happier country or not. In this research, I hope to track the correspondence of education to happiness in countries around the world. However, I believe that education is not the only factor in a country’s happiness level, therefore there will not be much correlation between education and happiness level.
Identify the types of variables in your research question. Categorical? Quantitative?

I variables I will be using are country, year, and average years of schooling from the Years of World Education by Country dataset, and the country and happiness score from the World Happiness report dataset. The variables are classified as follows: country (categorical), year (quantitative), Average years of schooling in 2017 (quantitative), Happiness Score (quantitative).

Glimpse of data

# add code here
#schooling <- read.csv("data/mean-years-of-schooling-long-run.csv")
#happiness <- read.csv("data/2017.csv")
#skimr::skim(schooling)
#skimr::skim(happiness)

Data 3

Introduction and data

Identify the source of the data.

I found this data set through the FiveThirtyEight data page. FiveThirtyEight is a statistics-driven news platform and in order to be transparent in regards to the thought process behind their opinions, they publish all of the data sets that they have organized and leveraged to guide their thought process on their website. Thus, I found this specific data set on their site and downloaded the CSV file and imported it into r.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data was recently uploaded by FiveThirtyEight, and to be exact, it was added just 2 days ago. However, it contains details from the 2018 election to the present day status and has a trend line to exemplify the variance between the years. The data science team at FiveThirtyEight collected this data from using generic online pollsters, such as Morning Consult, McLaughlin & Associates, and YouGov. For each observation within the data set, they cite where each observation comes from and explain the other metadata, such as the date, time, and method of collection.
Write a brief description of the observations.

Each observation is very detailed and contains all of the critical components of each polling result. For each observation, the data set explains when it was conducted, its start date and end date, the pollster that uncovered the data, the sample size, and the URL of that exact poll for further verification. There are 1222 observations that were identified, and each one was aggregated starting from 2020 to 2022.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What party do active voters hope to see in Congress, and how has that changed since 2018?
A description of the research topic along with a concise statement of your hypotheses on this topic.

This research topic would help us understand what party does our country want to see in power and how divided has our nation become. It would be incredibly fascinating to see how this varies based on one’s background and location and analyze this at a national level. My hypothesis: Given the current political climate, our nation is quite split on what party they hope to see in congress and one party is not necessarily more in favor than the other.
Identify the types of variables in your research question. Categorical? Quantitative?

The variables in our research would be the pollster, the ‘grade’, the start date, sample size, and the adjusted metric. Most of which are all quantitative, except the pollster and the grade could be categorical and use to classify the data while performing analysis. These could help us dissect the data at first and the quantitative measures could help us understand how these correlate to one another.

Glimpse of data

# add code here

#generic_polllist <- read.csv("data/generic_polllist.csv")
#skimr::skim(generic_polllist)

Data 4

Introduction and data

Identify the source of the data.

This dataset was found on CORGIS (The Collection of Really Great, Interesting, Situated) Datasets.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data was created in September 15, 2021 by Dennis Kafura, Joung Min Choi, and Bo Guan. It records every case of fatal shootings in the United States committed by a police officer since January 1st, 2015. This was done through research of news reports, law enforcement websites, and social media.
Write a brief description of the observations.

There are 16 columns and 6570 observations. Each observation entails the victim’s name, age, gender, race, as well as details revolving around their death. This includes where and when the incident occurred, how it occurred, and other various details.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What factors (gender, race, threat level, political identification, etc.) tend to be the driving factor that lead to a victim’s death in a police shooting?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic goes into the motive behind police shooting fatalities. This will help us see what action our country can take to reduce this number, and take action on correcting biases that may impact one’s decision on whether to shoot or not.

My hypothesis is that many police officers are power driven and are racist against people with darker skin. When they see a potential criminal, immediately associate their danger levels with their skin color. Additionally, I think that the location will have an impact, because more right-wing areas may be more non-inclusive than left-wing areas, which creates more bias.
Identify the types of variables in your research question. Categorical? Quantitative?

All of the columns/variables in the dataset are categorical, which include race, gender, location, threat level etc. Within the research question, it will be taking into account categorical variables to see how different categories impact police shooting fatalities, such as the ones mentioned above.

Glimpse of data

# add code here
#police_shootings <- read.csv("data/police_shootings.csv")
#skimr::skim(police_shootings)

Data 5

Introduction and data

Identify the source of the data.
- This dataset was found on CORGIS (The Collection of Really Great, Interesting, Situated).
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was created on March 13th, 2017 by Austin Cory Bart. The data collected is the sales and playtime of video games that were released from 2004 to 2010 (a 6 year range). The playtime data points were found from a crowd-sourced data on “How Long to Beat”.
Write a brief description of the observations.
- In regards to the observastions of the data set, it collected data on each video– specifically about how quick the game was completed and the amount of sales. An interesting data point that is listed is the rating for the game. My hypothesis is that the rating and time needed to be invested in the game will affect the sales. Specially, the rating would broaden the amount of consumers (specifically in age). Additionally, I think we could observe if there is a correlation of the amount of time spent with the ESRB rating… are games that are meant for an older audience typically longer to play?

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does the ESRB rating of a video game relate to the amount of time put into the game (playtime)?
A description of the research topic along with a concise statement of your hypotheses on this topic

I wonder if there is any correlation between the ESRB ratings and the amount of hours a person typically invests in a game. I wonder if a older audience would have a significally larger number of playtime rather than games who are for a younger audience. Some things that make me think this is the fact that kids get bored easier. How do we make a video game marketable for a child and what factors incline them to “finish” the game or invests more hours into it.
Identify the types of variables in your research question. Categorical? Quantitative?

The variables in my research would be the rating, name of game, genre, sales and mean time of players who have completed the game. Most of these variables are quantitative.

Glimpse of data

# add code here
#video_games <- read.csv("data/video_games.csv")
#skimr::skim(video_games)