Accident severity in New York State

Appendix to report

Data cleaning

INITIAL CLEAN

The original dataset contained all accidents from 49 US states (>2.5 million observations). This dataset was too large to be imported into the rstudio web environment, but the original dataset can be downloaded from https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents.

The initial clean was completed on my local device with the goal to reduce the size of the dataset into something we could work with on the web servers as I could only get it to work on my desktop version of r-studio. The data was filtered to just the NY observations, which trimmed rows from 2.8 million observations to 108K observations. 47 columns were reduced to 24 potentially valuable columns for our analysis. In particular, columns initially removed included ID, a text description of each accident, address information on the crash, timezone, 3 additional parameters describing the time of day of the accident (twilight). Additionally, each crash had a logical variable if a certain traffic infrastructure parameter was present at the crash. The initial data clean choose to limit analysis to the four most abundant types of crashes (stop signs, junctions, crossings, traffic signal). Other POI’s (such as roundabouts, railways, bumps) had ‘true’ values of <<1% of the entire dataset, and were hence excluded from our analysis.

From the initial data clean, the output was saved as a seperate NY accidents csv, of which additional cleaning could be completed after research questions and methods had been established.

SECONDARY CLEAN

Rows: 108049 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): City, County, State, Zipcode, Airport_Code, Wind_Direction, Weathe...
dbl  (9): Severity, Distance.mi., Temperature.F., Wind_Chill.F., Humidity......
lgl  (4): Crossing, Junction, Stop, Traffic_Signal
dttm (3): Start_Time, End_Time, Weather_Timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data summary
Name	Accidents_NY
Number of rows	108049
Number of columns	24
_______________________
Column type frequency:
character	8
logical	4
numeric	9
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
City	25	1.00	3	22	1053
County	0	1.00	4	14	63
State	0	1.00	2	2	1
Zipcode	2	1.00	5	10	13207
Airport_Code	195	1.00	4	4	55
Wind_Direction	1794	0.98	1	8	24
Weather_Condition	925	0.99	3	28	72
Sunrise_Sunset	119	1.00	3	5	2

Variable type: logical

skim_variable	complete_rate	mean	count
Crossing	1	0.07	FAL: 100844, TRU: 7205
Junction	1	0.16	FAL: 91022, TRU: 17027
Stop	1	0.02	FAL: 105930, TRU: 2119
Traffic_Signal	1	0.11	FAL: 95733, TRU: 12316

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Severity	0	1.00	2.22	0.57	1.00	2.0	2.00	2.00	4.00	▁▇▁▁▁
Distance.mi.	0	1.00	0.83	1.56	0.00	0.1	0.35	0.93	53.31	▇▁▁▁▁
Temperature.F.	807	0.99	54.01	18.43	-77.80	39.0	54.00	70.00	144.00	▁▁▇▇▁
Wind_Chill.F.	17035	0.84	49.28	21.51	-30.40	32.0	51.00	68.00	144.00	▁▆▇▂▁
Humidity…	863	0.99	65.56	20.18	8.00	50.0	66.00	83.00	100.00	▁▅▇▇▇
Pressure.in.	1072	0.99	29.75	0.40	19.75	29.5	29.80	30.03	30.87	▁▁▁▁▇
Visibility.mi.	1093	0.99	9.04	2.72	0.00	10.0	10.00	10.00	30.00	▁▇▁▁▁
Wind_Speed.mph.	4392	0.96	8.89	5.62	0.00	5.0	8.00	12.00	141.50	▇▁▁▁▁
Precipitation.in.	19660	0.82	0.02	0.34	0.00	0.0	0.00	0.00	10.05	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Start_Time	0	1.00	2016-03-23 02:35:03	2021-12-31 23:22:00	2021-01-20 15:24:13	78321
End_Time	0	1.00	2016-03-23 08:35:03	2021-12-31 23:49:05	2021-01-20 20:55:08	93167
Weather_Timestamp	615	0.99	2016-03-23 02:51:00	2021-12-31 23:32:00	2021-01-21 00:53:00	50268

Data summary
Name	accidents_NY_dropped
Number of rows	108049
Number of columns	16
_______________________
Column type frequency:
character	4
logical	4
numeric	6
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
City	25	1.00	3	22	1053
County	0	1.00	4	14	63
Weather_Condition	925	0.99	3	28	72
Sunrise_Sunset	119	1.00	3	5	2

Variable type: logical

skim_variable	complete_rate	mean	count
Crossing	1	0.07	FAL: 100844, TRU: 7205
Junction	1	0.16	FAL: 91022, TRU: 17027
Stop	1	0.02	FAL: 105930, TRU: 2119
Traffic_Signal	1	0.11	FAL: 95733, TRU: 12316

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Severity	0	1.00	2.22	0.57	1.0	2.0	2.00	2.00	4.00	▁▇▁▁▁
Distance.mi.	0	1.00	0.83	1.56	0.0	0.1	0.35	0.93	53.31	▇▁▁▁▁
Temperature.F.	807	0.99	54.01	18.43	-77.8	39.0	54.00	70.00	144.00	▁▁▇▇▁
Visibility.mi.	1093	0.99	9.04	2.72	0.0	10.0	10.00	10.00	30.00	▁▇▁▁▁
Wind_Speed.mph.	4392	0.96	8.89	5.62	0.0	5.0	8.00	12.00	141.50	▇▁▁▁▁
Precipitation.in.	19660	0.82	0.02	0.34	0.0	0.0	0.00	0.00	10.05	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Start_Time	0	1	2016-03-23 02:35:03	2021-12-31 23:22:00	2021-01-20 15:24:13	78321
End_Time	0	1	2016-03-23 08:35:03	2021-12-31 23:49:05	2021-01-20 20:55:08	93167

We decided to drop the following columns because they would not be useful for answering our research question(s) : state, zipcode, airport_code, weather_timestamp, wind_chill, humidity, pressure, wind_direction. We did this by selecting every column except these. The state variable is not important because the data is only for NY state, zipcode is not important because we could just use counties, and the rest of the dropped variables are insignificant because they should not have an effect on driving and thus accidents.

Data summary
Name	accidents_NY_replace_null…
Number of rows	107124
Number of columns	16
_______________________
Column type frequency:
character	4
logical	4
numeric	6
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
City	25	1	3	22	1036
County	0	1	4	14	63
Weather_Condition	0	1	3	28	72
Sunrise_Sunset	101	1	3	5	2

Variable type: logical

skim_variable	complete_rate	mean	count
Crossing	1	0.07	FAL: 100009, TRU: 7115
Junction	1	0.16	FAL: 90268, TRU: 16856
Stop	1	0.02	FAL: 105039, TRU: 2085
Traffic_Signal	1	0.11	FAL: 94918, TRU: 12206

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Severity	0	1.00	2.22	0.57	1	2.0	2.00	2.00	4.00	▁▇▁▁▁
Distance.mi.	0	1.00	0.82	1.55	0	0.1	0.35	0.93	53.31	▇▁▁▁▁
Temperature.F.	137	1.00	54.01	18.43	-19	39.0	54.00	70.00	144.00	▁▆▇▂▁
Visibility.mi.	351	1.00	9.04	2.72	0	10.0	10.00	10.00	30.00	▁▇▁▁▁
Wind_Speed.mph.	3690	0.97	8.89	5.62	0	5.0	8.00	12.00	141.50	▇▁▁▁▁
Precipitation.in.	1	1.00	0.24	0.92	0	0.0	0.00	0.00	10.05	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Start_Time	0	1	2016-03-23 02:35:03	2021-12-31 23:22:00	2021-01-21 17:47:30	77683
End_Time	0	1	2016-03-23 08:35:03	2021-12-31 23:49:05	2021-01-21 21:12:37	92430

After dropping these variables, we have 16 columns left and still have 108,049 rows. The only column with a concerning amount of nulls is the Precipitation.in. column. Because amount of precipitation depends on the weather, we decided to group by the weather type and fill in the null values for precipitation.in. with the mean of precipitation for each weather condition. For example, if a row with a null in the precipitation column had a weather type of light rain, the null value would be replaced by the mean precipitation amount for the light rain group. Before doing this, we dropped the rows with null values in the weather condition column, which is not an issue because there were only nulls in the weather condition column for .85% percent of the entire data frame. This left us with 107,124 rows.

Data summary
Name	cleaned_accidents_NY
Number of rows	102961
Number of columns	16
_______________________
Column type frequency:
character	4
logical	4
numeric	6
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
City	1	3	22	996
County	1	4	14	63
Weather_Condition	1	3	28	71
Sunrise_Sunset	1	3	5	2

Variable type: logical

skim_variable	complete_rate	mean	count
Crossing	1	0.07	FAL: 96242, TRU: 6719
Junction	1	0.16	FAL: 86899, TRU: 16062
Stop	1	0.02	FAL: 100965, TRU: 1996
Traffic_Signal	1	0.11	FAL: 91339, TRU: 11622

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Severity	1	2.20	0.56	1	2.00	2.00	2.00	4.00	▁▇▁▁▁
Distance.mi.	1	0.82	1.53	0	0.09	0.35	0.93	53.31	▇▁▁▁▁
Temperature.F.	1	53.96	18.40	-19	39.00	54.00	69.80	144.00	▁▆▇▂▁
Visibility.mi.	1	9.04	2.72	0	10.00	10.00	10.00	30.00	▁▇▁▁▁
Wind_Speed.mph.	1	8.90	5.62	0	5.00	8.00	12.70	141.50	▇▁▁▁▁
Precipitation.in.	1	0.20	0.84	0	0.00	0.00	0.00	10.02	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Start_Time	0	1	2016-03-23 02:35:03	2021-12-31 23:22:00	2021-01-28 22:04:00	74512
End_Time	0	1	2016-03-23 08:35:03	2021-12-31 23:49:05	2021-01-29 01:14:18	88958

After this, we dropped the remaining rows that contain null values because they only accounted for 3.7% of all of the data. After the first cleaning, we had 108,049 rows, and we ended up with 102,961 rows after the cleaning process. Therefore, we only lost about 4.7% of the data during the cleaning process, which is less than 5%.

Other appendicies (as necessary)