Accident severity in New York State

Appendix to report

Data cleaning

INITIAL CLEAN

The original dataset contained all accidents from 49 US states (>2.5 million observations). This dataset was too large to be imported into the rstudio web environment, but the original dataset can be downloaded from https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents.

The initial clean was completed on my local device with the goal to reduce the size of the dataset into something we could work with on the web servers as I could only get it to work on my desktop version of r-studio. The data was filtered to just the NY observations, which trimmed rows from 2.8 million observations to 108K observations. 47 columns were reduced to 24 potentially valuable columns for our analysis. In particular, columns initially removed included ID, a text description of each accident, address information on the crash, timezone, 3 additional parameters describing the time of day of the accident (twilight). Additionally, each crash had a logical variable if a certain traffic infrastructure parameter was present at the crash. The initial data clean choose to limit analysis to the four most abundant types of crashes (stop signs, junctions, crossings, traffic signal). Other POI’s (such as roundabouts, railways, bumps) had ‘true’ values of <<1% of the entire dataset, and were hence excluded from our analysis.

From the initial data clean, the output was saved as a seperate NY accidents csv, of which additional cleaning could be completed after research questions and methods had been established.

SECONDARY CLEAN

Rows: 108049 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): City, County, State, Zipcode, Airport_Code, Wind_Direction, Weathe...
dbl  (9): Severity, Distance.mi., Temperature.F., Wind_Chill.F., Humidity......
lgl  (4): Crossing, Junction, Stop, Traffic_Signal
dttm (3): Start_Time, End_Time, Weather_Timestamp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data summary
Name Accidents_NY
Number of rows 108049
Number of columns 24
_______________________
Column type frequency:
character 8
logical 4
numeric 9
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
City 25 1.00 3 22 0 1053 0
County 0 1.00 4 14 0 63 0
State 0 1.00 2 2 0 1 0
Zipcode 2 1.00 5 10 0 13207 0
Airport_Code 195 1.00 4 4 0 55 0
Wind_Direction 1794 0.98 1 8 0 24 0
Weather_Condition 925 0.99 3 28 0 72 0
Sunrise_Sunset 119 1.00 3 5 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Crossing 0 1 0.07 FAL: 100844, TRU: 7205
Junction 0 1 0.16 FAL: 91022, TRU: 17027
Stop 0 1 0.02 FAL: 105930, TRU: 2119
Traffic_Signal 0 1 0.11 FAL: 95733, TRU: 12316

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Severity 0 1.00 2.22 0.57 1.00 2.0 2.00 2.00 4.00 ▁▇▁▁▁
Distance.mi. 0 1.00 0.83 1.56 0.00 0.1 0.35 0.93 53.31 ▇▁▁▁▁
Temperature.F. 807 0.99 54.01 18.43 -77.80 39.0 54.00 70.00 144.00 ▁▁▇▇▁
Wind_Chill.F. 17035 0.84 49.28 21.51 -30.40 32.0 51.00 68.00 144.00 ▁▆▇▂▁
Humidity… 863 0.99 65.56 20.18 8.00 50.0 66.00 83.00 100.00 ▁▅▇▇▇
Pressure.in. 1072 0.99 29.75 0.40 19.75 29.5 29.80 30.03 30.87 ▁▁▁▁▇
Visibility.mi. 1093 0.99 9.04 2.72 0.00 10.0 10.00 10.00 30.00 ▁▇▁▁▁
Wind_Speed.mph. 4392 0.96 8.89 5.62 0.00 5.0 8.00 12.00 141.50 ▇▁▁▁▁
Precipitation.in. 19660 0.82 0.02 0.34 0.00 0.0 0.00 0.00 10.05 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Start_Time 0 1.00 2016-03-23 02:35:03 2021-12-31 23:22:00 2021-01-20 15:24:13 78321
End_Time 0 1.00 2016-03-23 08:35:03 2021-12-31 23:49:05 2021-01-20 20:55:08 93167
Weather_Timestamp 615 0.99 2016-03-23 02:51:00 2021-12-31 23:32:00 2021-01-21 00:53:00 50268
Data summary
Name accidents_NY_dropped
Number of rows 108049
Number of columns 16
_______________________
Column type frequency:
character 4
logical 4
numeric 6
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
City 25 1.00 3 22 0 1053 0
County 0 1.00 4 14 0 63 0
Weather_Condition 925 0.99 3 28 0 72 0
Sunrise_Sunset 119 1.00 3 5 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Crossing 0 1 0.07 FAL: 100844, TRU: 7205
Junction 0 1 0.16 FAL: 91022, TRU: 17027
Stop 0 1 0.02 FAL: 105930, TRU: 2119
Traffic_Signal 0 1 0.11 FAL: 95733, TRU: 12316

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Severity 0 1.00 2.22 0.57 1.0 2.0 2.00 2.00 4.00 ▁▇▁▁▁
Distance.mi. 0 1.00 0.83 1.56 0.0 0.1 0.35 0.93 53.31 ▇▁▁▁▁
Temperature.F. 807 0.99 54.01 18.43 -77.8 39.0 54.00 70.00 144.00 ▁▁▇▇▁
Visibility.mi. 1093 0.99 9.04 2.72 0.0 10.0 10.00 10.00 30.00 ▁▇▁▁▁
Wind_Speed.mph. 4392 0.96 8.89 5.62 0.0 5.0 8.00 12.00 141.50 ▇▁▁▁▁
Precipitation.in. 19660 0.82 0.02 0.34 0.0 0.0 0.00 0.00 10.05 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Start_Time 0 1 2016-03-23 02:35:03 2021-12-31 23:22:00 2021-01-20 15:24:13 78321
End_Time 0 1 2016-03-23 08:35:03 2021-12-31 23:49:05 2021-01-20 20:55:08 93167

We decided to drop the following columns because they would not be useful for answering our research question(s) : state, zipcode, airport_code, weather_timestamp, wind_chill, humidity, pressure, wind_direction. We did this by selecting every column except these. The state variable is not important because the data is only for NY state, zipcode is not important because we could just use counties, and the rest of the dropped variables are insignificant because they should not have an effect on driving and thus accidents.

Data summary
Name accidents_NY_replace_null…
Number of rows 107124
Number of columns 16
_______________________
Column type frequency:
character 4
logical 4
numeric 6
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
City 25 1 3 22 0 1036 0
County 0 1 4 14 0 63 0
Weather_Condition 0 1 3 28 0 72 0
Sunrise_Sunset 101 1 3 5 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Crossing 0 1 0.07 FAL: 100009, TRU: 7115
Junction 0 1 0.16 FAL: 90268, TRU: 16856
Stop 0 1 0.02 FAL: 105039, TRU: 2085
Traffic_Signal 0 1 0.11 FAL: 94918, TRU: 12206

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Severity 0 1.00 2.22 0.57 1 2.0 2.00 2.00 4.00 ▁▇▁▁▁
Distance.mi. 0 1.00 0.82 1.55 0 0.1 0.35 0.93 53.31 ▇▁▁▁▁
Temperature.F. 137 1.00 54.01 18.43 -19 39.0 54.00 70.00 144.00 ▁▆▇▂▁
Visibility.mi. 351 1.00 9.04 2.72 0 10.0 10.00 10.00 30.00 ▁▇▁▁▁
Wind_Speed.mph. 3690 0.97 8.89 5.62 0 5.0 8.00 12.00 141.50 ▇▁▁▁▁
Precipitation.in. 1 1.00 0.24 0.92 0 0.0 0.00 0.00 10.05 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Start_Time 0 1 2016-03-23 02:35:03 2021-12-31 23:22:00 2021-01-21 17:47:30 77683
End_Time 0 1 2016-03-23 08:35:03 2021-12-31 23:49:05 2021-01-21 21:12:37 92430

After dropping these variables, we have 16 columns left and still have 108,049 rows. The only column with a concerning amount of nulls is the Precipitation.in. column. Because amount of precipitation depends on the weather, we decided to group by the weather type and fill in the null values for precipitation.in. with the mean of precipitation for each weather condition. For example, if a row with a null in the precipitation column had a weather type of light rain, the null value would be replaced by the mean precipitation amount for the light rain group. Before doing this, we dropped the rows with null values in the weather condition column, which is not an issue because there were only nulls in the weather condition column for .85% percent of the entire data frame. This left us with 107,124 rows.

Data summary
Name cleaned_accidents_NY
Number of rows 102961
Number of columns 16
_______________________
Column type frequency:
character 4
logical 4
numeric 6
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
City 0 1 3 22 0 996 0
County 0 1 4 14 0 63 0
Weather_Condition 0 1 3 28 0 71 0
Sunrise_Sunset 0 1 3 5 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Crossing 0 1 0.07 FAL: 96242, TRU: 6719
Junction 0 1 0.16 FAL: 86899, TRU: 16062
Stop 0 1 0.02 FAL: 100965, TRU: 1996
Traffic_Signal 0 1 0.11 FAL: 91339, TRU: 11622

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Severity 0 1 2.20 0.56 1 2.00 2.00 2.00 4.00 ▁▇▁▁▁
Distance.mi. 0 1 0.82 1.53 0 0.09 0.35 0.93 53.31 ▇▁▁▁▁
Temperature.F. 0 1 53.96 18.40 -19 39.00 54.00 69.80 144.00 ▁▆▇▂▁
Visibility.mi. 0 1 9.04 2.72 0 10.00 10.00 10.00 30.00 ▁▇▁▁▁
Wind_Speed.mph. 0 1 8.90 5.62 0 5.00 8.00 12.70 141.50 ▇▁▁▁▁
Precipitation.in. 0 1 0.20 0.84 0 0.00 0.00 0.00 10.02 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Start_Time 0 1 2016-03-23 02:35:03 2021-12-31 23:22:00 2021-01-28 22:04:00 74512
End_Time 0 1 2016-03-23 08:35:03 2021-12-31 23:49:05 2021-01-29 01:14:18 88958

After this, we dropped the remaining rows that contain null values because they only accounted for 3.7% of all of the data. After the first cleaning, we had 108,049 rows, and we ended up with 102,961 rows after the cleaning process. Therefore, we only lost about 4.7% of the data during the cleaning process, which is less than 5%.

Other appendicies (as necessary)