Accident severity in New York State
Appendix to report
Data cleaning
INITIAL CLEAN
The original dataset contained all accidents from 49 US states (>2.5 million observations). This dataset was too large to be imported into the rstudio web environment, but the original dataset can be downloaded from https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents.
The initial clean was completed on my local device with the goal to reduce the size of the dataset into something we could work with on the web servers as I could only get it to work on my desktop version of r-studio. The data was filtered to just the NY observations, which trimmed rows from 2.8 million observations to 108K observations. 47 columns were reduced to 24 potentially valuable columns for our analysis. In particular, columns initially removed included ID, a text description of each accident, address information on the crash, timezone, 3 additional parameters describing the time of day of the accident (twilight). Additionally, each crash had a logical variable if a certain traffic infrastructure parameter was present at the crash. The initial data clean choose to limit analysis to the four most abundant types of crashes (stop signs, junctions, crossings, traffic signal). Other POI’s (such as roundabouts, railways, bumps) had ‘true’ values of <<1% of the entire dataset, and were hence excluded from our analysis.
From the initial data clean, the output was saved as a seperate NY accidents csv, of which additional cleaning could be completed after research questions and methods had been established.
SECONDARY CLEAN
Rows: 108049 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): City, County, State, Zipcode, Airport_Code, Wind_Direction, Weathe...
dbl (9): Severity, Distance.mi., Temperature.F., Wind_Chill.F., Humidity......
lgl (4): Crossing, Junction, Stop, Traffic_Signal
dttm (3): Start_Time, End_Time, Weather_Timestamp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Name | Accidents_NY |
Number of rows | 108049 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
character | 8 |
logical | 4 |
numeric | 9 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
City | 25 | 1.00 | 3 | 22 | 0 | 1053 | 0 |
County | 0 | 1.00 | 4 | 14 | 0 | 63 | 0 |
State | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
Zipcode | 2 | 1.00 | 5 | 10 | 0 | 13207 | 0 |
Airport_Code | 195 | 1.00 | 4 | 4 | 0 | 55 | 0 |
Wind_Direction | 1794 | 0.98 | 1 | 8 | 0 | 24 | 0 |
Weather_Condition | 925 | 0.99 | 3 | 28 | 0 | 72 | 0 |
Sunrise_Sunset | 119 | 1.00 | 3 | 5 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Crossing | 0 | 1 | 0.07 | FAL: 100844, TRU: 7205 |
Junction | 0 | 1 | 0.16 | FAL: 91022, TRU: 17027 |
Stop | 0 | 1 | 0.02 | FAL: 105930, TRU: 2119 |
Traffic_Signal | 0 | 1 | 0.11 | FAL: 95733, TRU: 12316 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Severity | 0 | 1.00 | 2.22 | 0.57 | 1.00 | 2.0 | 2.00 | 2.00 | 4.00 | ▁▇▁▁▁ |
Distance.mi. | 0 | 1.00 | 0.83 | 1.56 | 0.00 | 0.1 | 0.35 | 0.93 | 53.31 | ▇▁▁▁▁ |
Temperature.F. | 807 | 0.99 | 54.01 | 18.43 | -77.80 | 39.0 | 54.00 | 70.00 | 144.00 | ▁▁▇▇▁ |
Wind_Chill.F. | 17035 | 0.84 | 49.28 | 21.51 | -30.40 | 32.0 | 51.00 | 68.00 | 144.00 | ▁▆▇▂▁ |
Humidity… | 863 | 0.99 | 65.56 | 20.18 | 8.00 | 50.0 | 66.00 | 83.00 | 100.00 | ▁▅▇▇▇ |
Pressure.in. | 1072 | 0.99 | 29.75 | 0.40 | 19.75 | 29.5 | 29.80 | 30.03 | 30.87 | ▁▁▁▁▇ |
Visibility.mi. | 1093 | 0.99 | 9.04 | 2.72 | 0.00 | 10.0 | 10.00 | 10.00 | 30.00 | ▁▇▁▁▁ |
Wind_Speed.mph. | 4392 | 0.96 | 8.89 | 5.62 | 0.00 | 5.0 | 8.00 | 12.00 | 141.50 | ▇▁▁▁▁ |
Precipitation.in. | 19660 | 0.82 | 0.02 | 0.34 | 0.00 | 0.0 | 0.00 | 0.00 | 10.05 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Start_Time | 0 | 1.00 | 2016-03-23 02:35:03 | 2021-12-31 23:22:00 | 2021-01-20 15:24:13 | 78321 |
End_Time | 0 | 1.00 | 2016-03-23 08:35:03 | 2021-12-31 23:49:05 | 2021-01-20 20:55:08 | 93167 |
Weather_Timestamp | 615 | 0.99 | 2016-03-23 02:51:00 | 2021-12-31 23:32:00 | 2021-01-21 00:53:00 | 50268 |
Name | accidents_NY_dropped |
Number of rows | 108049 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 4 |
logical | 4 |
numeric | 6 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
City | 25 | 1.00 | 3 | 22 | 0 | 1053 | 0 |
County | 0 | 1.00 | 4 | 14 | 0 | 63 | 0 |
Weather_Condition | 925 | 0.99 | 3 | 28 | 0 | 72 | 0 |
Sunrise_Sunset | 119 | 1.00 | 3 | 5 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Crossing | 0 | 1 | 0.07 | FAL: 100844, TRU: 7205 |
Junction | 0 | 1 | 0.16 | FAL: 91022, TRU: 17027 |
Stop | 0 | 1 | 0.02 | FAL: 105930, TRU: 2119 |
Traffic_Signal | 0 | 1 | 0.11 | FAL: 95733, TRU: 12316 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Severity | 0 | 1.00 | 2.22 | 0.57 | 1.0 | 2.0 | 2.00 | 2.00 | 4.00 | ▁▇▁▁▁ |
Distance.mi. | 0 | 1.00 | 0.83 | 1.56 | 0.0 | 0.1 | 0.35 | 0.93 | 53.31 | ▇▁▁▁▁ |
Temperature.F. | 807 | 0.99 | 54.01 | 18.43 | -77.8 | 39.0 | 54.00 | 70.00 | 144.00 | ▁▁▇▇▁ |
Visibility.mi. | 1093 | 0.99 | 9.04 | 2.72 | 0.0 | 10.0 | 10.00 | 10.00 | 30.00 | ▁▇▁▁▁ |
Wind_Speed.mph. | 4392 | 0.96 | 8.89 | 5.62 | 0.0 | 5.0 | 8.00 | 12.00 | 141.50 | ▇▁▁▁▁ |
Precipitation.in. | 19660 | 0.82 | 0.02 | 0.34 | 0.0 | 0.0 | 0.00 | 0.00 | 10.05 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Start_Time | 0 | 1 | 2016-03-23 02:35:03 | 2021-12-31 23:22:00 | 2021-01-20 15:24:13 | 78321 |
End_Time | 0 | 1 | 2016-03-23 08:35:03 | 2021-12-31 23:49:05 | 2021-01-20 20:55:08 | 93167 |
We decided to drop the following columns because they would not be useful for answering our research question(s) : state, zipcode, airport_code, weather_timestamp, wind_chill, humidity, pressure, wind_direction. We did this by selecting every column except these. The state variable is not important because the data is only for NY state, zipcode is not important because we could just use counties, and the rest of the dropped variables are insignificant because they should not have an effect on driving and thus accidents.
Name | accidents_NY_replace_null… |
Number of rows | 107124 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 4 |
logical | 4 |
numeric | 6 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
City | 25 | 1 | 3 | 22 | 0 | 1036 | 0 |
County | 0 | 1 | 4 | 14 | 0 | 63 | 0 |
Weather_Condition | 0 | 1 | 3 | 28 | 0 | 72 | 0 |
Sunrise_Sunset | 101 | 1 | 3 | 5 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Crossing | 0 | 1 | 0.07 | FAL: 100009, TRU: 7115 |
Junction | 0 | 1 | 0.16 | FAL: 90268, TRU: 16856 |
Stop | 0 | 1 | 0.02 | FAL: 105039, TRU: 2085 |
Traffic_Signal | 0 | 1 | 0.11 | FAL: 94918, TRU: 12206 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Severity | 0 | 1.00 | 2.22 | 0.57 | 1 | 2.0 | 2.00 | 2.00 | 4.00 | ▁▇▁▁▁ |
Distance.mi. | 0 | 1.00 | 0.82 | 1.55 | 0 | 0.1 | 0.35 | 0.93 | 53.31 | ▇▁▁▁▁ |
Temperature.F. | 137 | 1.00 | 54.01 | 18.43 | -19 | 39.0 | 54.00 | 70.00 | 144.00 | ▁▆▇▂▁ |
Visibility.mi. | 351 | 1.00 | 9.04 | 2.72 | 0 | 10.0 | 10.00 | 10.00 | 30.00 | ▁▇▁▁▁ |
Wind_Speed.mph. | 3690 | 0.97 | 8.89 | 5.62 | 0 | 5.0 | 8.00 | 12.00 | 141.50 | ▇▁▁▁▁ |
Precipitation.in. | 1 | 1.00 | 0.24 | 0.92 | 0 | 0.0 | 0.00 | 0.00 | 10.05 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Start_Time | 0 | 1 | 2016-03-23 02:35:03 | 2021-12-31 23:22:00 | 2021-01-21 17:47:30 | 77683 |
End_Time | 0 | 1 | 2016-03-23 08:35:03 | 2021-12-31 23:49:05 | 2021-01-21 21:12:37 | 92430 |
After dropping these variables, we have 16 columns left and still have 108,049 rows. The only column with a concerning amount of nulls is the Precipitation.in. column. Because amount of precipitation depends on the weather, we decided to group by the weather type and fill in the null values for precipitation.in. with the mean of precipitation for each weather condition. For example, if a row with a null in the precipitation column had a weather type of light rain, the null value would be replaced by the mean precipitation amount for the light rain group. Before doing this, we dropped the rows with null values in the weather condition column, which is not an issue because there were only nulls in the weather condition column for .85% percent of the entire data frame. This left us with 107,124 rows.
Name | cleaned_accidents_NY |
Number of rows | 102961 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 4 |
logical | 4 |
numeric | 6 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
City | 0 | 1 | 3 | 22 | 0 | 996 | 0 |
County | 0 | 1 | 4 | 14 | 0 | 63 | 0 |
Weather_Condition | 0 | 1 | 3 | 28 | 0 | 71 | 0 |
Sunrise_Sunset | 0 | 1 | 3 | 5 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Crossing | 0 | 1 | 0.07 | FAL: 96242, TRU: 6719 |
Junction | 0 | 1 | 0.16 | FAL: 86899, TRU: 16062 |
Stop | 0 | 1 | 0.02 | FAL: 100965, TRU: 1996 |
Traffic_Signal | 0 | 1 | 0.11 | FAL: 91339, TRU: 11622 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Severity | 0 | 1 | 2.20 | 0.56 | 1 | 2.00 | 2.00 | 2.00 | 4.00 | ▁▇▁▁▁ |
Distance.mi. | 0 | 1 | 0.82 | 1.53 | 0 | 0.09 | 0.35 | 0.93 | 53.31 | ▇▁▁▁▁ |
Temperature.F. | 0 | 1 | 53.96 | 18.40 | -19 | 39.00 | 54.00 | 69.80 | 144.00 | ▁▆▇▂▁ |
Visibility.mi. | 0 | 1 | 9.04 | 2.72 | 0 | 10.00 | 10.00 | 10.00 | 30.00 | ▁▇▁▁▁ |
Wind_Speed.mph. | 0 | 1 | 8.90 | 5.62 | 0 | 5.00 | 8.00 | 12.70 | 141.50 | ▇▁▁▁▁ |
Precipitation.in. | 0 | 1 | 0.20 | 0.84 | 0 | 0.00 | 0.00 | 0.00 | 10.02 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Start_Time | 0 | 1 | 2016-03-23 02:35:03 | 2021-12-31 23:22:00 | 2021-01-28 22:04:00 | 74512 |
End_Time | 0 | 1 | 2016-03-23 08:35:03 | 2021-12-31 23:49:05 | 2021-01-29 01:14:18 | 88958 |
After this, we dropped the remaining rows that contain null values because they only accounted for 3.7% of all of the data. After the first cleaning, we had 108,049 rows, and we ended up with 102,961 rows after the cleaning process. Therefore, we only lost about 4.7% of the data during the cleaning process, which is less than 5%.