The Storm Data used for analysis comprises 42 datasets, consisting of 21 Event Details Files (named StormEvents_details-ftp_v1.0_d2019_c20200219.csv) and 21 Storm Data Fatality Files (named StormEvents_fatalities-ftp_v1.0_d2011_c20180718.csv.gz). The Event Details Files are labeled with “d” for the data year and “c” for the creation date, while the Storm Data Fatality Files use “dyyyy” for the data year and “cyyyymmdd” for the file creation date. This merging process considered exclusion of prior years with missing entries and CSV errors in the Storm Event Location data. The objective was to create a comprehensive and representative dataset for a robust analysis of storm-related incidents in the United States from 2003 to 2023.
Data Loading and Initial Inspection
The data cleaning process begins with loading the raw storm data from the file “f_storm.csv” using the `read_csv` function from the “readr” package. The “BEGIN_DATE_TIME” column is then converted to the appropriate datetime format using the `dmy_hms` function from the “lubridate” package.
Creating Additional Time-Related Variables
Two new time-related variables, “time” and “Duration,” are generated. The “time” variable is formatted to represent the time component of the “BEGIN_DATE_TIME” variable, and “Duration” is calculated as the numeric difference in seconds between the “END_DATE_TIME” and “BEGIN_DATE_TIME.”
Selecting Relevant Columns
The data is then filtered to include only the necessary columns for analysis, such as “FATALITY_ID,” “EVENT_ID,” “FATALITY_TYPE,” “FATALITY_AGE,” “FATALITY_SEX,” “FATALITY_LOCATION,” “EPISODE_ID,” “STATE,” “YEAR,” “MONTH_NAME,” “EVENT_TYPE,” “CZ_TYPE,” “CZ_NAME,” “BEGIN_DATE_TIME,” “CZ_TIMEZONE,” “END_DATE_TIME,” “INJURIES_DIRECT,” “INJURIES_INDIRECT,” “DEATHS_DIRECT,” and “DEATHS_INDIRECT.” Columns with missing values in key fields, such as “YEAR,” “FATALITY_SEX,” and “FATALITY_AGE,” are dropped.
library(shiny)#Data Import and Cleaninglibrary(readr)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(dplyr)library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Rows: 16146 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (30): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION, STA...
dbl (31): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data is further filtered based on user input for each specific analysis question (Q1-Q6). Filters include parameters such as “STATE,” “YEAR,” “FATALITY_SEX,” and “FATALITY_AGE.” These filters enable focused exploration of the data based on the characteristics of interest for each analysis.
Questions Specific #1: Calculating Derived Variables
Several derived variables are calculated to facilitate analysis. For Q1, the “CASUALTIES” and “FATALITIES” variables are computed as the sum of injuries and deaths, with the fatality rate being the ratio of total fatalities to total casualties for each storm type.
Questions Specific #2: Grouping and Summarizing
Data is grouped and summarized as needed for each analysis question. For example, Q2 involves summarizing fatalities by state, Q3 examines fatalities by age group and gender, and Q4 explores storm frequency by event type and year.
Handling Missing Data
In Q4, the data is filtered to exclude “CZ_TYPE” values other than ‘C’ or ‘Z’ to focus on relevant storm events. Additionally, missing values in the selected columns are dropped to ensure the accuracy of the analyses.
Formatting and Sorting
Variables such as “MONTH_NAME” are formatted to ensure proper ordering in visualizations. Dataframes are sorted in descending order based on relevant metrics to prioritize key information in the visualizations.
data <- f_storm_clean#For Shinyapp Filters# if (input$state_filter != "All") {# data <- data |> filter(STATE == input$state_filter)# }# # if (input$year_filter != "All") {# data <- data |> filter(YEAR == input$year_filter)# }# # if (input$gender_filter != "All") {# data <- data |> filter(FATALITY_SEX == input$gender_filter)# }# # if (input$age_filter[1] != 0 || input$age_filter[2] != 96) {# data <- data |> filter(FATALITY_AGE >= input$age_filter[1], FATALITY_AGE <= input$age_filter[2])# } data_Q1 <- data |>mutate(CASUALTIES = INJURIES_DIRECT + INJURIES_INDIRECT + DEATHS_DIRECT + DEATHS_INDIRECT,FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT ) |>group_by(EVENT_TYPE) |>summarize(total =sum(CASUALTIES),total_fatalities =sum(FATALITIES),FATALITIES_RATE = total_fatalities / total ) |>drop_na() |>arrange(desc(FATALITIES_RATE))# Return the filtered data to be displayed in the DataTabledata_Q1
data <- f_storm_clean#For Shinyapp Filters# if (input$year2_filter != "All") {# data <- data |> filter(YEAR == input$year2_filter)# }# # if (input$gender2_filter != "All") {# data <- data |> filter(FATALITY_SEX == input$gender2_filter)# }# # if (input$age2_filter[1] != 0 || input$age2_filter[2] != 96) {# data <- data |> filter(FATALITY_AGE >= input$age2_filter[1], FATALITY_AGE <= input$age2_filter[2])# } data_Q2 <- data |>mutate(FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT ) |>group_by(STATE) |>drop_na() |>summarize(sum =sum(FATALITIES) ) |>arrange(desc(sum))# Return the filtered data to be displayed in the DataTabledata_Q2
# A tibble: 63 × 2
STATE sum
<chr> <dbl>
1 LOUISIANA 428022
2 MISSOURI 27291
3 MISSISSIPPI 13560
4 TEXAS 6874
5 ALABAMA 6627
6 NEVADA 5974
7 OREGON 4173
8 NEW YORK 3256
9 ILLINOIS 2712
10 CALIFORNIA 2633
# ℹ 53 more rows
data <- f_storm_clean#For Shinyapp Filters # if (input$year3_filter != "All") {# data <- data |> filter(YEAR == input$year3_filter)# }# # if (input$gender3_filter != "All") {# data <- data |> filter(FATALITY_SEX == input$gender3_filter)# }data_Q3 <- data |>mutate(FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT ) |>drop_na() |>filter(CZ_TYPE %in%c('C', 'Z')) |>mutate(AGE_GROUP =case_when( FATALITY_AGE >=90~'Over 90', FATALITY_AGE >=72~'72 to 90', FATALITY_AGE >=54~'54 to 71', FATALITY_AGE >=36~'36 to 53', FATALITY_AGE >=18~'18 to 35', FATALITY_AGE >=0~'0 to 17' ) ) |>group_by(FATALITY_SEX, AGE_GROUP) |>summarise(sum =sum(FATALITIES) )
`summarise()` has grouped output by 'FATALITY_SEX'. You can override using the
`.groups` argument.
# Return the filtered data to be displayed in the DataTabledata_Q3
# A tibble: 12 × 3
# Groups: FATALITY_SEX [2]
FATALITY_SEX AGE_GROUP sum
<chr> <chr> <dbl>
1 F 0 to 17 6605
2 F 18 to 35 9681
3 F 36 to 53 28642
4 F 54 to 71 49021
5 F 72 to 90 115446
6 F Over 90 29896
7 M 0 to 17 6342
8 M 18 to 35 17745
9 M 36 to 53 56547
10 M 54 to 71 85698
11 M 72 to 90 104927
12 M Over 90 13852
data <- f_storm_clean#For Shinyapp Filters# if (input$state4_filter != "All") {# data <- data |> filter(STATE == input$state4_filter)# }data_Q4 <- data |>group_by(EVENT_TYPE, YEAR) |>drop_na() |>summarize(FREQUENCY =n()) |>mutate(total =sum(FREQUENCY) ) |>arrange(desc(total))
`summarise()` has grouped output by 'EVENT_TYPE'. You can override using the
`.groups` argument.
# Return the filtered data to be displayed in the DataTabledata_Q4
data <- f_storm_clean#For Shinyapp Filters# if (input$state6_filter != "All") {# data <- data |> filter(STATE == input$state6_filter)# }# # if (input$year6_filter != "All") {# data <- data |> filter(YEAR == input$year6_filter)# }# # if (input$gender6_filter != "All") {# data <- data |> filter(FATALITY_SEX == input$gender6_filter)# }# # if (input$age6_filter[1] != 0 || input$age6_filter[2] != 96) {# data <- data |> filter(FATALITY_AGE >= input$age6_filter[1], # FATALITY_AGE <= input$age6_filter[2])# }data_Q6 <- data |>mutate(FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT,MONTH_NAME =factor(MONTH_NAME, levels = month.name) ) |>drop_na() |>group_by(MONTH_NAME) |>summarise(total =sum(FATALITIES) )# Return the filtered data to be displayed in the DataTabledata_Q6
# A tibble: 12 × 2
MONTH_NAME total
<fct> <dbl>
1 January 1681
2 February 3807
3 March 4563
4 April 8265
5 May 28625
6 June 8077
7 July 10698
8 August 446497
9 September 4345
10 October 2025
11 November 1268
12 December 4581
Ensuring Data Consistency
Throughout the cleaning process, efforts are made to ensure data consistency and accuracy. Proper data types are enforced, and missing or irrelevant entries are handled appropriately. The data cleaning process prepares the dataset for analysis by addressing missing values, creating relevant variables, and filtering the data based on user input. These steps contribute to the generation of meaningful insights in response to the specific questions guiding the analysis (Q1-Q6).
Other appendicies (as necessary)
While attempting to enhance the visualization with a geographical representation of storm events using location files available on the dataset website, we encountered challenges attributed to the substantial size and lack of cleanliness in each file. Moreover, inconsistent CSV.GZ files for the years 2003, 2007, 2009, 2013, 2016, 2019, and 2020 necessitated additional conversions and cleaning efforts. Regrettably, due to time constraints and the need to address other crucial troubleshooting aspects of the Shinyapp, we made the decision to omit this feature, despite having completed the initial stages of data download and raw merging for the 21 location datasets.