Storm Event Data Analysis

Appendix to report

Data cleaning

Data Downloading and Merging

The Storm Data used for analysis comprises 42 datasets, consisting of 21 Event Details Files (named StormEvents_details-ftp_v1.0_d2019_c20200219.csv) and 21 Storm Data Fatality Files (named StormEvents_fatalities-ftp_v1.0_d2011_c20180718.csv.gz). The Event Details Files are labeled with “d” for the data year and “c” for the creation date, while the Storm Data Fatality Files use “dyyyy” for the data year and “cyyyymmdd” for the file creation date. This merging process considered exclusion of prior years with missing entries and CSV errors in the Storm Event Location data. The objective was to create a comprehensive and representative dataset for a robust analysis of storm-related incidents in the United States from 2003 to 2023.

Data Loading and Initial Inspection

The data cleaning process begins with loading the raw storm data from the file “f_storm.csv” using the `read_csv` function from the “readr” package. The “BEGIN_DATE_TIME” column is then converted to the appropriate datetime format using the `dmy_hms` function from the “lubridate” package.

Creating Additional Time-Related Variables

Two new time-related variables, “time” and “Duration,” are generated. The “time” variable is formatted to represent the time component of the “BEGIN_DATE_TIME” variable, and “Duration” is calculated as the numeric difference in seconds between the “END_DATE_TIME” and “BEGIN_DATE_TIME.”

Selecting Relevant Columns

The data is then filtered to include only the necessary columns for analysis, such as “FATALITY_ID,” “EVENT_ID,” “FATALITY_TYPE,” “FATALITY_AGE,” “FATALITY_SEX,” “FATALITY_LOCATION,” “EPISODE_ID,” “STATE,” “YEAR,” “MONTH_NAME,” “EVENT_TYPE,” “CZ_TYPE,” “CZ_NAME,” “BEGIN_DATE_TIME,” “CZ_TIMEZONE,” “END_DATE_TIME,” “INJURIES_DIRECT,” “INJURIES_INDIRECT,” “DEATHS_DIRECT,” and “DEATHS_INDIRECT.” Columns with missing values in key fields, such as “YEAR,” “FATALITY_SEX,” and “FATALITY_AGE,” are dropped.

library(shiny)

#Data Import and Cleaning

library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(stats)

f_storm <- read_csv("data/Storm_data_cleaned.csv")
Rows: 16146 Columns: 61
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (30): FATALITY_TYPE, FATALITY_DATE, FATALITY_SEX, FATALITY_LOCATION, STA...
dbl (31): FAT_YEARMONTH, FAT_DAY, FAT_TIME, FATALITY_ID, EVENT_ID, FATALITY_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
f_storm$BEGIN_DATE_TIME <- dmy_hms(f_storm$BEGIN_DATE_TIME)

f_storm$time <- format(f_storm$BEGIN_DATE_TIME, format = "%H:%M:%S")
f_storm$END_DATE_TIME <- dmy_hms(f_storm$END_DATE_TIME)
f_storm$Duration <- as.numeric(difftime(f_storm$END_DATE_TIME, 
                                        f_storm$BEGIN_DATE_TIME, units = "secs"))

f_storm_clean <- select(f_storm, c("FATALITY_ID", "EVENT_ID", "FATALITY_TYPE", 
                        "FATALITY_AGE", "FATALITY_SEX", "FATALITY_LOCATION", 
                        "EPISODE_ID", "STATE", "YEAR", "MONTH_NAME", "EVENT_TYPE", 
                        "CZ_TYPE", "CZ_NAME", "BEGIN_DATE_TIME", "CZ_TIMEZONE", 
                        "END_DATE_TIME", "INJURIES_DIRECT", "INJURIES_INDIRECT", 
                        "DEATHS_DIRECT", "DEATHS_INDIRECT")) |>
                drop_na(YEAR, FATALITY_SEX, FATALITY_AGE)

Filtering for Analysis

The data is further filtered based on user input for each specific analysis question (Q1-Q6). Filters include parameters such as “STATE,” “YEAR,” “FATALITY_SEX,” and “FATALITY_AGE.” These filters enable focused exploration of the data based on the characteristics of interest for each analysis.

Questions Specific #1: Calculating Derived Variables

Several derived variables are calculated to facilitate analysis. For Q1, the “CASUALTIES” and “FATALITIES” variables are computed as the sum of injuries and deaths, with the fatality rate being the ratio of total fatalities to total casualties for each storm type.

Questions Specific #2: Grouping and Summarizing

Data is grouped and summarized as needed for each analysis question. For example, Q2 involves summarizing fatalities by state, Q3 examines fatalities by age group and gender, and Q4 explores storm frequency by event type and year.

Handling Missing Data

In Q4, the data is filtered to exclude “CZ_TYPE” values other than ‘C’ or ‘Z’ to focus on relevant storm events. Additionally, missing values in the selected columns are dropped to ensure the accuracy of the analyses.

Formatting and Sorting

Variables such as “MONTH_NAME” are formatted to ensure proper ordering in visualizations. Dataframes are sorted in descending order based on relevant metrics to prioritize key information in the visualizations.

data <- f_storm_clean

#For Shinyapp Filters
# if (input$state_filter != "All") {
#   data <- data |> filter(STATE == input$state_filter)
# }
# 
# if (input$year_filter != "All") {
#   data <- data |> filter(YEAR == input$year_filter)
# }
# 
# if (input$gender_filter != "All") {
#   data <- data |> filter(FATALITY_SEX == input$gender_filter)
# }
# 
# if (input$age_filter[1] != 0 || input$age_filter[2] != 96) {
#   data <- data |> filter(FATALITY_AGE >= input$age_filter[1], FATALITY_AGE <= input$age_filter[2])
# }    

data_Q1 <- data |>
  mutate(
    CASUALTIES = INJURIES_DIRECT + INJURIES_INDIRECT + DEATHS_DIRECT + DEATHS_INDIRECT,
    FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT
  ) |>
  group_by(EVENT_TYPE) |>
  summarize(
    total = sum(CASUALTIES),
    total_fatalities = sum(FATALITIES),
    FATALITIES_RATE = total_fatalities / total
  ) |>
  drop_na() |>
  arrange(desc(FATALITIES_RATE))


# Return the filtered data to be displayed in the DataTable
data_Q1
# A tibble: 45 × 4
   EVENT_TYPE                total total_fatalities FATALITIES_RATE
   <chr>                     <dbl>            <dbl>           <dbl>
 1 Astronomical Low Tide         1                1           1    
 2 Coastal Flood               559              559           1    
 3 Lakeshore Flood               4                4           1    
 4 Storm Surge/Tide            174              174           1    
 5 Tropical Depression          14               14           1    
 6 Cold/Wind Chill            2359             2335           0.990
 7 Tropical Storm              275              268           0.975
 8 Hurricane (Typhoon)      464873           440347           0.947
 9 Marine Thunderstorm Wind    225              212           0.942
10 Marine Dense Fog             32               29           0.906
# ℹ 35 more rows
data <- f_storm_clean

#For Shinyapp Filters
# if (input$year2_filter != "All") {
#   data <- data |> filter(YEAR == input$year2_filter)
# }
# 
# if (input$gender2_filter != "All") {
#   data <- data |> filter(FATALITY_SEX == input$gender2_filter)
# }
# 
# if (input$age2_filter[1] != 0 || input$age2_filter[2] != 96) {
#   data <- data |> filter(FATALITY_AGE >= input$age2_filter[1], FATALITY_AGE <= input$age2_filter[2])
# }    

data_Q2 <- data |>
  mutate(
    FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT
  ) |>
  group_by(STATE) |>
  drop_na() |>
  summarize(
    sum = sum(FATALITIES)
  ) |> 
  arrange(desc(sum))


# Return the filtered data to be displayed in the DataTable
data_Q2
# A tibble: 63 × 2
   STATE          sum
   <chr>        <dbl>
 1 LOUISIANA   428022
 2 MISSOURI     27291
 3 MISSISSIPPI  13560
 4 TEXAS         6874
 5 ALABAMA       6627
 6 NEVADA        5974
 7 OREGON        4173
 8 NEW YORK      3256
 9 ILLINOIS      2712
10 CALIFORNIA    2633
# ℹ 53 more rows
data <- f_storm_clean

#For Shinyapp Filters 
# if (input$year3_filter != "All") {
#   data <- data |> filter(YEAR == input$year3_filter)
# }
# 
# if (input$gender3_filter != "All") {
#   data <- data |> filter(FATALITY_SEX == input$gender3_filter)
# }

data_Q3 <- data |>
  mutate(
    FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT
  ) |>
  drop_na() |>
  filter(CZ_TYPE %in% c('C', 'Z')) |>
  mutate(
    AGE_GROUP = case_when(
      FATALITY_AGE >= 90 ~ 'Over 90',
      FATALITY_AGE >= 72 ~ '72 to 90',
      FATALITY_AGE >= 54 ~ '54 to 71',
      FATALITY_AGE >= 36 ~ '36 to 53',
      FATALITY_AGE >= 18 ~ '18 to 35',
      FATALITY_AGE >= 0 ~ '0 to 17'
    )
  ) |>
  group_by(FATALITY_SEX, AGE_GROUP) |>
  summarise(
    sum = sum(FATALITIES)
  )
`summarise()` has grouped output by 'FATALITY_SEX'. You can override using the
`.groups` argument.
# Return the filtered data to be displayed in the DataTable
data_Q3
# A tibble: 12 × 3
# Groups:   FATALITY_SEX [2]
   FATALITY_SEX AGE_GROUP    sum
   <chr>        <chr>      <dbl>
 1 F            0 to 17     6605
 2 F            18 to 35    9681
 3 F            36 to 53   28642
 4 F            54 to 71   49021
 5 F            72 to 90  115446
 6 F            Over 90    29896
 7 M            0 to 17     6342
 8 M            18 to 35   17745
 9 M            36 to 53   56547
10 M            54 to 71   85698
11 M            72 to 90  104927
12 M            Over 90    13852
data <- f_storm_clean

#For Shinyapp Filters
# if (input$state4_filter != "All") {
#   data <- data |> filter(STATE == input$state4_filter)
# }

data_Q4 <- data |>
  group_by(EVENT_TYPE, YEAR) |>
  drop_na() |>
  summarize(FREQUENCY = n()) |>
  mutate(
    total = sum(FREQUENCY)
  ) |>
  arrange(desc(total)) 
`summarise()` has grouped output by 'EVENT_TYPE'. You can override using the
`.groups` argument.
# Return the filtered data to be displayed in the DataTable
data_Q4
# A tibble: 596 × 4
# Groups:   EVENT_TYPE [46]
   EVENT_TYPE  YEAR FREQUENCY total
   <chr>      <dbl>     <int> <int>
 1 Tornado     2003        51  1590
 2 Tornado     2004        34  1590
 3 Tornado     2005        35  1590
 4 Tornado     2006        66  1590
 5 Tornado     2007        78  1590
 6 Tornado     2008       124  1590
 7 Tornado     2009        24  1590
 8 Tornado     2010        46  1590
 9 Tornado     2011       559  1590
10 Tornado     2012        71  1590
# ℹ 586 more rows
data <- f_storm_clean

#For Shinyapp Filters
# if (input$state5_filter != "All") {
#   data <- data |> filter(STATE == input$state5_filter)
# }
# if (input$state5_filter != "All") {
#   data <- data |> filter(STATE == input$state5_filter)
# }
# if (input$year5_filter != "All") {
#   data <- data |> filter(YEAR == input$year5_filter)
# }
# if (input$gender5_filter != "All") {
#   data <- data |> filter(FATALITY_SEX == input$gender5_filter)
# }
# if (input$age5_filter[1] != 0 || input$age5_filter[2] != 96) {
#   data <- data |> filter(FATALITY_AGE >= input$age5_filter[1], 
#                          FATALITY_AGE <= input$age5_filter[2])
# }

data_Q5 <- data |>
  drop_na() |>
  mutate(
    FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT,
    DURATION = END_DATE_TIME - BEGIN_DATE_TIME
  ) |>
  mutate(
    DURATION_GROUP = case_when(
      DURATION >= 100000 ~ 'Over 100,000 secs',
      DURATION >= 50000 ~ '50,000 - 100,000 secs',
      DURATION >= 10000 ~ '10,000 - 50,000 secs',
      DURATION >= 0 ~ '0 - 10,000 secs'
    )
  ) |>
  group_by(DURATION_GROUP) |>
  summarize(
    SUM_FATALITIES = sum(FATALITIES)
  ) |>
  mutate(
    EVENT_PERSISTANCE_TIME = SUM_FATALITIES / n()
  )


# Return the filtered data to be displayed in the DataTable
data_Q5
# A tibble: 4 × 3
  DURATION_GROUP        SUM_FATALITIES EVENT_PERSISTANCE_TIME
  <chr>                          <dbl>                  <dbl>
1 0 - 10,000 secs                48239                 12060.
2 10,000 - 50,000 secs            6360                  1590 
3 50,000 - 100,000 secs           2940                   735 
4 Over 100,000 secs             466893                116723.
data <- f_storm_clean

#For Shinyapp Filters
# if (input$state6_filter != "All") {
#   data <- data |> filter(STATE == input$state6_filter)
# }
# 
# if (input$year6_filter != "All") {
#   data <- data |> filter(YEAR == input$year6_filter)
# }
# 
# if (input$gender6_filter != "All") {
#   data <- data |> filter(FATALITY_SEX == input$gender6_filter)
# }
# 
# if (input$age6_filter[1] != 0 || input$age6_filter[2] != 96) {
#   data <- data |> filter(FATALITY_AGE >= input$age6_filter[1], 
#                          FATALITY_AGE <= input$age6_filter[2])
# }

data_Q6 <- data |>
  mutate(
    FATALITIES = DEATHS_DIRECT + DEATHS_INDIRECT,
    MONTH_NAME = factor(MONTH_NAME, levels = month.name)
  ) |>
  drop_na() |>
  group_by(MONTH_NAME) |>
  summarise(
    total = sum(FATALITIES)
  )

# Return the filtered data to be displayed in the DataTable
data_Q6
# A tibble: 12 × 2
   MONTH_NAME  total
   <fct>       <dbl>
 1 January      1681
 2 February     3807
 3 March        4563
 4 April        8265
 5 May         28625
 6 June         8077
 7 July        10698
 8 August     446497
 9 September    4345
10 October      2025
11 November     1268
12 December     4581

Ensuring Data Consistency

Throughout the cleaning process, efforts are made to ensure data consistency and accuracy. Proper data types are enforced, and missing or irrelevant entries are handled appropriately. The data cleaning process prepares the dataset for analysis by addressing missing values, creating relevant variables, and filtering the data based on user input. These steps contribute to the generation of meaningful insights in response to the specific questions guiding the analysis (Q1-Q6).

Other appendicies (as necessary)

While attempting to enhance the visualization with a geographical representation of storm events using location files available on the dataset website, we encountered challenges attributed to the substantial size and lack of cleanliness in each file. Moreover, inconsistent CSV.GZ files for the years 2003, 2007, 2009, 2013, 2016, 2019, and 2020 necessitated additional conversions and cleaning efforts. Regrettably, due to time constraints and the need to address other crucial troubleshooting aspects of the Shinyapp, we made the decision to omit this feature, despite having completed the initial stages of data download and raw merging for the 21 location datasets.