Project title

Appendix to report

Data cleaning

Step 1-2: read raw dataset from our data file directory and join them together

Step 3: Remove unnecessary attributes and rename columns

First, we will remove any unnecessary attributes that are not useful that are not useful in building our interactive map and prediction model. We determined that variables magType, nst, rms, id, updated, horizontalError, DepthError are irrelevant to the scope of our project for the following reasons:

  • magType: our project focused on predicting earthquake magnitudes and creating an interactive map, the specific method or algorithm used to calculate the magnitude might not significantly impact the predictive capability of our model. Our interest lies more in predicting the magnitude itself rather than understanding the calculation method.

  • nst: the number of seismic stations used might not directly contribute to this specific prediction task. The focus could be more on geological or geographical factors influencing the magnitude rather than the number of stations involved in determining the location.

  • updated: we would focused on the time when the earthquake occurred instead on when it’s reported by human since human reported data may be lagged

  • depthError and horizontalError: they may be helpful in normalization our data. However, due to the scope of this class and the skill of our teammate, we will omit them.

  • id: every observation has an unique id, it doesn’t provide any valuable information for our prediction or mapping tasks, it might be redundant. Especially in predictive modeling, using unique identifiers typically doesn’t contribute to understanding or predicting our target variable

Step 4: Extract date information

Taking the “earthquakes” data frame and performing a series of transformations to create or modify columns related to date-time information (year, month, day) based on the existing “time” column, ensuring standardized formats and potentially converting month names to numeric representations. These variables will be helpful to us in determining the patterns.

Step 5: Convert to numeric/char and create a separate country column

Manipulates the “earthquakes” dataset. It begins by filtering rows in the dataset where the “place” column contains a comma. It then separates the “place” column into two distinct columns: “description” and “country”. Following this, it cleans the “country” column by removing any leading/trailing spaces and converts it to a character type.

Next, it performs a left join with a pre-defined list of U.S. states, marking the matching states as “United States”. This allows for alignment between the earthquake data and the U.S. states. Finally, the code ensures that the “country” column reflects “United States” for non-matching or missing state names, utilizing the if_else function to handle cases where a state name doesn’t correspond to the predefined list. These transformations aim to standardize and categorize earthquake locations within or outside the United States within the dataset.

cleaned data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
earthquakes <- read_csv("./data/earthquakes.csv")
New names:
Rows: 21740 Columns: 17
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): source, description, country, type, status, if_us dbl (10): ...1, year,
month, day, latitude, longitude, depth, magnitude, ga... dttm (1): time
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
earthquakes
# A tibble: 21,740 × 17
    ...1 time                 year month   day latitude longitude  depth
   <dbl> <dttm>              <dbl> <dbl> <dbl>    <dbl>     <dbl>  <dbl>
 1     1 2022-06-30 23:59:30  2022     6    30    19.5      -66.3   8.49
 2     2 2022-06-30 23:46:30  2022     6    30     7.30     -80.0  10   
 3     3 2022-06-30 23:45:40  2022     6    30    64.5     -138.   11.2 
 4     4 2022-06-30 23:37:46  2022     6    30     4.95     -76.1 114.  
 5     5 2022-06-30 22:09:38  2022     6    30    -2.68     110.   10   
 6     6 2022-06-30 22:03:55  2022     6    30    38.9      142.   35   
 7     7 2022-06-30 22:00:24  2022     6    30    51.9      178.  125.  
 8     8 2022-06-30 21:51:04  2022     6    30    -6.32     105.  139.  
 9     9 2022-06-30 20:31:57  2022     6    30    20.3      121.   35.8 
10    10 2022-06-30 19:38:54  2022     6    30     7.17     -80.0  10   
# ℹ 21,730 more rows
# ℹ 9 more variables: magnitude <dbl>, gap <dbl>, min_dis_to_center <dbl>,
#   source <chr>, description <chr>, country <chr>, type <chr>, status <chr>,
#   if_us <chr>

Other appendicies (as necessary)