Earthquakes Around the World in 2022
Report
Introduction
The primary motivation of our project is to enhance our understanding of the frequency and intensity of earthquakes in the United States and globally in 2022, recognizing that earthquakes are among the most devastating natural disasters with significant impacts on lives and properties. Utilizing a comprehensive dataset detailing the time, location, and magnitude of these events, our objectives include analyzing earthquake trends, understanding regional variations, and developing a user-friendly, interactive Shiny application to make this data accessible to the general public. This project addresses the critical need within the broader scope of natural disaster research for accessible information tools, aiming to bridge the gap between abundant earthquake data and its interpretability by non-experts. By providing an engaging platform for exploring earthquake data, we seek to enhance public awareness and understanding of these natural phenomena, contributing to better disaster preparedness and response strategies. According to our findings, there’s no strong correlation between location, time, magnitude, and frequency of earthquakes around the world.
Justification of approach
Our team created a shiny application that allows users to explore earthquake trends around the world. The intended audience for this application is everyone around the world who have an interest in exploring earthquakes data. This deliverable will suit their needs because it not only includes summary statistics, but the interactive aspect of it allows the users to play with the aspect they are most interested in, such as magnitude and depth. By allowing the users to input their own data and generating a prediction about earthquake, the users can also be more engaged and explore the data they need.
To access the application, please go to myApp folder, click on app.R, and run the file. It is to be noted that the app does take a while to load because there is a ML model being trained. It is also worth mentioning that the team intended to publish the application (https://posit.cloud/content/7079543), but the app requires more memory than the 1GB Cornell’s free account provides, so the team is still looking for a way to resolve this issue and will publish the website for the final project.
Data description
What are the observations (rows) and the attributes (columns)?
Each observation represent a reported earthquake in year 2023 across the world. The attributes provides additional information on the earthquake’s:
min_dis_to_center: rupture at a hypocenter which is defined by a position on the surface of the earth (epicenter)depth: depth below the point of hypocenter (focal depth).latitiude: the number of degrees north (N) or south (S) of the equator and varies from 0 at the equator to 90 at the poles.longtitude: is the number of degrees east (E) or west (W) of the prime meridian which runs through Greenwich, England.magtitude: magnitude of the eventsource: network distributor that provided the event datastatus: Indicates whether the event has been reviewed by a human. (i.e. automatic, reviewed, deleted)country: country the reported event occurredtime: time when the event occurredgap: The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.place: textual description of named geographic region near to the event. This may be a city name, or a Flinn-Engdahl Region name
Why was this dataset created?
The dataset was created to provide earth science information and products for loss reduction. With this real-time data, we can:
Elevate earthquake hazard identification and risk assessment techniques to bolster preparedness and resilience.
Deepen understanding of earthquake occurrences and their consequences to inform better mitigation strategies.
Prioritize the enhancement of the National Earthquake Hazards Reduction Program (NEHRP) with a focus on strengthening real-time monitoring in urban areas across the United States.
Who funded the creation of the dataset?
The U.S. Geological survey (USGR) grant funded the creation of the dataset, drawn from the federal fund each year.
What processes might have influenced what data was observed and recorded and what was not?
Some attribute data are collected from different sources and/or uses different methodology. Therefore, the process of determining certain attributes such as the depth vary.
What preprocessing was done, and how did the data come to be in the form that you are using?
The data is automatically updated and review by human. These data are not normalized to a standard metrics. However, alogrithms used in measuring these data are stored as a separate attribute.
If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
The event data are reported by network distributor and recorded to the system automatically. A human will review the event and verify it. However, the level of review can vary.
Design process
From the start, the team wanted to build a shiny application that can be fully functional on the web once published. However, there are a few considerations the team faced before the implementation. First of all, we want the application to be interactive instead of static so that the users can interact with the application and be better engaged. Secondly, we not only want visualizations that present relevant findings we discovered, but also want to utilize machine learning techniques so that users can predict the magnitude of earthquakes based on their inputs.
Given these considerations, we came up with a design: The application will be split to 3 pages. The first page will show an interactive map for earthquakes in the United States as we anticipate most users will be from the U.S.. The users would be able to select their specific state and the map will change accordingly, which will give them a sense of participation. The second page will be a page depicting the bigger picture after we drift away from only U.S., where the top of the page shows summary stats and then an interactive map of the whole world, which allows users to control sliders of time, magnitude, and depth to change data. Lastly, in the third page, we enable users to enter inputs, and based on the backstage machine learning model trained, they can get earthquake predictions.
There are definitely design challenges along the way. The biggest question was how many interactive visualizations we should present on the page. We were worried that having 2 maps might be excessive to the user, but eventually we realized that our main audience is probably U.S. residents that especially care about earthquakes happening in the U.S.. Thus, we decided to keep the U.S. map and add different interactions to keep users engaged.
Overall, we took a user-centered approached that considers our main audience and what they want to see in our application, and solved challenges along the way.
Limitations
Incomplete Data Coverage: Data value might not be inaccurate and might be missing due to technology limitation. For instance, the magnitude might be inaccurate (overestimated/underestimated) when the earthquake location is too far from to the The position uncertainty of the hypocenter location varies from about 100 m horizontally and 300 meters vertically for the best located events, those in the middle of densely spaced seismograph networks, to 10s of kilometers for global events in many parts of the world. Therefore, any
Variability in monitoring equipment: Differences in the quality and sensitivity of seismic networks and data collection, determination methods can affect the accuracy and consistency of earthquake data.
Data Manipulation or Exclusions: Data may be manipulated, censored, or selectively reported for various reasons, including political, economic, or security concerns, which can impact the accuracy of the available data.
Data Privacy and Security: Privacy and security concerns may limit access to certain earthquake data, especially in sensitive or classified areas.
Conclusion and Future work
Our current prediction model requires higher accuracy. Due to limited CPU power, we decided to use the observation data occured in 2022. Adding more observations to the data set will increase the processing time exponentially. For the time constraint of this project, we will only look at the observations in 2022. One of the drawback of this data limitation gave us is over-fitting. Out of the models we fitted (null, Naive Bayes, Random Forest, and decision tree), random forest gave us the best ROC_AUC (0.855). Null model and Naive Bayes output a ROC_AUC of 0.5 and 0.7819 respectively. The AUC of the train set is 0.9868; whereas, the AUC of the test set produced by the random forest model is 0.8592. This issue of over-fitting came from limited observations. Given the confusion matrix of the selected model, we found that other classes are most likely to be mis-classified as “minor damage” because most observations in the current data set reported class “minor damage”, known as oversampling. To prevent over fitting, we will have to increase our observation, particular increasing the number of samples representing other classes.
In addition, using the interactive tools of our shiny app and visualization produced in our shiny app, there are little variations of earthquake occurance across months for the top countries (i.e. Indonesia, Japan, Phiillppines, Puerto Rico, and US in ascending order) recorded earthquake throughout the year of 2022. As a result, our prediction model has little or almost no change in the percentage reported in eathquake occurrance likeliness when users change their inputted month. Given the small effect of months and other variables (see data exploration for detailed correlations and visualizations), we may want to try out a ridge regression in the future since many of our predictors has little or no effect on the outcome variable.
Acknowledgments
We want to extend our gratitude all those who help us contribute to the successful completion of this project. Our appreciation goes to:
Stack Overflow: for providing necessary resources and solution to resolve the errors that came across
INFO5001 staff: for the invaluable guidance and efforts in producing accessiable,comprehensive,informative class sessions and materials that we can reapply these concepts and reuse some of the codes into our project
ChatGPT: for providing us relevant function and library that can be implement into our algorithms based on the implementation we brainstormed
Medium/Kaggle/gitHub authors: for pinoneering working laid the foundation for our exploration of the machine learning portion of the work. Thanks to the following authors that inspire this project:
https://medium.com/analytics-vidhya/earthquake-damage-prediction-with-machine-learning-part-1-74cc73bb978b
https://vidhur2k.github.io/Earthquake-Prediction/
https://github.com/akash-r34/Earthquake-prediction-using-Machine-learning-models
Teammates: for their patients, encouragement, resourcefulness in providing each another mental and technical supports.