library(tidyverse)
library(skimr)Project title
Proposal
Data 1
Problem or question
Question: We want to explore how accurate can the World Cup predictions be.
Importance: The rapid advancement of data science has impacted the game of sports profoundly. As the most influential event in the world, I’m curious to see if the results of World Cup can be predicted. This is important because if the predictions are indeed accurate, fans of a certain team can have a better idea if their team will be able to win and by how much.
Types of variables used: numerical variable, date variables, and text variables.
Deliverable: The main deliverable may be an interactive web application. User can see the comparation between real world results and predicted results for the single match so that people can evaluate the power of their wills.
Introduction and data
- Data source: https://projects.fivethirtyeight.com/
- How it was collected: The data manager used a crawler to collect match data from sports websites and filtered the match data of the national teams that will participate in the 2018 and 2022 World Cup. In addition, he deleted game data that was older and had players that were completely different from today’s team
- Brief description of data: This dataset generates predictions for the 2018, 2022 World Cup matches based on the SPI ratings of the matches, including the number of goals, corners, shots on goal, and related match data.
- Ethical concerns: n/a (It only takes into account the performance of teams in the league and does not involve personal information).
Glimpse of data
# add code here
wc_forecasts <- read_csv("./data/wc_forecasts.csv")Rows: 512 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): forecast_timestamp, team, group, timestamp
dbl (18): spi, global_o, global_d, sim_wins, sim_ties, sim_losses, sim_goal_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wc_forecasts)Rows: 512
Columns: 22
$ forecast_timestamp <chr> "2022-12-18 17:56:03 UTC", "2022-12-18 17:56:03 UTC…
$ team <chr> "Argentina", "France", "Morocco", "Croatia", "Engla…
$ group <chr> "C", "D", "F", "F", "B", "A", "H", "G", "E", "A", "…
$ spi <dbl> 89.64860, 88.30043, 73.16416, 78.82038, 87.82131, 8…
$ global_o <dbl> 2.83610, 2.96765, 1.74313, 2.20264, 2.71564, 2.5271…
$ global_d <dbl> 0.39397, 0.54381, 0.53433, 0.60290, 0.44261, 0.5494…
$ sim_wins <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, …
$ sim_ties <dbl> 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, …
$ sim_losses <dbl> 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
$ sim_goal_diff <dbl> 3, 3, 3, 3, 7, 4, 2, 2, 1, 1, 1, -1, 1, 6, 0, 0, 1,…
$ goals_scored <dbl> 5, 6, 4, 4, 9, 5, 6, 3, 4, 5, 4, 3, 2, 9, 4, 2, 6, …
$ goals_against <dbl> 2, 3, 1, 1, 2, 1, 4, 1, 3, 4, 3, 4, 1, 3, 4, 2, 5, …
$ group_1 <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
$ group_2 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, …
$ group_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ group_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ make_round_of_16 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ make_quarters <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ make_semis <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ make_final <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ win_league <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ timestamp <chr> "2022-12-18 17:56:44 UTC", "2022-12-18 17:56:44 UTC…
Data 2
Problem or question
Question: We want to explore the relationship between time, location, and magnitude of earthquakes in the past 30 days around the globe.
Importance: Earthquake is one of the most devastating natural disasters on this planet, destroying properties and ending lives. In today’s world, where climate change is making earthquakes becoming more and more frequent, it is important to understand how many earthquakes are happening each month and how powerful/harmful they are. Investigating the proposed question will allow us to have a better understanding of how earthquakes are influencing our lives.
Types of variables used: categorical variable, numerical variable, and date variables.
Deliverable: The main deliverable will be an interactive web application that showcase information via maps and other visualizations. Users will be able to click or perform other interactions and explore the webpage.
Introduction and data
- Data source: https://earthquake.usgs.gov/earthquakes/search/
- How it was collected: The data was from the USGS official website. The data are recorded by a seismographic network. Each seismic station in the network measures the movement of the ground at that site. The slip of one block of rock over another in an earthquake releases energy that makes the ground vibrate.
- Brief description of data: This earthquakes dataset contains earthquakes that have magnitude greater than 4.5 in the past 30 days. There are 580 rows, each representing an incident of earthquake and 22 columns that describes each incident. For example, the date/time, location, magnitude, as well as other factors such as depth are all essential to answer our proposed question. We observe that most incidents have magnitudes between 4.5-6 and a lot of them have depth of 10.000.
- Ethical concerns: n/a
Glimpse of data
# add code here
earthquakes <- read_csv("./data/earthquakes.csv")Rows: 580 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): magType, net, id, place, type, status, locationSource, magSource
dbl (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
dttm (2): time, updated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(earthquakes)Rows: 580
Columns: 22
$ time <dttm> 2023-10-10 20:20:24, 2023-10-10 16:42:28, 2023-10-10 …
$ latitude <dbl> -6.6481, -8.5598, 15.9832, 7.0223, -22.8844, -15.5450,…
$ longitude <dbl> 129.7709, 112.5475, 146.9409, -82.5847, -66.2242, -174…
$ depth <dbl> 166.592, 84.706, 36.505, 10.000, 247.255, 25.023, 538.…
$ mag <dbl> 4.6, 4.7, 4.9, 4.5, 6.0, 5.7, 5.6, 4.6, 5.0, 4.6, 5.1,…
$ magType <chr> "mb", "mb", "mb", "mb", "mww", "mww", "mww", "mb", "mw…
$ nst <dbl> 87, 81, 70, 65, 100, 57, 95, 41, 162, 48, 114, 55, 74,…
$ gap <dbl> 83, 49, 76, 149, 34, 57, 25, 149, 16, 112, 88, 75, 150…
$ dmin <dbl> 2.116, 1.590, 1.333, 1.763, 1.803, 3.092, 4.209, 5.843…
$ rms <dbl> 0.64, 1.00, 0.89, 0.60, 0.70, 1.21, 0.80, 0.64, 0.69, …
$ net <chr> "us", "us", "us", "us", "us", "us", "us", "us", "us", …
$ id <chr> "us6000lelx", "us6000lei9", "us6000leh9", "us6000leh8"…
$ updated <dttm> 2023-10-11 14:27:24, 2023-10-11 22:42:23, 2023-10-10 …
$ place <chr> NA, "43 km SSW of Gongdanglegi Kulon, Indonesia", "153…
$ type <chr> "earthquake", "earthquake", "earthquake", "earthquake"…
$ horizontalError <dbl> 8.12, 7.60, 9.76, 7.44, 8.15, 7.49, 10.66, 11.34, 5.39…
$ depthError <dbl> 6.894, 5.684, 4.030, 1.915, 5.840, 5.196, 6.116, 1.954…
$ magError <dbl> 0.063, 0.074, 0.066, 0.091, 0.047, 0.089, 0.055, 0.107…
$ magNst <dbl> 80, 55, 73, 35, 44, 12, 32, 26, 47, 42, 16, 58, 95, 34…
$ status <chr> "reviewed", "reviewed", "reviewed", "reviewed", "revie…
$ locationSource <chr> "us", "us", "us", "us", "us", "us", "us", "us", "us", …
$ magSource <chr> "us", "us", "us", "us", "us", "us", "us", "us", "us", …
Data 3
Problem or question
Question: We want to explore Americans’ attitudes toward President Biden and Trump’s responses to COVID-19.
Importance: Assessing Americans’ attitudes toward the responses of Presidents Biden and Trump to COVID-19 is essential for evaluating the effectiveness of public health policies, informing political decision-making, and understanding the broader social, economic, and democratic implications of leadership during a global health crisis.
Types of variables used: categorical variable, numerical variable, date variables, and text variables.
Deliverable: The main deliverable will be an interactive web application. Users will be able to click different president names to switch between their different approval patterns among Americans.
Introduction and data
- Data source: https://github.com/fivethirtyeight/covid-19-polls
- How it was collected: It wasn’t specified, but likely to be taken from surveys.
- Brief description of data: The dataset keeps track of the poll results of Americans about whether or not they approve of the way the president handling covid-19 from the polls. The data consists of columns including poll dates, the president involved, the sample size, the approval and disapproval counts, and the demographic information about the voting population like their parties.
- Ethical concerns: n/a
Glimpse of data
# add code here
covid_approval_polls <- read_csv("./data/covid_approval_polls.csv")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 3867 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): pollster, sponsor, population, party, subject, text, url
dbl (3): sample_size, approve, disapprove
lgl (1): tracking
date (2): start_date, end_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(covid_approval_polls)Rows: 3,867
Columns: 13
$ start_date <date> 2020-02-02, 2020-02-02, 2020-02-02, 2020-02-02, 2020-02-0…
$ end_date <date> 2020-02-04, 2020-02-04, 2020-02-04, 2020-02-04, 2020-02-0…
$ pollster <chr> "YouGov", "YouGov", "YouGov", "YouGov", "Morning Consult",…
$ sponsor <chr> "Economist", "Economist", "Economist", "Economist", NA, NA…
$ sample_size <dbl> 1500, 376, 523, 599, 2200, 684, 817, 700, 1996, 700, 788, …
$ population <chr> "a", "a", "a", "a", "a", "a", "a", "a", "rv", "rv", "rv", …
$ party <chr> "all", "R", "D", "I", "all", "R", "D", "I", "all", "R", "D…
$ subject <chr> "Trump", "Trump", "Trump", "Trump", "Trump", "Trump", "Tru…
$ tracking <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ text <chr> "Do you approve or disapprove of Donald Trump’s handling o…
$ approve <dbl> 42, 75, 21, 39, 57, 88, 37, 50, 39, 71, 15, 34, 39, 74, 19…
$ disapprove <dbl> 29, 6, 51, 25, 22, 4, 37, 22, 35, 8, 60, 33, 28, 7, 50, 25…
$ url <chr> "https://d25d2506sfb94s.cloudfront.net/cumulus_uploads/doc…
Data 4
Question: We want to explore the likelihood of seeing a (specific type) of squirrels in a specific park in NYC
Importance: It was important for sanitation works and tree surgeons to understand the habitats and temperament of the squirrels within their regions so they minimize harms to squirrels and to themselves when cutting down trees and cleaning trashes.
Types of variables used: categorical variable, numerical variable, date variables, and text variables.
Deliverable: The main deliverable will be an interactive web application/chat bot. List of options will be given to the user to select their main objectives. Visualizations (and interpretations) will be generated based on the select objective. Users can indicate specific constraints as the chat bot asked for clarification along the process.
Introduction and data
- Data source: https://www.thesquirrelcensus.com/
- How it was collected: Collected based on observation with the help of 300 volunteers.
- Brief description of data: Data is retrived from the The Squirrel Census, a multimedia project, focused on Eastern gray squirrels in New York City. In March 2020, with 72 volunteers and NYC Open Data, they surveyed 24 parks, tallying 433 squirrel sightings. Their methodology prioritized narrative insights over mere population counts, highlighting the interactions between squirrels, humans, and urban parks. We observe from the data that many squirrels are gray and found to be on trees.
- Ethical concerns: n/a
Glimpse of data
nyc_park_squirrel <- read_csv("./data/squirrel-data.csv")Rows: 433 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Area Name, Area ID, Park Name, Park ID, Squirrel ID, Primary Fur C...
dbl (2): Squirrel Latitude (DD.DDDDDD), Squirrel Longitude (-DD.DDDDDD)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(nyc_park_squirrel)Rows: 433
Columns: 16
$ `Area Name` <chr> "UPPER MANHATTAN", "UPPER MANHATTAN"…
$ `Area ID` <chr> "A", "A", "A", "A", "A", "A", "A", "…
$ `Park Name` <chr> "Fort Tryon Park", "Fort Tryon Park"…
$ `Park ID` <chr> "01", "01", "01", "01", "01", "01", …
$ `Squirrel ID` <chr> "A-01-01", "A-01-02", "A-01-03", "A-…
$ `Primary Fur Color` <chr> "Gray", "Gray", "Gray", "Gray", "Gra…
$ `Highlights in Fur Color` <chr> "White", "White", "White", "White", …
$ `Color Notes` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Location <chr> "Ground Plane", "Ground Plane", "Gro…
$ `Above Ground (Height in Feet)` <chr> NA, NA, NA, NA, NA, NA, NA, "10", NA…
$ `Specific Location` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Activities <chr> "Foraging", "Foraging", "Eating, Dig…
$ `Interactions with Humans` <chr> "Indifferent", "Indifferent", "Indif…
$ `Other Notes or Observations` <chr> NA, "Looks skinny", NA, NA, "She lef…
$ `Squirrel Latitude (DD.DDDDDD)` <dbl> 40.85941, 40.85944, 40.85942, 40.859…
$ `Squirrel Longitude (-DD.DDDDDD)` <dbl> -73.93394, -73.93394, -73.93389, -73…