library(tidyverse)
library(skimr)
library(jsonlite)
library(dplyr)
library(tibble)
Wondrous Raichu
Project Proposal
Data 1
Introduction and data
Identify the source of the data.
The dataset comes from Inside Airbnb (http://insideairbnb.com/get-the-data/), an open platform that provides data on Airbnb listings in different locations around the world.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data gives NYC listings from the last quarter in 2022. It is aggregated through Airbnb’s public information on their website and shows all listings from that respective quarter at the particular time of publication (i.e., for this particular dataset, December 4th and 5th, 2022).
Write a brief description of the observations.
The observations published in this dataset contain all the pieces of information posted on a typical Airbnb listing. It describes the host and who they are, ratings on the host and place of stay, etc.
Research question(s)
How does Airbnb listing prices depend on rating of listing, location of listing, amenities included in the listing, and host response time?
In this research question, we aim to understand what factors affect Airbnb listing prices. There are five variables: four independent variables (i.e., rating of listing, location of listing, amenities included in the listing, and host response time) and one dependent variable (i.e. listing price). Listing price, rating of listing, and host response time are quantitative variables, while location of listing and amenities included in the listing are categorical variables.
This research question is important because it provides us with a better understanding of the Airbnb price mechanism, which can shed light on the broader dynamics of the sharing economy and the way in which online platforms like Airbnb are changing the traditional hospitality industry.
Our hypothesis is that a high listing rating, a more comprehensive list of included amenities, a quick host response time, and a listing located in a safe and convenient location will all lead to higher listing prices.
How do characteristics (i.e., ratings and descriptions) of listings and of hosts affect the prices of listings?
In this research question, we aim to understand how the characteristics of listings and hots affect the prices of listings on Airbnb. There are five variables: four independent variables (i.e., rating of host, description of host, rating of listing, description of listing) and one dependent variable (i.e. listing price). Ratings and listing prices are quantitative variables, while descriptions are categorical variables.
This research question is important because it highlights the strategies hosts take to make a good impression on potential guests and remain competitive in the crowded Airbnb marketplace. Moreover, for potential guests, understanding the strategies hosts use can help them make more informed decisions when selecting a listing that meets their needs and budget.
Our hypothesis is that higher listing ratings and higher host ratings will lead to higher listing prices. Furthermore, listings and hosts with more detailed and attractive descriptions will lead to higher listing prices.
What are the most popular neighborhoods for Airbnb listings and how does this popularity vary by listing type and price?
In this research question, we aim to understand how the popularity of Airbnb listings depend on listing type and listing price. Here, we can associated popularity with high ratings and high number of reviews. There are three variables: two independent variables (i.e., listing type and listing price) and two dependent variables (i.e. ratings and number of reviews). Listing type is a categorical variable, while listing price, ratings, and number of reviews are quantitative variables.
This research question is important as it can provide valuable insights into the preferences of travelers and market trends for Airbnb hosts and guests. It helps potential hosts to identify profitable areas for their listings and to understand the preferences of Airbnb customers. Identifying the most popular neighborhoods for Airbnb listings and analyzing how this popularity varies by listing type and price can help hosts adjust their pricing strategies and better target their listings, while guests can make more informed decisions on where to stay. Additionally, this research can shed light on the factors that influence the popularity of neighborhoods for Airbnb listings, providing valuable information for city planners, policymakers, and tourism agencies in shaping their urban development strategies and promoting tourism in specific areas.
Our hypothesis is that the most popular neighborhood will be Manhattan because it is the center of business and entertainment. Moreover, we hypothesize that entire home/apt listings will be more popular compared to private or shared rooms and that lower-priced listings will be more popular than higher-priced ones.
Glimpse of data
<- read_csv("data/airbnb_data/airbnb_data.csv")
airbnb_data
# Preview some rows
head(airbnb_data)
# A tibble: 6 × 75
id listing_url scrape_id last_scraped source name description
<dbl> <chr> <dbl> <date> <chr> <chr> <chr>
1 2595 https://www.airbnb.com/… 2.02e13 2022-12-05 city … Skyl… "Beautiful…
2 5203 https://www.airbnb.com/… 2.02e13 2022-12-05 previ… Cozy… "Our best …
3 5136 https://www.airbnb.com/… 2.02e13 2022-12-04 city … Spac… "We welcom…
4 5121 https://www.airbnb.com/… 2.02e13 2022-12-05 city … Blis… "One room …
5 6848 https://www.airbnb.com/… 2.02e13 2022-12-05 city … Only… "Comfortab…
6 5178 https://www.airbnb.com/… 2.02e13 2022-12-05 city … Larg… "Please do…
# ℹ 68 more variables: neighborhood_overview <chr>, picture_url <chr>,
# host_id <dbl>, host_url <chr>, host_name <chr>, host_since <date>,
# host_location <chr>, host_about <chr>, host_response_time <chr>,
# host_response_rate <chr>, host_acceptance_rate <chr>,
# host_is_superhost <lgl>, host_thumbnail_url <chr>, host_picture_url <chr>,
# host_neighbourhood <chr>, host_listings_count <dbl>,
# host_total_listings_count <dbl>, host_verifications <chr>, …
# Skim through data
skim(airbnb_data)
Name | airbnb_data |
Number of rows | 41533 |
Number of columns | 75 |
_______________________ | |
Column type frequency: | |
character | 25 |
Date | 5 |
logical | 8 |
numeric | 37 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
listing_url | 0 | 1.00 | 33 | 47 | 0 | 41533 | 0 |
source | 0 | 1.00 | 11 | 15 | 0 | 2 | 0 |
name | 11 | 1.00 | 1 | 248 | 0 | 40242 | 0 |
description | 785 | 0.98 | 1 | 1000 | 0 | 37137 | 0 |
neighborhood_overview | 17443 | 0.58 | 1 | 1000 | 0 | 19392 | 0 |
picture_url | 0 | 1.00 | 60 | 126 | 0 | 40390 | 0 |
host_url | 0 | 1.00 | 38 | 43 | 0 | 26832 | 0 |
host_name | 5 | 1.00 | 1 | 35 | 0 | 9628 | 0 |
host_location | 7745 | 0.81 | 5 | 40 | 0 | 1104 | 0 |
host_about | 18316 | 0.56 | 1 | 7309 | 0 | 14151 | 26 |
host_response_time | 5 | 1.00 | 3 | 18 | 0 | 5 | 0 |
host_response_rate | 5 | 1.00 | 2 | 4 | 0 | 68 | 0 |
host_acceptance_rate | 5 | 1.00 | 2 | 4 | 0 | 99 | 0 |
host_thumbnail_url | 5 | 1.00 | 55 | 106 | 0 | 26362 | 0 |
host_picture_url | 5 | 1.00 | 57 | 109 | 0 | 26362 | 0 |
host_neighbourhood | 8189 | 0.80 | 3 | 50 | 0 | 539 | 0 |
host_verifications | 0 | 1.00 | 2 | 32 | 0 | 8 | 0 |
neighbourhood | 17443 | 0.58 | 13 | 55 | 0 | 191 | 0 |
neighbourhood_cleansed | 0 | 1.00 | 4 | 25 | 0 | 223 | 0 |
neighbourhood_group_cleansed | 0 | 1.00 | 5 | 13 | 0 | 5 | 0 |
property_type | 0 | 1.00 | 4 | 34 | 0 | 80 | 0 |
room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
bathrooms_text | 77 | 1.00 | 6 | 17 | 0 | 30 | 0 |
amenities | 0 | 1.00 | 2 | 2028 | 0 | 35522 | 0 |
price | 0 | 1.00 | 5 | 10 | 0 | 1287 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
last_scraped | 0 | 1.00 | 2022-12-04 | 2022-12-05 | 2022-12-05 | 2 |
host_since | 5 | 1.00 | 2008-08-22 | 2022-12-02 | 2016-04-04 | 4649 |
calendar_last_scraped | 0 | 1.00 | 2022-12-04 | 2022-12-05 | 2022-12-05 | 2 |
first_review | 9393 | 0.77 | 2009-04-23 | 2022-12-04 | 2019-12-15 | 3772 |
last_review | 9393 | 0.77 | 2011-05-12 | 2022-12-04 | 2022-09-15 | 2715 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
host_is_superhost | 29 | 1 | 0.21 | FAL: 32635, TRU: 8869 |
host_has_profile_pic | 5 | 1 | 0.98 | TRU: 40904, FAL: 624 |
host_identity_verified | 5 | 1 | 0.85 | TRU: 35262, FAL: 6266 |
bathrooms | 41533 | 0 | NaN | : |
calendar_updated | 41533 | 0 | NaN | : |
has_availability | 0 | 1 | 0.85 | TRU: 35281, FAL: 6252 |
license | 41533 | 0 | NaN | : |
instant_bookable | 0 | 1 | 0.20 | FAL: 33123, TRU: 8410 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 1.728318e+17 | 2.974371e+17 | 2.59500e+03 | 1.835861e+07 | 4.117861e+07 | 5.477978e+17 | 7.741268e+17 | ▇▁▁▁▂ |
scrape_id | 0 | 1.00 | 2.022120e+13 | 0.000000e+00 | 2.02212e+13 | 2.022120e+13 | 2.022120e+13 | 2.022120e+13 | 2.022120e+13 | ▁▁▇▁▁ |
host_id | 0 | 1.00 | 1.400636e+08 | 1.526932e+08 | 2.43800e+03 | 1.491162e+07 | 6.561181e+07 | 2.418897e+08 | 4.899967e+08 | ▇▂▂▁▂ |
host_listings_count | 5 | 1.00 | 8.662000e+01 | 5.183500e+02 | 1.00000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 4.559000e+03 | ▇▁▁▁▁ |
host_total_listings_count | 5 | 1.00 | 1.362600e+02 | 7.735100e+02 | 1.00000e+00 | 1.000000e+00 | 3.000000e+00 | 7.000000e+00 | 1.201700e+04 | ▇▁▁▁▁ |
latitude | 0 | 1.00 | 4.073000e+01 | 6.000000e-02 | 4.05000e+01 | 4.069000e+01 | 4.072000e+01 | 4.076000e+01 | 4.091000e+01 | ▁▂▇▅▁ |
longitude | 0 | 1.00 | -7.394000e+01 | 6.000000e-02 | -7.42500e+01 | -7.398000e+01 | -7.395000e+01 | -7.392000e+01 | -7.371000e+01 | ▁▁▇▂▁ |
accommodates | 0 | 1.00 | 2.960000e+00 | 2.080000e+00 | 0.00000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▃▁▁▁ |
bedrooms | 3822 | 0.91 | 1.380000e+00 | 7.600000e-01 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.400000e+01 | ▇▁▁▁▁ |
beds | 941 | 0.98 | 1.650000e+00 | 1.160000e+00 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.200000e+01 | ▇▁▁▁▁ |
minimum_nights | 0 | 1.00 | 1.859000e+01 | 3.070000e+01 | 1.00000e+00 | 2.000000e+00 | 1.000000e+01 | 3.000000e+01 | 1.250000e+03 | ▇▁▁▁▁ |
maximum_nights | 0 | 1.00 | 5.324173e+04 | 1.053830e+07 | 1.00000e+00 | 6.000000e+01 | 3.650000e+02 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
minimum_minimum_nights | 14 | 1.00 | 1.864000e+01 | 3.239000e+01 | 1.00000e+00 | 2.000000e+00 | 7.000000e+00 | 3.000000e+01 | 1.250000e+03 | ▇▁▁▁▁ |
maximum_minimum_nights | 14 | 1.00 | 2.297000e+01 | 4.853000e+01 | 1.00000e+00 | 2.000000e+00 | 1.400000e+01 | 3.000000e+01 | 1.250000e+03 | ▇▁▁▁▁ |
minimum_maximum_nights | 14 | 1.00 | 1.243053e+06 | 5.161702e+07 | 1.00000e+00 | 2.700000e+02 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
maximum_maximum_nights | 14 | 1.00 | 2.122356e+06 | 6.745119e+07 | 1.00000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
minimum_nights_avg_ntm | 14 | 1.00 | 2.251000e+01 | 4.738000e+01 | 1.00000e+00 | 2.000000e+00 | 1.000000e+01 | 3.000000e+01 | 1.250000e+03 | ▇▁▁▁▁ |
maximum_nights_avg_ntm | 14 | 1.00 | 1.398113e+06 | 5.294218e+07 | 1.00000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
availability_30 | 0 | 1.00 | 7.900000e+00 | 1.014000e+01 | 0.00000e+00 | 0.000000e+00 | 2.000000e+00 | 1.500000e+01 | 3.000000e+01 | ▇▂▁▁▂ |
availability_60 | 0 | 1.00 | 2.192000e+01 | 2.210000e+01 | 0.00000e+00 | 0.000000e+00 | 1.800000e+01 | 4.200000e+01 | 6.000000e+01 | ▇▁▂▂▃ |
availability_90 | 0 | 1.00 | 3.681000e+01 | 3.487000e+01 | 0.00000e+00 | 0.000000e+00 | 3.300000e+01 | 7.000000e+01 | 9.000000e+01 | ▇▁▁▃▅ |
availability_365 | 0 | 1.00 | 1.432900e+02 | 1.442800e+02 | 0.00000e+00 | 0.000000e+00 | 8.700000e+01 | 3.120000e+02 | 3.650000e+02 | ▇▂▂▁▅ |
number_of_reviews | 0 | 1.00 | 2.620000e+01 | 5.618000e+01 | 0.00000e+00 | 1.000000e+00 | 5.000000e+00 | 2.500000e+01 | 1.666000e+03 | ▇▁▁▁▁ |
number_of_reviews_ltm | 0 | 1.00 | 7.980000e+00 | 1.856000e+01 | 0.00000e+00 | 0.000000e+00 | 1.000000e+00 | 8.000000e+00 | 9.920000e+02 | ▇▁▁▁▁ |
number_of_reviews_l30d | 0 | 1.00 | 6.700000e-01 | 1.550000e+00 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 7.400000e+01 | ▇▁▁▁▁ |
review_scores_rating | 9393 | 0.77 | 4.630000e+00 | 7.200000e-01 | 0.00000e+00 | 4.600000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_accuracy | 9841 | 0.76 | 4.750000e+00 | 4.600000e-01 | 0.00000e+00 | 4.710000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_cleanliness | 9831 | 0.76 | 4.630000e+00 | 5.400000e-01 | 0.00000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_checkin | 9845 | 0.76 | 4.820000e+00 | 4.100000e-01 | 0.00000e+00 | 4.800000e+00 | 4.950000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_communication | 9836 | 0.76 | 4.810000e+00 | 4.300000e-01 | 0.00000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_location | 9848 | 0.76 | 4.740000e+00 | 4.100000e-01 | 0.00000e+00 | 4.640000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
review_scores_value | 9848 | 0.76 | 4.650000e+00 | 4.900000e-01 | 0.00000e+00 | 4.550000e+00 | 4.770000e+00 | 4.950000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
calculated_host_listings_count | 0 | 1.00 | 2.063000e+01 | 6.887000e+01 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 4.870000e+02 | ▇▁▁▁▁ |
calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.131000e+01 | 5.645000e+01 | 0.00000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.870000e+02 | ▇▁▁▁▁ |
calculated_host_listings_count_private_rooms | 0 | 1.00 | 9.200000e+00 | 4.009000e+01 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 3.450000e+02 | ▇▁▁▁▁ |
calculated_host_listings_count_shared_rooms | 0 | 1.00 | 5.000000e-02 | 5.900000e-01 | 0.00000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | ▇▁▁▁▁ |
reviews_per_month | 9393 | 0.77 | 1.280000e+00 | 1.940000e+00 | 1.00000e-02 | 1.400000e-01 | 5.800000e-01 | 1.880000e+00 | 1.029800e+02 | ▇▁▁▁▁ |
Data 2
Introduction and data
Identify the source of the data.
The dataset comes from Yelp Dataset (https://www.yelp.com/dataset), which provides a subset of the businesses, reviews, and user data on Yelp.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The Yelp dataset was originally collected by Yelp’s data team and is publicly available for research and educational purposes. The dataset includes information on businesses located in 10 metropolitan areas in four countries: the United States, Canada, the United Kingdom, and Germany.
Write a brief description of the observations.
The dataset is divided into several JSON files:
business.json contains business data including location data, attributes, and categories
review.json contains fill review text data
user.json contains user data and all the metadata associated with users
checkin.json contains data on checkins on a business
tip.json contains data on tips written by a user on a business
In this proposal, we will focus on the data outlined in business.json. The business.json file contains location details (address, city, state, etc.), number of stars, as well as number of reviews among other things.
Research question
Are there differences in the way that customers review chain vs. independent restaurants, and do these differences vary depending on the type of cuisine or location?
In this research question, we aim to understand how customers view chain and independent restaurants. Customer perceptions can be analyzed through ratings and reviews. There are five variables: three independent variables (i.e., type of restaurant, type of cuisine, and location) and two dependent variables (i.e., ratings and reviews). Ratings are quantitative variables, while reviews, type of restaurant, type of cuisine, and location are categorical variables.
This research question is important because it can help restaurant owners and managers to better understand customer perceptions and preferences towards chain and independent restaurants. By examining whether differences in customer reviews vary based on cuisine type or location, insights can be gained into the factors that influence customer satisfaction and provide guidance on how to improve customer experiences.
Our hypothesis is that customers may perceive chain and independent restaurants differently, with chain restaurants being perceived as more consistent and reliable in terms of quality and service, while independent restaurants may be seen as more unique and offering more personalized experiences. The differences in customer perceptions may also vary depending on the type of cuisine or location, with certain cuisines or cities having a stronger preference for chain or independent restaurants.
Glimpse of data
# Consulted Professor Soltoff on how to rectangle dataset
<- read_lines(file = "data/yelp_data/yelp_academic_dataset_business.json")
yelp_raw
<- map(.x = yelp_raw, .f = fromJSON)
yelp_list
<- tibble(yelp = yelp_list) |>
yelp_data unnest_wider(col = yelp)
# Preview some rows
head(yelp_data)
# A tibble: 6 × 14
business_id name address city state postal_code latitude longitude stars
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Pns2l4eNsfO8kk… Abby… 1616 C… Sant… CA 93101 34.4 -120. 5
2 mpf3x-BjTdTEA3… The … 87 Gra… Afft… MO 63123 38.6 -90.3 3
3 tUFrWirKiKi_TA… Targ… 5255 E… Tucs… AZ 85711 32.2 -111. 3.5
4 MTSW4McQd7CbVt… St H… 935 Ra… Phil… PA 19107 40.0 -75.2 4
5 mWMc6_wTdE0EUB… Perk… 101 Wa… Gree… PA 18054 40.3 -75.5 4.5
6 CF33F8-E6oudUQ… Soni… 615 S … Ashl… TN 37015 36.3 -87.1 2
# ℹ 5 more variables: review_count <int>, is_open <int>, attributes <list>,
# categories <chr>, hours <list>
# Skim through data
skim(yelp_data)
Name | yelp_data |
Number of rows | 150346 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 7 |
list | 2 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
business_id | 0 | 1 | 22 | 22 | 0 | 150346 | 0 |
name | 0 | 1 | 2 | 64 | 0 | 114117 | 0 |
address | 0 | 1 | 0 | 110 | 5127 | 122844 | 0 |
city | 0 | 1 | 3 | 52 | 0 | 1416 | 0 |
state | 0 | 1 | 2 | 3 | 0 | 27 | 0 |
postal_code | 0 | 1 | 0 | 7 | 73 | 3362 | 0 |
categories | 103 | 1 | 4 | 503 | 0 | 83160 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
attributes | 0 | 1 | 87662 | 0 | 33 |
hours | 0 | 1 | 49823 | 0 | 7 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
latitude | 0 | 1 | 36.67 | 5.87 | 27.56 | 32.19 | 38.78 | 39.95 | 53.68 | ▅▂▇▁▁ |
longitude | 0 | 1 | -89.36 | 14.92 | -120.10 | -90.36 | -86.12 | -75.42 | -73.20 | ▅▁▁▇▇ |
stars | 0 | 1 | 3.60 | 0.97 | 1.00 | 3.00 | 3.50 | 4.50 | 5.00 | ▁▃▂▇▆ |
review_count | 0 | 1 | 44.87 | 121.12 | 5.00 | 8.00 | 15.00 | 37.00 | 7568.00 | ▇▁▁▁▁ |
is_open | 0 | 1 | 0.80 | 0.40 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
Data 3
Introduction and data
Identify the source of the data.
The dataset was downloaded from Kaggle (https://www.kaggle.com/code/ahmetburabua/drive-to-survive/input?select=final.csv).
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The dataset was created by Ahmet Buğra Buğa, a data analyst at Mathrics (based on the Kaggle account). The account did not say much about the data curation process other than the fact that “the dataset was collected from public places and combined.” We presume that the data was scraped from sources like Wikipedia and the Ergast Developer API (http://ergast.com/mrd/). In the Kaggle page, we can also access yearly race data from 1983-2021. The Kaggler was able to compress the data into one “final.csv” file.
Write a brief description of the observations.
The observations included in the “final.csv” file contain all the pieces of information relevant to a particular race. The dataset tells you which circuit the race is on, whether the weather during the race is warm, cold, dry, wet, and/or cloudy, and what grid position a driver started the race in among other things.
Research question
What factors are most strongly associated with drivers’ success in Formula 1 racing, and how have these factors changed over time?
In this research question, we aim to understand the factors that contribute to success in Formula 1 racing, including the race circuit, weather, starting grid position, and drivers’ ages. Here, we equate drivers’ success with placing a podium in the race (i.e., first, second, and third place). There are five variables: four independent variables (i.e., the race circuit, weather, starting grid position, and drivers’ ages) and one dependent variable (i.e. finishing grid position). Starting and finishing grid positions and drivers’ ages are quantitative variables, while race circuit and weather are categorical variables. Other quantitative variables we could include in our analysis when it comes to success include driver points and qualifying times.
Formula 1 racing is one of the most popular and competitive sports in the world, and understanding the factors that contribute to success in this field is crucial for teams, drivers, and fans. By identifying the most important factors associated with success in Formula 1 racing, teams can optimize their strategies and improve their chances of winning, while fans can gain a deeper appreciation for the skills and abilities required to excel in this sport. Moreover, studying how these factors have changed over time can provide insights into the evolution of Formula 1 racing and shed light on the impact of technological advancements, changes in regulations, and other factors on the sport.
Our hypothesis is that warm and dry weather conditions are more conducive to better driver performance compared to cold, cloudy, and/or wet weather. Dry roads provide better traction for tires, enabling drivers to control the car more effectively. On the other hand, wet weather can cause the tires to slip, potentially resulting in loss of control. In cold weather, the tires and mechanical components may not reach optimal operating temperatures, leading to reduced grip and responsiveness of the car.
Furthermore, we hypothesize that there is an optimal age for drivers in Formula 1, as being too young may lead to lack of experience and being too old may result in slower reaction time.
Glimpse of data
<- read_csv("data/f1_data/f1_data.csv")
f1_data
# Preview some rows
head(f1_data)
# A tibble: 6 × 22
...1 season round circuit_id weather_warm weather_cold weather_dry
<dbl> <dbl> <dbl> <chr> <lgl> <lgl> <lgl>
1 14 1983 1 jacarepagua FALSE FALSE TRUE
2 5 1983 1 jacarepagua FALSE FALSE TRUE
3 3 1983 1 jacarepagua FALSE FALSE TRUE
4 0 1983 1 jacarepagua FALSE FALSE TRUE
5 6 1983 1 jacarepagua FALSE FALSE TRUE
6 8 1983 1 jacarepagua FALSE FALSE TRUE
# ℹ 15 more variables: weather_wet <lgl>, weather_cloudy <lgl>, driver <chr>,
# nationality <chr>, constructor <chr>, grid <dbl>, podium <dbl>,
# driver_points <dbl>, driver_wins <dbl>, driver_standings_pos <dbl>,
# constructor_points <dbl>, constructor_wins <dbl>,
# constructor_standings_pos <dbl>, qualifying_time <dbl>, driver_age <dbl>
# Skim through data
skim(f1_data)
Name | f1_data |
Number of rows | 14794 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 4 |
logical | 5 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
circuit_id | 0 | 1 | 3 | 14 | 0 | 50 | 0 |
driver | 0 | 1 | 3 | 18 | 0 | 232 | 0 |
nationality | 0 | 1 | 4 | 13 | 0 | 34 | 0 |
constructor | 0 | 1 | 3 | 12 | 0 | 66 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
weather_warm | 0 | 1 | 0.39 | FAL: 9063, TRU: 5731 |
weather_cold | 0 | 1 | 0.02 | FAL: 14473, TRU: 321 |
weather_dry | 0 | 1 | 0.22 | FAL: 11525, TRU: 3269 |
weather_wet | 0 | 1 | 0.10 | FAL: 13306, TRU: 1488 |
weather_cloudy | 0 | 1 | 0.12 | FAL: 12999, TRU: 1795 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
…1 | 0 | 1 | 7465.31 | 4350.15 | 0 | 3698.25 | 7403.5 | 11232.75 | 15085.0 | ▇▇▇▇▇ |
season | 0 | 1 | 2001.59 | 11.24 | 1983 | 1992.00 | 2001.0 | 2012.00 | 2021.0 | ▇▇▆▇▇ |
round | 0 | 1 | 9.19 | 5.12 | 1 | 5.00 | 9.0 | 13.00 | 21.0 | ▇▆▆▆▁ |
grid | 0 | 1 | 11.76 | 6.70 | 1 | 6.00 | 12.0 | 17.00 | 27.0 | ▇▆▆▆▂ |
podium | 0 | 1 | 11.90 | 6.77 | 1 | 6.00 | 12.0 | 17.00 | 27.0 | ▇▆▆▆▂ |
driver_points | 0 | 1 | 19.94 | 42.08 | 0 | 0.00 | 3.0 | 19.00 | 387.0 | ▇▁▁▁▁ |
driver_wins | 0 | 1 | 0.36 | 1.18 | 0 | 0.00 | 0.0 | 0.00 | 13.0 | ▇▁▁▁▁ |
driver_standings_pos | 0 | 1 | 10.66 | 7.67 | 0 | 4.00 | 10.0 | 17.00 | 30.0 | ▇▅▅▃▁ |
constructor_points | 0 | 1 | 40.06 | 81.62 | 0 | 0.00 | 8.0 | 41.00 | 722.0 | ▇▁▁▁▁ |
constructor_wins | 0 | 1 | 0.74 | 1.95 | 0 | 0.00 | 0.0 | 0.00 | 18.0 | ▇▁▁▁▁ |
constructor_standings_pos | 0 | 1 | 5.86 | 3.83 | 0 | 3.00 | 6.0 | 9.00 | 20.0 | ▇▆▅▁▁ |
qualifying_time | 0 | 1 | 2.55 | 8.00 | -77 | 1.00 | 2.1 | 3.50 | 904.6 | ▇▁▁▁▁ |
driver_age | 0 | 1 | 28.59 | 4.73 | 17 | 25.00 | 28.0 | 32.00 | 43.0 | ▂▇▇▅▁ |