Project proposal

Author

Team Trusting Kiwi

library(tidyverse)

weekly_gas_prices <- read_csv("data/weekly_gas_prices.csv")
weekly_gas_prices
# A tibble: 22,360 × 5
   date       fuel     grade   formulation  price
   <date>     <chr>    <chr>   <chr>        <dbl>
 1 1990-08-20 gasoline regular all           1.19
 2 1990-08-20 gasoline regular conventional  1.19
 3 1990-08-27 gasoline regular all           1.25
 4 1990-08-27 gasoline regular conventional  1.25
 5 1990-09-03 gasoline regular all           1.24
 6 1990-09-03 gasoline regular conventional  1.24
 7 1990-09-10 gasoline regular all           1.25
 8 1990-09-10 gasoline regular conventional  1.25
 9 1990-09-17 gasoline regular all           1.27
10 1990-09-17 gasoline regular conventional  1.27
# ℹ 22,350 more rows
head(weekly_gas_prices)
# A tibble: 6 × 5
  date       fuel     grade   formulation  price
  <date>     <chr>    <chr>   <chr>        <dbl>
1 1990-08-20 gasoline regular all           1.19
2 1990-08-20 gasoline regular conventional  1.19
3 1990-08-27 gasoline regular all           1.25
4 1990-08-27 gasoline regular conventional  1.25
5 1990-09-03 gasoline regular all           1.24
6 1990-09-03 gasoline regular conventional  1.24
str(weekly_gas_prices)
spc_tbl_ [22,360 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ date       : Date[1:22360], format: "1990-08-20" "1990-08-20" ...
 $ fuel       : chr [1:22360] "gasoline" "gasoline" "gasoline" "gasoline" ...
 $ grade      : chr [1:22360] "regular" "regular" "regular" "regular" ...
 $ formulation: chr [1:22360] "all" "conventional" "all" "conventional" ...
 $ price      : num [1:22360] 1.19 1.19 1.25 1.25 1.24 ...
 - attr(*, "spec")=
  .. cols(
  ..   date = col_date(format = ""),
  ..   fuel = col_character(),
  ..   grade = col_character(),
  ..   formulation = col_character(),
  ..   price = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Dataset

We decided to use the Weekly U.S. Gas Prices dataset from TidyTuesday (2025-07-01), originally collected by the U.S. Energy Information Administration (EIA). The dataset contains 22360 observations and 5 variables, reporting national average weekly retail fuel prices (USD per gallon) from 1990 onward. Each row represents a specific fuel configuration in a given week, with variables including date, fuel (gasoline or diesel), grade, formulation, and price. formulation refers to the type of gasoline being sold. There are two possible values for this variable: conventional (the regular, unmodified gasoline used in regions where air quality regulations are not as strict) and reformulated (gasoline that has been modified to reduce emissions and is used in areas with stricter air quality standards, especially in high-pollution regions).

We chose this dataset because it combines a clear numerical outcome (price) with multiple categorical dimensions (fuel, grade, and formulation), making it well suited for time-series analysis and grouped visualizations. Gas prices are also socially relevant and intuitive, allowing us to explore how prices change during major economic events and how different gasoline grades and formulations compare over time. In particular, we are interested in understanding how gasoline prices fluctuate over time, with a focus on how major global events or political shifts (e.g., recessions, geopolitical tensions, or policy changes) influence these changes. This will help contextualize gas price trends in relation to broader economic, environmental, and political factors.

We are particularly interested in examining the diesel premium (the difference between diesel and all-grades gasoline prices) over time, as it provides insights into the unique market dynamics between diesel and gasoline. Diesel pricing is influenced by factors such as refining costs, regulatory changes, and demand from the transportation sector, making it an important metric to track. By focusing on the diesel premium, we aim to better understand how broader economic trends and policy impacts shape fuel pricing, particularly in industries reliant on diesel for transportation and logistics.

Questions

Q1: How do weekly U.S. retail fuel prices vary by season across fuel type (gasoline vs. diesel), gasoline grade (regular/midgrade/premium), and gasoline formulation (conventional vs. reformulated), and are the seasonal patterns similar or meaningfully different across these categories over time?

Q2: How does the size of the diesel premium (diesel price minus all-grades gasoline price) change over time, and how is that premium related to periods of higher vs. lower price volatility for diesel compared with gasoline?

Analysis plan

Q1: The variables involved include the date for the reported fuel price, fuel, grade, and formulation. The variables we will create include:

  • season: A categorical variable that classifies each week into one of four seasons (Winter, Spring, Summer, Fall), created from the date variable using the month (e.g., December, January, February = Winter)

  • month: A numerical or factor variable that represents the month extracted from the date variable

  • year: A variable representing the year extracted from the date variable.

No external data will be merged in. We will use these new variables to analyze the seasonal variation of U.S. retail fuel prices by fuel type, gasoline grade, and gasoline formulation over time, and compare seasonal patterns in fuel prices. Specifically, we will focus on how different fuel types (gasoline vs. diesel), grades (regular, midgrade, premium), and formulations (conventional vs. reformulated) behave across different seasons.

Analysis Steps:

  1. We will average the seasonal gas prices across all years in the dataset for each combination of fuel type, grade, and formulation. For example, we will calculate the average price for Winter across all years, rather than focusing on a specific year or span of years. This will allow us to analyze how seasonal price patterns emerge over time.

  2. Additionally, to address the broad time span of the dataset, we will perform the analysis on a smaller time chunk (e.g., focusing on the past 10 years or comparing pre-2008 and post-2008 data). This will help capture any recent changes in fuel price seasonality, as well as trends that might have emerged due to recent economic shifts, policy changes, or market fluctuations.

  3. The analysis will include a seasonal price trend visualization: A line or bar plot showing the average price by season for each combination of fuel, grade, and formulation. This will allow us to observe how fuel prices fluctuate over the year and whether any seasonal trends emerge.

  4. In addition to the seasonal trends, we will also investigate the correlation between spikes/drops in fuel prices and major world events (e.g., economic crises, wars, or political shifts), using external sources for context. This will help us understand how broader global factors influence the seasonal price patterns and if fuel prices follow specific trends during particular world events.

Q2: The variables involved include the date, fuel, and price. The variables we will create include:

  • diesel_all: The price of all-grade diesel.

  • gasoline_all_all: The price of all-grade diesel.

  • diesel_premium: The difference between diesel_all and gasoline_all_all.

  • gas_weekly_change: The weekly change in gasoline prices, respectively, calculated as the difference between the current week’s price and the previous week’s price.

  • diesel_weekly_change: The weekly change in diesel prices, respectively, calculated as the difference between the current week’s price and the previous week’s price.

  • gas_pct_change: The percent change in gasoline prices.

  • diesel_pct_change: The percent change in diesel prices.

  • gas_vol_12w: Rolling standard deviation (volatility) of gasoline prices over a 12-week window.

  • diesel_vol_12w: Rolling standard deviation (volatility) of diesel prices over a 12-week window.

  • vol_regime: A categorical variable categorizing volatility into three regimes: low, medium, and high, based on the rolling volatility of both gasoline and diesel prices.

Volatility Categorization: The classification of volatility into low, medium, and high will be based on the rolling standard deviation values of weekly price changes (gas_vol_12w and diesel_vol_12w). We will categorize volatility into three levels using the following wuantile method:

  • Low volatility: The bottom third of volatility values.

  • Medium volatility: The middle third of volatility values.

  • High volatility: The top third of volatility values.

Analysis Steps: We will use these new variables to analyze the change in the diesel premium over time by calculating diesel_premium for each week. We will also examine price volatility for both gasoline and diesel using rolling windows to compute weekly changes and rolling standard deviations.

The relationship between the diesel premium and volatility will be visualized to show how the premium behaves during high and low volatility periods.

  1. A layered ggplot graph that illustrates diesel_premium over time, allowing us to track its fluctuations and identify any patterns.

  2. A comparison of gas_vol_12w vs. diesel_vol_12w to compare price volatility for gasoline and diesel. This will provide insight into how volatility behaves for each fuel type.

  3. A scatter plot of diesel_premium against volatility measures (gas_vol_12w and diesel_vol_12w) to explore if there is a correlation between volatility and the size of the diesel premium.

This approach will provide insights into the evolution of the diesel premium and its relationship with periods of price volatility for both diesel and gasoline.