Project proposal

Author

Dream Team

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

This dataframe has 2 main datasets, one focused on water quality and the other on climate conditions in Sydney Australia. The first water quality dataset contains 123,530 observations and 10 variables. Each row represents a water sample collected at a specific swim site on a given date and time. Key variables include region, council, swim site name, geographic coordinates (latitude and longitude), enterococci concentration (CFU per 100 mL), water temperature, and conductivity. In our project Enterococci concentration serves as the primary indicator of microbial contamination and swimming safety. The second climate dataset contains 12,538 daily observations and 6 variables, including date, maximum and minimum air temperature, precipitation, and geographic coordinates. This dataset is useful as it provides environmental context that can be used to link water quality measurements to environmental factors.

We are interested in this dataframe because there is a clear narrative of connecting climate conditions to water safety. Water quality affects how safe it is for people to swim and environmental factors such as rainfall and temperature can influence bacterial contamination levels. By examining these relationships, this analysis can have real life implications that can inform officials as well as swimmers the best practices for water safety. Our dataframe is well suited for this narrative’s data visualization because it has a clear outcome variable such as enterococci concentration and context such as geographical location and time.

Make sure to load the data and use inline code for some of this information.

tuesdata <- tidytuesdayR::tt_load(2025, week = 20)
---- Compiling #TidyTuesday Information for 2025-05-20 ----
--- There are 2 files available ---


── Downloading files ───────────────────────────────────────────────────────────

  1 of 2: "water_quality.csv"
Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
  
  # Bad:
  vroom("X,Y\n1.5,2.3\n")
  
  # Good:
  vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
  Please report the issue at <https://github.com/tidyverse/readr/issues>.
2 of 2: "weather.csv"
water_quality <- tuesdata$water_quality
weather <- tuesdata$weather

glimpse(water_quality)
Rows: 123,530
Columns: 10
$ region                <chr> "Western Sydney", "Sydney Harbour", "Sydney Harb…
$ council               <chr> "Hawkesbury City Council", "North Sydney Council…
$ swim_site             <chr> "Windsor Beach", "Hayes Street Beach", "Northbri…
$ date                  <date> 2025-04-28, 2025-04-28, 2025-04-28, 2025-04-28,…
$ time                  <time> 11:00:00, 11:40:00, 10:54:00, 09:28:00, 10:35:0…
$ enterococci_cfu_100ml <dbl> 620, 64, 160, 54, 720, 230, 120, 280, 60, 100, 1…
$ water_temperature_c   <dbl> 20, 21, 21, 21, 18, 21, 21, 21, 22, 22, 20, 20, …
$ conductivity_ms_cm    <dbl> 248, 45250, 48930, 52700, 64, 39140, 4845, 50600…
$ latitude              <dbl> -33.60448, -33.84172, -33.80604, -33.80073, -33.…
$ longitude             <dbl> 150.8170, 151.2194, 151.2228, 151.2748, 150.6979…
glimpse(weather)
Rows: 12,538
Columns: 6
$ date             <date> 1991-01-01, 1991-01-02, 1991-01-03, 1991-01-04, 1991…
$ max_temp_C       <dbl> 29.3, 27.5, 28.2, 30.8, 30.4, 25.5, 25.0, 22.0, 24.3,…
$ min_temp_C       <dbl> 22.1, 22.4, 21.1, 23.7, 19.5, 18.2, 19.4, 19.3, 20.8,…
$ precipitation_mm <dbl> 2.4, 0.0, 0.0, 0.0, 12.2, 0.0, 0.0, 5.2, 0.7, 0.4, 2.…
$ latitude         <dbl> -33.84886, -33.84886, -33.84886, -33.84886, -33.84886…
$ longitude        <dbl> 151.1955, 151.1955, 151.1955, 151.1955, 151.1955, 151…

Questions

Question 1

How does rainfall affect water quality in swimming areas over time?

Why this question is of interest? We find the question important as swimmers, local governments, and health organizations all need to know how much rain is required to change a beach’s status from “safe” to “unsafe” for recreational usage. Quantifying this association helps transform the ambiguous “don’t swim after rain” advice into evidence-based recommendations with precise thresholds and time periods, since heavy rainfall can carry bacteria and other pollutants from urban catchments into coastal waterways. By concentrating on the bacteria levels (more specifically, enterococci within our data) as a way to track health, the analysis is also directly linked to public health risk and actual beach warnings.

Columns to be used for the analysis / Variables to be created

From the water quality dataset, we will use: - region: to compare broader geographic patterns, like Sydney Harbour vs. Western Sydney - council: to account for local governance differences - swim_site: for site-level analysis - date: key variable for merging with the rainfall data - time: allows for investigating same-day timing effects - enterococci_cfu_100ml: this is the primary outcome variable as it is the bacterial contamination level - latitude, longitude: for spatial mapping and clustering analysis

We plan to create various new variables to analyze the relationship between rainfall and water quality over time. First, we will construct a datetime variable by combining the date and time fields. We will also create a heavy_rain_indicator variable to see if there are days where rainfall exceeds a chosen threshold. We also plan to create a variable unsafe_indicator and set it equal to 1 if enterococci level exceeds recreational water quality guidelines and 0 otherwise. To account for temporal patterns, we will create seasonal or month indicators and a weekday/weekend variable. This will allow us to analyze both continuous contamination levels and the likelihood that the water quality exceeds public thresholds.

How we plan on doing the analysis? We plan to join the water-quality and weather datasets by date (and by site where possible), then treat enterococci_cfu_100 ml as the outcome and precipitation_mm as the main predictor. We will also include recent rainfall, such as the same day, the previous few days, and short multi-day totals. First, we will conduct analysis with scatterplots of rainfall vs. enterococci, along with time-series plots that overlay rainfall bars and bacteria lines, and horizontal reference lines at guideline thresholds for “unsafe” conditions. We will then fit simple models (ex. linear or generalized additive models) to estimate how much an increment of rainfall changes expected contamination levels and the probability of exceeding safety guidelines over time.

Question 2

Are there regions more sensitive to contaminations caused by rainfall?

Why this question is of interest? We think this question is important for public safety as rainfall can trigger increases in enterococci levels that make swimming sites unsafe shortly after rain events. Identifying regions that are more sensitive to rainfall driven contamination allows for more targeted monitoring of certain areas, especially during heavy rainfall seasons, potentially helping prevent harmful exposure to people. Additionally, a high Enterococci level may be indicative of some kind of underlying issue, making this analysis a critical first step in understanding what the problematic regions are.

Column to be used for the analysis /Variables to be created From the weather dataset: region: main grouping variable for comparing sensitivity swim_site: site level analysis date: to merge with rainfall data enterococci_cfu_100ml: outcome variable water_temperature_c: potential control variable latitude, longitude: for spatial matching and mapping

From the weather dataset: - date: for joining datasets - precipitation_mm: our main predictor variable - max_temp_c, min_temp_c: additional weather controls - latitude, longitude: to match each swim site with the nearest weather station

To evaluate whether certain regions are more sensitive to rainfall-driven contamination, we plan to create various variables: - log_enterococci: a log transformation of enterococci levels to reduce skewness - unsafe_indicator: a binary variable equal to 1 if bacteria levels exceed the safe threshold, and 0 otherwise - rainfall_lag1, rainfall_lag2: rainfall in the previous 1-2 days - rainfall_3day_total, rainfall_5day_total: cumulative rainfall measures - heavy_rain_indicator: indicator for rainfall above a certain threshold - region_average_enterococci: baseline contamination level per region - rainfall_region_interaction: an interaction term between rainfall and region

Creating these variables will allow us to compare both overall contamination levels as well as how strongly contamination increases in response to rainfall across different regions. The interaction term (rainfall_region_interaction) is particularly important because it will directly measure whether or not some regions experience larger increases in enterococci levels per unit of rainfall, which is our definition of sensitivity.

How we plan on doing the analysis? We plan to join the water-quality and weather datasets by date and coordination (longitude and latitude), so that each swimming site can be matched with the local weather conditions on the same date. This allows us to link precipitation levels from the weather data with enterococci levels from the water quality data and examine how rainfall is related to bacterial contamination at each site.

We will create scatterplots with precipitation levels on the x-axis and enterococci levels on the y-axis, where enterococci levels indicate the concentration of bacteria present in the water. This allows us to see the general trend between rainfall and enterococci levels. To compare differences across locations, we will facet the visualizations by region, which allows us to easily identify differences in patterns, such as whether some regions show steeper increases in enterococci levels at higher precipitation levels, indicating greater sensitivity to rainfall-driven contamination.

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

To answer our two research questions, firstly, we will merge the two datasets, water quality and weather, by date and geographic proximity using latitude and longitude. The main outcome variable in our analysis is enterococci_cfu_100ml, which will measure bacterial contamination levels at each swim site. Our main explanatory variable is precipitation_mm from the weather dataset. After merging the two datasets together, we will construct derived variables, including a combined datetime variable, lagged rainfall measures (like rainfall_lag1, rainfall_lag2, and rainfall_3day_total, rainfall_5day_total), a heavy_rain indicator, a log_enterococci transformation to address skewness, and an unsafe_indicator variable that captures whether contamination exceeds recreational water quality thresholds.

For our first question, we will analyze how rainfall affects water quality over time by visualizing the relationship between precipitation_mm and enterococci_cfu_100ml. We plan to use scatterplots and time series plots that overlay rainfall and bacterial levels, including horizontal reference lines to indicate unsafe thresholds. To quantify the relationship, we will estimate regression models with log_enterococci as the outcome and rainfall measures as predictors. These models will help us estimate how increases in rainfall affect expected contamination levels and the probability that water becomes unsafe.

For our second question, we will assess whether certain regions are more sensitive to rainfall-driven contamination. We plan to compare patterns across region and swim_site by faceting visualizations and calculating region-level averages. In our regression model, we will include interaction terms between precipitation_mm and region to test whether the effect of rainfall differs across geographic regions. If we get a statistically significant interaction, that would indicate that some regions experience larger increases in contamination per unit of rainfall. This would suggest greater environmental sensitivity and the potential need for more targeted monitoring policies of water quality.