Project proposal

Author

Proud-Chipmunk (Sarah Lee, Shane Lee, Sophia Escalante)

library(tidyverse)

Dataset

We are combining two data sets from (provenance):

Data obtained from TidyTuesday (2025-09-02)
Source: https://github.com/rfordatascience/tidytuesday

Currently we are not planning on using other datasets than the ones stated above.

Australian Frogs Dataset

This dataset contains records from the sixth annual release of data from the FrogID initiative, a citizen science project in Australia that collects frog call recordings through a mobile app. Volunteers record frog calls, which are then identified by museum experts and used to support research in frog ecology, taxonomy, and conservation. FrogID data has contributed to more than 30 scientific publications.

Australia is home to approximately 257 native frog species, many of which are found nowhere else in the world. However, nearly one in five species is threatened with extinction due to pressures such as climate change, urbanization, disease, and invasive species. This dataset helps researchers monitor frog populations and better understand environmental threats affecting amphibian biodiversity.

Loading In the Dataset:

## Loading in datasets:
frog_names_df <- read_csv("data/frog_names.csv")

Rows: 294 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): subfamily, tribe, scientificName, commonName, secondary_commonNames

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

frogID_df <- read_csv("data/frogID_data.csv")

Rows: 136621 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): scientificName, timezone, stateProvince
dbl  (6): occurrenceID, eventID, decimalLatitude, decimalLongitude, coordina...
date (1): eventDate
time (1): eventTime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Datasets dimensions:
frog_names_dim <- dim(frog_names_df)
frogID_dim <- dim(frogID_df)

## Join datasets together: 
frogs_df <- frog_names_df |> 
    inner_join(frogID_df, by = "scientificName", relationship = "many-to-many")

Technical Dataset Description

We are combining two data sets from (provenance) as frog_df:

Data obtained from TidyTuesday (2025-09-02)
Source: https://github.com/rfordatascience/tidytuesday

The dimensions of these datasets are the following:

frog_names_df: 294 rows and 5 columns
frogID_df: 136621 rows and 11 columns
frogs_df: 127467 rows and 15 columns

Why This Dataset:

Australia is famous for its unique and diverse wildlife, and its frogs are a particularly fascinating, yet vulnerable, part of that ecosystem. We chose this dataset because it offers a niche look at how technology bridges the gap between public engagement and scientific research. By exploring the connection between app users and real-world conservation, we aim to create visualizations that highlight both the beauty of these species and the environments they are in.

Questions

How does the number of frog call recordings vary by season and state for the 10 most frequently recorded frog species in 2023?
How does the relative abundance of the 10 most frequently recorded frog species vary across Australian states in 2023?

Analysis plan

We are not planning to merge any external data at this moment.

For Question 1: To complete this analysis, we aim to create a series of box-plots showing species richness across states and seasons, potentially using faceting to improve clarity. Additionally, we may construct a heat map displaying mean species richness by state and season to visualize overall patterns.

For Question 2: To complete this analysis, we aim to create circular (polar) plots to visualize the distribution of calling times across frog subfamilies and geographic regions, as calling time represents circular data over a 24-hour cycle. Additionally, we will generate density plots to compare the distribution of calling times among subfamilies and regions, allowing us to identify temporal patterns and differences in peak calling activity.

Question 1:

Outcome Variable:

Species richness (number of distinct scientificName)

Explanatory variables:

stateProvince
season(from eventDate)

Grouping Variables:

stateProvince
season

New variables to create:

month (from eventDate)
season(Winter, Spring, Summer, Fall)

Question 2:

Outcome variable:

Hour of day (from eventTime)

Explanatory variables:

subfamily
stateProvince

Grouping variables:

subfamily
stateProvince

New variables to create:

hour from eventTime
time_of_day category (dawn, dusk, day, night)