library(tidyverse)Project proposal
Dataset Description
For this project, we use the Billboard Hot 100 Number Ones dataset from the TidyTuesday project (2025-08-26). The dataset was compiled by Chris Dalla Riva for his book Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves and documents every song that reached #1 on the Billboard Hot 100 between August 4, 1958 and January 11, 2025.
The data are loaded as follows:
billboard <- read_csv("data/billboard.csv")billboard |>
select(
song,
artist,
date,
cdr_genre,
bpm,
energy,
danceability,
happiness,
artist_male
) |>
head(5)# A tibble: 5 × 9
song artist date cdr_genre bpm energy danceability happiness
<chr> <chr> <dttm> <chr> <dbl> <dbl> <dbl> <dbl>
1 Poor… Ricky… 1958-08-04 00:00:00 Pop;Rock 155 33 54 80
2 Nel … Domen… 1958-08-18 00:00:00 Pop 130 6 55 48
3 Litt… The E… 1958-08-25 00:00:00 Rock 73 40 41 70
4 It's… Tommy… 1958-09-29 00:00:00 Pop 71 15 33 61
5 It's… Conwa… 1958-11-10 00:00:00 Pop 127 43 44 36
# ℹ 1 more variable: artist_male <dbl>
The dataset contains 1177 rows and 105 columns. Each row represents a single #1 hit and includes the following types of variables:
- Numeric variables:
bpm,energy,danceability,happiness,loudness_d_b,acousticness,length_sec,weeks_at_number_one,front_person_age,overall_rating - Categorical variables:
cdr_genre,cdr_style,simplified_key,artist_structure,artist_male,artist_white,artist_black - Temporal variable:
date(date the song first reached #1) - Binary instrumentation variables:
guitar_based,piano_keyboard_based,vocally_based,bass_based,orchestral_strings,horns_winds, and many more
The dataset also comes with a supplementary file, topics.csv, which lists 97 distinct lyrical topic categories used in the lyrical_topic column.
Why we chose this dataset
We chose this dataset because it offers a uniquely rich window into how popular music in the United States has evolved over nearly seven decades. The combination of audio features from Spotify (energy, danceability, BPM), hand-coded instrumentation and genre labels, and artist demographic information allows us to ask questions that span both the sonic and social dimensions of hit music. With 1177 #1 hits and over 100 variables, the dataset supports a wide range of visual analyses without needing external data.
Questions
1. How have the sonic characteristics of #1 hits changed from 1958 to 2025, and do these trends differ by genre?
This question explores the long-term evolution of what a chart-topping song sounds like. As music production technology, listener preferences, and industry practices have shifted over decades, we expect measurable changes in audio features like energy, danceability, tempo, and acousticness. Comparing these trends across genres will reveal whether all genres have followed the same trajectory or diverged.
This question involves:
- Numeric variables:
energy,danceability,bpm,acousticness - Categorical variable:
cdr_genre - Temporal variable:
date - Variables to create:
decade(derived fromdate, grouping years into decades for cleaner visualization)
2. How has the gender composition of artists, songwriters, and producers behind #1 hits changed over time?
This question investigates representation in the music industry by examining the gender breakdown of the people who perform, write, and produce chart-topping songs. While discussions about gender equity in music are common, this dataset lets us visualize the actual trajectory of representation at the highest level of commercial success.
This question involves:
- Categorical variables:
artist_male,songwriter_male,producer_male - Temporal variable:
date - Variables to create:
decade(derived fromdate); recoded gender labels from numeric codes (0 = “All female”, 1 = “All male”, 2 = “Mixed gender”; code 3, which indicates non-binary members, will be merged into “Mixed gender” due to very few observations)
Analysis plan
Plan for Question 1
- Create a
decadevariable by extracting the year fromdateand binning into decades (1960s, 1970s, …, 2020s). - Compute decade-level summary statistics (median, IQR) for
energy,danceability,bpm, andacousticness. - Plot 1 — Smoothed time series: Plot each audio feature over time as a scatter plot with a LOESS smoothing line, using
dateon the x-axis and the feature value on the y-axis. Facet by audio feature to show all four trends in a single figure. This plot type is ideal for revealing long-term trends in continuous data over time. - Plot 2 — Ridgeline plot by genre and decade: For a selected feature (e.g.,
energy), show the distribution across decades, faceted or colored bycdr_genre. Genres with fewer than 15 total #1 hits will be grouped into an “Other” category to avoid misleading comparisons from small samples. A ridgeline plot (using{ggridges}) is effective for comparing many distributions across an ordered categorical variable, making it easy to see how genre-level distributions shift over time.
Variables used: date, energy, danceability, bpm, acousticness, cdr_genre
No external data will be merged in.
Plan for Question 2
- Create a
decadevariable as above. - Recode
artist_male,songwriter_male, andproducer_malefrom numeric codes to descriptive labels: “All female”, “All male”, “Mixed gender”. - For each decade and each role (artist, songwriter, producer), compute the proportion of #1 hits in each gender category.
- Plot 1 — Stacked proportional area chart: Show how the proportion of all-female, all-male, and mixed-gender acts has shifted over time for the artist role. A stacked area chart is well-suited for showing how parts of a whole change over a continuous time axis, making trends in representation immediately visible.
- Plot 2 — Grouped bar chart, faceted by role: For each decade, show side-by-side bars for the proportion of all-female vs. all-male vs. mixed-gender contributions, faceted by role (artist, songwriter, producer). Faceting allows direct comparison across roles, revealing whether progress in artist representation is matched behind the scenes in songwriting and production.
Variables used: date, artist_male, songwriter_male, producer_male
No external data will be merged in.