Project proposal

Author

proud-lizard

library(tidyverse)

Dataset Description

For this project, we use the Billboard Hot 100 Number Ones dataset from the TidyTuesday project (2025-08-26). The dataset was compiled by Chris Dalla Riva for his book Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves and documents every song that reached #1 on the Billboard Hot 100 between August 4, 1958 and January 11, 2025.

The data are loaded as follows:

billboard <- read_csv("data/billboard.csv")
billboard |>
  select(
    song,
    artist,
    date,
    cdr_genre,
    bpm,
    energy,
    danceability,
    happiness,
    artist_male
  ) |>
  head(5)
# A tibble: 5 × 9
  song  artist date                cdr_genre   bpm energy danceability happiness
  <chr> <chr>  <dttm>              <chr>     <dbl>  <dbl>        <dbl>     <dbl>
1 Poor… Ricky… 1958-08-04 00:00:00 Pop;Rock    155     33           54        80
2 Nel … Domen… 1958-08-18 00:00:00 Pop         130      6           55        48
3 Litt… The E… 1958-08-25 00:00:00 Rock         73     40           41        70
4 It's… Tommy… 1958-09-29 00:00:00 Pop          71     15           33        61
5 It's… Conwa… 1958-11-10 00:00:00 Pop         127     43           44        36
# ℹ 1 more variable: artist_male <dbl>

The dataset contains 1177 rows and 105 columns. Each row represents a single #1 hit and includes the following types of variables:

  • Numeric variables: bpm, energy, danceability, happiness, loudness_d_b, acousticness, length_sec, weeks_at_number_one, front_person_age, overall_rating
  • Categorical variables: cdr_genre, cdr_style, simplified_key, artist_structure, artist_male, artist_white, artist_black
  • Temporal variable: date (date the song first reached #1)
  • Binary instrumentation variables: guitar_based, piano_keyboard_based, vocally_based, bass_based, orchestral_strings, horns_winds, and many more

The dataset also comes with a supplementary file, topics.csv, which lists 97 distinct lyrical topic categories used in the lyrical_topic column.

Why we chose this dataset

We chose this dataset because it offers a uniquely rich window into how popular music in the United States has evolved over nearly seven decades. The combination of audio features from Spotify (energy, danceability, BPM), hand-coded instrumentation and genre labels, and artist demographic information allows us to ask questions that span both the sonic and social dimensions of hit music. With 1177 #1 hits and over 100 variables, the dataset supports a wide range of visual analyses without needing external data.

Questions

2. How has the gender composition of artists, songwriters, and producers behind #1 hits changed over time?

This question investigates representation in the music industry by examining the gender breakdown of the people who perform, write, and produce chart-topping songs. While discussions about gender equity in music are common, this dataset lets us visualize the actual trajectory of representation at the highest level of commercial success.

This question involves:

  • Categorical variables: artist_male, songwriter_male, producer_male
  • Temporal variable: date
  • Variables to create: decade (derived from date); recoded gender labels from numeric codes (0 = “All female”, 1 = “All male”, 2 = “Mixed gender”; code 3, which indicates non-binary members, will be merged into “Mixed gender” due to very few observations)

Analysis plan

Plan for Question 1

  1. Create a decade variable by extracting the year from date and binning into decades (1960s, 1970s, …, 2020s).
  2. Compute decade-level summary statistics (median, IQR) for energy, danceability, bpm, and acousticness.
  3. Plot 1 — Smoothed time series: Plot each audio feature over time as a scatter plot with a LOESS smoothing line, using date on the x-axis and the feature value on the y-axis. Facet by audio feature to show all four trends in a single figure. This plot type is ideal for revealing long-term trends in continuous data over time.
  4. Plot 2 — Ridgeline plot by genre and decade: For a selected feature (e.g., energy), show the distribution across decades, faceted or colored by cdr_genre. Genres with fewer than 15 total #1 hits will be grouped into an “Other” category to avoid misleading comparisons from small samples. A ridgeline plot (using {ggridges}) is effective for comparing many distributions across an ordered categorical variable, making it easy to see how genre-level distributions shift over time.

Variables used: date, energy, danceability, bpm, acousticness, cdr_genre

No external data will be merged in.

Plan for Question 2

  1. Create a decade variable as above.
  2. Recode artist_male, songwriter_male, and producer_male from numeric codes to descriptive labels: “All female”, “All male”, “Mixed gender”.
  3. For each decade and each role (artist, songwriter, producer), compute the proportion of #1 hits in each gender category.
  4. Plot 1 — Stacked proportional area chart: Show how the proportion of all-female, all-male, and mixed-gender acts has shifted over time for the artist role. A stacked area chart is well-suited for showing how parts of a whole change over a continuous time axis, making trends in representation immediately visible.
  5. Plot 2 — Grouped bar chart, faceted by role: For each decade, show side-by-side bars for the proportion of all-female vs. all-male vs. mixed-gender contributions, faceted by role (artist, songwriter, producer). Faceting allows direct comparison across roles, revealing whether progress in artist representation is matched behind the scenes in songwriting and production.

Variables used: date, artist_male, songwriter_male, producer_male

No external data will be merged in.