Project Lizard

Author

Edward Choi
Kevin Huang

Introduction

This project uses the Billboard Hot 100 Number Ones dataset from the TidyTuesday project (released 2025-08-26). The data were compiled by Chris Dalla Riva for his book Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves and document every song that reached #1 on the Billboard Hot 100 between August 4, 1958 and January 11, 2025. The dataset contains 1177 rows and 105 columns, with each row representing a single #1 hit.

The dataset is unusually rich for a music chart record. Beyond basic song and artist information, it includes audio features sourced from the Spotify API — energy, danceability, tempo (BPM), acousticness, and more — as well as hand-coded genre and style labels, binary instrumentation indicators (e.g., guitar_based, piano_keyboard_based), and artist demographic variables (gender and racial composition of performers, songwriters, and producers). This combination makes it possible to ask questions that span both the sonic and social dimensions of nearly seven decades of American popular music.

Question 1: How have the sonic characteristics of #1 hits changed from 1958 to 2025, and do these trends differ by genre?

Introduction

Since the Billboard Hot 100 launched in 1958, the sounds sitting at the top of the chart have shifted dramatically — from the twangy pop of the late 1950s through the guitar-driven rock of the 1960s–70s, the synthesizer-heavy pop of the 1980s–90s, and into the bass-forward hip-hop and electronic productions that dominate today. Audio features derived from Spotify — specifically energy, danceability, bpm, and acousticness — give us a quantitative handle on these changes, capturing perceptual qualities like intensity, rhythmic suitability for dancing, tempo, and how acoustic (vs. electronic) a song sounds.

We are interested in this question because it connects the abstract statistics of a chart to the lived experience of how popular music has felt different across generations. We also want to know whether all genres have followed the same trajectory, or whether, say, Country and Hip-Hop have moved in opposite sonic directions over the same period. The relevant variables are the four Spotify audio features (energy, danceability, bpm, acousticness), the temporal variable date, and the genre label cdr_genre. We derive a decade variable from date for the second plot.

Approach

For the first plot, we use a scatter plot with LOESS smoothing lines, with date on the x-axis and each audio feature on the y-axis, faceted across the four features. A scatter plot with a smoothing line is ideal for this purpose because we have a dense, continuous time series and want to reveal long-run trends without committing to a parametric model. Faceting by feature lets us place all four trends in a single figure for easy comparison, while scales = "free_y" accommodates the fact that BPM is on a very different scale (roughly 60–180) than the Spotify 0–1 scores. Color mapping is used on the smoothing line to distinguish it from the raw data points.

For the second plot, we use a ridgeline plot (from the {ggridges} package), showing the distribution of energy across decades, with one panel per genre. Genres with fewer than 15 total #1 hits are collapsed into an “Other” category to avoid misleading comparisons from very small samples. A ridgeline plot is well-suited here because it efficiently stacks many overlapping density curves along an ordered categorical axis (decade), making it easy to see how the shape and center of a genre’s energy distribution has evolved. Faceting by genre enables direct side-by-side comparison of genre-level trajectories.

Analysis

# Create decade variable
billboard <- billboard |>
  mutate(
    year = year(date),
    decade = factor(
      paste0(floor(year / 10) * 10, "s"),
      levels = paste0(seq(1950, 2020, 10), "s")
    )
  )

# Identify genres with at least 15 #1 hits; lump the rest into "Other"
top_genres <- billboard |>
  count(cdr_genre, sort = TRUE) |>
  filter(n >= 15) |>
  pull(cdr_genre)

billboard <- billboard |>
  mutate(genre_group = if_else(cdr_genre %in% top_genres, cdr_genre, "Other"))

# Pivot to long format for faceting
billboard |>
  select(date, energy, danceability, bpm, acousticness) |>
  pivot_longer(
    cols = c(energy, danceability, bpm, acousticness),
    names_to = "feature",
    values_to = "value"
  ) |>
  mutate(
    feature = factor(
      feature,
      levels = c("energy", "danceability", "bpm", "acousticness"),
      labels = c("Energy", "Danceability", "BPM", "Acousticness")
    )
  ) |>
  ggplot(aes(x = date, y = value)) +
  geom_point(alpha = 0.15, size = 0.5, color = "steelblue") +
  geom_smooth(method = "loess", se = TRUE, color = "darkred", linewidth = 1) +
  facet_wrap(~feature, scales = "free_y", ncol = 2) +
  scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
  labs(
    title = "Sonic Characteristics of Billboard #1 Hits Over Time",
    subtitle = "LOESS smoothing lines with 95% confidence bands",
    x = "Year",
    y = NULL,
    caption = "Source: TidyTuesday 2025-08-26 | Audio features from Spotify API"
  ) +
  theme_minimal() +
  theme(
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Faceted scatter plots with LOESS smoothing lines showing how energy, danceability, BPM, and acousticness of Billboard #1 hits have changed from 1958 to 2025.

billboard |>
  filter(!is.na(energy), !is.na(genre_group)) |>
  mutate(decade_chr = as.character(decade)) |>
  # geom_density_ridges needs ≥2 points per group to estimate bandwidth;
  # drop singleton decade buckets within each genre to avoid the crash
  group_by(genre_group, decade_chr) |>
  filter(n() >= 2) |>
  ungroup() |>
  ggplot(aes(x = energy, y = decade_chr, fill = genre_group)) +
  geom_density_ridges(
    alpha = 0.7,
    scale = 0.9,
    color = "white",
    linewidth = 0.3
  ) +
  facet_wrap(~genre_group, ncol = 3) +
  scale_x_continuous(labels = scales::label_number(accuracy = 1)) +
  scale_y_discrete(limits = paste0(seq(1950, 2020, 10), "s")) +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Energy Distribution of #1 Hits by Decade and Genre",
    x = "Energy (Spotify score, 0–100)",
    y = "Decade",
    caption = "Source: TidyTuesday 2025-08-26 | Genres with <15 hits grouped as 'Other'"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(face = "bold"),
    axis.text.x = element_text(size = 7)
  )

Picking joint bandwidth of 4.46

Picking joint bandwidth of 9.72

Picking joint bandwidth of 7.39

Picking joint bandwidth of 6.21

Picking joint bandwidth of 10.9

Picking joint bandwidth of 8.01

Picking joint bandwidth of 8.15

Ridgeline plots showing the distribution of energy scores across decades for each genre of Billboard #1 hits.

Discussion

The smoothed time series in Plot 1 reveals several clear long-run trends in what chart-topping songs sound like. Acousticness shows the most dramatic decline, falling sharply from relatively high values in the late 1950s and 1960s — when acoustic instruments and live arrangements dominated — to consistently low values by the 1980s, reflecting the rise of electric and electronic production. Energy shows a roughly inverse pattern, rising over the same period and plateauing at higher levels in more recent decades. Danceability has also trended upward, consistent with the growing dominance of hip-hop, R&B, and dance-pop at the top of the charts. BPM shows more modest and less consistent variation, suggesting that tempo alone is a weaker signal of stylistic change compared to the other features.

The ridgeline plot in Plot 2 shows that these overall trends mask meaningful genre-level differences. Hip-Hop and R&B hits cluster at higher energy values in the most recent decades, with distributions shifting rightward over time, while older genre categories like Country or Pop show broader, more variable distributions across decades — reflecting both their stylistic diversity and, in Pop’s case, its role as a catch-all category. Genres that were more prominent in earlier eras (e.g., genres associated with the 1960s–70s rock era) tend to have their energy mass concentrated in older decades, with few recent observations. The “Other” panel aggregates the long tail of niche genres and accordingly shows a wide, diffuse distribution across all decades.

Together, the two plots support the view that the sonic landscape of #1 hits has moved steadily toward higher energy and lower acousticness over the past 67 years, but that this shift has been uneven across genres. The trends are better understood as the result of changing genre dominance — hip-hop and electronic production displacing rock and acoustic pop at the top of the chart — rather than a uniform shift within any single genre’s sound.

Question 2: How Has the Gender Composition of Artists, Songwriters, and Producers Behind #1 Hits Changed Over Time?

Introduction

The music industry has long been male-dominated across all creative roles. This question examines how the gender makeup of the people who perform, write, and produce #1 Billboard hits has shifted from 1958 to 2025. The relevant variables are artist_male, songwriter_male, and producer_male, each encoded numerically (0 = all female, 1 = all male, 2 = mixed gender, 3 = includes non-binary members). Because code 3 accounts for very few observations, we merge it into “Mixed gender” to maintain stable proportions. The temporal variable date and the derived decade variable (created in the Question 1 analysis) allow us to track these shifts across time.

We are interested in this question because commercial success at the level of a #1 hit is a concrete, measurable form of industry recognition. If gender equity has improved over time, the proportion of all-female and mixed-gender acts should have risen in recent decades. Comparing the artist, songwriter, and producer roles lets us examine whether progress on stage is matched behind the scenes — a dimension that is often less visible in public discourse about representation in music.

Approach

For the first plot, we use a stacked proportional area chart with year on the x-axis and the proportion of #1 hits in each gender category (all male, mixed gender, all female) on the y-axis, restricted to the artist role. An area chart is well suited for showing how the composition of a whole changes continuously over time, making it easy to track broad shifts in the relative shares of each group. Color mapping distinguishes the three gender categories, and using continuous year (rather than decade) on the x-axis preserves the full resolution of temporal change.

For the second plot, we use a stacked proportional bar chart faceted by role (Artist, Songwriter, Producer), showing the gender composition of #1 hits per decade for all three roles simultaneously. A stacked proportional bar chart is ideal here because each bar sums to 100%, making the shifting balance of all-male, mixed-gender, and all-female contributions directly comparable across decades. Faceting by role enables side-by-side examination of whether trends for performing artists are matched or exceeded in the less-visible roles of songwriting and production. Consistent color mapping across both plots ties the two analyses together visually.

Analysis

# Recode numeric gender codes to descriptive labels
# Codes: 0 = all female, 1 = all male, 2 = mixed gender, 3 = non-binary present (→ Mixed)
billboard_gender <- billboard |>
  mutate(
    artist_gender = case_when(
      artist_male == 0 ~ "All female",
      artist_male == 1 ~ "All male",
      artist_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    songwriter_gender = case_when(
      songwriter_male == 0 ~ "All female",
      songwriter_male == 1 ~ "All male",
      songwriter_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    producer_gender = case_when(
      producer_male == 0 ~ "All female",
      producer_male == 1 ~ "All male",
      producer_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    across(
      c(artist_gender, songwriter_gender, producer_gender),
      ~ factor(.x, levels = c("All male", "Mixed gender", "All female"))
    )
  )

# Year-level proportions for artist gender (Plot 1)
# complete() ensures every year has a row for all three categories so geom_area()
# stacks correctly without interpolating across missing values
artist_by_year <- billboard_gender |>
  filter(!is.na(artist_gender)) |>
  count(year, artist_gender) |>
  complete(year, artist_gender, fill = list(n = 0)) |>
  group_by(year) |>
  mutate(prop = n / sum(n)) |>
  ungroup()

# Decade-level counts by role (Plot 2)
# complete() ensures all decade × role × gender combinations exist so bars are uniform width
gender_by_role <- billboard_gender |>
  select(decade, artist_gender, songwriter_gender, producer_gender) |>
  pivot_longer(
    cols = c(artist_gender, songwriter_gender, producer_gender),
    names_to = "role",
    values_to = "gender"
  ) |>
  filter(!is.na(gender)) |>
  mutate(
    role = factor(
      role,
      levels = c("artist_gender", "songwriter_gender", "producer_gender"),
      labels = c("Artist", "Songwriter", "Producer")
    )
  ) |>
  count(decade, role, gender) |>
  complete(decade, role, gender, fill = list(n = 0))

ggplot(artist_by_year, aes(x = year, y = prop, fill = artist_gender)) +
  geom_area(alpha = 0.85) +
  scale_x_continuous(breaks = seq(1960, 2020, by = 10)) +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_manual(
    values = c(
      "All male" = "#2166ac",
      "Mixed gender" = "#92c5de",
      "All female" = "#d6604d"
    ),
    name = "Gender composition"
  ) +
  labs(
    title = "Gender Composition of #1 Hit Artists, 1958–2025",
    x = "Year",
    y = "Proportion of #1 hits",
    caption = "Source: TidyTuesday 2025-08-26"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Stacked area chart showing the proportion of Billboard #1 hits by artist gender composition from 1958 to 2025.

ggplot(gender_by_role, aes(x = decade, y = n, fill = gender)) +
  geom_col(position = "fill", width = 0.75) +
  facet_wrap(~role, ncol = 1) +
  scale_y_continuous(labels = scales::label_percent(), expand = expansion(0)) +
  scale_fill_manual(
    values = c(
      "All male" = "#2166ac",
      "Mixed gender" = "#92c5de",
      "All female" = "#d6604d"
    ),
    name = NULL
  ) +
  labs(
    title = "Gender Representation Across Roles by Decade",
    x = NULL,
    y = "Share of #1 hits",
    caption = "Source: TidyTuesday 2025-08-26"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.justification = "center",
    strip.text = element_text(face = "bold", size = 12),
    panel.spacing = unit(1.2, "lines"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Grouped bar charts faceted by role (Artist, Songwriter, Producer) showing gender composition of Billboard #1 hits per decade.

Discussion

The stacked area chart in Plot 1 reveals the long-run dominance of all-male acts at the top of the Billboard Hot 100. For most of the chart’s history, the large majority of #1 hits have been by all-male artists or groups. Notable increases in the all-female and mixed-gender shares appear in the 1980s and 1990s, when female pop artists and mixed-gender groups achieved sustained commercial success. In the most recent decades, the proportions appear more variable year-to-year, reflecting a more competitive and diverse landscape at the top of the chart, though parity has not been reached.

The faceted bar chart in Plot 2 allows us to compare whether artist-level trends are matched in songwriting and production. Consistent with well-documented industry patterns, the songwriter and producer panels show even greater male dominance throughout the chart’s history, with slower rates of improvement compared to the artist panel. Production in particular has remained a heavily male-dominated field even as female artists have gained more visibility as performers. The divergence between the artist panel and the songwriter/producer panels suggests that while the faces of popular music have become somewhat more gender-diverse, the creative and technical infrastructure behind hits has remained more male-concentrated.

Together, the two plots show that representation gaps are structural and role-specific, not simply a reflection of who is visible on stage. Any apparent gains at the artist level should be interpreted with the caveat that this dataset captures only #1 hits — the very top of the chart — and does not reflect the broader pipeline of artists attempting to reach that level. The binary and ternary gender coding also limits nuance, particularly for large mixed-gender ensembles where individual contributions cannot be decomposed further.

Presentation

Our presentation can be found here.

Data

Dalla Riva, C. (2025). Billboard Hot 100 Number Ones [Data set]. TidyTuesday project (2025-08-26). Retrieved March 2026 from https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-08-26. Original data collected for: Dalla Riva, C. Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves. Audio features sourced from the Spotify Web API.

References

Dalla Riva, C. (2025). Billboard Hot 100 Number Ones [Data set]. TidyTuesday project (2025-08-26). https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-08-26

Dalla Riva, C. Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.r-project.org/

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilke, C. O. (2023). ggridges: Ridgeline plots in ‘ggplot2’. R package. https://CRAN.R-project.org/package=ggridges