Project Lizard

Author

Edward Choi
Kevin Huang

Introduction

This project uses the Billboard Hot 100 Number Ones dataset from the TidyTuesday project (released 2025-08-26). The data were compiled by Chris Dalla Riva for his book Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves and document every song that reached #1 on the Billboard Hot 100 between August 4, 1958 and January 11, 2025. The dataset contains 1177 rows and 105 columns, with each row representing a single #1 hit.

The dataset is unusually rich for a music chart record. Beyond basic song and artist information, it includes audio features sourced from the Spotify API — energy, danceability, tempo (BPM), acousticness, and more — as well as hand-coded genre and style labels, binary instrumentation indicators (e.g., guitar_based, piano_keyboard_based), and artist demographic variables (gender and racial composition of performers, songwriters, and producers). This combination makes it possible to ask questions that span both the sonic and social dimensions of nearly seven decades of American popular music.

Question 2: How Has the Gender Composition of Artists, Songwriters, and Producers Behind #1 Hits Changed Over Time?

Introduction

The music industry has long been male-dominated across all creative roles. This question examines how the gender makeup of the people who perform, write, and produce #1 Billboard hits has shifted from 1958 to 2025. The relevant variables are artist_male, songwriter_male, and producer_male, each encoded numerically (0 = all female, 1 = all male, 2 = mixed gender, 3 = includes non-binary members). Because code 3 accounts for very few observations, we merge it into “Mixed gender” to maintain stable proportions. The temporal variable date and the derived decade variable (created in the Question 1 analysis) allow us to track these shifts across time.

We are interested in this question because commercial success at the level of a #1 hit is a concrete, measurable form of industry recognition. If gender equity has improved over time, the proportion of all-female and mixed-gender acts should have risen in recent decades. Comparing the artist, songwriter, and producer roles lets us examine whether progress on stage is matched behind the scenes — a dimension that is often less visible in public discourse about representation in music.

Approach

For the first plot, we use a stacked proportional area chart with year on the x-axis and the proportion of #1 hits in each gender category (all male, mixed gender, all female) on the y-axis, restricted to the artist role. An area chart is well suited for showing how the composition of a whole changes continuously over time, making it easy to track broad shifts in the relative shares of each group. Color mapping distinguishes the three gender categories, and using continuous year (rather than decade) on the x-axis preserves the full resolution of temporal change.

For the second plot, we use a stacked proportional bar chart faceted by role (Artist, Songwriter, Producer), showing the gender composition of #1 hits per decade for all three roles simultaneously. A stacked proportional bar chart is ideal here because each bar sums to 100%, making the shifting balance of all-male, mixed-gender, and all-female contributions directly comparable across decades. Faceting by role enables side-by-side examination of whether trends for performing artists are matched or exceeded in the less-visible roles of songwriting and production. Consistent color mapping across both plots ties the two analyses together visually.

Analysis

# Recode numeric gender codes to descriptive labels
# Codes: 0 = all female, 1 = all male, 2 = mixed gender, 3 = non-binary present (→ Mixed)
billboard_gender <- billboard |>
  mutate(
    artist_gender = case_when(
      artist_male == 0 ~ "All female",
      artist_male == 1 ~ "All male",
      artist_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    songwriter_gender = case_when(
      songwriter_male == 0 ~ "All female",
      songwriter_male == 1 ~ "All male",
      songwriter_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    producer_gender = case_when(
      producer_male == 0 ~ "All female",
      producer_male == 1 ~ "All male",
      producer_male %in% c(2, 3) ~ "Mixed gender",
      .default = NA_character_
    ),
    across(
      c(artist_gender, songwriter_gender, producer_gender),
      ~ factor(.x, levels = c("All male", "Mixed gender", "All female"))
    )
  )

# Year-level proportions for artist gender (Plot 1)
# complete() ensures every year has a row for all three categories so geom_area()
# stacks correctly without interpolating across missing values
artist_by_year <- billboard_gender |>
  filter(!is.na(artist_gender)) |>
  count(year, artist_gender) |>
  complete(year, artist_gender, fill = list(n = 0)) |>
  group_by(year) |>
  mutate(prop = n / sum(n)) |>
  ungroup()

# Decade-level counts by role (Plot 2)
# complete() ensures all decade × role × gender combinations exist so bars are uniform width
gender_by_role <- billboard_gender |>
  select(decade, artist_gender, songwriter_gender, producer_gender) |>
  pivot_longer(
    cols = c(artist_gender, songwriter_gender, producer_gender),
    names_to = "role",
    values_to = "gender"
  ) |>
  filter(!is.na(gender)) |>
  mutate(
    role = factor(
      role,
      levels = c("artist_gender", "songwriter_gender", "producer_gender"),
      labels = c("Artist", "Songwriter", "Producer")
    )
  ) |>
  count(decade, role, gender) |>
  complete(decade, role, gender, fill = list(n = 0))
ggplot(artist_by_year, aes(x = year, y = prop, fill = artist_gender)) +
  geom_area(alpha = 0.85) +
  scale_x_continuous(breaks = seq(1960, 2020, by = 10)) +
  scale_y_continuous(labels = scales::label_percent()) +
  scale_fill_manual(
    values = c(
      "All male" = "#2166ac",
      "Mixed gender" = "#92c5de",
      "All female" = "#d6604d"
    ),
    name = "Gender composition"
  ) +
  labs(
    title = "Gender Composition of #1 Hit Artists, 1958–2025",
    x = "Year",
    y = "Proportion of #1 hits",
    caption = "Source: TidyTuesday 2025-08-26"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Stacked area chart showing the proportion of Billboard #1 hits by artist gender composition from 1958 to 2025.

ggplot(gender_by_role, aes(x = decade, y = n, fill = gender)) +
  geom_col(position = "fill", width = 0.75) +
  facet_wrap(~role, ncol = 1) +
  scale_y_continuous(labels = scales::label_percent(), expand = expansion(0)) +
  scale_fill_manual(
    values = c(
      "All male" = "#2166ac",
      "Mixed gender" = "#92c5de",
      "All female" = "#d6604d"
    ),
    name = NULL
  ) +
  labs(
    title = "Gender Representation Across Roles by Decade",
    x = NULL,
    y = "Share of #1 hits",
    caption = "Source: TidyTuesday 2025-08-26"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    legend.justification = "center",
    strip.text = element_text(face = "bold", size = 12),
    panel.spacing = unit(1.2, "lines"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank()
  )

Grouped bar charts faceted by role (Artist, Songwriter, Producer) showing gender composition of Billboard #1 hits per decade.

Discussion

The stacked area chart in Plot 1 reveals the long-run dominance of all-male acts at the top of the Billboard Hot 100. For most of the chart’s history, the large majority of #1 hits have been by all-male artists or groups. Notable increases in the all-female and mixed-gender shares appear in the 1980s and 1990s, when female pop artists and mixed-gender groups achieved sustained commercial success. In the most recent decades, the proportions appear more variable year-to-year, reflecting a more competitive and diverse landscape at the top of the chart, though parity has not been reached.

The faceted bar chart in Plot 2 allows us to compare whether artist-level trends are matched in songwriting and production. Consistent with well-documented industry patterns, the songwriter and producer panels show even greater male dominance throughout the chart’s history, with slower rates of improvement compared to the artist panel. Production in particular has remained a heavily male-dominated field even as female artists have gained more visibility as performers. The divergence between the artist panel and the songwriter/producer panels suggests that while the faces of popular music have become somewhat more gender-diverse, the creative and technical infrastructure behind hits has remained more male-concentrated.

Together, the two plots show that representation gaps are structural and role-specific, not simply a reflection of who is visible on stage. Any apparent gains at the artist level should be interpreted with the caveat that this dataset captures only #1 hits — the very top of the chart — and does not reflect the broader pipeline of artists attempting to reach that level. The binary and ternary gender coding also limits nuance, particularly for large mixed-gender ensembles where individual contributions cannot be decomposed further.

Presentation

Our presentation can be found here.

Data

Dalla Riva, C. (2025). Billboard Hot 100 Number Ones [Data set]. TidyTuesday project (2025-08-26). Retrieved March 2026 from https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-08-26. Original data collected for: Dalla Riva, C. Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves. Audio features sourced from the Spotify Web API.

References

Dalla Riva, C. (2025). Billboard Hot 100 Number Ones [Data set]. TidyTuesday project (2025-08-26). https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-08-26

Dalla Riva, C. Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.r-project.org/

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilke, C. O. (2023). ggridges: Ridgeline plots in ‘ggplot2’. R package. https://CRAN.R-project.org/package=ggridges