Project proposal

Author

proud-lizard

library(tidyverse)

Dataset Description

For this project, we use the Billboard Hot 100 Number Ones dataset from the TidyTuesday project (2025-08-26). The dataset was compiled by Chris Dalla Riva for his book Uncharted Territory: What Numbers Tell Us about the Biggest Hit Songs and Ourselves and documents every song that reached #1 on the Billboard Hot 100 between August 4, 1958 and January 11, 2025.

The data are loaded as follows:

billboard <- read_csv("data/billboard.csv")

glimpse(billboard)

Rows: 1,177
Columns: 105
$ song                                         <chr> "Poor Little Fool", "Nel …
$ artist                                       <chr> "Ricky Nelson", "Domenico…
$ date                                         <dttm> 1958-08-04, 1958-08-18, …
$ weeks_at_number_one                          <dbl> 2, 5, 1, 6, 2, 1, 3, 4, 3…
$ non_consecutive                              <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0…
$ rating_1                                     <dbl> 4, 7, 5, 3, 7, 5, 8, 1, 9…
$ rating_2                                     <dbl> 5, 7, 6, 3, 8, 5, 8, 5, 9…
$ rating_3                                     <dbl> 3, 5, 6, 7, 9, 2, 8, 2, 8…
$ overall_rating                               <dbl> 4.000000, 6.333333, 5.666…
$ divisiveness                                 <dbl> 1.3333333, 1.3333333, 0.6…
$ label                                        <chr> "Imperial", "Decca", "Apt…
$ parent_label                                 <chr> "Imperial", "Decca", "ABC…
$ cdr_genre                                    <chr> "Pop;Rock", "Pop", "Rock"…
$ cdr_style                                    <chr> "Acoustic Rock", "Vocal",…
$ discogs_genre                                <chr> "Rock", "Pop;Folk, World,…
$ discogs_style                                <chr> "Rock & Roll", "Vocal;Can…
$ artist_structure                             <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0…
$ featured_artists                             <chr> NA, NA, NA, NA, NA, NA, N…
$ multiple_lead_vocalists                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ group_named_after_non_lead_singer            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ talent_contestant                            <chr> NA, NA, NA, NA, NA, NA, N…
$ posthumous                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ artist_place_of_origin                       <chr> "United States", "Italy",…
$ front_person_age                             <dbl> 18.00000, 30.00000, 17.00…
$ artist_male                                  <dbl> 1, 1, 1, 1, 1, 1, 2, 1, 2…
$ artist_white                                 <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 0…
$ artist_black                                 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1…
$ songwriters                                  <chr> "Sharon Sheeley", "Franco…
$ songwriters_w_o_interpolation_sample_credits <chr> "Sharon Sheeley", "Franco…
$ songwriter_male                              <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ songwriter_white                             <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1…
$ artist_is_a_songwriter                       <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 0…
$ artist_is_only_songwriter                    <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0…
$ producers                                    <chr> "Jimmie Haskell;Ozzie Nel…
$ producer_male                                <dbl> 1, NA, NA, 1, 1, 1, 1, 1,…
$ producer_white                               <dbl> 1, NA, NA, 1, 1, 1, 1, 0,…
$ artist_is_a_producer                         <dbl> 1, NA, NA, 0, 0, 0, 1, 1,…
$ artist_is_only_producer                      <dbl> 0, NA, NA, 0, 0, 0, 1, 1,…
$ songwriter_is_a_producer                     <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ time_signature                               <chr> "4/4", "Free;6/8;4/4", "F…
$ keys                                         <chr> "C", "Bb", "A", "Eb", "B"…
$ simplified_key                               <chr> "C", "Bb", "A", "Eb", "B"…
$ bpm                                          <dbl> 155, 130, 73, 71, 127, 12…
$ energy                                       <dbl> 33, 6, 40, 15, 43, 14, 20…
$ danceability                                 <dbl> 54, 55, 41, 33, 44, 63, 3…
$ happiness                                    <dbl> 80, 48, 70, 61, 36, 52, 3…
$ loudness_d_b                                 <dbl> -12, -17, -13, -18, -10, …
$ acousticness                                 <dbl> 67, 98, 87, 4, 86, 83, 89…
$ vocally_based                                <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 1…
$ bass_based                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ guitar_based                                 <dbl> 1, 0, 1, 0, 0, 1, 1, 1, 0…
$ piano_keyboard_based                         <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0…
$ orchestral_strings                           <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1…
$ horns_winds                                  <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0…
$ accordion                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ banjo                                        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ bongos                                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ clarinet                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ cowbell                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ falsetto_vocal                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ flute_piccolo                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ handclaps_snaps                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ harmonica                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ human_whistling                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ kazoo                                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ mandolin                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pedal_lap_steel                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ocarina                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ saxophone                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sitar                                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ trumpet                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ukulele                                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ violin                                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sound_effects                                <chr> NA, NA, NA, NA, NA, NA, N…
$ song_structure                               <chr> "A2", "E1", "D3", "D1", "…
$ rap_verse_in_a_non_rap_song                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ length_sec                                   <dbl> 154, 219, 163, 156, 134, …
$ instrumental                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ instrumental_length_sec                      <dbl> 12, 11, 0, 0, 0, 7, 8, 28…
$ intro_length_sec                             <dbl> 12, 40, 10, 3, 24, 31, 8,…
$ vocal_introduction                           <dbl> 0, 1, 1, 0, 1, 1, 0, 0, 0…
$ free_time_vocal_introduction                 <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0…
$ fade_out                                     <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0…
$ live                                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ cover                                        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1…
$ sample                                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ interpolation                                <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 0…
$ inspired_by_a_different_song                 <dbl> 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ lyrics                                       <chr> "I used to play around wi…
$ lyrical_topic                                <chr> "Lost Love", "Flying;Drea…
$ lyrical_narrative                            <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ spoken_word                                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ explicit                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ foreign_language                             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ written_for_a_play                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ featured_in_a_then_contemporary_play         <chr> NA, NA, NA, NA, NA, NA, N…
$ written_for_a_film                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ featured_in_a_then_contemporary_film         <chr> NA, NA, NA, NA, NA, NA, N…
$ written_for_a_t_v_show                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ featured_in_a_then_contemporary_t_v_show     <chr> NA, NA, NA, NA, NA, NA, N…
$ associated_with_dance                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ topped_the_charts_by_multiple_artist         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ double_a_side                                <chr> NA, NA, NA, NA, NA, NA, N…
$ eurovision_entry                             <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ u_s_artwork                                  <chr> "Cannot Locate", "Cannot …

The dataset contains 1177 rows and 105 columns. Each row represents a single #1 hit and includes the following types of variables:

Numeric variables: bpm, energy, danceability, happiness, loudness_d_b, acousticness, length_sec, weeks_at_number_one, front_person_age, overall_rating
Categorical variables: cdr_genre, cdr_style, simplified_key, artist_structure, artist_male, artist_white, artist_black
Temporal variable: date (date the song first reached #1)
Binary instrumentation variables: guitar_based, piano_keyboard_based, vocally_based, bass_based, orchestral_strings, horns_winds, and many more

The dataset also comes with a supplementary file, topics.csv, which lists 97 distinct lyrical topic categories used in the lyrical_topic column.

Why we chose this dataset

We chose this dataset because it offers a uniquely rich window into how popular music in the United States has evolved over nearly seven decades. The combination of audio features from Spotify (energy, danceability, BPM), hand-coded instrumentation and genre labels, and artist demographic information allows us to ask questions that span both the sonic and social dimensions of hit music. With 1177 #1 hits and over 100 variables, the dataset supports a wide range of visual analyses without needing external data.

Questions

Because this is an observational dataset, our questions are framed to identify associations and trends rather than causal relationships.

1. How have the sonic characteristics of #1 hits changed from 1958 to 2025, and do these trends differ by genre?

This question explores the long-term evolution of what a chart-topping song sounds like. As music production technology, listener preferences, and industry practices have shifted over decades, we expect measurable changes in audio features like energy, danceability, tempo, and acousticness. Comparing these trends across genres will reveal whether all genres have followed the same trajectory or diverged.

This question involves:

Numeric variables: energy, danceability, bpm, acousticness
Categorical variable: cdr_genre
Temporal variable: date
Variables to create: decade (derived from date, grouping years into decades for cleaner visualization)

2. How has the gender composition of artists, songwriters, and producers behind #1 hits changed over time?

This question investigates representation in the music industry by examining the gender breakdown of the people who perform, write, and produce chart-topping songs. While discussions about gender equity in music are common, this dataset lets us visualize the actual trajectory of representation at the highest level of commercial success.

This question involves:

Categorical variables: artist_male, songwriter_male, producer_male
Temporal variable: date
Variables to create: decade (derived from date); recoded gender labels from numeric codes (0 = “All female”, 1 = “All male”, 2 = “Mixed gender”; code 3, which indicates non-binary members, will be merged into “Mixed gender” due to very few observations)

Analysis plan

Plan for Question 1

Create a decade variable by extracting the year from date and binning into decades (1960s, 1970s, …, 2020s).
Compute decade-level summary statistics (median, IQR) for energy, danceability, bpm, and acousticness.
Plot 1 — Smoothed time series: Plot each audio feature over time as a scatter plot with a LOESS smoothing line, using date on the x-axis and the feature value on the y-axis. Facet by audio feature to show all four trends in a single figure. This plot type is ideal for revealing long-term trends in continuous data over time.
Plot 2 — Ridgeline plot by genre and decade: For a selected feature (e.g., energy), show the distribution across decades, faceted or colored by cdr_genre. Genres with fewer than 15 total #1 hits will be grouped into an “Other” category to avoid misleading comparisons from small samples. A ridgeline plot (using {ggridges}) is effective for comparing many distributions across an ordered categorical variable, making it easy to see how genre-level distributions shift over time.

Variables used: date, energy, danceability, bpm, acousticness, cdr_genre

No external data will be merged in.

Plan for Question 2

Create a decade variable as above.
Recode artist_male, songwriter_male, and producer_male from numeric codes to descriptive labels: “All female”, “All male”, “Mixed gender”.
For each decade and each role (artist, songwriter, producer), compute the proportion of #1 hits in each gender category.
Plot 1 — Stacked proportional area chart: Show how the proportion of all-female, all-male, and mixed-gender acts has shifted over time for the artist role. A stacked area chart is well-suited for showing how parts of a whole change over a continuous time axis, making trends in representation immediately visible.
Plot 2 — Grouped bar chart, faceted by role: For each decade, show side-by-side bars for the proportion of all-female vs. all-male vs. mixed-gender contributions, faceted by role (artist, songwriter, producer). Faceting allows direct comparison across roles, revealing whether progress in artist representation is matched behind the scenes in songwriting and production.

Variables used: date, artist_male, songwriter_male, producer_male

No external data will be merged in.