library(tidyverse)Project proposal
billboard <- read_csv("data/billboard.csv")
topics <- read_csv("data/topics.csv")
head(billboard)# A tibble: 6 × 105
song artist date weeks_at_number_one non_consecutive rating_1
<chr> <chr> <dttm> <dbl> <dbl> <dbl>
1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
3 Littl… The E… 1958-08-25 00:00:00 1 0 5
4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
# ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
# divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
# cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
# artist_structure <dbl>, featured_artists <chr>,
# multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
# talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
# front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
Dataset
Our dataset is a comprehensive catalog of every song that reached #1 on the Billboard Hot 100 chart, spanning from 1958-08-04 to 2025-01-11. The dataset contains 1177 songs described by 105 variables covering chart performance, musical characteristics, artist demographics, production credits, and lyrical content. The data comes from the Tidy Tuesday project. It was compiled by Chris Dalla Riva.
Key variable types include:
- Numerical:
weeks_at_number_one,overall_rating,bpm,energy,danceability,happiness,acousticness,loudness_d_b,length_sec,front_person_age - Categorical:
cdr_genre,artist_structure,artist_male,artist_white,artist_black,lyrical_topic,simplified_key,time_signature
A companion reference table (topics.csv) lists 97 distinct lyrical topic categories used to classify songs thematically.
We chose this dataset because it is highly relevant and offers a good combination of numerical features with categorical ones. It covers a long historical period, allowing us to see trends in popular music and changes over time. We can also break down both the chart success and key details about the music itself and its creators.
Questions
How have the musical characteristics of Billboard #1 hits changed over time, and do these trends differ across genres? Our team would like to analyze whether attributes such as energy, danceability, and acousticness show different patterns of change over time across genres.
How is chart longevity (weeks at #1) associated with artist demographics and song attributes? Our team would like to explore whether factors such as artist gender, artist race, song length, and critic ratings are associated with how long a song remains at the top of the charts.
Analysis plan
Question 1: Musical characteristics over time by genre
- Variables involved:
date,energy,danceability,acousticness,cdr_genre - Variables to be created: A
decadevariable derived fromdate(ex: each decade between the 1960 to the 2020s) to group songs into meaningful time periods. We may also pivot the audio features into long format for faceted visualization. - Handling genre categories: Because the dataset may contain many genre labels with uneven representation, we will examine genre frequencies and consider grouping or filtering genres with small sample sizes to ensure reliable comparisons.
- Contextual interpretation: To interpret the visuals, we will look for shifts in averages, variability, and overall trajectories of musical attributes across decades and genres. We will also consider how broader industry and technological changes, like the rise of digital production and music streaming, may influence the trends and we will discuss these factors when interpreting results.
- External data: None.
- Approach: We will create time-series visualizations (like smoothed line plots or boxplots by decade) of each musical characteristic, faceted or colored by genre. This will let us compare specific genres with one another, like if Pop #1 hits have grown more energetic over time while R&B hits increased in danceability.
Question 2: Chart longevity, demographics, and song attributes
- Variables involved:
weeks_at_number_one,artist_male,artist_black,artist_structure,overall_rating,length_sec - Variables to be created: A categorical version of
weeks_at_number_one(ex: “1 week,” “2-4 weeks,” “5+ weeks”) for grouped comparisons. These cutoffs will be informed by the distribution of the variable so that categories reflect meaningful differences in chart longevity. We will also recodeartist_maleandartist_blackinto descriptive factor labels so that it will say female/male instead of just 0 or 1. - Artist structure clarification: We will examine how artist_structure (solo vs. group) is defined and determine how to categorize hybrid artist types such as duets or collaborations to ensure consistent comparisons.
- Missing data handling: We will assess the dataset for missing or inconsistent values, particularly in demographic and rating variables, and determine whether filtering or recoding strategies are necessary.
- External data: None.
- Approach: We will use boxplots, bar charts, and scatterplots to visualize how weeks at #1 varies across artist gender, race, and structure, and how it relates to song attributes such as rating and length. Multivariate visualizations using color and faceting will allow us to examine several variables simultaneously. If appropriate, we may also explore simple statistical models (such as regression) to more formally assess relationships between song characteristics and chart longevity. To interpret these visuals, we will examine differences in medians, distributions, and spread across demographic and song attribute groupings.
Code Organization Plan
- Our code will be organized into labeled sections:
- Data loading
- Data cleaning and preprocessing
- Variable creation (e.g., decade, longevity categories)
- Question 1 visualizations
- Question 2 visualizations
We will include comments explaining both the purpose and reasoning behind key transformations.
Limitations and Considerations
- We acknowledge that external factors may influence observed trends, including:
- Changes in chart calculation methodology
- The rise of streaming platforms
- Shifts in music production technology
- Industry marketing practices
These contextual factors will be considered when interpreting our findings.