library(tidyverse)
#remotes::install_github("wjakethompson/taylor")
#install.packages(c("tayloRswift"))
#install.packages("tidytuesdayR")
library(tayloRswift)
Project proposal
Dataset
# tuesdata <- tidytuesdayR::tt_load('2023-10-17')
#
# # Taylor Swift data sets
# taylor_album_songs <- tuesdata$taylor_album_songs
# taylor_all_songs <- tuesdata$taylor_all_songs
# taylor_albums <- tuesdata$taylor_albums
# Using locally stored data due to exceeding API limit
<- read_csv("data/taylor_album_songs.csv") taylor_album_songs
Rows: 194 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): album_name, track_name, artist, featuring, key_name, mode_name, k...
dbl (14): track_number, danceability, energy, key, loudness, mode, speechin...
lgl (4): ep, bonus_track, explicit, lyrics
date (4): album_release, promotional_release, single_release, track_release
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/taylor_all_songs.csv") taylor_all_songs
Rows: 274 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): album_name, track_name, artist, featuring, key_name, mode_name, k...
dbl (14): track_number, danceability, energy, key, loudness, mode, speechin...
lgl (4): ep, bonus_track, explicit, lyrics
date (4): album_release, promotional_release, single_release, track_release
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/taylor_albums.csv") taylor_albums
Rows: 14 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): album_name
dbl (2): metacritic_score, user_score
lgl (1): ep
date (1): album_release
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A brief description of your data set including its provenance, dimensions, etc. as well as the reason why you chose this data set.
Make sure to load the data and use inline code for some of this information.
The data set we will use for our project is a the Taylor Swift data from the TidyTuesday project. This data was uploaded to GitHub by the RforDataScience organization. The Taylor Swift data we will utilize consists of three different data sets. The first is taylor_albums, a data set with details about all of Swift’s albums including release date, meta-critic score, and user score. This data set has 14 rows and 5 columns. The second is taylor_album_songs, a data set which includes all songs that are part of a Taylor Swift album. This data set includes information such as the album it is part of, any featured artists, as well as more specific information such as the songs energy, dance-ability, liveliness, etc. This data set has 194 rows and 29 columns. The third data set is taylor_all_songs, which includes all songs Swift has ever been featured in. This data set includes the same information as taylor_album_songs, it simply has more entries. This data set has 274 rows and 29 columns. We chose this data set because Taylor Swift is probably the most talked about celebrity in the world right now so the material is very relevant. While she has been in the headlines a lot recently for her relationship with NFL player Travis Kelce, her fame all stems from being one of, if not the most, popular artists in the world. Almost all of her albums have been extremely successful and this data set has the information necessary to predict what components of her songs make her albums so popular, so we thought it would be very interesting to look at.
Questions
- How does the average danceability of the songs album contribute to the metacritic score of that album?
- How have characteristics of Taylor Swift’s music (i.e. tempo and acousticness) changed over the course of her career?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
Question 1
We will answer the question by using the taylor_album_songs.csv
and taylor_albums.csv
data from TidyTuesday. From taylor_album_songs
, we’ll be able to view the specific songs in each album and their danceability
scores. From there we can compare metrics (average, median, sum, etc.) for danceability
scores and compare across the different albums a long with the metacritic_score
. This will probably involve some sort of bar, box and whisker plot, and/or scatter plot in order to plot the relationships between album_name
, danceability
, and the metacritic_score
. In order to evaluate if the differences between metrics are significant we can take a two sample t-test comparing the various albums to see if there’s a significant difference in danceability/metacritic score across the different albums. If there are any NA values for the danceability/metacritic score we will just exclude it from the metrics (ignore the value). If the album has none of the values, we will just exclude the album from our analysis.
Question 2
We will answer this question using the taylor_all_songs.csv
data from TidyTuesday. We will combine the album_release
, promotional_release
, single_release
, and track_release
categories into one date column in order to have a continuous time series we can plot to. We then plan to plot some sort of scatter plot with a linear regression line representing the general trend followed across Taylor’s career. For multiple songs released in the same day, i.e. album songs, we will plot all the songs as separate data points. We can then create multiple layer reflecting different characteristics of her music like danceability
, energy
, tempo
, acousticness
, etc. to see the general trend of these characteristics. If the song is missing the date after merging all the date columns, we will drop the row. Otherwise if its missing all of the characteristics of the song, we will drop the row. Otherwise we will use only the row that has data in the plot. In order to distinguish different types of releases by colors (singles one color, albums another, etc.)