Harmonizing with Data: Exploring music through data science

Appendix to report

Data cleaning

# A tibble: 4,214 × 11
# Groups:   artist_name [2,333]
   artist_name           artist_hotttnesss artist_terms song_loudness song_tempo
   <chr>                             <dbl> <chr>                <dbl>      <dbl>
 1 Casual                            0.402 hip hop             -11.2        92.2
 2 Gob                               0.402 pop punk             -4.50      130. 
 3 Planet P Project                  0.332 new wave            -13.5        86.6
 4 Wayne Watson                      0.352 ccm                  -7.54      118. 
 5 Blue Rodeo                        0.448 country rock         -8.58      120. 
 6 Richard Souther                   0.331 chill-out           -16.1       128. 
 7 Tesla                             0.513 hard rock            -5.27      150. 
 8 Elena                             0.378 uk garage            -8.05      112. 
 9 The Dillinger Escape…             0.542 math-core            -4.26      167. 
10 SUE THOMPSON                      0.306 pop rock            -12.3       138. 
# ℹ 4,204 more rows
# ℹ 6 more variables: song_duration <dbl>, song_year <int>,
#   song_hotttnesss <dbl>, song_key <dbl>, song_time_signature <dbl>,
#   num_songs <int>

We downloaded the music.csv data set from the CORGIS website and imported it with read.csv(). Next, in order to create a column that has the number of songs in the dataset created by each artist, we grouped the rows by artist and used mutate() to add a column to the dataset that contained this number. After this, we filtered out the columns that we are not planning on using in our analysis. We decided to keep basic information about the artists like name, popularity, and genre. We also decided to keep columns that contained information about the songs’ characteristics like key, time signature, and duration. We also removed any songs with a song.hotttnesss value of zero or lower to focus our analysis on songs that achieved at least some popularity. Finally, we cleaned the column names and dropped any NA values in the data set.