Tiktok Track Trends

Exploratory data analysis

Research question(s)

Research question(s). State your research question (s) clearly.

What factors contribute to making a track popular on TikTok?

How does the duration of the TikTok video impact its popularity?
What is the association between danceability and energy?
Do the release date and the artist play a role in the popularity of the track?

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

# import packages
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

library(skimr)

# import data
tiktok <- read_csv(file.path("data", "tiktok.csv"))

Rows: 6746 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): track_id, track_name, artist_id, artist_name, album_id, release_da...
dbl (14): duration, popularity, danceability, energy, key, loudness, mode, s...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# view data, check data types are valid
glimpse(tiktok)

Rows: 6,746
Columns: 23
$ track_id         <chr> "6kVuF2PYLuvl9T85XjNbaO", "1RGIjMFMgJxkZHMDXVYzOJ", "…
$ track_name       <chr> "Lay It Down Gmix - Main", "Bartender (feat. Akon)", …
$ artist_id        <chr> "1Xfmvd48oOhEWkscWyEbh9", "3aQeKQSyrW4qWr35idm0cy", "…
$ artist_name      <chr> "Lloyd", "T-Pain", "T-Pain", "Blxst", "Gryffin", "Bel…
$ album_id         <chr> "43C6GVlhXG4KfZuEbxty3r", "6CrSEKCF4TYrbSIitegb3h", "…
$ duration         <dbl> 302186, 238800, 238800, 161684, 218295, 122772, 12277…
$ release_date     <chr> "2011-01-01", "2007-06-05", "2007-06-05", "2020-12-04…
$ popularity       <dbl> 28, 75, 75, 76, 72, 89, 89, 50, 89, 70, 70, 98, 98, 4…
$ danceability     <dbl> 0.597, 0.832, 0.832, 0.571, 0.548, 0.855, 0.855, 0.77…
$ energy           <dbl> 0.800, 0.391, 0.391, 0.767, 0.839, 0.463, 0.463, 0.80…
$ key              <dbl> 1, 8, 8, 2, 6, 3, 3, 11, 4, 1, 1, 8, 8, 1, 3, 0, 11, …
$ loudness         <dbl> -5.423, -8.504, -8.504, -5.160, -2.371, -7.454, -7.45…
$ mode             <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,…
$ speechiness      <dbl> 0.3120, 0.0628, 0.0628, 0.2870, 0.0644, 0.0367, 0.036…
$ acousticness     <dbl> 0.04610, 0.05640, 0.05640, 0.33600, 0.13500, 0.21700,…
$ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.7…
$ liveness         <dbl> 0.1800, 0.2240, 0.2240, 0.0809, 0.1020, 0.3470, 0.347…
$ valence          <dbl> 0.565, 0.436, 0.436, 0.605, 0.314, 0.866, 0.866, 0.90…
$ tempo            <dbl> 155.932, 104.961, 104.961, 93.421, 98.932, 102.931, 1…
$ playlist_id      <chr> "6kVuF2PYLuvl9T85XjNbaO", "1RGIjMFMgJxkZHMDXVYzOJ", "…
$ playlist_name    <chr> "6kVuF2PYLuvl9T85XjNbaO", "1RGIjMFMgJxkZHMDXVYzOJ", "…
$ duration_mins    <dbl> 5.036433, 3.980000, 3.980000, 2.694733, 3.638250, 2.0…
$ genre            <chr> "TIKTOK DANCE", "TIKTOK DANCE", "TIKTOK DANCE", "TIKT…

# remove null and NaN values, if any
tiktok_clean <- na.omit(tiktok)

# drop unnecessary columns: playlist_id, playlist_name, genre
tiktok_clean <- select(tiktok_clean, -playlist_id, -playlist_name, -genre)

# checked column names, none necessary for renaming
# note: keys go from 0 to 11 (C, C#, D, D#, E, F,…)
tiktok_clean

# A tibble: 6,746 × 20
   track_id      track_name artist_id artist_name album_id duration release_date
   <chr>         <chr>      <chr>     <chr>       <chr>       <dbl> <chr>       
 1 6kVuF2PYLuvl… Lay It Do… 1Xfmvd48… Lloyd       43C6GVl…   302186 2011-01-01  
 2 1RGIjMFMgJxk… Bartender… 3aQeKQSy… T-Pain      6CrSEKC…   238800 2007-06-05  
 3 1RGIjMFMgJxk… Bartender… 3aQeKQSy… T-Pain      6CrSEKC…   238800 2007-06-05  
 4 1dIWPXMX4kRH… Chosen (f… 4qXC0i02… Blxst       7Awrgen…   161684 2020-12-04  
 5 4QVS8YCpK71R… Tie Me Do… 2ZRQcIgz… Gryffin     69t8rpg…   218295 2018-08-03  
 6 7BoobGhD4x5K… Build a B… 26cMerAx… Bella Poar… 5YKqfiQ…   122772 2021-05-14  
 7 7BoobGhD4x5K… Build a B… 26cMerAx… Bella Poar… 5YKqfiQ…   122772 2021-05-14  
 8 5OKHUpNLi4GE… Ever After 1mCY2mHc… Bonnie Bai… 4TrCexU…   231559 2018-05-11  
 9 3J8EOeKLTLXO… Calling M… 6jGMq4yG… Lil Tjay    3MEKpJ7…   205458 2021-04-02  
10 5caZgotE4D6e… Clap For … 03T8GHHc… YungManny   7nYMFoZ…   125579 2021-03-20  
# ℹ 6,736 more rows
# ℹ 13 more variables: popularity <dbl>, danceability <dbl>, energy <dbl>,
#   key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, duration_mins <dbl>

Data description

Have an initial draft of your data description section. Your data description should be about your analysis-ready data.

The dataset was created to analyze the popularity of trending TikTok tracks, and understand what factors influence a TikTok track’s popularity. The dataset was created by Team Dan, a team participating in the Eskwelabs DSCN Sprint (https://github.com/romeoben/DSC7-Sprint2-TeamDan) in June 2021. We acquired the dataset from Kaggle. It is unclear who funded the creation of the dataset.

Each observation is a TikTok track. The columns in our dataset include track_id and track_name, artist_id and artist_name, album_id, duration (measured in seconds), duration_mins (duration measured in minutes), release_date, popularity (integer scale from 1 to 100), danceability (integer scale from 0 to 1), and energy (integer scale from 0 to 1). Additionally, columns related to the track musicality include key, loudness, mode, “speechiness”, “instrumentalness”, “acousticness”, liveness, valence (float scale from 0 to 1) and tempo.There are 6,746 instances in total. Of this, there are 3,560 instances with unique ids.

The track creators were not involved in the data collection. The data collection was done by a third-party. The data was collected by scraping data from TikTok and compiling it into csv files. The top trending tracks on Tiktok (at the specific time of collection) influenced the data that was recorded. We removed observations that had any null values. Then, we dropped the genre, playlist_id, and playlist_name columns, because they are not relevant to our research question. We did not have to rename any columns as they were all in snake case format.

Data limitations

Identify any potential problems with your dataset.

As the creators of the dataset are based in the Philippines, our dataset could be specific to the popularity of TikTok tracks in the Philippines and Southeast Asia. Additionally, we have many repeat observations of the same track with the same values for each column. Since there is no explanation of the data from the creators, it is unclear what each observation is. We assume that each observation is a TikTok video, and several videos have the same track, explaining why there are repeat observations.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

# How does the duration of the TikTok video impact its popularity?
tiktok_clean |>
  select(track_name, artist_id, artist_name, duration_mins, popularity) |>
  ggplot(aes(x = duration_mins, y = popularity)) +
  geom_point(alpha = 0.1) +
  labs(
    title = "How does the duration of the TikTok video impact\nits popularity?",
    x =  "Song Duration (mins)",
    y = "Popularity (ranking)"
  )

# What is the association between danceability and energy?
tiktok_clean |>
  select(track_name, artist_id, artist_name, danceability, energy) |>
  ggplot(aes(x = energy, y = danceability)) +
  geom_point(alpha = 0.1) +
  geom_smooth() +
  labs(
    title = "What is the association between danceability and energy?",
    x =  "Energy",
    y = "Dancebility"
  )

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

dance_energy <- linear_reg() |>
  fit(danceability ~ energy, data = tiktok_clean) #everything i try is almost 0 anyway
tidy(dance_energy)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  0.746     0.00624   119.      0    
2 energy      -0.00739   0.00963    -0.767   0.443

glance(dance_energy)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 0.0000872    -0.0000611 0.138     0.588   0.443     1  3791. -7577. -7557.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

#association between popularity and dance-ability
tiktok_clean |>
  select(danceability, popularity) |>
  ggplot(aes(x = danceability, y = popularity, color = "blue")) +
  geom_point(alpha = 0.5) +
  labs(
    title = "What is the association between danceability and popularity?",
    x =  "Dancebility",
    y = "Popularity"
  ) +
  scale_color_viridis_d()

How does the duration of the TikTok video impact its popularity?

We can see that the highest ranked range from durations as low as 0.7237667 minutes and as long as over 10 minutes cans still rank at the top of the charts. However, we see the highest density of points in the range of 1.5 - 3 minutes, we can see that overall, songs tend to be around the length of 3 minutes or so, and again, we cannot say that this range of duration necessarily implies popularity. Perhaps the only thing we can say with confidence is that the songs that are within this range are at a “safe” duration where the artist can have time within the song to have catchy moments. Longer songs can bore and thus fail to keep audience attention, whereas shorter songs may not have enough time to have any notable moments within the song/video. Therefore, while duration is an important aspect to consider, it is not the only factor in determining the success.

What is the association between danceability and energy?

We predicted that there would actually be a positive simple linear relationship between danceability and energy, however, it seems that there is almost no association between the two variables. The discrepancy between our hypothesis an the actual data lie in the fact that we had a fundamental misunderstanding of what “energy” and “danceability” meant. The way the energy data was collected was based off of a perceptual measure of intensity and activity, so it is a mix of dynamic range, perceived loudness, timbre, onset rate, and general song entropy, so something like death metal could be highly energetic on this scale. Whereas danceability is calculated from combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. So in this case, death metal is very energetic, but not necessarily the best to dance to. This graph was insightful and allowed us to clear up this confusion!

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

We would like to know if this dataset is sufficient, as the creators of the dataset did not provide a directory and did not explain much about the columns or how they collected the dataset.
What specific columns within our dataset do you think are most suitable for analysis?
Which relationships (between columns) are important to explore, and what plots would be helpful?