An Investigation of Song Popularity

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    CORGIS, https://think.cs.vt.edu/corgis/csv/video_games/

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    This dataset was created in 2017 and the video game playtime information was collected from crowd-sourced data on the “How Long to Beat” web source, which has statistics for various video games regarding the length of time needed to play. This data from “How Long to Beat” seems to be ethically collected because user participation in the website is voluntary.

  • Write a brief description of the observations.

    The observations are individual video games, and the columns are various variables such as specific features, year released, the length of time to play specific parts of the game (average, fastest, slowest), and more.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1) What makes a video game popular, and how does length of time played relate to a game’s popularity?

    2) What is the most popular genre according to release year?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic includes exploring the popularity of video games and also how this correlates to length of time played, as this topic sheds light into gaming tendencies of the 21st century generation. We hypothesize that the most popular games have the longest playing time.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    To determine the popularity of the video game, the quantitative variables of “Metrics.Review Score” and “Metrics.Sales” can be used to evaluate this question. Similarly, regarding length of time played, the quantitative variable of either the mean or median can be used. Other categorical variables such as “Metadata.Genres” can be used to evaluate the popularity in relation to different genres of games, for example.

Glimpse of data

dataset1 <- read_csv("data/video_games.csv")
Rows: 1212 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Title, Metadata.Genres, Metadata.Publishers, Release.Console, Rele...
dbl (25): Features.Max Players, Metrics.Review Score, Metrics.Sales, Metrics...
lgl  (6): Features.Handheld?, Features.Multiplatform?, Features.Online?, Met...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(dataset1)
Data summary
Name dataset1
Number of rows 1212
Number of columns 36
_______________________
Column type frequency:
character 5
logical 6
numeric 25
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Title 0 1.00 2 52 0 900 0
Metadata.Genres 0 1.00 6 52 0 48 0
Metadata.Publishers 264 0.78 2 20 0 31 0
Release.Console 0 1.00 4 13 0 5 0
Release.Rating 0 1.00 1 1 0 3 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Features.Handheld? 0 1 1 TRU: 1212
Features.Multiplatform? 0 1 1 TRU: 1212
Features.Online? 0 1 1 TRU: 1212
Metadata.Licensed? 0 1 1 TRU: 1212
Metadata.Sequel? 0 1 1 TRU: 1212
Release.Re-release? 0 1 1 TRU: 1212

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Features.Max Players 0 1 1.66 1.20 1.00 1.00 1.00 2.00 8.00 ▇▁▁▁▁
Metrics.Review Score 0 1 68.83 12.96 19.00 60.00 70.00 79.00 98.00 ▁▂▅▇▂
Metrics.Sales 0 1 0.50 1.07 0.01 0.09 0.21 0.46 14.66 ▇▁▁▁▁
Metrics.Used Price 0 1 17.39 5.02 4.95 14.95 17.95 17.95 49.95 ▂▇▁▁▁
Release.Year 0 1 2006.82 1.05 2004.00 2006.00 2007.00 2008.00 2008.00 ▁▂▅▇▇
Length.All PlayStyles.Average 0 1 13.65 19.40 0.00 3.56 8.86 16.03 279.73 ▇▁▁▁▁
Length.All PlayStyles.Leisure 0 1 26.25 51.60 0.00 4.00 12.00 27.60 476.27 ▇▁▁▁▁
Length.All PlayStyles.Median 0 1 11.23 13.49 0.00 3.02 8.00 13.78 126.00 ▇▁▁▁▁
Length.All PlayStyles.Polled 0 1 44.42 154.84 0.00 1.00 6.00 25.00 2300.00 ▇▁▁▁▁
Length.All PlayStyles.Rushed 0 1 9.40 11.18 0.00 2.60 6.71 11.37 120.20 ▇▁▁▁▁
Length.Completionists.Average 0 1 19.81 46.63 0.00 0.00 6.00 21.55 683.13 ▇▁▁▁▁
Length.Completionists.Leisure 0 1 25.78 61.51 0.00 0.00 6.17 27.12 691.57 ▇▁▁▁▁
Length.Completionists.Median 0 1 18.80 44.04 0.00 0.00 6.00 20.35 683.13 ▇▁▁▁▁
Length.Completionists.Polled 0 1 5.66 19.70 0.00 0.00 1.00 3.00 379.00 ▇▁▁▁▁
Length.Completionists.Rushed 0 1 16.40 40.33 0.00 0.00 5.50 18.38 674.70 ▇▁▁▁▁
Length.Main + Extras.Average 0 1 12.73 23.98 0.00 0.00 7.29 16.11 291.00 ▇▁▁▁▁
Length.Main + Extras.Leisure 0 1 18.87 42.92 0.00 0.00 8.00 21.03 478.93 ▇▁▁▁▁
Length.Main + Extras.Median 0 1 12.10 23.36 0.00 0.00 7.00 15.00 291.00 ▇▁▁▁▁
Length.Main + Extras.Polled 0 1 14.00 57.33 0.00 0.00 1.00 7.00 1100.00 ▇▁▁▁▁
Length.Main + Extras.Rushed 0 1 10.32 20.90 0.00 0.00 6.28 12.94 291.00 ▇▁▁▁▁
Length.Main Story.Average 0 1 8.47 9.69 0.00 0.00 6.57 11.03 72.38 ▇▁▁▁▁
Length.Main Story.Leisure 0 1 11.05 14.09 0.00 0.00 8.00 14.51 135.58 ▇▁▁▁▁
Length.Main Story.Median 0 1 8.28 9.50 0.00 0.00 6.04 10.53 70.00 ▇▁▁▁▁
Length.Main Story.Polled 0 1 24.88 87.38 0.00 0.00 3.00 14.00 1100.00 ▇▁▁▁▁
Length.Main Story.Rushed 0 1 6.97 7.96 0.00 0.00 5.34 9.31 70.00 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    Awesome Public Datasets (GitHub): https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_2023.csv

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    This data was collected from ATP records by Jeff Sackmann, a GitHub user, in the year 2023, from the start of the year through March 6. In terms of ethics, the ATP records is public information, so this data was ethically collected.

  • Write a brief description of the observations.

    Each observation of this dataset is a match in a tournament, and each match contains variables such as the tournament name, competitor names, as well as statistics concerning shots hit from the matches and the outcomes of the matches. Based on the time length, this dataset contains data relevant to ATP matches through the first two months of 2023.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1) How is the win percentage of a competitor related to successful first serves?

    2) How does the winner age relates to their win percentage of matches?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic includes exploring the relationship between the wins of competitors and their successful first serves, as well as possibly a relationship to player age. We would have to calculate win percentage by each match per tournament. We hypothesize that more successful players make more first serves, and, in relation to age, the most successful players are around the mid-range of ages.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    The winning competitor and tournament names are categorical variables, whereas win percentage and age are quantitative variables.

Glimpse of data

dataset2 <- read_csv("data/atp_matches_2023.csv")
Rows: 723 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): tourney_id, tourney_name, surface, tourney_level, winner_seed, win...
dbl (33): draw_size, tourney_date, match_num, winner_id, winner_ht, winner_a...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(dataset2)
Data summary
Name dataset2
Number of rows 723
Number of columns 49
_______________________
Column type frequency:
character 16
numeric 33
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tourney_id 0 1.00 8 32 0 46 0
tourney_name 0 1.00 4 28 0 46 0
surface 0 1.00 4 4 0 2 0
tourney_level 0 1.00 1 1 0 3 0
winner_seed 438 0.39 1 2 0 30 0
winner_entry 630 0.13 1 2 0 5 0
winner_name 0 1.00 8 31 0 198 0
winner_hand 0 1.00 1 1 0 3 0
winner_ioc 0 1.00 3 3 0 60 0
loser_seed 529 0.27 1 2 0 34 0
loser_entry 589 0.19 1 2 0 5 0
loser_name 0 1.00 7 31 0 261 0
loser_hand 2 1.00 1 1 0 3 0
loser_ioc 0 1.00 3 3 0 68 0
score 0 1.00 3 29 0 414 0
round 0 1.00 1 4 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
draw_size 0 1.00 44.22 39.86 4.0 32.00 32.0 32.00 128.0 ▂▇▁▁▂
tourney_date 0 1.00 20230172.40 51.67 20230102.0 20230116.00 20230204.0 20230213.00 20230227.0 ▅▁▁▁▇
match_num 0 1.00 224.45 99.74 1.0 184.50 277.0 290.00 300.0 ▂▁▁▁▇
winner_id 0 1.00 142151.56 41421.97 100644.0 106330.00 126203.0 200221.00 210234.0 ▇▃▁▁▅
winner_ht 51 0.93 187.11 6.41 170.0 183.00 188.0 191.00 206.0 ▁▅▇▃▁
winner_age 0 1.00 26.62 4.31 18.0 23.90 26.1 28.70 41.4 ▃▇▃▂▁
loser_id 0 1.00 140747.63 41432.55 100644.0 106148.00 124186.0 200221.00 212041.0 ▇▅▁▁▅
loser_ht 94 0.87 186.06 6.40 170.0 183.00 185.0 188.00 206.0 ▁▅▇▃▁
loser_age 5 0.99 27.05 4.42 17.6 24.10 26.5 29.60 43.0 ▂▇▅▂▁
best_of 0 1.00 3.35 0.76 3.0 3.00 3.0 3.00 5.0 ▇▁▁▁▂
minutes 102 0.86 121.60 46.87 0.0 88.00 115.0 148.00 345.0 ▁▇▃▁▁
w_ace 101 0.86 7.68 5.59 0.0 4.00 7.0 10.00 42.0 ▇▃▁▁▁
w_df 101 0.86 2.35 2.04 0.0 1.00 2.0 3.00 14.0 ▇▃▁▁▁
w_svpt 101 0.86 79.77 29.80 14.0 58.25 76.0 95.00 191.0 ▂▇▅▁▁
w_1stIn 101 0.86 51.13 20.16 8.0 37.00 48.0 61.00 128.0 ▂▇▃▁▁
w_1stWon 101 0.86 38.96 14.64 6.0 29.00 36.0 47.00 95.0 ▂▇▅▁▁
w_2ndWon 101 0.86 15.95 6.23 2.0 12.00 15.0 19.75 37.0 ▂▇▅▂▁
w_SvGms 101 0.86 12.89 4.36 2.0 10.00 12.0 15.00 28.0 ▁▇▆▁▁
w_bpSaved 101 0.86 3.36 3.16 0.0 1.00 3.0 5.00 22.0 ▇▂▁▁▁
w_bpFaced 101 0.86 4.84 4.06 0.0 2.00 4.0 7.00 26.0 ▇▃▁▁▁
l_ace 101 0.86 5.80 5.59 0.0 2.00 4.0 8.00 44.0 ▇▂▁▁▁
l_df 101 0.86 3.04 2.53 0.0 1.00 3.0 4.00 25.0 ▇▁▁▁▁
l_svpt 101 0.86 83.25 29.84 12.0 62.00 78.0 100.00 205.0 ▂▇▅▁▁
l_1stIn 101 0.86 52.01 19.89 7.0 37.00 48.0 63.00 143.0 ▂▇▃▁▁
l_1stWon 101 0.86 35.02 15.13 4.0 25.00 32.0 44.00 101.0 ▃▇▃▁▁
l_2ndWon 101 0.86 14.63 6.72 1.0 10.00 14.0 19.00 38.0 ▃▇▆▂▁
l_SvGms 101 0.86 12.67 4.30 2.0 10.00 12.0 15.00 27.0 ▁▇▆▁▁
l_bpSaved 101 0.86 4.67 3.17 0.0 2.00 4.0 6.00 17.0 ▇▆▃▁▁
l_bpFaced 101 0.86 8.32 4.02 0.0 5.00 8.0 11.00 23.0 ▂▇▃▂▁
winner_rank 9 0.99 99.95 176.53 1.0 21.25 56.0 94.75 1594.0 ▇▁▁▁▁
winner_rank_points 9 0.99 1425.10 1397.77 2.0 574.25 832.5 1835.00 6980.0 ▇▂▁▁▁
loser_rank 18 0.98 148.81 243.49 1.0 42.00 75.0 129.00 1859.0 ▇▁▁▁▁
loser_rank_points 18 0.98 929.36 971.21 1.0 435.00 695.0 971.00 6980.0 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    CORGIS: https://think.cs.vt.edu/corgis/csv/music/

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data is from a library called the Million Song Dataset. It is a collaboration between Echo Nest and LabROSA (a labratory that works on intelligent machine listening). The dataset was published in 2011. In terms of ethics, this dataset was collected ethically because the statistics and information from artists and songs is publicly-available information.

  • Write a brief description of the observations.

    Each observation in the dataset is a unique song, with a total of 1 million song observations total. Each row has a variable for song popularity and its artists’ popularity.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1) What is the relationship between artist popularity, the year the song was released, and the artists’ genre?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    This research topic explores the popularity of artists, the year they released their songs, and their corresponding genre. We hypothesize that in more recent years, artist popularity generally has increase due to increased accessibility of music. Similarly, different genres of music will peak in different ranges of years based on the current trends of popularity.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Artist popularity and the year the song was released are quantitative variables. The artist’s genre is a categorical variable.

Glimpse of data

# add code here
dataset3 <- read_csv("data/music.csv")
Rows: 10000 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): artist.id, artist.name, artist.terms, song.id
dbl (31): artist.familiarity, artist.hotttnesss, artist.latitude, artist.loc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(dataset3)
Data summary
Name dataset3
Number of rows 10000
Number of columns 35
_______________________
Column type frequency:
character 4
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist.id 0 1 18 18 0 3888 0
artist.name 0 1 1 255 0 4412 0
artist.terms 5 1 2 40 0 458 0
song.id 0 1 18 51 0 10000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
artist.familiarity 0 1 0.57 0.16 0.00 0.47 0.56 0.67 1.00 ▁▂▇▅▂
artist.hotttnesss 0 1 0.39 0.14 0.00 0.33 0.38 0.45 1.08 ▁▇▃▁▁
artist.latitude 0 1 13.90 20.36 -41.28 0.00 0.00 34.42 69.65 ▁▇▁▃▁
artist.location 0 1 0.08 7.80 0.00 0.00 0.00 0.00 780.00 ▇▁▁▁▁
artist.longitude 0 1 -23.92 43.72 -162.44 -73.95 0.00 0.00 174.77 ▁▂▇▁▁
artist.similar 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
artist.terms_freq 0 1 224.89 22392.16 0.00 0.95 1.00 1.00 2239217.00 ▇▁▁▁▁
release.id 0 1 371024.06 236777.83 0.00 172858.00 333103.00 573532.50 823599.00 ▇▇▅▆▅
release.name 0 1 23.10 1322.90 0.00 0.00 0.00 0.00 85555.00 ▇▁▁▁▁
song.artist_mbtags 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.33 ▇▁▁▁▁
song.artist_mbtags_count 0 1 0.52 0.88 0.00 0.00 0.00 1.00 9.00 ▇▁▁▁▁
song.bars_confidence 0 1 0.24 0.29 0.00 0.04 0.12 0.35 8.86 ▇▁▁▁▁
song.bars_start 0 1 1.07 1.72 0.00 0.44 0.79 1.22 59.74 ▇▁▁▁▁
song.beats_confidence 0 1 0.61 0.32 0.00 0.41 0.69 0.88 1.00 ▃▂▃▆▇
song.beats_start 0 1 0.43 0.81 -60.00 0.19 0.33 0.50 12.25 ▁▁▁▁▇
song.duration 0 1 240.62 246.08 1.04 176.03 223.06 276.38 22050.00 ▇▁▁▁▁
song.end_of_fade_in 0 1 0.76 1.86 0.00 0.00 0.20 0.42 43.12 ▇▁▁▁▁
song.hotttnesss 0 1 -0.24 0.69 -1.00 -1.00 0.00 0.41 1.00 ▇▁▃▆▂
song.key 0 1 5.37 9.67 0.00 2.00 5.00 8.00 904.80 ▇▁▁▁▁
song.key_confidence 0 1 0.45 0.33 0.00 0.22 0.47 0.66 19.08 ▇▁▁▁▁
song.loudness 0 1 -10.48 5.40 -51.64 -13.16 -9.38 -6.53 0.57 ▁▁▁▆▇
song.mode 0 1 0.69 0.46 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
song.mode_confidence 0 1 0.48 0.19 0.00 0.36 0.49 0.61 1.00 ▂▅▇▅▁
song.start_of_fade_out 0 1 229.88 112.02 -21.39 168.86 213.86 266.27 1813.43 ▇▁▁▁▁
song.tatums_confidence 0 1 0.51 0.33 0.00 0.24 0.50 0.77 9.23 ▇▁▁▁▁
song.tatums_start 0 1 0.30 0.51 0.00 0.11 0.19 0.29 12.25 ▇▁▁▁▁
song.tempo 0 1 122.90 35.20 0.00 96.96 120.16 144.01 262.83 ▁▆▇▂▁
song.time_signature 0 1 3.56 1.27 0.00 3.00 4.00 4.00 7.00 ▂▁▇▁▁
song.time_signature_confidence 0 1 0.60 8.99 0.00 0.10 0.55 0.86 898.89 ▇▁▁▁▁
song.title 0 1 10.01 945.49 0.00 0.00 0.00 0.00 94496.00 ▇▁▁▁▁
song.year 0 1 934.70 996.65 0.00 0.00 0.00 2000.00 2010.00 ▇▁▁▁▇
dataset3
# A tibble: 10,000 × 35
   artist.familiarity artist.hotttnesss artist.id          artist.latitude
                <dbl>             <dbl> <chr>                        <dbl>
 1              0.582             0.402 ARD7TVE1187B99BFB1             0  
 2              0.631             0.417 ARMJAGH1187FB546F3            35.1
 3              0.487             0.343 ARKRRTF1187B9984DA             0  
 4              0.630             0.454 AR7G5I41187FB4CE6C             0  
 5              0.651             0.402 ARXR32B1187FB57099             0  
 6              0.535             0.385 ARKFYS91187B98E58F             0  
 7              0.556             0.262 ARD0S291187B9B7BF5             0  
 8              0.801             0.606 AR10USD1187B99F3F1             0  
 9              0.427             0.332 AR8ZCNI1187B9A069B             0  
10              0.551             0.423 ARNTLGG11E2835DDB9             0  
# ℹ 9,990 more rows
# ℹ 31 more variables: artist.location <dbl>, artist.longitude <dbl>,
#   artist.name <chr>, artist.similar <dbl>, artist.terms <chr>,
#   artist.terms_freq <dbl>, release.id <dbl>, release.name <dbl>,
#   song.artist_mbtags <dbl>, song.artist_mbtags_count <dbl>,
#   song.bars_confidence <dbl>, song.bars_start <dbl>,
#   song.beats_confidence <dbl>, song.beats_start <dbl>, song.duration <dbl>, …