An Investigation of Song Popularity

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

CORGIS, https://think.cs.vt.edu/corgis/csv/video_games/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This dataset was created in 2017 and the video game playtime information was collected from crowd-sourced data on the “How Long to Beat” web source, which has statistics for various video games regarding the length of time needed to play. This data from “How Long to Beat” seems to be ethically collected because user participation in the website is voluntary.
Write a brief description of the observations.

The observations are individual video games, and the columns are various variables such as specific features, year released, the length of time to play specific parts of the game (average, fastest, slowest), and more.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1) What makes a video game popular, and how does length of time played relate to a game’s popularity?

2) What is the most popular genre according to release year?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic includes exploring the popularity of video games and also how this correlates to length of time played, as this topic sheds light into gaming tendencies of the 21st century generation. We hypothesize that the most popular games have the longest playing time.
Identify the types of variables in your research question. Categorical? Quantitative?

To determine the popularity of the video game, the quantitative variables of “Metrics.Review Score” and “Metrics.Sales” can be used to evaluate this question. Similarly, regarding length of time played, the quantitative variable of either the mean or median can be used. Other categorical variables such as “Metadata.Genres” can be used to evaluate the popularity in relation to different genres of games, for example.

Glimpse of data

dataset1 <- read_csv("data/video_games.csv")

Rows: 1212 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Title, Metadata.Genres, Metadata.Publishers, Release.Console, Rele...
dbl (25): Features.Max Players, Metrics.Review Score, Metrics.Sales, Metrics...
lgl  (6): Features.Handheld?, Features.Multiplatform?, Features.Online?, Met...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(dataset1)

Data summary
Name	dataset1
Number of rows	1212
Number of columns	36
_______________________
Column type frequency:
character	5
logical	6
numeric	25
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Title	0	1.00	2	52	900
Metadata.Genres	0	1.00	6	52	48
Metadata.Publishers	264	0.78	2	20	31
Release.Console	0	1.00	4	13	5
Release.Rating	0	1.00	1	1	3

Variable type: logical

skim_variable	complete_rate	mean	count
Features.Handheld?	1	1	TRU: 1212
Features.Multiplatform?	1	1	TRU: 1212
Features.Online?	1	1	TRU: 1212
Metadata.Licensed?	1	1	TRU: 1212
Metadata.Sequel?	1	1	TRU: 1212
Release.Re-release?	1	1	TRU: 1212

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Features.Max Players	1	1.66	1.20	1.00	1.00	1.00	2.00	8.00	▇▁▁▁▁
Metrics.Review Score	1	68.83	12.96	19.00	60.00	70.00	79.00	98.00	▁▂▅▇▂
Metrics.Sales	1	0.50	1.07	0.01	0.09	0.21	0.46	14.66	▇▁▁▁▁
Metrics.Used Price	1	17.39	5.02	4.95	14.95	17.95	17.95	49.95	▂▇▁▁▁
Release.Year	1	2006.82	1.05	2004.00	2006.00	2007.00	2008.00	2008.00	▁▂▅▇▇
Length.All PlayStyles.Average	1	13.65	19.40	0.00	3.56	8.86	16.03	279.73	▇▁▁▁▁
Length.All PlayStyles.Leisure	1	26.25	51.60	0.00	4.00	12.00	27.60	476.27	▇▁▁▁▁
Length.All PlayStyles.Median	1	11.23	13.49	0.00	3.02	8.00	13.78	126.00	▇▁▁▁▁
Length.All PlayStyles.Polled	1	44.42	154.84	0.00	1.00	6.00	25.00	2300.00	▇▁▁▁▁
Length.All PlayStyles.Rushed	1	9.40	11.18	0.00	2.60	6.71	11.37	120.20	▇▁▁▁▁
Length.Completionists.Average	1	19.81	46.63	0.00	0.00	6.00	21.55	683.13	▇▁▁▁▁
Length.Completionists.Leisure	1	25.78	61.51	0.00	0.00	6.17	27.12	691.57	▇▁▁▁▁
Length.Completionists.Median	1	18.80	44.04	0.00	0.00	6.00	20.35	683.13	▇▁▁▁▁
Length.Completionists.Polled	1	5.66	19.70	0.00	0.00	1.00	3.00	379.00	▇▁▁▁▁
Length.Completionists.Rushed	1	16.40	40.33	0.00	0.00	5.50	18.38	674.70	▇▁▁▁▁
Length.Main + Extras.Average	1	12.73	23.98	0.00	0.00	7.29	16.11	291.00	▇▁▁▁▁
Length.Main + Extras.Leisure	1	18.87	42.92	0.00	0.00	8.00	21.03	478.93	▇▁▁▁▁
Length.Main + Extras.Median	1	12.10	23.36	0.00	0.00	7.00	15.00	291.00	▇▁▁▁▁
Length.Main + Extras.Polled	1	14.00	57.33	0.00	0.00	1.00	7.00	1100.00	▇▁▁▁▁
Length.Main + Extras.Rushed	1	10.32	20.90	0.00	0.00	6.28	12.94	291.00	▇▁▁▁▁
Length.Main Story.Average	1	8.47	9.69	0.00	0.00	6.57	11.03	72.38	▇▁▁▁▁
Length.Main Story.Leisure	1	11.05	14.09	0.00	0.00	8.00	14.51	135.58	▇▁▁▁▁
Length.Main Story.Median	1	8.28	9.50	0.00	0.00	6.04	10.53	70.00	▇▁▁▁▁
Length.Main Story.Polled	1	24.88	87.38	0.00	0.00	3.00	14.00	1100.00	▇▁▁▁▁
Length.Main Story.Rushed	1	6.97	7.96	0.00	0.00	5.34	9.31	70.00	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.

Awesome Public Datasets (GitHub): https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_2023.csv
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data was collected from ATP records by Jeff Sackmann, a GitHub user, in the year 2023, from the start of the year through March 6. In terms of ethics, the ATP records is public information, so this data was ethically collected.
Write a brief description of the observations.

Each observation of this dataset is a match in a tournament, and each match contains variables such as the tournament name, competitor names, as well as statistics concerning shots hit from the matches and the outcomes of the matches. Based on the time length, this dataset contains data relevant to ATP matches through the first two months of 2023.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1) How is the win percentage of a competitor related to successful first serves?

2) How does the winner age relates to their win percentage of matches?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic includes exploring the relationship between the wins of competitors and their successful first serves, as well as possibly a relationship to player age. We would have to calculate win percentage by each match per tournament. We hypothesize that more successful players make more first serves, and, in relation to age, the most successful players are around the mid-range of ages.
Identify the types of variables in your research question. Categorical? Quantitative?

The winning competitor and tournament names are categorical variables, whereas win percentage and age are quantitative variables.

Glimpse of data

dataset2 <- read_csv("data/atp_matches_2023.csv")

Rows: 723 Columns: 49
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): tourney_id, tourney_name, surface, tourney_level, winner_seed, win...
dbl (33): draw_size, tourney_date, match_num, winner_id, winner_ht, winner_a...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(dataset2)

Data summary
Name	dataset2
Number of rows	723
Number of columns	49
_______________________
Column type frequency:
character	16
numeric	33
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
tourney_id	0	1.00	8	32	46
tourney_name	0	1.00	4	28	46
surface	0	1.00	4	4	2
tourney_level	0	1.00	1	1	3
winner_seed	438	0.39	1	2	30
winner_entry	630	0.13	1	2	5
winner_name	0	1.00	8	31	198
winner_hand	0	1.00	1	1	3
winner_ioc	0	1.00	3	3	60
loser_seed	529	0.27	1	2	34
loser_entry	589	0.19	1	2	5
loser_name	0	1.00	7	31	261
loser_hand	2	1.00	1	1	3
loser_ioc	0	1.00	3	3	68
score	0	1.00	3	29	414
round	0	1.00	1	4	8

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
draw_size	0	1.00	44.22	39.86	4.0	32.00	32.0	32.00	128.0	▂▇▁▁▂
tourney_date	0	1.00	20230172.40	51.67	20230102.0	20230116.00	20230204.0	20230213.00	20230227.0	▅▁▁▁▇
match_num	0	1.00	224.45	99.74	1.0	184.50	277.0	290.00	300.0	▂▁▁▁▇
winner_id	0	1.00	142151.56	41421.97	100644.0	106330.00	126203.0	200221.00	210234.0	▇▃▁▁▅
winner_ht	51	0.93	187.11	6.41	170.0	183.00	188.0	191.00	206.0	▁▅▇▃▁
winner_age	0	1.00	26.62	4.31	18.0	23.90	26.1	28.70	41.4	▃▇▃▂▁
loser_id	0	1.00	140747.63	41432.55	100644.0	106148.00	124186.0	200221.00	212041.0	▇▅▁▁▅
loser_ht	94	0.87	186.06	6.40	170.0	183.00	185.0	188.00	206.0	▁▅▇▃▁
loser_age	5	0.99	27.05	4.42	17.6	24.10	26.5	29.60	43.0	▂▇▅▂▁
best_of	0	1.00	3.35	0.76	3.0	3.00	3.0	3.00	5.0	▇▁▁▁▂
minutes	102	0.86	121.60	46.87	0.0	88.00	115.0	148.00	345.0	▁▇▃▁▁
w_ace	101	0.86	7.68	5.59	0.0	4.00	7.0	10.00	42.0	▇▃▁▁▁
w_df	101	0.86	2.35	2.04	0.0	1.00	2.0	3.00	14.0	▇▃▁▁▁
w_svpt	101	0.86	79.77	29.80	14.0	58.25	76.0	95.00	191.0	▂▇▅▁▁
w_1stIn	101	0.86	51.13	20.16	8.0	37.00	48.0	61.00	128.0	▂▇▃▁▁
w_1stWon	101	0.86	38.96	14.64	6.0	29.00	36.0	47.00	95.0	▂▇▅▁▁
w_2ndWon	101	0.86	15.95	6.23	2.0	12.00	15.0	19.75	37.0	▂▇▅▂▁
w_SvGms	101	0.86	12.89	4.36	2.0	10.00	12.0	15.00	28.0	▁▇▆▁▁
w_bpSaved	101	0.86	3.36	3.16	0.0	1.00	3.0	5.00	22.0	▇▂▁▁▁
w_bpFaced	101	0.86	4.84	4.06	0.0	2.00	4.0	7.00	26.0	▇▃▁▁▁
l_ace	101	0.86	5.80	5.59	0.0	2.00	4.0	8.00	44.0	▇▂▁▁▁
l_df	101	0.86	3.04	2.53	0.0	1.00	3.0	4.00	25.0	▇▁▁▁▁
l_svpt	101	0.86	83.25	29.84	12.0	62.00	78.0	100.00	205.0	▂▇▅▁▁
l_1stIn	101	0.86	52.01	19.89	7.0	37.00	48.0	63.00	143.0	▂▇▃▁▁
l_1stWon	101	0.86	35.02	15.13	4.0	25.00	32.0	44.00	101.0	▃▇▃▁▁
l_2ndWon	101	0.86	14.63	6.72	1.0	10.00	14.0	19.00	38.0	▃▇▆▂▁
l_SvGms	101	0.86	12.67	4.30	2.0	10.00	12.0	15.00	27.0	▁▇▆▁▁
l_bpSaved	101	0.86	4.67	3.17	0.0	2.00	4.0	6.00	17.0	▇▆▃▁▁
l_bpFaced	101	0.86	8.32	4.02	0.0	5.00	8.0	11.00	23.0	▂▇▃▂▁
winner_rank	9	0.99	99.95	176.53	1.0	21.25	56.0	94.75	1594.0	▇▁▁▁▁
winner_rank_points	9	0.99	1425.10	1397.77	2.0	574.25	832.5	1835.00	6980.0	▇▂▁▁▁
loser_rank	18	0.98	148.81	243.49	1.0	42.00	75.0	129.00	1859.0	▇▁▁▁▁
loser_rank_points	18	0.98	929.36	971.21	1.0	435.00	695.0	971.00	6980.0	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.

CORGIS: https://think.cs.vt.edu/corgis/csv/music/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data is from a library called the Million Song Dataset. It is a collaboration between Echo Nest and LabROSA (a labratory that works on intelligent machine listening). The dataset was published in 2011. In terms of ethics, this dataset was collected ethically because the statistics and information from artists and songs is publicly-available information.
Write a brief description of the observations.

Each observation in the dataset is a unique song, with a total of 1 million song observations total. Each row has a variable for song popularity and its artists’ popularity.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1) What is the relationship between artist popularity, the year the song was released, and the artists’ genre?
A description of the research topic along with a concise statement of your hypotheses on this topic.

This research topic explores the popularity of artists, the year they released their songs, and their corresponding genre. We hypothesize that in more recent years, artist popularity generally has increase due to increased accessibility of music. Similarly, different genres of music will peak in different ranges of years based on the current trends of popularity.
Identify the types of variables in your research question. Categorical? Quantitative?

Artist popularity and the year the song was released are quantitative variables. The artist’s genre is a categorical variable.

Glimpse of data

# add code here
dataset3 <- read_csv("data/music.csv")

Rows: 10000 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): artist.id, artist.name, artist.terms, song.id
dbl (31): artist.familiarity, artist.hotttnesss, artist.latitude, artist.loc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(dataset3)

Data summary
Name	dataset3
Number of rows	10000
Number of columns	35
_______________________
Column type frequency:
character	4
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
artist.id	0	1	18	18	3888
artist.name	0	1	1	255	4412
artist.terms	5	1	2	40	458
song.id	0	1	18	51	10000

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
artist.familiarity	1	0.57	0.16	0.00	0.47	0.56	0.67	1.00	▁▂▇▅▂
artist.hotttnesss	1	0.39	0.14	0.00	0.33	0.38	0.45	1.08	▁▇▃▁▁
artist.latitude	1	13.90	20.36	-41.28	0.00	0.00	34.42	69.65	▁▇▁▃▁
artist.location	1	0.08	7.80	0.00	0.00	0.00	0.00	780.00	▇▁▁▁▁
artist.longitude	1	-23.92	43.72	-162.44	-73.95	0.00	0.00	174.77	▁▂▇▁▁
artist.similar	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
artist.terms_freq	1	224.89	22392.16	0.00	0.95	1.00	1.00	2239217.00	▇▁▁▁▁
release.id	1	371024.06	236777.83	0.00	172858.00	333103.00	573532.50	823599.00	▇▇▅▆▅
release.name	1	23.10	1322.90	0.00	0.00	0.00	0.00	85555.00	▇▁▁▁▁
song.artist_mbtags	1	0.00	0.00	0.00	0.00	0.00	0.00	0.33	▇▁▁▁▁
song.artist_mbtags_count	1	0.52	0.88	0.00	0.00	0.00	1.00	9.00	▇▁▁▁▁
song.bars_confidence	1	0.24	0.29	0.00	0.04	0.12	0.35	8.86	▇▁▁▁▁
song.bars_start	1	1.07	1.72	0.00	0.44	0.79	1.22	59.74	▇▁▁▁▁
song.beats_confidence	1	0.61	0.32	0.00	0.41	0.69	0.88	1.00	▃▂▃▆▇
song.beats_start	1	0.43	0.81	-60.00	0.19	0.33	0.50	12.25	▁▁▁▁▇
song.duration	1	240.62	246.08	1.04	176.03	223.06	276.38	22050.00	▇▁▁▁▁
song.end_of_fade_in	1	0.76	1.86	0.00	0.00	0.20	0.42	43.12	▇▁▁▁▁
song.hotttnesss	1	-0.24	0.69	-1.00	-1.00	0.00	0.41	1.00	▇▁▃▆▂
song.key	1	5.37	9.67	0.00	2.00	5.00	8.00	904.80	▇▁▁▁▁
song.key_confidence	1	0.45	0.33	0.00	0.22	0.47	0.66	19.08	▇▁▁▁▁
song.loudness	1	-10.48	5.40	-51.64	-13.16	-9.38	-6.53	0.57	▁▁▁▆▇
song.mode	1	0.69	0.46	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
song.mode_confidence	1	0.48	0.19	0.00	0.36	0.49	0.61	1.00	▂▅▇▅▁
song.start_of_fade_out	1	229.88	112.02	-21.39	168.86	213.86	266.27	1813.43	▇▁▁▁▁
song.tatums_confidence	1	0.51	0.33	0.00	0.24	0.50	0.77	9.23	▇▁▁▁▁
song.tatums_start	1	0.30	0.51	0.00	0.11	0.19	0.29	12.25	▇▁▁▁▁
song.tempo	1	122.90	35.20	0.00	96.96	120.16	144.01	262.83	▁▆▇▂▁
song.time_signature	1	3.56	1.27	0.00	3.00	4.00	4.00	7.00	▂▁▇▁▁
song.time_signature_confidence	1	0.60	8.99	0.00	0.10	0.55	0.86	898.89	▇▁▁▁▁
song.title	1	10.01	945.49	0.00	0.00	0.00	0.00	94496.00	▇▁▁▁▁
song.year	1	934.70	996.65	0.00	0.00	0.00	2000.00	2010.00	▇▁▁▁▇

dataset3

# A tibble: 10,000 × 35
   artist.familiarity artist.hotttnesss artist.id          artist.latitude
                <dbl>             <dbl> <chr>                        <dbl>
 1              0.582             0.402 ARD7TVE1187B99BFB1             0  
 2              0.631             0.417 ARMJAGH1187FB546F3            35.1
 3              0.487             0.343 ARKRRTF1187B9984DA             0  
 4              0.630             0.454 AR7G5I41187FB4CE6C             0  
 5              0.651             0.402 ARXR32B1187FB57099             0  
 6              0.535             0.385 ARKFYS91187B98E58F             0  
 7              0.556             0.262 ARD0S291187B9B7BF5             0  
 8              0.801             0.606 AR10USD1187B99F3F1             0  
 9              0.427             0.332 AR8ZCNI1187B9A069B             0  
10              0.551             0.423 ARNTLGG11E2835DDB9             0  
# ℹ 9,990 more rows
# ℹ 31 more variables: artist.location <dbl>, artist.longitude <dbl>,
#   artist.name <chr>, artist.similar <dbl>, artist.terms <chr>,
#   artist.terms_freq <dbl>, release.id <dbl>, release.name <dbl>,
#   song.artist_mbtags <dbl>, song.artist_mbtags_count <dbl>,
#   song.bars_confidence <dbl>, song.bars_start <dbl>,
#   song.beats_confidence <dbl>, song.beats_start <dbl>, song.duration <dbl>, …