How Song Characteristics can Affect Song Popularity

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Source of data is from CORGIS dataset project: https://think.cs.vt.edu/corgis/csv/coffee/
The data was created by Sam Donald in 10/28/2022, taking coffee testers from many countries who professionally rated the coffee on a 0-100 scale, noting acidity, sweetness, etc…
Different variables for the location of where coffee was produced, total coffee scores, and the coffee characteristics that make up the coffee score based on acidity, sweetness, etc…

Research question

In different countries, how does the flavor, acidity, and aroma of coffee compare?
As college students, we drink a large amount of coffee to make up for pulling all-nighters. Understanding the factors that contribute to the appeal of coffee is important. This is good information for farmers to know so that they can grow more popular flavors of coffee.
Our research topic wants to explore the variation of flavors of coffee and how they relate to coffee production across the world. We hypothesize that countries in the Americas have sweeter coffee.
Region and coffee species is categorical, while flavor data is all a float and is numerical.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

Glimpse of data

# add code here
coffee<-read.csv("data/coffee.csv") |>
  skimr::skim()
coffee

Data summary
Name	read.csv(“data/coffee.csv…
Number of rows	989
Number of columns	23
_______________________
Column type frequency:
character	7
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Location.Country	1	4	28	32
Location.Region	1	3	76	278
Data.Owner	1	3	50	263
Data.Type.Species	1	7	7	2
Data.Type.Variety	1	3	21	28
Data.Type.Processing.method	1	3	25	6
Data.Color	1	4	12	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Location.Altitude.Min	1	1640.08	9192.52	0	905.00	1300.00	1550.00	190164.00	▇▁▁▁▁
Location.Altitude.Max	1	1675.93	9191.96	0	950.00	1310.00	1600.00	190164.00	▇▁▁▁▁
Location.Altitude.Average	1	1658.00	9192.06	0	950.00	1300.00	1600.00	190164.00	▇▁▁▁▁
Year	1	2013.55	1.66	2010	2012.00	2013.00	2015.00	2018.00	▁▇▃▃▁
Data.Production.Number.of.bags	1	151.76	125.67	1	15.00	170.00	275.00	600.00	▇▁▇▁▁
Data.Production.Bag.weight	1	210.49	1666.71	0	1.00	60.00	69.00	19200.00	▇▁▁▁▁
Data.Scores.Aroma	1	7.57	0.40	0	7.42	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Flavor	1	7.52	0.42	0	7.33	7.50	7.75	8.83	▁▁▁▁▇
Data.Scores.Aftertaste	1	7.39	0.43	0	7.25	7.42	7.58	8.67	▁▁▁▁▇
Data.Scores.Acidity	1	7.54	0.40	0	7.33	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Body	1	7.51	0.39	0	7.33	7.50	7.67	8.50	▁▁▁▁▇
Data.Scores.Balance	1	7.50	0.43	0	7.33	7.50	7.75	8.58	▁▁▁▁▇
Data.Scores.Uniformity	1	9.82	0.59	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Sweetness	1	9.83	0.69	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Moisture	1	0.09	0.04	0	0.10	0.11	0.12	0.28	▃▇▆▁▁
Data.Scores.Total	1	81.97	3.86	0	81.08	82.50	83.58	90.58	▁▁▁▁▇

Data 2

Introduction and data

Source of data is from CORGIS dataset project: https://think.cs.vt.edu/corgis/csv/music/. We believe this data was ethically sourced.
The data is from The Million Song Dataset created in 2011.
The dataset contains information on artists (e.g. location, demographics, and popularity) and information on the artist’s respective songs (e.g. title, year, length, tempo, bpm, etc.)

Research question

How do song characteristics (e.g. loudness, tempo, length) relate to song and artist popularity?
College students are one of the largest demographics for listening to music. We are interested in researching the relationship between song characteristics and popularity and whether or not some aspects of songs or of the artist are more or less influential in this respect.
Artist name, artist location, and song title are categorical. Song length, song hotttnesss, and artist hotttnesss are quantitative. We hypothesize that American artists with shorter songs that have a higher bpm have the highest song and artist hotttnesss score.

Glimpse of data

# add code here
music<-read.csv("data/music.csv") |>
  skimr::skim()
music

Data summary
Name	read.csv(“data/music.csv”…
Number of rows	10000
Number of columns	35
_______________________
Column type frequency:
character	4
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
artist.id	1	18	18	0	3888
artist.name	1	1	255	0	4412
artist.terms	1	0	40	5	459
song.id	1	18	52	0	10000

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
artist.familiarity	1	0.57	0.16	0.00	0.47	0.56	0.67	1.00	▁▂▇▅▂
artist.hotttnesss	1	0.39	0.14	0.00	0.33	0.38	0.45	1.08	▁▇▃▁▁
artist.latitude	1	13.90	20.36	-41.28	0.00	0.00	34.42	69.65	▁▇▁▃▁
artist.location	1	0.08	7.80	0.00	0.00	0.00	0.00	780.00	▇▁▁▁▁
artist.longitude	1	-23.92	43.72	-162.44	-73.95	0.00	0.00	174.77	▁▂▇▁▁
artist.similar	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
artist.terms_freq	1	224.89	22392.16	0.00	0.95	1.00	1.00	2239217.00	▇▁▁▁▁
release.id	1	371024.06	236777.83	0.00	172858.00	333103.00	573532.50	823599.00	▇▇▅▆▅
release.name	1	23.10	1322.90	0.00	0.00	0.00	0.00	85555.00	▇▁▁▁▁
song.artist_mbtags	1	0.00	0.00	0.00	0.00	0.00	0.00	0.33	▇▁▁▁▁
song.artist_mbtags_count	1	0.52	0.88	0.00	0.00	0.00	1.00	9.00	▇▁▁▁▁
song.bars_confidence	1	0.24	0.29	0.00	0.04	0.12	0.35	8.86	▇▁▁▁▁
song.bars_start	1	1.07	1.72	0.00	0.44	0.79	1.22	59.74	▇▁▁▁▁
song.beats_confidence	1	0.61	0.32	0.00	0.41	0.69	0.88	1.00	▃▂▃▆▇
song.beats_start	1	0.43	0.81	-60.00	0.19	0.33	0.50	12.25	▁▁▁▁▇
song.duration	1	240.62	246.08	1.04	176.03	223.06	276.38	22050.00	▇▁▁▁▁
song.end_of_fade_in	1	0.76	1.86	0.00	0.00	0.20	0.42	43.12	▇▁▁▁▁
song.hotttnesss	1	-0.24	0.69	-1.00	-1.00	0.00	0.41	1.00	▇▁▃▆▂
song.key	1	5.37	9.67	0.00	2.00	5.00	8.00	904.80	▇▁▁▁▁
song.key_confidence	1	0.45	0.33	0.00	0.22	0.47	0.66	19.08	▇▁▁▁▁
song.loudness	1	-10.48	5.40	-51.64	-13.16	-9.38	-6.53	0.57	▁▁▁▆▇
song.mode	1	0.69	0.46	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
song.mode_confidence	1	0.48	0.19	0.00	0.36	0.49	0.61	1.00	▂▅▇▅▁
song.start_of_fade_out	1	229.88	112.02	-21.39	168.86	213.86	266.27	1813.43	▇▁▁▁▁
song.tatums_confidence	1	0.51	0.33	0.00	0.24	0.50	0.77	9.23	▇▁▁▁▁
song.tatums_start	1	0.30	0.51	0.00	0.11	0.19	0.29	12.25	▇▁▁▁▁
song.tempo	1	122.90	35.20	0.00	96.96	120.16	144.01	262.83	▁▆▇▂▁
song.time_signature	1	3.56	1.27	0.00	3.00	4.00	4.00	7.00	▂▁▇▁▁
song.time_signature_confidence	1	0.60	8.99	0.00	0.10	0.55	0.86	898.89	▇▁▁▁▁
song.title	1	10.01	945.49	0.00	0.00	0.00	0.00	94496.00	▇▁▁▁▁
song.year	1	934.70	996.65	0.00	0.00	0.00	2000.00	2010.00	▇▁▁▁▇

Data 3

Introduction and data

Source of the data is from CORGIS Dataset Project on Billionaires:

https://think.cs.vt.edu/corgis/csv/billionaires/
This data was collected from the Forbes World’s Billionaires lists from 1996-2014, with additional data added by scholars at Peterson Institute for International Economics.
The data set shows various billionaires and information about their company, their demographics, and how they obtained their wealth.

Research question

Our research question: How does the source and growth of billionaires’ wealth relate to billionaires’ rankings?
We are researching the origin of billionaires’ wealth (how it was inherited, where the money came from, etc.) and understanding how these factors may have impacts on billionaire rankings. This is an important research topic because many of these billionaires in the data set have economic impacts on companies and markets that individuals interact with on a daily basis as well as country GDPs. We hypothesize that inherited wealth has a high correlation with high billionaire rankings.
Billionaire names (categorical) and rankings (quantitative) are strings and integers, while the variables relating to wealth origin are booleans and strings, which are categorical variables.

Glimpse of data

# add code here
billionaires<-read.csv("data/billionaires.csv") |>
  skimr::skim()
billionaires

Data summary
Name	read.csv(“data/billionair…
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	16
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
name	1	5	45	0	2077
company.name	1	0	59	38	1578
company.relationship	1	0	46	46	75
company.sector	1	0	52	23	521
company.type	1	0	22	36	19
demographics.gender	1	0	14	34	4
location.citizenship	1	4	20	0	73
location.country.code	1	3	6	0	74
location.region	1	1	24	0	8
wealth.type	1	0	24	22	6
wealth.how.category	1	0	18	1	10
wealth.how.from.emerging	1	4	4	0	1
wealth.how.industry	1	0	31	1	20
wealth.how.inherited	1	6	24	0	6
wealth.how.was.founder	1	4	4	0	1
wealth.how.was.political	1	4	4	0	1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth.in.billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁