Project title

Proposal

library(tidyverse)
library(skimr)
library(dplyr)

Data 1

Introduction and data

Identify the source of the data.

Kaggle: Spotify Tracks Dataset

Source: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was collected and cleansed using Spotify’s web API and Python by the original data curator.
Write a brief description of the observations.

Each observation is a song. Each row contains the id, artist, album name, song name, popularity, duration, explicitness, rating of danceability, rating of energy, key, loudness, mode, speechness, acousticness, instrumentalness, liveness, valence, tempo, time signature, and track genre of the song on Spotify. There are 114000 observations and 21 columns in the dataset.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Are the most popular songs on Spotify more danceable?
- Are the most popular songs on Spotify longer in duration?
- What genre are the most popular songs on Spotify?
A description of the research topic along with a concise statement of your hypotheses on this topic.

Music is an integral enterprise to our daily life. Whether it be listening to the radio or attending a concert, we experience different types of music at frequent rates. Thus, it’s interesting to explore what makes different songs popular; we are especially interested in danceability, duration, and genre. Currently, we hypothesize that the most popular songs on Spotify are highly danceable, short, pop songs.
Identify the types of variables in your research question. Categorical? Quantitative?
- Track_id: Qualitative - description
- Artists: Qualitative- descriptive
- Duration: Quantitative
- Album_name: Qualitative - descriptive
- Track_name: Qualitative- descriptive
- Popularity: Quantitative
- Danceability: Quantitative
- Track_genre: Categorical

Glimpse of data

spotify <- read.csv("data/dataset.csv")

skimr::skim(spotify)

Data summary
Name	spotify
Number of rows	114000
Number of columns	21
_______________________
Column type frequency:
character	6
numeric	15
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
track_id	1	22	22	0	89741
artists	1	0	513	1	31438
album_name	1	0	243	1	46590
track_name	1	0	511	1	73609
explicit	1	4	5	0	2
track_genre	1	3	17	0	114

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X	1	56999.50	32909.11	0.00	28499.75	56999.50	85499.25	113999.00	▇▇▇▇▇
popularity	1	33.24	22.31	0.00	17.00	35.00	50.00	100.00	▇▇▇▃▁
duration_ms	1	228029.15	107297.71	0.00	174066.00	212906.00	261506.00	5237295.00	▇▁▁▁▁
danceability	1	0.57	0.17	0.00	0.46	0.58	0.70	0.98	▁▃▇▇▂
energy	1	0.64	0.25	0.00	0.47	0.69	0.85	1.00	▂▃▅▆▇
key	1	5.31	3.56	0.00	2.00	5.00	8.00	11.00	▇▃▃▅▆
loudness	1	-8.26	5.03	-49.53	-10.01	-7.00	-5.00	4.53	▁▁▁▇▆
mode	1	0.64	0.48	0.00	0.00	1.00	1.00	1.00	▅▁▁▁▇
speechiness	1	0.08	0.11	0.00	0.04	0.05	0.08	0.96	▇▁▁▁▁
acousticness	1	0.31	0.33	0.00	0.02	0.17	0.60	1.00	▇▂▂▂▂
instrumentalness	1	0.16	0.31	0.00	0.00	0.00	0.05	1.00	▇▁▁▁▁
liveness	1	0.21	0.19	0.00	0.10	0.13	0.27	1.00	▇▃▁▁▁
valence	1	0.47	0.26	0.00	0.26	0.46	0.68	1.00	▆▇▇▇▅
tempo	1	122.15	29.98	0.00	99.22	122.02	140.07	243.37	▁▃▇▃▁
time_signature	1	3.90	0.43	0.00	4.00	4.00	4.00	5.00	▁▁▁▇▁

Data 2

Introduction and data

Identify the source of the data.
- Kaggle: “TV shows on Netflix, Prime Video, Hulu, and Disney+”
- https://www.kaggle.com/datasets/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was collected and uploaded in 2021. The curator scraped the available television shows during that time from Netflix, Prime Video, Hulu, and Disney+ and uploaded them to this data set.
Write a brief description of the observations.
- We have 12 columns and 5368 observations. The columns are the titles of the show (string), whether the show is streamed on Hulu, Disney+, Prime Video, or Netflix, the ranking of the show (integer out of 100), the year it was produced, the target age group (string). We have 1 if the show is streamed on a specific platform, and 0 otherwise. All the shows have a unique identifier.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What platform presents the best-quality television shows for each age-group?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- As streaming platforms continue to grow in popularity, discourse about which platform is best seems to be taking place at an increasingly frequent rate. As such, we’re interested in learning about platforms’ television offerings per age group (especially ours!). In order to address this interest, we’re intending on looking at the IMDb and/or Rotten Tomatoes scores (indicating TV quality), age group, and platform for each TV show; we can analyze the research question with these particular variables. Currently, we hypothesize that Netflix will have the highest-quality TV shows for most age groups.
Identify the types of variables in your research question. Categorical? Quantitative?
- Row ID: Quantitative
- Unique TV ID: Quantitative
- Title: Qualitative - Descriptive
- Year: Quantitative
- Age: Categorical
- IMDb Rating: Quantitative
- Rotten Tomatoes Rating: Quantitative
- Netflix: Categorical
- Hulu: Categorical
- Prime Video: Categorical
- Disney+: Categorical
- Type: Categorical
- NOTE: Some of these variables might be filtered out/manipulated in our cleaned data set; the above list represents the variables in their original form.

Glimpse of data

tvshows <- read.csv("data/tv_shows.csv")

skimr::skim(tvshows)

Data summary
Name	tvshows
Number of rows	5368
Number of columns	12
_______________________
Column type frequency:
character	4
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Title	1	1	77	0	5368
Age	1	0	3	2127	6
IMDb	1	0	6	962	79
Rotten.Tomatoes	1	6	7	0	85

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X	1	2683.50	1549.75	0	1341.75	2683.5	4025.25	5367	▇▇▇▇▇
ID	1	2814.95	1672.39	1	1345.75	2788.0	4308.25	5717	▇▇▇▇▇
Year	1	2012.63	10.14	1904	2011.00	2016.0	2018.00	2021	▁▁▁▁▇
Netflix	1	0.37	0.48	0	0.00	0.0	1.00	1	▇▁▁▁▅
Hulu	1	0.30	0.46	0	0.00	0.0	1.00	1	▇▁▁▁▃
Prime.Video	1	0.34	0.47	0	0.00	0.0	1.00	1	▇▁▁▁▅
Disney.	1	0.07	0.25	0	0.00	0.0	0.00	1	▇▁▁▁▁
Type	1	1.00	0.00	1	1.00	1.0	1.00	1	▁▁▇▁▁

Data 3

Introduction and data

Identify the source of the data.
- Kaggle: “Data Science Job Salaries”
- https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was collected and uploaded in 2022.
Write a brief description of the observations.
- The data has 606 observations with 12 columns. The columns represent the work year, experience level, employment type, job title, salary, salary currency, salary in USD, employee’s place of residence, company size, company location, and the remote ratio. Each observation represents an employee with the information about their job on those columns above.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Which job title/industry has the highest average salary from 2020-2022?
- What is the average salary for all (or specific) experience levels in small, medium, and large companies?
  - Potential Question to explore: What is the highest average salary that an EN (entry level/junior level) employee can expect to earn (based on small, medium, or large company)?
- What are the most popular job titles in the data science industry?
- Which experience level is most common in the industry?
- Which region is where most employees come from (sorted based on experience level)?
- Which experience level is most common among employees in each job title?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- As budding Information Science majors, we look at data science to be a promising field to base our future careers in. We’re interested to learn more about how the average salary has changed over the last couple of years within the industry and how experience levels influence the salary within popular data science job titles. To explore this in greater detail, we plan to analyze salary trends over time based on varying experience levels and company sizes. Other variables can also be used to explore the host of questions mentioned above. (For the first question on the list) We hypothesize that data science in the tech industry will likely have the highest average salary. (For the second question on the list) We hypothesize that the average salary for EX (executive level/director) experienced employees at large companies will be the highest.
Identify the types of variables in your research question. Categorical? Quantitative?
- Employee_ID: Quantitative
- Work_year: Quantitative
- Experience_level: Categorical
- Employment_type: Categorical
- Job_title: Categorical
- Salary: Quantitative
- Salary_in_usd: Quantitative
- Remote_ratio: Quantitative
- Company_location: Categorical
- Company_size: Categorical

Glimpse of data

ds_salary <- read.csv("data/ds_salaries.csv") 

skimr::skim(ds_salary)

Data summary
Name	ds_salary
Number of rows	607
Number of columns	12
_______________________
Column type frequency:
character	7
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
experience_level	1	2	2	4
employment_type	1	2	2	4
job_title	1	11	40	50
salary_currency	1	3	3	17
employee_residence	1	2	2	57
company_location	1	2	2	50
company_size	1	1	1	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X	1	303.00	175.37	0	151.5	303	454.5	606	▇▇▇▇▇
work_year	1	2021.41	0.69	2020	2021.0	2022	2022.0	2022	▂▁▆▁▇
salary	1	324000.06	1544357.49	4000	70000.0	115000	165000.0	30400000	▇▁▁▁▁
salary_in_usd	1	112297.87	70957.26	2859	62726.0	101570	150000.0	600000	▇▅▁▁▁
remote_ratio	1	70.92	40.71	0	50.0	100	100.0	100	▂▁▂▁▇