Project title

Proposal

library(tidyverse)
library(skimr)
library(dplyr)

Data 1

Introduction and data

  • Identify the source of the data.

    Kaggle: Spotify Tracks Dataset

    Source: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data was collected and cleansed using Spotify’s web API and Python by the original data curator.

  • Write a brief description of the observations.

    Each observation is a song. Each row contains the id, artist, album name, song name, popularity, duration, explicitness, rating of danceability, rating of energy, key, loudness, mode, speechness, acousticness, instrumentalness, liveness, valence, tempo, time signature, and track genre of the song on Spotify. There are 114000 observations and 21 columns in the dataset.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Are the most popular songs on Spotify more danceable?

    • Are the most popular songs on Spotify longer in duration?

    • What genre are the most popular songs on Spotify?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    Music is an integral enterprise to our daily life. Whether it be listening to the radio or attending a concert, we experience different types of music at frequent rates. Thus, it’s interesting to explore what makes different songs popular; we are especially interested in danceability, duration, and genre. Currently, we hypothesize that the most popular songs on Spotify are highly danceable, short, pop songs.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Track_id: Qualitative - description

    • Artists: Qualitative- descriptive

    • Duration: Quantitative

    • Album_name: Qualitative - descriptive

    • Track_name: Qualitative- descriptive

    • Popularity: Quantitative

    • Danceability: Quantitative

    • Track_genre: Categorical

Glimpse of data

spotify <- read.csv("data/dataset.csv")

skimr::skim(spotify)
Data summary
Name spotify
Number of rows 114000
Number of columns 21
_______________________
Column type frequency:
character 6
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
track_id 0 1 22 22 0 89741 0
artists 0 1 0 513 1 31438 0
album_name 0 1 0 243 1 46590 0
track_name 0 1 0 511 1 73609 0
explicit 0 1 4 5 0 2 0
track_genre 0 1 3 17 0 114 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 56999.50 32909.11 0.00 28499.75 56999.50 85499.25 113999.00 ▇▇▇▇▇
popularity 0 1 33.24 22.31 0.00 17.00 35.00 50.00 100.00 ▇▇▇▃▁
duration_ms 0 1 228029.15 107297.71 0.00 174066.00 212906.00 261506.00 5237295.00 ▇▁▁▁▁
danceability 0 1 0.57 0.17 0.00 0.46 0.58 0.70 0.98 ▁▃▇▇▂
energy 0 1 0.64 0.25 0.00 0.47 0.69 0.85 1.00 ▂▃▅▆▇
key 0 1 5.31 3.56 0.00 2.00 5.00 8.00 11.00 ▇▃▃▅▆
loudness 0 1 -8.26 5.03 -49.53 -10.01 -7.00 -5.00 4.53 ▁▁▁▇▆
mode 0 1 0.64 0.48 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
speechiness 0 1 0.08 0.11 0.00 0.04 0.05 0.08 0.96 ▇▁▁▁▁
acousticness 0 1 0.31 0.33 0.00 0.02 0.17 0.60 1.00 ▇▂▂▂▂
instrumentalness 0 1 0.16 0.31 0.00 0.00 0.00 0.05 1.00 ▇▁▁▁▁
liveness 0 1 0.21 0.19 0.00 0.10 0.13 0.27 1.00 ▇▃▁▁▁
valence 0 1 0.47 0.26 0.00 0.26 0.46 0.68 1.00 ▆▇▇▇▅
tempo 0 1 122.15 29.98 0.00 99.22 122.02 140.07 243.37 ▁▃▇▃▁
time_signature 0 1 3.90 0.43 0.00 4.00 4.00 4.00 5.00 ▁▁▁▇▁

Data 2

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was collected and uploaded in 2021. The curator scraped the available television shows during that time from Netflix, Prime Video, Hulu, and Disney+ and uploaded them to this data set.
  • Write a brief description of the observations.

    • We have 12 columns and 5368 observations. The columns are the titles of the show (string), whether the show is streamed on Hulu, Disney+, Prime Video, or Netflix, the ranking of the show (integer out of 100), the year it was produced, the target age group (string). We have 1 if the show is streamed on a specific platform, and 0 otherwise. All the shows have a unique identifier.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What platform presents the best-quality television shows for each age-group?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • As streaming platforms continue to grow in popularity, discourse about which platform is best seems to be taking place at an increasingly frequent rate. As such, we’re interested in learning about platforms’ television offerings per age group (especially ours!). In order to address this interest, we’re intending on looking at the IMDb and/or Rotten Tomatoes scores (indicating TV quality), age group, and platform for each TV show; we can analyze the research question with these particular variables. Currently, we hypothesize that Netflix will have the highest-quality TV shows for most age groups.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Row ID: Quantitative

    • Unique TV ID: Quantitative

    • Title: Qualitative - Descriptive

    • Year: Quantitative

    • Age: Categorical

    • IMDb Rating: Quantitative

    • Rotten Tomatoes Rating: Quantitative

    • Netflix: Categorical

    • Hulu: Categorical

    • Prime Video: Categorical

    • Disney+: Categorical

    • Type: Categorical

    • NOTE: Some of these variables might be filtered out/manipulated in our cleaned data set; the above list represents the variables in their original form.

Glimpse of data

tvshows <- read.csv("data/tv_shows.csv")

skimr::skim(tvshows)
Data summary
Name tvshows
Number of rows 5368
Number of columns 12
_______________________
Column type frequency:
character 4
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Title 0 1 1 77 0 5368 0
Age 0 1 0 3 2127 6 0
IMDb 0 1 0 6 962 79 0
Rotten.Tomatoes 0 1 6 7 0 85 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 2683.50 1549.75 0 1341.75 2683.5 4025.25 5367 ▇▇▇▇▇
ID 0 1 2814.95 1672.39 1 1345.75 2788.0 4308.25 5717 ▇▇▇▇▇
Year 0 1 2012.63 10.14 1904 2011.00 2016.0 2018.00 2021 ▁▁▁▁▇
Netflix 0 1 0.37 0.48 0 0.00 0.0 1.00 1 ▇▁▁▁▅
Hulu 0 1 0.30 0.46 0 0.00 0.0 1.00 1 ▇▁▁▁▃
Prime.Video 0 1 0.34 0.47 0 0.00 0.0 1.00 1 ▇▁▁▁▅
Disney. 0 1 0.07 0.25 0 0.00 0.0 0.00 1 ▇▁▁▁▁
Type 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁

Data 3

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was collected and uploaded in 2022.
  • Write a brief description of the observations.

    • The data has 606 observations with 12 columns. The columns represent the work year, experience level, employment type, job title, salary, salary currency, salary in USD, employee’s place of residence, company size, company location, and the remote ratio. Each observation represents an employee with the information about their job on those columns above.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Which job title/industry has the highest average salary from 2020-2022?

    • What is the average salary for all (or specific) experience levels in small, medium, and large companies?

      • Potential Question to explore: What is the highest average salary that an EN (entry level/junior level) employee can expect to earn (based on small, medium, or large company)?
    • What are the most popular job titles in the data science industry?

    • Which experience level is most common in the industry?

    • Which region is where most employees come from (sorted based on experience level)?

    • Which experience level is most common among employees in each job title?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • As budding Information Science majors, we look at data science to be a promising field to base our future careers in. We’re interested to learn more about how the average salary has changed over the last couple of years within the industry and how experience levels influence the salary within popular data science job titles. To explore this in greater detail, we plan to analyze salary trends over time based on varying experience levels and company sizes. Other variables can also be used to explore the host of questions mentioned above. (For the first question on the list) We hypothesize that data science in the tech industry will likely have the highest average salary. (For the second question on the list) We hypothesize that the average salary for EX (executive level/director) experienced employees at large companies will be the highest.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Employee_ID: Quantitative

    • Work_year: Quantitative

    • Experience_level: Categorical

    • Employment_type: Categorical

    • Job_title: Categorical

    • Salary: Quantitative

    • Salary_in_usd: Quantitative

    • Remote_ratio: Quantitative

    • Company_location: Categorical

    • Company_size: Categorical

Glimpse of data

ds_salary <- read.csv("data/ds_salaries.csv") 

skimr::skim(ds_salary)
Data summary
Name ds_salary
Number of rows 607
Number of columns 12
_______________________
Column type frequency:
character 7
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
experience_level 0 1 2 2 0 4 0
employment_type 0 1 2 2 0 4 0
job_title 0 1 11 40 0 50 0
salary_currency 0 1 3 3 0 17 0
employee_residence 0 1 2 2 0 57 0
company_location 0 1 2 2 0 50 0
company_size 0 1 1 1 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 303.00 175.37 0 151.5 303 454.5 606 ▇▇▇▇▇
work_year 0 1 2021.41 0.69 2020 2021.0 2022 2022.0 2022 ▂▁▆▁▇
salary 0 1 324000.06 1544357.49 4000 70000.0 115000 165000.0 30400000 ▇▁▁▁▁
salary_in_usd 0 1 112297.87 70957.26 2859 62726.0 101570 150000.0 600000 ▇▅▁▁▁
remote_ratio 0 1 70.92 40.71 0 50.0 100 100.0 100 ▂▁▂▁▇