library(tidyverse)
library(skimr)
library(dplyr)Project title
Proposal
Data 1
Introduction and data
Identify the source of the data.
Kaggle: Spotify Tracks Dataset
Source: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset?resource=download
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data was collected and cleansed using Spotify’s web API and Python by the original data curator.
Write a brief description of the observations.
Each observation is a song. Each row contains the id, artist, album name, song name, popularity, duration, explicitness, rating of danceability, rating of energy, key, loudness, mode, speechness, acousticness, instrumentalness, liveness, valence, tempo, time signature, and track genre of the song on Spotify. There are 114000 observations and 21 columns in the dataset.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Are the most popular songs on Spotify more danceable?
Are the most popular songs on Spotify longer in duration?
What genre are the most popular songs on Spotify?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Music is an integral enterprise to our daily life. Whether it be listening to the radio or attending a concert, we experience different types of music at frequent rates. Thus, it’s interesting to explore what makes different songs popular; we are especially interested in danceability, duration, and genre. Currently, we hypothesize that the most popular songs on Spotify are highly danceable, short, pop songs.
Identify the types of variables in your research question. Categorical? Quantitative?
Track_id: Qualitative - description
Artists: Qualitative- descriptive
Duration: Quantitative
Album_name: Qualitative - descriptive
Track_name: Qualitative- descriptive
Popularity: Quantitative
Danceability: Quantitative
Track_genre: Categorical
Glimpse of data
spotify <- read.csv("data/dataset.csv")
skimr::skim(spotify)| Name | spotify |
| Number of rows | 114000 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| track_id | 0 | 1 | 22 | 22 | 0 | 89741 | 0 |
| artists | 0 | 1 | 0 | 513 | 1 | 31438 | 0 |
| album_name | 0 | 1 | 0 | 243 | 1 | 46590 | 0 |
| track_name | 0 | 1 | 0 | 511 | 1 | 73609 | 0 |
| explicit | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
| track_genre | 0 | 1 | 3 | 17 | 0 | 114 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1 | 56999.50 | 32909.11 | 0.00 | 28499.75 | 56999.50 | 85499.25 | 113999.00 | ▇▇▇▇▇ |
| popularity | 0 | 1 | 33.24 | 22.31 | 0.00 | 17.00 | 35.00 | 50.00 | 100.00 | ▇▇▇▃▁ |
| duration_ms | 0 | 1 | 228029.15 | 107297.71 | 0.00 | 174066.00 | 212906.00 | 261506.00 | 5237295.00 | ▇▁▁▁▁ |
| danceability | 0 | 1 | 0.57 | 0.17 | 0.00 | 0.46 | 0.58 | 0.70 | 0.98 | ▁▃▇▇▂ |
| energy | 0 | 1 | 0.64 | 0.25 | 0.00 | 0.47 | 0.69 | 0.85 | 1.00 | ▂▃▅▆▇ |
| key | 0 | 1 | 5.31 | 3.56 | 0.00 | 2.00 | 5.00 | 8.00 | 11.00 | ▇▃▃▅▆ |
| loudness | 0 | 1 | -8.26 | 5.03 | -49.53 | -10.01 | -7.00 | -5.00 | 4.53 | ▁▁▁▇▆ |
| mode | 0 | 1 | 0.64 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
| speechiness | 0 | 1 | 0.08 | 0.11 | 0.00 | 0.04 | 0.05 | 0.08 | 0.96 | ▇▁▁▁▁ |
| acousticness | 0 | 1 | 0.31 | 0.33 | 0.00 | 0.02 | 0.17 | 0.60 | 1.00 | ▇▂▂▂▂ |
| instrumentalness | 0 | 1 | 0.16 | 0.31 | 0.00 | 0.00 | 0.00 | 0.05 | 1.00 | ▇▁▁▁▁ |
| liveness | 0 | 1 | 0.21 | 0.19 | 0.00 | 0.10 | 0.13 | 0.27 | 1.00 | ▇▃▁▁▁ |
| valence | 0 | 1 | 0.47 | 0.26 | 0.00 | 0.26 | 0.46 | 0.68 | 1.00 | ▆▇▇▇▅ |
| tempo | 0 | 1 | 122.15 | 29.98 | 0.00 | 99.22 | 122.02 | 140.07 | 243.37 | ▁▃▇▃▁ |
| time_signature | 0 | 1 | 3.90 | 0.43 | 0.00 | 4.00 | 4.00 | 4.00 | 5.00 | ▁▁▁▇▁ |
Data 2
Introduction and data
Identify the source of the data.
Kaggle: “TV shows on Netflix, Prime Video, Hulu, and Disney+”
https://www.kaggle.com/datasets/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was collected and uploaded in 2021. The curator scraped the available television shows during that time from Netflix, Prime Video, Hulu, and Disney+ and uploaded them to this data set.
Write a brief description of the observations.
- We have 12 columns and 5368 observations. The columns are the titles of the show (string), whether the show is streamed on Hulu, Disney+, Prime Video, or Netflix, the ranking of the show (integer out of 100), the year it was produced, the target age group (string). We have 1 if the show is streamed on a specific platform, and 0 otherwise. All the shows have a unique identifier.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What platform presents the best-quality television shows for each age-group?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- As streaming platforms continue to grow in popularity, discourse about which platform is best seems to be taking place at an increasingly frequent rate. As such, we’re interested in learning about platforms’ television offerings per age group (especially ours!). In order to address this interest, we’re intending on looking at the IMDb and/or Rotten Tomatoes scores (indicating TV quality), age group, and platform for each TV show; we can analyze the research question with these particular variables. Currently, we hypothesize that Netflix will have the highest-quality TV shows for most age groups.
- Identify the types of variables in your research question. Categorical? Quantitative?
Row ID: Quantitative
Unique TV ID: Quantitative
Title: Qualitative - Descriptive
Year: Quantitative
Age: Categorical
IMDb Rating: Quantitative
Rotten Tomatoes Rating: Quantitative
Netflix: Categorical
Hulu: Categorical
Prime Video: Categorical
Disney+: Categorical
Type: Categorical
NOTE: Some of these variables might be filtered out/manipulated in our cleaned data set; the above list represents the variables in their original form.
Glimpse of data
tvshows <- read.csv("data/tv_shows.csv")
skimr::skim(tvshows)| Name | tvshows |
| Number of rows | 5368 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Title | 0 | 1 | 1 | 77 | 0 | 5368 | 0 |
| Age | 0 | 1 | 0 | 3 | 2127 | 6 | 0 |
| IMDb | 0 | 1 | 0 | 6 | 962 | 79 | 0 |
| Rotten.Tomatoes | 0 | 1 | 6 | 7 | 0 | 85 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1 | 2683.50 | 1549.75 | 0 | 1341.75 | 2683.5 | 4025.25 | 5367 | ▇▇▇▇▇ |
| ID | 0 | 1 | 2814.95 | 1672.39 | 1 | 1345.75 | 2788.0 | 4308.25 | 5717 | ▇▇▇▇▇ |
| Year | 0 | 1 | 2012.63 | 10.14 | 1904 | 2011.00 | 2016.0 | 2018.00 | 2021 | ▁▁▁▁▇ |
| Netflix | 0 | 1 | 0.37 | 0.48 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▅ |
| Hulu | 0 | 1 | 0.30 | 0.46 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
| Prime.Video | 0 | 1 | 0.34 | 0.47 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▅ |
| Disney. | 0 | 1 | 0.07 | 0.25 | 0 | 0.00 | 0.0 | 0.00 | 1 | ▇▁▁▁▁ |
| Type | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
Data 3
Introduction and data
Identify the source of the data.
Kaggle: “Data Science Job Salaries”
https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was collected and uploaded in 2022.
Write a brief description of the observations.
- The data has 606 observations with 12 columns. The columns represent the work year, experience level, employment type, job title, salary, salary currency, salary in USD, employee’s place of residence, company size, company location, and the remote ratio. Each observation represents an employee with the information about their job on those columns above.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Which job title/industry has the highest average salary from 2020-2022?
What is the average salary for all (or specific) experience levels in small, medium, and large companies?
- Potential Question to explore: What is the highest average salary that an EN (entry level/junior level) employee can expect to earn (based on small, medium, or large company)?
What are the most popular job titles in the data science industry?
Which experience level is most common in the industry?
Which region is where most employees come from (sorted based on experience level)?
Which experience level is most common among employees in each job title?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- As budding Information Science majors, we look at data science to be a promising field to base our future careers in. We’re interested to learn more about how the average salary has changed over the last couple of years within the industry and how experience levels influence the salary within popular data science job titles. To explore this in greater detail, we plan to analyze salary trends over time based on varying experience levels and company sizes. Other variables can also be used to explore the host of questions mentioned above. (For the first question on the list) We hypothesize that data science in the tech industry will likely have the highest average salary. (For the second question on the list) We hypothesize that the average salary for EX (executive level/director) experienced employees at large companies will be the highest.
Identify the types of variables in your research question. Categorical? Quantitative?
Employee_ID: Quantitative
Work_year: Quantitative
Experience_level: Categorical
Employment_type: Categorical
Job_title: Categorical
Salary: Quantitative
Salary_in_usd: Quantitative
Remote_ratio: Quantitative
Company_location: Categorical
Company_size: Categorical
Glimpse of data
ds_salary <- read.csv("data/ds_salaries.csv")
skimr::skim(ds_salary)| Name | ds_salary |
| Number of rows | 607 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| experience_level | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| employment_type | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| job_title | 0 | 1 | 11 | 40 | 0 | 50 | 0 |
| salary_currency | 0 | 1 | 3 | 3 | 0 | 17 | 0 |
| employee_residence | 0 | 1 | 2 | 2 | 0 | 57 | 0 |
| company_location | 0 | 1 | 2 | 2 | 0 | 50 | 0 |
| company_size | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| X | 0 | 1 | 303.00 | 175.37 | 0 | 151.5 | 303 | 454.5 | 606 | ▇▇▇▇▇ |
| work_year | 0 | 1 | 2021.41 | 0.69 | 2020 | 2021.0 | 2022 | 2022.0 | 2022 | ▂▁▆▁▇ |
| salary | 0 | 1 | 324000.06 | 1544357.49 | 4000 | 70000.0 | 115000 | 165000.0 | 30400000 | ▇▁▁▁▁ |
| salary_in_usd | 0 | 1 | 112297.87 | 70957.26 | 2859 | 62726.0 | 101570 | 150000.0 | 600000 | ▇▅▁▁▁ |
| remote_ratio | 0 | 1 | 70.92 | 40.71 | 0 | 50.0 | 100 | 100.0 | 100 | ▂▁▂▁▇ |