library(tidyverse)
library(skimr)
Dataset Exploration
Proposal
Data 1
Introduction and data
Identify the source of the data.
Opportunity Insights, a team of researchers based at Harvard University examining data to see how to improve upward mobility and work with stakeholders to implement policy changes
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
They collected anonymized data on 20 million Americans (who are in their mid-thirties today), mapping them back to the Census tract in which they grew up. For each of the 70,000 census tracts in America, they then estimated children’s average earnings, incarceration rates, and other outcomes by parental income level, race, gender, and county.
Write a brief description of the observations.
The first dataset reports predicted outcomes for children (born between 1978 and 1983) by county, race, and gender, with each row corresponding to one county. The second dataset reports key statistics on children’s income distributions based on their parents’ incomes by college tier and birth year/cohort.
Research question
How do the experiences and outcomes of low-income students who attend elite colleges compare to those who attend other types of colleges and universities, and what factors contribute to these differences?
- Hypothesis: Low-income students who attend elite colleges will have higher outcomes (higher graduation rates and higher incomes) compared to those attending other types of colleges and universities. Factors contributing to these differences are might be the difference in resources such as academic support and networking opportunities between elite and other colleges.
What are the long-term outcomes (e.g., educational attainment, income, health) for individuals who are able to overcome their disadvantaged family background, and how do these outcomes compare to those of individuals from more privileged backgrounds?
Hypothesis: Individuals who are able to overcome their disadvantaged family background will experience similar long-term outcomes to individuals from more privileged backgrounds.
Variables: documentation found here: https://opportunityinsights.org/wp-content/uploads/2018/04/Codebook-MRC-Table-8.pdf**
tier : Selectivity and type combination (1-14)
par_ventile: Parent income ventiles 1 to 20, with 1 for bottom ventile and 20 for top ventile. (par_ventile = 99 provides data for the 99th parent income percentile)
par_mean: Mean parent household income in par_pctile-tier cell
mean_kid: Mean kid earnings
k_q[KIDQUINT]: Fraction of kids in an income quintile [KIDQUINT]. where KIDQUINT is an integer 1 to 5. 1 is the bottom quintile and 5 is the top.
k_nowork: Fraction of kids not working
k_median: Median child individual earnings in 2014. (mean of the 3 or 4, when count is an even number, middle observations in each cell, when sorted on income)
k_median_nozero: Median child individual earnings when excluding zeros, defined analogously to k_median.
How does the extent of intergenerational mobility vary across different regions, and what factors contribute to these differences?
Hypothesis: Regions with greater income gaps exhibit lower levels of intergenerational mobility rates, while regions with stronger public education systems and more job opportunities will have higher levels of intergenerational mobility.
Variables: [outcome]_[race]_[gender]_p[pctile]: Mean predicted outcome for children of a given race, gender and with parents at a given percentile in the national household income distribution.
- list of [outcomes] found here: https://opportunityinsights.org/wp-content/uploads/2019/07/Codebook-for-Table-5.pdf
[race] is either pooled, white, Black,Hispanic, Asian, Native American (natam), or other
[gender] is either pooled, male, orfemale
[pctile] is either 1st, 25th, 50th, 75th, or 100th percentile
[outcome]_[race]_[gender]_mean`: Mean outcome for children of race and gender
frac_below_median_[race]_[gender]: Fraction of children with parents who have income below the national median for parents with children in the same birth cohort (Children are weighted by the total number of childhood years spent in the county.)
Identify the types of variables in your research question. Categorical? Quantitative?
Glimpse of data
# import and view data
<- read_csv(file.path("data", "county_outcomes.csv")) county_outcomes
Rows: 3219 Columns: 10827
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): czname
dbl (10818): state, county, kir_natam_female_p1, kir_natam_female_p25, kir_n...
lgl (8): kfr_imm_natam_female_p25_se, kfr_imm_natam_female_p75_se, kir_i...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(county_outcomes)
# A tibble: 6 × 10,827
state county kir_natam_female_p1 kir_natam_female_p25 kir_natam_female_p50
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA
2 1 3 0.344 0.344 0.344
3 1 5 NA NA NA
4 1 7 NA NA NA
5 1 9 NA NA NA
6 1 11 NA NA NA
# ℹ 10,822 more variables: kir_natam_female_p75 <dbl>,
# kir_natam_female_p100 <dbl>, kir_natam_female_n <dbl>,
# kir_natam_female_mean <dbl>, jail_natam_female_p1 <dbl>,
# jail_natam_female_p25 <dbl>, jail_natam_female_p50 <dbl>,
# jail_natam_female_p75 <dbl>, jail_natam_female_p100 <dbl>,
# jail_natam_female_n <dbl>, jail_natam_female_mean <dbl>,
# kfr_natam_female_p1 <dbl>, kfr_natam_female_p25 <dbl>, …
<- read_csv(file.path("data", "mrc_table8.csv")) child_outcome
Rows: 3528 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): tier_name
dbl (21): cohort, par_ventile, tier, par_mean, k_mean, k_rank, k_top1pc, k_t...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(child_outcome)
# A tibble: 6 × 22
cohort par_ventile tier tier_name par_mean k_mean k_rank k_top1pc k_top5pc
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1980 1 1 Ivy Plus 6200 106300 0.700 0.0970 0.333
2 1980 1 2 Other elite… 6500 71800 0.671 0.0500 0.217
3 1980 1 3 Highly sele… 7000 63600 0.650 0.0327 0.158
4 1980 1 4 Highly sele… 6800 60300 0.626 0.0283 0.118
5 1980 1 5 Selective p… 7200 41300 0.556 0.00790 0.0529
6 1980 1 6 Selective p… 7000 43900 0.568 0.0103 0.0683
# ℹ 13 more variables: k_top10pc <dbl>, k_q5 <dbl>, k_q4 <dbl>, k_q3 <dbl>,
# k_q2 <dbl>, k_q1 <dbl>, k_nowork <dbl>, married <dbl>, k_median <dbl>,
# k_median_nozero <dbl>, count <dbl>, tot_count <dbl>, density <dbl>
Data 2
Introduction and data
We acquired our dataset from Kaggle. It was created by “Team Dan” on GitHub (https://github.com/romeoben/DSC7-Sprint2-TeamDan) in June 2021. The data was collected by scraping data from TikTok and compiling it into csv files. Unfortunately, there isn’t further detail as to how this data was collected.
The observations are recorded in a csv file. It contains data about the track name and artist name. It also records duration, and measures for popularity, dance-ability, and energy.
Research question
What factors contribute to making a track popular on TikTok? This is a critical question to consider- given that Tiktok popularity has paved the way for several artists to start their career, as well as gain money and recognition.
Hypothesis: Tracks that are used in TikToks made by popular creators, that are turned into dances, or that contain words that can be used for memes become popular.
Variables
The categorical variables include track_name, artist_name, and release_date. The quantitative variables include duration (measured in seconds), popularity (integer scale from 1 to 100), danceability (float scale from 0 to 1), and energy (float scale from 0 to 1). Glimpse of data
# import and view data
<- read_csv(file.path("data", "tiktok.csv")) tiktok
Rows: 6746 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): track_id, track_name, artist_id, artist_name, album_id, release_da...
dbl (14): duration, popularity, danceability, energy, key, loudness, mode, s...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(tiktok) # skimr::skim(teens)
Name | tiktok |
Number of rows | 6746 |
Number of columns | 23 |
_______________________ | |
Column type frequency: | |
character | 9 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
track_id | 0 | 1 | 22 | 22 | 0 | 3560 | 0 |
track_name | 0 | 1 | 2 | 116 | 0 | 3209 | 0 |
artist_id | 0 | 1 | 22 | 22 | 0 | 2087 | 0 |
artist_name | 0 | 1 | 2 | 34 | 0 | 2086 | 0 |
album_id | 0 | 1 | 22 | 22 | 0 | 3278 | 0 |
release_date | 0 | 1 | 4 | 10 | 0 | 1257 | 0 |
playlist_id | 0 | 1 | 22 | 22 | 0 | 2558 | 0 |
playlist_name | 0 | 1 | 11 | 99 | 0 | 2558 | 0 |
genre | 0 | 1 | 7 | 18 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
duration | 0 | 1 | 194428.74 | 59208.59 | 43426.00 | 155866.00 | 186980.00 | 224284.00 | 716206.00 | ▆▇▁▁▁ |
popularity | 0 | 1 | 57.65 | 24.62 | 0.00 | 44.00 | 64.00 | 76.00 | 100.00 | ▂▂▅▇▃ |
danceability | 0 | 1 | 0.74 | 0.14 | 0.15 | 0.66 | 0.76 | 0.84 | 0.99 | ▁▁▃▇▆ |
energy | 0 | 1 | 0.62 | 0.17 | 0.02 | 0.50 | 0.62 | 0.75 | 1.00 | ▁▂▇▇▃ |
key | 0 | 1 | 5.29 | 3.75 | 0.00 | 1.00 | 6.00 | 9.00 | 11.00 | ▇▂▃▅▆ |
loudness | 0 | 1 | -7.01 | 2.85 | -26.89 | -8.53 | -6.61 | -5.02 | 1.08 | ▁▁▁▇▂ |
mode | 0 | 1 | 0.58 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
speechiness | 0 | 1 | 0.14 | 0.13 | 0.02 | 0.05 | 0.08 | 0.20 | 0.91 | ▇▂▁▁▁ |
acousticness | 0 | 1 | 0.22 | 0.24 | 0.00 | 0.03 | 0.13 | 0.32 | 0.99 | ▇▃▁▁▁ |
instrumentalness | 0 | 1 | 0.03 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 0.96 | ▇▁▁▁▁ |
liveness | 0 | 1 | 0.18 | 0.14 | 0.02 | 0.09 | 0.12 | 0.22 | 0.95 | ▇▂▁▁▁ |
valence | 0 | 1 | 0.55 | 0.23 | 0.03 | 0.37 | 0.54 | 0.74 | 1.00 | ▃▆▇▇▅ |
tempo | 0 | 1 | 120.53 | 25.60 | 54.37 | 100.05 | 120.99 | 135.00 | 216.05 | ▂▆▇▂▁ |
duration_mins | 0 | 1 | 3.24 | 0.99 | 0.72 | 2.60 | 3.12 | 3.74 | 11.94 | ▆▇▁▁▁ |
Data 3
Introduction and data
We acquired the dataset from the CDC Youth Risk Behavior Survey (YRBS) from 2013 - 2019. The survey utilized a three-stage cluster sample design to produce a representative sample of high schoolers (9th through 12th grade students) in the U.S (excluding Puerto Rico and Virgin Island Territories). The population comprised of all types of schooling (public, private, Catholic, etc), and the data was adjusted so by applying a weighting factor to account for nonresponses and the overrepresentation of black and hispanic students within the results, finally, the weights were scaled such that the weighted count of students was equal to the total sample size, and the weighted proportions of students in each grade matched population projections for each survey year. 180 schools were used in the dataset, and there were 13,677 usable questionnaires included in the data. There are 99 questions ranging from topics regarding Driving Habits, Tobacco Use, and other “risky” behaviors.
Research question
How has the prevalence of risky behaviors among teenagers, such as substance use and unprotected sex, changed over time, and what factors have contributed to these changes?
How has the prevalence of risky behaviors, in particular nicotine consumption, among teenagers, actually declined (i.e. did vaping rates essentially replace cigarette usage rates?)
Hypothesis
The prevalence of risky behaviors among teenagers has decreased over time as a consequence of increased awareness and education about engaging in such behaviors.
Variables: We chose these variables in order to gain coverage of tobacco use/e-cigarette use, drug usage (marijuana), highly illicit drug usage (heroin), and sexual behavior:
year
tobacco-use: Have you ever used an electronic vapor product? / Have you ever tried cigarette smoking, even one or two puffs?
marijuana-use: During your life, how many times have you used marijuana? (filter out those who responded that they never used marijuana before)
illicit-use: During your life, how many times have you used heroin (also called smack, junk, or China White)? (filter out those who have never used heroin before)
safe-sex: The last time you had sexual intercourse, did you or your partner use a condom? (filter data for only “yes” responses)
Our quantitative variables include the proportion of those who have engaged in risky behavior per year.
Statement of importance
Understanding how the prevalence of risky behaviors among teenagers has evolved over time and what may have contributed to these changes is crucial for developing effective education within schools and communities. By creating these preventative measures and harm reduction strategies, we can greatly reduce the tragic consequences of these risky behaviors, including addiction and death.
Glimpse of data
# import and view data
library("readxl")
<- read_excel(file.path("data", "teens.xlsx"))
teens head(teens)
# A tibble: 6 × 109
site raceeth q6orig q7orig record orig_rec q1 q2 q3 q4 q5
<chr> <chr> <chr> <chr> <dbl> <lgl> <chr> <chr> <chr> <chr> <chr>
1 XX 7 504 121 1 NA 5 2 2 1 A
2 XX 8 503 119 2 NA 4 2 2 2 A D
3 XX 8 506 95 3 NA 4 1 2 2 B E
4 XX 5 510 152 4 NA 4 2 2 2 E
5 XX 6 510 130 5 NA 5 2 2 1 <NA>
6 XX 5 506 165 6 NA 4 1 2 2 E
# ℹ 98 more variables: q6 <dbl>, q7 <dbl>, q8 <chr>, q9 <chr>, q10 <chr>,
# q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>, q15 <chr>, q16 <chr>,
# q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>, q21 <chr>, q22 <chr>,
# q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>, q27 <chr>, q28 <chr>,
# q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>, q33 <chr>, q34 <chr>,
# q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>, q39 <chr>, q40 <chr>,
# q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, q45 <chr>, q46 <chr>, …