Dataset Exploration

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

Opportunity Insights, a team of researchers based at Harvard University examining data to see how to improve upward mobility and work with stakeholders to implement policy changes
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

They collected anonymized data on 20 million Americans (who are in their mid-thirties today), mapping them back to the Census tract in which they grew up. For each of the 70,000 census tracts in America, they then estimated children’s average earnings, incarceration rates, and other outcomes by parental income level, race, gender, and county.
Write a brief description of the observations.

The first dataset reports predicted outcomes for children (born between 1978 and 1983) by county, race, and gender, with each row corresponding to one county. The second dataset reports key statistics on children’s income distributions based on their parents’ incomes by college tier and birth year/cohort.

Research question

How do the experiences and outcomes of low-income students who attend elite colleges compare to those who attend other types of colleges and universities, and what factors contribute to these differences?
- Hypothesis: Low-income students who attend elite colleges will have higher outcomes (higher graduation rates and higher incomes) compared to those attending other types of colleges and universities. Factors contributing to these differences are might be the difference in resources such as academic support and networking opportunities between elite and other colleges.
What are the long-term outcomes (e.g., educational attainment, income, health) for individuals who are able to overcome their disadvantaged family background, and how do these outcomes compare to those of individuals from more privileged backgrounds?
- Hypothesis: Individuals who are able to overcome their disadvantaged family background will experience similar long-term outcomes to individuals from more privileged backgrounds.
- Variables: documentation found here: https://opportunityinsights.org/wp-content/uploads/2018/04/Codebook-MRC-Table-8.pdf**
  - tier : Selectivity and type combination (1-14)
  - par_ventile: Parent income ventiles 1 to 20, with 1 for bottom ventile and 20 for top ventile. (par_ventile = 99 provides data for the 99th parent income percentile)
  - par_mean: Mean parent household income in par_pctile-tier cell
  - mean_kid: Mean kid earnings
  - k_q[KIDQUINT]: Fraction of kids in an income quintile [KIDQUINT]. where KIDQUINT is an integer 1 to 5. 1 is the bottom quintile and 5 is the top.
  - k_nowork: Fraction of kids not working
  - k_median: Median child individual earnings in 2014. (mean of the 3 or 4, when count is an even number, middle observations in each cell, when sorted on income)
  - k_median_nozero: Median child individual earnings when excluding zeros, defined analogously to k_median.
How does the extent of intergenerational mobility vary across different regions, and what factors contribute to these differences?
- Hypothesis: Regions with greater income gaps exhibit lower levels of intergenerational mobility rates, while regions with stronger public education systems and more job opportunities will have higher levels of intergenerational mobility.
- Variables: [outcome]_[race]_[gender]_p[pctile]: Mean predicted outcome for children of a given race, gender and with parents at a given percentile in the national household income distribution.
  - list of [outcomes] found here: https://opportunityinsights.org/wp-content/uploads/2019/07/Codebook-for-Table-5.pdf
- [race] is either pooled, white, Black,Hispanic, Asian, Native American (natam), or other
- [gender] is either pooled, male, orfemale
- [pctile] is either 1st, 25th, 50th, 75th, or 100th percentile
- [outcome]_[race]_[gender]_mean`: Mean outcome for children of race and gender
- frac_below_median_[race]_[gender]: Fraction of children with parents who have income below the national median for parents with children in the same birth cohort (Children are weighted by the total number of childhood years spent in the county.)
Identify the types of variables in your research question. Categorical? Quantitative?

Glimpse of data

# import and view data
county_outcomes <- read_csv(file.path("data", "county_outcomes.csv"))

Rows: 3219 Columns: 10827
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr     (1): czname
dbl (10818): state, county, kir_natam_female_p1, kir_natam_female_p25, kir_n...
lgl     (8): kfr_imm_natam_female_p25_se, kfr_imm_natam_female_p75_se, kir_i...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(county_outcomes)

# A tibble: 6 × 10,827
  state county kir_natam_female_p1 kir_natam_female_p25 kir_natam_female_p50
  <dbl>  <dbl>               <dbl>                <dbl>                <dbl>
1     1      1              NA                   NA                   NA    
2     1      3               0.344                0.344                0.344
3     1      5              NA                   NA                   NA    
4     1      7              NA                   NA                   NA    
5     1      9              NA                   NA                   NA    
6     1     11              NA                   NA                   NA    
# ℹ 10,822 more variables: kir_natam_female_p75 <dbl>,
#   kir_natam_female_p100 <dbl>, kir_natam_female_n <dbl>,
#   kir_natam_female_mean <dbl>, jail_natam_female_p1 <dbl>,
#   jail_natam_female_p25 <dbl>, jail_natam_female_p50 <dbl>,
#   jail_natam_female_p75 <dbl>, jail_natam_female_p100 <dbl>,
#   jail_natam_female_n <dbl>, jail_natam_female_mean <dbl>,
#   kfr_natam_female_p1 <dbl>, kfr_natam_female_p25 <dbl>, …

child_outcome <- read_csv(file.path("data", "mrc_table8.csv"))

Rows: 3528 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): tier_name
dbl (21): cohort, par_ventile, tier, par_mean, k_mean, k_rank, k_top1pc, k_t...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(child_outcome)

# A tibble: 6 × 22
  cohort par_ventile  tier tier_name    par_mean k_mean k_rank k_top1pc k_top5pc
   <dbl>       <dbl> <dbl> <chr>           <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
1   1980           1     1 Ivy Plus         6200 106300  0.700  0.0970    0.333 
2   1980           1     2 Other elite…     6500  71800  0.671  0.0500    0.217 
3   1980           1     3 Highly sele…     7000  63600  0.650  0.0327    0.158 
4   1980           1     4 Highly sele…     6800  60300  0.626  0.0283    0.118 
5   1980           1     5 Selective p…     7200  41300  0.556  0.00790   0.0529
6   1980           1     6 Selective p…     7000  43900  0.568  0.0103    0.0683
# ℹ 13 more variables: k_top10pc <dbl>, k_q5 <dbl>, k_q4 <dbl>, k_q3 <dbl>,
#   k_q2 <dbl>, k_q1 <dbl>, k_nowork <dbl>, married <dbl>, k_median <dbl>,
#   k_median_nozero <dbl>, count <dbl>, tot_count <dbl>, density <dbl>

Data 2

Introduction and data

We acquired our dataset from Kaggle. It was created by “Team Dan” on GitHub (https://github.com/romeoben/DSC7-Sprint2-TeamDan) in June 2021. The data was collected by scraping data from TikTok and compiling it into csv files. Unfortunately, there isn’t further detail as to how this data was collected.

The observations are recorded in a csv file. It contains data about the track name and artist name. It also records duration, and measures for popularity, dance-ability, and energy.

Research question

What factors contribute to making a track popular on TikTok? This is a critical question to consider- given that Tiktok popularity has paved the way for several artists to start their career, as well as gain money and recognition.

Hypothesis: Tracks that are used in TikToks made by popular creators, that are turned into dances, or that contain words that can be used for memes become popular.

Variables

The categorical variables include track_name, artist_name, and release_date. The quantitative variables include duration (measured in seconds), popularity (integer scale from 1 to 100), danceability (float scale from 0 to 1), and energy (float scale from 0 to 1). Glimpse of data

# import and view data
tiktok <- read_csv(file.path("data", "tiktok.csv"))

Rows: 6746 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): track_id, track_name, artist_id, artist_name, album_id, release_da...
dbl (14): duration, popularity, danceability, energy, key, loudness, mode, s...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(tiktok) # skimr::skim(teens)

Data summary
Name	tiktok
Number of rows	6746
Number of columns	23
_______________________
Column type frequency:
character	9
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
track_id	1	22	22	3560
track_name	1	2	116	3209
artist_id	1	22	22	2087
artist_name	1	2	34	2086
album_id	1	22	22	3278
release_date	1	4	10	1257
playlist_id	1	22	22	2558
playlist_name	1	11	99	2558
genre	1	7	18	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
duration	1	194428.74	59208.59	43426.00	155866.00	186980.00	224284.00	716206.00	▆▇▁▁▁
popularity	1	57.65	24.62	0.00	44.00	64.00	76.00	100.00	▂▂▅▇▃
danceability	1	0.74	0.14	0.15	0.66	0.76	0.84	0.99	▁▁▃▇▆
energy	1	0.62	0.17	0.02	0.50	0.62	0.75	1.00	▁▂▇▇▃
key	1	5.29	3.75	0.00	1.00	6.00	9.00	11.00	▇▂▃▅▆
loudness	1	-7.01	2.85	-26.89	-8.53	-6.61	-5.02	1.08	▁▁▁▇▂
mode	1	0.58	0.49	0.00	0.00	1.00	1.00	1.00	▆▁▁▁▇
speechiness	1	0.14	0.13	0.02	0.05	0.08	0.20	0.91	▇▂▁▁▁
acousticness	1	0.22	0.24	0.00	0.03	0.13	0.32	0.99	▇▃▁▁▁
instrumentalness	1	0.03	0.14	0.00	0.00	0.00	0.00	0.96	▇▁▁▁▁
liveness	1	0.18	0.14	0.02	0.09	0.12	0.22	0.95	▇▂▁▁▁
valence	1	0.55	0.23	0.03	0.37	0.54	0.74	1.00	▃▆▇▇▅
tempo	1	120.53	25.60	54.37	100.05	120.99	135.00	216.05	▂▆▇▂▁
duration_mins	1	3.24	0.99	0.72	2.60	3.12	3.74	11.94	▆▇▁▁▁

Data 3

Introduction and data

We acquired the dataset from the CDC Youth Risk Behavior Survey (YRBS) from 2013 - 2019. The survey utilized a three-stage cluster sample design to produce a representative sample of high schoolers (9th through 12th grade students) in the U.S (excluding Puerto Rico and Virgin Island Territories). The population comprised of all types of schooling (public, private, Catholic, etc), and the data was adjusted so by applying a weighting factor to account for nonresponses and the overrepresentation of black and hispanic students within the results, finally, the weights were scaled such that the weighted count of students was equal to the total sample size, and the weighted proportions of students in each grade matched population projections for each survey year. 180 schools were used in the dataset, and there were 13,677 usable questionnaires included in the data. There are 99 questions ranging from topics regarding Driving Habits, Tobacco Use, and other “risky” behaviors.

Research question

How has the prevalence of risky behaviors among teenagers, such as substance use and unprotected sex, changed over time, and what factors have contributed to these changes?
How has the prevalence of risky behaviors, in particular nicotine consumption, among teenagers, actually declined (i.e. did vaping rates essentially replace cigarette usage rates?)
Hypothesis

The prevalence of risky behaviors among teenagers has decreased over time as a consequence of increased awareness and education about engaging in such behaviors.
Variables: We chose these variables in order to gain coverage of tobacco use/e-cigarette use, drug usage (marijuana), highly illicit drug usage (heroin), and sexual behavior:
- year
- tobacco-use: Have you ever used an electronic vapor product? / Have you ever tried cigarette smoking, even one or two puffs?
- marijuana-use: During your life, how many times have you used marijuana? (filter out those who responded that they never used marijuana before)
- illicit-use: During your life, how many times have you used heroin (also called smack, junk, or China White)? (filter out those who have never used heroin before)
- safe-sex: The last time you had sexual intercourse, did you or your partner use a condom? (filter data for only “yes” responses)
- Our quantitative variables include the proportion of those who have engaged in risky behavior per year.

Statement of importance

Understanding how the prevalence of risky behaviors among teenagers has evolved over time and what may have contributed to these changes is crucial for developing effective education within schools and communities. By creating these preventative measures and harm reduction strategies, we can greatly reduce the tragic consequences of these risky behaviors, including addiction and death.

Glimpse of data

# import and view data
library("readxl")
teens <- read_excel(file.path("data", "teens.xlsx"))
head(teens)

# A tibble: 6 × 109
  site  raceeth q6orig q7orig record orig_rec q1    q2    q3    q4    q5   
  <chr> <chr>   <chr>  <chr>   <dbl> <lgl>    <chr> <chr> <chr> <chr> <chr>
1 XX    7       504    121         1 NA       5     2     2     1     A    
2 XX    8       503    119         2 NA       4     2     2     2     A  D 
3 XX    8       506    95          3 NA       4     1     2     2     B  E 
4 XX    5       510    152         4 NA       4     2     2     2     E    
5 XX    6       510    130         5 NA       5     2     2     1     <NA> 
6 XX    5       506    165         6 NA       4     1     2     2     E    
# ℹ 98 more variables: q6 <dbl>, q7 <dbl>, q8 <chr>, q9 <chr>, q10 <chr>,
#   q11 <chr>, q12 <chr>, q13 <chr>, q14 <chr>, q15 <chr>, q16 <chr>,
#   q17 <chr>, q18 <chr>, q19 <chr>, q20 <chr>, q21 <chr>, q22 <chr>,
#   q23 <chr>, q24 <chr>, q25 <chr>, q26 <chr>, q27 <chr>, q28 <chr>,
#   q29 <chr>, q30 <chr>, q31 <chr>, q32 <chr>, q33 <chr>, q34 <chr>,
#   q35 <chr>, q36 <chr>, q37 <chr>, q38 <chr>, q39 <chr>, q40 <chr>,
#   q41 <chr>, q42 <chr>, q43 <chr>, q44 <chr>, q45 <chr>, q46 <chr>, …