Impact of Education Level on Change in Mental Health during COVID

Proposal

Data 1 - Mental Health Post-COVID

Introduction and data

  • This data comes from Data.gov.

  • It was originally created 2 years ago by the U.S. Census Bureau and 5 other federal agencies via the Household Pulse Survey.

  • The observations are 12 day periods with an indicator of mental health separated by various groups, whether it be age, sex, region, race, and more.

Research question

  • Is post-covid mental health affected by regions (states)?
  • Covid-19 impacted on people’s mental wellness, but the impact could vary from region to region in US. We hypothesize that regions with good weather such as California were least impacted by Covid-19 in terms of mental health.
  • The regions (states) are categorical, while the mental health index (score) is quantitative.

Glimpse of data

Rows: 10404 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Indicator, Group, State, Subgroup, Phase, Time Period Label, Time ...
dbl  (5): Time Period, Value, LowCI, HighCI, Suppression Flag

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data summary
Name mental_health
Number of rows 10404
Number of columns 15
_______________________
Column type frequency:
character 10
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Indicator 0 1.00 44 98 0 4 0
Group 0 1.00 6 45 0 10 0
State 0 1.00 4 20 0 52 0
Subgroup 0 1.00 4 69 0 80 0
Phase 0 1.00 1 19 0 8 0
Time Period Label 0 1.00 20 27 0 38 0
Time Period Start Date 0 1.00 10 10 0 38 0
Time Period End Date 0 1.00 10 10 0 38 0
Confidence Interval 490 0.95 9 11 0 7709 0
Quartile Range 3672 0.65 5 22 0 500 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time Period 0 1.00 28.13 11.04 1.0 20.0 29.0 37.0 45.0 ▁▅▇▇▇
Value 490 0.95 17.45 8.27 1.4 10.3 16.2 24.0 62.9 ▇▇▃▁▁
LowCI 490 0.95 14.77 7.66 0.8 8.0 13.9 20.8 53.2 ▇▆▃▁▁
HighCI 490 0.95 20.48 9.05 2.0 12.9 19.2 27.4 71.9 ▇▇▃▁▁
Suppression Flag 10382 0.00 1.00 0.00 1.0 1.0 1.0 1.0 1.0 ▁▁▇▁▁

Data 2 - Music Popularity

Introduction and data

  • This dataset is sourced from the Million Song Dataset (a library created by data-collection companies, Echo Nest and LabROSA).

  • This data was collected by Echo Nest, which hopes to collect statistics on the top one million popular, contemporary songs. It was also collected by LabROSA, which studied machine learning in music.

    These companies were funded by the National Science Foundation of America to create a dataset to evaluate the composition of commercially successful tunes.

  • Observations include the song name, artist, song length, and year released of the top popular, contemporary songs. There are also some composition and mixing information, like the length of musical bars and the song’s fading introduction.

Research question

  • How does popularity of contemporary songs relate to their artists? How does initial popularity of a song relate to popularity growth of a tune?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    Music has become am important part of people’s life, as it alleviates stress and provides an outlet from daily stressors. Internet platforms have also increased the music accessibility to listeners across the globe. We hypothesize that artists with higher familiarity are positively correlated with the popularity of their songs.

  • The popularity of songs and artist familiarity/hotness are all quantitative variables. The terms of artists is categorical, which can include latin jazz, heavy metal, etc.

Glimpse of data

Rows: 10000 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): artist.id, artist.name, artist.terms, song.id
dbl (31): artist.familiarity, artist.hotttnesss, artist.latitude, artist.loc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data summary
Name music
Number of rows 10000
Number of columns 35
_______________________
Column type frequency:
character 4
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist.id 0 1 18 18 0 3888 0
artist.name 0 1 1 255 0 4412 0
artist.terms 5 1 2 40 0 458 0
song.id 0 1 18 51 0 10000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
artist.familiarity 0 1 0.57 0.16 0.00 0.47 0.56 0.67 1.00 ▁▂▇▅▂
artist.hotttnesss 0 1 0.39 0.14 0.00 0.33 0.38 0.45 1.08 ▁▇▃▁▁
artist.latitude 0 1 13.90 20.36 -41.28 0.00 0.00 34.42 69.65 ▁▇▁▃▁
artist.location 0 1 0.08 7.80 0.00 0.00 0.00 0.00 780.00 ▇▁▁▁▁
artist.longitude 0 1 -23.92 43.72 -162.44 -73.95 0.00 0.00 174.77 ▁▂▇▁▁
artist.similar 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
artist.terms_freq 0 1 224.89 22392.16 0.00 0.95 1.00 1.00 2239217.00 ▇▁▁▁▁
release.id 0 1 371024.06 236777.83 0.00 172858.00 333103.00 573532.50 823599.00 ▇▇▅▆▅
release.name 0 1 23.10 1322.90 0.00 0.00 0.00 0.00 85555.00 ▇▁▁▁▁
song.artist_mbtags 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.33 ▇▁▁▁▁
song.artist_mbtags_count 0 1 0.52 0.88 0.00 0.00 0.00 1.00 9.00 ▇▁▁▁▁
song.bars_confidence 0 1 0.24 0.29 0.00 0.04 0.12 0.35 8.86 ▇▁▁▁▁
song.bars_start 0 1 1.07 1.72 0.00 0.44 0.79 1.22 59.74 ▇▁▁▁▁
song.beats_confidence 0 1 0.61 0.32 0.00 0.41 0.69 0.88 1.00 ▃▂▃▆▇
song.beats_start 0 1 0.43 0.81 -60.00 0.19 0.33 0.50 12.25 ▁▁▁▁▇
song.duration 0 1 240.62 246.08 1.04 176.03 223.06 276.38 22050.00 ▇▁▁▁▁
song.end_of_fade_in 0 1 0.76 1.86 0.00 0.00 0.20 0.42 43.12 ▇▁▁▁▁
song.hotttnesss 0 1 -0.24 0.69 -1.00 -1.00 0.00 0.41 1.00 ▇▁▃▆▂
song.key 0 1 5.37 9.67 0.00 2.00 5.00 8.00 904.80 ▇▁▁▁▁
song.key_confidence 0 1 0.45 0.33 0.00 0.22 0.47 0.66 19.08 ▇▁▁▁▁
song.loudness 0 1 -10.48 5.40 -51.64 -13.16 -9.38 -6.53 0.57 ▁▁▁▆▇
song.mode 0 1 0.69 0.46 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
song.mode_confidence 0 1 0.48 0.19 0.00 0.36 0.49 0.61 1.00 ▂▅▇▅▁
song.start_of_fade_out 0 1 229.88 112.02 -21.39 168.86 213.86 266.27 1813.43 ▇▁▁▁▁
song.tatums_confidence 0 1 0.51 0.33 0.00 0.24 0.50 0.77 9.23 ▇▁▁▁▁
song.tatums_start 0 1 0.30 0.51 0.00 0.11 0.19 0.29 12.25 ▇▁▁▁▁
song.tempo 0 1 122.90 35.20 0.00 96.96 120.16 144.01 262.83 ▁▆▇▂▁
song.time_signature 0 1 3.56 1.27 0.00 3.00 4.00 4.00 7.00 ▂▁▇▁▁
song.time_signature_confidence 0 1 0.60 8.99 0.00 0.10 0.55 0.86 898.89 ▇▁▁▁▁
song.title 0 1 10.01 945.49 0.00 0.00 0.00 0.00 94496.00 ▇▁▁▁▁
song.year 0 1 934.70 996.65 0.00 0.00 0.00 2000.00 2010.00 ▇▁▁▁▇

Data 3 - Coffee Quality

Introduction and data

  • CORGIS Dataset Project by Sam Donald.

  • This data is from Coffee Quality Institute’s review pages in January 2018 by Buzzfeed’s Data Scientist James LeDoux.

  • This data is for both Arabica and Robusta beans, across many countries and professionally rated on a 0-100 scale. All sorts of scoring/ratings for things like acidity, sweetness, fragrance, balance, etc.

Research question

  • Which region (continent) produces the best coffee?
  • Coffee was introduced to almost all tropical regions around the world. However, which region can produce the coffee with the best quality? We hypothesize that South America countries can produce the best coffee.
  • Regions are categorical, while coffee quality (score) is quantitative.

Glimpse of data

Data summary
Name coffee
Number of rows 989
Number of columns 23
_______________________
Column type frequency:
character 7
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Location.Country 0 1 4 28 0 32 0
Location.Region 0 1 3 76 0 278 0
Data.Owner 0 1 3 50 0 263 0
Data.Type.Species 0 1 7 7 0 2 0
Data.Type.Variety 0 1 3 21 0 28 0
Data.Type.Processing.method 0 1 3 25 0 6 0
Data.Color 0 1 4 12 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Location.Altitude.Min 0 1 1640.08 9192.52 0 905.00 1300.00 1550.00 190164.00 ▇▁▁▁▁
Location.Altitude.Max 0 1 1675.93 9191.96 0 950.00 1310.00 1600.00 190164.00 ▇▁▁▁▁
Location.Altitude.Average 0 1 1658.00 9192.06 0 950.00 1300.00 1600.00 190164.00 ▇▁▁▁▁
Year 0 1 2013.55 1.66 2010 2012.00 2013.00 2015.00 2018.00 ▁▇▃▃▁
Data.Production.Number.of.bags 0 1 151.76 125.67 1 15.00 170.00 275.00 600.00 ▇▁▇▁▁
Data.Production.Bag.weight 0 1 210.49 1666.71 0 1.00 60.00 69.00 19200.00 ▇▁▁▁▁
Data.Scores.Aroma 0 1 7.57 0.40 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Flavor 0 1 7.52 0.42 0 7.33 7.50 7.75 8.83 ▁▁▁▁▇
Data.Scores.Aftertaste 0 1 7.39 0.43 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
Data.Scores.Acidity 0 1 7.54 0.40 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Body 0 1 7.51 0.39 0 7.33 7.50 7.67 8.50 ▁▁▁▁▇
Data.Scores.Balance 0 1 7.50 0.43 0 7.33 7.50 7.75 8.58 ▁▁▁▁▇
Data.Scores.Uniformity 0 1 9.82 0.59 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Sweetness 0 1 9.83 0.69 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Moisture 0 1 0.09 0.04 0 0.10 0.11 0.12 0.28 ▃▇▆▁▁
Data.Scores.Total 0 1 81.97 3.86 0 81.08 82.50 83.58 90.58 ▁▁▁▁▇