library(tidyverse)
library(skimr)
library(janitor)
US Mental Health
Proposal
Data 1: Mental Health Client-Level, 2020
Introduction and data
We are sourcing our data from Substance Abuse and Mental Health Data Archive (SAMHDA), more specifically it is “Mental Health Client-Level Data, 2020.” The full data set can be accessed via this link:
- https://pdas.samhsa.gov/#/survey/MH-CLD-2020-DS0001
The MH-CLD is collected from state mental health agencies who provide mental health treatment services. These SMHAs then report the data. More information can be found here:
- https://www.samhsa.gov/data/data-we-collect/mh-cld-mental-health-client-level-data
The observations are focused on clients and include factors such as their demographic characteristics as well as any diagnosed mental health disorders.
Research question
- How do mental health disorders vary by race? Is there any correlation between an individual’s race and their diagnosed mental health disorder?
- SAMHAs Mental Health Client-Level Data set provides information on over 6 million individuals, their self-identified race, and their mental health diagnosis. We are interested in determining the relationship between an individual’s race and mental health diagnosis. We hypothesize that there is a difference in the rate of mental health disorders between minority and non-minority groups as a result of added race-based societal pressures and experiences. However it will be difficult to determine if the data accurately represents mental health disorders in minority groups as a result of resource inaccessibility.
- Both race and mental health diagnosis are categorical variables, though the data set provides the count of individuals in each category (frequency table). Thus there are quantitative values associated with each race and each mental health diagnosis.
Glimpse of data
<- read_csv("data/MHCLD.csv") |>
mhcld select(`Race`, `Mental health diagnosis one`, `Unweighted Count`) |>
pivot_wider(names_from = "Race",
values_from = "Unweighted Count") |>
rename(mental_health_diagnosis = `Mental health diagnosis one`,
overall = `Overall`,
invalid = `Missing/unknown/not collected/invalid`,
american_indian_alaska_native = `American Indian/Alaska Native`,
asian = `Asian`,
black_african_american = `Black or African American`,
native_hawaiian_pacific_islander = `Native Hawaiian or Other Pacific Islander`,
white = `White`,
other_multiple = `Some other race alone/two or more races`
|>
) ::skim() skimr
Rows: 120 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Race, Mental health diagnosis one, Total % SE, Total % CI (lower),...
dbl (4): Total %, Row %, Column %, Unweighted Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(mhcld)
Rows: 9
Columns: 17
$ skim_type <chr> "character", "numeric", "numeric", "numeric", "nu…
$ skim_variable <chr> "mental_health_diagnosis", "overall", "invalid", …
$ n_missing <int> 0, 0, 0, 0, 0, 0, 0, 0, 0
$ complete_rate <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1
$ character.min <int> 7, NA, NA, NA, NA, NA, NA, NA, NA
$ character.max <int> 62, NA, NA, NA, NA, NA, NA, NA, NA
$ character.empty <int> 0, NA, NA, NA, NA, NA, NA, NA, NA
$ character.n_unique <int> 15, NA, NA, NA, NA, NA, NA, NA, NA
$ character.whitespace <int> 0, NA, NA, NA, NA, NA, NA, NA, NA
$ numeric.mean <dbl> NA, 926069.467, 87284.667, 17782.933, 12230.800, …
$ numeric.sd <dbl> NA, 1718339.880, 162900.322, 33065.169, 22981.474…
$ numeric.p0 <dbl> NA, 18773, 2489, 694, 329, 2481, 42, 11750, 988
$ numeric.p25 <dbl> NA, 88733.0, 8986.5, 1222.5, 860.5, 23490.5, 155.…
$ numeric.p50 <dbl> NA, 520176, 35090, 8782, 5032, 82441, 948, 299023…
$ numeric.p75 <dbl> NA, 826827.5, 81542.0, 15619.5, 12735.5, 158377.5…
$ numeric.p100 <dbl> NA, 6945521, 654635, 133372, 91731, 1230018, 1529…
$ numeric.hist <chr> NA, "▇▁▁▁▁", "▇▁▁▁▁", "▇▁▁▁▁", "▇▁▁▁▁", "▇▁▁▁▁", …
Data 2: Drug Use and Health, 2020
Introduction and data
We are sourcing our data from Substance Abuse and Mental Health Data Archive.
National Survey on Drug Use and Health, 2020:
- https://pdas.samhsa.gov/#/survey/NSDUH-2020-DS0001
SAMHAs Mental Health Client-Level Data set provides information on a sample of over 32 thousand persons on the ages when they first drank alcohol and if they ever took any other drugs since then. This data was collected back in 2020 through the use of surveys answered by participants.
When looking at each row present in the dataset, it will provide readers with the number of persons who drank alcohol at a certain age along with the counts of how many did and did not do certain kinds of drugs.
Research question
Does the age of when a person have their first alcoholic drink, later affect the likelihood of them using harder substances? Is there a correlation between different drug uses and the age when drinking begins?
The research topic based on the Substance Abuse and Mental Health Data Archive is to investigate the relationship between the age of onset of alcohol use and subsequent drug use. Specifically, the research will examine whether the age at which an individual starts drinking alcohol is associated with an increased likelihood of using harder substances, such as opioids or cocaine. We hypothesize that individuals who start drinking alcohol at a younger age are more likely to experiment with other drugs and develop substance use disorders compared to those who start drinking at an older age.
The variables in this dataset are all Categorical in nature initially, with numeric values serving as the counts for each combination.
Glimpse of data
# add code here
<-read_csv("data/Alcohol_Cocaine_Use.csv") al_cocaine
Rows: 402 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): EVER USED COCAINE, AGE WHEN FIRST DRANK ALCOHOLIC BEVERAGE, Total %...
dbl (9): Total %, Total % SE, Row %, Row % SE, Column %, Column % SE, Weight...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<-al_cocaine|>
al_cocaine_cleanclean_names()|>
select(ever_used_cocaine,age_when_first_drank_alcoholic_beverage,weighted_count)|>
pivot_wider(
names_from = ever_used_cocaine,
values_from = weighted_count
)<-al_cocaine_clean|>
al_cocaine_cleanslice(2:nrow(al_cocaine_clean))|>
select(-Overall)|>
rename(
cocaine_yes=`1 - Yes`,
cocaine_no=`2 - No`,
cocaine_baddata=`85 - BAD DATA Logically assigned`,
cocanine_dknow=`94 - DON T KNOW`,
cocaine_refuse=`97 - REFUSED`
)
glimpse(al_cocaine_clean)
Rows: 66
Columns: 6
$ age_when_first_drank_alcoholic_beverage <chr> "1", "10", "11", "12", "13", "…
$ cocaine_yes <dbl> 36579.93, 949174.50, 752564.67…
$ cocaine_no <dbl> 117932.28, 1665175.04, 979946.…
$ cocaine_baddata <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ cocanine_dknow <dbl> 0.000, 0.000, 0.000, 945.995, …
$ cocaine_refuse <dbl> 0.0000, 0.0000, 0.0000, 0.0000…
Data 3: Treatment Episode and Discharges, 2020
Introduction and data
We are sourcing our data from Substance Abuse and Mental Health Data Archive. The link to the data can be found below:
Treatment Episode Data Set -- Discharges (TEDS-D), 2020
- https://pdas.samhsa.gov/#/survey/TEDS-D-2020-DS0001
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This particular dataset is sourced from SAMHSHA, but includes data that was initially collected by state agencies that provide mental health support. All data collected is available via public files. The data we are concerned with was collected in 2020.
Write a brief description of the observations.
The observations in this dataset include demographic, addiction, drug use, and employement information from patients at the time of discharge from their respective institutions.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
At a basic level, one of the main complications with treatment is the patient leaving prematurely for a multitude of reasons. Thus, in our research we will investigate the relationship between number of days before discharge and substance use at admission. More specifically, we will use the patients age as a control to see if the type of substance used has an effect on the number of days before discharge. This question is important in identifying treatment plans for patients suffering from addiction, as the findings will give a rough idea to the time needed before discharge, as well as common unplanned (and damaging) reasons for discharge.
A description of the research topic along with a concise statement of your hypotheses on this topic.
Using the recorded data from discharges, we hope to gain a better understanding of what types of substance abuse contribute to failed treatment, in hopes of altering treatment plans for these respective abuse types. By combining length of stay with type of substance and narrowing our findings by demographic data such as age, ethnicity, employment status, and location, we hope to find a correlation between abused substances and their associated times required for treatment. I hypothesize that substances known to have a higher rate of addiction and abuse will lead to the most negative outcomes in terms of reason for discharge, while more recreational and “weaker” substances will have a higher success rates.
Identify the types of variables in your research question. Categorical? Quantitative?
The type of substance used is a categorical variables, while age and time until discharge are discrete numerical variables. The count for each instance is a numerical variable.
Glimpse of data
<- read_csv("data/discharge_data.csv", skip = 799)|>
discharge_data clean_names()|>
rename(treatment_type = `overall`,
length_of_stay = `x9`,
substance_used = `hallucinogens`,
n = `x13`)|>
select(treatment_type, length_of_stay, substance_used, n)|>
filter(length_of_stay != "Overall")
New names:
Rows: 1596 Columns: 18
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(14): Overall, 9, Hallucinogens, N/A...5, N/A...6, N/A...7, N/A...9, N/A... dbl
(4): 1.39652651627867E-05, 0.012682926829268300, 0.0014777765147209300, 13
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `N/A` -> `N/A...5`
• `N/A` -> `N/A...6`
• `N/A` -> `N/A...7`
• `N/A` -> `N/A...9`
• `N/A` -> `N/A...10`
• `N/A` -> `N/A...11`
• `N/A` -> `N/A...13`
• `N/A` -> `N/A...14`
• `N/A` -> `N/A...15`
• `N/A` -> `N/A...16`
• `N/A` -> `N/A...18`
<- discharge_data|>
discharge_filtered filter(substance_used != "None")|>
filter(substance_used != "Overall")
discharge_filtered
# A tibble: 1,406 × 4
treatment_type length_of_stay substance_used n
<chr> <chr> <chr> <dbl>
1 Treatment completed 1 Missing/unknown/not collected/inval… 52886
2 Treatment completed 1 Methamphetamine/speed 2543
3 Treatment completed 1 Other amphetamines 115
4 Treatment completed 1 Other stimulants 70
5 Treatment completed 1 Benzodiazepines 173
6 Treatment completed 1 Other tranquilizers 2
7 Treatment completed 1 Barbiturates 4
8 Treatment completed 1 Other sedatives or hypnotics 39
9 Treatment completed 1 Inhalants 6
10 Treatment completed 1 Over-the-counter medications 7
# ℹ 1,396 more rows