Exploring the Age Factor:

Substance Use, Abuse, and the Impact of Age on Patterns and Behaviors

Introduction

As recreational drug use becomes more prevalent and accepted in the United States, particularly in the case of recreational Cannabis use, it is important to understand the broader trends and potentially broad implications those trends have on society. This report is a relatively small contribution to the wider body of knowledge concerning drug use in the United States that seeks to understand the relationships between age and substance use and abuse. We wanted to understand whether age was a contributing factor to whether individuals abused substances and if whether it could inform individuals choice in the substance they used from some of the most popular recreational drugs: alcohol, cannabis, tobacco, and cocaine.

Using widely available data from the National Survey on Drug Use and Health we looked at the summary statistics for the rates of use and abuse of a particular substance for a given age category and formulated a null-hypothesis significance to evaluate whether age played a significant role in drug abuse and “drug of choice”. We were curious as to whether individuals, especially adolescents, tended towards one substance or another and whether that changed overtime with age. We find this question especially pressing as legalization efforts accross the United States for substances such as Cannabis, which is shown to have certain medical uses, have increased the amount of discourse about the implications of legalization/acceptance of certain substances. As for our question of age’s role, in abuse we wondered whether the potential of increased ease of access as a result of legalization could lead younger individuals to consume substance they might not have otherwise.

Our findings below suggest that it is unlikely that age is the determining factor in substance abuse, although it does not rule out the role of age entirely. To be more specific, the results indicate the use rate for marijuana is higher in states with medical, decriminalized, or recreational legalization compared to states with illegal status for the 18-25 age group. However, it is important to note that this association is not uniform across all age groups, with the 12-17 age group showing a less pronounced increase in use rate due to legalization.

Data description

Motivation:

The Drugs data set from CORGIS was compiled by the CORGIS to be used as open source data in analysis by whoever wanted to use it. It was compiled from data from the National Survey on Drug Use and Health (NSDUH), which is administered by the Substance Abuse and Mental Health Administration (SAMHSA) to “provide estimates of substance use and mental illness at the national, state, and sub-state levels” and “help identify the extent of substance use and mental illness among different subgroups, estimate trends over time, and determine the need for treatment services” to “allow researchers, clinicians, policymakers, and the general public to better understand and improve the nation’s behavioral health”. The data collection is publicly funded through the U.S. Department of Health and Human Services.

The State Marijuana Laws (2019) data set was compiled by Selene Arrazolo in 2016 as a personal project for data.world from data collected by Michael Maciag funded by Governing Data, a magazine based in Washington, D.C. that covers state and local government that is owned by e.Republic a Folsom, California-based research and media company that focuses on connecting private IT companies with government and education agencies and maintains multiple publications, platforms, and brands all concerned with technology and government. The data set was updated to reflect the legal status of marijuana by state in 2019 and include decriminalization as a variable.

Composition:

The drugs data set is comprised of summary functions of the survey data collected by SAMHSA through the NSDUH from 2002-2019. The data represents the responses of individuals collected in face-to-face household interviews in all 50 states and the District of Columbia. The survey it was collected through covers residents of households, persons in non-institutional group quarters, and civilians living on military bases. It excludes individuals experiencing homelessness who do not use shelters, active military personnel, and residents of institutional group quarters such as jails, nursing homes, mental institutions, and long-term care hospitals. The original SAMHSA data sets identified five subgroups by residence state and an exhaustive list of different substances. The data set compiled by CORGIS combined the five original age groups to form three, and only includes alcohol, cigarette, tobacco, marijuana, and cocaine use.

The State Marijuana Laws (2019) data set is comprised of boolean variables relating to the legal status of marijuana in all 50 states and the District of Columbia. It was collected through research of publicly available data.

Collection:

The drugs data set is a compilation of data collected through the NSDUH from 2002-2019. The survey is a face-to-face interview by trained interviewing staff managed by field supervisors who, in turn, reported to regional supervisors. The interviewing staff was trained and use a questionnaire written by SAMHSA to conduct the interviews. Starting in 2018 respondents were offered a $30 incentive which had the expected effect of increasing the response rate. Interviewing staff were trained on the ethics and regulations involving research on human subjects.

The State Marijuana Laws (2019) data set was collected from publicly available records on state laws and public information on the internet.

Processing/cleaning/labeling:

The raw data collected through the NSDUH was processed into summary statistics by SAMHSA, but CORGIS compiled the data set from the raw data from NSDUH. It includes both totals (in thousands of people) and rates (as a percentage of the population) for each age subgroup compiled, 12-17, 18-25, and 26+, as well as the related states each respondent resided in when the data was collected. The data set is made up of data collected annually through the NSDUH from 2002-2019.

The State Marijuana Laws (2019) data set is unprocessed and only links the legal status of marijuana in 2019 in each state to its respective row.

Uses:

All of the data used in this project has likely been used before in other analyses as all the data is publicly available. The State Marijuana Laws (2019) data set was created for the use of this project only but was compiled from existing publicly available information. All the raw data for the drugs data set is available on the SAMHSA website, and the raw data for the State Marijuana Laws (2019) data set is available on Wikipedia.

Data analysis

Research Question: Are individuals more likely to abuse one substance over another based on their age?

#import libraries
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidyr)
library(dplyr)
library(broom)
library(glmnet)

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7

library(ggplot2)
library(ggthemes)

#load data
drugUse <- read.csv('data/drugs.csv')
legalStatus <- read.csv('data/marijuana_legal_status_2002_2019.csv')

#preview data
skimr::skim(drugUse)

Data summary
Name	drugUse
Number of rows	867
Number of columns	53
_______________________
Column type frequency:
character	1
numeric	52
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
State	0	1	4	20	0	51	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	2010.00	4.90	2002.00	2006.00	2010.00	2014.00	2018.00	▇▆▆▆▇
Population.12.17	1	489714.13	563795.85	30551.00	131540.50	339685.00	541095.00	3293484.00	▇▂▁▁▁
Population.18.25	1	658880.04	755989.75	57395.00	174293.50	456240.00	746808.00	4469106.00	▇▁▁▁▁
Population.26.	1	3874155.48	4320775.92	310110.00	1027871.00	2698757.00	4509094.00	25917724.00	▇▂▁▁▁
Totals.Alcohol.Use.Disorder.Past.Year.12.17	1	19.22	25.29	0.00	5.00	11.00	24.00	204.00	▇▁▁▁▁
Totals.Alcohol.Use.Disorder.Past.Year.18.25	1	94.48	108.27	6.00	26.00	64.00	119.50	717.00	▇▁▁▁▁
Totals.Alcohol.Use.Disorder.Past.Year.26.	1	224.15	254.02	19.00	57.50	154.00	271.50	1586.00	▇▁▁▁▁
Rates.Alcohol.Use.Disorder.Past.Year.12.17	1	0.04	0.02	0.01	0.03	0.04	0.05	0.11	▇▇▅▁▁
Rates.Alcohol.Use.Disorder.Past.Year.18.25	1	0.15	0.04	0.07	0.12	0.15	0.18	0.27	▃▇▇▂▁
Rates.Alcohol.Use.Disorder.Past.Year.26.	1	0.06	0.01	0.03	0.05	0.06	0.07	0.11	▂▇▃▁▁
Totals.Alcohol.Use.Past.Month.12.17	1	65.80	77.68	3.00	17.00	43.00	81.00	540.00	▇▁▁▁▁
Totals.Alcohol.Use.Past.Month.18.25	1	393.02	440.01	32.00	99.00	258.00	480.00	2639.00	▇▁▁▁▁
Totals.Alcohol.Use.Past.Month.26.	1	2124.66	2372.87	167.00	525.00	1380.00	2623.00	14513.00	▇▂▁▁▁
Rates.Alcohol.Use.Past.Month.12.17	1	0.14	0.04	0.05	0.11	0.13	0.16	0.25	▂▇▇▃▁
Rates.Alcohol.Use.Past.Month.18.25	1	0.61	0.08	0.30	0.56	0.61	0.66	0.76	▁▁▅▇▃
Rates.Alcohol.Use.Past.Month.26.	1	0.55	0.08	0.28	0.51	0.56	0.61	0.72	▁▂▅▇▃
Totals.Tobacco.Cigarette.Past.Month.12.17	1	36.80	41.88	1.00	10.00	23.00	47.50	295.00	▇▁▁▁▁
Totals.Tobacco.Cigarette.Past.Month.18.25	1	209.94	219.53	14.00	56.00	147.00	265.00	1281.00	▇▂▁▁▁
Totals.Tobacco.Cigarette.Past.Month.26.	1	857.22	844.42	76.00	223.00	678.00	1066.00	4452.00	▇▂▁▁▁
Rates.Tobacco.Cigarette.Past.Month.12.17	1	0.08	0.04	0.01	0.05	0.08	0.11	0.20	▆▇▇▃▁
Rates.Tobacco.Cigarette.Past.Month.18.25	1	0.34	0.08	0.13	0.28	0.35	0.40	0.53	▂▅▇▇▁
Rates.Tobacco.Cigarette.Past.Month.26.	1	0.23	0.04	0.12	0.21	0.23	0.26	0.34	▁▅▇▅▁
Totals.Illicit.Drugs.Cocaine.Used.Past.Year.12.17	1	5.06	7.51	0.00	1.00	3.00	6.00	56.00	▇▁▁▁▁
Totals.Illicit.Drugs.Cocaine.Used.Past.Year.18.25	1	37.11	46.66	2.00	10.00	22.00	46.00	345.00	▇▁▁▁▁
Totals.Illicit.Drugs.Cocaine.Used.Past.Year.26.	1	59.11	72.74	2.00	14.00	36.00	75.00	585.00	▇▁▁▁▁
Rates.Illicit.Drugs.Cocaine.Used.Past.Year.12.17	1	0.01	0.01	0.00	0.01	0.01	0.01	0.03	▇▆▃▁▁
Rates.Illicit.Drugs.Cocaine.Used.Past.Year.18.25	1	0.06	0.02	0.02	0.04	0.06	0.07	0.12	▃▇▆▂▁
Rates.Illicit.Drugs.Cocaine.Used.Past.Year.26.	1	0.01	0.01	0.01	0.01	0.01	0.02	0.05	▇▆▁▁▁
Totals.Marijuana.New.Users.12.17	1	24.72	28.67	2.00	7.00	17.00	29.00	197.00	▇▁▁▁▁
Totals.Marijuana.New.Users.18.25	1	24.70	29.33	2.00	6.00	16.00	29.50	204.00	▇▁▁▁▁
Totals.Marijuana.New.Users.26.	1	5.53	8.92	0.00	1.00	3.00	6.00	119.00	▇▁▁▁▁
Rates.Marijuana.New.Users.12.17	1	0.06	0.01	0.03	0.05	0.06	0.07	0.10	▁▇▆▂▁
Rates.Marijuana.New.Users.18.25	1	0.08	0.02	0.03	0.06	0.07	0.09	0.16	▁▇▃▁▁
Rates.Marijuana.New.Users.26.	1	0.00	0.00	0.00	0.00	0.00	0.00	0.02	▇▂▁▁▁
Totals.Marijuana.Used.Past.Month.12.17	1	34.86	41.58	2.00	9.50	22.00	42.00	307.00	▇▁▁▁▁
Totals.Marijuana.Used.Past.Month.18.25	1	123.24	147.85	8.00	35.00	79.00	152.50	1106.00	▇▁▁▁▁
Totals.Marijuana.Used.Past.Month.26.	1	216.13	290.39	10.00	56.50	121.00	276.50	3086.00	▇▁▁▁▁
Rates.Marijuana.Used.Past.Month.12.17	1	0.07	0.02	0.04	0.06	0.07	0.08	0.14	▂▇▅▂▁
Rates.Marijuana.Used.Past.Month.18.25	1	0.19	0.05	0.08	0.15	0.18	0.22	0.39	▂▇▃▂▁
Rates.Marijuana.Used.Past.Month.26.	1	0.06	0.03	0.02	0.04	0.05	0.07	0.18	▇▅▁▁▁
Totals.Marijuana.Used.Past.Year.12.17	1	65.55	76.86	4.00	18.00	43.00	81.00	545.00	▇▁▁▁▁
Totals.Marijuana.Used.Past.Year.18.25	1	202.54	237.62	16.00	56.00	131.00	252.50	1687.00	▇▁▁▁▁
Totals.Marijuana.Used.Past.Year.26.	1	348.76	449.85	17.00	91.50	212.00	439.50	4476.00	▇▁▁▁▁
Rates.Marijuana.Used.Past.Year.12.17	1	0.14	0.03	0.09	0.12	0.13	0.16	0.23	▃▇▅▂▁
Rates.Marijuana.Used.Past.Year.18.25	1	0.31	0.07	0.17	0.27	0.30	0.35	0.53	▂▇▅▂▁
Rates.Marijuana.Used.Past.Year.26.	1	0.09	0.04	0.04	0.07	0.08	0.11	0.25	▇▆▂▁▁
Totals.Tobacco.Use.Past.Month.12.17	1	47.51	51.33	1.00	13.00	31.00	62.00	358.00	▇▁▁▁▁
Totals.Tobacco.Use.Past.Month.18.25	1	249.24	253.20	18.00	67.50	181.00	313.00	1488.00	▇▂▁▁▁
Totals.Tobacco.Use.Past.Month.26.	1	1029.91	1001.45	95.00	258.50	828.00	1289.00	5099.00	▇▂▁▁▁
Rates.Tobacco.Use.Past.Month.12.17	1	0.11	0.04	0.02	0.07	0.11	0.14	0.24	▅▇▇▃▁
Rates.Tobacco.Use.Past.Month.18.25	1	0.40	0.08	0.17	0.35	0.42	0.46	0.59	▁▃▆▇▂
Rates.Tobacco.Use.Past.Month.26.	1	0.28	0.05	0.15	0.25	0.28	0.31	0.41	▁▅▇▅▁

skimr::skim(legalStatus)

Data summary
Name	legalStatus
Number of rows	867
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
State	0	1	4	20	0	51	0
Legal_Status	0	1	7	14	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1	2010.00	4.90	2002	2006	2010	2014	2018	▇▆▆▆▇
Legalization_Year	0	1	2019.02	2.18	2012	2020	2020	2020	2020	▁▁▁▁▇

# subset for rate of alcohol use disorder in past year
alcohol_year_rate <- drugUse[, c("Rates.Alcohol.Use.Disorder.Past.Year.12.17", 
                                 "Rates.Alcohol.Use.Disorder.Past.Year.18.25", 
                                 "Rates.Alcohol.Use.Disorder.Past.Year.26.")]

# subset for rate of ilicit drug use in past year
ilicit_year_rate <- drugUse[, c("Rates.Illicit.Drugs.Cocaine.Used.Past.Year.12.17", 
                                "Rates.Illicit.Drugs.Cocaine.Used.Past.Year.18.25", 
                                "Rates.Illicit.Drugs.Cocaine.Used.Past.Year.26.")]

# subset for rate of marijuana use in past year
marijuana_year_rate <- drugUse[, c("Rates.Marijuana.Used.Past.Year.12.17", 
                                   "Rates.Marijuana.Used.Past.Year.18.25", 
                                   "Rates.Marijuana.Used.Past.Year.26.")]

# alc mean and sd
alcohol_year_rate_mean <- c(mean(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.12.17), 
                            mean(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.18.25), 
                            mean(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.26.))
alcohol_year_rate_sd <- c(sd(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.12.17), 
                          sd(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.18.25), 
                          sd(drugUse$Rates.Alcohol.Use.Disorder.Past.Year.26.))
alcohol_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                              mean = alcohol_year_rate_mean,
                              sd = alcohol_year_rate_sd)

# illicit mean and sd
ilicit_year_rate_mean <- c(mean(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.12.17), 
                           mean(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.18.25), 
                           mean(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.26.))
ilicit_year_rate_sd <- c(sd(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.12.17), 
                         sd(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.18.25), 
                         sd(drugUse$Rates.Illicit.Drugs.Cocaine.Used.Past.Year.26.))
ilicit_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                              mean = ilicit_year_rate_mean,
                              sd = ilicit_year_rate_sd)

# marijuana mean and sd
marijuana_year_rate_mean <- c(mean(drugUse$Rates.Marijuana.Used.Past.Year.12.17), 
                              mean(drugUse$Rates.Marijuana.Used.Past.Year.18.25), 
                              mean(drugUse$Rates.Marijuana.Used.Past.Year.26.))
marijuana_year_rate_sd <- c(sd(drugUse$Rates.Marijuana.Used.Past.Year.12.17), 
                            sd(drugUse$Rates.Marijuana.Used.Past.Year.18.25), 
                            sd(drugUse$Rates.Marijuana.Used.Past.Year.26.))
marijuana_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                                 mean = marijuana_year_rate_mean,
                                 sd = marijuana_year_rate_sd)

# summary tables
alcohol_summary

  age_group       mean          sd
1     12-17 0.04094951 0.018241042
2     18-25 0.15118001 0.038558341
3       26+ 0.05976760 0.009945844

ilicit_summary

  age_group       mean          sd
1     12-17 0.01028040 0.005499766
2     18-25 0.05692399 0.018145774
3       26+ 0.01496876 0.005256846

marijuana_summary

  age_group       mean         sd
1     12-17 0.13802775 0.02566284
2     18-25 0.31406891 0.06666459
3       26+ 0.09191627 0.03779188

# combine data into one df
summary_total_data <- rbind(
  transform(alcohol_summary, drug_type = "Alcohol"),
  transform(ilicit_summary, drug_type = "Illicit (Cocaine)"),
  transform(marijuana_summary, drug_type = "Marijuana")
)

# create histogram
ggplot(summary_total_data, aes(x = age_group, y = mean, fill = drug_type)) +
  geom_bar(position = "dodge", stat = "identity") +
# change color and theme
  scale_color_colorblind() +
  theme_minimal() +
# add labels
  labs(title = "Mean Rate of Drug Use in the Past Year vs. Age Group",
       x = "Age Group",
       y = "Mean Rate of Use",
       fill = "Drug Type")

This visualization shows the mean rate of drug use in the past year divided into 3 age groups (12-17, 18-25, 26+) Focus on 3 types of drugs: Alcohol, Illicit Drugs (Cocaine), and Marijuana. The visualization shows that age groups 18-25 have the overall highest mean rate of drugs used for all 3 categories in the past year.

rate_columns <- c(
  "Year", 
  "State", 
  "Rates.Alcohol.Use.Past.Month.12.17", 
  "Rates.Alcohol.Use.Past.Month.18.25", 
  "Rates.Alcohol.Use.Past.Month.26.",
  "Rates.Tobacco.Use.Past.Month.12.17",
  "Rates.Tobacco.Use.Past.Month.18.25",
  "Rates.Tobacco.Use.Past.Month.26.",
  "Rates.Marijuana.Used.Past.Month.12.17", 
  "Rates.Marijuana.Used.Past.Month.18.25", 
  "Rates.Marijuana.Used.Past.Month.26.",
  "Rates.Illicit.Drugs.Cocaine.Used.Past.Year.12.17", 
  "Rates.Illicit.Drugs.Cocaine.Used.Past.Year.18.25", 
  "Rates.Illicit.Drugs.Cocaine.Used.Past.Year.26.")

drugUse_pastMonthRates_sub <- drugUse[, rate_columns]

# reshape the data to long format
long_drugUse <- drugUse_pastMonthRates_sub |>
  pivot_longer(
    cols = starts_with("Rates"),
    names_to = c("Substance", "Age.Group"),
    names_pattern = "Rates\\.(.*)\\.Past\\.Year\\.(.*)",
    values_to = "Rate") |>
  mutate(Substance = str_extract(Substance, "Alcohol|Marijuana|Cocaine"))

# convert Age.Group numeric values to factors
long_drugUse$Age.Group <- factor(long_drugUse$Age.Group, 
                                 levels = c("12.17", "18.25", "26."), 
                                 labels = c("12-17", "18-25", "26+"))

#pivot data to create Substance Type and Age Group columns and factor those columns
drugUse_long_factored <- drugUse_pastMonthRates_sub |>
  pivot_longer(
    cols = starts_with("Rates."),
    names_to = c("Substance Type", "Age Group"),
    names_pattern = "(?:Rates|Rates.Illicit.Drugs)\\.([a-zA-Z ]*)\\.(?:Use|Used)\\.Past\\.(?:Month|Year)\\.(..*)?",
    values_to = "Use Rate") |>
  mutate( 
    `Age Group` = factor(
      `Age Group`,
      levels = c("12.17", "18.25", "26."),
      labels = c("12-17", "18-25", "26+")),
    `Substance Type` = factor(
      `Substance Type`,
      levels = c("Alcohol", "Marijuana", "Cocaine", "Tobacco"),
      labels = c("Alcohol", "Marijuana", "Cocaine", "Tobacco")))

#create grouped bar plot
drugUse_grouped_bar <- ggplot(data = drugUse_long_factored, 
       aes(x = `Age Group`, y = `Use Rate`, fill = `Substance Type`)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Substance Use Rates by Age Group and Substance Type",
       x = "Age Group",
       y = "Use Rate",
       fill = "Substance Type") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

drugUse_grouped_bar

# calculate the max rate to adjust y axis accordingly
max_sum_rate <- long_drugUse |>
  group_by(Year, Substance) |>
  summarize(total_rate = sum(Rate)) |>
  ungroup() |>
  summarize(max_sum_rate = max(total_rate)) |>
  pull(max_sum_rate)

`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.

# create faceted bar plot
faceted_rates_plot <- ggplot(drugUse_long_factored, 
       aes(
         x = `Substance Type`, 
         y = `Use Rate`, 
         fill = `Age Group`)) +
  geom_bar(
    stat = "identity", 
    position = "stack") +
  facet_wrap(~ Year, ncol = 3) +
  labs(
    title = "Substance Use Disorder Rates by Age Group",
    x = "Substance",
    y = "Rate") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    strip.background = element_blank(),
    aspect.ratio = .5) +
  scale_y_continuous(
    limits = c(0, max_sum_rate * 1.1), 
    breaks = seq(0, max_sum_rate * 1.1, by = 5))

faceted_rates_plot

#mutate drugUse_long_factored to add a new column that combines the state and age group variables
drugUse_heatmap_sub <- drugUse_long_factored |>
  mutate(
    `Substance AgeGroup` = paste(
      `Substance Type`,
      `Age Group`, 
      sep = " "))

#create a heatmap where each square represents the use rate or each age group in each state
drugUse_heatmap <- ggplot(data = drugUse_heatmap_sub, 
       aes(
         x = `Substance AgeGroup`, 
         y = State, 
         fill = `Use Rate`)) +
  geom_tile(height = 4) +
  scale_fill_gradient2(
    low = "white", 
    mid = "blue", 
    high = "darkblue", 
    midpoint = median(drugUse_heatmap_sub$`Use Rate`)) +
  theme_minimal() +
  labs(
    title = "Substance Use Rates by State Age Group and Substance Type",
    x = "State and Age Group",
    y = "Substance Type",
    fill = "Use Rate") +
   theme(
     axis.text.x = element_text(
       angle = 90, 
       hjust = 1, 
       vjust = 0.5))

drugUse_heatmap

#join drugUse and legalStatus
drugUse_legalStatus_joined <- drugUse_long_factored |>
  left_join(legalStatus, by = c("Year", "State")) |>
  filter(`Substance Type` == "Marijuana",
         Legalization_Year != 2020)

#create the faceted line plot with the use rate of marijuana 
#for each age group as a line and the facets as the states
ggplot(drugUse_legalStatus_joined, aes(x = Year, y = `Use Rate`, color = `Age Group`)) +
  geom_line() +
  facet_wrap(~ State, ncol = 3) +
  geom_vline(
    data = subset(
      drugUse_legalStatus_joined, 
      !is.na(Legalization_Year)), 
    aes(xintercept = Legalization_Year), 
    color = "red", 
    linetype = "dashed") +
  labs(
    title = "Marijuana Use Rate by Age Group and State",
    x = "Year",
    y = "Use Rate",
    color = "Age Group") +
  theme_minimal()

# subset for rate of alcohol use disorder in past month
alcohol_month_rate <- drugUse[, c("Rates.Alcohol.Use.Past.Month.12.17", 
                                 "Rates.Alcohol.Use.Past.Month.18.25", 
                                 "Rates.Alcohol.Use.Past.Month.26.")]

# subset for rate of tabacco use disorder in past month
tobacco_month_rate <- drugUse[, c("Rates.Tobacco.Cigarette.Past.Month.12.17", 
                                 "Rates.Tobacco.Cigarette.Past.Month.18.25", 
                                 "Rates.Tobacco.Cigarette.Past.Month.26.")]

# subset for rate of marijuana use in past month
marijuana_month_rate <- drugUse[, c("Rates.Marijuana.Used.Past.Month.12.17", 
                                   "Rates.Marijuana.Used.Past.Month.18.25", 
                                   "Rates.Marijuana.Used.Past.Month.26.")]

# alcohol mean and sd
alcohol_month_rate_mean <- c(mean(drugUse$Rates.Alcohol.Use.Past.Month.12.17), 
                            mean(drugUse$Rates.Alcohol.Use.Past.Month.18.25), 
                            mean(drugUse$Rates.Alcohol.Use.Past.Month.26.))
alcohol_month_rate_sd <- c(sd(drugUse$Rates.Alcohol.Use.Past.Month.12.17), 
                          sd(drugUse$Rates.Alcohol.Use.Past.Month.18.25), 
                          sd(drugUse$Rates.Alcohol.Use.Past.Month.26.))
alcohol_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                              mean = alcohol_month_rate_mean,
                              sd = alcohol_month_rate_sd)

# tobacco mean and sd
tobacco_month_rate_mean <- c(mean(drugUse$Rates.Tobacco.Cigarette.Past.Month.12.17), 
                             mean(drugUse$Rates.Tobacco.Cigarette.Past.Month.18.25), 
                             mean(drugUse$Rates.Tobacco.Cigarette.Past.Month.26.))
tobacco_month_rate_sd <- c(sd(drugUse$Rates.Tobacco.Cigarette.Past.Month.12.17), 
                           sd(drugUse$Rates.Tobacco.Cigarette.Past.Month.18.25), 
                           sd(drugUse$Rates.Tobacco.Cigarette.Past.Month.26.))
tobacco_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                              mean = tobacco_month_rate_mean,
                              sd = tobacco_month_rate_sd)

# marijuana mean and sd
marijuana_month_rate_mean <- c(mean(drugUse$Rates.Marijuana.Used.Past.Month.12.17), 
                               mean(drugUse$Rates.Marijuana.Used.Past.Month.18.25), 
                               mean(drugUse$Rates.Marijuana.Used.Past.Month.26.))
marijuana_month_rate_sd <- c(sd(drugUse$Rates.Marijuana.Used.Past.Month.12.17), 
                             sd(drugUse$Rates.Marijuana.Used.Past.Month.18.25), 
                             sd(drugUse$Rates.Marijuana.Used.Past.Month.26.))
marijuana_summary <- data.frame(age_group = c("12-17", "18-25", "26+"),
                                mean = marijuana_month_rate_mean,
                                sd = marijuana_month_rate_sd)

# combine data into one df
summary_total_data <- rbind(
  transform(alcohol_summary, drug_type = "Alcohol"),
  transform(tobacco_summary, drug_type = "Tabacco (Cigarette)"),
  transform(marijuana_summary, drug_type = "Marijuana")
)

# create histogram
ggplot(summary_total_data, aes(x = age_group, y = mean, fill = drug_type)) +
  geom_bar(position = "dodge", stat = "identity") +
# change color and theme
  scale_color_colorblind() +
  theme_minimal() +
# add labels
  labs(title = "Mean Rate of Drug Use in the Past Month vs. Age Group",
       x = "Age Group",
       y = "Mean Rate of Use",
       fill = "Drug Type")

This visualization shows the mean rate of drug use in the past month divided into 3 age groups (12-17, 18-25, 26+) Focus on 3 types of drugs: Alcohol, Tobacco (Cigarettes), and Marijuana. Similar to the previous visualization, we can clearly see that age groups 18-25 have the overall highest mean rate of drugs used for all 3 categories in the past month.

Evaluation of significance

Null Hypothesis: There is no difference in the means of substance use rates across age groups.

In the equation, ρ is the population correlation coefficient between age and the likelihood of abusing one substance over another.

\[ H_0: p = 0 \]

Alternative Hypothesis: There is a significant relationship between age and the likelihood of abusing one substance over another.

\[ H_A: p \neq 0 \]

# #| label: loaded-and-cleaned-the-data
# drugs <- read.csv("data/drugs.csv")
# marijuana_laws <- read.csv("data/marijuana_legal_status_2002_2019.csv")
# drugs_legal <- inner_join(drugs, marijuana_laws) %>%
#   mutate(marijuana_legal_status = if_else(Medical == "Yes", "Medical", if_else(Recreational == "Yes", "Recreational", if_else(Illegal == "Yes", "Illegal", "Decriminalized")))) %>%
#   select(-Medical:-Decriminalized)
# 
# 
# drugs_legal <- drugs_legal |>
#   mutate(alcohol_abuse = ifelse(Totals.Alcohol.Use.Disorder.Past.Year.12.17 + 
#                                 Totals.Alcohol.Use.Disorder.Past.Year.18.25 +
#                                 Totals.Alcohol.Use.Disorder.Past.Year.26. > 0, 1, 0),
#          marijuana_abuse = ifelse(Totals.Marijuana.Used.Past.Year.12.17 + 
#                                 Totals.Marijuana.Used.Past.Year.18.25 +
#                                 Totals.Marijuana.Used.Past.Year.26. > 0, 1, 0),
#          cocaine_abuse = ifelse(Totals.Illicit.Drugs.Cocaine.Used.Past.Year.12.17 + 
#                                 Totals.Illicit.Drugs.Cocaine.Used.Past.Year.18.25 +
#                                 Totals.Illicit.Drugs.Cocaine.Used.Past.Year.26. > 0, 1, 0),
#          tobacco_abuse = ifelse(Totals.Tobacco.Cigarette.Past.Month.12.17 + 
#                                 Totals.Tobacco.Cigarette.Past.Month.18.25 +
#                                 Totals.Tobacco.Cigarette.Past.Month.26. > 0, 1, 0))
# 
# 
# fit_alc <- glm(alcohol_abuse ~ Population.12.17 + Population.18.25 + Population.26., data = drugs_legal, family = binomial)
# 
# fit_marijuana <- glm(marijuana_abuse ~ Population.12.17 + Population.18.25 + Population.26., data = drugs_legal, family = binomial)
# 
# fit_cocaine <- glm(cocaine_abuse ~ Population.12.17 + Population.18.25 + Population.26., data = drugs_legal, family = binomial)
# 
# fit_tobacco <- glm(tobacco_abuse ~ Population.12.17 + Population.18.25 + Population.26., data = drugs_legal, family = binomial)
# 
# library(tidyr)
# library(dplyr)
# library(broom)
# library(glmnet)
# 
#   
# summary(fit_alc)
#  
# summary(fit_marijuana)
#  
# summary(fit_cocaine)
#  
# summary(fit_tobacco)
# 
# tidy(fit_alc) |> 
#   filter(term %in% c("Population.12.17", "Population.18.25", "Population.26.")) |> 
#   select(p.value)
#  
# tidy(fit_marijuana) |> 
#   filter(term %in% c("Population.12.17", "Population.18.25", "Population.26.")) |> 
#   select(p.value)
#  
# tidy(fit_cocaine) |> 
#   filter(term %in% c("Population.12.17", "Population.18.25", "Population.26.")) |> 
#   select(p.value)
#  
# tidy(fit_tobacco) |> 
#   filter(term %in% c("Population.12.17", "Population.18.25", "Population.26.")) |> 
#   select(p.value)

# filter the data frame for the substance type
substance_type <- "Alcohol"

filtered_data <- drugUse_long_factored |> 
  filter(`Substance Type` == substance_type)

# perform the ANOVA test
anova_result <- aov(
  `Use Rate` ~ `Age Group`, 
  data = filtered_data)

# check the ANOVA results
summary(anova_result)

              Df Sum Sq Mean Sq F value Pr(>F)    
`Age Group`    2 114.65   57.32   12550 <2e-16 ***
Residuals   2598  11.87    0.00                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# filter the data frame for the substance type
substance_type <- "Tobacco"

filtered_data <- drugUse_long_factored |> 
  filter(`Substance Type` == substance_type)

# perform the ANOVA test
anova_result <- aov(
  `Use Rate` ~ `Age Group`, 
  data = filtered_data)

# check the ANOVA results
summary(anova_result)

              Df Sum Sq Mean Sq F value Pr(>F)    
`Age Group`    2  38.49  19.244    5602 <2e-16 ***
Residuals   2598   8.92   0.003                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# filter the data frame for the substance type
substance_type <- "Marijuana"

filtered_data <- drugUse_long_factored |> 
  filter(`Substance Type` == substance_type)

# perform the ANOVA test
anova_result <- aov(
  `Use Rate` ~ `Age Group`, 
  data = filtered_data)

# check the ANOVA results
summary(anova_result)

              Df Sum Sq Mean Sq F value Pr(>F)    
`Age Group`    2  9.315   4.657    3560 <2e-16 ***
Residuals   2598  3.398   0.001                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# filter the data frame for the substance type
substance_type <- "Cocaine"

filtered_data <- drugUse_long_factored |> 
  filter(`Substance Type` == substance_type)

# perform the ANOVA test
anova_result <- aov(
  `Use Rate` ~ `Age Group`, 
  data = filtered_data)

# check the ANOVA results
summary(anova_result)

              Df Sum Sq Mean Sq F value Pr(>F)    
`Age Group`    2 1.1438  0.5719    4432 <2e-16 ***
Residuals   2598 0.3353  0.0001                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#join drugUse and legalStatus create a binary variable for recreational legal status
drugUse_legalStatus_joined <- drugUse_long_factored |>
  left_join(legalStatus, by = c("Year", "State")) 

#filter for adolescents and marijuana
filtered_drugUse_legalStatus_joined <- drugUse_legalStatus_joined |>
  filter(
    `Age Group` == "12-17", 
    `Substance Type` == "Marijuana")

#create binary for pre legalization years and post legalization years
filtered_drugUse_legalStatus_joined$Legal <- ifelse(
  filtered_drugUse_legalStatus_joined$Year >= filtered_drugUse_legalStatus_joined$Legalization_Year, 1, 0)

#model <- lm(`Use Rate` ~ Legal + factor(Year) + factor(State), data = filtered_drugUse_legalStatus_joined)


#summary(drugUse_legalStatus_DiD)

#summary(model)

#| label: performed-post-hoc-analysis
#tukey_hsd <- TukeyHSD(anova_result)
#tukey_hsd

#| label: performed-two-way-ANOVA
result <- aov(`Use Rate` ~ `Age Group` * `Substance Type`, data = drugUse_long_factored)

summary(result)

                                Df Sum Sq Mean Sq F value Pr(>F)    
`Age Group`                      2  96.03   48.01   20345 <2e-16 ***
`Substance Type`                 3 249.24   83.08   35204 <2e-16 ***
`Age Group`:`Substance Type`     6  67.56   11.26    4771 <2e-16 ***
Residuals                    10392  24.52    0.00                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drugUse_legalStatus_joined <- drugUse_long_factored |>
  left_join(legalStatus, by = c("Year", "State")) |>
  filter(`Substance Type` == "Marijuana") |>
  mutate(
    Legal_Status = factor(Legal_Status, c("Illegal", "Medical", "Decriminalized", "Recreational")),
    `Age Group` = factor(`Age Group`, c("12-17", "18-25", "26+"))
  )

drugUse_legalStatus_joined$Legal_Status <- relevel(drugUse_legalStatus_joined$Legal_Status, ref = "Illegal")

drugUse_legalStatus_joined$`Age Group` <- relevel(drugUse_legalStatus_joined$`Age Group`, ref = "18-25")

# Perform linear regression analysis
model <- lm(`Use Rate` ~ Legal_Status * `Age Group`, data = drugUse_legalStatus_joined)

# Display summary of the model
summary(model)


Call:
lm(formula = `Use Rate` ~ Legal_Status * `Age Group`, data = drugUse_legalStatus_joined)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.08990 -0.01361 -0.00310  0.01024  0.16173 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                  0.167322   0.001243 134.653
Legal_StatusMedical                          0.058988   0.002145  27.503
Legal_StatusDecriminalized                   0.015613   0.002810   5.557
Legal_StatusRecreational                     0.131145   0.004450  29.470
`Age Group`12-17                            -0.100493   0.001757 -57.185
`Age Group`26+                              -0.123360   0.001757 -70.198
Legal_StatusMedical:`Age Group`12-17        -0.041704   0.003033 -13.750
Legal_StatusDecriminalized:`Age Group`12-17 -0.010512   0.003974  -2.646
Legal_StatusRecreational:`Age Group`12-17   -0.103296   0.006293 -16.414
Legal_StatusMedical:`Age Group`26+          -0.028477   0.003033  -9.389
Legal_StatusDecriminalized:`Age Group`26+   -0.009111   0.003974  -2.293
Legal_StatusRecreational:`Age Group`26+     -0.038879   0.006293  -6.178
                                            Pr(>|t|)    
(Intercept)                                  < 2e-16 ***
Legal_StatusMedical                          < 2e-16 ***
Legal_StatusDecriminalized                  3.03e-08 ***
Legal_StatusRecreational                     < 2e-16 ***
`Age Group`12-17                             < 2e-16 ***
`Age Group`26+                               < 2e-16 ***
Legal_StatusMedical:`Age Group`12-17         < 2e-16 ***
Legal_StatusDecriminalized:`Age Group`12-17  0.00821 ** 
Legal_StatusRecreational:`Age Group`12-17    < 2e-16 ***
Legal_StatusMedical:`Age Group`26+           < 2e-16 ***
Legal_StatusDecriminalized:`Age Group`26+    0.02193 *  
Legal_StatusRecreational:`Age Group`26+     7.53e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02703 on 2589 degrees of freedom
Multiple R-squared:  0.8513,    Adjusted R-squared:  0.8506 
F-statistic:  1347 on 11 and 2589 DF,  p-value: < 2.2e-16

Interpretation and conclusions

Each analysis resulted in a p-value of less than 2e-16, essentially zero. This implies a significant difference in use rate across age groups means that the average use rates of a given substance type are not the same for all age groups. In other words, the proportion of people using a specific substance within one age group is significantly different from the proportion of people using the same substance within another age group.

This finding implies that age is an important factor in the use of a particular substance type. It suggests that substance use behavior might be influenced by factors that vary across different age groups, such as social environment, peer influence, availability, or developmental stage.

We ran a second analysis using a linear regression model to try and determine whether the effects of legalization increased use rate and if this increase was uniform across age groups. The main effect coefficients for Legal_StatusMedical, Legal_StatusDecriminalized, and Legal_StatusRecreational show the differences in use rate for the 18-25 age group compared to the reference category (illegal status) when the age group is held constant. All these coefficients are positive, which indicates that the use rate for marijuana is higher in states with medical, decriminalized, or recreational legalization compared to states with illegal status for the 18-25 age group.

The difference in use rate between medical legalization and illegal status for the 12-17 age group is 0.041704 units lower than the difference for the 18-25 age group. This means that the effect of medical legalization on use rate is less pronounced for the 12-17 age group compared to the 18-25 age group.

The difference in use rate between decriminalized status and illegal status for the 12-17 age group is 0.010512 units lower than the difference for the 18-25 age group. This implies that the effect of decriminalized status on use rate is also less pronounced for the 12-17 age group compared to the 18-25 age group.

The difference in use rate between recreational legalization and illegal status for the 12-17 age group is 0.103296 units lower than the difference for the 18-25 age group. This indicates that the effect of recreational legalization on use rate is less pronounced for the 12-17 age group compared to the 18-25 age group.

The difference in use rate between medical legalization and illegal status for the 26+ age group is 0.028477 units lower than the difference for the 18-25 age group. This means that the effect of medical legalization on use rate is less pronounced for the 26+ age group compared to the 18-25 age group.

The difference in use rate between decriminalized status and illegal status for the 26+ age group is 0.009111 units lower than the difference for the 18-25 age group. This implies that the effect of decriminalized status on use rate is also less pronounced for the 26+ age group compared to the 18-25 age group.

The difference in use rate between recreational legalization and illegal status for the 26+ age group is 0.038879 units lower than the difference for the 18-25 age group. This indicates that the effect of recreational legalization on use rate is less pronounced for the 26+ age group compared to the 18-25 age group.

These interaction terms suggest that the effect of legalization on use rate is not the same across different age groups. For both the 12-17 age group and the 26+ age group, the effect of legalization (be it medical, decriminalized, or recreational) on use rate appears to be less pronounced compared to the 18-25 age group.

The results indicate that legalization (medical, decriminalized, or recreational) is associated with higher use rates, particularly for the 18-25 age group. However, this association is not uniform across all age groups, with the 12-17 age group showing a less pronounced increase in use rate due to legalization.

Limitations

Regarding the drug data set, there are a few possible limitations. For one, the data is self-reported, which can lead to loads of biases or inaccuracies in the results. Respondents may also underreport or possibly over-report substance use depending on various factors such as social desirability bias or memory recall. Also, another flaw of the drug data set is that it comes with a limited scope when it comes to the number of drugs described, which in the case of the data set only includes four substances: alcohol, cigarettes, marijuana, and cocaine. There are certainly more drugs out there that individuals use that the data set does not include, and it affects our overall scope of what substance abuse looks like in the United States.
Regarding the State Marijuana Laws data set, there are a few limitations. First, the data set represents information from 2019, but in the past three years, there have been many changes to the rules/laws of marijuana use in the United States. State laws concerning marijuana usage are constantly evolving, and as the data set does not reflect the most current legal status of marijuana in each state, there can be several issues. Another potential issue with the data set is that it may not capture the full range of factors that influence the legal status of marijuana in each state, such as public opinion, political factors, or the influence of special interest groups. The view on marijuana usage is constantly evolving, and it is an issue that the data set does not consider.

Acknowledgments

We would like to recognize some online resources that we found helpful while completing this project: - The info2950-s23 GIT Discussion Board was always a great resource that we could resort to if we felt lost or confused. Seeing similar issues that our peers faced and the course staff’s responses was a great way to solve some problems that we came across. - Another online resource that we found helpful was an online community called Stack Overflow. There were multiple instances where Stack Overflow helped us use code that we did not know of before or even learn difficult concepts that were not covered in class - One specific example would be when we needed to figure out how to implement the anova test into our code. Here is one question that we used: - https://stackoverflow.com/questions/5799111/factors-in-aov - We also utilized Stack Overflow to learn how we could interpret the results we obtained from the ANOVA test. After doing some research, we thought it would be best if we utilized Tukey’s Honestly Significant Difference. We searched through stack overflow to find base code that could help us implement this test into our own code. - https://stackoverflow.com/questions/70189631/obtain-cld-from-tukeyhsd-test - Our group decided to use an ANOVA test to estimate how a dependent variable changes according to the levels of one or more categorical variables. Since we did not learn about this testing method during lectures, we used Youtube videos to familiarize ourselves with this method - How To Perform an ANOVA test in R by Eugene O’Loughlin: https://www.youtube.com/watch?v=VmJ2l8iXsxo&ab_channel=EugeneO%27Loughlin