Project proposal

Author

dank-cap

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset. Make sure to load the data and use inline code for some of this information.

For this project, we decided to choose the Campus Pride Index dataset which was extracted from the campus pride index website, it is a national tool that assesses LGBTQ+ inclusivity in colleges and universities across the United States. The dataset consists of 238 rows (representing campuses in the U.S.) and is divided into two main dataframes:

  1. Pride Index Ratings: This dataframe contains 5 columns with information on campus name, location (city and state), Campus Pride Index rating (1–5 scale), student population, and community type (large urban city, small town, rural, etc).

  2. Campus Characteristics: This dataframe consists of 18 columns with binary indicators (TRUE/NA) for attributes such as public vs. private status, academic classification (doctoral, master’s, or baccalaureate), and whether the institution is a Hispanic-Serving Institution, Historically Black College/University, or LGBTQ+-friendly campus.

We chose this dataset because as current college students, we found it relevant and engaging. Additionally, the dataset’s diverse variables provide a good enough opportunity for visualization and storytelling. We thought it would be interesting to find a structured way to analyze factors influencing LGBTQ+ inclusivity in higher education and explore potential relationships between a university’s attributes such as location, community type, and academic focus and the inclusivity ratings it receives.

# load data
df <- read_csv("data/pride_index.csv", show_col_types = FALSE)
df_tags <- read_csv("data/pride_index_tags.csv", show_col_types = FALSE)

# show some of the data
glimpse(df)
Rows: 238
Columns: 5
$ campus_name     <chr> "University of Maryland, College Park", "University of…
$ campus_location <chr> "College Park, MD", "Dearborn, MI", "Valhalla, NY", "B…
$ rating          <dbl> 5.0, 3.0, 4.0, 3.5, 4.0, 4.5, 5.0, 4.0, 4.5, 4.5, 4.5,…
$ students        <dbl> 37952, 9000, 13000, 29850, 8500, 14769, 19446, 1850, 3…
$ community_type  <chr> "large urban city", "medium city", "very small town", …
glimpse(df_tags)
Rows: 238
Columns: 18
$ campus_name            <chr> "University of Maryland, College Park", "Univer…
$ campus_location        <chr> "College Park, MD", "Dearborn, MI", "Valhalla, …
$ public                 <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, NA, TRUE, NA, TRU…
$ private                <lgl> NA, NA, NA, NA, NA, TRUE, NA, TRUE, NA, NA, NA,…
$ doctoral               <lgl> TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRU…
$ masters                <lgl> NA, TRUE, NA, TRUE, TRUE, NA, NA, NA, TRUE, TRU…
$ baccalaureate          <lgl> NA, TRUE, NA, TRUE, TRUE, NA, NA, TRUE, TRUE, T…
$ community              <lgl> NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ residential            <lgl> TRUE, NA, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ nonresidential         <lgl> NA, TRUE, TRUE, NA, NA, NA, NA, NA, NA, TRUE, T…
$ liberal_arts           <lgl> NA, NA, NA, NA, TRUE, TRUE, NA, TRUE, NA, NA, N…
$ technical              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ religious              <lgl> NA, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, NA, N…
$ military               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hbcu                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hispanic_serving       <lgl> NA, NA, TRUE, NA, NA, NA, NA, NA, NA, TRUE, TRU…
$ aapi_serving           <lgl> TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ other_minority_serving <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Questions

The two questions you want to answer.

  1. How do pride ratings vary by student population size and community type (e.g., large urban city, medium city, very small town)?

  2. How do pride ratings differ by location and university type (e.g., public, private, community college)?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1:

We will use “rating” as the dependent variable and examine its relationship with the independent variables “students” and “community_type” from the pride_index.csv dataset. Student population size is a numerical variable, while “community_type” is a categorical variable with classifications such as large urban city, medium city, small town, and rural community. To illustrate differences in pride ratings across these categories, we will try visualizations such as scatter plots and stacked bar charts, and then decide which one best answers the question. For these plots, we envision using the y-axis to denote student population, while distinguishing community type by color or shape.

Question 2:

We will use “rating” as the dependent variable and “campus_location” along with university type indicators (“public,” “private,” “community”) as independent variables from the pride_index_tags.csv dataset. “Campus_location” includes cities and states, and we might perform some cleaning to aggregate the data at the state level. We will create visualizations, such as maps, to illustrate differences in pride ratings across university types and locations, and then decide which one most effectively addresses the question—using color variance to indicate pride index ratings (similar to a heat map). It is possible that there may be some skewness between location and university type, which could result in confounding variables, so we will also add a facet wrap based on university type to address this potential issue.