Home Owners’ Loan Corporation (HOLC) Grades and their Relationship to Racial Demographics

Proposal

library(tidyverse)
library(skimr)
library(readr)
library(readxl)

Data 1

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.

The source of the data set is: The United States Census Bureau.
The data was collected by the Bureau’s Small Area Income and Poverty Estimates (SAIPE) program and it was last updated in 2021.
Observations in this dataset include: socioeconomic indicators such as poverty rates, median household income, unemployment rates, etc. They are organized by each county in the United States.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

What is the relationship between poverty rates and educational attainment in different counties of the United States?
The goal is to compare the relationship of poverty rates and educational attainment in different counties of the United States. The hypothesis of this study is that counties with higher poverty rates will have lower educational attainment levels compared to counties with lower poverty rates, which will probably have higher educational attainment levels.
The poverty rate is also a quantitative variable representing the percentage of people living below the poverty line. Educational attainment is a categorical variable that ranges from less than a high school diploma to a graduate degree. The county is a categorical variable that represents the different geographical locations within the United States.

Glimpse of data

saipe_census <- read_excel("data/census.xlsx")

New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`
• `` -> `...29`
• `` -> `...30`
• `` -> `...31`

skim(saipe_census)

Data summary
Name	saipe_census
Number of rows	3198
Number of columns	31
_______________________
Column type frequency:
character	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Table with column headers in rows 3 and 4	0	1	2	199	55
…2	2	1	3	16	327
…3	2	1	2	11	53
…4	2	1	4	33	1930
…5	1	1	1	26	2817
…6	2	1	1	18	2746
…7	2	1	1	18	2851
…8	2	1	1	25	296
…9	2	1	1	18	254
…10	2	1	1	18	348
…11	1	1	1	26	2243
…12	2	1	1	18	2034
…13	2	1	1	18	2361
…14	2	1	1	25	402
…15	2	1	1	18	319
…16	2	1	1	18	491
…17	1	1	1	38	2009
…18	2	1	1	18	1766
…19	2	1	1	18	2182
…20	2	1	1	37	394
…21	2	1	1	18	304
…22	2	1	1	18	481
…23	1	1	1	23	3089
…24	2	1	1	18	3098
…25	2	1	1	18	3091
…26	1	1	1	25	55
…27	2	1	1	18	53
…28	2	1	1	18	54
…29	2	1	1	24	47
…30	2	1	1	18	45
…31	2	1	1	18	46

Data 2

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.

The population and race/ethnicity data comes from the U.S. decennial census which was uploaded to the fivethirtyeight github under redlining by Jay Boice (jayb).
The dataset shows 2020 population estimates by race/ethnicity for redlining grades in micro- and metropolitan areas. The data was collected by the HOLC in 1935-1940 and downloaded from the Mapping Inequality project. The population and race/ethnicity data was obtained from the 2020 U.S. census.
The dataset contains population estimates by race/ethnicity for different redlining grades, along with location quotients (LQs) that measure segregation. It includes micro- and metropolitan areas with A to D-rated zones, covering 138 of 143 metropolitan areas in the Mapping Inequality data. The data covers non-Hispanic white, non-Hispanic Black, Hispanic/Latino, non-Hispanic Asian, and other racial/ethnic communities.

Zone-block-matches.csv links 2020 U.S. census blocks to HOLC zones, using proportional weighting based on intersection area. HOLC zones are identified by city, state, grade, ID, and neighborhood ID.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

How does the presence of a certain race in a given zone affect the HOLC grade (degree of desirability) of that area?
The central purpose of this research topic is to observe whether there is a correlation between areas consisting of minority races and the HOLC grade of this zone. For this study, the hypothesis is that areas with higher percentages of minorities, looking specifically at Black, Hispanic and Asian populations, are going to have more HOLC grades consisting of C’s and D’s than A’s and B’s. Conversely, areas with higher percentages of White residents are going to have more HOLC grades consisting of A’s and B’s than C’s and D’s.
For this research question, the main variables to be looked at are the HOLC grades and the percentages of each race (Black, Hispanic, Asian, White). The HOLC grades are a categorical variable. The population of residents belonging to each race (per zone) is a quantitative variable.

Glimpse of data

redlining <- read_csv("data/metro-grades.csv")

Rows: 551 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): metro_area, holc_grade
dbl (26): white_pop, black_pop, hisp_pop, asian_pop, other_pop, total_pop, p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(redlining)

Data summary
Name	redlining
Number of rows	551
Number of columns	28
_______________________
Column type frequency:
character	2
numeric	26
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
metro_area	0	1	8	46	0	138	0
holc_grade	0	1	1	1	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
white_pop	1	28838.97	86150.59	158.00	3508.50	8448.00	21756.50	1164087.00	▇▁▁▁▁
black_pop	1	15735.14	62120.43	1.00	530.00	2380.00	8218.00	894704.00	▇▁▁▁▁
hisp_pop	1	19139.83	106594.49	6.00	407.50	1759.00	6920.50	1492338.00	▇▁▁▁▁
asian_pop	1	6216.87	41375.78	1.00	82.50	305.00	1446.50	767862.00	▇▁▁▁▁
other_pop	1	3809.54	14369.45	10.00	354.00	992.00	2350.00	239048.00	▇▁▁▁▁
total_pop	1	73740.32	297082.01	228.00	7205.00	17460.00	43386.00	4558038.00	▇▁▁▁▁
pct_white	1	55.45	22.48	3.77	39.27	59.01	74.71	94.12	▃▅▆▇▆
pct_black	1	19.98	18.67	0.31	5.86	13.39	28.44	85.40	▇▃▂▁▁
pct_hisp	1	15.54	16.86	1.10	4.62	8.80	19.89	93.90	▇▂▁▁▁
pct_asian	1	3.23	3.67	0.09	1.00	2.01	3.91	31.39	▇▁▁▁▁
pct_other	1	5.79	2.07	0.88	4.52	5.55	6.95	17.73	▂▇▂▁▁
lq_white	1	1.05	0.48	0.16	0.78	0.98	1.22	4.11	▇▇▁▁▁
lq_black	1	1.04	0.65	0.02	0.53	1.02	1.40	6.64	▇▃▁▁▁
lq_hisp	1	1.02	0.47	0.18	0.69	1.00	1.26	3.64	▆▇▁▁▁
lq_asian	1	0.82	0.43	0.11	0.54	0.74	1.03	2.68	▆▇▂▁▁
lq_other	1	1.05	0.20	0.39	0.94	1.05	1.16	2.20	▁▇▅▁▁
surr_area_white_pop	1	319868.85	710162.12	8715.00	47157.00	92469.00	248955.00	5169444.00	▇▁▁▁▁
surr_area_black_pop	1	119940.46	297059.96	1080.00	11130.00	30777.00	83252.00	2626933.00	▇▁▁▁▁
surr_area_hisp_pop	1	163548.68	620217.68	588.00	5652.00	23603.00	68357.00	5415084.00	▇▁▁▁▁
surr_area_asian_pop	1	63296.08	253564.47	113.00	1539.00	5186.00	19426.00	2077111.00	▇▁▁▁▁
surr_area_other_pop	1	34257.60	81771.26	1248.00	4450.00	9243.00	24755.00	675943.00	▇▁▁▁▁
surr_area_pct_white	1	54.77	17.34	11.98	42.08	57.22	66.97	89.11	▂▅▆▇▅
surr_area_pct_black	1	20.05	15.19	2.07	8.25	16.55	26.49	76.30	▇▅▂▁▁
surr_area_pct_hisp	1	15.58	15.19	1.53	5.39	9.57	21.00	81.20	▇▂▁▁▁
surr_area_pct_asian	1	4.08	4.23	0.37	1.35	3.10	4.91	29.21	▇▂▁▁▁
surr_area_pct_other	1	5.51	1.77	1.80	4.34	5.31	6.43	15.52	▃▇▂▁▁

Data 3

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
1. The source of the data comes from different pollsters who polled different Senate elections during August as well as actual election data (i.e. the results of different Senate elections)
2. It was collected by the original data curator around 2018, which was after the last election represented in this dataset, the 2016 Senate elections.
3. The observations find these different Senate polls conducted the August before the elections and compare them to the actual results of these elections, finding the error in these polls compared to the actual results.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

How accurately do polls taken in the August before Senate races predict the eventual outcomes of these races, and what factors contribute to their accuracy or inaccuracy?
The goal of this project is to determine how accurate August polls are for Senate races, and how accuracy changes by election year or by state. The hypothesis will be that Senate elections conducted during election year (i.e. years cleanly divisible by four) will be much more accurate than those during midterm years.
Some of the variables are categorical, including cycle (election year), state, and senate_class (an identification of which group a senate race is in). However, much of them are quantitative, including start_date, end_date (when exactly the polls were conducted), DEM_poll and REP_poll (how much each party got in the poll), DEM_result and REP_result (how much each party got in the actual election), and error and absolute error (how much the poll was off by in percentages/ absolute value of that).

Glimpse of data

aug_senate_poll <- read_csv("data/august_senate_polls.csv")

Rows: 594 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): state
dbl  (8): cycle, senate_class, DEM_poll, REP_poll, DEM_result, REP_result, e...
date (2): start_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(aug_senate_poll)

Data summary
Name	aug_senate_poll
Number of rows	594
Number of columns	11
_______________________
Column type frequency:
character	1
Date	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
state	0	1	2	2	0	49	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
start_date	0	1	1990-07-26	2016-08-30	2008-08-14	265
end_date	0	1	1990-08-01	2016-08-31	2008-08-17	244

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cycle	1	2006.87	7.38	1990.00	2004.00	2008.00	2012.00	2016.00	▂▂▂▇▇
senate_class	1	2.14	0.85	1.00	1.00	2.00	3.00	3.00	▆▁▅▁▇
DEM_poll	1	43.33	9.91	12.00	38.85	44.00	49.00	76.00	▁▂▇▂▁
REP_poll	1	42.57	8.92	11.00	38.00	43.00	48.00	69.00	▁▂▇▅▁
DEM_result	1	48.15	10.18	11.62	42.73	48.86	54.62	83.18	▁▂▇▃▁
REP_result	1	47.76	8.70	12.86	42.49	47.84	52.77	76.08	▁▂▇▅▁
error	1	-0.38	10.44	-35.70	-6.95	0.20	6.31	28.54	▁▃▇▇▁
absolute_error	1	8.17	6.50	0.01	3.37	6.62	11.47	35.70	▇▅▂▁▁