Credit Cards

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

Source: FiveThirtyEight
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data set begin in the year 1998 and has been collected every year until present day. The data keeps track of the results of elections and most of the data is collected through voter feedback, surveys, and posted election results.
Write a brief description of the observations.

The observations include the election results for Governor, U.S. Senate, and President (including primaries) at the state and national levels. The data contains 538 items and they are indicated by factors like race, political party, type of office, type of election, number of votes, and the overall winner for each ballot.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How have the winning parties across all types of election races changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The data includes information about winning parties in all the races including what types of races. This includes presidential race, senate race etc. Looking at all of this over time, have the winning parties been more democratic, republican or a mix over time. Our hypothesis is that over time the winning races have trended toward democratic.
Identify the types of variables in your research question. Categorical? Quantitative?

Type = Categorical

Party = Categorical

Date = Quantitative

Glimpse of data

skimr::skim("electrion_results_gubernational.csv")

Data summary
Name	“electrion_results_gubern…
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	35	35	0	1	0

skimr::skim("electrion_results_generichouse.csv")

Data summary
Name	“electrion_results_generi…
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	34	34	0	1	0

skimr::skim("electrion_results_house.csv")

Data summary
Name	“electrion_results_house….
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	27	27	0	1	0

skimr::skim("electrion_results_presidential.csv")

Data summary
Name	“electrion_results_presid…
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	34	34	0	1	0

skimr::skim("electrion_results_senate.csv")

Data summary
Name	“electrion_results_senate…
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	28	28	0	1	0

skimr::skim("races.csv")

Data summary
Name	“races.csv”
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	9	9	0	1	0

Data 2

Introduction and data

Identify the source of the data.

Source: UCI Machine Learning Repository

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The last time this data was updated was 3 years ago and when looking for the collection of the source, the curator said that the information was confidential to the public becuase it might put the participants at risk.

Write a brief description of the observations.

From what I could analyze from this data set, it has to do with credit card applications. Names of the participants are changed to ID’s and the other variables are categorical and numerical that deal with factors considered when accepting credit card applications satisfying the project requirements.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How do factors such as gender, income, housing, children, etc. affect whether a credit card application gets approved?

A description of the research topic along with a concise statement of your hypotheses on this topic.

Since there are multiple factors involved in credit card application’s getting approved or denied, then we would choose a few of them and decide how this affects our research question. For instance, I chose gender, income, housing, and children because these are all variables that may determine whether someone’s credit card application get rejected. If they are a woman, of low-income, who live in a rented apartment, and have multiple children then maybe they will be looked differently than other applicants.

Identify the types of variables in your research question. Categorical? Quantitative?

gender = categorical

car = categorical

children = numerical

income = numerical

education = categorical

family status = categorical

housing = categorical

birth = numerical

employed days = numerical

Glimpse of data

application_cred <- read_csv("data/application_record.csv")

Rows: 438557 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME...
dbl (10): ID, CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED, FLA...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

credit_record <- read_csv("data/credit_record.csv")

Rows: 1048575 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATUS
dbl (2): ID, MONTHS_BALANCE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(application_cred)

Data summary
Name	application_cred
Number of rows	438557
Number of columns	18
_______________________
Column type frequency:
character	8
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
CODE_GENDER	0	1.00	1	1	2
FLAG_OWN_CAR	0	1.00	1	1	2
FLAG_OWN_REALTY	0	1.00	1	1	2
NAME_INCOME_TYPE	0	1.00	7	20	5
NAME_EDUCATION_TYPE	0	1.00	15	29	5
NAME_FAMILY_STATUS	0	1.00	5	20	5
NAME_HOUSING_TYPE	0	1.00	12	19	6
OCCUPATION_TYPE	134203	0.69	7	21	18

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	6022176.27	571637.02	5008804	5609375	6047745.0	6456971	7999952	▆▇▇▁▁
CNT_CHILDREN	1	0.43	0.72	0	0	0.0	1	19	▇▁▁▁▁
AMT_INCOME_TOTAL	1	187524.29	110086.85	26100	121500	160780.5	225000	6750000	▇▁▁▁▁
DAYS_BIRTH	1	-15997.90	4185.03	-25201	-19483	-15630.0	-12514	-7489	▃▆▇▇▅
DAYS_EMPLOYED	1	60563.68	138767.80	-17531	-3103	-1467.0	-371	365243	▇▁▁▁▂
FLAG_MOBIL	1	1.00	0.00	1	1	1.0	1	1	▁▁▇▁▁
FLAG_WORK_PHONE	1	0.21	0.40	0	0	0.0	0	1	▇▁▁▁▂
FLAG_PHONE	1	0.29	0.45	0	0	0.0	1	1	▇▁▁▁▃
FLAG_EMAIL	1	0.11	0.31	0	0	0.0	0	1	▇▁▁▁▁
CNT_FAM_MEMBERS	1	2.19	0.90	1	2	2.0	3	20	▇▁▁▁▁

skim(credit_record)

Data summary
Name	credit_record
Number of rows	1048575
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
STATUS	0	1	1	1	0	8	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	0	1	5068286.42	46150.58	5001711	5023644	5062104	5113856	5150487	▇▅▃▅▅
MONTHS_BALANCE	0	1	-19.14	14.02	-60	-29	-17	-7	0	▁▂▅▆▇

Data 3

Introduction and data

Identify the source of the data.

FiveThirtyEight
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The Redlining maps used in this project were originally drawn by the “Home Owners’ Loan Corporation” (HOLC) from 1935-40 and downloaded from the Mapping Inequality project. The population and race/ethnicity data comes from the 2020 U.S. decennial census.
Write a brief description of the observations.

The observations that are provided in the data include the name of a metro area, the “holc” grade that certain parts of the area were given from A to D, and then all of the races and ethnicities. Furthermore, calculations of estimates of proportions of specific groups of people in certain areas was calculated as well and named “surr_area_***_pop”.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How has redlining in the mid-20th century in metropolitan areas in the United States affected the current demographic groups that are found in those areas today?
A description of the research topic along with a concise statement of your hypotheses on this topic.

From the data collected, we can look at the change in demographics from minority groups and compare them to white populations in metropolitan areas to determine how redlining affected them. We will look at how these “holc” grades determined which places were “hazardous” or “best” to live in and see how it determined the current demographics.
Identify the types of variables in your research question. Categorical? Quantitative?

Populations: Quantative (example: white_pop, which is estimate of white population in HOLC areas)

HOLC Grade: Categorical (A-D, A being best and D being hazardous)

Surrounding area: Quantatative (example: surr_area_black_pop which is the estimate of black population in metro areas)

Glimpse of data

skimr::skim("metro_grades.csv")

Data summary
Name	“metro_grades.csv”
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	16	16	0	1	0

skimr::skim("zone_block_matches.csv")

Data summary
Name	“zone_block_matches.csv”
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data	0	1	22	22	0	1	0