library(tidyverse)
library(skimr)
Credit Cards
Proposal
Data 1
Introduction and data
Identify the source of the data.
Source: FiveThirtyEight
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data set begin in the year 1998 and has been collected every year until present day. The data keeps track of the results of elections and most of the data is collected through voter feedback, surveys, and posted election results.
Write a brief description of the observations.
The observations include the election results for Governor, U.S. Senate, and President (including primaries) at the state and national levels. The data contains 538 items and they are indicated by factors like race, political party, type of office, type of election, number of votes, and the overall winner for each ballot.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How have the winning parties across all types of election races changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.
The data includes information about winning parties in all the races including what types of races. This includes presidential race, senate race etc. Looking at all of this over time, have the winning parties been more democratic, republican or a mix over time. Our hypothesis is that over time the winning races have trended toward democratic.
Identify the types of variables in your research question. Categorical? Quantitative?
Type = Categorical
Party = Categorical
Date = Quantitative
Glimpse of data
::skim("electrion_results_gubernational.csv") skimr
Name | “electrion_results_gubern… |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 35 | 35 | 0 | 1 | 0 |
::skim("electrion_results_generichouse.csv") skimr
Name | “electrion_results_generi… |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 34 | 34 | 0 | 1 | 0 |
::skim("electrion_results_house.csv") skimr
Name | “electrion_results_house…. |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 27 | 27 | 0 | 1 | 0 |
::skim("electrion_results_presidential.csv") skimr
Name | “electrion_results_presid… |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 34 | 34 | 0 | 1 | 0 |
::skim("electrion_results_senate.csv") skimr
Name | “electrion_results_senate… |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 28 | 28 | 0 | 1 | 0 |
::skim("races.csv") skimr
Name | “races.csv” |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 9 | 9 | 0 | 1 | 0 |
Data 2
Introduction and data
- Identify the source of the data.
Source: UCI Machine Learning Repository
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The last time this data was updated was 3 years ago and when looking for the collection of the source, the curator said that the information was confidential to the public becuase it might put the participants at risk.
- Write a brief description of the observations.
From what I could analyze from this data set, it has to do with credit card applications. Names of the participants are changed to ID’s and the other variables are categorical and numerical that deal with factors considered when accepting credit card applications satisfying the project requirements.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How do factors such as gender, income, housing, children, etc. affect whether a credit card application gets approved?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Since there are multiple factors involved in credit card application’s getting approved or denied, then we would choose a few of them and decide how this affects our research question. For instance, I chose gender, income, housing, and children because these are all variables that may determine whether someone’s credit card application get rejected. If they are a woman, of low-income, who live in a rented apartment, and have multiple children then maybe they will be looked differently than other applicants.
Identify the types of variables in your research question. Categorical? Quantitative?
gender = categorical
car = categorical
children = numerical
income = numerical
education = categorical
family status = categorical
housing = categorical
birth = numerical
employed days = numerical
Glimpse of data
<- read_csv("data/application_record.csv") application_cred
Rows: 438557 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME...
dbl (10): ID, CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED, FLA...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/credit_record.csv") credit_record
Rows: 1048575 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATUS
dbl (2): ID, MONTHS_BALANCE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(application_cred)
Name | application_cred |
Number of rows | 438557 |
Number of columns | 18 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
CODE_GENDER | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
FLAG_OWN_CAR | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
FLAG_OWN_REALTY | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
NAME_INCOME_TYPE | 0 | 1.00 | 7 | 20 | 0 | 5 | 0 |
NAME_EDUCATION_TYPE | 0 | 1.00 | 15 | 29 | 0 | 5 | 0 |
NAME_FAMILY_STATUS | 0 | 1.00 | 5 | 20 | 0 | 5 | 0 |
NAME_HOUSING_TYPE | 0 | 1.00 | 12 | 19 | 0 | 6 | 0 |
OCCUPATION_TYPE | 134203 | 0.69 | 7 | 21 | 0 | 18 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1 | 6022176.27 | 571637.02 | 5008804 | 5609375 | 6047745.0 | 6456971 | 7999952 | ▆▇▇▁▁ |
CNT_CHILDREN | 0 | 1 | 0.43 | 0.72 | 0 | 0 | 0.0 | 1 | 19 | ▇▁▁▁▁ |
AMT_INCOME_TOTAL | 0 | 1 | 187524.29 | 110086.85 | 26100 | 121500 | 160780.5 | 225000 | 6750000 | ▇▁▁▁▁ |
DAYS_BIRTH | 0 | 1 | -15997.90 | 4185.03 | -25201 | -19483 | -15630.0 | -12514 | -7489 | ▃▆▇▇▅ |
DAYS_EMPLOYED | 0 | 1 | 60563.68 | 138767.80 | -17531 | -3103 | -1467.0 | -371 | 365243 | ▇▁▁▁▂ |
FLAG_MOBIL | 0 | 1 | 1.00 | 0.00 | 1 | 1 | 1.0 | 1 | 1 | ▁▁▇▁▁ |
FLAG_WORK_PHONE | 0 | 1 | 0.21 | 0.40 | 0 | 0 | 0.0 | 0 | 1 | ▇▁▁▁▂ |
FLAG_PHONE | 0 | 1 | 0.29 | 0.45 | 0 | 0 | 0.0 | 1 | 1 | ▇▁▁▁▃ |
FLAG_EMAIL | 0 | 1 | 0.11 | 0.31 | 0 | 0 | 0.0 | 0 | 1 | ▇▁▁▁▁ |
CNT_FAM_MEMBERS | 0 | 1 | 2.19 | 0.90 | 1 | 2 | 2.0 | 3 | 20 | ▇▁▁▁▁ |
skim(credit_record)
Name | credit_record |
Number of rows | 1048575 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
STATUS | 0 | 1 | 1 | 1 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ID | 0 | 1 | 5068286.42 | 46150.58 | 5001711 | 5023644 | 5062104 | 5113856 | 5150487 | ▇▅▃▅▅ |
MONTHS_BALANCE | 0 | 1 | -19.14 | 14.02 | -60 | -29 | -17 | -7 | 0 | ▁▂▅▆▇ |
Data 3
Introduction and data
Identify the source of the data.
FiveThirtyEight
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The Redlining maps used in this project were originally drawn by the “Home Owners’ Loan Corporation” (HOLC) from 1935-40 and downloaded from the Mapping Inequality project. The population and race/ethnicity data comes from the 2020 U.S. decennial census.
Write a brief description of the observations.
The observations that are provided in the data include the name of a metro area, the “holc” grade that certain parts of the area were given from A to D, and then all of the races and ethnicities. Furthermore, calculations of estimates of proportions of specific groups of people in certain areas was calculated as well and named “surr_area_***_pop”.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How has redlining in the mid-20th century in metropolitan areas in the United States affected the current demographic groups that are found in those areas today?
A description of the research topic along with a concise statement of your hypotheses on this topic.
From the data collected, we can look at the change in demographics from minority groups and compare them to white populations in metropolitan areas to determine how redlining affected them. We will look at how these “holc” grades determined which places were “hazardous” or “best” to live in and see how it determined the current demographics.
Identify the types of variables in your research question. Categorical? Quantitative?
Populations: Quantative (example: white_pop, which is estimate of white population in HOLC areas)
HOLC Grade: Categorical (A-D, A being best and D being hazardous)
Surrounding area: Quantatative (example: surr_area_black_pop which is the estimate of black population in metro areas)
Glimpse of data
::skim("metro_grades.csv") skimr
Name | “metro_grades.csv” |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 16 | 16 | 0 | 1 | 0 |
::skim("zone_block_matches.csv") skimr
Name | “zone_block_matches.csv” |
Number of rows | 1 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
data | 0 | 1 | 22 | 22 | 0 | 1 | 0 |