── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.0
✔ tibble 3.2.1 ✔ dplyr 1.1.2
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.2 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.1.1
✔ infer 1.0.4 ✔ workflows 1.1.2
✔ modeldata 1.0.1 ✔ workflowsets 1.0.0
✔ parsnip 1.0.3 ✔ yardstick 1.1.0
✔ recipes 1.0.6
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
Rows: 438557 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME...
dbl (10): ID, CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED, FLA...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1048575 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATUS
dbl (2): ID, MONTHS_BALANCE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Credit Cards
Different factors that affect credit card applications
Introduce the topic and motivation
When we were deciding what type of dataset to evaluate, we were looking for a dataset that was pertinent to our daily lives. As college students who are soon to be entering the workforce, we are going to have to deal with financial topics that we might not have considered: applying for a credit card, investing money, buying equities, etc. After looking at several datasets, we decided that a dataset that we found on Kaggle, which is called, “Credit Card Approval Prediction” was interesting and pertinent to our goals. The questions that we are trying to answer are the following:
How do factors such as gender, age, degree status, job, etc. affect whether an individual pays their credit card in time?
Is there a correlation between paying off credit cards and age?
Introduce the data
Who created the data set?
- This dataset was created by Seanny Song a data engineer based in Washington DC- Baltimore area, and he collected this data from a confidential bank to use for a machine learning model to predict whether credit card applications were gonna get approved based on different factors
What does it contain?
It contained two data sets which we merged and cleaned it to remove duplicates and variables we did not need: One for general information about an applicant, and the second for the status of their credit card payments.
The first: Each individual row is a credit card applicant that is assigned a unique ID to maintain private information confidential. The columns are categorical and numerical variables that represent information about the applicant. For example, their income, their gender, their marital status, etc.
The second: Contained information whether an applicant was on time for their credit card payments or if they were overdue.
Highlights from EDA
Research question 1: How do factors such as gender, age, degree status, job, etc. affect whether an individual pays their credit card in time?
Hypothesis Test
Is there a correlation between paying off credit cards and age?
Our group was interested in whether or not there is a correlation between paying off a credit card and a persons age.
Null Hypothesis: There is no relationship between age and paying credit card bills on time.
\[ H_o : \mu_ {OnTime} = \mu_{OverDue} \]
Alternative Hypothesis: There is a relationship between age and paying credit card bills on time.
\[ H_o = \mu_ {OnTime} \neq \mu_{OverDue} \]
Logistic Regression Equation:
\[ \log\Big(\frac{p}{1-p}\Big) = 0.665 - 0.002 \times new\_age\_years \]
Model
Call:
glm(formula = status ~ new_age_years, family = binomial, data = credit_app)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4539 -1.4314 0.9317 0.9406 0.9546
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6649905 0.0450988 14.745 <2e-16 ***
new_age_years -0.0016718 0.0009928 -1.684 0.0922 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43125 on 33109 degrees of freedom
Residual deviance: 43122 on 33108 degrees of freedom
AIC: 43126
Number of Fisher Scoring iterations: 4
`geom_smooth()` using formula = 'y ~ x'
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
Conclusions + future work
Our conclusion
There is not enough evidence to conclude that there is a correlation between the age of an applicant and if they will pay their credit off on time. We cannot conclude that the older a person is, the more likely they are to pay their credit on time due to the fact that the p-value is higher than 0.05 (making it not statistically significant); therefore, we fail to reject the null.
Future Work
Possible future work could be looking at other banks from other parts of the United States to see if there is any correlation between how these different factors contribute to a credit card applicant and their credit card payment status.