Credit Cards

Appendix to report

Data cleaning

library(tidyverse)
library(skimr)
library(tidymodels)
library(dplyr)
library(scales)
library(skimr)


#downloading both data sets and merging them
application_cred <- read_csv("data/application_record.csv")
credit_record <- read_csv("data/credit_record.csv")

merged_data <- merge(application_cred, credit_record, by = "ID") 

#removing variables that are irrelevant to our research question and cleaning the rest
credit_app<- merged_data |> 
  select(-FLAG_MOBIL, -FLAG_WORK_PHONE, 
         -FLAG_EMAIL, -OCCUPATION_TYPE, -FLAG_OWN_CAR,
         -FLAG_OWN_REALTY, -CNT_CHILDREN) |>
    filter(STATUS != "X",
         STATUS != "NA")|>
  mutate(
    new_age_years = (-1 * DAYS_BIRTH) / 365,
    new_employment_years = (-1 * DAYS_EMPLOYED) / 365,
    new_age_years = round(new_age_years),
    new_employment_years = round(new_employment_years),
    new_employment_years = replace(new_employment_years, 
                                   new_employment_years == -1001, 0),
    STATUS = case_when(
    STATUS == "C" ~ "On Time",
    STATUS == "X" ~ "On Time",
    STATUS == "0" ~ "OverDue",
    STATUS == "1" ~ "OverDue",
    STATUS == "2" ~ "OverDue",
    STATUS == "3" ~ "OverDue",
    STATUS == "4" ~ "OverDue",
    STATUS == "5" ~ "OverDue",)
  ) |>
  janitor::clean_names()|>
  mutate(code_gender = case_when(
    code_gender == "M" ~ "Male",
    code_gender == "F" ~ "Female"
  ))|>
  select(-days_birth, -days_employed, -flag_phone,
         -cnt_fam_members)

#removing duplicates
credit_app <- distinct(credit_app, id, .keep_all = TRUE)

skim(credit_app)
Data summary
Name credit_app
Number of rows 33110
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
code_gender 0 1 4 6 0 2 0
name_income_type 0 1 7 20 0 5 0
name_education_type 0 1 15 29 0 5 0
name_family_status 0 1 5 20 0 5 0
name_housing_type 0 1 12 19 0 6 0
status 0 1 7 7 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 5078014.66 41876.79 5008804 5041983 5069441 5114633 5150487 ▇▇▇▇▆
amt_income_total 0 1 185877.17 101412.41 27000 121500 157500 225000 1575000 ▇▁▁▁▁
months_balance 0 1 -15.87 14.18 -60 -25 -12 -4 0 ▁▂▂▃▇
new_age_years 0 1 43.87 11.55 21 34 43 53 69 ▅▇▆▆▃
new_employment_years 0 1 6.06 6.52 0 1 4 9 43 ▇▂▁▁▁

Data Cleaning Description

We began by importing the data using read_csv(). Then, we merged the data from application_cred and credit_record by the ID. Then, we created credit_app from merged_data. Then, we eliminated the following columns using the select() function: FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_EMAIL, OCCUPATION_TYPE. Then, we filtered out the “NA” and “X” values. The next problem we encountered with the data was that the birthday values were formatted in the following way: “Count backwards from current day (0), -1 means yesterday”. Clearly, this is confusing so we used mutate to create new_age_years to calculate and round the appropriate age. The next issue we encountered in the dataset was that the amount of time worked was formatted in a similar way: Count backwards from current day(0). This was also not conducive to plotting and visualizing the data so we mutated it to create new_employment_years to represent the rounded appropriate number of years. Then, we classified the credit approval based on the status, rating it as “On Time” or “Overdue”. Then we did “janitor::clean_names()\|\>” in order to clean the names in the data frame. We then mutated the values in gender_code to be the words “Male” and “Female” instead of “M” and “F” for clarity. The last steps of the data tidying and cleaning were to use the select() function once again to eliminate the following columns: days_birth, days_employed, flag_phone, cnt_fam_members, and using distinct to remove rows of the dataset with duplicate ids to eliminate any duplicate data.