Credit Cards

Appendix to report

Data cleaning

library(tidyverse)
library(skimr)
library(tidymodels)
library(dplyr)
library(scales)
library(skimr)


#downloading both data sets and merging them
application_cred <- read_csv("data/application_record.csv")
credit_record <- read_csv("data/credit_record.csv")

merged_data <- merge(application_cred, credit_record, by = "ID") 

#removing variables that are irrelevant to our research question and cleaning the rest
credit_app<- merged_data |> 
  select(-FLAG_MOBIL, -FLAG_WORK_PHONE, 
         -FLAG_EMAIL, -OCCUPATION_TYPE, -FLAG_OWN_CAR,
         -FLAG_OWN_REALTY, -CNT_CHILDREN) |>
    filter(STATUS != "X",
         STATUS != "NA")|>
  mutate(
    new_age_years = (-1 * DAYS_BIRTH) / 365,
    new_employment_years = (-1 * DAYS_EMPLOYED) / 365,
    new_age_years = round(new_age_years),
    new_employment_years = round(new_employment_years),
    new_employment_years = replace(new_employment_years, 
                                   new_employment_years == -1001, 0),
    STATUS = case_when(
    STATUS == "C" ~ "On Time",
    STATUS == "X" ~ "On Time",
    STATUS == "0" ~ "OverDue",
    STATUS == "1" ~ "OverDue",
    STATUS == "2" ~ "OverDue",
    STATUS == "3" ~ "OverDue",
    STATUS == "4" ~ "OverDue",
    STATUS == "5" ~ "OverDue",)
  ) |>
  janitor::clean_names()|>
  mutate(code_gender = case_when(
    code_gender == "M" ~ "Male",
    code_gender == "F" ~ "Female"
  ))|>
  select(-days_birth, -days_employed, -flag_phone,
         -cnt_fam_members)

#removing duplicates
credit_app <- distinct(credit_app, id, .keep_all = TRUE)

skim(credit_app)

Data summary
Name	credit_app
Number of rows	33110
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
code_gender	1	4	6	2
name_income_type	1	7	20	5
name_education_type	1	15	29	5
name_family_status	1	5	20	5
name_housing_type	1	12	19	6
status	1	7	7	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	5078014.66	41876.79	5008804	5041983	5069441	5114633	5150487	▇▇▇▇▆
amt_income_total	1	185877.17	101412.41	27000	121500	157500	225000	1575000	▇▁▁▁▁
months_balance	1	-15.87	14.18	-60	-25	-12	-4	0	▁▂▂▃▇
new_age_years	1	43.87	11.55	21	34	43	53	69	▅▇▆▆▃
new_employment_years	1	6.06	6.52	0	1	4	9	43	▇▂▁▁▁

Data Cleaning Description

We began by importing the data using read_csv(). Then, we merged the data from application_cred and credit_record by the ID. Then, we created credit_app from merged_data. Then, we eliminated the following columns using the select() function: FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_EMAIL, OCCUPATION_TYPE. Then, we filtered out the “NA” and “X” values. The next problem we encountered with the data was that the birthday values were formatted in the following way: “Count backwards from current day (0), -1 means yesterday”. Clearly, this is confusing so we used mutate to create new_age_years to calculate and round the appropriate age. The next issue we encountered in the dataset was that the amount of time worked was formatted in a similar way: Count backwards from current day(0). This was also not conducive to plotting and visualizing the data so we mutated it to create new_employment_years to represent the rounded appropriate number of years. Then, we classified the credit approval based on the status, rating it as “On Time” or “Overdue”. Then we did “janitor::clean_names()\|\>” in order to clean the names in the data frame. We then mutated the values in gender_code to be the words “Male” and “Female” instead of “M” and “F” for clarity. The last steps of the data tidying and cleaning were to use the select() function once again to eliminate the following columns: days_birth, days_employed, flag_phone, cnt_fam_members, and using distinct to remove rows of the dataset with duplicate ids to eliminate any duplicate data.