library(tidyverse)library(skimr)library(tidymodels)library(dplyr)library(scales)library(skimr)#downloading both data sets and merging themapplication_cred <-read_csv("data/application_record.csv")credit_record <-read_csv("data/credit_record.csv")merged_data <-merge(application_cred, credit_record, by ="ID") #removing variables that are irrelevant to our research question and cleaning the restcredit_app<- merged_data |>select(-FLAG_MOBIL, -FLAG_WORK_PHONE, -FLAG_EMAIL, -OCCUPATION_TYPE, -FLAG_OWN_CAR,-FLAG_OWN_REALTY, -CNT_CHILDREN) |>filter(STATUS !="X", STATUS !="NA")|>mutate(new_age_years = (-1* DAYS_BIRTH) /365,new_employment_years = (-1* DAYS_EMPLOYED) /365,new_age_years =round(new_age_years),new_employment_years =round(new_employment_years),new_employment_years =replace(new_employment_years, new_employment_years ==-1001, 0),STATUS =case_when( STATUS =="C"~"On Time", STATUS =="X"~"On Time", STATUS =="0"~"OverDue", STATUS =="1"~"OverDue", STATUS =="2"~"OverDue", STATUS =="3"~"OverDue", STATUS =="4"~"OverDue", STATUS =="5"~"OverDue",) ) |> janitor::clean_names()|>mutate(code_gender =case_when( code_gender =="M"~"Male", code_gender =="F"~"Female" ))|>select(-days_birth, -days_employed, -flag_phone,-cnt_fam_members)#removing duplicatescredit_app <-distinct(credit_app, id, .keep_all =TRUE)skim(credit_app)
Data summary
Name
credit_app
Number of rows
33110
Number of columns
11
_______________________
Column type frequency:
character
6
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
code_gender
0
1
4
6
0
2
0
name_income_type
0
1
7
20
0
5
0
name_education_type
0
1
15
29
0
5
0
name_family_status
0
1
5
20
0
5
0
name_housing_type
0
1
12
19
0
6
0
status
0
1
7
7
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
id
0
1
5078014.66
41876.79
5008804
5041983
5069441
5114633
5150487
▇▇▇▇▆
amt_income_total
0
1
185877.17
101412.41
27000
121500
157500
225000
1575000
▇▁▁▁▁
months_balance
0
1
-15.87
14.18
-60
-25
-12
-4
0
▁▂▂▃▇
new_age_years
0
1
43.87
11.55
21
34
43
53
69
▅▇▆▆▃
new_employment_years
0
1
6.06
6.52
0
1
4
9
43
▇▂▁▁▁
Data Cleaning Description
We began by importing the data using read_csv(). Then, we merged the data from application_cred and credit_record by the ID. Then, we created credit_app from merged_data. Then, we eliminated the following columns using the select() function: FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_EMAIL, OCCUPATION_TYPE. Then, we filtered out the “NA” and “X” values. The next problem we encountered with the data was that the birthday values were formatted in the following way: “Count backwards from current day (0), -1 means yesterday”. Clearly, this is confusing so we used mutate to create new_age_years to calculate and round the appropriate age. The next issue we encountered in the dataset was that the amount of time worked was formatted in a similar way: Count backwards from current day(0). This was also not conducive to plotting and visualizing the data so we mutated it to create new_employment_years to represent the rounded appropriate number of years. Then, we classified the credit approval based on the status, rating it as “On Time” or “Overdue”. Then we did “janitor::clean_names()\|\>” in order to clean the names in the data frame. We then mutated the values in gender_code to be the words “Male” and “Female” instead of “M” and “F” for clarity. The last steps of the data tidying and cleaning were to use the select() function once again to eliminate the following columns: days_birth, days_employed, flag_phone, cnt_fam_members, and using distinct to remove rows of the dataset with duplicate ids to eliminate any duplicate data.