This is primarily an ETL document for the 2024 home credit
competition data on Kaggle. This will also be a first attempt at 1) a
kaggle competition, 2) approaching this dataset, and 3) documenting the
bottom-up process for public review. This will primarily be a reference
for myself on how to improve on all these aspects. The goal of this
competition is to predict target class from all the
provided features.
library(tidyverse)
library(here)
library(DBI)
library(RSQLite)
library(reticulate)
library(tools)
library(data.tree)
con <- dbConnect(SQLite(), "mydb")
When I first approached these data, I attempted to read the all the
csv files into my R environment through read_csv(). As it
turns out, the data contained millions of rows for multiple files and
were clearly too big to import for my laptop memory. My roundabout
strategy in circumventing this problem was creating a local relational
database (SQLite), trim the data down on the database side, and then
import a data table that’s hopefully small enough to work with.
To do all this, I planned on reading each csv data file into a local database. I’m not a fan of manually writing all my file names so first I set up all the folder paths to read into a local SQLite database.
# folder paths
train_folder_path <- here("home-credit-credit-risk-model-stability","csv_files","train") %>% str_c(.,"/")
test_folder_path <- here("home-credit-credit-risk-model-stability","csv_files","test") %>% str_c(.,"/")
# file paths
train_names <- here("home-credit-credit-risk-model-stability","csv_files","train") %>% list.files()
test_names <- here("home-credit-credit-risk-model-stability","csv_files","test") %>% list.files()
# test/train full file paths
train_paths <- str_c(train_folder_path, train_names)
test_paths <- str_c(test_folder_path, test_names)
# Names for each data table
test_names_noext <- test_names %>% file_path_sans_ext()
train_names_noext <- train_names %>% file_path_sans_ext()
I then iterated the creation of a database table for each csv by
running a dbWriteTable() for each filepath.
# Import all csv files into one SQL database
# for each csv file in train folder, read into mydb
for (i in seq(length(train_paths))){
dbWriteTable(con, train_names_noext[i], train_paths[i], overwrite = TRUE)
}
# for each csv file in test folder, read into mydb
for (i in seq(length(test_paths))){
dbWriteTable(con, test_names_noext[i], test_paths[i], overwrite = TRUE)
}
Since the data is relatively unknown at this point, here are some
useful SQL/r functions to extract some useful overviews. We’ll perform
these on the train subset of csv files.
dbListFields: DBI function to examine column names"PRAGMA table_info": SQL statement to examine column
data type# table fields
train_fields <- tibble(table_name = train_names_noext) %>% mutate(
fields = map(table_name, ~dbListFields(con, .)),
info = map(table_name, ~dbGetQuery(con, str_c("PRAGMA table_info(",.,")")))
)
# show
train_fields %>% select(info) %>% unnest(info) %>% head() %>% kable()
| cid | name | type | notnull | dflt_value | pk |
|---|---|---|---|---|---|
| 0 | case_id | INTEGER | 0 | NA | 0 |
| 1 | actualdpd_943P | REAL | 0 | NA | 0 |
| 2 | annuity_853A | REAL | 0 | NA | 0 |
| 3 | approvaldate_319D | TEXT | 0 | NA | 0 |
| 4 | byoccupationinc_3656910L | REAL | 0 | NA | 0 |
| 5 | cancelreason_3545846M | TEXT | 0 | NA | 0 |
dbListTables: DBI function to examine existing table
names# table names
datatables <- dbListTables(con)
# original table names
original_table_names <- tibble(table = datatables) %>%
# filter out future added table names
filter(str_detect(table, "test|train"),
str_detect(table, "id|ID|_U|joined", negate = TRUE))
# show
original_table_names %>% head() %>% kable()
| table |
|---|
| test_applprev_1_0 |
| test_applprev_1_1 |
| test_applprev_1_2 |
| test_applprev_2 |
| test_base |
| test_credit_bureau_a_1_0 |
There are 1139 columns across 68 tables. Some recurring columns are
case_id, num_group1, and
num_group2. case_id is clearly an unique row
identifier for each person. Based on the Kaggle data description,
num_group1 and num_group2 are meant to index
case_id when there are multiple row entries per
case_id (e.g. historical loan contracts).
We can also examine the structure of the files with a
data.tree below.
# set up nodes
homecredit<- Node$new("home-credit-credit-risk-model-stability")
csvfiles <- homecredit$AddChild("csv_files")
train <- csvfiles$AddChild("train")
test <- csvfiles$AddChild("test")
# add train nodes
for (i in (original_table_names %>% filter(str_detect(table, "train")) %>% pull(table))){
train$AddChild(i)}
# add test nodes
for (i in (original_table_names %>% filter(str_detect(table, "test")) %>% pull(table))){
test$AddChild(i)}
# show
homecredit %>%
as.data.frame() %>%
as.matrix() %>%
print(quote=FALSE)
## levelName
## [1,] home-credit-credit-risk-model-stability
## [2,] °--csv_files
## [3,] ¦--train
## [4,] ¦ ¦--train_applprev_1_0
## [5,] ¦ ¦--train_applprev_1_1
## [6,] ¦ ¦--train_applprev_2
## [7,] ¦ ¦--train_base
## [8,] ¦ ¦--train_credit_bureau_a_1_0
## [9,] ¦ ¦--train_credit_bureau_a_1_1
## [10,] ¦ ¦--train_credit_bureau_a_1_2
## [11,] ¦ ¦--train_credit_bureau_a_1_3
## [12,] ¦ ¦--train_credit_bureau_a_2_0
## [13,] ¦ ¦--train_credit_bureau_a_2_1
## [14,] ¦ ¦--train_credit_bureau_a_2_10
## [15,] ¦ ¦--train_credit_bureau_a_2_2
## [16,] ¦ ¦--train_credit_bureau_a_2_3
## [17,] ¦ ¦--train_credit_bureau_a_2_4
## [18,] ¦ ¦--train_credit_bureau_a_2_5
## [19,] ¦ ¦--train_credit_bureau_a_2_6
## [20,] ¦ ¦--train_credit_bureau_a_2_7
## [21,] ¦ ¦--train_credit_bureau_a_2_8
## [22,] ¦ ¦--train_credit_bureau_a_2_9
## [23,] ¦ ¦--train_credit_bureau_b_1
## [24,] ¦ ¦--train_credit_bureau_b_2
## [25,] ¦ ¦--train_debitcard_1
## [26,] ¦ ¦--train_deposit_1
## [27,] ¦ ¦--train_other_1
## [28,] ¦ ¦--train_person_1
## [29,] ¦ ¦--train_person_2
## [30,] ¦ ¦--train_static_0_0
## [31,] ¦ ¦--train_static_0_1
## [32,] ¦ ¦--train_static_cb_0
## [33,] ¦ ¦--train_tax_registry_a_1
## [34,] ¦ ¦--train_tax_registry_b_1
## [35,] ¦ °--train_tax_registry_c_1
## [36,] °--test
## [37,] ¦--test_applprev_1_0
## [38,] ¦--test_applprev_1_1
## [39,] ¦--test_applprev_1_2
## [40,] ¦--test_applprev_2
## [41,] ¦--test_base
## [42,] ¦--test_credit_bureau_a_1_0
## [43,] ¦--test_credit_bureau_a_1_1
## [44,] ¦--test_credit_bureau_a_1_2
## [45,] ¦--test_credit_bureau_a_1_3
## [46,] ¦--test_credit_bureau_a_1_4
## [47,] ¦--test_credit_bureau_a_2_0
## [48,] ¦--test_credit_bureau_a_2_1
## [49,] ¦--test_credit_bureau_a_2_10
## [50,] ¦--test_credit_bureau_a_2_11
## [51,] ¦--test_credit_bureau_a_2_2
## [52,] ¦--test_credit_bureau_a_2_3
## [53,] ¦--test_credit_bureau_a_2_4
## [54,] ¦--test_credit_bureau_a_2_5
## [55,] ¦--test_credit_bureau_a_2_6
## [56,] ¦--test_credit_bureau_a_2_7
## [57,] ¦--test_credit_bureau_a_2_8
## [58,] ¦--test_credit_bureau_a_2_9
## [59,] ¦--test_credit_bureau_b_1
## [60,] ¦--test_credit_bureau_b_2
## [61,] ¦--test_debitcard_1
## [62,] ¦--test_deposit_1
## [63,] ¦--test_other_1
## [64,] ¦--test_person_1
## [65,] ¦--test_person_2
## [66,] ¦--test_static_0_0
## [67,] ¦--test_static_0_1
## [68,] ¦--test_static_0_2
## [69,] ¦--test_static_cb_0
## [70,] ¦--test_tax_registry_a_1
## [71,] ¦--test_tax_registry_b_1
## [72,] °--test_tax_registry_c_1
We see that all the original csv files are organized by a prefix
test_ or train_. These can be further related
by numeral suffixes that denote either 1) related tables split by
columns (train_person_1, train_person_2) or 2)
related tables split by rows
(train_credit_bureau_a_2_1,train_credit_bureau_a_2_2,…,train_credit_bureau_a_2_8).
In addition, a feature_definitions.csv file was provided
that mapped column names to feature definitions. For future reference,
we create a tibble to map feature definitions to column names to table
names.
# import feature definitions
features <- here("home-credit-credit-risk-model-stability","feature_definitions.csv") %>% read_csv(show_col_types = FALSE)
# table fields mapped to feature description
train_features <- train_fields %>% unnest(everything()) %>% left_join(features, by = c("fields" = "Variable")) %>% select(fields, everything())
train_features %>% select(fields, table_name,Description ) %>% head()
The goal of this competition is to predict target, which
can be found in train_base. The definition of
target, found on a forum
Q&A, is defined as:
...unpaid payment (one is enough) in certain time period. There is also some time tolerance (e.g. one day late is not default) and amount tolerance (if client paid $100 instead of $100.10) ).
Here’s some information on train_base.
Column names, type
# Base training dataset - summary
# table columns
base_columns <- dbGetQuery(con, "PRAGMA table_info(train_base)")
base_columns %>% head() Number of rows
# table rows
base_rows <- dbGetQuery(con, "SELECT COUNT(*) FROM train_base")
base_rows %>% head() Number of rows per participant
# rows per participant
base_rows_per_id <- dbGetQuery(con, "SELECT case_id, COUNT(*) FROM train_base GROUP BY case_id")
base_rows_per_id %>% head()
# sum of rows
row_per_users <- dbGetQuery(con, "SELECT row_count, count(*) as user_count
FROM
(SELECT count(case_id) AS row_count
FROM train_base
GROUP BY case_id)
GROUP BY row_count")
row_per_users %>% head() Now we have an idea of the main table we’ll be working with. Since
the prediction target is only available for the
case_id present in train_base, the first step
for reducing the data load is to filter for the same
case_id in all the other supplemental tables. Before we do
that, however, I want to simplify the number of tables I have by joining
tables split by rows. This will reduce the number of table
LEFT JOINs I will have to perform later on.
The split tables we’re looking for share the same
columns but different rows. We want to 1)
identify tables sharing the same columns , 2) identify the shared naming
scheme that tbese split tables use, and 3) join the split table rows by
the common naming scheme. Here are the steps to set this up:
# renew datatable
datatables <- dbListTables(con)
# Filter original tables
original_fields <- tibble(table = datatables) %>% filter(
str_detect(table, "test|train"),
str_detect(table, "id|ID|_U|joined", negate = TRUE)
) %>% mutate(
fields = map(table, ~dbListFields(con, .)),
info = map(table, ~dbGetQuery(con, str_c("PRAGMA table_info(",.,")")))
) %>% select(table, fields) %>% unnest(fields)
# map extracted scheme to original table name, # of columns
split_datasets_revised <- original_fields %>% select(table) %>% unique() %>%
mutate(table_name_extract =str_extract(table,"[A-Za-z_]*_\\d(?=_\\d)" )) %>%
select(table, table_name_extract) %>% mutate(
ncol = map(table, ~dbGetQuery(con, str_c("PRAGMA table_info(",.,")")) %>% nrow()) ) %>% unnest(ncol)
split_datasets_revised %>% head()
UNION statements to create new joined
tables suffxied with _U.# joined table set names
split_dataset_names <- split_datasets_revised$table_name_extract %>% unique() %>% na.omit()
# create UNION/create table query for every split tables
union_query_revised <-
# extract unique shared table names
split_datasets_revised %>% select(table_name_extract) %>% na.omit() %>% unique() %>%
# join to identify split tables
left_join(split_datasets_revised,by = join_by(table_name_extract)) %>%
# add # of split tables per group
group_by(table_name_extract) %>% add_tally() %>%
# iterate SELECT statements for SQL query
mutate(queries = str_c("SELECT * FROM ", table)) %>%
# create single query for each table
select(table_name_extract, queries) %>% group_by(table_name_extract) %>% nest(nested_query = queries) %>%
# create combined table
mutate(
combined_query = nested_query[[1]] %>% pull(queries) %>% paste0(., collapse = " "),
union_query = str_replace_all(combined_query,"(\\d )(?=SELECT)","\\1UNION " ),
write_query = str_c("CREATE TABLE IF NOT EXISTS ",table_name_extract,"_U"," AS ", union_query))
# examine the queries
union_query_revised %>% select(table_name_extract,write_query) %>% head()
# SQL - union
# execute each query
union_query_revised$write_query %>% walk(~dbExecute(con,.))
Since the main focus of the operation is predicting
target from train_base, we can reduce our
operations to just data that’s available for case_ids that
are present in train_base. This can be done by using
INNER JOIN and filtering out data from all supplemental
sources with train_base$case_id.
case_id column from base tables# create IDs to filter out relevant IDs in data tables
dbExecute(con,"CREATE TABLE IF NOT EXISTS train_id AS SELECT case_id FROM train_base")
dbExecute(con,"CREATE TABLE IF NOT EXISTS test_id AS SELECT case_id FROM test_base")
# numgroup index
numgroup_tibble <- original_fields %>%
filter(str_detect(fields, "num_group")) %>%
pivot_wider(names_from = fields, values_from = fields, values_fn = ~1, values_fill = 0)
numgroup_tibble %>% head()
Create query to INNER JOIN and filter tables by
train_base$case_id, numgroup and save filtered
tables with a _id suffix.
# renew data table
datatables <- dbListTables(con)
# filter data by id
id_join_query_revised <- tibble(tables = datatables) %>%
# filter for original tables,joined tables
filter(
# filter out split table names
str_detect(tables,str_c(str_c(split_dataset_names,"_\\d"), collapse = "|"),
negate = TRUE))%>%
filter(str_detect(tables, "id|ID|joined|base", negate = TRUE)) %>%
filter(str_detect(tables, "train|test")) %>%
# create query
mutate(
query = case_when(
str_detect(tables, "train") ~ str_c("CREATE TABLE IF NOT EXISTS ",tables, "_id"," AS SELECT * FROM ",tables,
" INNER JOIN train_id ON ", tables),
str_detect(tables, "test") ~ str_c("CREATE TABLE IF NOT EXISTS ",tables, "_id"," AS SELECT * FROM ",tables,
" INNER JOIN test_id ON ", tables)
),
query = case_when(
str_detect(tables, "train") ~ str_c(query, ".case_id = train_id.case_id"),
str_detect(tables, "test") ~ str_c(query, ".case_id = test_id.case_id")
)
)
# filter rows by num_group
id_join_query_num <- id_join_query_revised %>% left_join(numgroup_tibble , by = c("tables" = "table")) %>% mutate(
query = case_when(
num_group1 == 1 & num_group2 == 0 ~ str_c(query, " AND num_group1 = 0"),
num_group1 == 1 & num_group2 == 1 ~ str_c(query, " AND num_group1 = 0 AND num_group2 = 0"),
.default = query
)
)
# show queries
id_join_query_num$query %>% head()
## [1] "CREATE TABLE IF NOT EXISTS test_applprev_1_U_id AS SELECT * FROM test_applprev_1_U INNER JOIN test_id ON test_applprev_1_U.case_id = test_id.case_id"
## [2] "CREATE TABLE IF NOT EXISTS test_applprev_2_id AS SELECT * FROM test_applprev_2 INNER JOIN test_id ON test_applprev_2.case_id = test_id.case_id AND num_group1 = 0 AND num_group2 = 0"
## [3] "CREATE TABLE IF NOT EXISTS test_credit_bureau_a_1_U_id AS SELECT * FROM test_credit_bureau_a_1_U INNER JOIN test_id ON test_credit_bureau_a_1_U.case_id = test_id.case_id"
## [4] "CREATE TABLE IF NOT EXISTS test_credit_bureau_a_2_U_id AS SELECT * FROM test_credit_bureau_a_2_U INNER JOIN test_id ON test_credit_bureau_a_2_U.case_id = test_id.case_id"
## [5] "CREATE TABLE IF NOT EXISTS test_credit_bureau_b_1_id AS SELECT * FROM test_credit_bureau_b_1 INNER JOIN test_id ON test_credit_bureau_b_1.case_id = test_id.case_id AND num_group1 = 0"
## [6] "CREATE TABLE IF NOT EXISTS test_credit_bureau_b_2_id AS SELECT * FROM test_credit_bureau_b_2 INNER JOIN test_id ON test_credit_bureau_b_2.case_id = test_id.case_id AND num_group1 = 0 AND num_group2 = 0"# execute
id_join_query_num$query %>% walk(~dbExecute(con,.))
To facilitate a quick analysis we’re going to pick out some variables
that I think are theoretically relevant in predicting
target:
birth_259D: Birth dateeducation_927M: Educationempl_employedfrom_271D: Employment dateempl_employedtotal_800L: Employment lengthempl_industry_691L: Job industryfamilystate_447L: Family statusincometype_1044TL: Salary typemainoccupationinc_384A: Income amountriskassesment_940T: Normalized risk - assessed by
credit bureauWe can use a tibble to map the selected features to their respective tables.
# highlight features of interest and identify within datatables
# pick features of interest and rename
revised_feature_list <- list(
base = c("case_id"),
person_1_id = c("birth_259D", #birthdate
"education_927M", # education
"empl_employedfrom_271D", # employ date
"empl_employedtotal_800L", # employ length
"empl_industry_691L", # job industry
"familystate_447L", # relationship
"incometype_1044T", # salary type
"mainoccupationinc_384A"# income amount
),
static_cb_0_id = c("riskassesment_940T" #risk assessment - normalized
))
# link features of interest to data tables
revised_features_tbl <-
# set up variables as tibble
tibble(table = names(revised_feature_list),
train_table = str_c("train_", table),
test_table = str_c("test_", table),
features = revised_feature_list ) %>% unnest(features)We’ll then set up the query parts using the mapped tibble.
# create query for joining relevant features to train_base
train_join <- revised_features_tbl %>% filter(str_detect(features, "case_id", negate = TRUE)) %>% na.omit() %>%
# table.column query
mutate(
table_column = str_c(train_table,".", features)) %>%
select(train_table, table_column) %>%
mutate(
join_string = str_c("LEFT JOIN ", train_table, " ON "),
id_string = str_c("train_base.case_id = ",train_table, ".case_id"),
join_id_string = str_c(join_string, id_string)
) %>% unique()
# create query for joining relevant features to train_test
test_join <- revised_features_tbl %>% filter(str_detect(features, "case_id", negate = TRUE)) %>% na.omit() %>%
# table.column query
mutate(
table_column = str_c(test_table,".", features)) %>%
select(test_table, table_column) %>%
mutate(
join_string = str_c("LEFT JOIN ", test_table, " ON "),
id_string = str_c("test_base.case_id = ",test_table, ".case_id"),
join_id_string = str_c(join_string, id_string)
) %>% unique()
train_base set up train_join %>% select(table_column, join_id_string) %>% head()
test_base set up test_join %>% select(table_column, join_id_string) %>% head()The parts can be strung together for the final query.
# create the final train join query
# all join TABLE on TABLE
train_join_list <- train_join$join_id_string %>% unique() %>% str_c(collapse = " ")
# all TABLE.COLUMNS
train_table_column_join <- train_join$table_column %>% str_c(collapse = ", ")
# full query
train_join_query <- train_table_column_join %>%
# select train_base columns
str_c("SELECT train_base.*, ",.) %>%
#
str_c(., " FROM train_base") %>%
# left join table statements
str_c(., " ",train_join_list) %>%
# save as new table
str_c("CREATE TABLE IF NOT EXISTS train_joined AS ",.)
# create the final train join query
# all join TABLE on TABLE
test_join_list <- test_join$join_id_string %>% unique() %>% str_c(collapse = " ")
# all TABLE.COLUMNS
test_table_column_join <- test_join$table_column %>% str_c(collapse = ", ")
# full query
test_join_query <- test_table_column_join %>%
# select test_base columns
str_c("SELECT test_base.*, ",.) %>%
#
str_c(., " FROM test_base") %>%
# left join table statements
str_c(., " ",test_join_list) %>%
# save as new table
str_c("CREATE TABLE IF NOT EXISTS test_joined AS ",.)
train_base join querytrain_join_query
## [1] "CREATE TABLE IF NOT EXISTS train_joined AS SELECT train_base.*, train_person_1_id.birth_259D, train_person_1_id.education_927M, train_person_1_id.empl_employedfrom_271D, train_person_1_id.empl_employedtotal_800L, train_person_1_id.empl_industry_691L, train_person_1_id.familystate_447L, train_person_1_id.incometype_1044T, train_person_1_id.mainoccupationinc_384A, train_static_cb_0_id.riskassesment_940T FROM train_base LEFT JOIN train_person_1_id ON train_base.case_id = train_person_1_id.case_id LEFT JOIN train_static_cb_0_id ON train_base.case_id = train_static_cb_0_id.case_id"
test_base join querytest_join_query
## [1] "CREATE TABLE IF NOT EXISTS test_joined AS SELECT test_base.*, test_person_1_id.birth_259D, test_person_1_id.education_927M, test_person_1_id.empl_employedfrom_271D, test_person_1_id.empl_employedtotal_800L, test_person_1_id.empl_industry_691L, test_person_1_id.familystate_447L, test_person_1_id.incometype_1044T, test_person_1_id.mainoccupationinc_384A, test_static_cb_0_id.riskassesment_940T FROM test_base LEFT JOIN test_person_1_id ON test_base.case_id = test_person_1_id.case_id LEFT JOIN test_static_cb_0_id ON test_base.case_id = test_static_cb_0_id.case_id"# join tables with selected features
dbExecute(con, test_join_query)
dbExecute(con, train_join_query)
Now that we have a dataset with all our relevant features, we can move into data anaysis and prediction. First we need to define our variable types across our test and train data.
# read train table
train_joined_df <- dbGetQuery(con, "SELECT * FROM train_joined") %>% tibble()
"train_joined_df" %>% quickrds()
# read test table
test_joined_df <- dbGetQuery(con, "SELECT * FROM test_joined") %>% tibble()
"test_joined_df" %>% quickrds()
categorical variables (e.g., )numeric:train_joined_df<- read_rds("train_joined_df.rds")
# define train table variables
train_joined_tbl <- train_joined_df %>%
# convert date to numeric
mutate(
date_decision = as_date(date_decision),
birth_259D = as_date(birth_259D),
age = time_length(difftime(date_decision, birth_259D), "years"),
month.n = month(date_decision),
year.n = year(date_decision),
week.n = WEEK_NUM %>% as.numeric()) %>%
# label numeric, factors
mutate(
across(c(matches("mainoccupationinc_384A|riskassesment_940T|
age|\\.n")),
~ as.numeric(.)),
across(c(matches("education_927M|empl_employedtotal_800L|empl_industry_691L|
familystate_447L|incometype_1044T|target|familystate_447L")),
~ as.factor(.)))
"train_joined_tbl" %>% quickrds()
test_joined_df<- read_rds("test_joined_df.rds")
# define test table variables
test_joined_tbl <- test_joined_df %>%
# convert date to numeric
mutate(
date_decision = as_date(date_decision),
birth_259D = as_date(birth_259D),
age = time_length(difftime(date_decision, birth_259D), "years"),
month.n = month(date_decision),
year.n = year(date_decision),
week.n = WEEK_NUM %>% as.numeric()) %>%
# label numeric, factors
mutate(
across(c(matches("mainoccupationinc_384A|riskassesment_940T|
age|\\.n")),
~ as.numeric(.)),
across(c(matches("education_927M|empl_employedtotal_800L|empl_industry_691L|
familystate_447L|incometype_1044T|target|familystate_447L")),
~ as.factor(.)))
"test_joined_tbl" %>% quickrds()
train dftrain_joined_tbl %>% glimpse()
## Rows: 1,526,659
## Columns: 18
## $ case_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20…
## $ date_decision <date> 2019-01-03, 2019-01-03, 2019-01-04, 2019-01-03, 2019-01-04, 2019-01-02,…
## $ MONTH <int> 201901, 201901, 201901, 201901, 201901, 201901, 201901, 201901, 201901, …
## $ WEEK_NUM <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ target <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ birth_259D <date> 1986-07-01, 1957-08-01, 1974-12-01, 1993-08-01, 1994-01-01, 1979-10-01,…
## $ education_927M <fct> P97_36_170, P97_36_170, P97_36_170, P33_146_175, P33_146_175, P97_36_170…
## $ empl_employedfrom_271D <chr> "2017-09-15", "2008-10-29", "2010-02-15", "2018-05-15", "2014-12-15", "2…
## $ empl_employedtotal_800L <fct> MORE_FIVE, MORE_FIVE, MORE_FIVE, MORE_FIVE, MORE_FIVE, MORE_FIVE, MORE_F…
## $ empl_industry_691L <fct> OTHER, OTHER, OTHER, OTHER, OTHER, OTHER, EDUCATION, EDUCATION, OTHER, ,…
## $ familystate_447L <fct> MARRIED, DIVORCED, MARRIED, MARRIED, MARRIED, MARRIED, SINGLE, SINGLE, M…
## $ incometype_1044T <fct> SALARIED_GOVT, SALARIED_GOVT, EMPLOYED, EMPLOYED, EMPLOYED, PRIVATE_SECT…
## $ mainoccupationinc_384A <dbl> 10800, 10000, 14000, 10000, 24000, 64000, 20000, 46000, 90000, 100000, 5…
## $ riskassesment_940T <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ age <dbl> 32.50924, 61.42368, 44.09309, 25.42368, 25.00753, 39.25530, 28.00548, 25…
## $ month.n <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ year.n <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, …
## $ week.n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
test dftest_joined_tbl %>% glimpse()
## Rows: 10
## Columns: 17
## $ case_id <int> 57543, 57549, 57551, 57552, 57569, 57630, 57631, 57632, 57633, 57634
## $ date_decision <date> 2020-10-06, 2020-10-06, 2020-10-06, 2020-10-07, 2020-10-06, 2020-10-06, …
## $ MONTH <int> 202010, 202010, 202010, 202010, 202010, 202010, 202010, 202010, 202010,…
## $ WEEK_NUM <int> 92, 92, 92, 92, 92, 92, 92, 92, 92, 92
## $ birth_259D <date> 1996-08-01, 1992-03-01, 1990-08-01, NA, NA, NA, NA, NA, NA, NA
## $ education_927M <fct> P97_36_170, P97_36_170, P97_36_170, NA, NA, NA, NA, NA, NA, NA
## $ empl_employedfrom_271D <chr> "2018-02-15", "2019-05-04", "2019-01-15", NA, NA, NA, NA, NA, NA, NA
## $ empl_employedtotal_800L <fct> MORE_ONE, MORE_ONE, MORE_FIVE, NA, NA, NA, NA, NA, NA, NA
## $ empl_industry_691L <fct> OTHER, OTHER, OTHER, NA, NA, NA, NA, NA, NA, NA
## $ familystate_447L <fct> MARRIED, SINGLE, MARRIED, NA, NA, NA, NA, NA, NA, NA
## $ incometype_1044T <fct> EMPLOYED, EMPLOYED, EMPLOYED, NA, NA, NA, NA, NA, NA, NA
## $ mainoccupationinc_384A <dbl> 36000, 15000, 24000, NA, NA, NA, NA, NA, NA, NA
## $ riskassesment_940T <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ age <dbl> 24.18070, 28.59959, 30.18207, NA, NA, NA, NA, NA, NA, NA
## $ month.n <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10
## $ year.n <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020
## $ week.n <dbl> 92, 92, 92, 92, 92, 92, 92, 92, 92, 92Here’s a look at some summary statistics.
library(summarytools)
# factor level proportions
train_joined_tbl %>% select(where(is.factor)) %>% freq()
## Frequencies
## train_joined_tbl$target
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- --------- --------- -------------- --------- --------------
## 0 1478665 96.86 96.86 96.86 96.86
## 1 47994 3.14 100.00 3.14 100.00
## <NA> 0 0.00 100.00
## Total 1526659 100.00 100.00 100.00 100.00
##
## train_joined_tbl$education_927M
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------------- --------- --------- -------------- --------- --------------
## a55475b1 798046 52.274 52.274 52.274 52.274
## P106_81_188 54661 3.580 55.854 3.580 55.854
## P157_18_172 631 0.041 55.896 0.041 55.896
## P17_36_170 5481 0.359 56.255 0.359 56.255
## P33_146_175 258589 16.938 73.193 16.938 73.193
## P97_36_170 409251 26.807 100.000 26.807 100.000
## <NA> 0 0.000 100.000
## Total 1526659 100.000 100.000 100.000 100.000
##
## train_joined_tbl$empl_employedtotal_800L
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## -------------------- --------- --------- -------------- --------- --------------
## (Empty string) 998344 65.39 65.39 65.39 65.39
## LESS_ONE 30467 2.00 67.39 2.00 67.39
## MORE_FIVE 371321 24.32 91.71 24.32 91.71
## MORE_ONE 126527 8.29 100.00 8.29 100.00
## <NA> 0 0.00 100.00
## Total 1526659 100.00 100.00 100.00 100.00
##
## train_joined_tbl$empl_industry_691L
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------------------- --------- ----------- -------------- ----------- --------------
## (Empty string) 1004423 65.79223 65.79223 65.79223 65.79223
## AGRICULTURE 5288 0.34638 66.13861 0.34638 66.13861
## ARMY_POLICE 14 0.00092 66.13952 0.00092 66.13952
## ART_MEDIA 245 0.01605 66.15557 0.01605 66.15557
## CATERING 3508 0.22978 66.38536 0.22978 66.38536
## CHARITY_RELIGIOUS 25 0.00164 66.38699 0.00164 66.38699
## EDUCATION 30346 1.98774 68.37473 1.98774 68.37473
## FINANCE 3150 0.20633 68.58106 0.20633 68.58106
## GAMING 96 0.00629 68.58735 0.00629 68.58735
## GOVERNMENT 35440 2.32141 70.90876 2.32141 70.90876
## HEALTH 13026 0.85324 71.76200 0.85324 71.76200
## INSURANCE 217 0.01421 71.77621 0.01421 71.77621
## IT 279 0.01828 71.79449 0.01828 71.79449
## LAWYER 564 0.03694 71.83143 0.03694 71.83143
## MANUFACTURING 9035 0.59182 72.42325 0.59182 72.42325
## MARKETING 239 0.01566 72.43890 0.01566 72.43890
## MINING 3582 0.23463 72.67353 0.23463 72.67353
## OTHER 386837 25.33880 98.01233 25.33880 98.01233
## POST_TELCO 612 0.04009 98.05241 0.04009 98.05241
## REAL_ESTATE 3680 0.24105 98.29346 0.24105 98.29346
## RECRUITMENT 85 0.00557 98.29903 0.00557 98.29903
## TOURISM 513 0.03360 98.33263 0.03360 98.33263
## TRADE 20696 1.35564 99.68827 1.35564 99.68827
## TRANSPORTATION 4318 0.28284 99.97111 0.28284 99.97111
## WELNESS 441 0.02889 100.00000 0.02889 100.00000
## <NA> 0 0.00000 100.00000
## Total 1526659 100.00000 100.00000 100.00000 100.00000
##
## train_joined_tbl$familystate_447L
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------------------- --------- --------- -------------- --------- --------------
## (Empty string) 798046 52.27 52.27 52.27 52.27
## DIVORCED 19296 1.26 53.54 1.26 53.54
## LIVING_WITH_PARTNER 8142 0.53 54.07 0.53 54.07
## MARRIED 484846 31.76 85.83 31.76 85.83
## SINGLE 183334 12.01 97.84 12.01 97.84
## WIDOWED 32995 2.16 100.00 2.16 100.00
## <NA> 0 0.00 100.00
## Total 1526659 100.00 100.00 100.00 100.00
##
## train_joined_tbl$incometype_1044T
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------------------------- --------- ------------ -------------- ------------ --------------
## EMPLOYED 298158 19.530098 19.530098 19.530098 19.530098
## HANDICAPPED 1 0.000066 19.530164 0.000066 19.530164
## HANDICAPPED_2 7371 0.482819 20.012983 0.482819 20.012983
## HANDICAPPED_3 5258 0.344412 20.357395 0.344412 20.357395
## OTHER 11436 0.749087 21.106482 0.749087 21.106482
## PRIVATE_SECTOR_EMPLOYEE 490562 32.133043 53.239525 32.133043 53.239525
## RETIRED_PENSIONER 311028 20.373115 73.612640 20.373115 73.612640
## SALARIED_GOVT 373646 24.474752 98.087392 24.474752 98.087392
## SELFEMPLOYED 29199 1.912608 100.000000 1.912608 100.000000
## <NA> 0 0.000000 100.000000
## Total 1526659 100.000000 100.000000 100.000000 100.000000
# numeric variable summaries
train_joined_tbl %>% select(where(is.numeric)) %>% descr()
## Descriptive Statistics
## train_joined_tbl
## N: 1526659
##
## age case_id mainoccupationinc_384A MONTH month.n
## ----------------- ------------ ------------ ------------------------ ------------ ------------
## Mean 44.56 1286076.57 57707.48 201936.29 6.43
## Std.Dev 14.03 718946.59 33348.30 44.74 3.51
## Min 20.96 0.00 0.00 201901.00 1.00
## Q1 32.74 766197.00 36000.00 201906.00 3.00
## Median 42.81 1357358.00 50000.00 201910.00 7.00
## Q3 56.20 1739023.00 70000.00 202001.00 9.00
## Max 76.04 2703454.00 200000.00 202010.00 12.00
## MAD 16.72 721155.91 29652.00 7.41 4.45
## IQR 23.46 972825.00 34000.00 95.00 6.00
## CV 0.31 0.56 0.58 0.00 0.55
## Skewness 0.27 0.14 1.66 0.87 -0.05
## SE.Skewness 0.00 0.00 0.00 0.00 0.00
## Kurtosis -1.03 -0.59 3.71 -1.22 -1.23
## N.Valid 1526659.00 1526659.00 1526659.00 1526659.00 1526659.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
##
## Table: Table continues below
##
##
##
## riskassesment_940T WEEK_NUM week.n year.n
## ----------------- -------------------- ------------ ------------ ------------
## Mean 0.23 40.77 40.77 2019.30
## Std.Dev 0.98 23.80 23.80 0.46
## Min -3.67 0.00 0.00 2019.00
## Q1 -0.23 23.00 23.00 2019.00
## Median 0.37 40.00 40.00 2019.00
## Q3 0.97 55.00 55.00 2020.00
## Max 2.12 91.00 91.00 2020.00
## MAD 0.89 25.20 25.20 0.00
## IQR 1.20 32.00 32.00 1.00
## CV 4.32 0.58 0.58 0.00
## Skewness -0.83 0.30 0.30 0.88
## SE.Skewness 0.01 0.00 0.00 0.00
## Kurtosis 0.25 -0.65 -0.65 -1.23
## N.Valid 53560.00 1526659.00 1526659.00 1526659.00
## Pct.Valid 3.51 100.00 100.00 100.00
We see that riskassesment_940T has 96.5% missing entries
so it may be best to leave it out for now.