1 Introduction to model tuning

Since the random forest in my previous notebook did not appear to predict very well, let’s try some hyperparameter tuning. We’ll first set up with the pipeline from last time.

1.1 Setup

import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

pd.options.display.max_columns = None
OS_df = pd.read_csv("Online Sales Data.csv")
OS_df.head()

# Date

##    Transaction ID        Date Product Category             Product Name  \
## 0           10001  2024-01-01      Electronics            iPhone 14 Pro   
## 1           10002  2024-01-02  Home Appliances         Dyson V11 Vacuum   
## 2           10003  2024-01-03         Clothing         Levi's 501 Jeans   
## 3           10004  2024-01-04            Books        The Da Vinci Code   
## 4           10005  2024-01-05  Beauty Products  Neutrogena Skincare Set   
## 
##    Units Sold  Unit Price  Total Revenue         Region Payment Method  
## 0           2      999.99        1999.98  North America    Credit Card  
## 1           1      499.99         499.99         Europe         PayPal  
## 2           3       69.99         209.97           Asia     Debit Card  
## 3           4       15.99          63.96  North America    Credit Card  
## 4           1       89.99          89.99         Europe         PayPal

OS_df["Date"] = pd.to_datetime(OS_df.Date)
OS_df["year"] = OS_df.Date.dt.year
OS_df["month"] = OS_df.Date.dt.month
OS_df["day"] = OS_df.Date.dt.day
OS_df["dow"] = OS_df.Date.dt.dayofweek
OS_df["quarter"] = OS_df.Date.dt.quarter
OS_df["weekday"] = OS_df.Date.dt.weekday

for col in ['year','month','day','dow','quarter','weekday']:
    OS_df[col] = OS_df[col].astype('category')
    
# split train
# label
OS_labels = OS_df["Total Revenue"]
# features
OS_data = OS_df.drop(["Total Revenue", "Transaction ID"], axis = 1)

  # split into train/test
x_train, x_test, y_train, y_test = train_test_split(OS_data, OS_labels, test_size=0.2, random_state=42)

# numerical transformer
numerical_transformer = Pipeline(steps=[
    ("standardize", StandardScaler())
])
# categorical transformer
categorical_transformer = Pipeline(steps=[
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])


# name categorical features
categorical_columns = [col for col in x_train.columns if x_train[col].dtype == "object"]

# name numerical features
numerical_columns = [col for col in x_train.columns if x_train[col].dtype in ["int64", "float64"]]

# bundle everything in ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", numerical_transformer, numerical_columns),
        ("categorical", categorical_transformer, categorical_columns),
    ]
)

1.2 Random Hyperparameter Grid

We’ll set up a random hyperparamter grid to optimize the following parameters: - n_estimators = number of trees in the foreset - max_features = max number of features considered for splitting a node - max_depth = max number of levels in each decision tree - min_samples_split = min number of data points placed in a node before the node is split - min_samples_leaf = min number of data points allowed in a leaf node - bootstrap = method for sampling data points (with or without replacement)



# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 300, num = 15)]
# Number of features to consider at every split
max_features = [0.5,1, 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'rf__n_estimators': n_estimators,
               'rf__max_features': max_features,
               'rf__max_depth': max_depth,
               'rf__min_samples_split': min_samples_split,
               'rf__min_samples_leaf': min_samples_leaf,
               'rf__bootstrap': bootstrap}
print(random_grid)

## {'rf__n_estimators': [1, 22, 43, 65, 86, 107, 129, 150, 171, 193, 214, 235, 257, 278, 300], 'rf__max_features': [0.5, 1, 'sqrt'], 'rf__max_depth': [5, 15, 26, 36, 47, 57, 68, 78, 89, 99, 110, None], 'rf__min_samples_split': [2, 5, 10], 'rf__min_samples_leaf': [1, 2, 4], 'rf__bootstrap': [True, False]}

1.3 Fit

We will fit our model with several steps. First, we set up the pipeline that we used last time. Instead of fitting the pipeline to the data, we’ll first feed it into a random search with cross validation to optimize our hyperparameters. The hyperparameter values being tested were defined in our parameter grid above. The random search will test different combinations of hyperparameter values until the 100th iteration. Once it’s finished, we’re left with an “optimized” model for predicting. Note the quotation: although the search does try to optimize the model, it does so for the training data. Sometimes, as we’ll see, this may not improve predictive accuracy.

from sklearn.model_selection import RandomizedSearchCV
# random forest model
rf_model = RandomForestRegressor()
pipe = Pipeline(steps = [("preprocessor",preprocessor), ('rf' , rf_model)])
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = pipe, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(x_train, y_train)

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('numerical',
                                                                               Pipeline(steps=[('standardize',
                                                                                                StandardScaler())]),
                                                                               ['Units '
                                                                                'Sold',
                                                                                'Unit '
                                                                                'Price']),
                                                                              ('categorical',
                                                                               Pipeline(steps=[('ohe',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['Product '
                                                                                'Category',
                                                                                'Product '
                                                                                'Name',
                                                                                'Region',
                                                                                'Payment '
                                                                                'Method'])])),
                                             ('rf', RandomForestRegressor())]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'rf__bootstrap': [True, False],
                                        'rf__max_depth': [5, 15, 26, 36, 47, 57,
                                                          68, 78, 89, 99, 110,
                                                          None],
                                        'rf__max_features': [0.5, 1, 'sqrt'],
                                        'rf__min_samples_leaf': [1, 2, 4],
                                        'rf__min_samples_split': [2, 5, 10],
                                        'rf__n_estimators': [1, 22, 43, 65, 86,
                                                             107, 129, 150, 171,
                                                             193, 214, 235, 257,
                                                             278, 300]},
                   random_state=42, verbose=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomizedSearchCV

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('numerical',
                                                                               Pipeline(steps=[('standardize',
                                                                                                StandardScaler())]),
                                                                               ['Units '
                                                                                'Sold',
                                                                                'Unit '
                                                                                'Price']),
                                                                              ('categorical',
                                                                               Pipeline(steps=[('ohe',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['Product '
                                                                                'Category',
                                                                                'Product '
                                                                                'Name',
                                                                                'Region',
                                                                                'Payment '
                                                                                'Method'])])),
                                             ('rf', RandomForestRegressor())]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'rf__bootstrap': [True, False],
                                        'rf__max_depth': [5, 15, 26, 36, 47, 57,
                                                          68, 78, 89, 99, 110,
                                                          None],
                                        'rf__max_features': [0.5, 1, 'sqrt'],
                                        'rf__min_samples_leaf': [1, 2, 4],
                                        'rf__min_samples_split': [2, 5, 10],
                                        'rf__n_estimators': [1, 22, 43, 65, 86,
                                                             107, 129, 150, 171,
                                                             193, 214, 235, 257,
                                                             278, 300]},
                   random_state=42, verbose=2)

estimator: Pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('standardize',
                                                                   StandardScaler())]),
                                                  ['Units Sold', 'Unit Price']),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product Category',
                                                   'Product Name', 'Region',
                                                   'Payment Method'])])),
                ('rf', RandomForestRegressor())])

preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('numerical',
                                 Pipeline(steps=[('standardize',
                                                  StandardScaler())]),
                                 ['Units Sold', 'Unit Price']),
                                ('categorical',
                                 Pipeline(steps=[('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product Category', 'Product Name', 'Region',
                                  'Payment Method'])])

numerical

['Units Sold', 'Unit Price']

StandardScaler

StandardScaler()

categorical

['Product Category', 'Product Name', 'Region', 'Payment Method']

OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

RandomForestRegressor

RandomForestRegressor()

1 Introduction to model tuning

1.1 Setup

1.2 Random Hyperparameter Grid

1.3 Fit

2 Model comparisons

2.1 Evaluation metrics

2.2 Setting up comparison models

2.3 Regression model performance

2.4 Base model performance

2.5 Random grid model performance