Since the random forest in my previous notebook did not appear to predict very well, let’s try some hyperparameter tuning. We’ll first set up with the pipeline from last time.
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
pd.options.display.max_columns = None
OS_df = pd.read_csv("Online Sales Data.csv")
OS_df.head()
# Date
## Transaction ID Date Product Category Product Name \
## 0 10001 2024-01-01 Electronics iPhone 14 Pro
## 1 10002 2024-01-02 Home Appliances Dyson V11 Vacuum
## 2 10003 2024-01-03 Clothing Levi's 501 Jeans
## 3 10004 2024-01-04 Books The Da Vinci Code
## 4 10005 2024-01-05 Beauty Products Neutrogena Skincare Set
##
## Units Sold Unit Price Total Revenue Region Payment Method
## 0 2 999.99 1999.98 North America Credit Card
## 1 1 499.99 499.99 Europe PayPal
## 2 3 69.99 209.97 Asia Debit Card
## 3 4 15.99 63.96 North America Credit Card
## 4 1 89.99 89.99 Europe PayPal
OS_df["Date"] = pd.to_datetime(OS_df.Date)
OS_df["year"] = OS_df.Date.dt.year
OS_df["month"] = OS_df.Date.dt.month
OS_df["day"] = OS_df.Date.dt.day
OS_df["dow"] = OS_df.Date.dt.dayofweek
OS_df["quarter"] = OS_df.Date.dt.quarter
OS_df["weekday"] = OS_df.Date.dt.weekday
for col in ['year','month','day','dow','quarter','weekday']:
OS_df[col] = OS_df[col].astype('category')
# split train
# label
OS_labels = OS_df["Total Revenue"]
# features
OS_data = OS_df.drop(["Total Revenue", "Transaction ID"], axis = 1)
# split into train/test
x_train, x_test, y_train, y_test = train_test_split(OS_data, OS_labels, test_size=0.2, random_state=42)
# numerical transformer
numerical_transformer = Pipeline(steps=[
("standardize", StandardScaler())
])
# categorical transformer
categorical_transformer = Pipeline(steps=[
("ohe", OneHotEncoder(handle_unknown="ignore"))
])
# name categorical features
categorical_columns = [col for col in x_train.columns if x_train[col].dtype == "object"]
# name numerical features
numerical_columns = [col for col in x_train.columns if x_train[col].dtype in ["int64", "float64"]]
# bundle everything in ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("numerical", numerical_transformer, numerical_columns),
("categorical", categorical_transformer, categorical_columns),
]
)
We’ll set up a random hyperparamter grid to optimize the following parameters: - n_estimators = number of trees in the foreset - max_features = max number of features considered for splitting a node - max_depth = max number of levels in each decision tree - min_samples_split = min number of data points placed in a node before the node is split - min_samples_leaf = min number of data points allowed in a leaf node - bootstrap = method for sampling data points (with or without replacement)
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 300, num = 15)]
# Number of features to consider at every split
max_features = [0.5,1, 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'rf__n_estimators': n_estimators,
'rf__max_features': max_features,
'rf__max_depth': max_depth,
'rf__min_samples_split': min_samples_split,
'rf__min_samples_leaf': min_samples_leaf,
'rf__bootstrap': bootstrap}
print(random_grid)
## {'rf__n_estimators': [1, 22, 43, 65, 86, 107, 129, 150, 171, 193, 214, 235, 257, 278, 300], 'rf__max_features': [0.5, 1, 'sqrt'], 'rf__max_depth': [5, 15, 26, 36, 47, 57, 68, 78, 89, 99, 110, None], 'rf__min_samples_split': [2, 5, 10], 'rf__min_samples_leaf': [1, 2, 4], 'rf__bootstrap': [True, False]}
We will fit our model with several steps. First, we set up the pipeline that we used last time. Instead of fitting the pipeline to the data, we’ll first feed it into a random search with cross validation to optimize our hyperparameters. The hyperparameter values being tested were defined in our parameter grid above. The random search will test different combinations of hyperparameter values until the 100th iteration. Once it’s finished, we’re left with an “optimized” model for predicting. Note the quotation: although the search does try to optimize the model, it does so for the training data. Sometimes, as we’ll see, this may not improve predictive accuracy.
from sklearn.model_selection import RandomizedSearchCV
# random forest model
rf_model = RandomForestRegressor()
pipe = Pipeline(steps = [("preprocessor",preprocessor), ('rf' , rf_model)])
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = pipe, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(x_train, y_train)
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units ' 'Sold', 'Unit ' 'Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product ' 'Category', 'Product ' 'Name', 'Region', 'Payment ' 'Method'])])), ('rf', RandomForestRegressor())]), n_iter=100, n_jobs=-1, param_distributions={'rf__bootstrap': [True, False], 'rf__max_depth': [5, 15, 26, 36, 47, 57, 68, 78, 89, 99, 110, None], 'rf__max_features': [0.5, 1, 'sqrt'], 'rf__min_samples_leaf': [1, 2, 4], 'rf__min_samples_split': [2, 5, 10], 'rf__n_estimators': [1, 22, 43, 65, 86, 107, 129, 150, 171, 193, 214, 235, 257, 278, 300]}, random_state=42, verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomizedSearchCV(cv=3, estimator=Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units ' 'Sold', 'Unit ' 'Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product ' 'Category', 'Product ' 'Name', 'Region', 'Payment ' 'Method'])])), ('rf', RandomForestRegressor())]), n_iter=100, n_jobs=-1, param_distributions={'rf__bootstrap': [True, False], 'rf__max_depth': [5, 15, 26, 36, 47, 57, 68, 78, 89, 99, 110, None], 'rf__max_features': [0.5, 1, 'sqrt'], 'rf__min_samples_leaf': [1, 2, 4], 'rf__min_samples_split': [2, 5, 10], 'rf__n_estimators': [1, 22, 43, 65, 86, 107, 129, 150, 171, 193, 214, 235, 257, 278, 300]}, random_state=42, verbose=2)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('rf', RandomForestRegressor())])
ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])
['Units Sold', 'Unit Price']
StandardScaler()
['Product Category', 'Product Name', 'Region', 'Payment Method']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor()
To understand the improvement from the grid search, let’s compare it to a regression predictor and a default random forest regressor.
# function to return model evaluation metrics
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
MSE = np.square(errors).mean()
RMSE = np.sqrt(MSE)
print('Model Performance')
print('MSE: {:0.2f} '.format(MSE))
print('RMSE: {:0.2f} '.format(RMSE))
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
from sklearn.linear_model import LinearRegression
# regression
regression_model = LinearRegression()
regression_pipe = Pipeline(steps = [("preprocessor",preprocessor), ('reg' , regression_model)])
regression_pipe.fit(x_train, y_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('reg', LinearRegression())])
ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])
['Units Sold', 'Unit Price']
StandardScaler()
['Product Category', 'Product Name', 'Region', 'Payment Method']
OneHotEncoder(handle_unknown='ignore')
LinearRegression()
# default random forest set up
base_model = RandomForestRegressor( random_state = 42)
base_pipeline = Pipeline(steps = [("preprocessor",preprocessor), ('base' , base_model)])
base_pipeline.fit(x_train, y_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('base', RandomForestRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('base', RandomForestRegressor(random_state=42))])
ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])
['Units Sold', 'Unit Price']
StandardScaler()
['Product Category', 'Product Name', 'Region', 'Payment Method']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# compare models
regression_accuracy = evaluate(regression_pipe, x_test, y_test)
## Model Performance
## MSE: 5554664.09
## RMSE: 2356.83
## Average Error: 1304.7727 degrees.
## Accuracy = -435.77%.
# compare models
base_accuracy = evaluate(base_pipeline, x_test, y_test)
## Model Performance
## MSE: 12096.94
## RMSE: 109.99
## Average Error: 49.9369 degrees.
## Accuracy = 89.67%.
# random grid search model
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, x_test, y_test)
## Model Performance
## MSE: 11889.88
## RMSE: 109.04
## Average Error: 58.5152 degrees.
## Accuracy = 84.39%.
We see that the tuned model is being slightly out performed by the default settings. It is possible that the we’ve overtuned the models and we’re overfitting the training data. However, both models performed a lot better than a linear regression model. Thank you for joining me in this was a brief introduction to hyperparameter tuning with random forests.