In this notebook we’ll be exploring how to use data pipelines to set up a machine learning model. We’ll be using an Online Sales dataset from Kaggle and we’ll be examining what factors would lead to predicting the most revenue. Here’s the setup.
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
pd.options.display.max_columns = None
OS_df = pd.read_csv("Online Sales Data.csv")
OS_df.head()
## Transaction ID Date Product Category Product Name \
## 0 10001 2024-01-01 Electronics iPhone 14 Pro
## 1 10002 2024-01-02 Home Appliances Dyson V11 Vacuum
## 2 10003 2024-01-03 Clothing Levi's 501 Jeans
## 3 10004 2024-01-04 Books The Da Vinci Code
## 4 10005 2024-01-05 Beauty Products Neutrogena Skincare Set
##
## Units Sold Unit Price Total Revenue Region Payment Method
## 0 2 999.99 1999.98 North America Credit Card
## 1 1 499.99 499.99 Europe PayPal
## 2 3 69.99 209.97 Asia Debit Card
## 3 4 15.99 63.96 North America Credit Card
## 4 1 89.99 89.99 Europe PayPal
Let’s look at what we’re working with. Here are the column types and missing values.
OS_df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 240 entries, 0 to 239
## Data columns (total 9 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Transaction ID 240 non-null int64
## 1 Date 240 non-null object
## 2 Product Category 240 non-null object
## 3 Product Name 240 non-null object
## 4 Units Sold 240 non-null int64
## 5 Unit Price 240 non-null float64
## 6 Total Revenue 240 non-null float64
## 7 Region 240 non-null object
## 8 Payment Method 240 non-null object
## dtypes: float64(2), int64(2), object(5)
## memory usage: 17.0+ KB
The dataframe contains 8 variables for us to examine. Another point to note is that this is a relatively clean dataset. Across the 8 variables on 240 rows, we find no missing data. That saves us a step in missing data methods which we can cover in another analyses. For now let’s visualize some different data types to see if we can learn more.
We can explore the data with some visualizations.
Let’s start with the numerical features.
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_df = OS_df.select_dtypes(include = numerics)
num_desc = numeric_df.hist(bins=50, figsize=(20,15))
plt.show()
Units sold, unit price, and total revenue are right skewed. Transaction is a key so we won’t continue to examine this as a numeric variable.
Here are some summary statistics.
numeric_df.loc[:,numeric_df.columns != "Transaction ID"].describe()
## Units Sold Unit Price Total Revenue
## count 240.000000 240.000000 240.000000
## mean 2.158333 236.395583 335.699375
## std 1.322454 429.446695 485.804469
## min 1.000000 6.500000 6.500000
## 25% 1.000000 29.500000 62.965000
## 50% 2.000000 89.990000 179.970000
## 75% 3.000000 249.990000 399.225000
## max 10.000000 3899.990000 3899.990000
Average Total Revenue is ~335 dollars. However, we know that the distribution shows quite a right skew. In addition, we have a maximum value of ~3900, which is quite far from the mean.
Now let’s examine the categorical features.
strings = ['object']
string_df = OS_df.select_dtypes(include = strings)
for col in string_df.columns:
cat_desc = sns.catplot(x = col, data = string_df, kind = "count")
plt.show()
Since this appears to be a curated dataset, we see that the categorical variables are equally distributed except payment type. Date is, unsurprisingly, not like the other categorical variables so we’ll have to transform it later for further insights.
Our date variable is not very meaningful for us right now. We’ll break it down into multiple features to understand it better.
OS_df["Date"] = pd.to_datetime(OS_df.Date)
OS_df["year"] = OS_df.Date.dt.year
OS_df["month"] = OS_df.Date.dt.month
OS_df["day"] = OS_df.Date.dt.day
OS_df["dow"] = OS_df.Date.dt.dayofweek
OS_df["quarter"] = OS_df.Date.dt.quarter
OS_df["weekday"] = OS_df.Date.dt.weekday
for col in ['year','month','day','dow','quarter','weekday']:
OS_df[col] = OS_df[col].astype('category')
To move on and examine relationships with our target of interest, total revenue, we must prepare the train-test split first. This is because the insights we may gather here should reflect our training data and that we don’t accidentally leak information about the test set. Furthermore, we’ll have to stratify by payment type because as we saw earlier, this was the only variable with unequal distributions of cases across categories.
from sklearn.model_selection import train_test_split
# features
OS_data = OS_df.drop(["Total Revenue", "Transaction ID"], axis = 1)
# label
OS_labels = OS_df["Total Revenue"]
# split into train/test
x_train, x_test, y_train, y_test = train_test_split(OS_data, OS_labels, test_size=0.2, random_state=42,stratify = OS_data["Payment Method"])
Let’s view some preliminary relationships.
from pandas.plotting import scatter_matrix
# combine x_train y_train for visualization
train_df = pd.concat([x_train, y_train], axis=1)
explore_num = [col for col in train_df.columns if train_df[col].dtype in ["int64", "float64"]]
# plot
numeric_df_correlation = train_df[explore_num]
num_plot = scatter_matrix(numeric_df_correlation, figsize=(12, 8))
## /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/plotting/_matplotlib/tools.py:227: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
## fig = plt.figure(**fig_kw)
plt.show()
Observations to consider: It appears that Unit Price and Total Revenue
have a linear relationship. This may be a promising relationship to
think about.
Since we covered all the features, we are ready to prepare the dataset for modeling. We’ll set up a pipeline to complete necessary transformations. First we want to programmatically collect our column names into feature types.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
# name categorical features
categorical_columns = [col for col in x_train.columns if x_train[col].dtype == "object"]
# name numerical features
numerical_columns = [col for col in x_train.columns if x_train[col].dtype in ["int64", "float64"]]
Each feature type requires separate transformation handling. We’ll standardize our numerical variables and use one-hot encoding on our categorical variables.
# numerical transformer
numerical_transformer = Pipeline(steps=[
("standardize", StandardScaler())
])
# categorical transformer
categorical_transformer = Pipeline(steps=[
("ohe", OneHotEncoder(handle_unknown="ignore"))
])
Once our individual transformation pipelines are set up, we want to
use ColumnTransformer
to let our pipeline know which
columns require the appropriate transformations.
# bundle everything in ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("numerical", numerical_transformer, numerical_columns),
("categorical", categorical_transformer, categorical_columns),
]
)
Now that we’ve set up the transformation pipeline, we can set up a model for fit and prediction. Here we’ve chosen a Random Forest model for its relative ease to run and good performance out-of-box.
# random forest model
rf_model = RandomForestRegressor()
# pipeline
pipeline = Pipeline(steps = [("preprocessor",preprocessor), ('rfmodel' , rf_model)])
To fit and predict, we throw the test data into the pipeline.
# fit
pipeline.fit(x_train, y_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('rfmodel', RandomForestRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])), ('rfmodel', RandomForestRegressor())])
ColumnTransformer(transformers=[('numerical', Pipeline(steps=[('standardize', StandardScaler())]), ['Units Sold', 'Unit Price']), ('categorical', Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))]), ['Product Category', 'Product Name', 'Region', 'Payment Method'])])
['Units Sold', 'Unit Price']
StandardScaler()
['Product Category', 'Product Name', 'Region', 'Payment Method']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor()
Now we evaluate the predicted labels to the test labels based on rmse, mape, and accuracy.
# predict
preds = pipeline.predict(x_test)
# evaluate performance
from sklearn.metrics import mean_squared_error
# RMSE
RMSE = mean_squared_error(y_test, preds, squared = False)
errors = abs(preds - y_test)
mape = 100 * np.mean(errors / y_test)
accuracy = 100 - mape
print('RMSE: {:0.4f} '.format(np.mean(RMSE)))
## RMSE: 332.7132
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
## Average Error: 103.6254 degrees.
print('Accuracy = {:0.2f}%.'.format(accuracy))
## Accuracy = 87.43%.
The mean absolute percent error is quite high, 102%. The RMSE is quite large when we know the mean total revenue is ~340.
While our model appears to be a poor fit, this was quite a naive run-through of building a machine learning model. I will tackle data exploration and model building in future notebooks. Please reach out if you have any questions or feedback. Thank you for joining me in this pipeline exercise.