naive-pipeline.knit

1 Python pipelines

In this notebook we’ll be exploring how to use data pipelines to set up a machine learning model. We’ll be using an Online Sales dataset from Kaggle and we’ll be examining what factors would lead to predicting the most revenue. Here’s the setup.

import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

pd.options.display.max_columns = None
OS_df = pd.read_csv("Online Sales Data.csv")
OS_df.head()
##    Transaction ID        Date Product Category             Product Name  \
## 0           10001  2024-01-01      Electronics            iPhone 14 Pro   
## 1           10002  2024-01-02  Home Appliances         Dyson V11 Vacuum   
## 2           10003  2024-01-03         Clothing         Levi's 501 Jeans   
## 3           10004  2024-01-04            Books        The Da Vinci Code   
## 4           10005  2024-01-05  Beauty Products  Neutrogena Skincare Set   
## 
##    Units Sold  Unit Price  Total Revenue         Region Payment Method  
## 0           2      999.99        1999.98  North America    Credit Card  
## 1           1      499.99         499.99         Europe         PayPal  
## 2           3       69.99         209.97           Asia     Debit Card  
## 3           4       15.99          63.96  North America    Credit Card  
## 4           1       89.99          89.99         Europe         PayPal

Let’s look at what we’re working with. Here are the column types and missing values.

OS_df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 240 entries, 0 to 239
## Data columns (total 9 columns):
##  #   Column            Non-Null Count  Dtype  
## ---  ------            --------------  -----  
##  0   Transaction ID    240 non-null    int64  
##  1   Date              240 non-null    object 
##  2   Product Category  240 non-null    object 
##  3   Product Name      240 non-null    object 
##  4   Units Sold        240 non-null    int64  
##  5   Unit Price        240 non-null    float64
##  6   Total Revenue     240 non-null    float64
##  7   Region            240 non-null    object 
##  8   Payment Method    240 non-null    object 
## dtypes: float64(2), int64(2), object(5)
## memory usage: 17.0+ KB

The dataframe contains 8 variables for us to examine. Another point to note is that this is a relatively clean dataset. Across the 8 variables on 240 rows, we find no missing data. That saves us a step in missing data methods which we can cover in another analyses. For now let’s visualize some different data types to see if we can learn more.

2 Descriptives

We can explore the data with some visualizations.

2.1 Numeric

Let’s start with the numerical features.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric_df = OS_df.select_dtypes(include = numerics)

num_desc = numeric_df.hist(bins=50, figsize=(20,15))
plt.show()

Units sold, unit price, and total revenue are right skewed. Transaction is a key so we won’t continue to examine this as a numeric variable.

Here are some summary statistics.

numeric_df.loc[:,numeric_df.columns != "Transaction ID"].describe()
##        Units Sold   Unit Price  Total Revenue
## count  240.000000   240.000000     240.000000
## mean     2.158333   236.395583     335.699375
## std      1.322454   429.446695     485.804469
## min      1.000000     6.500000       6.500000
## 25%      1.000000    29.500000      62.965000
## 50%      2.000000    89.990000     179.970000
## 75%      3.000000   249.990000     399.225000
## max     10.000000  3899.990000    3899.990000

Average Total Revenue is ~335 dollars. However, we know that the distribution shows quite a right skew. In addition, we have a maximum value of ~3900, which is quite far from the mean.

2.2 Categorical

Now let’s examine the categorical features.

strings = ['object']
string_df = OS_df.select_dtypes(include = strings)

for col in string_df.columns:
  cat_desc = sns.catplot(x = col, data = string_df, kind = "count")
  plt.show()

Since this appears to be a curated dataset, we see that the categorical variables are equally distributed except payment type. Date is, unsurprisingly, not like the other categorical variables so we’ll have to transform it later for further insights.

2.3 Date

Our date variable is not very meaningful for us right now. We’ll break it down into multiple features to understand it better.

OS_df["Date"] = pd.to_datetime(OS_df.Date)
OS_df["year"] = OS_df.Date.dt.year
OS_df["month"] = OS_df.Date.dt.month
OS_df["day"] = OS_df.Date.dt.day
OS_df["dow"] = OS_df.Date.dt.dayofweek
OS_df["quarter"] = OS_df.Date.dt.quarter
OS_df["weekday"] = OS_df.Date.dt.weekday

for col in ['year','month','day','dow','quarter','weekday']:
    OS_df[col] = OS_df[col].astype('category')

2.4 Train/test split

To move on and examine relationships with our target of interest, total revenue, we must prepare the train-test split first. This is because the insights we may gather here should reflect our training data and that we don’t accidentally leak information about the test set. Furthermore, we’ll have to stratify by payment type because as we saw earlier, this was the only variable with unequal distributions of cases across categories.

from sklearn.model_selection import train_test_split
# features
OS_data = OS_df.drop(["Total Revenue", "Transaction ID"], axis = 1)
# label
OS_labels = OS_df["Total Revenue"]
  # split into train/test
x_train, x_test, y_train, y_test = train_test_split(OS_data, OS_labels, test_size=0.2, random_state=42,stratify = OS_data["Payment Method"])

3 Exploratory analysis

Let’s view some preliminary relationships.

from pandas.plotting import scatter_matrix
# combine x_train y_train for visualization
train_df = pd.concat([x_train, y_train], axis=1)
explore_num = [col for col in train_df.columns if train_df[col].dtype in ["int64", "float64"]]

# plot
numeric_df_correlation = train_df[explore_num]
num_plot = scatter_matrix(numeric_df_correlation, figsize=(12, 8))
## /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/plotting/_matplotlib/tools.py:227: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
##   fig = plt.figure(**fig_kw)
plt.show()

Observations to consider: It appears that Unit Price and Total Revenue have a linear relationship. This may be a promising relationship to think about.

4 Modeling

Since we covered all the features, we are ready to prepare the dataset for modeling. We’ll set up a pipeline to complete necessary transformations. First we want to programmatically collect our column names into feature types.

4.1 Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor


# name categorical features
categorical_columns = [col for col in x_train.columns if x_train[col].dtype == "object"]

# name numerical features
numerical_columns = [col for col in x_train.columns if x_train[col].dtype in ["int64", "float64"]]

Each feature type requires separate transformation handling. We’ll standardize our numerical variables and use one-hot encoding on our categorical variables.

# numerical transformer
numerical_transformer = Pipeline(steps=[
    ("standardize", StandardScaler())
])
# categorical transformer
categorical_transformer = Pipeline(steps=[
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

Once our individual transformation pipelines are set up, we want to use ColumnTransformer to let our pipeline know which columns require the appropriate transformations.

# bundle everything in ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", numerical_transformer, numerical_columns),
        ("categorical", categorical_transformer, categorical_columns),
    ]
)

Now that we’ve set up the transformation pipeline, we can set up a model for fit and prediction. Here we’ve chosen a Random Forest model for its relative ease to run and good performance out-of-box.

# random forest model
rf_model = RandomForestRegressor()

# pipeline
pipeline = Pipeline(steps = [("preprocessor",preprocessor), ('rfmodel' , rf_model)])

4.2 Random Forest Model

To fit and predict, we throw the test data into the pipeline.

# fit
pipeline.fit(x_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  Pipeline(steps=[('standardize',
                                                                   StandardScaler())]),
                                                  ['Units Sold', 'Unit Price']),
                                                 ('categorical',
                                                  Pipeline(steps=[('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product Category',
                                                   'Product Name', 'Region',
                                                   'Payment Method'])])),
                ('rfmodel', RandomForestRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we evaluate the predicted labels to the test labels based on rmse, mape, and accuracy.

# predict
preds = pipeline.predict(x_test)

# evaluate performance
from sklearn.metrics import mean_squared_error

# RMSE
RMSE = mean_squared_error(y_test, preds, squared = False)
errors = abs(preds - y_test)
mape = 100 * np.mean(errors / y_test)
accuracy = 100 - mape

print('RMSE: {:0.4f} '.format(np.mean(RMSE)))
## RMSE: 332.7132
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
## Average Error: 103.6254 degrees.
print('Accuracy = {:0.2f}%.'.format(accuracy))
## Accuracy = 87.43%.

The mean absolute percent error is quite high, 102%. The RMSE is quite large when we know the mean total revenue is ~340.

While our model appears to be a poor fit, this was quite a naive run-through of building a machine learning model. I will tackle data exploration and model building in future notebooks. Please reach out if you have any questions or feedback. Thank you for joining me in this pipeline exercise.