The world’s leading publication for data science, AI, and ML professionals.

How to use LightGBM and boosted decision trees to forecast sales

An extensive guide to structuring data to predict future sales using machine learning. It walks through how to use python to create lagged…

An extensive guide to structuring data to predict future sales using machine learning. It walks through how to use python to create lagged variables, rolling means and time-based features. It covers how to perform target encoding, train test splitting for time dependent models and build a gradient boosted tree model predicting next month’s sales for a retailer.

Photo by 褚 天成 on Unsplash
Photo by 褚 天成 on Unsplash

The Problem Statement

Most companies are intersted in understanding their future performance. Public companies have to provide guidance to investors on where they think their financial performance will end up the next quarter, or over the next year.

To be able to answer how the company will perform in future time periods, many companies employ analysts to build analytic solutions that predict business performance. This analysis tends to be focused on averaging historical performance and extrapolating it to future results. Moving averages and rolling windows are a common practice and has long been the standard practice.

With more recent developments in Data science, these models can be significantly improved using more sophisticated techniques such as Gradient Boosted Threes.

In this guide, we will walk through forecasting next months sales using ML for a Russian retailer. We will predict the sale volume for each item sold at each store.

By structuring the data as a monthly prediction, we can take advantage of very granular data we have about our stores and products. We will use historical sales data captured at time t-n to predict future sales at time t+n.

image by author
image by author

The Data

This dataset is sourced from kaggle’s Predicting Future Sales contest and includes several data files that we will have to join together. For those with database modeling understanding, the salestrain file or training set can be thought of as the fact table in a star schema with items, item categories and shops being dimensional tables we can join to with a primary key.

The test file is another fact with similar relationships. The key difference between the sales_train file and the test file is that the sales file is daily and the test file is monthly. Often in practice we may want to forecast monthly sales so that it’s more digestible for end consumers. This means we will also have to aggregate the daily data to monthly in order to feed it into our model.

The dataset has the following files:

  • sales_train.csv – the training set. Daily historical data from January 2013 to October 2015.
  • test.csv – the test set. You need to forecast the sales for these shops and products for November 2015.
  • sample_submission.csv – a sample submission file in the correct format.
  • items.csv – supplemental information about the items/products.
  • item_categories.csv – supplemental information about the items categories.
  • shops.csv– supplemental information about the shops.
## Import lots of libraries to use
import pandas as pd
import numpy as np
from google_trans_new import google_translator
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
%matplotlib inline
from itertools import product
import time
from sklearn.model_selection import KFold
from sklearn import base
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, KFold, train_test_split
import calendar
from datetime import datetime

Let’s read in the data.

## allows us to pick up the european formatting of the dates in the trainset
dateparse = lambda x: pd.datetime.strptime(x, '%d.%m.%Y') 
# importing the trainset with dates correctly formatted
sales = pd.read_csv('sales_train.csv', parse_dates = ['date'], date_parser = dateparse)
#import the rest of the files
test = pd.read_csv('test.csv')
items = pd.read_csv('items.csv')
item_categories = pd.read_csv('item_categories.csv')
shops = pd.read_csv('shops.csv')

Now that we have read in all the files, we start analyzing the files one by one. Starting with the Item Categories.

Item Categories file

The file is unfortunately in Russian. To better understand the data we can translate the individual categories into meaningful data points for a non-Russian speakers. To do that we can use googletrans to translate the category names to English. We will store the translated values in a column called Category_type.

item_categories.head()
image by author
image by author

Not speaking Russian can be a disadvantage, but we can see that there appears as though we have a patterns where there is a word followed by PS2, PS3, PS4 and PSP. That looks a lot like it related to various Sony playstation platforms. Maybe the first can help us categories these items together. Let’s translate the words and see.

# Starting with translating the column item_category name from Russian to English. We will then append that to the original dataframe.
translator = google_translator()  
list_a = []
for word in item_categories['item_category_name']:
    try:
        a = translator.translate(word)
        list_a.append(a)
    except:
        list_a.append(word)
item_categories['English_Name'] = list(list_a)
print(list_a)
image by author
image by author

The translator is not perfect as it missed some terms. We can manually search for he words and replace them with the best translation we fund so that our categories are easier to work with.

## Программы means Programs
item_categories['English_Name']= item_categories['English_Name'].str.replace("Программы", "Programs")
## Книги means Books
item_categories['English_Name']= item_categories['English_Name'].str.replace("Книги", "Books")
item_categories.head()
image by author
image by author

Aha! Looks like the first part of the of each item category has the category type in it. In our example, the category type was accessories. Let’s extract that out and store it in a new feature called Category_type.

## Create a feature called Variable type by splitting the English_Name strings where they either have a paranthesis or a dash.
list_a = []
for row in item_categories['English_Name']:
        a = row.replace('(','-').split(' -')[0] ## replacing the opening parantheses with dash so we can use str.split function to split on it.
        list_a.append(a)
item_categories['Category_type'] = list(list_a)
## Lets check out the categories we have
pd.DataFrame((item_categories['Category_type'].unique()))

Looks like several of the categories have similar names and meaning. For example, game console and gaming console are virtually the same type of category. Let’s clean those up a bit and get more uniformity of this new feature.

## Let's clean up some of this output in the categories:
## Game Consoles are really the same thing as Gaming Consoles
item_categories['Category_type']= item_categories['Category_type'].str.replace("Gaming Consoles", "Game Consoles")
## Payment cards with a lowercase c is the same as Payment Cards with upper case C
item_categories['Category_type']= item_categories['Category_type'].str.replace("Payment cards", "Payment Cards")
## Cinema and movie tends to be synomomous. Let's change "The Movie" category type to Cinema
item_categories['Category_type']= item_categories['Category_type'].str.replace("The Movie", "Cinema")
## Pure and Clean Media Seem Similar. Let's combine into Pure/Clean Media
item_categories['Category_type']= item_categories['Category_type'].str.replace("Clean media", "Pure/Clean Media")
item_categories['Category_type']= item_categories['Category_type'].str.replace("Pure Media", "Pure/Clean Media")

Since this dataset is on the larger side (for a laptop to handle anyways), let’s drop the columns we are not going to be using. This will allow us to use less memory on the laptop. We drop item categories_name, the English name and leave only category type.

item_categories = item_categories.drop(['item_category_name','English_Name'],axis =1)

The Shops

This file contains the names of the shops. It can be used as a key to get the names for the shop Id’s we have in the sales file. Because this file is also in Russian, we will again translate the words into English. Once we have the names in English, we will extract the cities these shops are located within and use that as a feature.

shops.head()
image by author
image by author

Despite seeing learning about how to spell accesories in Russian, I am afraid my Russian is still not good enough to read the shop names. Let’s translate them to English and take a look at what the words mean.

## Let's translate this into English
translator = google_translator() 
list_a = []
for word in shops['shop_name']:
 a = translator.translate(word)
 list_a.append(a)
shops['English_Shop_Name'] = list_a
shops
image by author
image by author

Looks like the city is the first word, followed by something like shopping center, TC or ,SEC. Let’s try and extract the city from this. Some googeling of the words made it seem like all the spots I checked, whether they are TC or SEC were in shopping malls. As a result we did not create a featured out that part of the shop name.

We will create a city variable with only the city names (first word) and create a feature called City type by splitting the English_Shops_Name strings by the spaces.

Because there are some cities like St. Petersburg that have a space in their name, we remove spaces following a period and spaces following an exclamation point.

list_a = []
for row in shops['English_Shop_Name']:
 a = row.replace('. ','').replace('! ','').split(' ')[0] ## remove spaces follwing period or exclaimation point and split based on spaces. First word is city
 list_a.append(a)
shops['City'] = list(list_a)
## Lets check out the categories we have
pd.DataFrame((shops['City'].unique()))
image by author
image by author

We drop the shop_name, English Shop Name since we extracted the city information from it already.

shops = shops.drop(['shop_name','English_Shop_Name'],axis = 1)
shops.head()
image by author
image by author

Aggregating data

Since the task is to make a monthly prediction, we need to aggregate the data to monthly level before doing any encodings. The following code-cell serves just that purpose. It also renames the item_cnt_day varibale into Target (once we have made it a monthly aggregate).

gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{target = 'sum'}})
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
index_cols = ['shop_id', 'item_id', 'date_block_num']
# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
 cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
 cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
 grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)
#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day': 'sum'})
gb.columns = ['shop_id', 'item_id', 'date_block_num', 'target']
# gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})
#fix column names
# gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

Sometimes we have outliers that would in our training data. In this specifc dataset we know that if our target has a value of 20 or more. This means that if we see a value larger than 20, we will automatically call it 20. This has a significant positive impact on our RMSE score. This may not always be applicable to all Forecasting models.

all_data['target']=all_data['target'].clip(0,20)

Merge Datasets Together

Next we create one dataframe with our train and test sets. We will join this with the item_categories, items and shops. We will use this dataframe to create alot of our features and reduce the need to apply them to multiple dataframes wherever possible. For example, when we create lagged variables, we need to create them for both the train, validation and test sets. By combining all the data into one dataframe we only have to do this once.

In the real world, we would build a pre-processing pipeline that applies the same feature engineering to the un-labelled data as was used in training. That is a topic for another day.

We will union train and test sets together. As we saw from the code right above, we are missing two columns on the test set. These are date_block_num and target. For now we will assign the target to be zero. We will also assign the number 34 to the date_block_num. The date_block_num corresponds to the month in the dataset, so since we need to predict next months item_counts, we will simply look for the max of the training set and add one. (the max is 33)

## Assign 34 to date_block_num and 0.0 to target
test['date_block_num'] = 34
test['target'] = 0.0
TEST_ID = test['ID'] ## in case we need this later
## Then we need to union them and save that back as our all_data dataframe
all_data = pd.concat([all_data,test], axis =0, sort=True)
all_data = all_data.drop(columns = ['ID'])

Next we will merge the all_data dataframe with the items, item_categories and shops dataframes. Since we want to avoid creating duplicate rows we will add a number of row counts checker to make sure we do not add any new rows or drop any.

## Calculate number of rows prior to merge
prior_rows = all_data.shape[0]
## Merge the sales train data with the items, item categoris and shops datasets to get the names of items, their categories and the shop names
all_data = pd.merge(all_data, items, on = "item_id")
all_data = pd.merge(all_data, item_categories, on = "item_category_id")
all_data = pd.merge(all_data, shops, on = "shop_id")
## Calcualte number and print of rows dropped (should be zero)
print("Dropped {} rows".format(prior_rows - all_data.shape[0]))
image by author
image by author

Feature Engineering

Since we have a lot of data that could potentially be predictive, we need to pre-process it into a format that our model can use. This is often referred to as feature engineering.

Handling Dates, Seasons and Days

Dates can tell us a lot of things about sales. For example, February sales may be lower than January sales simply because February has fewer days than other months. The type of days also matter. More weekend days may mean that more people frequent the stores. Seasons matter as well, where June’s sales may be different than December’s. We will create features related to all of these items.

To get started, we first need to extract the month-nd Dates for each dateblock and store it in a dataframe.

## Pull out the last date of each dateblock and append it to the 
from datetime import datetime
list_a = []
for dateblock in sales['date_block_num'].unique():
 a = sales[sales['date_block_num'] == dateblock]
 a = max(a['date'])
 list_a.append(a)

list_a.append(datetime.strptime('2015–11–30','%Y-%m-%d')) ## Manually adding the month for the test set
## Transform it to dataframe so we can merge with all_data
list_a = pd.DataFrame(list_a)
## Give the data a descriptive column header
list_a.columns = ['Month_End_Date']

Now that the month and dates have been extracted, we can count the number of Mondays, Tuesdays etc there were in each month.

## Let's calculate the number of specific days are in each month.
import calendar
## Create the empty lists
mon_list = []
tue_list = []
wed_list = []
thu_list = []
fri_list = []
sat_list = []
sun_list = []
## Calculate the number of a specific day in a given month (for example, number of mondays in March of 2015)
for date in list_a['Month_End_Date']:
 mon_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[0] != 0])))
 tue_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[1] != 0])))
 wed_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[2] != 0])))
 thu_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[3] != 0])))
 fri_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[4] != 0])))
 sat_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[5] != 0])))
 sun_list.append((len([1 for i in calendar.monthcalendar(date.year,date.month) if i[6] != 0])))
## Add these to our list we created with the dates
list_a['Number_of_Mondays'] = mon_list
list_a['Number_of_Tuesdays'] = tue_list
list_a['Number_of_Wednesdays'] = wed_list
list_a['Number_of_Thursdays'] = thu_list
list_a['Number_of_Fridays'] = fri_list
list_a['Number_of_Saturdays'] = sat_list
list_a['Number_of_Sundays'] = sun_list

We can also extract features related to the year, the month and the number of days in the month.

## Create the empty lists
year_list = []
month_list = []
day_list = []
## Next lets calculate strip out the number of days in a month, the number of the month and the number of the year
for date in list_a['Month_End_Date']:
    year_list.append(date.year)
    month_list.append(date.month)
    day_list.append(date.day)
## Add to our dataframe
list_a['Year'] = year_list
list_a['Month'] = month_list
list_a['Days_in_Month'] = day_list

The list_a dataframe can be merged back with the all_data dataframe and we’ve added a couple of date features.

## Merge the new dataframe with the all_data, using the index and the date_block_num as keys
all_data = pd.merge(all_data, list_a, left_on = 'date_block_num', right_index = True)

Price Variables

Our original approach was to add the count of transactions and we did not do anything with Price of the item. Let’s average the monthly price and merge that feature back with our all_data dataframe.

## adding the average monthly price within a monthly block for each item at each store to the dataset
a = sales.groupby(['date_block_num','shop_id','item_id'])['item_price'].mean()
a = pd.DataFrame(a)
all_data = pd.merge(all_data,a,how = "left", left_on = ['date_block_num','shop_id','item_id'], right_on = ['date_block_num','shop_id','item_id'])

Months since item first sold & Months since item was last sold

These features show the number of date blocks (months) since the first time the item was sold and the last time the item was sold. These will help us understand how new the item is and could potentially tell us that the item is no longer being sold.

We will calculate the min for each item. This will give us the first month it was sold in. Then we will calculate the difference between that number and the current date block to see how "old" the item is.

a = all_data.groupby('item_id')['date_block_num'].min()
a = pd.DataFrame(a)
a = a.reset_index()
a.columns = ['item_id','min_item_sale_date_block_num']
all_data = pd.merge(all_data,a, left_on = 'item_id', right_on = 'item_id')
all_data['Months_Since_Item_First_Sold'] = all_data['date_block_num']- all_data['min_item_sale_date_block_num']

Some of the data in the test set are for products we have never seen before. Lets create a features that calculated the average monthly sales for a specific item only in the first month it was sold. We will make the rest of them zero.

We will also apply the same logic to item categories and shop ids combined.We can calculate the average sales in the first month by category

a = all_data[all_data['Months_Since_Item_First_Sold'] == 0].groupby(['item_category_id','Months_Since_Item_First_Sold'])['target'].mean()
a = pd.DataFrame(a)
a = a.reset_index()
a.columns = ['item_category_id','Months_Since_Item_First_Sold','avg_first_months_sales_by_item_category_id']
all_data = pd.merge(all_data,a, left_on = ['item_category_id','Months_Since_Item_First_Sold'], right_on = ['item_category_id','Months_Since_Item_First_Sold'], how = 'left')
all_data['avg_first_months_sales_by_item_category_id'] = all_data['avg_first_months_sales_by_item_category_id'].fillna(0)

Calculate the average sales in the first month by category and shop ID.

a = all_data[all_data['Months_Since_Item_First_Sold'] == 0].groupby(['item_category_id', 'Months_Since_Item_First_Sold','shop_id'])['target'].mean()
a = pd.DataFrame(a)
a = a.reset_index()
a.columns = ['item_category_id','Months_Since_Item_First_Sold','shop_id','avg_first_months_sales_by_item_category_and_shop']
all_data = pd.merge(all_data,a, left_on = ['item_category_id','Months_Since_Item_First_Sold','shop_id'], right_on = ['item_category_id','Months_Since_Item_First_Sold', 'shop_id'], how = 'left')
all_data['avg_first_months_sales_by_item_category_and_shop'] = all_data['avg_first_months_sales_by_item_category_and_shop'].fillna(0)

Lagged Variables

If I was only allowed one datapoint to predict what next month’s sales would be, I would probably use this months sales. This month’s sale is a lagged 1 variable. Lags of target variables are very common in time series analysis . We will create a couple of lagged variables (last month’s sales). We will repeat this for several months.

To create a lag function, the pandas libary has a very useful function called shift. We wrap it in a loop to generate multiple lags for several months. We also use the built in fill value function to make the na values be zero.

## With pandas shift function
for each in [1,2,3,4,5,6,12]:
    all_data[str("target_lag_"+str(each))] = all_data.groupby(['date_block_num','shop_id','item_id'])['target'].shift(each, fill_value = 0)

MORE! Lagging is fun. Let’s create features using the mean for each month and either category, city, shop or item. We can use the lags of these features in our model.

## Average number of sales by month and by item
all_data['avg_monthly_by_item'] = all_data.groupby(['item_id', 'date_block_num'])['target'].transform('mean')
## Average number of sales by month and by shop
all_data['avg_monthly_by_shop'] = all_data.groupby(['shop_id', 'date_block_num'])['target'].transform('mean')
## Average number of sales by month and by category
all_data['avg_monthly_by_category'] = all_data.groupby(['Category_type', 'date_block_num'])['target'].transform('mean')
## Average number of sales by month and by city
all_data['avg_monthly_by_city'] = all_data.groupby(['City', 'date_block_num'])['target'].transform('mean')

Moving Averages

Another way to add some more features to our data is to create rolling or moving averages. Rolling averages can be very predictive and help ascertain historical levels for the target. Let’s create two rolling averages, 3 and 6 months.

## 3-months rolling average
all_data['target_3_month_avg'] = (all_data['target_lag_1'] + all_data['target_lag_2'] +all_data['target_lag_3']) /3
## 6-months rolling average
all_data['target_6_month_avg'] = (all_data['target_lag_1'] + all_data['target_lag_2'] +all_data['target_lag_3'] + all_data['target_lag_4'] + all_data['target_lag_5'] +all_data['target_lag_6']) /6

Notice how we explicitly calculated these averages. This can also be accomplished with a pandas rolling and mean functions.

## 3-months rolling average
all_data['target_3_month_avg'] = all_data.groupby(['date_block_num','shop_id','item_id'])['target'].rolling(3, fill_value = 0).mean()
## 6-months rolling average
all_data['target_6_month_avg'] = all_data.groupby(['date_block_num','shop_id','item_id'])['target'].rolling(3, fill_value = 0).mean()

Improving Memory Usage

Wow that is a lot of features we have created. Since we are only working with a laptop, we should probably make sure we are storing this in as small of dataframe as possible. Let’s check the memory usage and the datatypes of our dataframe.

all_data.info(memory_usage = "deep")

At the bottom of the output when you run the info, you get a very nice view of the number of different data types you have in your data frame along with their total memory usage.

Since a lot of these are listed as either int64 or float64, we can probably reduce them down to smaller space datatypes like int16 or float8. Downcasting means we reduce the datatypes of each feature to its lowest possible type.

for column in all_data:
 if all_data[column].dtype == 'float64':
 all_data[column]=pd.to_numeric(all_data[column], downcast='float')
 if all_data[column].dtype == 'int64':
 all_data[column]=pd.to_numeric(all_data[column], downcast='integer')
## Dropping Item name to free up memory
all_data = all_data.drop('item_name',axis =1)
## Let's check the size
all_data.info(memory_usage = "deep")
image by author
image by author

Using down casting, we were able to reduce the size of the dataset to half the size.

Train Test Splitting

In more traditional ML models, we would randomly assign observatiosn to train, test and validation sets. In forecasting situations, we need to consider the impact that time has on our dataset and structure our train and test validation accordingly. Now that our dataset has been downcast, we can start splitting data into training (first 32 months), validation (month 33) and back to our test set (month 34).

X_train = all_data[all_data.date_block_num < 33]
Y_train = all_data[all_data.date_block_num < 33]['target']
X_valid = all_data[all_data.date_block_num == 33]
Y_valid = all_data[all_data.date_block_num == 33]['target']
X_test = all_data[all_data.date_block_num == 34]

More Feature Engineering – Target Encoding

Why do we target encode?

Gradient boosted tree-based models such as XGBoost and Lightgbm have a hard time handling high cardinality categorical variables. Target encoding helps transform categorical variables into numeric values by replacing the string or text value with the average outcome. For example, if the average sale volume for a ‘PS3’ is 300 and PS2 is 200, we would replace the ‘PS3‘ string with 300 and PS2 with 200. Intuivatevly a model can now learn that we should expect PS3 sales volume to be more than PS2 volume. This type of feature engineering can help improve the model performance.

Why do we need to regularize?

Simply calculating the averages of the target variables can cause overfitting and often reduces the models ability to generalize to new data. So we need to regularize the

Regularization Techniques:

  • Cross validation loop inside training data
  • Smoothing
  • Adding Random Noise
  • Sorting and Calculating expanding mean

We will only do cross-validation loop inside our training data. To get started, we will define two helper function that I picked up from https://medium.com/@pouryaayria/k-fold-target-encoding-dfe9a594874b

## Helpder Function to KFold Mean encoding
class KFoldTargetEncoderTrain(base.BaseEstimator,
                               base.TransformerMixin):
    def __init__(self,colnames,targetName,
                  n_fold=5, verbosity=True,
                  discardOriginal_col=False):
        self.colnames = colnames
        self.targetName = targetName
        self.n_fold = n_fold
        self.verbosity = verbosity
        self.discardOriginal_col = discardOriginal_col
    def fit(self, X, y=None):
        return self
    def transform(self,X):
        assert(type(self.targetName) == str)
        assert(type(self.colnames) == str)
        assert(self.colnames in X.columns)
        assert(self.targetName in X.columns)
        mean_of_target = X[self.targetName].mean()
        kf = KFold(n_splits = self.n_fold,
                   shuffle = False, random_state=2019)
        col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
        X[col_mean_name] = np.nan
        for tr_ind, val_ind in kf.split(X):
            X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
            X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())
            X[col_mean_name].fillna(mean_of_target, inplace = True)
        if self.verbosity:
            encoded_feature = X[col_mean_name].values
            print('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,self.targetName,                    
                   np.corrcoef(X[self.targetName].values,
                               encoded_feature)[0][1]))
        if self.discardOriginal_col:
            X = X.drop(self.targetName, axis=1)
        return X
## Helper function to get the Kfold Mean encoded on the test set
class KFoldTargetEncoderTest(base.BaseEstimator, base.TransformerMixin):

    def __init__(self,train,colNames,encodedName):

        self.train = train
        self.colNames = colNames
        self.encodedName = encodedName

    def fit(self, X, y=None):
        return self
    def transform(self,X):
        mean =  self.train[[self.colNames,
                self.encodedName]].groupby(
                                self.colNames).mean().reset_index() 

        dd = {}
        for index, row in mean.iterrows():
            dd[row[self.colNames]] = row[self.encodedName]
        X[self.encodedName] = X[self.colNames]
        X = X.replace({self.encodedName: dd})
        return X

Now that we have both of our helper functions defined, lets use them to start mean-encoding our variables:

  • item_id
  • shop_id
  • City
  • Category_type
  • item_category_id
## item_id mean encoding
targetc = KFoldTargetEncoderTrain('item_id','target',n_fold=5)
X_train = targetc.fit_transform(X_train)
## shop_id mean encoding
targetc = KFoldTargetEncoderTrain('shop_id','target',n_fold=5)
X_train = targetc.fit_transform(X_train)
## City mean encoding
targetc = KFoldTargetEncoderTrain('City','target',n_fold=5)
X_train = targetc.fit_transform(X_train)
## Category_type mean encoding
targetc = KFoldTargetEncoderTrain('Category_type','target',n_fold=5)
X_train = targetc.fit_transform(X_train)
## Item_category_id mean encoding
targetc = KFoldTargetEncoderTrain('item_category_id','target',n_fold=5)
X_train = targetc.fit_transform(X_train)
image by author
image by author

Apply similar transformation to the test set.

## Transform validation &amp; test set
## Apply item id mean encoding to test set
test_targetc = KFoldTargetEncoderTest(X_train,'item_id','item_id_Kfold_Target_Enc')
X_valid = test_targetc.fit_transform(X_valid)
X_test = test_targetc.fit_transform(X_test)
## Apply shop id mean encoding to test set
test_targetc = KFoldTargetEncoderTest(X_train,'shop_id','shop_id_Kfold_Target_Enc')
X_valid = test_targetc.fit_transform(X_valid)
X_test = test_targetc.fit_transform(X_test)
## Apply city mean encoding to test set
test_targetc = KFoldTargetEncoderTest(X_train,'City','City_Kfold_Target_Enc')
X_valid = test_targetc.fit_transform(X_valid)
X_test = test_targetc.fit_transform(X_test)
## Apply Category_type mean encoding to test set
test_targetc = KFoldTargetEncoderTest(X_train,'Category_type','Category_type_Kfold_Target_Enc')
X_valid = test_targetc.fit_transform(X_valid)
X_test = test_targetc.fit_transform(X_test)
## Apply item_category_id mean encoding to test set
test_targetc = KFoldTargetEncoderTest(X_train,'item_category_id','item_category_id_Kfold_Target_Enc')
X_valid = test_targetc.fit_transform(X_valid)
X_test = test_targetc.fit_transform(X_test)

Final Dataset

We are getting close. Our features are done. Let’ do a couple of checks to make sure we only have the features we will use.

## drop first 12 months since we have lagged variables
X_train = X_train[X_train.date_block_num > 12]
## Assign target variables to seperate variables
y= X_train['target']
Y_valid = X_valid['target']
## Drop Categorical Variables that we mean encoded, the target and the item codes.
columns_to_drop = ['target', 'Category_type','City','Month_End_Date', 'item_category_id']
X_train= X_train.drop(columns_to_drop, axis = 1)
X_valid = X_valid.drop(columns_to_drop, axis = 1)
X_test = X_test.drop(columns_to_drop, axis = 1)

Modeling with LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency.
  • Lower memory usage.
  • Better accuracy.
  • Support of parallel and GPU learning.
  • Capable of handling large-scale data.

LightGBM is really good at handling datasets larger than 100K records, and does so relatively fast compared to XGBoost.

We need to transform train and validation into lgb dataset structures required for modeling.

lgb_train = lgb.Dataset(X_train, y)
lgb_eval = lgb.Dataset(X_valid, Y_valid, reference=lgb_train)

Like for most boosted models, we will need to tune our hyper-paremeters. These are the ones that I had the most success with, but it does not mean that they are the "ultimate" ones. Generally speaking, I tend to follow the guidance provided by the documentation when I tune parameters.

LightGBM uses the leaf-wise tree growth algorithm, while many other popular tools use depth-wise tree growth. Compared with depth-wise growth, the leaf-wise algorithm can converge much faster. However, the leaf-wise growth may be over-fitting if not used with the appropriate parameters.

To get good results using a leaf-wise tree, these are some important parameters:

num_leaves. This is the main parameter to control the complexity of the tree model. Theoretically, we can set num_leaves = 2^(max_depth) to obtain the same number of leaves as depth-wise tree. However, this simple conversion is not good in practice. The reason is that a leaf-wise tree is typically much deeper than a depth-wise tree for a fixed number of leaves. Unconstrained depth can induce over-fitting. Thus, when trying to tune the num_leaves, we should let it be smaller than 2^(max_depth). For example, when the max_depth=7 the depth-wise tree can get good accuracy, but setting num_leaves to 127 may cause over-fitting, and setting it to 70 or 80 may get better accuracy than depth-wise.

min_data_in_leaf. This is a very important parameter to prevent over-fitting in a leaf-wise tree. Its optimal value depends on the number of training samples and num_leaves. Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset. max_depth. You also can use max_depth to limit the tree depth explicitly.

# specify the configurations as a dict
params = {
 'boosting_type': 'gbdt',
 'objective': 'regression',
 'metric': 'rmse',
 'num_leaves': 31,
 'learning_rate': 0.05,
 'feature_fraction': 0.9,
 'bagging_fraction': 0.8,
 'bagging_freq': 5,
 'verbose': 0,
 'num_threads' : 4
}
print('Starting training...')
# train
gbm = lgb.train(params,
 lgb_train,
 num_boost_round=10000,
 valid_sets=lgb_eval,
 early_stopping_rounds=100)
print('Saving model...')
# save model to file
gbm.save_model('model.txt')
print('Starting predicting...')
# predict
y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(Y_valid, y_pred) ** 0.5)

Let’s check out the feature importance plot.

num_features = 50
indxs = np.argsort(gbm.feature_importance())[:num_features]

feature_imp = pd.DataFrame(sorted(zip(gbm.feature_importance()[indxs],X_train.columns[indxs])), columns=['Value','Feature'])
plt.figure(figsize=(20, 20))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('Top {} LightGBM Features accorss folds'.format(num_features))
plt.tight_layout()
plt.show()
image by author
image by author

That’s it, we have successfully trained a LightGBM model to forecast next month’s sales.

If you found this useful, please applaud the story.


Related Articles