Machine Learning for Retail Price Recommendation with Python

Published in

Towards Data Science

6 min readJul 23, 2018

Mercari, Japan’s biggest community-powered shopping app, knows one problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

In this machine learning project, we will build a model that automatically suggests the right product prices. We are provided of the following information:

train_id — the id of the listing

name — the title of the listing

item_condition_id — the condition of the items provided by the sellers

category_name — category of the listing

brand_name — the name of the brand

price — the price that the item was sold for. This is target variable that we will predict

shipping — 1 if shipping fee is paid by seller and 0 by buyer

item_description — the full description of the item

EDA

The data set can be downloaded from Kaggle. To validate the result, I only need the train.tsv. Let’s get started!

import gc
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
import lightgbm as lgbdf = pd.read_csv('train.tsv', sep = '\t')

Randomly split the data into train and test sets. We are using training set only for EDA.

msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test = df[~msk]train.shape, test.shape

((1185866, 8), (296669, 8))

train.head()

train.info()

Price

train.price.describe()

The price of items are right skewed, vast majority of the items priced at 10–20. However, the most expensive item priced at 2009. So we will make log-transformation on the price.

plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(12,6), edgecolor='white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('Price Distribution', fontsize=12)

Shipping

Over 55% of items shipping fee were paid by the buyers.

train['shipping'].value_counts() / len(train)

How shipping related to the price?

shipping_fee_by_buyer = train.loc[df['shipping'] == 0, 'price']
shipping_fee_by_seller = train.loc[df['shipping'] == 1, 'price']fig, ax = plt.subplots(figsize=(18,8))
ax.hist(shipping_fee_by_seller, color='#8CB4E1', alpha=1.0, bins=50, range = [0, 100],
       label='Price when Seller pays Shipping')
ax.hist(shipping_fee_by_buyer, color='#007D00', alpha=0.7, bins=50, range = [0, 100],
       label='Price when Buyer pays Shipping')
plt.xlabel('price', fontsize=12)
plt.ylabel('frequency', fontsize=12)
plt.title('Price Distribution by Shipping Type', fontsize=15)
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

print('The average price is {}'.format(round(shipping_fee_by_seller.mean(), 2)), 'if seller pays shipping');
print('The average price is {}'.format(round(shipping_fee_by_buyer.mean(), 2)), 'if buyer pays shipping')

The average price is 22.58 if seller pays shipping

The average price is 30.11 if buyer pays shipping

We compare again after log-transformation on the price.

fig, ax = plt.subplots(figsize=(18,8))
ax.hist(np.log(shipping_fee_by_seller+1), color='#8CB4E1', alpha=1.0, bins=50,
       label='Price when Seller pays Shipping')
ax.hist(np.log(shipping_fee_by_buyer+1), color='#007D00', alpha=0.7, bins=50,
       label='Price when Buyer pays Shipping')
plt.xlabel('log(price+1)', fontsize=12)
plt.ylabel('frequency', fontsize=12)
plt.title('Price Distribution by Shipping Type', fontsize=15)
plt.tick_params(labelsize=12)
plt.legend()
plt.show()

It is obvious that the average price is higher when buyer pays shipping.

Category Names

print('There are', train['category_name'].nunique(), 'unique values in category name column')

There are 1265 unique values in category name column

Top 10 most common category names:

train['category_name'].value_counts()[:10]

Item condition vs. Price

sns.boxplot(x = 'item_condition_id', y = np.log(train['price']+1), data = train, palette = sns.color_palette('RdBu',5))

There seems to be various on the average price between each item condition id.

After above exploratory data analysis, I decide to use all the features to build our model.

LightGBM

Under the umbrella of the DMTK project of Microsoft, LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Parallel and GPU learning supported
Capable of handling large-scale data

Therefore, we are going to give it a try.

General settings:

NUM_BRANDS = 4000
NUM_CATEGORIES = 1000
NAME_MIN_DF = 10
MAX_FEATURES_ITEM_DESCRIPTION = 50000

There are missing values in the columns that we have to fix:

print('There are %d items that do not have a category name.' %train['category_name'].isnull().sum())

There are 5083 items that do not have a category name.

print('There are %d items that do not have a brand name.' %train['brand_name'].isnull().sum())

There are 506370 items that do not have a brand name.

print('There are %d items that do not have a description.' %train['item_description'].isnull().sum())

There are 3 items that do not have a description.

Helper function for LightGBM:

def handle_missing_inplace(dataset): 
    dataset['category_name'].fillna(value='missing', inplace=True) 
    dataset['brand_name'].fillna(value='missing', inplace=True) 
    dataset['item_description'].replace('No description yet,''missing', inplace=True) 
    dataset['item_description'].fillna(value='missing', inplace=True)def cutting(dataset):
    pop_brand = dataset['brand_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_BRANDS]
    dataset.loc[~dataset['brand_name'].isin(pop_brand), 'brand_name'] = 'missing'
    pop_category = dataset['category_name'].value_counts().loc[lambda x: x.index != 'missing'].index[:NUM_CATEGORIES]def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

Drop rows where price = 0

df = pd.read_csv('train.tsv', sep = '\t')
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test = df[~msk]
test_new = test.drop('price', axis=1)
y_test = np.log1p(test["price"])train = train[train.price != 0].reset_index(drop=True)

Merge train and new test data.

nrow_train = train.shape[0]
y = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test_new])

Training Preparation

handle_missing_inplace(merge)
cutting(merge)
to_categorical(merge)

Count vectorize name and category name columns.

cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(merge['name'])cv = CountVectorizer()
X_category = cv.fit_transform(merge['category_name'])

TF-IDF Vectorize item_description column.

tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION, ngram_range=(1, 3), stop_words='english')
X_description = tv.fit_transform(merge['item_description'])

Label binarize brand_name column.

lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(merge['brand_name'])

Create dummy variables for item_condition_id and shipping columns.

X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']], sparse=True).values)

Create sparse merge.

sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()

Remove features with document frequency <=1.

mask = np.array(np.clip(sparse_merge.getnnz(axis=0) - 1, 0, 1), dtype=bool)
sparse_merge = sparse_merge[:, mask]

Separate train and test data from sparse merge.

X = sparse_merge[:nrow_train]
X_test = sparse_merge[nrow_train:]

Create dataset for lightgbm.

train_X = lgb.Dataset(X, label=y)

Specify our parameters as a dict.

params = {
        'learning_rate': 0.75,
        'application': 'regression',
        'max_depth': 3,
        'num_leaves': 100,
        'verbosity': -1,
        'metric': 'RMSE',
    }

Use ‘regression’ as application as we are dealing with a regression problem.
Use ‘RMSE’ as metric because this is a regression problem.
“num_leaves”=100 as our data is relative big.
Use “max_depth” to avoid overfitting.
Use “verbosity” to control the level of LightGBM’s verbosity (<0: Fatal).
“learning_rate” determines the impact of each tree on the final outcome.

Training Start

Training a model requires a parameter list and data set. And training will take a while.

gbm = lgb.train(params, train_set=train_X, num_boost_round=3200, verbose_eval=100)

Predict

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

Evaluation

from sklearn.metrics import mean_squared_error
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

The rmse of prediction is: 0.46164222941613137

Source code can be found on Github. Have a productive week!

Reference: Kaggle