
Table of Contents:
- Introduction
- Business Problem
- Prerequisites
- Data Source
- Existing Approaches
- Understanding the data
- EDA
- Data Preprocessing
- Benchmark Solution
- First Cut Solution
- Deep Learning-Based Solution
- Conclusion
- Deployment and Predictions
- Future Work
- Github Repository and Linkedin
- References
1. Introduction
Mercari is an e-commerce company currently operational in the US and Japan. This provides a platform where customers can sell items that are no longer useful. It tries to make all the processes hassle-free by providing at-home pickups, same-day delivery, and many other advantages. The company website displays more than 350k items are listed every day on the website which reflects its popularity among users.
2. Business Problem
The problem is quite straightforward to understand where, given the details of the product the output, the price for the product should be. When we pose this as a machine learning problem we call this out as a Regression Problem as as the output is real numbers(price).
Performance Metric:
The performance metric is a function that is used to know where our model stands in its work. There are various types of metrics depending on the type of the problem and business requirement, eg accuracy, roc curve, Mean Squared Error, etc. The metric that we are using here is Root Mean Squared Logarithmic Error.

Objectives:
a) Obviously, our prime objective for the problem is to predict an accurate price. Based on the description of the product, a genuine price must be suggested to the seller to maintain credibility among the sellers as well as the buyers.
b) As the machine learning model is going to run as a web app the time latency for the prediction must be less. A seller doesnot prefer waiting a long time to know the results. Time ranging from milliseconds to few seconds should be the appropriate time for the output to be displayed to the seller.
3. Prerequisites
Hereby I am assuming the readers know the basic concept of classical machine learning and deep learning, data preprocessing, exploratory data analysis.
4. Data Source
The dataset has been taken through Kaggle which containing two files one is for training and the other is a test file. The training file contains more than 1.4 million records and the test file contains about 3 million records. The files are in .tsv format with a size of 74MB and 280MB respectively which are zipped.
5. Existing Approaches
The problem is solved by many Data Science enthusiasts with various approaches. The competition winners i.e. Paweł and Konstantin follow quite a simple approach in combination with some amazing tricks. They have trained a simple MLP with sparse inputs. Their tricks include name chargrams, stemming with PorterStemmer, concatenation of text features. They used Bag of words -1,2-grams(with/without Tf-IDF) and one-hot encoding for categorical features(_link_). Others have tried CNN-based models also.
6. Understanding the Data
Each row for the training dataset represents a listed product that has certain information about it. Each column displays specific details of the product. In total there are seven columns which are as follows:
1. name: This displays the title of the product listed
_2. item_condition_id:_ The condition of the item provided by the seller. This is an ordered categorical feature which categories range from 1 to 5 where 1 is better than 5.
3. category name: This feature holds the category name for the listing. Eg: "Women/Athletic Apparel/Pants, Tights, Leggings.
_4. brandname: This feature gives the name of the brand. The feature also contains null values.
5. price: This is the feature that we want to predict that is the target variable. The units are in USD($).
6. shipping: This is a categorical feature that tells who pays the shipping fee. It is "1" if the seller pays and "0" if the buyer pays to ship.
_7. itemdescription: This feature contains the full description of the item
7. Exploratory Data Analysis
This step is one of the crucial steps in solving a machine learning problem because it provides an opportunity to explore and understand the dataset. This deep insight into the dataset is important not only to understand the existing features but also to develop new features to input to the machine learning model. So now we know why we want to do EDA let’s get into it.
Basic Statistics
In this part of EDA, we will try to understand the overall dataset which includes gathering initial stats about the raw data like the shape or the data types.
df = pd.read_csv("train.tsv",delimiter="t",index_col=["train_id"])
print(df.shape)
df.info()


As we can clearly see the dataset has 1482535 rows and 7 columns. The .info() tells the data type for each column and ‘category_name’, ‘brand_name’ and ‘item_description’ contain some values as null which we will be taken care of in the data processing part.
7.1 Univariate analysis:
In simple terms, exploring the data by taking one feature(column) at a time is a univariate analysis. This approach helps to develop an understanding of the specific feature under consideration.
7.1.a price
As we know this is our target variable. The data type for this feature is float therefore we can check the statistics for this variable
df.price.describe()

One thing which we can see is the mean value is $26.7 and the maximum value is around $2000 which means there is definitely some skewness in the data. To confirm this we can simply plot the distribution which results with the following chart.

The plot is highly skewed which is what we expected. This looks very close to the log-normal distribution. When a distribution is log-normal then the distribution of logarithmic for the values is a Gaussian distribution. This means if we take the log of the price it will result close to Gaussian distribution which is shown as follows:
# Note : We are adding 1 to aboid inf value for 0 price products
df["log_price"] = df.price.apply(lambda x:np.log(x+1))
plt.subplot(1,2,2)
plt.hist(df.log_price,bins=30,color="teal")
plt.title("Log(Price+1) Distribution")
plt.xlabel("log(Price+1)")
plt.ylabel("Count")
plt.show()

This distribution is close to the Gaussian Distribution which has certain advantages over the raw price.
- The values in Gaussian distribution are much more evenly distributed in all the ranges unlike in log-normal distribution where very few data points are available for higher price values.
- This benefits the models like linear regression which holds an assumption of dependent variables are normally distributed. Moreover, for other models also Gaussian Distributed target variables provide better performance.
- And considering our performance metric anyhow we have to compute the log of predicted and actual prices so it is better to consider log(price+1) as our target value and compute RMSE over it.
Thus, we are going to consider log(price+1) as our target variable. There are some of the products which are priced at $ 0. There is an interesting EDA that we would be doing later on these products.
7.1.b item_condition_id
Mercari categorizes the condition of their product into 5 categories.
- New
- Like New
- Good
- Fair
- Poor
These are represented in the dataset from 1 to 5 respectively. Now let’s see the products lying in each of these categories.

We can clearly observe for item_condition_id there lies the variability in the number of listings under each category. The maximum components fall under category 1 whereas 5 holds the minimum number of listings. This makes sense because the lower the number better is the quality/condition of the product and no one wants to rate their product as in bad condition.
7.1.c shipping:
We are going to perform a similar analysis for shipping

- Here also there is a small imbalance in categories. The variation of these features with price would be an interesting analysis to note that we will see in the bivariate analysis section.
7.1.d ‘brand_name’
The brand name is the feature that holds about 40 % of values as null initially. We are going to impute those null values as "missing" which we will see in preprocessing part. We are also going to extract brand names from the name of the listing which reduces the unknown brand names to 27%.


- There are almost 4800 categories in brandname. The top 10 brands which occur a maximum number of times include pink, Lularoe, Nike, victorias secret, apple, Nintendo, forever 21, lululemon, Michael kors_ that are actually very famous brand names. __ The least occurring brands include police, happy socks, pal zileri, etc. which are very less heard of*.*
7.1.e category_name
The _category_name_ is the name for the category in which the product falls. This is specific to the Mercari website. This feature also holds about 6k null values which are input with text ‘missing’.
Let us first take an overall picture of category_name.


- There are 1287 categories category_name in which the majority of the products are from women’s clothing and beauty products.
- We can observe at the Mercari website that the categories are hierarchical which means one main category holds sub-categories and so on or we can say there are different Tiers(we are going to use this terminology). These are separated by ‘/’.

We will see in data preprocessing how to separate these categories, but for now, we can assume that the categories are divided into three tiers(Tier_1, Tier_2, Tier_3).
7.1.f Tier_1
In the first level, there is a total of 10 categories (unknown values are imputed as missing). Almost half of the products fall under the category of "women" followed by beauty and kid which have quite a low percentage in comparison.


7.1.g ‘Tier_2’
In the second tier, we have more than 100 categories. As expected in the top 10 women’s clothing, beauty products are dominating. The least occuring category is ‘quilt ‘.


7.1.h Tier_3
In the third tier, we can observe more than 800 categories. Top 10 products include t-shirts, pant-tight leggings, games, shoes, and others. The least 10 product categories include videogame, tiles, educational(interesting), cleaning, cuff, etc.


7.1.i ‘name’
Next, we have a feature name that is unique for most of the products as we can see there are 100k categories in the feature but for some products, it is the same. One name which is occurring for the maximum number of times is ‘bundle’. This might be for the products for being sold in large quantity

For the text features, we can also check the distribution for the length of the characters and the number of words.

- The majority of sellers tend to give more than 20 characters in names as the distribution of a number of characters is left-skewed.
- There are some sellers who give exceptionally higher words for names i.e. more than 12
7.1.j ‘item_description ‘
The item description is an important feature for modeling. It holds descriptive information about the product given by the seller. This feature contains 4 rows as null which are imputed as ‘missing’.
When we observe the categories in this feature which is dominated by ‘no description yet’ for about 5% of the total listings. These might be the default values given for this part during the listing being registered on the website.

- The distribution of the number of characters in item_description depicts highly skewed plots. The CDF plot shows that around 90% of the descriptions have lengths lower than 400.

7.2 Bivariate Analysis
7.2.a item_condition_id VS log_price
Here we are plotting the distribution of log_prices for each category in item_condition.

- The overall plot is quite clumsy but one thing that we can observe is the CDF for the item_condition_id =5 is lower than others which makes sense because the condition for 5 is poor, therefore the prices should be also low.
7.2.b shipping VS log_price
Here we are analyzing the relationship between log_price and the shipping for which we are plotting the distribution for each category of shipping.


- There is some information in this feature because the distribution of the categories is shifted along the axis.
- The CDF plot for buyer pays is towards the right which shows the prices for buyer paying the shipping is higher which is justified.
- The boxplot also states that category 1 has lower prices than category 0. This is understood because "1" signifies that the seller will pay the shipping which makes the product price lower.
7.2.c brand_name VS price
For this, we are plotting the top 25 costliest brands in the dataset.

- We can observe that the prices of the costliest brands can go up to $2000.
- It also depends on the product type as there is a brand ring that would most probably be selling rings. This product itself is a costly product irrespective of the brand.
7.2.d Tier_1 VS Price
Now let’s see the impact of Tier_1 categories on the price of the product. Here we are plotting the mean price and the median price values for each category in Tier_1 from which we can observe that mean values are higher than the median values. This is the property of right-skewed distribution and we know the price of the product is highly right-skewed.

- The box plot for the log_prices for each category signifies the range of prices for electronic, women, beauty, vintage collectible and handmade is higher than others remaining.

7.2.e name VS item_conditon_id
This is an interesting analysis which tells how the names vary according to the condition of the product. We are going to make use of Wordcloud for this analysis.

- Clothing products and brands are visible in all the images like " Pink" which is present in all categories.
- In item_conditon_1 "new" is often used word in the name and it also includes brand names that are quite popular whereas In item_condition_5 words like "broken" can be seen which makes sense because this category is for poor quality products.
- As there is no size feature separately size is mentioned in the name itself. Hence we can see ‘size’ in almost all the word clouds.
7.2.f item_description VS item_conditon_id
A similar analysis if dome for item_description and the results are as follows:

- All the words occurring in each of the categories are understandable and quite genuine. Such as item_condition_id = have words like brand new, never used, new, high quality.
- Now as the number of item_condition_id increases the words used in the description also vary. In item_condition_id =3 used condition, gently used are most prominent words which change to parts, broken, work for item_condition_id =5.
7.3 Analysis for zero_price products
In this section, we are going to analyze the products that have a price equals to $0. So first let’s see how many products are there which have a price equal to zero.
zero_price = df[df.price<=0]
print("The shape of the df =",zero_price.shape)

To see which category these products belong we are going to plot the products in each category of feature Tier_1, Tier_2, Tier_3.


- Most of the products which are at price "0" are clothing or beauty products that fall under the women category.
We can also the percentage of the products which are lying in each item_conditon and compare it with the overall percentage.

- The products of zero price have a higher percentage of products with item conditions 2,3,4,5. This shows that there may be the used items like items of clothing etc that people are giving out for free just for donation.
8. Data Preprocessing
Data Preprocessing or data cleaning is an integral step for building a machine learning model. In this step, we make required changes to the dataset so that it is suitable for further processing. Data preprocessing steps are very specific to the problem or the dataset we are dealing with. Here we are going to see the data cleaning for features particularly for this problem.
8.1 name
For text features, there are many techniques that are used for data preprocessing. These include decontracting, lowering the characters, lemmatization, tokenization, stemming, stopword removal, and many more. Here we have done very simply preprocessing which is decontraction, removing special characters, stopword removal, and lowering the text. The code for it is as follows:
Examples:

8.2 brand_name
So for brand_name, not only the basic text preprocessing is done but as we discussed earlier in EDA that there are certain products that include brand_names in their names, so we have to extract them.
Examples:

Now before extracting the brand_names from the name we had about 42% of values as ‘missing’ but after following this approach, we reduced this number to 27%!!
8.3 category_name
In catagory_name also the same approach is followed to preprocess.
Examples:

Once this is done we will divide the dataset into three tiers by splitting at "/".
Examples:

8.4 item_description
This feature is also a text feature that needs to be processed. There are some null values also which need to be taken care of. The following function is used for the same.
Example:

Here I have shown the preprocessing for train data same has to be followed for test data.
9. Benchmark Model
This model we will use to compare the performance of our machine learning models. The Benchmark model that we are using here is a simple average model which gives the average output based on the two features "shipping" and "item_conditon". To put it simply when we input shipping (0 or 1) and item_conditon_id(1 to 5 ) the model extract all the data with similar shipping and item_conditon_id. It takes the average of these points and outputs the result.
Splitting the data
Before we test the model let us split the dataset into train and validation data in a ratio of 9:1 respectively.
df_train,df_val = train_test_split(df,test_size=0.1,random_state = 3)
print(df_train.shape)
print(df_val.shape)


When tested the benchmark model on validation data it provides an RMSLE value of 0.7255 which becomes a benchmark. Any model which has metric values higher than this value is performing badly.
10. First Cut Approach
For the first cut solution, I have tried different machine learning models, but the most important part of machine learning is featurization. I have followed two types of featurization.
10.1 First Approach
10.1.1 FeatureEngineering
In the first approach, we will try to keep the dimension of our output vectors low so that we can train the model faster. Therefore we are going to use Ordinal Encoding for Categorical Data and Average Word2Vec for text data. Let us see one by one:
10.1.1.1 Categorical Data
We know the categorical data in our dataset consists of shipping, item_condition_id, brand_name, Tier_1, Tier_2, Tier_3 out of which "shipping" and "item_condition_id" are already ordinally encoded. So for the remaining features, I am going to use the Sklearn library’s OrdinalEncoder functionality.
Note that there would be some unknown categories in Validation data to which I am giving a value of -1.
10.1.1.2 Text Data
For text data, I am going to join name and item_description. There is not actually any specific reason for this but just to reduce dimension. You can also feature them separately. I tried them separately also, as the vectors are dense vectors it was taking more time to train.
So the approach for vectorizing text features for the first part I have used Average Word2Vec. Here I am using Word2Vec Functionality of Gensim library and train it on the corpus of concatenated data (name + item description).
The idea is for each sentence to vectorize each word with the help of the Word2Vec model and then take an average of all these vectors which will be considered as vectors of the sentence.
10.1.1.3 Is Missing
This feature gives 1 if a value is missing in brand name or name or item description otherwise it is 0.
df_train["is_missing"] = (df_train.brand_name_processed=="missing") | (df_train.name_processed =="missing")| (df_train.processed_item_description=="missing")
df_train["is_missing"] = df_train["is_missing"].astype(int)
10.1.1.4 Stacking all Features
All the features that are formed are stacked together horizontally.
'''STACKING ALL FEATURES OF TRAIN DATASET'''
x_train = np.hstack((df_train.item_condition_id.values.reshape(-1,1) , df_train.shipping.values.reshape(-1,1) ,
df_train.is_missing.values.reshape(-1,1) ,
train_vec_brand , train_vec_t1 ,
train_vec_t2,train_vec_t3 , train_vec_text))
10.1.1.5 Modeling
a. Lasso
This model is a linear model which reduces squared loss with l1 regularization. Here I am using the Sklearn library for this model. The model is hyperparameter tuned on a range of values and trained on the best parameters.

The performance for the model comes out to be 0.6037 on validation data and 0.60476 for Test Data on Kaggle (Public Score).
b. Ridge
The Ridge model is also a linear model which reduces Squared Loss but with l2 regularization.

The RMSLE for this model is 0.6038 on validation data and 0.6048 on Test Data Kaggle.
c. Decision Tree
Decision Tree as the name suggests uses the tree-based model to predict the output. I am using Sklearn’s Decision tree model here.

The RMSLE for validation data is 0.6353 and on the Test data is 0.63648. Clearly, the performance for the decision tree is not good as compared to the earlier models.
I also tried Random Forest for these features but the train time for it was very high so had to stop in between. If interested you can see the full code on the Github repository.
d. Light GBM
Light GBM model functionality of an ensemble model and is faster to train, therefore it is suitable for the problem.

The performance for this model on validation data is 0.5008 and on test data is 0.50049 (Private Kaggle Score).
10.2 Second Approach
10.2.1 Feature Engineering
In the second approach, we are going to use One Hot Encoding for Categorical Data and Tfidf for text data.
10.2.1.1 Categorical Data (One Hot Encoding)
For all the Categorical Data which includes shipping, item condition id, Tier1, Tier2, Tier3 are One hot Encoded. The following code is used to do so:
10.2.1.1 Text Data (Tfidf Vectorizer)
Tfidf states "Term Frequency -Inverse Document Frequency". It has two parts first part is "Term Frequency" which is a simple ratio of a word to the total number of words in the sentence which gives more value to more occurring words in a sentence. "Inverse Document Frequency" is the second part which is the ratio of total documents and the number of documents in which the wors occurs. IDF gives higher value to the rarer words in the documents.
I am using the bigram range till 2 and maximum features up to 50000.
10.2.1.3 Is Missing
This feature gives 1 if a value is missing in brand name or name or item description otherwise it is 0.
df_train["is_missing"] = (df_train.brand_name_processed=="missing") | (df_train.name_processed =="missing")| (df_train.processed_item_description=="missing")
df_train["is_missing"] = df_train["is_missing"].astype(int)
10.2.1.4 Stacking
Similarly, all the features are stacked together horizontally forming a one vector for one product.
'''STACKING ALL THE FEATURES'''
# STACKING TRAIN FEATURES
x_train = hstack((train_vec_item_con,train_vec_shipping,
train_vec_name,train_vec_brand,
train_vec_t1,
train_vec_t2,
train_vec_t3,
df_train.is_missing.values.reshape(-1,1)
,train_vec_desc))
10.2.1.5 Models
a. Linear Regression
This implementation of linear regression reduces the squared loss. Here I am using Sklearn’s implementation of Linear Regression.
'''TRAINING LINEAR REGERSSION'''
lr = LinearRegression(normalize=True)
lr.fit(x_train,y_train)
This simple model gives the performance metric of 0.4620 for validation data and 0.4621 on Test Data(Kaggle Private Score)
b. Ridge Model
The Ridge Model with one hot endoding and tfidf features is hyperparameter tuned on various parameters. The code part will remain same as discussed above.

By training on the best parameters, we get the validation RMSLE of 0.4581 and 0.45831 on Test data(Kaggle Private Score).
Observations:
- By now the best model that we have is Ridge Model with One Hot Encoded and TFIDF features.
- For label encoded and word2vec features complex models like Light GBM perform better than simple linear models like a lasso, ridge.
- But for Tdidf features, the features become comparatively higher therefore linear models give decent values of performance metric
- For better performance, we have to try Deep Learning Model.
11. Deep Learning-Based Solution
Deep Learning models use multiple layers between the input and output layers in the network which makes it a powerful method to solve a problem. Moreover, its resemblance with the human brain concretizes the working of the idea.
Here we are going to use the RNN type of DL model. RNN stands for Recurrent Neural Network which includes LSTM or GRU-based model. RNN models are used for data that has long sequences which are present in our text data(name and item description). Therefore you can see the architecture GRU being used for these features.
11.1 Tokenizing and Padding
Before we give data to our network it has to be vectorized. The embedding layer is provided by Keras to vectorized data but it requires the input in a specific format. For that, we have first Tokenize our text data. Through Tokenization we are going to convert the text data to numbers which are nothing but the index values of the words in vocabulary which is learned by fitting on train data.
Following this, the data needs to be padded to have the same length of input for each datapoint. This code is used for tokenizing and padding.
11.2 Architecture
The architecture of the model is the most important factor in deep learning for achieving higher performance. The architecture that I am following here is as follows:

11.3 Training
The training of the model is done with Adam Optimizer, the loss being MSE and the metric that I have taken is RMSE. Callbacks of ModelCheckpoint, learning rate scheduler, and early stopping are also introduced.
# FITTING THE MODEL
model.fit(x=x_train,y=y_train,validation_data= (x_val,y_val),epochs=10,batch_size = 100,
callbacks=[save,lr,earlystop])

Note that the model was fitted for 10 epochs but it stopped after 4 epochs due to the Early stopping callback.
The deep learning model gives us the performance metric of 0.4316 for validation data and 0.4331 for Test Data (Kaggle Private Score).

12. Model Comparison and Conclusion
These are the performances for all the models, which actually display the journey from a basic model as a Benchmark to a Complex model like Deep Learning Model with the improvements at each step.
It clearly shows that for this problem deep learning model is performing better than the Machine Learning models.

13. Deployment and Predictions
I have deployed the model on AWS. To try the model click the link. Below is the video of the working of the model and a prediction on one of the products. Do let me know in the comments your thoughts.
14. Future Work
After this all effort I think I have achieved a decent value of performance metric, but there is always a scope of improvement. Following are some ideas on which I would work in the future.
- To perform some more features engineering techniques for Machine Learning models.
- To try more complex models with Tfidf features.
- To try CNN-based models.
- To try Tfidf/CountVectorizer based features in Deep Learning models.
15. Github Repository and LinkedIn
Here is the GitHub Repository to refer to the full code and to connect with me on LinkedIn click here.