Sales & data science: a step-by-step guide to competitor analysis using Python

What makes sellers successful on online marketplaces like Amazon or eBay? Data science helps to answer this question.

Published in

Towards Data Science

10 min readMar 16, 2020

Online marketplaces like Amazon or eBay are full of sellers competing with each other at every step. It’s not easy to increase sales when other people offer products that are very similar to yours.

Have you ever wondered what makes buyers choose one seller over another if they all offer pretty much the same thing?

You might be thinking that it’s all down to the price.

So did I, but then I couldn’t resist the temptation to see whether there are other factors that come into play here.

In this article, I step into the shoes of an online seller to see how a data-driven approach using Python programming tools can help answer this question.

*Source: Courtesy of my team at* *Sunscrapers*.

To do that, I took two steps.

Step 1: Find an online marketplace that is rich in content about sales history. This article was written for educational purposes, so I’d like to keep to name of the marketplace confidential.

Step 2: Come up with a method that allows comparing different offers. The trick here was finding a product that comes in many similar variants and is sold by a large number of sellers. One of the most commonly sold products on the online marketplace I picked for this study is something we all use:

A smartphone screen protector.

In this article, I’m going to ask and answer 4 questions:

Q1. What are the average prices of our product type?

Q2. How does pricing affect sellability?

Q3. What are the most popular screen protector brands?

Q4. Which brands are the most profitable?

Tools of the trade

Python — our programming language of choice for data science.

Python tools:

Scrapy — this web scraping/crawling framework offers convenient features such as marshalling and preprocessing of field values. You can run it via the online scrapinghub platform, which helps to make your crawling process less taxing.
Pandas — I will use it to load data into tables (and then clean, process, and analyze it).
Seaborn and Matplotlib — these are some handy libraries for data visualization in Python.

BTW. I also wrote two articles to help you work with Pandas, check them out:

How to use the Split-Apply-Combine strategy in Pandas groupby

How to handle large datasets in Python with Pandas and Dask

Data mining and preparation

1. Getting the data

The first step is finding a data source. I needed to choose an online marketplace that offers some kind of indicator of sales performance — that’s how I can evaluate it against other offer features. The platform I picked provides information about the last 100 transactions.

Note: The web crawler code content is rather extensive and will vary for different marketplace websites, so I decided not to include examples of the scraping code in this article. The Python framework for scraping and web crawling, Scrapy, offers plenty of documentation with easy-to-follow tutorials, so refer to them if needed.

Here’s a short description of how I got the data for this article:

First, I manually searched for smartphone screen protectors on the online marketplace and started the crawling process from there. Typically, a search pattern starts with some kind of an offer list where each listing points to an item-dedicated page with more information, possibly even a list of past purchases.

Note that you can usually harvest valuable information about the product’s presentation already in the search results list (like the seller’s status, points, no. of past purchases). After all, this is the customer’s first exposure to the offer and it may impact their decision-making process.

After the crawl, I ended up with two Pandas tables. The main table (df) had all of the product information, with each row corresponding to an item. The other table (sale_history), stored information about the past purchases, with each product including many rows of individual sale events data.

I’ll show you the table examples later on.

2. Processing the data

After the data extraction step, it’s time to do some cleaning and data preparation. Besides all the usual steps (removing nulls, casting columns into the right data types, etc.), there were a couple of interesting steps I’d like to mention here — again, without going into the details.

As a first step, I tend to go through individual columns with the Pandas unique() method. That’s how I can see whether the values are consistent and sensible — and catch any potential issues. Then I check for data duplication by grouping rows by the column that serves as the unique identifier for a specific item — in this case I used product_id.

One of the first things I noticed is that some product pages were linked to multiple listings in the search results page (coincidence? I don’t think so!). I got rid of the duplicates, but decided to keep this information for analysis. So I created a new column with the number of listings per item first, then deleted the copies except for one:

df[‘same_offer_count’] = df.groupby(‘product_id’)[‘product_id’]\
    .transform(‘count’)
df = df.drop_duplicates(subset=’product_id’, keep=’first’)

Another interesting problem was dealing with multiple currencies used throughout the marketplace. My raw data table contained bare price string values along with the quoted currency symbol (for example, ‘US $1.09’ or ‘C $2.42’), so I needed to extract the numeric values and unify all the price currencies by converting them to USD. Here are a few example rows before the transformation:

Here’s the code I used to transform it:

import refrom currency_converter import CurrencyConvertercc = CurrencyConverter()
currency_shortcuts = {‘C’:’CAD’, ‘US’:’USD’, ‘AU’:’AUD’} # first I checked only these occur…
regx_str=r’(\w+\s*)\$[ ]?(\d+[.|,]?\d+)’ # note the two ‘re’ groups!df[[‘currency’, ‘quoted_price’]] = df[‘current_price’]\
    .str.extract(pat=regx_str)df[‘currency’] = df[‘currency’].str.replace(‘ ‘, ‘’)
df[‘currency’] = df[‘currency’].map(currency_shortcuts)
df[‘price_USD’] = df[‘quoted_price’].copy()for currency in [ c for c in df[‘currency’].unique() 
                  if c not in [‘USD’]]:
    fltr = df[‘currency’].isin([currency]) 
    df.loc[fltr, ‘price_USD’] = df.loc[fltr, ‘quoted_price’]\
        .apply(lambda x: cc.convert(x, currency, ‘USD’))

Which resulted in:

Next, I processed the sales history table (sale_history). I performed some basic type fixes, extracted and converted prices and currencies, and filled in the null values (code not shown). I ended up with this table (again, it’s just a snapshot of rows):

To make it useful for my analysis and plotting, I aggregated the entries by date (and product_id, of course) and calculated the number of sold items and the daily selling rates. Wrapping all this into a single function allowed applying it row-wise to the data frame:

def calculate_sale_history_stats(df):
    “””Calculates statistics on sale history, 
    returns new dataframe”””    delta = df[‘purchase_date’].max() — df[‘purchase_date’].min()
    days = int(delta.days)
    values = list(df[‘quantity_sold’])
    earnings = list(df[‘total_price’])
    sold_count = sum(values)
 
    if len(values) < days:
        values.extend([0]*(len(values) — days))
        earnings.extend([0]*(len(earnings) — days))
 
    res = pd.Series(
        [ sold_count, np.mean(values), 
          np.std(values), np.mean(earnings), 
          np.std(earnings)
        ], 
        index=[‘Sold_count’, ‘Mean_daily_sold_count’,
               ‘Sold_count_St.Dev’, ‘Daily_earnings’,
               ‘Daily_earnings_St.Dev’]
    )
 
    return round(res, 2)

And applying it to the sale_history dataframe:

sale_history_stats = sale_history.groupby(‘brand’)\
    .apply(calculate_sale_history_stats)

resulted in:

Finally, I merged the aggregated sales stats (sale_history_stats) into the main df table:

df = pd.merge(
    how=’left’,
    on=’product_id’,
    left=aggreg_sale_history,
    right=df[[‘product_id’,’shipping_cost’, ‘shipping_from’,
              ‘top_rating_badge’, ‘seller_feedback_score’, 
              ‘seller_feedback_perc’,]]
)

Here’s the resulting df table (again, only a selection of columns is shown):

Now we’re good to go. So let’s begin our competitor analysis.

Q1: What are the average prices of our product type?

Let’s see how much profit we can make from screen protectors in general. How much do sellers generally charge for this type of product?

I can analyze the prices in the marketplace and see what customers usually pay for smartphone screen protectors like this:

import matplotlib.pyplot as plt
import seaborn as snsprices = df[‘price_USD’]sns.distplot(df[‘price_USD’], ax=ax, bins=200)paid_prices = sale_history[‘sell_price’]sns.distplot(paid_prices, ax=ax, bins=100)

And here are the two resulting histograms with overlaid additional information (code not shown):

As you can see, most screen protectors cost about $1.15 (average ~$3.9). However, it seems that customers often prefer to chip in a few extra bucks to their purchase (average of ~$5, median $~3.8). ‘The cheaper, the better’ rule doesn’t apply here.

Based on this insight, we may assume that choosing to price our product around $4 will do the job.

Q2: How does pricing affect sellability?

Pricing is probably the most important factor in the customer’s decision-making process. Sellers often assume that a high price may discourage consumers from buying their products. Striking the right balance between profitability and affordability can become a challenge.

Let’s check how the number of sold items and daily earnings match the unit price (as a daily average):

# The daily earnings vs price:
sns.lmplot(
    x=’sell_price’, 
    y=’Daily_earnings’, 
    data=df[[‘sell_price’, ‘Daily_earnings’]]
)# Plot the sales frequency vs price:
sns.lmplot(
    x=’sell_price’, 
    y=’Mean_daily_sold_count’, 
    data=df[[‘sell_price’, ‘Mean_daily_sold_count’]]
)

As expected, higher prices mean fewer sales on average. But when we look at the daily income, it seems that profits tend to increase with higher prices.

At this point, it would be interesting to find out whether the pricing reflects the quality and/or reputation of the product. Unfortunately, that’s beyond the scope of this study.

Q3. What are the most popular screen protector brands?

Let’s take a closer look at brand names. There’s a spectrum of values referring to “no brand.”

So let’s clean this mess up first and label them all as “unbranded”:

ubranded_values = [ 
    ‘Does not apply’, 
    ‘Does Not Apply’, 
    ‘Unbranded/Generic’, 
    ‘unbranded’
]df[‘brand’] = df[‘brand’].apply(
    lambda s: ‘Unbranded’ if s in ubranded_values else s
)

Now I can put data into work and get a pie chart showing brand names. It indicates that the majority of the products offered on our online marketplace (c. 60 percent) has no branding (or don’t indicate one) at all.

Consumers may want to stick to a recognizable name, so let’s ignore the unnamed products for a moment and focus instead on the top 20 brands offered in the marketplace that sell the highest number of products daily.

For this, I will use our sale_history table with all the transactions data.

Let’s create a table with information about the brands offered on the marketplace:

sold_brands = sale_history\
    .groupby(‘brand’)\
    .apply(calculate_sale_history_stats)

Next, let’s see the top 10 brands that have recorded the highest number of sold items so far — create a table and plot it like this:

top_sold_brands = sold_brands.sort_values(
    by=[‘Sold_count’, ‘Daily_earnings’, ‘Mean_daily_sold_count’], 
    ascending=False
).reset_index()sns.barplot(
    data=top_sold_brands.iloc[1:21], 
    x=’brand’, 
    y=’Sold_count’)

One glance is enough to see that all the unnamed brands combined have accumulated the highest number of items sold. However, Spigen appears to be the runner-up in this category and dominates the market among named brand products.

Q4. Which brands are the most profitable?

Which screen protector brands brought sellers the highest revenue in the shortest time? Let’s bring the unbranded products back to the table as the picture might turn out slightly different when it comes to earnings:

most_profitable_brands = sold_brands.sort_values(
    by=[‘Daily_earnings’, ‘Mean_daily_sold_count’, ‘Sold_count’],
    ascending=False
).reset_index()most_profitable_brands = most_profitable_brands[[
    ‘brand’, ‘Daily_earnings’, 
    ‘Daily_earnings_St.Dev’,’Sold_count’,
    ‘Mean_daily_sold_count’, ‘Mean_Sold_count_St.Dev’]]

I get this table:

And let’s visualize it on a bar chart like so:

plt.bar(x, y, width=0.85, yerr=y_err, 
        alpha=0.7, color=’darkgrey’,  
        ecolor=’black’
)

That should create the following graph:

It’s clear now that unbranded products aren’t that profitable. Note that the order of brands changed a lot when we compare revenue instead of the total number of sold items. Brands with mid-range prices are at the top of the list now (like PureGear that costs about $9.5). And that’s despite their relatively infrequent daily rate of sales (c. 1–2 daily).

Answering these 4 questions showed us that ‘quality over quantity’ is probably the smartest way of crafting a sales strategy for online marketplaces.

In this article, I focused on taking you through the data mining and preparation process to finally answer four key questions about sales trends on the online marketplace I chose for this study.

Here are some other questions I could answer from the data:

Do shipping costs impact customer decisions?
How many sellers discount their products?
Do discounts result in higher sales?
Does it make any difference to customers where they purchase products from?
How does the seller feedback score affect sales?
Does having a Top Rated badge boost product sales?

Luckily, I’ve already done that. You can get all the answers in my study How to increase sales on online marketplaces with data science. Download it here for free.

Originally published at https://sunscrapers.com on March 16, 2020.