Data Science in Inventory Management: Real case in managing a warehouse

In this post, we will know how to manage a warehouse with the most efficiency budgeting

Nhan Tran
Towards Data Science

--

Hello and welcome back to another story about Data Science! This is a real case example, not just a theory or academically post so we will do a testing experiment with problem in real life. So in case you have any concern, let’s drop a comment and we can discuss together!

Think about the problem

Alright, let’s think about 2 shopping cases that you have experienced when you go shopping. Someday you go to the shoes shop and decided to buy a new pair of shoes, and unfortunately, you got problem:

  1. You picked up a good model, but they don’t have any left that fit your foot size
  2. They offered another model which fit your foot size, but you don’t like it

what a bad day right?

As a shop manager, they are not happy too. They lost a customer — it’s you. And if this is the first time you visited the store, will you come back? Of course… “maybe”. But if this is the second time you got the same problem with this store, will you come back? It’s absolutely a shop manager’s nightmare.

Find a way

So how to avoid this problem if you are playing a role as shop manager? Of course, we cannot because we cannot control how many customer has the same style and the same size will visit your store at the same day. It depends on our luck!

Then we close the case, give up on this problem and try to explain to our big boss like “okay I’m sorry, it’s my bad luck”?

Absolutely no! We have enough statistics tool to solve the problem. But remember, we cannot make sure every customer will have their shoes BUT we can increase the chance that customer will find their favorite shoes at our store using Data Science (so don’t mess with our luck).

Check the historical data

Before making plan for future, we need to examine the history in the past — remember this! So first step, we need to meet the Sales Department to get the sales history of few years in the past. And this is the data we got from Sales Department (you can download here)

Al Bundy raw data

Examine and Cleaning up the data

Let’s examine our fields in the data set:

InvoiceNo, ProductID, Year, Month: they contain only number, but they’re not a numerical field, they’re categorical field

Date, Country: categorical field. We already have Date field which have the meaning of both Year, and Month fields, so we should keep Date and exclude Year, and Month out of our data set

Shop: absolutely a categorical field grouped by Country ID (US, UK, CAN, GER)

Size (US), Size (Europe), Size (UK): categorical field. But let’s think twice! 3 fields has the same meaning because we can convert Size (US) to Size (Europe) and Size (UK) and of course, vice versa using this conversion chart. So in this case, we should keep only 1 field as a representative. I will pick Size (US) because it looks simpler than Europe one and has less floating number than UK. Please remember that the size of Male and Female are different, so do not using Size field without Gender field or you will make a serious mistake!

Gender: another basic categorical field

UnitPrice, Discount, SalePrice: numerical field. And before doing any calculation, I suggest we should convert the Discount field which contain percentage values into float for easier calculation. If you are a clever one, you can see the hidden equation between these field which is

SalePrice = UnitPrice * (1 — Discount)

…so we should keep only SalePrice for our data set

After doing this step, we will have a much more simpler data set:

Cleaning up the data set

Analyze the data

The data set is ready to be analyzed! Now we have to count how many items sold in previous period (could be year, or month…). In data analysis, counting the appearance of single object is called “frequency”. To count the frequency, in this case, we will segment the data set by:

  • Country
  • Size
  • Gender (as I mentioned before, Size could not be used without Gender, so we have to include the Gender too)

Our crosstable will have 2 dimensions only but we already have 3 dimensions here so we have to split the table into 2 different one by Gender. This is the result:

Sizes distributed by country

Define the problem

  1. The problem: Based on our observation, what is the number of shoes (model, size) that are likely to be sold?
  2. The condition: We will find the answer of our problem with 95% confidence interval
  3. The plan:
    - Use the last 12 months data set
    - Use men shoes data set
    - Use US data set

This step, you will think about why do we use men shoes and US data set only. Let’s think: if a man visited the store, would a woman come? We cannot answer, because gender in this case is identical. It doesn’t have any reflection together. In another hand, for instance, if a kid visited the store, there is a high chance that his parents come too, because that kid cannot go shopping alone and using money by himself. But for our case, if a man visited the store, he could come alone, or with his bros, or with his girlfriend, or even a group of his friends. There is no connection between the chance of a male visitor and a female visitor. That’s how we call it identical.

Of course, the Country is the same with Gender, they are both identical.

Then why 12 months data set? Because it’s enough the represent a full cycle of a year. Shopping behavior mostly bases on a full year cycle such as: year-end sale off, Black Friday, company’s promotion on its anniversary, people likely to change their shoes when winter comes (to buy warmer shoes)…

Men shoes sold in 2016 distributed by Size and Month

Put the data to heatmap, let’s see the hidden message from our data set

Men shoes sold in 2016 distributed by Size and Month — Heatmap on Tableau

Now we can easily see the hidden message:

  • Too small sizes (6 to 7.5) and too big sizes (13 to 16) is not popular. So the chance that a man visits our store and buys a pair of shoes on these sizes is quite low, especially the oversize 15, and 16 — we have no unit sold for this size whole year!
  • Average sizes (9 to 10.5) seems very good on unit sold. US man has foot size of 9.5 seems out number of other sizes. Focus to produce and store this size, we have great chance to sell shoes and get money!

But this is just a high level of examination. Keep diving deeper to reveal the entire of hidden messages our data set hiding.

Calculation

Because we have 17 different shoes sizes (men only — based on our plan), we need to calculate 17 different CI (confidence intervals). Firstly let’s calculate the means with Microsoft Excel using the =average(number1, [number2]...) function (in this post, I will not mention much about programming. Just stick with KISS principal)

Men shoes, US 2016 & Mean

We do not know the population variance and our data set consists of only 12 observations (represented by 12 months of 2016), then we have to use the T-Statistic.

Let’s find the value in T-Statistic Table for 95% Confidence Interval with 11 Degrees of Freedom!

Let’s explain a bit in case you concern:

  • 11 is the Degree of Freedom of 12 months calculated using df = n— 1 (with n=12)
  • 0.025 is the Significant Level of 95% Confidence Interval calculated using SL = (1 - CI) / 2 (prefer image below)
Probability Density 2 tails

Alright! Look at the T-Statistic Table below and we can find out the T-Value is 2.201 (right at the cross point between red column and blue row)

T-Statistic Table of 95%CI (0.05%SL) and 11DF
T-Statistic Value

Now, let’s calculate Standard Errors and Margin Errors!

In Microsoft Excel, you can calculate Standard Errors by using this formula:

=STDEV.S(number1, [number2]...)/SQRT(n)

with:

  • number1, number2… are number of shoes was sold
  • n = 12 (months)

And Margin Errors = Standard Errors * t(11, 0.025)

Mean, SE, and ME

Last one, calculate the Confidence Interval

Confidence Interval formula

I know that you are getting headache with above formula but wait! We already have the Mean and Margin Errors, right? Then we can calculate the CI as (Mean — ME; Mean + ME)

Mean, SE, ME, and CI(95%)

In 95% of the cases, the true population mean of sales for each shoes will fall into the respective interval. The ceiling values (upper bound, or the higher values) of the CI shows us the maximum number of shoes needed. And vice versa for the flooring values as they are the minimum number of shoes was sold. As we don’t want to be low in stock a possible solution to the problem is get as many pairs of shoes as the closest number to the ceiling limit of the Confidence Interval to maximum selling possibility. And of course, we need to store more than the flooring limit of the CI to ensure all customers (who are men, and have particular shoes size) can buy their favorite shoes instead of leaving our store without buying because the items are out of stocks.

Let’s round up the CI and see:

Conclusions based on sample data set

Did we solved the problem?

Nope! Don’t blame on me

  • Alright. This is the conclusion for the year of 2016 but now is 2019, so can we use it? Absolutely yes! We can calculate for 2017, 2018 and then get the Mean of them and give the prediction for 2019
  • The prediction for 2019 can be incorrect due to other factors such as: fashion trend this year is changed, people like using sneakers more than shoes; or there is another shoes store just opened next to us. But this post ONLY analyze on historical data
  • This post we used data of 2016, in The US, and analyzed for man sizes only. So if you’re a store manager, you have to do the same analyze for women to get a bigger picture of your store

Extension! Let’s deep dive a bit more

I wonder do you notice the number of 105 and 175 in the bottom on previous image of “Conclusions based on sample data set”. What does it mean?

  • 105: the minimum number of pairs of shoes was sold. So as a store manager, we have to prepare big inventory enough to store this number of shoes or we will run out of stock
  • 175: a big inventory is good, but it’s no need to have a huge inventory! Big enough to store 175 pairs is enough. Remember, the bigger inventory you have, the more expensive you pay for management

What do you do for size US 16? If your eyes are keen enough, you can see there was no product sold with this size. Let’s check the data for 2017, and 2018 as well. If no item was sold, you need to stop producing or importing this size

What if your inventory is not big enough to store the minimum number of shoes? No worry. This is the number of sold items for whole year. Let’s check the Heatmap on Tableau on the top of this post, you can see the distribution during 12 months. Convert each number (per month) to frequency — how many percentage that month reflects to whole year. You can figure out which month we sold the most, double check the inventory, prepare the importing plan, then you have zero worry facing with the trend of supply and demand

Conclusion

  1. Data analyzing cannot help you AVOID the problem, but help you REFLECT the historical data to predict the future trend
  2. In data science as well as statistics, we DON’T ENSURE anything, we just give the conclusion base on how much CONFIDENCE (this post is 95%)
  3. And the last one, you can increase your CONFIDENCE by analyzing the historical numbers frequently. Keep it update yearly, or monthly, or weekly. The more frequently you update your data, the more accuracy you can predict

Once again, thanks for your reading time.

Happy learning everyday!

--

--