It’s an exciting time at Best Buy Canada: with the holidays upon us, we’ve officially entered busy season. That means stress testing the website, finding Black Friday deals, and lots of extra orders to ship.
Meanwhile, the Digital Intelligence team is awaiting the greatest gift of all: our new cloud data warehouse! Our Data Engineering elves have been working hard on this and we’re excited to build bigger, better and faster than ever. But our team is an impatient bunch and we’ve been itching to stretch our Machine Learning muscles, so this year we’ve been building out our ML toolset.
Unfortunately, this meant working with legacy systems that can be hard to access and don’t always play together nicely. Throughout the process we’ve learned a lot about building actionable and scalable data products. So if you’re looking to get started with machine learning without all the fancy tools, here’s how.
The Project: Smarter Pricing
Bestbuy.ca has tens of thousands of products online – millions when you include our marketplace. Our merchants rely on their extensive experience and expertise to secure the best prices for our customers. However, with such a large catalogue to manage, it’s impossible for them to give equal attention to all our products.
To help with this, we were tasked with finding an automated, data-driven approach for pricing a selection of our online inventory. Our solution was to forecast expected sales at various price points, with an additional layer to find the optimal price based on business requirements.
Start Small
Our first step was gathering input data, which we soon realized would be harder than planned. While some data was easily accessible, other key information was hidden behind APIs and in backend systems that were hard to reach.
It can be tempting to spend lots of time perfecting each part of your model before moving on. However, if we wanted to deliver value incrementality and keep our stakeholders happy while building a useful model, we had to carefully consider our priorities. When collating our input data, we could have:
- Exhaustively searched for the locations of all relevant data
- Set up secure connections to multiple data sources
- Built ETLs for uploading it to the cloud (since we run our notebooks in Azure Machine Learning)
But this could have taken months! While cleaner data pipelines are always appreciated within the analyst community, our business stakeholders were looking for more tangible outcomes. To split our data into now and later groups, we asked:
- How accessible is it?
- What value does it bring?
- Is there a simpler alternative that’s almost as good?
For example, our core sales data was vital to our model and kept in an easy-to-reach database: importing it was a no-brainer. Meanwhile, accessing our full pricing archive required developer help and IT support, which might have taken some time. Instead of waiting, we inferred product prices from recent sales. This wasn’t ideal – it could be inaccurate for products with sparse sales – but it would be good enough to start modelling our most important items.
Taking a flexible approach let us start building exploratory models in no time. Sure, we were missing some things that would prove to be important later, but remember the Pareto Principle: we now had the 20% of our data that would provide 80% of the results.
Iterate, Iterate, Iterate
OK, so you’ve started small and built your MVP. It’s easy to be overwhelmed at this stage: you need to improve model accuracy, bring in new data, and fix those random bugs that somehow keep appearing. To keep making progress, it’s important to break things down into small, manageable chunks and iterate on your work. Here are some examples of how we did this:
1) Building throwaway models
Although our long-term solution was ML-based, we took the opportunity to experiment with other approaches and build upon the results:
Iteration 1: Used simple heuristics to set prices (e.g. cost + X%, lowest recent selling price).
Iteration 2: Introduced a gradient-boosted decision tree (GBT), with outputs reviewed by analysts and merchants.
Iteration 3: Automated GBT training schedule, built-in business rule adjustments, and full control over pricing for select products.
Our early models weren’t very accurate but that didn’t matter: delivering results early gave us useful context for later iterations – vendor contracts can get pretty complex, you know – and proved the value of dynamic pricing.
2) Incorporating new datasets
After starting small we had plenty of new data that could improve our model. Before investing time in setting up new ETLs, we followed the same pattern to measure their value.
Iteration 1: Manually upload new dataset (CSV format, natch) and retrain model, comparing performance to previous versions.
Iteration 2: If useful, try using existing tooling to automate the export. Protip: Azure Logic Apps are a great low-code starting point.
Iteration 3: Transition to Azure Data Factory when greater complexity or reliability is needed.
Best Buy Canada already uses Microsoft Azure, so we were able to use their tools with our existing subscription. Even if you’re working locally, the same iterative principles still apply – and your setup is probably easier to manage!
The same iterative principles apply regardless of platform, even if you’re running everything on your local machine. For example, you might start with manual downloads and hard-coded file paths before automating data retrieval and cataloging later.
3) Improving text processing
Product titles and descriptions can contain valuable data when handled correctly. This isn’t always easy and our approach changed over time.
Iteration 1: Used One-Hot Encoding for the simplest approach; strings were now usable but a wide, sparse matrix caused long training times.
Iteration 2: Switched to Multiple Correspondence Analysis from the Prince library, reducing dimensionality and training time.
Iteration 3: Improved accuracy by experimenting with Target Encoding.
You’ll refine the useful parts of your model and discard dead ends as you continue to experiment. A pattern of small iterations helps minimize wasted effort and provides a steady flow of updates – great for morale and keeping stakeholders happy.
Track Your Changes
The final point is short but sweet: use version control. When you move fast, things will break. If your previously perfect pipeline starts returning strange errors, it’s helpful to see what changed and when. You’ll have a much easier time finding and squashing bugs with a full history available.
Git has long been a standard tool in software engineering and is increasingly being adopted by modern analytics teams, and Learn Git Branching is a great place to start if you’ve never used it before. If you can’t commit to that (pun definitely intended), simply saving scripts with date suffixes will work for small projects. One final tip: if you use Jupyter Notebooks, Jupytext will automatically create plain Python files that are much easier to compare.
In Conclusion
Bringing a machine learning model to production is no small undertaking, especially when dealing with legacy systems. By breaking things down and starting small, you can narrow your scope and focus on providing immediate value. Delivering early outputs grows your understanding of the problem area and provides a foundation for iterative building. And finally, tracking your changes will help when you encounter unexpected results or tricky bugs. Try this approach on your next project – even if you do have fancy tools available – and you might be surprised by how quickly you make progress.
P.S. Best Buy Canada is hiring! The Digital Intelligence team is looking for Data Analysts and you can apply here.