The world’s leading publication for data science, AI, and ML professionals.

How to structure business problems for data science solutions

Half the battle in a successful data science project can be expressing the problem in a way that ensures a optimal data-driven solution…

That lightbulb moment (image via Shutterstock under license to Whitehat Analytics)
That lightbulb moment (image via Shutterstock under license to Whitehat Analytics)

Half the battle in a successful Data science project can be expressing the problem in a way that ensures a optimal data-driven solution, with a clear set of realistic, achievable objectives. Problem structuring involves understanding four key questions:

  1. What business problem do we want to solve (in a precise, well defined way) – this should be driven largely by operations, finance or other expert staff rather than the technology team. What exactly will be the commercial benefit of solving this problem?
  2. What data is available that might help solve this problem?
  3. How exactly can we use point 2 to solve point 1? How much pre-processing will be required, what techniques and technologies will be used, and what timelines, resources and budgets are required?
  4. Do these plans fit in with the rest of the business and are they aligned with your overall Data Science strategy? If you have properly addressed the first 3 points, this should be a yes, but it always worth this final check.

It is at points 3 and 4 that seemingly well-structured data projects often become unstuck. A granular analysis at this stage can save much subsequent hair-tearing and disappointment.

How do I turn my business problem into a data science problem?

When problem structuring, it can help to have a library of generic data-driven solutions to generic business problems, to allow us to pull down ideas from the shelf, have an idea of how we might best approach the problem – and give an view of the common pitfalls.

While not every problem is categorisable in this way, a surprising number are. Four examples of common data science problems include:

  1. Customer-centric data science problems

As many as half of all commercial data science projects fall into this category. Retail businesses and marketing or advertising firms are often particularly interested in these problems, but almost all businesses would benefit from knowing and understanding their customers with more detail and sophistication. Objectives are frequently focused around many factors, including

– increasing revenue by improving product recommendations

– upselling

– cross-selling

– reducing churn and improving retention rates

– personalising the user experience

– improving targeted marketing

– sentiment analysis

– product or service personalisation

– pricing optimisation.

The common element is that all these goals hinge on a successful understanding of your needs, motivations, likes and dislikes, using all available data.

  1. Optimisation problems

These are problems that can be characterised as maximising or minimising factors such as costs, revenues, risks, time or pollution, within a well-defined quantitative framework and with a given set of constraints.

We solve these problems by modelling them as graphs or networks and solving them heuristically using specialised algorithms. Typically, this is complex because solutions are ‘path dependent’ – ie where you can move next is dependent on where you are now.

Frequently seen examples are supply chain optimisation, logistics and transportation (e.g. delivery routing), finance (e.g. minimising the risk in an investment portfolio for a given target return), and scheduling (e.g. a retailer optimising staffing levels per store within a shift pattern of work, or an airline wishing to optimise its route network).

  1. Demand Prediction

Demand forecasting is often undertaken as a top-down process involving estimation of demand by product line or business unit based on historical aggregate demand, with some correction for obvious exogenous variables such as the weather.

Data science can be used to turn this process on its head and estimate demand from the bottom up using a range of data sources, including consumer data, macroeconomic data and other open data. We may be able to estimate demand on a per store, per hour or per customer basis with much greater confidence. In instances where logistical constraints are material, this kind of granularity can be critical.

  1. Counter Fraud Analytics

While not a widespread use case, counter fraud can be among the most challenging data science problems for several reasons.

  • Firstly, fraudsters do not like to be caught and will change their methods in response to the success of your counter-fraud efforts. They will also change their behaviour when changes in the legal/regulatory framework create or remove opportunities for fraud. This means that counter-fraud analytics is chasing an ever-moving target.
  • Secondly, the counter-fraud analyst does not know the true extent of fraud, only the instances which have been caught. This make statistical generalisations difficult, which in turn must be incorporated into the model building process.
  • Thirdly, and perhaps most importantly, fraud is (we hope) the quintessential needle in a haystack problem. 99.9% of banking transactions are not fraudulent. Therefore, the number of fraud data points from which to generalise is very small. This is, again, critical when considering statistical models.

Examples

Some of the work we have undertaken at Whitehat Analytics that conforms to these broad templates include:

Supply chain risk optimisation

We developed a model and integrated front-end tool to enable large corporates to optimise risk exposures across a multi-layer supply chain, using a database of financial and commodity exposure between companies and cutting-edge linear programming (LP) solvers.

Counter-fraud analytics

A large ride-hailing company engaged us to build a risk scoring framework that incorporated a diverse range of data including vehicle, financial and app data as well as geolocation and demographic information to assess risk of fraud among their drivers.

Automated product matching

We worked with a leading global retailer to integrate and match product information such as metadata, product images and text descriptions using natural language processing (NLP) and deep learning techniques (convolutional neural networks) for image recognition. This allowed the client to optimise its supply chain by ensuring they could buy each product in their range at the lowest price.

Optimisation of real estate locations

We built an engine to optimise the real estate footprint of a large government department with more than 700 locations nationally, to minimise travel times for service users and employees whilst ensuring service standards and reducing office costs.

Conclusion

While project failure rates in data science can be high, sucessful problem structuring and project planning can substantially mitigate the risk. Identifying key objectives, deliverables, and data requirements, and realistically estimating dates and resources will go a long way to ensuring that your business stakeholders and Technology team are on the same page.

Finn WheatleyDirector of Data Science at Whitehat Analytics

  • Interested in finding out more about Whitehat Analytics and how we help companies scale and productionise data science capability?
  • Follow us on LinkedIn
  • Follow us on Twitter
  • Find out more about me
  • Follow me on Medium for more data science content

Related Articles