5 Simple Questions to Find Data for a Machine Learning Project

Find the data you need to kick off a machine learning project by considering what data is available and what business need it satisfies.

Cory Randolph
Towards Data Science

--

Photo by Markus Winkler on Unsplash

Many businesses want innovation, and Machine Learning often comes up as a suggestion; but how can you find the data you need to try out a machine learning project? Below are 5 straightforward questions that I have found useful in finding data for new machine learning projects.

5 Questions to Ask When Finding Data for a ML Project:

1) What enterprise systems are being used where you work?

The very use of an enterprise system implies that there is business value being provided by the tool, and data is likely being stored within that system. Even small to medium-sized companies usually have a few enterprise tools they use, and these systems are usually built with a relational database behind the scenes. In many cases the raw data is accessible with relatively simple SQL (Structure Query Language) or through front end reports that can pull out the needed data.

2) When senior leadership is looking at data to help make decisions, where is that data coming from in the first place?

Another way to phrase the first question is to think about the information that senior leadership cares about and where it comes from. Chances are it will lead you back to the enterprise system mentioned earlier, or it could lead you to a spreadsheet that holds useful data. Regardless of the source (relational database, spreadsheet, etc) at least you are in a good starting place for exploring questions that are important to the business.

3) Is there some sort of quality assurance process surrounding the data?

If the answer is no, it may not be an immediate reason to abandon the project— it just means there may be more up-front data cleaning required.

4) What types of processes are routinely repeated?

Businesses are usually built around a few processes/products that have been refined through iteration; and that means there is likely some historical data that could be leveraged for a data analysis or machine learning project. For example, most of my field of work has been in the maintenance industry where preventative measures are routinely carried out to keep systems and equipment functioning well. Each task is captured in a work order that has multiple pieces of information (or features in machine learning terminology) such as data, duration, cost, labor hours, work type, etc. Even just a few hundred historical work orders may be enough data to train a machine learning model to optimize a maintenance program by being able to predict relevant information such as which work orders will become past due.

5) Can you use financial data?

Financial data is a great place to start a machine learning/data analysis project because it usually meets all the other data questions mentioned above. First, financial data has many inbuilt data quality assurances since mistakes there can have large negative impacts on the business. Additionally, the historical records are usually accurate with time stamps that allow for grouping data based on different use cases. Utilizing financial data is also very likely to satisfy the component of being something that senior leadership is invested in.

If you’re looking for a place to start, here are three potential use cases for finance data with machine learning:

  • Estimating next year’s budget based on historical information (regression).
  • Predicting useful categories such as late payments or budget overruns (classification).
  • Identify patterns in expenditures that are outside the norm (anomaly detection).

Answering these five questions can be a great start to determine exactly where to begin with a machine learning project. As a bonus tip: Whatever project you land on, call it a Machine Learning “Pilot” project so that the goal remains innovation and exploration. This helps set realistic expectations for yourself, your team, and your higher-ups as you develop these sorts of early-stage Machine Learning initiatives. Even if no one on your team is doing Machine Learning, with these 5 questions to guide you, the opportunity to try something new and innovative is not so out of reach.

--

--

Enjoy applying machine learning to real business problems and helping others learn to do the same.