Notes from Industry
Modern product goals regardless of the domain rely a lot on algorithms solved by computers. Typical approaches adopt solutions based on heuristics, i.e., step-by-step instructions on how to finish tasks. Often times such approaches are not robust enough to tackle real-world situations. In presence of data representing those situations, Machine Learning (ML) is a very good approach that finds a probabilistic solution by learning from data.
It may be tempting to think any product goal can be formulated using ML. However, due to its heavy reliance on pattern recognition, ML comes with uncertainties, Therefore, it is necessary to identify what part of the product goal should be candidates for ML-based framing. For example, in the matter of finance, how much to declare for tax does not suit well for ML, since getting it wrong may be considered a felony. There are complex, yet set guidelines to figure out the tax amount. On the other hand, how much one may spend on junk food or clothing for the coming year can be an ideal candidate for ML. There are no fixed rules and can vary a lot depending on taste, availability, weather, global trend, etc. In this article, I will shed some light on the problem. In particular, I will share approaches to figure out tasks solvable by ML, data requirements, and ML approaches for different tasks types.
Human experts are typically very good at providing wise decisions when faced with simple scenarios with a small amount of data to process. In such cases, ML solutions are likely to perform much worse. For example, finding the treatment for typical infection, identifying the weekly menu for restaurants, etc. ML solutions, on the other hand, would likely perform better in the face of a broad range of scenarios on a large volume of data, such as recommending movies to different genres, price recommendations for different products, etc. While these factor sounds good to launch an ML project, such a simple framework is not good enough determine the project’s success. We describe some three steps that may simplify the estimation.
Step 1: Focus on the problem
You should start with a product goal that needs solving. Once that is settled, you should be open to the approaches to meet the goal. It does not have to be ML. Avoid following the next best thing that excites you and try to be objective. Sometimes it can be done via a simple heuristic extracted from previous experiences. Sometimes an algorithm based on a set of complex, interrelated signals need to be devised. Sometimes simple descriptive statistics may do the trick. Sometimes none of the above works. The solution needs to be learned from data. For example, movie recommendations can be as simple as most recent block-blusters, can come from a bit more complex fanbase-driven genre analysis, can be based on what others are watching in the geographic region, or can even be based on viewing habits mapped as complex movie data.
Once you are certain simple analytics may not cut it, frame the product goal as an ML problem and evaluate the framing repeatedly until you are certain you got it simple enough. For example, if the problem is recommending movies, then the framing can rely on some sort of association between movie metadata and what people have watched. While it may be tempting to come up with extremely complex metadata mapping, there must be a simple enough version that is easy to cover common cases.
The framing should start broad and go narrow in every iteration. You can start by identifying if it is supervised, where learning happens on known labels, semi-supervised, where learning happens on weak labels, or unsupervised, where learning happens without any labels. It is possible to frame the same problem in different methods. For example, identifying cases of credit defaults may be done based on known cases of previous defaults or on outliers. If the problem is pursued as supervised learning, it can further be modeled as regression, classification, or other well-known supervised learning subcategories. For example, the credit default problem may be studied as a study default amount or a case of binary case of defaulter or not.

As depicted in Figure 2, narrowing down to the appropriate modeling approach should primarily rely on how much accuracy can be achieved with the data at hand. If for example, labels are missing, there is no point in pursuing supervised learning, If there is little data on default amount, it would not be practical to pursue it as a regression problem. In many real-world cases, modeling can be complex and may employ dividing the goal into subtasks, each of which may be further modeled as above.
Step 2: Understand data requirements
Data is the blood of ML solution. The absence of relevant data would make or break the case for ML solutions. More importantly, the data needs to be clean. If the data is not clean, you need to consider, if it is feasible to clean. If the answer is yes, you can proceed, but beware that a significant amount of effort would go towards cleaning the data. If the answer is no, you may have to rethink the modeling approach.
For example, imagine the problem of understanding whether a set of products are liked/disliked by customers. If most of the feedback texts are spam, you may have to discard the idea of modeling the sentiment using feedback texts. If that is not the case, but the texts are full of mixed languages, it may be too difficult to extract meaningful representations of liking/disliking. In such a case, effort must go either towards cleaning/transforming the data or collecting more useful data. If on the other hand, the feedback includes ratings on a numerical/ordinal/categorical scale, modeling effort and subsequent study of solution approach will likely be quite efficient. It is often quite useful to pick up features from signals that are missing in the data.
Step 3: Explore robust, efficient baseline solution
Once the modeling category has been identified, it is time to think about solution approaches. There are often many state-of-the-art solution techniques that you can consider. For regression/classification problems, one can consider implementing using linear methods, ensemble methods, deep learning, etc. It is neither a good idea to start with the most complex one nor the simplest one. Choose something robust that can cover the most prominent cases on many trials. Rule out the ones that take a long time to compute or are complex to interpret. If it takes too long, it will slow down all subsequent iterations, which ultimately would impede improvements. If it is too difficult to interpret, it will make it difficult to identify improvements, which ultimately may force the solution to a plateau.
Often times ensemble methods, such as random forests, gradient boosted trees, provide good baselines to start with. They provide decent results, are robust to variations, provide a certain level of interpretebility, and can execute rather quickly.
Once the baseline is implemented and evaluate, it is time to plan improvement directions. All accepted improvement trials must perform better than the baseline.
Iterate the 3-step process
Once you achieve the results of your first baseline solution, iterate the three-step process. See if more narrow modeling is possible and better representative data is available. If so, pivot to alternate approaches that are likely to give you more accurate results. Do not focus on making the process of running ML solutions more efficient, since there may be further improvements that rule out all the improvements. For example, in the face of more data, you can approach more advanced version of ensemble methods, deep learning methods, in addition to common approaches that typically improves accuracy, such as automated feature selection, hyperparameter tuning, etc.
Try to keep the iterations simpler. A milestone may include many iterations and can be based on a fixed interval of set deadlines or major improvement points. The stakeholders should be involved at least during the demo of the milestones. Going down the rabbit hole of fine-tuning without the consent of stakeholders is usually a big mistake.
Remarks
Our development efforts, primarily in the AI/ML/Data area, are riddled with focusing too much on algorithms. Although it is faster and easier to go the route of picking the right algorithm, identifying and using the right data is the key to scale this type of work. Finding the right data is an adventure and doing such a thing in the absence of your friends and colleagues is hard. So, all in all, do not forget to make a lot of friends that you mentor or get mentored by. If you have alternative or complementary views, please reach out.