The world’s leading publication for data science, AI, and ML professionals.

Data Science Survival Guide for Non-Technical Colleagues

5 Tips How to Bring Value to a Data Science Project

5 Tips How to Bring Value to an AI Project Without Being a Data Scientist

Photo by Andreas Wagner on Unsplash
Photo by Andreas Wagner on Unsplash

Compared to my previous articles, which were highly technical and meant mainly for fellow data scientists, I have this time decided to write an article aimed towards other team roles, that are often present in Data Science projects and we as data scientists cooperate with.

Whether you are a product manager, a domain expert, a scrum master, or even a developer without much data science experience, and you are involved in a data science project and truly care about its outcome, I will try to give you several tips of what to expect, how to proceed, and in general how to understand what is really going on.


The "AI"

Let me start by clearing up some misconceptions that might occur when coming into the project. The "AI" buzzword is often used in multiple ways and contexts, which can result in inadequate expectations. It might be initially perceived as a magic box that solves all your problems and you just have to ask. This is unfortunately very far from the reality.

The truth is that to develop a reliably working AI system takes just as much time and effort as to develop any other software, if not more. Additionally, the path is not always as straightforward as in classical software development, and can take turns along the way. The process usually requires doing multiple experiments, and the next course of action is determined according to their outcome. This also means that it can be difficult to plan ahead and give reasonable time estimates.

Additionally, you should be aware that in some cases the approach of using AI might not actually be the right solution for your problem and something worth pursuing. Make sure to also consider other options before really diving in, and if you decide to proceed, be prepared that the AI might not necessarily work the way you would imagine.

Tip #1: Set your expectations correctly and consider all the options.


Inputs and Outputs

Next, it is crucial for the team to agree on the task that is needed to be solved, and to have it in mind during the whole duration of the project. This might seem obvious at first, but the reality is often different.

What is really needed is to specify what exactly will be the inputs to the AI system, and even more importantly, what will be its outputs. To give you an example, imagine you are building a dog classifier system that is able to tell you, whether there is a dog in the picture or not. The input of this system is an image, and the output is yes there is a dog or no there isn’t.

A dog classifier example showing its inputs and outputs. Image by Author.
A dog classifier example showing its inputs and outputs. Image by Author.

The inputs and outputs determine how the whole system works, what kind of data is needed in order to build it, and how is the solution finally evaluated. Therefore, changing them in the middle of the project (for example detecting also a cat in the picture) has almost always negative consequences, and practically resets the project back to the beginning.

Tip #2: Thoughtfully specify inputs and outputs of the task, and do not change them.


Data

In general, we say that an AI system = data + code. This means that the data itself are the part of the system and define how the system works. While the data scientist can handle all the coding on his own, he cannot do much without having data, no matter how skilled and experienced he is.

The data need to correspond to the task you are trying to solve, meaning they have to contain the input and output information in the same way that you have defined. In the best case scenario, the data containing both information have been already gathered before the project started, which means that the data scientist can already start building the system.

It might also happen that you only have the data with input information available, and you are missing the output. To go back to our dog classifier, an example of this would be that we have a lot of pictures of animals, but we don’t know which of them are and aren’t dogs. In this case, we as humans need to look at the pictures ourselves, and manually assign the correct output label— a process called data labeling. The output labels are needed for the system in order to learn the task of assigning them to the given inputs, and therefore, it is important to be very precise when labeling data, since you are literally teaching the system.

The last scenario is that you don’t have any data corresponding to the input nor output of our task at all. If this happens, it is probably wise to reconsider whether to even start the project in the first place. A possible way to proceed here is to try to find some suitable datasets online, and use them. However, the more unique your task is, the more difficult it is to do so.

Tip #3: Define how the AI system should work using the data.


Evaluation

After you manage to gather a suitable dataset, the data scientist can start training and evaluating the system.

The way this works is that the dataset is split into training and testing set. As expected, the training set is used to train the system, and the testing set to evaluate it. All the metrics that you will hear will be counted on the testing set, and the system never actually sees these data during its training, just as it never sees the data on which it will be used in production in the future. This makes the evaluation fair and reflecting the reality.

The resulting numbers and metrics that you will hear can be sometimes deceiving. From the experience, if some reported number says 99.8% accuracy or similar, there is probably something wrong. Therefore, do not just listen and look at the numbers hoping them to be as high as possible, but try to understand how exactly they are counted, and what they represent. Do not hesitate to ask the data scientist to explain any given metrics, he or she will be happy to do so.

Additionally, you can and should influence the metrics that are actually being reported and optimized for. The data scientist doesn’t know the domain as much as you do, which is probably the reason why you are a part of the project in the first place. This is your opportunity to shine. It usually happens that some types of errors that the system is making hurt you more than others. Going back to our dog classifier example, it might hurt you more when you say that something is a dog when it actually isn’t (a false positive error), than when you don’t say something is a dog when it actually is (a false negative error). This can all be reflected in the used metrics, and therefore, it is a good practice to agree on the used metrics in the beginning, the same way you agreed on the inputs and outputs.

A matrix showing the types of errors of the dog classifier. Image by Author.
A matrix showing the types of errors of the dog classifier. Image by Author.

Tip #4: Optimize metrics that reflect reality, and that everybody understands.


Error Analysis

Once the system is trained and evaluated and you are not yet satisfied with its performance, it is needed to find ways to improve it. The data scientist might not always have a straightforward answer for this, and therefore, you need to proceed with an Error Analysis.

This means manually looking at the mistakes that the system is making, and trying to determine why is it so. You will often be surprised by the actual errors, but also get an overall feel of how the system could potentially fail in the production. Using our dog classifier again, the mistakes could be for example that the image is too blurry, a part of the dog is obscured by another object or that the image contains a wolf which is just similar to a dog. Another kind of mistake can be that the label assigned by human is simply wrong, and the system is actually correct. This usually happens when the data labeling task is done very quickly, or without specific guidelines.

Do not get too much carried away by a single mistake, but try to find patterns and quantify it. The output of the analysis should not be "I saw there is a wolf classified as dog", but rather "30% of mistakes are wolfs classified as dogs, 20% are blurry images, 5% are wrong labels etc.". Not all types of mistakes can be easily fixed, and having them quantified can help to determine which are really worth pursuing, as well as what is the potential performance gain if you succeed.

Tip #5: Do the error analysis systematically, and focus on patterns rather than single examples.


Conclusion

These were the 5 tips and stages of the project where I feel your role can bring the most value. I hope it will make your project run smoother and the data scientist, as well as the other team members, happier.

Thank you for reading!

How Much Time Can You Save With Active Learning?

1 to 5 Star Ratings – Classification or Regression?


Related Articles