The world’s leading publication for data science, AI, and ML professionals.

Reverse-Engineering water properties, with Machine Learning

Here is how to understand where the water comes from, given its properties. With few lines of codes.

Photo by mrjn Photography on Unsplash
Photo by mrjn Photography on Unsplash

Machine Learning helps us a lot in everyday lives, and in ways sometimes we don’t even fully realize. The main principle (and the main reason why we use Machine Learning) is that when something is really hard to understand, we still might have a chance to solve it with data.

While it is true that ML is usually adopted to solve a problem,

another powerful option is to use Machine Learning to reverse-engineer.

But what does it actually mean?Well, let’s say you create a new social media, and everytime someone posts something with a certain word (let’s say "stupid") you decide to delete the post because it is against your policy. Imagine that your friends decide to use your social media and they don’t know this rule. If they want to understand why their posts have been deleted, they have to reverse-engineer your social media. That is basically what reverse-engineer stuff means.

In this blog post, I will show you how to actually reverse-engineer a certain dataset, using Machine Learning, with few lines of code.

In particular, we will have a dataset explaining all the Water properties of a certain water in a certain place. Given these properties, we want to reverse engineer them and use Machine Learning to actually retrieve the country of origin.

Let’s get started!

0. Libraries:

For our purpose, we will use Python. We will use some well known libraries like sklearn, pandas, matplotlib, and numpy.Here is the full list that you can just copy-paste in your notebook:

1. Data exploration:

Nice, so we are almost ready. We need an important thing, though: data. You can download the dataset here and, good news, it is a relatively small one. As we want to get the country, we don’t want to use a surrogate of the country itself as a property. For this reason, we’ll just drop the surrogate column out of our data. Here is the final result:

We got a lot of properties, a lot of non-numerical features, and the country column, that is the one we want to understand. We can understand the columns types by doing this:

So we have a list of numerical features and a list of non-numerical ones:

We can understand how many entries has each non-numerical feature by doing this:

And we can check the correlation of numerical features by doing this:

Now let’s give a look at our target column and see how many countries we have.

Ok so…it’s a mess. Let’s make it more ordered by just selecting the most frequent ones and group the other classes together.

Note: This is not necessary at all! It is just to make things clearer and understandable. Feel free to skip this part.

Way better.

2. Machine Learning:

When you are reverse-engineering something, explainability is crucial. A well known and explainable Machine Learning model is known as Decision Tree.But how does it work? Well let’s make that easy. Have you haver played "Guess who?" ? It is a famous classic game where you have to understand a certain character by asking questions to your opponent. Let’s say you have to guess Ed Sheeran. It could go like this:

Is he an actor? No
Is he a singer? Yes 
Is he a pop singer? Yes 
Is he from Canada? No
Has he ginger hairs? Yes
It's Ed Sheeran!

And that is exactly what decision trees does with our features to understand the country. Pretty simple right? Let’ make it work! With these lines of code you can prepare the dataset and see the correlation with the target column.

With these lines of code you can actually train your model and check its performance (train/test split: 80%/20%):

It is a reverse-engineering task, so it is supposed to be like this! There is a precise pattern that we want to understand and we don’t want to have any errors.

And that, is the reverse-engineered result:

That explains everything! If water exceeds a certain metal percent belongs to a certain class, the same with glass and with droughts_floods_temperature.

3. Final remarks

Being able to understand new stuff thanks to Machine Learning is amazing and inspiring. I really hoped you liked the article and had fun as much as I had writing it and developing the code.

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.

Thank you so much, and have a wonderful day!


Related Articles