For more follow me on twitter: https://twitter.com/faviovaz

Where is the data?

Or how to enrich your datasets and create new features automatically.

Favio Vázquez
Towards Data Science
7 min readApr 16, 2020

--

Illustration by Héizel Vázquez

One of the hardest things, when you are working with a new dataset, is to discover the most important features for predicting your target, and also, where you can find new sources of information that can improve your understanding of the data and your models.

In this article, I’m going to show you how to do that without any programming skills. Yes, that can sound weird right now but bear with me. In future articles, I’ll explore other programming libraries that can help you do this and see which approach gives better results.

We are going to do this with an example dataset. We are going to use the House Sales in King County, Seattle, USA dataset. You can find all the information about the data here:

The idea of the dataset is to predict the price of the house given the different features. Before going to the place where I’ll show you how to do the data enrichment, let’s load the data in python to get some information about it. Below you can see a simple notebook that does that:

It’s important to notice that I’m not doing the whole EDA for the dataset, this is just to get basic information. And what I’m about to show does not eliminate the data science process that you have to follow, is just to enrich your data and get more information about it.

Ok, now it’s time.

The way we are going to do this is with a system called Explorium. I discovered this software a while ago, and I’ve been using it ever since. They describe their product as:

Explorium is driving a new paradigm in the world of data science — one where companies can build models on the data they need, not the data they have. Discover the only end to end data science platform that focuses on superior data for machine learning.

So it’s an end-to-end platform where you can build and deploy models, but we will explore that in other articles. You can ask for a demo to replicate what I’m doing in this article here:

I’m going to do a step by step tutorial. If you have any questions please let me know in the comment section below :).

Creating a project:

The first thing you have to do when you have access to the app is to create a project.

And then name your project:

Loading the data

The next step is to load the data into the system. You have several ways of doing that like:

  • Local
  • S3
  • Oracle
  • Teradata
  • Mysql
  • Mssql
  • Postgres
  • Redshift
  • Hive
  • Google BigQuery
  • Google Storage

I’m going to load it from my local computer. After a few seconds you should see this screen:

And as you can see you have a very basic but useful data exploration tool at the bottom. You can get some basic stats about your data.

As you can see:

we have 21 columns and 21613 rows, that match with what we saw with Python.

The data engine

Now it’s time to get the magic started. In Explorium you have something called the “engine”, and that’s the place where the software is going to create new features from your data, and get new datasets depending on the columns and its contents.

This is what we are going to do (taken from the website):

There are many more things that the system can do, and as I mentioned we will be exploring that in other articles. Let’s continue.

We have to set now, what we are trying to predict. This is more for the machine learning pieces of the software but is necessary to continue:

We now have to push the PLAY button:

And now we wait.

And then we wait for a little more. This could take minutes. The system is using your information to extract other data from thousands of datasets.

Just so you know, after around 5 minutes the system detected 30 useful datasets and around 890 features. You can stop the engine if you don’t want to wait more time, but I recommend that you wait until the end.

Insights

After several minutes you should be seeing something like this:

Let me break that page to you. At the top, we see that with the internal data (and some new features it created from your dataset) it achieved a score (R²) of 87.46 that is pretty good, and with the external data, it went to 89.52 that is a little better. Not that surprising but it is something.

In the left part, you see that it discovered 30 datasets that are useful and match your data. In the middle, you have all of the different features it created from internal and external data, and finally, in the right part, you see the results of some ML models. You see that the best model was created using XGBoost with an RMSE of 118.5K and MAE of 66.3K.

We are not going to focus on the models right now. Only the data. In the middle section, you can get the best 50 features to predict the price:

If you click there and then on download the selected features:

You will download the best 50 features to predict the price. Something important here is that the price is not going to be there, so if you want to use the new data externally we will need to append it. Let’s load the data in pandas and see it:

As you see the name of the columns is super explicit, but we can change that easily with Python and Pandas, so don’t worry about it. Now going back to Explorium, in the Insights tab you can get the feature importance:

Where you can see how important the variables are for modeling and where they come from. As you see some of them are from the original dataset and some external, and also we have stuff like K-Neighbors of some existing columns in our data. That can be very helpful for offline modeling.

By the way!! You can also download the external datasets directly:

And get whole profiling from them.

Conclusion

Enriching and improving your data is just super simple now. As I mentioned before I’ll compare the results with Explorium and some Python libraries in the future so stay tuned for that.

We can see that the process was super simple and intuitive, and if you don’t want the modeling, you can keep the new data and work on your own. I can assure you that most of the work is in the discovery phase, the data cleaning and feature engineering, modeling is easy, and coding is getting simpler by the day.

If you combine this type of software with a good understanding of the business, and an effective data science methodology you will be rocking the data science world much faster.

Thanks for reading! If you want to keep up to date with my articles please follow me here on Medium and on Twitter:

--

--

Data scientist, physicist and computer engineer. Love sharing ideas, thoughts and contributing to Open Source in Machine Learning and Deep Learning ;).