Introduction to AI Tables

Integrate ML inside databases using MindsDB

Published in

Towards Data Science

8 min readOct 4, 2022

In this article, we are going to cover the topic of AI tables. Machine Learning is a great tool nowadays. We can apply it to a lot of different tasks, from image recognition to natural language processing.

However, one of the most important use cases of ML models in the real world is to make predictions on tabular data on tabular data that we already have stored in a database. So what if we could just apply the model directly to the table itself? Well, this is exactly what an AI table is.

AI Tables

AI tables integrate the machine learning model directly as virtual tables inside your database. This allows us to create predictors that can be queried with SQL statements.

By doing this process we simplify a lot of the ML lifecycle since many operations just get done inside the database itself, and we do not have to process the data somewhere outside the database itself.

But to make a good understanding of how they work, in this article we are going to make an entire example of how to create, train, and make predictions with a model trained directly inside the database.

To do that, we are going to use a tool called MindsDB. MindsDB is an in-database ML tool that helps us to extend the SQL language so that ML models can be created, queried, and maintained as if they were database tables. Let’s see it in action.

Example: Census Income Predictions

All the code used in this example can be found in this repository.

Data

For this example, we will use a “Census Income” dataset. The goal is to predict whether the income of a person exceeds $50K/yr based on census data. The dataset can be found in this link, extracted from the UCI ML repository.

The downloaded file contains 3 files:

adult.data: a file containing 32561 records of census data for training.
adult.test: a file containing 16281 records of census data for testing.
adult.names: a file with the description of the dataset, info, license, etc.

Each record on the data consists of 14 columns, where we can find attributes such as age, workclass, education, and more. The last column corresponds to the target column, whether the salary exceeds $50K per year.

Data preprocessing

We will perform a bit of preprocessing in the data to make the whole process more understandable. First, let’s take a look at the raw dataset using the pandas library.

Shape: (32561, 15)

The list of names of each column can be found in the aduls.name file along with other info on the dataset.

As we can see, we have a normal pandas dataframe with some categorical data and some numerical data. We could deal with the categorical data, normalize values, clean the dataset, etc. but the model will work with the data as it is, and since the aim of the article is to explain the concept of AI Tables we will leave the data for now.

The only column we are going to change is the last one. We will substitute the values ‘<=50K’ and ‘<50K’ for 0 and 1 respectively, just to make the problem look like a normal classification problem.

This is all the preprocessing we need to do in the dataset. Now we can export the dataframe into a csv file. We will need it soon.

The file is called ‘data.csv’ and we will not keep the index column in the new file.

For the test set, we are going to do the same. We will keep the last column because when making the predictions we are going to create an entirely new column with the predicted values. Then, we will use that column and the ‘target’ one to obtain the metrics about the performance of the model.

Shape: (16281, 15)

First 5 rows of the test data, by Author

In this case, while loading the dataframe we need to use the skiprows attribute since the first row of the test set is not a record per se.

Now we can save also the dataframe in a csv file. We will name it ‘test.csv’.

MindsDB

It is time to bring up the AI Tables and create our predictive model. In this article, we will use the MindsDB cloud, but you can install it on your system. You can find all the information in their documentation.

First, we will need an account and login into the cloud.

We will have an editor to make normal SQL queries. Now we have to import the data we generated in the last step. To do that, go to the ‘Add Data’ page, select the tab ‘Files’ and go to ‘Import File’.

Here we have to import the two csv files that we generated before (data.csv and test.csv). We will name them ‘train_data’ and ‘test_test’ respectively.

Once we have imported our files we can check if they are correctly uploaded by running the following query in the editor.

Tables_in_files
-------------------
test_data
train_data

We can see that both tables are now in the cloud. From now on we will use SQL queries to interact with the data. For example, we can check the 10 first records of the training data using the following SQL command.

Creating the predictor

Now that we have our data loaded, it is time to create a predictor. We are going to use the ‘create predictor’ statement.

salary_predictor is the name of the predictor.
files is the default table name for the uploaded files.
target is the name of the column we want to predict.

This query will create a predictor for our data. The predictor now has to be trained on the data. We can check its status with the following query.

All the predictors we create will be stored in the mindsdb.predictors table.

The output of the query is something like this.

We can notice that it is in a training phase, along with other data about the predictor.

The training process will take around 5–10 minutes to complete. Once it is finished we can run the previous query again to see the results.

We can see now that the status is ‘complete’ and the accuracy is 0.832. This means that our predictor has around 83% of accuracy in the training data.

And that’s it! We have trained a model directly in our table using SQL queries. Now we can test the predictor with the test data.

Making predictions

MindsDB gives us two possible ways to make predictions. The first one is running a SQL query with a single new record in the parameters of the query. This is perfect if we want to predict punctual new data.

The other way is to make a batch prediction for a large set of data in our database. This is our scenario, since we have a complete file (test_data) with new records, and going one by one will take a too long time.

We can run predictions into a bunch of data with the following query.

With the SELECT statement, we are going to use all the record fields available in the data. Then, we are going to use the JOIN statement to make the predictions using the model we have just trained. Lastly, we are going to generate a new column called ‘predicted_salary’ with the predictions.

If we run the query, it will output a new dataframe with an extra column of predictions. Let’s evaluate the model using python. Above the output, on the right side, there is a button to export the results into a .csv file. I have marked it in the picture in green.

Click on it, and it will download the result of the query. I will change the file name to ‘predictions.csv’ so the following code will use that name. Now let’s use pandas to check the performance of the model. Remember that the accuracy of the predictor was around 83% based on the training process.

We will use seaborn to visualize the results and confusion_matrix to generate the metrics of the model.

(16281, 16)

Now we open the file and check that the output shape has 16 columns, one more than the original dataframe. Remember this is because this data has an extra column for the predicted label.

Let’s use the columns that we have an interest in and compute the confusion matrix.

Now we have the metrics in array format, let’s create a visualization for them.

Metrics of the confusion model, by Author

And we can also take a look at the metrics.

Precision: 0.8420587052673905
Recall: 0.9418907978771252
F1: 0.8891813858695653

As we can observe, the precision in the test set is around 84%, very close to the one given by the predictor metrics.

Remember all the code is in this repository.

Conclusion

So that’s all for this tutorial. We have seen what is an AI Table, and we run an example to see how it works. But this is not all the possibilities that this tool can make. I link here the official documentation of MindsDB so you can take a look at all the other possibilities.

I suggest you try this with your dataset or try making a new experiment. Also, you can create a regression model instead of a classification one. Another idea I leave you with is to improve the performance of this model. You can try cleaning and processing the training data and check whether the predictor improves its metric or not. It can be a fun challenge!

I hope you have found this tutorial useful. If you have any further questions, feel free to leave a comment. See you next time! 👋🏻