Powering SQL With Machine Learning Capabilities

Eyal Trabelsi
Towards Data Science
3 min readNov 15, 2019

--

Enabling Semantic Queries In SQL Using Word Embedding

Photo by Nathan Dumlao on Unsplash

In today’s world, there is an increase in the demand for Machine Learning. This leads to tremendous efforts to reduce the skills/efforts needed for these tasks. One of the most promising strategies out there is to incorporate these capabilities into standard SQL, It started with Oded Shmueli paper and got implemented for elastic-search and PostgreSQL.

In order to enable those capabilities in SQL, we need to insert actual semantic meaning to database entities (rows/columns/tables). We do so by incorporate word embedding vectors into our database. so we could ask queries such as “show me all the rows similar to this row.” which is something you can’t do with standard SQL.

All we need to get magic in our database is to incorporate word embedding in the right way.

All we need to get magic in our database is to incorporate word embedding in the right way

How Does It work

In order to allow to quantify semantic relationships inside the database, in addition to all database relations, there is a need for a way to represent their semantic meaning in some way.

Embedding is a technique for mapping a sequence to a numerical vector space where the vectors indicate various aspects of the associated sequence (the embedding use is word2vec). The proposed solution is to construct embeddings of the database entities in order to represent their semantic meaning, and then access these vectors (via user-defined functions) and compare their closeness (via user-defined functions).

The main phases of our query execution can be described by the following figure:

We start by getting a SQL query with UDFs from the user and executing it, to execute the query the sentences are tokenized and the relational vectors are calculated and returned to the user.

So we are gonna enhance queries by allowing to quantify semantic relationships inside the database using the following UDFs:

  • cosineDistance(a,b) which takes vectors a, b return their cosine distance.
  • vec(token) which takes a token and returns its associated vector.
  • Token entity e declares a variable that can be bound to tokens.
  • Contain(row, entity) which states that entity must be bound to the token generated by tokenizing row.

What Capabilities It Adds

Those building blocks I showed before will allow doing advanced queries, first, we can run “Semantic similarity queries”.

For example, “Finding the most similar customer (semantically) to a potential customer by industry”.

We find all pairs of products a,b (only once per pair) and order those by similarity.

Recall that word embeddings also support inductive reasoning (classic King is to Man as Queen is to ?), in other words, we can calculate analogies.

For example, Find all pairs of product a, b which relate to themself as peanut butter relates to jelly”.

First, we create a table with product names and the distance between them and the jelly/peanut butter vector. Afterward, we find all pairs of products a,b which relate close (from distance perspective) to themselves as peanut butter relates to jelly.

Another new set of capabilities to do fuzzy joins on database entities (like columns, rows or even tables). There are many use-cases where one would like to run a query without knowing exactly what the column/table needed to use.

For example, “find all tickets of user ‘Moshe’ given unknown fuzzy foreign key between tickets and users”.

First, we declare two tokens e1, e2 that represent in our database and in our use case unknown column. Afterward, we find all the users called Moshe and find their tickets, since we don’t know the join columns on the ticket table we try to join on every column and keep those with small similarities. Note: we could enforce a 1:1 relation with rank with the distance ordered.

The last word

this is a step towards empowering database systems with machine learning capabilities. These capabilities are applicable to a wide variety of applications (including data integration, search, and more).

I hope you found it interesting and useful, I am open to any kind of constructive feedback.

--

--

Data architect at bigabid with a passion for performance, scale, python, machine learning and making software easier to use.