Powering SQL With Machine Learning Capabilities
Enabling Semantic Queries In SQL Using Word Embedding
In today’s world, there is an increase in the demand for Machine Learning. This leads to tremendous efforts to reduce the skills/efforts needed for these tasks. One of the most promising strategies out there is to incorporate these capabilities into standard SQL, It started with Oded Shmueli paper and got implemented for elastic-search and PostgreSQL.
In order to enable those capabilities in SQL, we need to insert actual semantic meaning to database entities (rows/columns/tables). We do so by incorporate word embedding vectors into our database. so we could ask queries such as “show me all the rows similar to this row.” which is something you can’t do with standard SQL.
All we need to get magic in our database is to incorporate word embedding in the right way.
How Does It work
In order to allow to quantify semantic relationships inside the database, in addition to all database relations, there is a need for a way to represent their semantic meaning in some way.
Embedding is a technique for mapping a sequence to a numerical vector space where the vectors indicate various aspects of the associated sequence (the embedding use is word2vec). The proposed solution is to construct embeddings of the database entities in order to represent their semantic meaning, and then access these vectors (via user-defined functions) and compare their closeness (via user-defined functions).
The main phases of our query execution can be described by the following figure:
So we are gonna enhance queries by allowing to quantify semantic relationships inside the database using the following UDFs:
- cosineDistance(a,b) which takes vectors a, b return their cosine distance.
- vec(token) which takes a token and returns its associated vector.
- Token entity e declares a variable that can be bound to tokens.
- Contain(row, entity) which states that entity must be bound to the token generated by tokenizing row.
What Capabilities It Adds
Those building blocks I showed before will allow doing advanced queries, first, we can run “Semantic similarity queries”.
For example, “Finding the most similar customer (semantically) to a potential customer by industry”.
Recall that word embeddings also support inductive reasoning (classic King is to Man as Queen is to ?), in other words, we can calculate analogies.
For example, “Find all pairs of product a, b which relate to themself as peanut butter relates to jelly”.
Another new set of capabilities to do fuzzy joins on database entities (like columns, rows or even tables). There are many use-cases where one would like to run a query without knowing exactly what the column/table needed to use.
For example, “find all tickets of user ‘Moshe’ given unknown fuzzy foreign key between tickets and users”.
The last word
this is a step towards empowering database systems with machine learning capabilities. These capabilities are applicable to a wide variety of applications (including data integration, search, and more).
I hope you found it interesting and useful, I am open to any kind of constructive feedback.