The world’s leading publication for data science, AI, and ML professionals.

Natural Language to SQL from Scratch with Tensorflow

An end-to-end tutorial of training a machine learning model to translate from natural language questions/instructions to SQL queries using…

Introduction

In this blog post, we’re going to look at an interesting task: translating natural language to SQL. The academic term for that is natural language interface for database (NLIDB). Even though NLIDB is still an area of active research, building a model for one simple table is actually pretty straightforward. We’ll do that for an employee table with 3 columns: name, gender, and salary (as shown in Figure-1). By the end of this blog post, you will learn how to go from a string of natural language input such as show me the names of the employees whose income is higher than 50000, to the SQL query output select e.name from employee e where e.salary > 50000.

Figure-1: Employee Table
Figure-1: Employee Table

Overview

In this section, we explain the core idea on a high level.

The core of this task is a machine translation problem. While it may be tempting to just throw in a sequence-to-sequence machine translation model to go directly from the input natural language to the output Sql query, it does not perform well in practice. The main reason is that the model may encounter out of vocabulary (OOV) tokens for column values. Though the model can learn to tolerate other minor unknown words to some extent, OOV for column values is fatal. Imagine we have a different salary value in the example above from the introduction that the training dataset doesn’t cover – 60000, 70000, 80000, you name it – there will always be a salary number that’s out of vocabulary. The same goes for the name column. OOV tokens will be mapped to a [UNK] symbol and fed to the translation model. So there is no way for the model to reconstruct the exact actual column values in the SQL output.

The typical way to handle this is to run the raw input through a process called schema linking, which identifies and caches the column values, and substitutes them with placeholders that the model has seen during training. For instance, the input example from the introduction will become show me the names of the employees whose income is higher than [salary_1] after schema linking. During training, the model is given label output like select e.name from employee e where e.salary > [salary_1]. So the model is in fact learning how to translate from column-values-substituted natural language input to column-values-substituted SQL output. The final step is just to fill in the placeholders by looking up the corresponding column values cached by the schema linking step.

Now with the high level understanding, let’s dive deep into the concrete steps.

Data Acquisition

We’ll start with data acquisition. To have full control of the end-to-end process and also to make it more fun, we are going to generate the training data from scratch. The expected outcome of this data generation step is a list of pairs of column-values-substituted natural language sentences and column-values-substituted SQL queries. While you can just go ahead and handcraft ten thousand training pairs, it’s tedious and may not include a lot of the linguistic variation that the model will encounter during inference.

Instead, we only manually create a few dozen training pair templates to bootstrap the data generation, and then sample the possible column names and conditions to instantiate the training pairs. A template may look like the following:

Show me [cols] of employees whose [cond_cols] order by [comp_cols] in descending order
select [cols] from employee where [cond_cols] order by [comp_cols] desc

We can then sample a combination of employee columns (name, gender, salary) to fill in the [cols] part, sample a combination of predefined conditions (e.g, gender = [gender_1], salary > [salary_1] and salary < [salary_2]) to fill in the [cond_cols], and sample a combination of comparable columns (only salary in this case) to fill in the [comp_cols] part. Obviously, we need to define the possible content for the [cols], [cond_cols], and [comp_cols] for both the natural language sentences and SQL queries beforehand. This is often called specifying the ontology of the domain.

By instantiating the concrete pairs, we can easily go from dozens of training pairs to hundreds of training pairs. Next, we need to augment the training pairs to include more linguistic variation. We can do that by replacing phrases in the natural language sentences with paraphrases. Essentially for every one or multiple-word phrase in the sentence, we create a new sentence by changing it to a paraphrase. The SQL query stays unchanged because we are focusing on linguistic variation in the natural language sentence. The new sentence and the original SQL query pair is then a new addition to the training dataset. We get the paraphrases from a paraphrase database. See this blog post for the concrete code of how to use a paraphrase database for natural language augmentation. This allows us to go from hundreds of training pairs to thousands of training pairs, which suffice for this task.

Natural Language to SQL Model

Now we have thousands of column-values-substituted natural language and SQL query pairs, we can build our translation model. We use a sequence-to-sequence model with attention mechanisms detailed in this blog post. See Figure-2 for the RNN part of the architecture.

Figure-2: RNN Architecture
Figure-2: RNN Architecture

We have one change to the embedding part of the architecture though. The original translation model from the machine translation blog post only learns input word embeddings from scratch. In this model, we are still going to learn the input word embeddings, but in addition, we’ll load a fixed pre-trained word embedding layer, and concatenate the output of these two embedding layers to feed to the remaining parts of the model. See Figure-3 for an illustration.

Figure-3: Natural Language Embedding Architecture
Figure-3: Natural Language Embedding Architecture

The rationale is that we may not have sufficient training data to represent the linguistic variation of the natural language input. So we want a pre-trained word embedding layer. But at the same time, we want to maintain our own trained-from-scratch embedding to capture the nuances in this particular domain. The following is the code snippet to load a fixed pre-trained word embedding layer.

Once we have that, we’ll use similar code in the machine translation [blog post](https://towardsdatascience.com/end-to-end-attention-based-machine-translation-model-with-minimum-tensorflow-code-ae2f08cc8218) for the machine translation. The only difference is that as mentioned above, we’ll concatenate the output from the fixed embedding layer with the natural language embedding layer and feed it to the recurrent neural network layer (instead of just obtaining the output from the natural language embedding layer). See the following code snippet for the complete model code. For how to create the model initialisation parameters nl_text_processor, sql_text_processor, and the training loop code, feel free to refer to the machine translation blog post.

We train the model for 10 epoch, and try it out with a few natural language input:

Translation Training
Translation Training
Input: show me the name of the employee whose salary is higher than [salary_1]
Output: select e.name from employee e where e.salary > [salary_1]
Input: get me employee with name [name_1]
Output: select * from employee e where e.name like '%[name_1]%'
Input: which employee has the highest salary
Output: select * , e.salary from employee e order by e.salary desc limit 1

Schema Linking Model

Now, the only task left is to link the column values. The goal is given a natural language sentence like show me the names of the employees whose income is higher than 50000, the schema linking should output show me the names of the employees whose income is higher than [salary_1], which can be consumed by the translation model. At the same time, we should remember that [salary_1] is 50000. When we have the SQL output from the translation model select e.name from employee e where e.salary > [salary_1], we’ll write back the actual column value 50000 to create the final executable SQL query select e.name from employee e where e.salary > 50000.

There are many ways to link the raw input to schema. Some use rule-based syntactic detection to identify the potential column values. Others scan the various-sized spans of the input and determine how similar they are to a column value. The similarity can be measured by calculating the euclidean distance between the span’s word embeddings and the column’s aggregate embeddings. The column’s aggregate embedding can be pre-computed by summing over all the word embeddings of a representative sample of column values.

We’re going to explore a novel approach here. We’ll use the same machine translation method to predict/generate an output mask for any given input. Let’s map the predefined placeholders to an integer and the rest of non-column-values to zero.

[name_1]: 1, [salary_1]: 2, [salary_2]: 3, [gender_1]: 4, others: 0

Then, let’s generate the schema linking training data by replacing the placeholders from the translation training pairs with random values. See the following code snippet for more details:

We have the schema linking training data pairs like the following. We can feed the training data into the exact same machine translation model architecture we have before.

All employees with name John Joe and salary 50000 and gender male
 0      0       0    0    1   1   0     0     2    0     0     4

We train it for 10 epochs, and try a few input examples.

Schema Linking Training
Schema Linking Training
# [name_1]: 1, [salary_1]: 2, [salary_2]: 3, [gender_1]: 4, others: 0
Input: name of the employee whose salary is higher than 50000
Output: 0    0  0     0       0      0    0    0     0    2
Input: get me employee whose name is John Joe
Output: 0   0     0      0     0   0   1   1
Input: show me employees whose salary is between 450000 and 650000
Output: 0    0     0       0      0    0    0       2    0     3

Now, we just need to use the output as a mask to mark the placeholders in the raw input. Note that in the case of multiple-word values such as name, we’ll need to collapse them into a single placeholder.

At this point, we have all the steps required to go from a natural language input to a executable SQL output.

Recommended Papers

  • An End-to-end Neural Natural Language Interface for Databases (link).
  • Neural Approaches for Natural Language Interfaces to Databases: A Survey (link).
  • Zero-shot Text-to-SQL Learning with Auxiliary Task (link).
  • A Transfer-Learnable Natural Language Interface for Databases (link).
  • Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation (link).
  • PPDB: The Paraphrase Database (link).
  • Learning a Neural Semantic Parser from User Feedback (link).
  • A Comparative Survey of Recent Natural Language Interfaces for Databases (link).

Related Articles