Making Sense of Big Data

AI-enabled conversations with Analytics tables

A review of AI progress in transforming natural language database queries to SQL code

Gowtham Ramesh
Towards Data Science
15 min readNov 2, 2020

--

Written by Gowtham R and Sundeep Teki on November 1, 2020

1. Introduction

In recent years, the amount of data powering different industries, and their systems has been increasing exponentially. Majority of business information is stored in the form of relational databases that store, process, and retrieve data. Databases power information systems across multiple industries, for instance, consumer tech (e.g. orders, cancellations, refunds), supply chain (e.g. raw materials, stocks, vendors), healthcare (e.g. medical records), finance (e.g. financial business metrics), customer support, search engines, and much more.

It is imperative for modern data-driven companies to track the real-time state of its business in order to quickly understand and diagnose any emerging issues, trends, or anomalies in the data and take immediate corrective actions. This work is usually performed manually by business analysts who compose complex queries in declarative query languages like SQL to derive business insights stored in multiple tables. These results are typically processed in the form of charts or graphs to enable leadership teams to quickly visualize the results and facilitate data-driven decision making.

Although the most common SQL queries that address fundamental business metrics are predefined and incorporated in commercial products like PowerBi that power insights into business metrics, any new or follow-up business queries still need to be manually coded by the analysts. Such static interactions between database queries and consumption of the corresponding results require time-consuming manual intervention and result in slow feedback cycles. It is vastly more efficient to have non-technical business leaders directly interact with the analytics tables via natural language queries that abstract away the underlying SQL code.

Defining the SQL query requires a strong understanding of database schema, SQL syntax and can quickly get overwhelming for beginners and non-technical stakeholders. Efforts to bridge this communication gap have led to the development of a new type of processing called Natural Language Interface to Database (NLIDB). This natural search capability has become more popular over recent years as companies such as Microsoft [1][2], Salesforce [3][4], and others are developing similar technologies for Natural language (NL) to SQL (NL2SQL). The converted SQL could also enable virtual assistants like Alexa, Google Home, and others to improve their responses when the answer can be found in different databases or tables. This blog will review the challenges, evaluation methods, datasets, different approaches, and some state-of-the-art deep learning approaches for NL2SQL.

Figure 1. Example of NL query converted to SQL from WikiSQL dataset (source)
Figure 2. Example of NL2SQL involving multiple tables (source)

2. Technical challenges

2.1 Understanding NL query and aligning utterance with schema

The system must understand both the user’s question and the table schemas (columns, table names, and values) to map the query to SQL correctly. A key challenge here is understanding the structured schema of DB tables (e.g., the name, data type, and stored values of columns) and the alignment between the input query and the schema. For instance, for the question, Which country has the largest GDP?, the model needs to map GDP to the Gross Domestic Product Column. Sometimes the question might also require understanding the semantics of a column rather than just column names.

Figure 3. Some queries might require understanding cell values (source)

For the table and question shown in Figure 3, the Venue column used to answer the example question refers to host cities. Hence, the model needs to align “city” in the query with the venue column in the table.

2.2 Generalization to cross-domains

Figure 4. Examples of possible errors in cross-domain setting (source)

Collecting large training data for different domains is expensive and non-scalable. Hence, it is important to train systems to generalize to different domains and databases. This generalization would involve identifying new entities, mapping unseen phrases and entities correctly in the SQL query, and handling novel database and query structures (larger tables, the composition of SQL components, etc.)[5].

2.3 Order matters problem

One of the standard ways to solve the NL2SQL tasks is to use seq2seq (since both NL query and SQL are sequences) models and their variants. One of the issues with this approach is that different SQL queries may be equivalent to each other due to commutative and associative properties.

Figure 5. Two ways to write the same SQL query (source)

In the above example, we see that even if the WHERE conditions are swapped, it would yield the same results, but syntactically both the queries are different. If we have the former as the ground truth, and the seq2seq model predicts the latter, it would be penalized.

3. Datasets

Table 1. Overview of some of the tabular datasets for NL2SQL (source)

There are several datasets for NL2SQL tasks. These contain annotated NL questions, SQL pairs corresponding to one or more tables. These datasets differ in terms of domains (single vs. cross-domain), size (number of queries — which is essential for proper model evaluation), and query complexity (single table vs. multi-table).

The early datasets like ATIS, GeoQuery focus on single domains and are also limited in terms of the number of queries. Some of the latest datasets like WikiSQL, Spider are cross-domain, and context-independent with a larger size. One significant difference between WikiSQL and Spider is query complexity. Queries in WikiSQL are simpler (which only covers SELECT and WHERE clauses). Also, each database in WikiSQL is only a simple table without any foreign key. Spider contains a modest number of queries and includes complex questions that involve joins of tables and nested queries. The SParC[15] and CoSQL[16] are the extensions of the Spider dataset that are created for contextual cross-domain semantic parsing and conversational dialog text-to-SQL system.

4. Evaluation methods

The most common methods to evaluate NL2SQL systems are execution accuracy and logical form accuracy.

Execution accuracy compares the result after execution of the predicted SQL query with the result of the ground truth query. One downside of this method is that it is possible to have an unrelated SQL query that does not correspond to the question but still gives the right answer (for example, NULL result).

Figure 6. Example of SQL canonicalization (image by the author)

Logical form accuracy compares the exact string match of predicted SQL query with the ground truth query. This metric has the limitation of incorrectly penalizing predictions that yield correct results upon execution but do not have an exact string match with a ground truth SQL query. One approach to solving the ordering issue is to canonicalize SQL queries before comparison [17]. SQL canonicalization is a method to make evaluations consistent by ordering columns in SELECT, tables in FROM, and WHERE constraints and standardizing table aliases, capitalization, and space between symbols.

Authors of Spider [19] use component matching, which measures the average exact match between the prediction and ground truth on different SQL components like SELECT, WHERE, GROUP BY, etc. The prediction and ground truth is parsed and decomposed into subcomponents and then their exact match is calculated component-wise.

For example, to evaluate the SELECT component:

SELECT avg(col1), max(col2), min(col1) is decomposed to set (avg, min, col1), (max, col2)

And then this set is compared with the ground truth sets.

Figure 7. Sample questions from the Spider dataset categorized by hardness (source)

Even though this takes care of the ordering issue, it still does not account for when the prediction uses a different logic (compared to ground truth SQL) to arrive at the same result. Hence for a thorough evaluation, execution accuracy should also be used.

The authors in [19] also categorize query by hardness based on the number of SQL components, selections, and conditions. This categorization can be very helpful for getting more insights about model performance with respect to query complexity.

5. Different approaches for NL2SQL

Figure 8. Categorization of the NLIs and the used NLP technologies (source)

5.1 Rule-based approaches

Most existing approaches focus on a rule-based parser for natural language combined with ambiguity detection. Some rule-based systems use trigger words to identify patterns in the user’s question. For example, “by” is a common word used in aggregation queries like “List the movies directed by <director>”. Here, the trigger word’s left side might have the keywords required for the SELECT clause, and the right side would have the necessary keywords for the GROUP BY clause.

Figure 9. A simple pattern to match the country’s capital from countries table (source)

Despite its simplicity, this approach (if rules are well-formed) has been shown to handle a surprisingly broad type of queries. Modern conversational agents such as Siri and Cortana follow a similar principle, although the rules are not deterministic and based on training (logistic regression classifier of intent).

5.2 Grammar-based systems

In Grammar-based systems, the user’s question is parsed, and the resulting parsed tree is directly mapped to expression in SQL. A grammar is created which can describe the possible syntactic structures of the user’s questions.

Figure 10. An over-simplistic grammar (source)

The over-simplistic grammar shown in the figure considers the user’s question(S) to be composed of Noun Phrase and a Verb Phrase; Noun phrases consist of a Determiner followed by Noun, Determiner consists of the word “What” or “Which” etc.

Figure 11. Resultant parse tree after parsing “Which rock contains magnesium?” with the above grammar (source)

This grammar can then be used to parse a question like “Which rock contains magnesium?” into a parse tree and then map the resulting parse tree to SQL. This mapping back to SQL would be carried out by rules and based entirely on the parse tree’s syntactic information.

5.3 Deep Learning based-approaches

Rule-based approaches are limited in terms of coverage, scalability, and naturalness. They are also not robust to natural language diversity and are very difficult to scale across domains. The advent of large scale supervised datasets like WikiSQL, Spider, etc., and advances in Natural language processing, pretraining [20], etc. has enabled Deep learning models to achieve the state of the art results in NL2SQL tasks.

Almost all the deep learning models generate the SQL query from natural language input with an encoder-decoder [21] model. The encoder could be RNN [22] / LSTM [24] or the recent transformer [25] networks. Most of the models differ in how they encode the schema (table names, column names, cell values, etc.) and how they produce the SQL output.

Some models make schema as part of their output vocabulary. In other words, they put all the table names, column names, etc., into their output vocabulary, and while decoding the SQL output, they select these words from the vocabulary. NSP[10], DBPAL[18] are some of the methods which use this approach. One major limitation of this approach is we cannot adapt them to cross-domains as they do not encode new schemas in their input.

Figure 12. Components of Seq2SQL model which takes the question, column names, and SQL tokens as input (source)

In contrast, other methods like SEQ2SQL[3] use the schema as input to the model and while decoding, use the table or column names mentioned in the input using pointer networks [26]. For instance, in SEQ2SQL[3], the authors use column names, question tokens, and SQL tokens like SELECT, WHERE, COUNT, MIN, MAX, etc. as input. Their pointer network produces the SQL query by selecting exclusively from this augmented input sequence. The authors also claim that apart from limiting the output space, this augmented pointer network produces higher quality WHERE clauses.

Figure 13. Example of a pointer network (source)

Based on the generation of SQL queries from natural language input, there exist three types of models: sequence to sequence, sequence to tree, and slot-filling[23]. The sequence to sequence models generates the SQL as a sequence of words. The sequence to tree models generates a syntax tree of the predicted SQL query. The slot-filling methods treat the SQL query as a set of slots and then decode the whole question using relevant decoders for each slot. An advantage of grammar-based decoders is that they can check for grammatical errors at every step, producing complex queries with joins, nested queries, etc. without any syntax errors.

5.4 Modern Deep Learning approaches

Modern Deep Learning approaches use more techniques to learn joint representations over NL questions and structured information present in tables. They use various attention-based architectures for question/schema encoding and AST based structure architecture (sequence to tree) for query decoding. IRNet [1], RAT-SQL (current SOTA approach in spider) [2] use BERT [21] (for NL representation) along with in-house strategies to encode structured information in tables. In contrast, TaBERT[27] uses a general-purpose pretraining approach to learn representations of natural language sentences and tabular data. These techniques include schema linking, better schema encoding, using DB content (Cell values instead of just column and table names), contextualizing questions and schema representations.

5.4.1 Schema Linking

Figure 14. Schema linking (question to column and table linking) in RAT-SQL (source)

This involves aligning the entity references in the question to the right schema columns or tables. Textual matches are the best evidence for question-schema alignment, and it might be directly beneficial to the encoder. Linking is generally done with string matching in IRNet and RAT-SQL. N-grams (up to lengths of 5 or 6) in the question are used to match (both exact matches and partial matches are considered) column or table names in the schema. After linking, IRNet tags each entity mentioned in the question with the type of the corresponding entity (table name/column name, etc.) while encoding. The column names are also assigned types EXACT MATCH and PARTIAL MATCH based on n-gram overlap with question words. RAT-SQL, on the other hand, constructs a graph with question words and the column/table names as the nodes and edges being QUESTION-COLUMN-M, QUESTION-TABLE-M, etc., where M is either one of EXACTMATCH, PARTIALMATCH, or NOMATCH.

Figure 15. Schema linking in IRNet (source)

5.4.2 Value-based linking

The natural language question can also have value mentions (like ‘4’ in ‘For cars with 4 cylinders, which model has the largest horsepower’), which would be present as a cell value in some table. IRNet looks up the value mention from the question in a knowledge base and searches the results returned over the column names for partial or exact matches. The column names are assigned types VALUE EXACT MATCH and VALUE PARTIAL MATCH based on the match. RAT-SQL, on the other hand, adds an edge COLUMN-VALUE between question word and a column name if the question word occurs as a value in the column.

TaBERT uses the DB content directly instead of linking and using the column name. The authors reason that contents provide more detail about a column’s semantics than just the column’s name, which might be ambiguous. They select a content snapshot consisting of only a few rows that are most relevant based on string matching (n-gram overlap) to the NL question.

5.4.3 Schema Encoding

Figure 16. Example Graph schema which has multiple tables (source)

This involves encoding the relational structure in databases. It is much more challenging in databases with multi-table relations (where encoding primary, foreign keys, etc. are essential). IRNet encodes both the columns and tables to get column and row representations. The columns are represented by column name and their type, which is defined from schema linking. The final representations are created by adding the column name embeddings, context embeddings (based on n-grams matched in the question), and type embeddings.

Figure 17. Edge types in RAT-SQL Schema graph (source)

RAT-SQL represents the schema as a directed graph with columns and tables as nodes. The edges are defined by database relations detailed in the above diagram.

5.4.4 Contextualizing question and schema representations:

This helps in learning effective joint representations. RAT-SQL augments their schema graph by adding edges between question words and schema entities defined after schema linking. They introduce a relation-aware self-attention [25] layer to use the relational structure in the input and also learn “soft” relations between the sequence elements. They do this by providing a way to communicate known relations (like primary key, foreign keys, etc. defined in edge labels) by adding their representations to the attention.

Figure 18. rij encodes the known relationship and is added to the self-attention equation (source)

6. Conclusions and Future trends

Figure 19. An example of incorporating human feedback to improve NL2SQL systems (source)

In this blog, we reviewed the state-of-the-art in NL2SQL — the problem statement, challenges, evaluation of such systems, and the modern machine learning techniques for solving the task. Recent work also focuses on improving the user experience while using such systems. Photon [4] is a flexible system that supports both NL questions and SQL as input. It also has a confusion detection module that detects unanswerable questions and helps users paraphrase a question to get the right answers. Authors in [28] also show that incorporating human feedback can further improve the accuracy and user experience of these systems (see figure 19).

Figure 20. Example showing interpretability of NL2SQL system (source)

Although modern NL2SQL techniques achieve good accuracy on benchmark test sets, they are still far from demonstrating robust performance in production settings. In the context of business decision making, it is critical to achieving reliable performance to foster and build users’ trust in such systems. NL2SQL methods have the potential to significantly enhance the efficiency of human analysts so they can focus more time on contextual interpretation and validation of results. The output of modern end-to-end deep learning systems suffer from a lack of interpretability, and while there is significant research on how AI systems work under the hood, incorporating humans-in-the-loop to provide feedback and improve the predictive power will accelerate the adoption and use of NL2SQL systems across the modern data-driven organizations.

7. References

[1] Guo, Jiaqi, et al. “Towards complex text-to-sql in cross-domain database with intermediate representation.” (2019) arXiv preprint arXiv:1905.08205

[2] Wang, Bailin, et al. “Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers.” (2019) arXiv preprint arXiv:1911.04942

[3] Zhong, Victor, Caiming Xiong, and Richard Socher. “Seq2sql: Generating structured queries from natural language using reinforcement learning.” (2017) arXiv preprint arXiv:1709.00103

[4] Zeng, Jichuan, et al. “Photon: A Robust Cross-Domain Text-to-SQL System.” (2020) arXiv preprint arXiv:2007.15280

[5] Suhr, Alane, et al. “Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing.” (2020) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

[6] Dahl, Deborah A., et al. “Expanding the scope of the ATIS task: The ATIS-3 corpus.”(1994) HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8–11, 1994.

[7] Tang, Lappoon R., and Raymond J. Mooney. “Using multiple clause constructors in inductive logic programming for semantic parsing.”(2001) European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2001.

[8] Tang, Lappoon R., and Raymond Mooney. “Automated construction of database interfaces: Integrating statistical and relational learning for semantic parsing.” (2000) Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.

[9] Li, Fei, and H. V. Jagadish. “Constructing an interactive natural language interface for relational databases.” (2014) Proceedings of the VLDB Endowment 8.1: 73–84.

[10] Iyer, Srinivasan, et al. “Learning a neural semantic parser from user feedback.” (2017) arXiv preprint arXiv:1704.08760

[11] Yaghmazadeh, Navid, et al. “SQLizer: query synthesis from natural language.” (2017) Proceedings of the ACM on Programming Languages 1.OOPSLA: 1–26.

[12] Zhong, Victor, Caiming Xiong, and Richard Socher. “Seq2sql: Generating structured queries from natural language using reinforcement learning.” (2017) arXiv preprint arXiv:1709.00103

[13] Finegan-Dollak, Catherine, et al. “Improving text-to-sql evaluation methodology.” (2018) arXiv preprint arXiv:1806.09029

[14] Yu, Tao, et al. “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.” (2018) arXiv preprint arXiv:1809.08887

[15] Yu, Tao, et al. “Sparc: Cross-domain semantic parsing in context.” (2019) arXiv preprint arXiv:1906.02285

[16] Yu, Tao, et al. “Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases.” (2019) arXiv preprint arXiv:1909.05378

[17] Finegan-Dollak, Catherine, et al. “Improving text-to-sql evaluation methodology.” (2018) arXiv preprint arXiv:1806.09029

[18] Basik, Fuat, et al. “Dbpal: A learned nl-interface for databases.” (2018) Proceedings of the 2018 International Conference on Management of Data

[19] Yu, Tao, et al. “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.” (2018) arXiv preprint arXiv:1809.08887

[20] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” (2018) arXiv preprint arXiv:1810.04805

[21] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” (2014) Advances in neural information processing systems.

[22] Williams, Ronald J.; Hinton, Geoffrey E.; Rumelhart, David E. “Learning representations by back-propagating errors”. (October 1986) Nature.

[23] Kim, Hyeonji, et al. “Natural language to SQL: where are we today?.” (2020) Proceedings of the VLDB Endowment 13.10: 1737–1750.

[24] Hochreiter, Sepp; Schmidhuber, Jürgen “Long Short-Term Memory”. (1997–11–01) Neural Computation.

[25] Vaswani, Ashish, et al. “Attention is all you need.” (2017) Advances in neural information processing systems.

[26] Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. “Pointer networks.” (2015) Advances in neural information processing systems.

[27] Yin, Pengcheng, et al. “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data.” (2020) arXiv preprint arXiv:2005.08314

[28] Elgohary, Ahmed, Saghar Hosseini, and Ahmed Hassan Awadallah. “Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback.” (2020) arXiv preprint arXiv:2005.02539

--

--

Incoming CS PhD student at University of Wisconsin-Madison | Former Researcher at AI4Bharat, IIT madras | Interested in knowledge-intensive NLP, multilinguality