Data analysis made easy: Text2Code for Jupyter notebook

Kartik Godawat

Published in

Towards Data Science

5 min readSep 6, 2020

Inspiration: GPT-3

In June of 2020, OpenAI launched their new model GPT-3, which not only has futuristic NLP(Natural Language Processing) capabilities, but was also able to generate React code and simplify command-line commands.

Looking at these demos was a huge inspiration for us and we realized that while doing data analysis, a lot of times, we often forget less-used pandas or plotly syntax and need to search for it. Copying the code from StackOverflow then requires modifying the variables and column names accordingly. We started exploring for something which generates ready-to-execute code for human queries like:

show rainfall and humidity in a heatmap from dataframe df

group df by state and get average & maximum of user_age

Snippets was one such extension we used for some time but after a certain number of snippets, the UI becomes unintuitive. While it is good for static templates, we needed something more to handle dynamic nature of our use-case.

We decided to attempt building a new jupyter extension for this purpose. Unfortunately, we didn’t have beta access to GPT-3, so using that amazing model wasn’t an option.

Simplifying the task:

We wanted to build something which runs on our desktops (with GPUs). We initially tried treating the problem as a chat-bot problem and started with Rasa but were soon stopped short due to lack of proper training data.

Having failed to build a truly generative model, we decided to develop a supervised model which can work for the use-cases defined in the training pipeline and could be easily extended. Taking inspiration from chatbot pipelines, we decided to simplify the problem into the following components:

Generate / Gather training data
Intent matching: What is it that the user wants to do?
NER(Named Entity Recognition): Identify variables(entities) in the sentences
Fill Template: Use extracted entities in a fixed template to generate code
Wrap inside jupyter extension

Generating training data:

In order to simulate what end “users” are going to query to the system, we started with some formats we thought we ourselves use to describe the command in English. For example:

display a line plot showing $colname on y-axis and $colname on x-axis from $varname

Then, we generate variations by using a very simple generator to replace $colname and $varname to get variations in the training set.

Intent Matching:

After having generated the data, which is mapped with a unique “intent_id” for specific intents, we then used Universal Sentence Encoder to get embeddings of the user query and find cosine similarity with our predefined intent queries(generated data). Universal Sentence Encoder is similar to word2vec which generates embeddings, but for sentences instead of words.

NER(Named Entity Recognition):

The same generated data could be then used to train a custom entity recognition model, which could detect column, variable, library names. For this purpose, we explored HuggingFace models but ended up using Spacy to train a custom model, primarily because HuggingFace models are transformer based models and are a bit heavy as compared to Spacy.

Fill Template:

Filling a template is very easy once the entities are correctly recognized and intents are correctly matched. For example, “show 5 rows from df” query would result in two entities: a variable and a numeric. Template code for this was straightforward to write.

Integrate with Jupyter:

Suprisingly, this one turned out to be the most complex of all, as it is slightly tricky to write such complex extensions for Jupyter and there is little documentation or examples available (as compared to other libraries like HuggingFace or Spacy). With some trial and errors, and a bit of copy-paste from already existing extensions, we were finally able to wrap everything around as a single python package, which could be installed via pip install

We had to create a frontend as well as a server extension which gets loaded when jupyter notebook is triggered. Frontend sends the query to server to get the generated template code and then inserts it in the cell and executes it.

Demo:

The demo video was prepared on Chai Time Data Science dataset by Sanyam Bhutani.

Short video of supported commands

Limitations:

Like with many ML models, sometimes intent matching and NER fail miserably, even when the intent is obvious to the human eye. Some of the areas we could attempt to improve the situation are:

Gather/Generate higher-quality English sentence training data. Paraphrasing is one technique we haven’t tried yet to generate different ways of speaking the same sentence.
Gather real-world variable names, library names as opposed to randomly generating them.
Try NER with a transformer-based model.
With enough data, train a language model to directly do English->code like GPT-3 does, instead of having separate stages in the pipeline.

That’s all folks!

I hope you enjoyed reading the article. The entire code for the extension, which is ready-to-install on a local GPU machine is available here.

Deepak and I hacked this together over a couple of weekends. The code is not production-ready but good enough for people to modify and use for their own. We would love to hear feedback and ideas for improvement. :)