The world’s leading publication for data science, AI, and ML professionals.

Data Visualization Generation Using Large Language and Image Generation Models with LIDA

An overview of the LIDA library, including how to get started, examples, and considerations going forward

Recently I came across LIDA – a grammar-agnostic library designed to automatically generate data visualizations and infographics using large language models (LLMs) and Image Generation Models (IGMs). LIDA works with various large language model providers, such as OpenAI and Hugging Face. In this post, I’ll provide a high-level overview of the library, show you how to get started, highlight a few examples, and share some thoughts and considerations on the use of LLMs and IGMs **** in the data visualization and business intelligence (BI) field.

Photo by and_machines on Unsplash
Photo by and_machines on Unsplash

Overview¹

Creating data visualizations is often a complex task – one that involves data manipulation, coding, and design skills. LIDA is an open-source library that automates the data visualization creation process by reducing development time, number of errors, and overall complexity.

LIDA consists of 4 modules, as displayed in the following image. Each module serves a unique purpose in this multi-stage visualization generation approach.

Image from Victor Dibia from LIDA GitHub
Image from Victor Dibia from LIDA GitHub
  1. SUMMARIZER: this module converts data into a summary in natural language. The summary is implemented in two stages. In the first stage, Base Summary Generation, rules are applied to extract properties from the dataset using the pandas library, general statistics are generated, and a few samples are pulled for each column in the dataset. In the second stage, Summary enrichment, the contents from the Base Summary stage is enriched by either an LLM or a user via the UI to include a semantic description of the dataset and fields.¹
  2. GOAL EXPLORER: this module creates data exploration goals based on the summary generated by the SUMMARIZER module. Goals generated by this module are represented as JSON data structures containing the question, the visualization addressing the question, and the rationale.¹
  3. VIZ GENERATOR: this module consists of 3 submodules (a code scaffold constructor, a code generator, and a code executor). The goal of this module is to generate, evaluate, repair, filter, and execute visualization code according to specifications within a data visualization goal from the GOAL EXPLORER module, or from a new visualization goal created by the user.¹
  4. INFOGRAPHER: this module ** utilizes IGMs to create** stylized infographics based on the output of the VIZ GENERATOR module, and based on visualization and style prompts.¹

LIDA leverages two key capabilities of LLMs:

  1. Language Modeling – these capabilities assist in the generation of semantically meaningful visualization goals.¹
  2. Code Writing (i.e. Code Generation) – these capabilities assist in generating code to create data visualizations, which are then used as input to image generation models, such as DALL-E and Latent Diffusion, to generate stylized infographics.¹

Additionally, prompt engineering is used within the LIDA tool.

"Prompt Engineering is the process of designing, optimizing, and refining prompts used to communicate with AI language models. A prompt is a question, statement, or request that is input into an AI system to elicit a specific response or output."²

A couple of ways prompt engineering is incorporated into LIDA include the usage of prompts to create & define six evaluation dimensions, and the ability for users to specify style prompts to format a visualization.

The examples later in this post show more on some of these features mentioned in this section, and you can read more about LIDA here.

Getting Started

There are 2 ways to get started with LIDA – via the python API, or via a hybrid user interface. This section shows how to get started with the user interface from your local machine using the optional bundled UI and web API in the LIDA library.

Note: In this example, OpenAI is used. To use a different LLM provider, or to use the Python API, check out the GitHub documentation here.

Step 1: Install the necessary libraries

Install the following libraries on your computer.

pip install lida
pip install -U llmx openai

Step 2: Create a variable to store your OpenAI API Key

To create an OpenAI API Key navigate to your Profile > User API Keys, then select + Create new secret key.

Image by Author: Retrieve API Key
Image by Author: Retrieve API Key

Copy the API key. In a new terminal window, save the API key in a variable called OPENAI_API_KEY.

export OPENAI_API_KEY=""

Step 3: Launch the UI Web App

From the terminal window, launch the LIDA UI web app with the following command.

lida ui --port=8080 --docs

In a web browser, navigate to "localhost:8080", and then you’re all set to get started! Select either the Live demo or Demo tab to view the web app.

Image by Author: Accessing LIDA web app
Image by Author: Accessing LIDA web app

Web App Examples

This section goes through a few examples and tips using the Top 10 Films US Box Office dataset³ from Kaggle.

Step 1: Select a visualization library/grammar

Before creating any data visualizations or summaries, select a visualization library to use. There are 4 options to pick from: Altair, Matplotlib, Seaborn, and GGPlot. To start with, select Seaborn – a Python library for data visualization, based on Matplotlib.

Image by Author: Select a visualization library/grammar
Image by Author: Select a visualization library/grammar

TIP: Not sure which library to start with? Pick one, and switch later! You can switch the visualization library/grammar at a later point, even after the data has been uploaded. If you switch after loading the data, and see an error, a quick refresh should resolve the issue.

Step 2: Review Generation Settings

On the right, there is an option to modify the Generation Settings. Here you can select the Model Provider, the Model to use for generation, and adjust other fields such as Max Tokens, Temperature, and Number of Messages. For now, keep the default settings.

Image by Author: Review Generation Settings
Image by Author: Review Generation Settings

Step 3: Upload data

After setting the base parameters, upload the dataset. Click or drag the file to upload the dataset into the web app. Alternatively, you can use one of the sample files provided.

Image by Author: Upload file
Image by Author: Upload file

TIP: If you get an error when trying to upload a file, check your usage and billing access for the model provider you have selected. Access issues can result in data file upload issues in LIDA. Additionally, the terminal window will display error messages, if there are any, that may be useful for troubleshooting an issue.

CAUTION: Be careful about if/when switching back to the LIDA homepage – this will result in losing the work in your current display!

Step 4: Review the Data Summary

The Data Summary section provides a description of the dataset, and a summary of the fields in the dataset including column type, number of unique values, description of the column, and sample values. This output is a result of the SUMMARIZER module mentioned previously.

The following image shows the Data Summary for the Top 10 Films US Box Office dataset. There is a description for the entire dataset, and all 9 columns in the dataset.

Image by Author: Data Summary for Top 10 Films US Box Office Dataset
Image by Author: Data Summary for Top 10 Films US Box Office Dataset

Select View raw summary? to view the data summary as a JSON dictionary.

Image by Author: View raw summary
Image by Author: View raw summary

Step 5: Review Goal Exploration

This section shows a list of automatically generated goals, or hypotheses, based on the dataset uploaded. Each goal is stated as question, and includes a description of what the visualization will display. This output is a result of the GOAL EXPLORER module mentioned previously.

Here you can read through the different goals, and select one to visualize in the Visualization Generation section.

Image by Author: Goal Exploration results for Top 10 Films US Box Office Dataset
Image by Author: Goal Exploration results for Top 10 Films US Box Office Dataset

Step 6: Visualization Generation

Based on the goal selected in the previous section, Goal Exploration, you will see the visualization, as well as the Python code used to generate the visual for that goal.

The following image shows the result for the goal, "What is the distribution of movie releases by month?". On the left is the visualization, a vertical bar chart, and on the right is the Python code used to create the visual. This code snippet can be copied for external use.

Image by Author: Visualization Generation output for "What is the distribution of movie releases by month?"
Image by Author: Visualization Generation output for "What is the distribution of movie releases by month?"

Alternatively, you can enter a new visualization goal, outside of the ones listed in the Goal Exploration section.

For example, the following image shows the output for "What are the top 5 genres with the largest average budget?".

Image by Author: Visualization Generation output for "What are the top 5 genres with the largest average budget?"
Image by Author: Visualization Generation output for "What are the top 5 genres with the largest average budget?"

Note: Selecting the Generate button, to the right of the goal, refreshes the visualization. This may result in slight changes, such as a change in the color scheme.

Step 7: Visualization modification and evaluation

Once a visualization is generated, there are 4 tabs that can be utilized: Refine, Explain, Evaluate & Recommend.

Image by Author: Refine, Explain, Evaluate, and Recommend! tabs under the Visualization Generation section
Image by Author: Refine, Explain, Evaluate, and Recommend! tabs under the Visualization Generation section

The first tab, Refine, modifies the chart using natural language commands.

The following image shows modifications made to the chart, "What is the distribution of movie releases by month?", using the Refine tab. The chart was modified using natural commands to arrange the months in chronological order, to display the values in a horizontal bar chart, and to add count values to each bar.

Image by Author: Visualization Generation output after natural language commands input in Refine tab
Image by Author: Visualization Generation output after natural language commands input in Refine tab

TIP: Make sure your style prompts are clear, concise, & specific! Otherwise you may end up with a distorted visualization, unexpected results, or your natural language command may not render a chart. Remember, when it comes to writing prompts, Garbage In → Garbage Out! Writing prompts is an art, so writing effective style prompts may require some refinement.

If you need to reset the visualization after a few style prompts not turning out as expected, use the Clear Chat History button to reset the visualization.

Image by Author: Clear Chat History button
Image by Author: Clear Chat History button

The second tab, Explain, provides a text explanation on how the visual was created – in terms of data transformations, chart elements, code, etc.

Image by Author: Chart explanations example
Image by Author: Chart explanations example

The third tab, Evaluate, evaluates the generated chart across 6 dimensions: bugs, transformation, compliance, type, encoding, and aesthetics. Each dimension has a rating out of 5, and a description on why it received that rating.

Image by Author: Chart evaluation example
Image by Author: Chart evaluation example

There is an option to auto repair the chart, using the button on the bottom right, Auto Repair Chart, as seen in the image above. If you agree with the recommendations provided in the chart evaluation, then this is a nice and quick way to apply the fixes! The following image shows an updated chart after auto repairing the chart based on the Aesthetics evaluation.

Image by Author: Updated bar chart after Auto Repair Chart selected
Image by Author: Updated bar chart after Auto Repair Chart selected

The fourth tab, Recommend!, generates similar charts, and corresponding code snippets— not tied to the initial goal. This can be useful for brainstorming other charts, or other insights to gain from the data.

Image by Author: Chart recommendation examples
Image by Author: Chart recommendation examples

Thoughts & Considerations

This section highlights a few areas of consideration regarding the use of LLMs and IGMs in the data visualization and business intelligence field – including, but not limited to, automatic data visualization generation.

Evaluation Metrics

LIDA makes use of 2 metrics – Visualization Error Rate (VER), and Self-Evaluated Visualization Quality (SEVQ).

VER shows how many of the generated visualizations result in code compilation errors, stated as a percentage.

SEVQ uses LLMs, such as GPT-4, to assess the quality of visualizations generated. It takes the average score of 6 dimensions – code accuracy, data transformations, goal compliance, visualization type, data encoding, and aesthetics. These dimensions each generate a score based on prompts to an LLM (to see a sketch of the prompts used, read the paper here)¹. You may recall, these dimensions appear in the Evaluate tab in the LIDA web app.

These metrics evaluate visualization generation, and they raise a good point – it’s important to keep in mind how we evaluate the use of LLMs and IGMs for data visualization and BI tools. As this area continues to evolve, it’s important for practitioners to keep this in mind when implementing LLMs and IGMs for data visualization and BI solutions for their organization, and ask themselves – What metrics do we need to consider going forward? What processes need to be built in place? How do we ensure the output is accurate, trustworthy, explainable, and governed?

Deployment – Environment Setup Considerations

When utilizing LLMs and IGMs for data visualization within an organization, there are several things to consider regarding deployment.

Use of these models for data visualization, or use of these models in general, can require a large amount of resources, depending on factors such as the model size, dataset size, and number of users. This can lead to high costs if not planned correctly and efficiently. It’s important to make sure the correct infrastructure is set in place, to ensure a smooth implementation. Testing more refined LLMs for a specific use case can also help in reducing the overall footprint.

Additionally, data security and governance are important to keep in mind when using LLMs and IGMs for data visualization. Regardless of the tool, it’s crucial to ensure that data is secure within the tool, and that it is governed throughout its use.

Chart Explanations

As shown in a previous example, the chart explanations generated within LIDA focus on details regarding how the chart was created – in terms of data transformations, chart elements, and code generated. While this is helpful for a developer creating charts with a dataset, this kind of context is not beneficial for business users. Business users and analysts would benefit from chart explanations that include insights about the data points within a visualization, not just the chart elements and structure.

Regardless of an individual’s role, natural language text accompanying the charts can help provide key insights from a data visualization. There are some natural language generation (NLG) tools that are able to integrate into business-intelligence (BI) tools today. It’ll be interesting to see how this space continuous to evolve with LLMs, IGMs, and data visualization solutions.

Haven’t seen NLG with BI before? Check out this GitHub page for a quick intro.

Moving forward, it’s imperative to think about the end user, and understanding what LLM + IGM + data visualizations solutions will fit that audience based on their goals and interests.

Chart Design using Prompts

The examples earlier showed how data visualizations can be generated using LLMs and IGMs. While these charts are automatically generated, they still require modification to make sure they are well designed. Often you can’t leave the first chart as it is. This requires the use of the Auto Repair capabilities in LIDA, which captures some but not all changes that should be made, as well as style prompts, which requires some experience and knowledge in the data visualization domain.

These style prompts are inputted by the user in natural language and can include requests such as modifying chart titles, changing chart colors, or sorting chart values.

The use of these style prompts can help users save time when developing charts – reducing the time to write code, and reducing the time needed to debug and format code.

However, with the introduction of prompts in data visualization generation, it becomes equally important to understand what makes a good prompt. Prompts that are clear, concise, and specific will yield better results than one that is not. Unclear requests, can result in poor visualizations or unexpected results.

Now this doesn’t mean we shouldn’t leverage prompts in creating data visualizations – rather, it’s to point out that there may be a learning curve as you get started. Figuring out the right prompt may involve some testing, and will require clearly phrased commands.

Overall, LIDA is a great open-source tool to start learning about some of the advancements in LLMs, IGMs, & Data Visualization. Check out Victor Dibia’s full paper here & try out the web app, or Python API, to learn more about how LLMs and IGMs are changing the way we can create data visualizations.


Payal is a Data & AI specialist. In her spare time, she enjoys reading, traveling, and writing on Medium. If you enjoy her work, follow or subscribe to her list, and never miss a story!

The above article is personal and does not necessarily represent IBM’s positions, strategies, or opinions.

References

[1]: Dibia, Victor. LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics Using Large Language Models, Microsoft Research, 8 May 2023, aclanthology.org/2023.acl-demo.11.pdf.

[2]: Vagh, Avinash. "NLP and Prompt Engineering: Understanding the Basics." DEV Community, DEV Community, 6 Apr. 2023, dev.to/avinashvagh/understanding-the-concept-of-natural-language-processing-nlp-and-prompt-engineering-35hg.

[3]: Films, Will’s. "Top 10 Films at the US Box Office 2000–2023." Kaggle, 20 Mar. 2024, www.kaggle.com/datasets/willsfilms/top-10-films-at-the-us-box-office-2000-2023. (CC0)


Related Articles