LMQL — SQL for Language Models

Yet another tool that could help you with LLM applications

Published in

Towards Data Science

17 min readNov 27, 2023

I’m sure you’ve heard about SQL or even have mastered it. SQL (Structured Query Language) is a declarative language widely used to work with database data.

According to the annual StackOverflow survey, SQL is still one of the most popular languages in the world. For professional developers, SQL is in the top-3 languages (after Javascript and HTML/CSS). More than a half of professionals use it. Surprisingly, SQL is even more popular than Python.

Graph by author, data from StackOverflow survey

SQL is a common way to talk to your data in a database. So, it is no surprise that there are attempts to use a similar approach for LLMs. In this article, I would like to tell you about one such approach called LMQL.

What is LMQL?

LMQL (Language Model Query Language) is an open-source programming language for language models. LMQL is released under Apache 2.0 license, which allows you to use it commercially.

LMQL was developed by ETH Zurich researchers. They proposed a novel idea of LMP (Language Model Programming). LMP combines natural and programming languages: text prompt and scripting instructions.

In the original paper, “Prompting Is Programming: A Query Language for Large Language Models” by Luca Beurer-Kellner, Marc Fischer and Martin Vechev, the authors flagged the following challenges of the current LLM usage:

Interaction. For example, we could use meta prompting, asking LM to expand the initial prompt. As a practical case, we could first ask the model to define the language of the initial question and then respond in that language. For such a task, we will need to send the first prompt, extract language from the output, add it to the second prompt template and make another call to the LM. There’s quite a lot of interactions we need to manage. With LMQL, you can define multiple input and output variables within one prompt. More than that, LMQL will optimise overall likelihood across numerous calls, which might yield better results.
Constraint & token representation. The current LMs don’t provide the functionality to constrain output, which is crucial if we use LMs in production. Imagine building a sentiment analysis in production to mark negative reviews in our interface for CS agents. Our program would expect to receive from the LLM “positive”, “negative”, or “neutral”. However, quite often, you could get something like “The sentiment for provided customer review is positive” from the LLM, which is not so easy to process in your API. That’s why constraints would be pretty helpful. LMQL allows you to control output using human-understandable words (not tokens that LMs operate with).
Efficiency and cost. LLMs are large networks, so they are pretty expensive, regardless of whether you use them via API or in your local environment. LMQL can leverage predefined behaviour and the constraint of the search space (introduced by constraints) to reduce the number of LM invoke calls.

As you can see, LMQL can address these challenges. It allows you to combine multiple calls in one prompt, control your output and even reduce cost.

The impact on cost and efficiency could be pretty substantial. The limitations to the search space can significantly reduce costs for LLMs. For example, in the cases from the LMQL paper, there were 75–85% fewer billable tokens with LMQL compared to standard decoding, which means it will significantly reduce your cost.

Image from the paper by Beurer-Kellner et al. (2023)

I believe the most crucial benefit of LMQL is the complete control of your output. However, with such an approach, you will also have another layer of abstraction over LLM (similar to LangChain, which we discussed earlier). It will allow you to switch from one backend to another easily if you need to. LMQL can work with different backends: OpenAI, HuggingFace Transformers or llama.cpp.

You can install LMQL locally or use a web-based Playground online. Playground can be pretty handy for debugging, but you can only use the OpenAI backend here. For all other use cases, you will have to use local installation.

As usual, there are some limitations to this approach:

This library is not very popular yet, so the community is pretty small, and few external materials are available.
In some cases, documentation might not be very detailed.
The most popular and best-performing OpenAI models have some limitations, so you can’t use the full power of LMQL with ChatGPT.
I wouldn’t use LMQL in production since I can’t say that it’s a mature project. For example, distribution over tokens provides pretty poor accuracy.

Somewhat close alternative to LMQL is Guidance. It also allows you to constrain generation and control the LM’s output.

Despite all the limitations, I like the concept of Language Model Programming, and that’s why I’ve decided to discuss it in this article.

If you’re interested to learn more about LMQL from its authors, check this video.

LMQL syntax

Now, we know a bit what LMQL is. Let’s look at the example of an LMQL query to get acquainted with its syntax.

beam(n=3)
    "Q: Say 'Hello, {name}!'" 
    "A: [RESPONSE]" 
from "openai/text-davinci-003"
where len(TOKENS(RESPONSE)) < 20

I hope you can guess its meaning. But let’s discuss it in detail.
Here’s a scheme for a LMQL query

Image from paper by Beurer-Kellner et al. (2023)

Any LMQL program consists of 5 parts:

Decoder defines the decoding procedure used. In simple words, it describes the algorithm to pick up the next token. LMQL has three different types of decoders: argmax, beam and sample. You can learn about them in more detail from the paper.
Actual query is similar to the classic prompt but in Python syntax, which means that you could use such structures as loops or if-statements.
In from clause, we specified the model to use (openai/text-davinci-003 in our example).
Where clause defines constraints.
Distribution is used when you want to see probabilities for tokens in the return. We haven’t used distribution in this query, but we will use it to get class probabilities for the sentiment analysis later.

Also, you might have noticed special variables in our query {name} and [RESPONSE]. Let’s discuss how they work:

{name} is an input parameter. It could be any variable from your scope. Such parameters help you create handy functions that could be easily re-used for different inputs.
[RESPONSE] is a phrase that LM will generate. It can also be called a hole or placeholder. All the text before [RESPONSE] is sent to LM, and then the model’s output is assigned to the variable. It’s handy that you could easily re-use this output later in the prompt, referring to it as {RESPONSE}.

We’ve briefly covered the main concepts. Let’s try it ourselves. Practice makes perfect.

Getting started

Setting up environment

First of all, we need to set up our environment. To use LMQL in Python, we need to install a package first. No surprises, we can just use pip. You need an environment with Python ≥ 3.10.

pip install lmql

If you want to use LMQL with local GPU, follow the instructions in the documentation.

To use OpenAI models, you need to set up APIKey to access OpenAI. The easiest way is to specify the OPENAI_API_KEY environment variable.

import os
os.environ['OPENAI_API_KEY'] = '<your_api_key>'

However, OpenAI models have many limitations (for example, you won’t be able to get distributions with more than five classes). So, we will use Llama.cpp to test LMQL with local models.

First, you need to install Python binding for Llama.cpp in the same environment as LMQL.

pip install llama-cpp-python

If you want to use local GPU, specify the following parameters.

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Then, we need to load model weights as .gguf files. You can find models on HuggingFace Models Hub.

We will be using two models:

Llama-2-7B (link)
zephyr-7B-beta (link)

Llama-2–7B is the smallest version of fine-tuned generative text models by Meta. It’s a pretty basic model, so we shouldn’t expect outstanding performance from it.

Zephyr is a fine-tuned version of the Mistral model with decent performance. It performs better in some aspects than a 10x larger open-source model Llama-2–70b. However, there’s still some gap between Zephyr and proprietary models like ChatGPT or Claude.

Image from the paper by Tunstall et al. (2023)

According to the LMSYS ChatBot Arena leaderboard, Zephyr is the best-performing model with 7B parameters. It’s on par with much bigger models.

Let’s load .gguf files for our models.

import os
import urllib.request


def download_gguf(model_url, filename):
    if not os.path.isfile(filename):
        urllib.request.urlretrieve(model_url, filename)
        print("file has been downloaded successfully")
    else:
        print("file already exists")

download_gguf(
    "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf", 
    "zephyr-7b-beta.Q4_K_M.gguf"
)

download_gguf(
    "https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf", 
    "llama-2-7b.Q4_K_M.gguf"
)

We need to download a few GBs so that it might take some time (10–15 minutes for each model). Luckily, you need to do it only once.

You can interact with the local models in two different ways (documentation):

Two-process architecture when you have a separate long-running process with your model and short-running inference calls. This approach is more suitable for production.
For ad-hoc tasks, we could use in-process model loading, specifying local: before the model name. We will be using this approach to work with the local models.

Now, we’ve set up the environment, and it’s time to discuss how to use LMQL from Python.

Python functions

Let’s briefly discuss how to use LMQL in Python. Playground can be handy for debugging, but if you want to use LM in production, you need an API.

LMQL provides four main approaches to its functionality: lmql.F , lmql.run , @lmql.query decorator and Generations API.

Generations API has been recently added. It’s a simple Python API that helps to do inference without writing LMQL yourself. Since I am more interested in the LMP concept, we won’t cover this API in this article.

Let’s discuss the other three approaches in detail and try to use them.

First, you could use lmql.F. It’s a lightweight functionality similar to lambda functions in Python that could allow you to execute part of LMQL code. lmql.F can have only one placeholder variable that will be returned from the lambda function.

We could specify both prompt and constraint for the function. The constraint will be equivalent to the where clause in the LMQL query.

Since we haven’t specified any model, the OpenAI text-davinci will be used.

capital_func = lmql.F("What is the captital of {country}? [CAPITAL]", 
    constraints = "STOPS_AT(CAPITAL, '.')")

capital_func('the United Kingdom')

# Output - '\n\nThe capital of the United Kingdom is London.'

If you’re using Jupyter Notebooks, you might encounter some problems since Notebooks environments are asynchronous. You could enable nested event loops in your notebook to avoid such issues.

import nest_asyncio
nest_asyncio.apply()

The second approach allows you to define more complex queries. You can use lmql.run to execute an LMQL query without creating a function. Let’s make our query a bit more complicated and use the answer from the model in the following question.

In this case, we’ve defined constraints in the where clause of the query string itself.

query_string = '''
    "Q: What is the captital of {country}? \\n"
    "A: [CAPITAL] \\n"
    "Q: What is the main sight in {CAPITAL}? \\n"
    "A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10) \
      and (len(TOKENS(ANSWER)) < 100) and STOPS_AT(CAPITAL, '\\n') \
      and STOPS_AT(ANSWER, '\\n')
'''

lmql.run_sync(query_string, country="the United Kingdom")

Also, I’ve used run_sync instead of run to get a result synchronously.

As a result, we got an LMQLResult object with a set of fields:

prompt — include the whole prompt with the parameters and the model’s answers. We could see that the model answer was used for the second question.
variables — dictionary with all the variables we defined: ANSWER and CAPITAL .
distribution_variable and distribution_values are None since we haven’t used this functionality.

The third way to use Python API is the @lmql.query decorator, which allows you to define a Python function that will be handy to use in the future. It’s more convenient if you plan to call this prompt several times.

We could create a function for our previous query and get only the final answer instead of returning the whole LMQLResult object.

@lmql.query
def capital_sights(country):
    '''lmql
    "Q: What is the captital of {country}? \\n"
    "A: [CAPITAL] \\n"
    "Q: What is the main sight in {CAPITAL}? \\n"
    "A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10) and (len(TOKENS(ANSWER)) < 100) \
        and STOPS_AT(CAPITAL, '\\n') and STOPS_AT(ANSWER, '\\n')

    # return just the ANSWER 
    return ANSWER
    '''

print(capital_sights(country="the United Kingdom"))

# There are many famous sights in London, but one of the most iconic is 
# the Big Ben clock tower located in the Palace of Westminster. 
# Other popular sights include Buckingham Palace, the London Eye, 
# and Tower Bridge.

Also, you could use LMQL in combination with LangChain:

LMQL queries are Prompt Templates on steroids and could be part of LangChain chains.
You could leverage LangChain components from LMQL (for example, retrieval). You can find examples in the documentation.

Now, we know all the basics of LMQL syntax, and we are ready to move on to our task — to define sentiment for customer comments.

Sentiment Analysis

To see how LMQL is performing, we will use labelled Yelp reviews from the UCI Machine Learning Repository and try to predict sentiment. All reviews in the dataset are positive or negative, but we will keep neutral as one of the possible options for classification.

For this task, let’s use local models — Zephyr and Llama-2. To use them in LMQL, we need to specify the model and tokeniser when we are calling LMQL. For Llama-family models, we can use the default tokeniser.

First attempts

Let’s pick one customer review The food was very good. and try to define its sentiment. We will use lmql.run for debugging since it’s convenient for such ad-hoc calls.

I’ve started with a very naive approach.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
    query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta'))

# [Error during generate()] The requested number of tokens exceeds 
# the llama.cpp model's context size. Please specify a higher n_ctx value.

If your local model works exceptionally slowly, check whether your computer uses swap memory. Restart could be an excellent option to solve it.

The code looks absolutely straightforward. Surprisingly, however, it doesn’t work and returns the following error.

[Error during generate()] The requested number of tokens exceeds the llama.cpp 
model's context size. Please specify a higher n_ctx value.

From the message, we can guess that the output doesn’t fit the context size. Our prompt is about 20 tokens. So, it’s a bit weird that we’ve hit the threshold on the context size. Let’s try to constrain the number of tokens for SENTIMENT and see the output.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where (len(TOKENS(SENTIMENT)) < 200)
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

#  Positive sentiment.
# 
# Q: What is the sentiment of the following review: ```The service was terrible.```?
# A:  Negative sentiment.
# 
# Q: What is the sentiment of the following review: ```The hotel was amazing, the staff were friendly and the location was perfect.```?
# A:  Positive sentiment.
# 
# Q: What is the sentiment of the following review: ```The product was a complete disappointment.```?
# A:  Negative sentiment.
# 
# Q: What is the sentiment of the following review: ```The flight was delayed for 3 hours, the food was cold and the entertainment system didn't work.```?
# A:  Negative sentiment.
# 
# Q: What is the sentiment of the following review: ```The restaurant was packed, but the waiter was efficient and the food was delicious.```?
# A:  Positive sentiment.
# 
# Q:

Now, we could see the root cause of the problem — the model was stuck in a cycle, repeating the question variations and answers again and again. I haven’t seen such issues with OpenAI models (suppose they might control it), but they are pretty standard to open-source local models. We could use the STOPS_AT constraint to stop generation if we see Q: or a new line in the model response to avoid such cycles.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where STOPS_AT(SENTIMENT, 'Q:') \
     and STOPS_AT(SENTIMENT, '\\n')
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

# Positive sentiment.

Excellent, we’ve solved the issue and got the result. But since we will do classification, we would like the model to return one of the three outputs (class labels): negative, neutral or positive. We could add such a filter to the LMQL query to constrain the output.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables['SENTIMENT'])

# positive

We don’t need filters with stopping criteria since we are already limiting output to just three possible options, and LMQL doesn’t look at any other possibilities.

Let’s try to use the chain of thoughts reasoning approach. Giving the model some time to think usually improves the results. Using LMQL syntax, we could quickly implement this approach.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: Let's think step by step. [ANALYSIS]. Therefore, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, '\\n') \
    and (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables)

The output from the Zephyr model is pretty decent.

We can try the same prompt with Llama 2.

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: Let's think step by step. [ANALYSIS]. Therefore, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, '\\n') \
    and (SENTIMENT in ['positive', 'negative', 'neutral'])
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:llama-2-7b.Q4_K_M.gguf")).variables)

The reasoning doesn’t make much sense. We’ve already seen on the Leaderboard that the Zephyr model is much better than Llama-2–7b.

In classical Machine Learning, we usually get not only class labels but also their probability. We could get the same data using distribution in LMQL. We just need to specify the variable and possible values — distribution SENTIMENT in [‘positive’, ‘negative’, ‘neutral’].

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: Let's think step by step. [ANALYSIS]. Therefore, the sentiment is [SENTIMENT]" distribution SENTIMENT in ['positive', 'negative', 'neutral']
where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, '\\n')
"""

print(lmql.run_sync(query_string, 
    model = lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta')).variables)

Now, we got probabilities in the output, and we could see that the model is quite confident in the positive sentiment.

Probabilities could be helpful in practice if you want to use only decisions when the model is confident.

Now, let’s create a function to use our sentiment analysis for various inputs. It would be interesting to compare results with and without distribution, so we need two functions.

@lmql.query(model=lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
   tokenizer = 'HuggingFaceH4/zephyr-7b-beta', n_gpu_layers=1000))
# specified n_gpu_layers to use GPU for higher speed
def sentiment_analysis(review):
    '''lmql
    "Q: What is the sentiment of the following review: ```{review}```?\\n"
    "A: Let's think step by step. [ANALYSIS]. Therefore, the sentiment is [SENTIMENT]" where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, '\\n') \
        and (SENTIMENT in ['positive', 'negative', 'neutral'])
    '''


@lmql.query(model=lmql.model("local:llama.cpp:zephyr-7b-beta.Q4_K_M.gguf", 
  tokenizer = 'HuggingFaceH4/zephyr-7b-beta', n_gpu_layers=1000))
def sentiment_analysis_distribution(review):
    '''lmql
    "Q: What is the sentiment of the following review: ```{review}```?\\n"
    "A: Let's think step by step. [ANALYSIS]. Therefore, the sentiment is [SENTIMENT]" distribution SENTIMENT in ['positive', 'negative', 'neutral']
    where (len(TOKENS(ANALYSIS)) < 200) and STOPS_AT(ANALYSIS, '\\n')
    '''

Then, we could use this function for the new review.

sentiment_analysis('Room was dirty')

The model decided that it was neutral.

There’s a rationale behind this conclusion, but I would say this review is negative. Let’s see whether we could use other decoders and get better results.

By default, the argmax decoder is used. It’s the most straightforward approach: at each step, the model selects the token with the highest probability. We could try to play with other options.

Let’s try to use the beam search approach with n = 3 and a pretty high tempreture = 0.8. As a result, we would get three sequences sorted by likelihood, so we could just get the first one (with the highest likelihood).

sentiment_analysis('Room was dirty', decoder = 'beam', 
    n = 3, temperature = 0.8)[0]

Now, the model was able to spot the negative sentiment in this review.

It’s worth saying that there’s a cost for beam search decoding. Since we are working on three sequences (beams), getting an LLM result takes 3 times more time on average: 39.55 secs vs 13.15 secs.

Now, we have our functions and can test them with our real data.

Results on real-life data

I’ve run all the functions on a 10% sample of the 1K dataset of Yelp reviews with different parameters:

models: Llama 2 or Zephyr,
approach: using distribution or just constrained prompt,
decoders: argmax or beam search.

First, let’s compare accuracy — share of reviews with correct sentiment. We can see that Zephyr performs much better than the Llama 2 model. Also, for some reason, we get significantly poorer quality with distributions.

If we look a bit deeper, we could notice:

For positive reviews, accuracy is usually higher.
The most common error is marking the review as neutral,
For Llama 2 with prompt, we could see a high rate of critical issues (positive comments that were labelled as negatives).

In many cases, I suppose the model uses a similar rationale, scoring negative comments as neutral as we’ve seen earlier with the “dirty room” example. The model is unsure whether “dirty room” has a negative or neutral sentiment since we don’t know whether the customer expected a clean room.

It’s also interesting to look at actual probabilities:

75% percentile of positive labels for positive comments is above 0.85 for the Zephyr model, while it is way lower for Llama 2.
All models show poor performance for negative comments, where the 75% percentile for negative labels for negative comments is way below even 0.5.

Our quick research shows that a vanilla prompt with a Zephyr model and argmax decoder would be the best option for sentiment analysis. However, it’s worth checking different approaches for your use case. Also, you could often achieve better results by tweaking prompts.

You can find the full code on GitHub.

Summary

Today, we’ve discussed a concept of LMP (Language Model Programming) that allows you to mix prompts in natural language and scripting instructions. We’ve tried using it for sentiment analysis tasks and got decent results using local open-source models.

Even though LMQL is not widespread yet, this approach might be handy and gain popularity in the future since it combines natural and programming languages into a powerful tool for LMs.

Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please leave them in the comments section.

Dataset

Kotzias,Dimitrios. (2015). Sentiment Labelled Sentences. UCI Machine Learning Repository (CC BY 4.0 license). https://doi.org/10.24432/C57604