Beyond chat-bots: the power of prompt-based GPT models for downstream NLP tasks
Large-scale language models have taken the NLP community by storm in the past few years. Generative Pre-Trained Transformer (GPT) models, such as GPT-3 by OpenAI and GPT-J 6 & GPT-NeoX-20B by EleutherAI, have shown impressive results when it comes to generating text that is indistinguishable from human-generated text. An intuitive use-case for such a GPT model is something conversational like a chat-bot or AI storyteller, you give the model a question or story prompt and the model continues it. However, the immense potential of prompt-based machine learning using GPT models for other tasks is often less intuitive as it embodies a paradigm shift of sorts. In this article, I will discuss how you can use creative prompt engineering and GPT models to help solve the downstream NLP tasks that you care about.
For the uninitiated, what is a GPT model?
The family of GPT models consists of generative language models that can predict the next token in a sequence of tokens. These models generally represent deep neural networks consisting of billions of parameters that are trained on an extremely large volume of textual data, primarily collected from the internet. This means that these models are not trained for specific tasks, they just generate text based on the preceding text, whatever it may be. This might not sound particularly useful, but it more closely resembles how humans would communicate. Someone might ask you a question (i.e. the preceding text) and you provide an answer (i.e., the generated text). So for example, if I give the OpenAI GPT-3 model a prompt such as:
“Where is the University of Washington located?”
It will generate a response that looks like this:
“The University of Washington is located in Seattle, Washington.”
Using GPT models for downstream NLP tasks
It is evident that these GPT models are powerful and can generate text that is often indistinguishable from human-generated text. But how can we get a GPT model to perform tasks such as classification, sentiment analysis, topic modeling, text cleaning, and information extraction? A natural thought might be to just ask the model to perform such a task, but that can be fiddly, unpredictable, and hard to control. For example, let’s say I have a research project about employee compensation and my downstream NLP task is as follows: extract all sentences out of an employee review that relates to compensation and classify them as positive, negative, or neutral.
If I would ask a person to do this, I could phrase it something like this:
Here is the review:
Plumbing Co is a great place to work for those who are interested in the field of plumbing. The company is always expanding and there is room for advancement. The pay is too low, however, and this is the only downside to working for Plumbing Co. In general, check it out!
The task is to:
Give me the sentences that relate to compensation and classify them as positive, negative, or neutral.
What is the answer?
If I give this prompt, as is, to OpenAI GPT-3 Davinci, their largest and most expensive model, and run it a few times it gives me the following responses:
Try 1: The pay is too low, however, and this is the only downside to working for Plumbing Co.
Try 2: negative: “The pay is too low, however, and this is the only downside to working for Plumbing Co.”
Try 3: Negative: The pay is too low
These results are honestly quite impressive, without giving it any examples the model provided results that are somewhat in the spirit of what we need. However, the results are not consistent, and parsing a wide range of answers will be difficult, if not impossible. So, how do we make this work? The answer is prompt-engineering and fine-tuning.
With appropriate prompt engineering and a sufficient number of examples we can use a single GPT model to do virtually any downstream NLP task, including:
- Text classification
- Topic modeling
- Text cleaning, text correction, and text normalization
- Named entity and information extraction
- And much more, your creativity is the limit!
Let me show you an example (with code 🔥)
Ok, so let’s look at the example of the employee reviews. As a refresher, our objective is to extract all sentences out of an employee review that relates to compensation and classify them as positive, negative, or neutral. The code underlying this example is included in the Jupyter Notebook linked at the end.
To illustrate, here are some example employee reviews with compensation sentences:
Review #1 — negative sentiment:
Plumbing Co is a great place to work for those who are interested in the field of plumbing. The company is always expanding and there is room for advancement. The pay is too low, however, and this is the only downside to working for Plumbing Co.
Review #2 — positive sentiment:
Plumbing Co is a great company to work for! The compensation is great and above the industry standard. The benefits are also very good. The company is very fair and treats its employees well. I would definitely recommend Plumbing Co to anyone looking for a great place to work.
Review #3 — neutral sentiment:
I’ve been working at Plumbing Co for a few months now, and I’ve found it to be a pretty decent place to work. The salary is pretty average, but the coffee is really great. Overall, I think it’s a pretty good company to work for. The hours are reasonable, and the work is fairly easy. I would recommend it to anyone looking for a decent job.
Without a GPT model we might solve this task using a machine learning pipeline that looks something like this:
- Do a thorough cleanup to make sure the text is consistent and normal.
- Separate each review out into separate sentences using a library like Spacy.
- Set up a keyword list to identify sentences that relate to compensation.
- Create a large training sample of compensation sentences by manually classifying them into positive, neutral, or negative.
- Convert the text into a numerical representation using something like
TF-IDF or Word Embeddings. - Train a supervised machine learning model (e.g., Naïve Bayes of SVM) on your training sample.
- Run each sentence about compensation through your prediction model and link it back to the reviews.
There is nothing wrong with this approach, but it is a lot of work, requires a lot of discretionary decisions, and isn’t particularly flexible. For example, if a compensation word is spelled slightly differently it will not be picked up or if you do not have enough training data, or make a small mistake during the training step, the prediction model will likely overfit and probably not work properly. Overall, this prediction pipeline requires a lot of time, planning, and attention to get right.
So let’s look at a prompt-based GPT pipeline and compare:
- Do a rough cleanup of the text to make it reasonably clean.
- Design a prompt and completion that performs your task.
- Create a small training sample to generate examples for the model.
- Fine-tune the general GPT model to start generating the completions you are after (this is optional, it depends on the complexity of your task).
- Generate a completion for each prompt using your model.
- Parse the information out of the generated completions.
Because the GPT model already has a strong language understanding it enables us to save a lot of hassle and skip straight to formulating our task (i.e., prompt) and coming up with good examples. As a bonus, the GPT pipeline is also likely to result in better performance for many tasks, how awesome! 😄
Prompt-engineering
The primary “paradigm shift” of the prompt-based GPT approach is that we have to design a prompt and completion using natural language to get the model to do what we want. This is generally referred to as prompt engineering and is important because it is the primary way to tell the model what we want it to do. I consider this a bit of a paradigm shift because it requires a fundamentally different way of thinking about your problem relative to a more traditional pipeline that revolves around numbers, vectors, and matrices. A careful prompt design will give you the best prediction performance and it will also make it possible to easily process the generated completions afterward.
Let’s design a prompt and completion for our task:
Prompt + Completion:
Plumbing Co is a great company to work for! The compensation is great and above the industry standard. The benefits are also very good. The company is very fair and treats its employees well. I would definitely recommend Plumbing Co to anyone looking for a great place to work.
####
<positive> The compensation is great and above the industry standard
<positive> The benefits are also very good.
<|endoftext|>
Our prompt starts with the review and ends with \n####\n. The “\n####\n” is important because it tells our model where the prompt ends and the completion begins. The completion consists of a single line for each compensation sentence starting with the sentiment inside arrow brackets. We end the completion with <|endoftext|>, which is a common stop indicator so that we can tell the API when to stop generating tokens. The completion here is designed such that we can easily parse it afterward. Forcing every sentence to be on a new line enables us to distinguish sentences and putting the sentiment in arrow brackets enables us to extract it easily. As illustrated in the Jupyter Notebook, this completion design enables us to parse the entire completion using a single relatively basic regular expression.
Teaching the model to generate our completions
You can generate predictions from a GPT model in one of three ways:
- Zero-shot
> Don’t give the model any examples; just give it your prompt. - Few-shot
> Include a few prompt+completions examples inside your prompt to indicate what type of completion you are expecting from the model. - Fine-tuning
> Provide the GPT model with a larger number of examples and force it to adjust its internal weights to get particularly good at generating your specific completions.
In a zero-shot scenario, the model will not have seen your completion so it will instead try to guess what should come next based on what is common in regular text. This resembles the example I showed earlier. If we want the model to generate our specific completions we will need to give it examples. For more general tasks it might be sufficient to give a handful of examples as part of your prompt, which is called the few-shot approach. This is easy and intuitive, however, it limits the number of examples we can give the model and we need to give it these examples every time we want to make a prediction, which is slow and not cost-efficient. Below is a few-shot example using the OpenAI Playground:
Fine-tuning a custom model enables us to provide the general GPT model with more examples of our prompt+completion so that it will learn how to generate a completion when presented with one of our prompts. This is a more time-consuming approach, but it is generally necessary for more complex and specific downstream NLP tasks. The companion Jupyter Notebook guides you through a fine-tuning example using the OpenAI API for our employee review use case.
Pros and cons of the prompt-based GPT approach
Every method has strengths and weaknesses. To help you evaluate whether the GPT approach is right for your project, let me summarize the pros and cons based on my own experience of using it for several research projects.
The pros:
- Prompt-based machine learning enables you to design your task using human language, which is often more intuitive once you get used to it. Also, because you can actually read the completions it is also much easier to quickly check whether your predictions make sense for your task.
- GPT models are very flexible. The only requirement is that you can express your task in terms of a prompt + completion. Creative prompt engineering opens up a lot of opportunities to automate downstream tasks that would be extraordinarily difficult to accomplish using traditional methods.
- Because the GPT models are trained on extremely large volumes of data, you generally only need to give it a few hundred examples for it to start working reliably for most downstream tasks. This makes it much more feasible to generate a high-quality gold-standard training sample versus a scenario where you need multiple thousands of examples.
- A prompt-based GPT pipeline can handle occasional text imperfections and text nuances, as they are quite prevalent in the underlying training dataset. This means that the discretionary text processing choices you make play a less influential role, which will often result in more robust and easier to reproduce predictions.
The cons:
- Fine-tuning and inference (i.e., making predictions) can be very computationally intensive and require specific state-of-the-art GPU resources. You can circumvent this by using a Machine Learning as a Service (MLaaS) solution such as OpenAI, NLP Cloud, or Forefront. These are paid services, however, and the costs generally scale by the number of predictions that you need to make. The resulting costs can be very manageable (e.g., sub $100) or extraordinary depending on your prompt+completion length and your number of predictions.
- Evaluating the accuracy and performance of your predictions can be more challenging than traditional approaches and might require coding up your own evaluation logic. For example, in the case of classifying our employee reviews, we would need to write up some code to calculate the hold-out performance for the metrics that we care about.
- If you use a larger GPT model, such as GPT-3, GPT-J, or GPT-NeoX-20B, the inference throughput speed will be relatively slow as every prediction needs to propagate through billions of parameters. Running large volumes of predictions (e.g., 1 million+) for long complex prompts can take multiple days to complete, or longer.
- Designing appropriate prompts and completion will take some trial and error to get right most of the time, it is sometimes more of an art than a science. Processing the completions will also require some basic coding skills in Python and ideally some experience with regular expressions.
Wrap-up
I hope this post gave you a clearer idea of how you can use creative prompt engineering to use GPT models for downstream NLP tasks! To help you get started, I also coded up a Jupyter Notebook that walks you through all the steps of designing your prompts, fine-tuning your model, and making your predictions using the OpenAI API. You can find the repository here and a quick link to the notebook is below:
I need your help! 👋 I am considering writing up a research paper with more detailed guidance on the promise and do’s-and-don’ts of using prompt-based GPT for research projects. Are you interested in reading such a paper? If you are, it would be very helpful if you could express your interest through the form below. You can also optionally sign up for a notification when it comes out. Thanks a lot! 🙏
You can find the form here✏️: https://forms.gle/wo5aStgux3SvktmN8