The world’s leading publication for data science, AI, and ML professionals.

Guide to ChatGPT’s Advanced Settings – Top P, Frequency Penalties, Temperature, and More

Unlock the full potential of ChatGPT by optimizing extended configurations like Top P, frequency and presence penalties, stop sequences…

Photo by Volodymyr Hryshchenko on Unsplash
Photo by Volodymyr Hryshchenko on Unsplash

While ChatGPT already yields impressive results with the default settings, there is huge untapped potential that comes from its advanced parameters.

By adjusting settings like Top P, frequency penalty, presence penalty, stop sequences, maximum length, and temperature, we can steer text generation to meet nuanced demands through new levels of creativity and specificity.

In this article, we explore these advanced settings and learn how to tune them effectively.

Table of Contents

(1) Temperature (2) Maximum Length (3) Stop Sequences (4) Top P (5) Frequency Penalty (6) Presence Penalty


Introduction

Using ChatGPT is simple – type a prompt and receive a response. Yet, there are numerous advanced parameters that we can configure to enrich the output generated.

The OpenAI Playground lets us interact with language models while offering various configuration options, as shown below:

OpenAI Playground landing page with advanced settings in the right-hand panel | Image by author
OpenAI Playground landing page with advanced settings in the right-hand panel | Image by author

These advanced settings can also be configured in the API codes:

response = openai.ChatCompletion.create(
         messages = [
            {"role": "user",
             "content": ""
             }],
         model = "gpt-4"
         temperature = 0,
         max_tokens = 100,
         stop = ["goodbye"],
         top_p = 0.5,
         frequency_penalty = 0,
         presence_penalty = 0)

Aside from the self-explanatory parameters of the mode and model, let’s do a deep dive into the other six parameters.


(1) Temperature

Temperature controls the degree of randomness in the responses and its value ranges between 0 and 2.

At zero, the outputs are more predictable and will stick closely to the most likely words. If we want consistent answers, a zero temperature is ideal, especially when using these models for grounded responses.

If we want responses that are more creative and unpredictable, we can increase the temperature. Consider the following sentence:

"The cat sat on the mat and started to ___"

  • At a low temperature (e.g., 0), the model will choose a highly probable word like "purr" or "sleep."
  • At a medium temperature (e.g., 1), the model might introduce slightly less expected yet reasonable words, such as "groom" or "stretch."
  • At a high temperature (e.g., 2), the model may generate more diverse and less predictable outcomes, such as "contemplate" or "brainstorm."

At higher temperatures, the model is more inclined to take risks, leading to a wider variety of possible completions. However, high temperatures may lead to nonsensical output, as shown below:

Example of gibberish output from the maximum temperature value | Image by author
Example of gibberish output from the maximum temperature value | Image by author

From a technical perspective, a higher temperature flattens the probability distribution such that the typically less-common tokens now become as likely to be generated as the more common ones. On the other hand, a lower temperature skews the distribution such that the the more common tokens will have an even higher probability of being generated.


(2) Maximum Length

The maximum length relates to the maximum number of tokens that will be generated.

For English text, 1 token is roughly 0.75 words (or 4 characters). Check out OpenAI’s token counter to calculate the number of tokens in your text.

One caveat is that the maximum length includes the input prompt. If we set the maximum length at 1,000 and our input has 300 tokens, the output will be capped at 1,000 – 300 = 700 tokens.

Furthermore, the upper limit to the maximum length is model-specific. For example, the GPT-4 model can reach 8,191 tokens.

With this, we can generate responses that fit within custom token limits for different use cases. An example is marketing campaigns where we create SMS messages within the 160-character (~40 tokens) limit.

Suppose we have the following 20-token prompt:

"Generate an SMS marketing message for a local bakery in London called Delights that is offering a discount."

To generate a message that fits the SMS limit, we set the maximum length as 40 + 20 = 60 tokens. After entering the prompt, we get this concise SMS message:

Hey! Delicious savings at Delights Bakery, London! 
Enjoy 20% off exquisite pastries & delightful bread. 
Pop in & make your day a bit sweeter. Hurry, offer ends soon!

(3) Stop Sequences

The "stop sequences" parameter instructs the model to halt generation upon reaching a certain string. This is useful when we want the output to end at specific points, ensuring that the response is concise and omits unwanted information.

Suppose we have the following output after asking ChatGPT to generate a resignation letter template:

Subject: Resignation Notice

Dear [Manager's Name],

I hereby resign from my position at [Company Name], 
effective [Last Working Day, typically two weeks from the date of the email].

Best,
[Your Name]

If we want to exclude the sign-off at the bottom, we can set the string "Best" as one of the stop sequences. By doing so, the regenerated output will cut off at the stop sequence, as seen below:

Stop sequence output in OpenAI Playground | Image by author
Stop sequence output in OpenAI Playground | Image by author

The returned output excludes the stop sequence itself, and up to four string sequences can be defined for each execution.


(4) Top P

Top P is associated with the top-p sampling technique (aka nucleus sampling). As a recap, GPT models generate the next word by assigning probabilities to all possible next words in its vocabulary.

With top-p sampling, instead of considering the entire vocabulary, the next word will be sampled from a smaller set of words that collectively have a cumulative probability above the Top P value.

Top P ranges from 0 to 1 (default), and a lower Top P means the model samples from a narrower selection of words. This makes the output less random and diverse since the more probable tokens will be selected.

For instance, if Top P is set at 0.1, only tokens comprising the top 10% probability mass are considered.

_Given that Top P impacts output randomness, OpenAI recommends adjusting either Top P or temperature, but not both. Nonetheless, there is no harm in experimenting with tuning both._

The following shows the outputs for different Top P values based on this prompt:

"Write a wildly creative short synopsis about a whale"

Comparing outputs of different Top P values | Image by author
Comparing outputs of different Top P values | Image by author

The example above shows that the output from a lower Top P of 0.01 appears less creative and fancy in its description.


Technical Details

If Top P is set to 0.1, it does not strictly mean that tokens in the top 10% probability mass are considered. Rather, the model finds the smallest set of most probable tokens whose cumulative probability exceeds 10%.

It starts from the most probable token and adds others in descending probabilities until the Top P is met. In some cases, this could involve many tokens if no single token has a very high probability and the distribution is relatively flat.


(5) Frequency Penalty

The frequency penalty addresses a common problem in text generation: repetition. By applying penalties to frequently appearing words, the model is encouraged to diversify language use.

Positive frequency penalty values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood of repeating the same line verbatim.

Based on API documentation, the frequency penalty ranges from -2 to 2 (default 0). However, the range available on the Playground is 0 to 2. We shall follow the API documentation’s range.

The following shows the outputs for different frequency penalties based on this prompt:

"Write a poem where every word starts with Z"

Comparing outputs of different frequency penalties | Image by author
Comparing outputs of different frequency penalties | Image by author

The example above shows that a larger frequency penalty leads to fewer repeated words and greater diversity, such that we even get words that do not begin with ‘Z.’

Reasonable values for the frequency penalty are around 0.1 to 1. We can increase it further to suppress repetition strongly, but it can degrade output quality. Negative values can also be set to increase repetition instead of reducing it.


(6) Presence Penalty

Like the frequency penalty, the presence penalty aims to reduce token repetition.

Positive presence penalty values penalize new tokens based on whether they have appeared in the text so far, increasing the model’s likelihood of talking about new topics.

Based on API documentation, the presence penalty ranges from -2 to 2 (default 0), whereas the range on the Playground is 0 to 2.

What is the difference between a frequency penalty and a presence penalty?

The subtle difference lies mainly in the degree of penalty on the repeated tokens. The frequency penalty is proportional (i.e., relative marker) **** to how often a particular token has been generated.

On the other hand, the presence penalty is a once-off (additive) penalty applied to a token that has appeared at least once, like a Boolean (1/0) marker.

The impact of these penalties is seen in the following equation for the logit (unnormalized log probability) μ of a token j:

Equation showing logit of the j-th token subtracted by two penalty terms | Image by author
Equation showing logit of the j-th token subtracted by two penalty terms | Image by author

c[j] refers to how often a token has been generated previously, and the α values are the penalty coefficients (i.e., between -2 and 2).

Reasonable values for the presence penalty are the same as described for the frequency penalty.


Wrapping It Up

After understanding what each parameter does, we can tweak these advanced settings more confidently to meet our needs.

Tuning these parameters is a delicate blend of art and science, so it is recommended to play around with different configurations to see what works best for various use cases.

Before you go

I welcome you to join me on a journey of Data Science discovery! Follow this Medium page and visit my GitHub to stay updated with more engaging and practical content. Meanwhile, have fun experimenting with ChatGPT’s advanced settings!

Running Llama 2 on CPU Inference Locally for Document Q&A

Text-to-Audio Generation with Bark, Clearly Explained


Related Articles