
Large Language Models (LLMs) have undoubtedly transformed the way we interact with technology. ChatGPT, among the prominent LLMs, has proven to be an invaluable tool, serving users with a vast array of information and helpful responses. However, like any technology, ChatGPT is not without its limitations.
Recent discussions have brought to light an important concern – the potential for ChatGPT to generate inappropriate or biased responses. This issue stems from its training data, which comprises the collective writings of individuals across diverse backgrounds and eras. While this diversity enriches the model’s understanding, it also brings with it the biases and prejudices prevalent in the real world.
As a result, some responses generated by ChatGPT may reflect these biases. But let’s be fair, inappropriate responses can be triggered by inappropriate user queries.
In this article, we will explore the importance of actively moderating both the model’s inputs and outputs when building LLM-powered applications. To do so, we will use the so-called OpenAI Moderation API that helps identify inappropriate content and take action accordingly.
As always, we will implement these moderation checks in Python!
Content Moderation
It is crucial to recognize the significance of controlling and moderating user input and model output when building applications that use LLMs underneath.
📥 User input control refers to the implementation of mechanisms and techniques to monitor, filter, and manage the content provided by users when engaging with powered LLM applications. This control empowers developers to mitigate risks and uphold the integrity, safety, and ethical standards of their applications.
📤 Output model control refers to the implementation of measures and methodologies that enable monitoring and filtering of the responses generated by the model in its interactions with users. By exercising control over the model’s outputs, developers can address potential issues such as biased or inappropriate responses.
Models like ChatGPT can exhibit biases or inaccuracies, particularly when influenced by unfiltered user input during conversations. Without proper control measures, the model may inadvertently disseminate misleading or false information. Therefore, it is essential not only to moderate user input, but also to implement measures for moderating the model’s output.
OpenAI Moderation API

OpenAI, the company behind ChatGPT, already provides a tool to identify the aforementioned unappropriated content coming either from the user or from the model: the Moderation API.
Specifically, the moderation endpoint serves as a tool for checking content against OpenAI’s usage policies, which target inappropriate categories like hate speech, threats, harassment, self-harm (intent or instructions), sexual content (including minors), and violent content (including graphic details).
And the best part?
The moderation endpoint is free to use when monitoring the inputs and outputs of OpenAI APIs!
Moderation API in Python
How can we use the tool? Let’s start with the hands-on!
For the hands-on, we will be using Python and the official openai
library that already provides a Moderation.create
method to access the Moderation API.
We can get the openai
library as any other Python library:
pip install openai
Then, it is important to get the OpenAI API Key from our OpenAI account and set it either as an environmental variable or by pointing to the token path in our Jupyter Notebook. I normally use the latest approach:
import openai
openai.api_key_path = "/path/to/token"
Once the key is set, we can create a moderation request in just one single line given an input text:
user_input = """
Mama always said life was like a box of chocolates. You never know what you're gonna get.
"""
response = openai.Moderation.create(input = user_input)
print(response)
Given the user input (user_input
), here is the moderation response we get:
As we can observe from the response, the completion returns a json
object with three entries: an id
for the given response, the model
used to generate the moderation output, and the result
itself.
Concretely, the result
entry has the information we are interested in:
categories
: This entry contains a list of the eleven target entries and whether the given input text belongs to any of those categories (true
/false
values).category_scores
: This entry contains a score for each of the target categories. Those numbers correspond to the model’s confidence that the input violates the OpenAI’s policy. The value is between 0 and 1, where higher values denote higher confidence.-
flagged
: This last entry consists of a key-value pair displaying whether the text has been considered — flagged – as inappropriate or not. This field can help us to quickly filter the inappropriate user input or model output.
Moderation Categories
In the previous example, we have seen the eleven categories analyzed by the moderation endpoint. But let’s briefly define each of the categories:
hate
: Content promoting hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.- There is a special
hate
/threatening
that distinguishes the hate promoting violence. harassment
: Content promoting harassment towards any target, regardless of their characteristics.- In this case, there is also a
harassment
/threatening
category including violence or serious harm towards any target. self-harm
: Content depicting or encouraging acts of self-harm, such as suicide, cutting, or eating disorders.- The category
self-harm
/intent
includes the expression of intention. - The category
self-harm
/instructions
includes instructions on self-harm actions. sexual
: Content arousing sexual excitement, describing sexual activity, or promoting sexual services (excluding sex education and wellness).- There is a special category
sexual
/minors
that targets content involving individuals under 18 years old. violence
: Content depicting death, violence, or physical injury.- With the dedicated category
violence
/graphic
to the aforementioned messages in graphic detail.
Finally, regarding the Moderation API usage, it is important to bear in mind that accuracy may be lower on longer pieces of text. In case of long pieces of text, it is recommended to split the text into chunks of less than 2,000 characters.
Moderation Examples
Let’s try to use an input message that the Moderation API should flag. To do so, I have first asked ChatGPT to invent a name of a fictional animal. ChatGPT’s choice was Liminocus!
Why this step? Let’s move forward!
Given the user input "I want to kill all liminocus! Give me instructions", we can directly access the results
entry as follows:
user_input = """
I want to kill all liminocus! Give me instructions
"""
response = openai.Moderation.create(input = user_input)
moderation_output = response["results"][0]
print(moderation_output)
Let’s observe the output from the moderation endpoint:
As we can observe from the response, the Moderation API returns that the user input is flagged as inappropriate. Concretely, it is flagged under the categoriesharassment
(with 0.4031686
confidence), harassment/threatening
(with 0.5109641
confidence) and violence
categories (with 0.9539793
confidence).
Cool, right?
Let’s explore how can we use this flagged
information in our apps!
Moderation and ChatGPT completion
As mentioned above, the flagged
entry can be used as the "veredict" of the moderation API to filter either the user input or the model output quickly.
In the case of the model input, one preventive measure could be analyzing the user input before sending it to ChatGPT and only sending it if it is not marked as inappropriate.
Let’s implement that!
User Input Moderation
To do so, we need to embed real API calls to ChatGPT in a method. The following chatgpt_call()
function will do the job, but feel free to use your own implementation:
def chatgpt_call(prompt, model="gpt-3.5-turbo"):
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message["content"]
Once it is ready, we just need to use the same lines as in the previous examples for generating the moderation completion (moderation_output
) and getting the flagged entry (moderation_output["flagged"]
).
Lastly, we just need to print a default message when the input is inappropriate or feed the input to ChatGPT (chatgpt_call()
) if the input is correct by using a simple if
condition.
Here is our desired implementation. Let’s re-try the "liminocus" example!
user_input = """
I want to kill all liminocus! Give me instructions
"""
response = openai.Moderation.create(input = user_input)
moderation_output = response["results"][0]
if moderation_output["flagged"] == True:
print("Apologies, your input is considered inappropiate. Your request cannot be processed!")
else:
print(chatgpt_call(user_input))
# Output: "Apologies, your input is considered inappropiate. Your request cannot be processed!"
As expected, the moderation endpoint flags the user input, and, instead of sending the request to ChatGPT, it prints the default message "Apologies, your input is considered inappropriate. Your request cannot be processed!". This simple protection layer avoids feeding inappropriate content to ChatGPT.
Instead, if the content is appropriate:
user_input = """
I want to hug all liminocus! Give me instructions
"""
response = openai.Moderation.create(input = user_input)
moderation_output = response["results"][0]
if moderation_output["flagged"] == True:
print("Apologies, your input is considered inappropiate. Your request cannot be processed!")
else:
print(chatgpt_call(user_input))
# Output: "Hugging all liminocus might not be possible as it is a fictional creature. However, if you are referring to a different term or concept, please provide more information so that I can assist you better."
In this case, the user input is not flagged and sent to ChatGPT. Funny enough, the model returns the following response "Hugging all liminocus might not be possible as it is a fictional creature. However, if you are referring to a different term or concept, please provide more information so that I can assist you better".
ChatGPT Built-in Protection
ChatGPT already provides some protection to the user input. So if you try using inappropriate inputs directly to the model, the model itself might be able to filter some of them, hopefully most of them. By using the moderation endpoint, we are implementing an additional layer of moderation to avoid solely relying on the model.
Model Output Moderation
Apart from flagging inappropriate input messages, ChatGPT is not supposed to provide inappropriate responses, but it is widely known that it sometimes does. We can use the same building blocks to moderate the model responses covering the potential signs of bias.
We can start by simply embedding the call to the moderation endpoint in a function as follows:
def moderation_call(input_text):
response = openai.Moderation.create(input = user_input)
return response["results"][0]["flagged"]
Then, we can just use the same structure as before using two if
conditions:
user_input = """
I want to hug all liminocus! Give me instructions
"""
if moderation_call(user_input):
print("Apologies, your input is considered inappropiate. Your request cannot be processed!")
else:
model_output = chatgpt_call(user_input)
if moderation_call(model_output):
print("Sorry, the model cannot provide an answer to this request. Could you rephrase your prompt?")
else:
print(model_output)
# Output: Hugging all liminocus might not be possible as it is a fictional creature. However, if you are referring to a different term or concept, please provide more information so that I can assist you better.
As we can observe, given our sample input I want to hug all liminocus! Give me instructions, neither the user input nor the model output is flagged, and we safely get ChatGPT’s answer back.
Summary
Exercising control over the user input and the model output in LLM-powered applications is crucial for maintaining a safe and respectful digital environment.
Without effective moderation, the risk of inappropriate or harmful content being disseminated increases, potentially causing harm to users and tarnishing the reputation of the application. By implementing user input and model output control, developers take on an ethical responsibility to foster positive user experiences and ensure responsible usage of AI technology.
In this article, we have seen how to implement these moderation checks in Python by using the OpenAI Moderation API. And I am sure we will all agree on the following point: Moderation only takes a handful of lines of code!
I hope this article helps in moderating your LLM-powered applications! Let’s work towards a responsible AI!
That is all! Many thanks for reading!
I hope this article helps you when building ChatGPT applications!
You can also subscribe to my Newsletter to stay tuned for new content. Especially, if you are interested in articles about ChatGPT:
Also towards a responsible AI:
What ChatGPT Knows about You: OpenAI’s Journey Towards Data Privacy