AI-powered Personal VoiceBot for Language Learning

Published in

Towards Data Science

11 min readAug 28, 2023

Created with Stable Diffusion XL — Image by the author

What’s the most effective way to master a new language? Speaking it! But we all know how intimidating it can be to try out new words and phrases in front of others. What if you had a patient and understanding friend to practice with, free from judgment, free from shame?

That patient and understanding friend you’re looking for might just be a virtual language tutor powered by LLMs! This could be a game-changing way to master a language, all from the comfort of your own space.

Recently, large language models have come onto the scene, and they’re changing the way we do things. These powerful tools have created chatbots that can respond just like humans, and they’ve quickly integrated into various aspects of our lives, being used in hundreds of different ways. One particularly interesting use is in language learning, especially when it comes to speaking practice.

When I moved to Germany a while ago, I realized how challenging it can be to learn a new language and find opportunities to practice speaking it. Classes and language groups can be expensive or hard to fit into a busy schedule. As a person faced with these challenges, I had an idea: why not use chatbots for speaking practice? Texting alone wouldn’t be enough, though, since language learning involves more than just writing. Therefore, by combining an AI-powered chatbot with speech-to-text and text-to-speech technologies, I managed to create a learning experience that feels like talking to a real person.

In this article, I will share the tools I have chosen, explain the process, and introduce the concept of speaking practice with an AI chatbot through voice commands and voice responses. The pipeline of the project consists of three main sections: speech-to-text transcription, employing a language model, and text-to-speech conversion. These will be explained under the following three headings.

1. Speech-to-Text transcription

Speech recognition for my language tutor acts as the bridge between the user’s spoken input and the AI’s text-based understanding to generate response. It’s a critical component that enables voice-driven interaction, contributing to a more immersive and effective language learning experience.

Accurate transcription is crucial for a smooth interaction with the Chatbot, especially in a language learning context where pronunciation, accent, and grammar are the key factors. There are various speech recognition tools which can be utilized to transcribe spoken input in Python such as OpenAI’s Whisper and Google Cloud’s Speech-to-Text.

When selecting a speech recognition tool for the language tutor project, considerations such as accuracy, language support, cost, and whether an offline solution is required should be taken into account.

Google has a Python API which requires internet connection and offers 60 minutes of transcription per month free of charge. Unlike Google, OpenAI published their Whisper model and you can run it locally without depending on the internet speed as far as you have enough computational power. That’s why I’ve chosen Whisper to reduce the latency of the transcription as much as possible.

2. Language model

The language model is the backbone of this project. As I am already very acquainted with ChatGPT and its API, I’ve decided to use it for this project too. However, in case you have enough computational power, you can also deploy Llama locally, which would be free of charge. ChatGPT costs a little money, but a lot more convenient as you just need a couple lines of code to run it.

To increase the consistency of responses and to have them in a specific template, you could also finetune language models (e.g.how to finetune chatgpt on your use case). You need to generate exemplary sentences and corresponding optimal responses, and feed them into a finetuning training. However, the basic tutor I want to build doesn’t need a fine-tuning and I will use the generalised GPT3.5-turbo in my project.

I will provide an example of an API call to facilitate a conversation between user and ChatGPT via its API in Python, below. First, if you don’t already have one, you’ll need to open an OpenAI account and set an API Key to interact with ChatGPT. Instructions are given here.

Once you’ve set up your API key, you can begin generating text using the openai.ChatCompletion.create method. This method requires two parameters: the model parameter, which specifies the particular GPT model to be accessed via the API, and the messages parameter, which includes the structure for a conversation with ChatGPT. The messages parameter consists of two key components: role and content.

Here’s a code snippet to illustrate the process:

# Initialize the messages with setting behavior(s).
messages = [{"role": "system", "content": "Enter behaviour(s) here."}]

# Start an infinite loop to continue the conversation with the user.
while True:
    content = input("User: ") # Get input from the user to respond.
    messages.append({"role": "user", "content": content})# Append the user's input to the messages.

    # Use the OpenAI GPT-3.5 model to generate a response to the user's input.
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    chat_response = completion.choices[0].message.content # Extract the chat response from the API response.
    print(f'ChatGPT: {chat_response}') # Print the response.

    # Append the response to the messages with the role "assistant" to store the chat history.
    messages.append({"role": "assistant", "content": chat_response})

The system role is defined to determine the behaviour of the ChatGPT by adding an instruction within the content at the beginning of the message list.
During the chat, the user message is received from the user via mentioned speech recognition model to get a response from ChatGPT.
Finally, ChatGPT’s responses are appended to the message list in theassistant role to log the conversation history.

3. Text-to-Speech conversion

In the Speech-to-Text transcription section, I explained how the user utilizes voice commands to simulate a conversational experience, as if speaking with a real person. To further enhance this sensation and create a more dynamic and interactive learning experience, the next step involves converting the text output from GPT into audible speech using a text-to-speech tool like gTTS. This not only helps creating a more engaging and easy-to-follow experience but also addresses a critical aspect of language learning: the challenge of comprehension through listening rather than reading. By integrating this auditory component, we are facilitating a more comprehensive practice that closely mirrors real-world language use.

There are various TTS tools available, such as Google’s Text-to-Speech (gTTS) and IBM Watson’s Text to Speech. In this project, I preferred gTTS since it is super easy-to-use, presents a natural voice quality without costing a penny. To use the gTTS library, you will need to have an internet connection as the library requires access to the Google’s server to convert the text to speech.

Detailed Explanation of the Pipeline

Before we dive into the pipeline, you might want to take a look at the entire code on my Github page, as I will be referring to some sections of it.

The figure below explains the workflow of AI-powered virtual language tutor that is designed to set up a real-time, voice-based conversational learning experience:

Chart of the pipeline — Image by the author

The user begins the conversation by initiating a recording of their speech, temporarily saving it as a .wav file. This is accomplished by pressing and holding the spacebar, and the recording is stopped when the spacebar is released. The sections of the Python code that enable this press-and-talk functionality are explained below.

The following global variables are used to manage the state of the recording process:

recording = False       # Indicates whether the system is currently recording audio
done_recording = False  # Indicates that the user has completed recording a voice command 
stop_recording = False  # Indicates that the user wants to exit the conversation

The listen_for_keys function is for checking key presses and releases. It sets the global variables based on the state of the spacebar and esc button.

def listen_for_keys():
    # Function to listen for key presses to control recording
    global recording, done_recording, stop_recording
    while True:
        if keyboard.is_pressed('space'):  # Start recording on spacebar press
            stop_recording = False
            recording = True
            done_recording = False
        elif keyboard.is_pressed('esc'):  # Stop recording on 'esc' press
            stop_recording = True
            break
        elif recording:  # Stop recording on spacebar release
            recording = False
            done_recording = True
            break
        time.sleep(0.01)

The callback function is used to handle the audio data when recording. It checks the recording flag to determine whether to record the incoming audio data.

def callback(indata, frames, time, status):
    # Function called for each audio block during recording.
    if recording:
        if status:
            print(status, file=sys.stderr)
        q.put(indata.copy())

The press2record function is the main function which is responsible for handling voice recording when the user presses and holds the spacebar.

It initialises global variables to manage the recording state and determines the sample rate, and it creates a temporary file to store the recorded audio.

The function then opens a SoundFile object to write the audio data and an InputStream object to capture the audio from the microphone, using the previously mentioned callback function. A thread is started to listen for key presses, specifically the spacebar for recording and the 'esc' key to stop. Inside a loop, the function checks the recording flag and writes the audio data to the file if recording is active. If the recording is stopped, the function returns -1; otherwise, it returns the filename of the recorded audio.

def press2record(filename, subtype, channels, samplerate):
    # Function to handle recording when a key is pressed
    global recording, done_recording, stop_recording
    stop_recording = False
    recording = False
    done_recording = False
    try:
        # Determine the samplerate if not provided
        if samplerate is None:
            device_info = sd.query_devices(None, 'input')
            samplerate = int(device_info['default_samplerate'])
            print(int(device_info['default_samplerate']))
        # Create a temporary filename if not provided
        if filename is None:
            filename = tempfile.mktemp(prefix='captured_audio',
                                       suffix='.wav', dir='')
        # Open the sound file for writing
        with sf.SoundFile(filename, mode='x', samplerate=samplerate,
                          channels=channels, subtype=subtype) as file:
            with sd.InputStream(samplerate=samplerate, device=None,
                                channels=channels, callback=callback, blocksize=4096) as stream:
                print('press Spacebar to start recording, release to stop, or press Esc to exit')
                listener_thread = threading.Thread(target=listen_for_keys)  # Start the listener on a separate thread
                listener_thread.start()
                # Write the recorded audio to the file
                while not done_recording and not stop_recording:
                    while recording and not q.empty():
                        file.write(q.get())
        # Return -1 if recording is stopped
        if stop_recording:
            return -1

    except KeyboardInterrupt:
        print('Interrupted by user')

    return filename

Finally, the get_voice_command function calls press2record to record user's voice command.

def get_voice_command():
    # ...
    saved_file = press2record(filename="input_to_gpt.wav", subtype = args.subtype, channels = args.channels, samplerate = args.samplerate)
    # ...

Having captured and saved the voice command in a temporary .wav file, we now enter the transcription phase. In this stage, the recorded audio is converted into text using Whisper. The corresponding script for simply running transcription task for a .wav file is given below:

def get_voice_command():
    # ...
    result = audio_model.transcribe(saved_file, fp16=torch.cuda.is_available())
    # ...

This method takes two parameters: the path to the recorded audio file, saved_file, and an optional flag to use FP16 precision if CUDA is available to enhances performance on compatible hardware. It simply returns the transcribed text.

Then, the transcribed text is sent to ChatGPT to generate an appropriate response in the interact_with_tutor() function. The corresponding code segment is as follows:

def interact_with_tutor():
    # Define the system role to set the behavior of the chat assistant
    messages = [
        {"role": "system", "content" : "Du bist Anna, meine deutsche Lernpartnerin. 
                                        Du wirst mit mir chatten. Ihre Antworten werden kurz sein.
                                        Mein Niveau ist B1, stell deine Satzkomplexität auf mein Niveau ein. 
                                        Versuche immer, mich zum Reden zu bringen, indem du Fragen stellst, und vertiefe den Chat immer."}
    ]
    while True:
        # Get the user's voice command
        command = get_voice_command()  
        if command == -1:
            # Save the chat logs and exit if recording is stopped
            save_response_to_pkl(messages)
            return "Chat has been stopped."

        # Add the user's command to the message history
        messages.append({"role": "user", "content": command})  

        # Generate a response from the chat assistant
        completion = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages
        )  

        # Extract the response from the completion
        chat_response = completion.choices[0].message.content  # Extract the response from the completion
        print(f'ChatGPT: {chat_response} \n')  # Print the assistant's response
        messages.append({"role": "assistant", "content": chat_response})  # Add the assistant's response to the message history
        # ...

The function interact_with_tutor starts by defining the system role of ChatGPT to shape its behaviour throughout the conversation. Since my goal is to practice German, I set the system role accordingly. I called my virtual tutor as “Anna” and set my language proficiency level for her to adjust her responses. Additionally, I instructed her to keep the conversation engaging by asking questions.

Next, the user’s transcribed voice command is appended to the message list with the role of “user.” This message is then sent to ChatGPT. As the conversation continues within a while loop, the entire history of user commands and GPT responses is logged in the messages list.

After the each response of ChatGPT, we convert the text message into speech using gTTS.

def interact_with_tutor():
  # ...
  # Convert the text response to speech
  speech_object = gTTS(text=messages[-1]['content'],tld="de", lang=language, slow=False)
  speech_object.save("GPT_response.wav")
  current_dir = os.getcwd()
  audio_file = "GPT_response.wav"
  # Play the audio response
  play_wav_once(audio_file, args.samplerate, 1.0)
  os.remove(audio_file) # Remove the temporary audio file

The gTTS() function gets 4 parameters : text, tld, lang, and slow. The text parameter is being assigned the content of the last message in the messages list (indicated by [-1]) which you want to convert into speech. The tld parameter specifies the top-level domain for the Google Translate service. Setting it to "de" means that the German domain is used, which can be significant for ensuring that the pronunciation and intonation are appropriate for the German. The lang parameter specifies the language in which the text should be spoken. In this code, the language variable is set to 'de', meaning that the text will be spoken in German.slow=False: the slow parameter controls the speed of the speech. Setting it to False means that the speech will be spoken at a normal speed. If it were set to True, the speech would be spoken more slowly.

The converted speech of ChatGPT response is then saved as a temporary .wav file, played back to the user, and then removed.
Theinteract_with_tutor function repeatedly runs when user continues the conversation by pressing the spacebar again.
If the user presses “esc”, conversation ends and the entire conversation is saved to a pickle file,chat_log.pkl. You can use it later for some analysis.

Command line usage

For running the script, simply run the python code in terminal as the following:

sudo python chat.py

sudo is needed as the script requires to access the microphone and utilize the keyboard library. If you use anaconda, you can also start the anaconda terminal by “run as administrator” to give the full access.

Here is a demo video to show how the code runs on my laptop. You can get a feeling of the performance:

Demo video created by the author

Final Remarks

I set the language of the tutor to German by simply setting the ChatGPT’s system role and adjusting the parameters within the gTTs function to align with German language. However, you could easily switch it to another language. It would only take seconds to configure it for your target language.

If you’d like to chat about a specific topic, you can also add it in the system role of ChatGPT. For example, practicing for interviews with it might be a nice use-case. You can also specify your language level to adjust its responses.

One important remark is that the overall speed of the chat depends on your internet connection (due to ChatGPT API and gTTS) as well as your hardware (due to local deployment of Whisper). In my case, overall response time after my inputs ranges between 4–10 seconds.