In this tutorial we will create a personal local LLM assistant, that you can talk to. You will be able to record your voice using your microphone and send to the LLM. The LLM will return the answer with text AND speech.
Here is how the app works:
You can find the code in this GitHub repo: https://github.com/amirarsalan90/personal_llm_assistant
The main components of the app include:
- Local LLM (hosted by llama-cpp-python)
- Speech-to-text (Whisper)
- Text-to-speech (Bark)
llama-cpp-python
Llama-cpp-python is a python binding for the great llama.cpp , which implements many Large Language Models in C/C++ . Because of its wide adoption by open-source community, I decided to use it in this tutorial.
Note: I have tested this app on a system with Nvidia RTX4090 gpu. NVIDIA CUDA toolkit needs to be installed on your system and in your path before installing llama-cpp-python.
First thing first, lets create a new conda environment:
conda create --name assistant python=3.10
conda activate assistant
Next we need to install llama-cpp-python. As mentioned in llama-cpp-python descriptions, llama.cpp supports a number of hardware acceleration backends to speed up inference. In order to leverage the GPU and run the LLM on GPU, we will build the program with CUBLAS. I had some issues with getting to offload the model on GPU and I finally found this post on how to properly install:
export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python[server]
We also need to install a couple of other packages for this app:
pip install gradio
pip install openai
pip install huggingface-cli
pip install huggingface_hub
pip install torch
pip install transformers
pip install nltk
pip install optimum
Next step is to download model weights to serve. For this tutorial, I decided to go with Mistral-7b-Instruct-v0.2 (In fact Mistral 7b is my go to model among all other 7b models). The model formats for llama.cpp are in GGUF format. In case you are not aware, TheBloke is the go to repo to get quantized models and GGUF converted models.
We can download the model weights with this command:
mkdir models
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False
After downloading the model weights, we are ready to test our LLM. For this purpose, we can run an LLM server (we previously installed llama-cpp-python[server]). Open a terminal, activate your assistant conda environment, and run the server:
conda activate assistant
python3 -m llama_cpp.server --model ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --n_gpu_layers -1 --chat_format chatml
In this command, n_gpu_layers
shows how many layers of your model are going to be offloaded to GPU. Because I have a 4090 GPU with 24 GB of VRAM, it is more than enough to load this quantized model, therefore I used -1 which means offload all layers to GPU. The second parameter chat_format
shows which chat template should be used for our model, and since we are using mistral model, we choose chatml template. You can read more about chat templates here
Now it is time to use python and send a request to your model. One more thing to mention is that llama-cpp-python implements an OpenAI like API. This means you can send your requests to your local LLM similar to the way you send requests to OpenAI API to use GPT models like GPT-3.5 or GPT-4.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "In what city were the 2000 olympics taken place?"}
],
)
print(response)
Speech to Text (Whisper)
For the speech to text component, we will use the famous Whisper model, which is an open-sources transformer-based speech to text model. Whisper receives as input an audio file, and returns a transcript of the words spoken. I use huggingface transformers implementation, so it will be very straightforward. Using a huggingface pipeline, doing inference with Whisper will be as simple as:
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0"
)
transcription = pipe(audio_file_path)['text']
where transcription includes the transcript of the input audio file.
Text to Speech (Bark)
in order to convert text to speech, I use Bark model, which is a transformer-based text to speech model. it can generate realistic, multilingual speech as well as other audio – including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.
Again, for Bark model we will use the huggingface implementation. Using it with huggingface transformers is as simple as:
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")
inputs = processor(
text=["Hello, my name is Suno. And, uh - and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
return_tensors="pt",
)
speech_values = model.generate(**inputs, do_sample=True)
from IPython.display import Audio
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
Web UI (Gradio)
To create the simple UI, we use Gradio. Gradio is a great library, using which you can build UI for your data science projects easily and within minutes.
Here is the complete code for the Gradio app:
import gradio as gr
from transformers import pipeline
from transformers import AutoProcessor, BarkModel
import torch
from openai import OpenAI
import numpy as np
from IPython.display import Audio, display
import numpy as np
import re
from nltk.tokenize import sent_tokenize
WORDS_PER_CHUNK = 25
def split_sentence_into_chunks(sentence, n):
words = sentence.split()
if len(words) <= n:
return [sentence]
else:
chunks = [' '.join(words[i:i+n]) for i in range(0, len(words), n)]
return chunks
# Setup Whisper client
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0"
)
voice_processor = AutoProcessor.from_pretrained("suno/bark")
voice_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:0")
voice_model = voice_model.to_bettertransformer()
voice_preset = "v2/en_speaker_9"
system_prompt = "You are a helpful AI"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx") # Placeholder, replace
sample_rate = 48000
def transcribe_and_query_llm_voice(audio_file_path):
transcription = pipe(audio_file_path)['text']
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": system_prompt}, # Update this as per your needs
{"role": "user", "content": transcription + "n Answer briefly."}
],
)
llm_response = response.choices[0].message.content
sampling_rate = voice_model.generation_config.sample_rate
silence = np.zeros(int(0.25 * sampling_rate))
BATCH_SIZE = 12
model_input = sent_tokenize(llm_response)
pieces = []
for i in range(0, len(model_input), BATCH_SIZE):
inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
if len(inputs) != 0:
inputs = voice_processor(inputs, voice_preset=voice_preset)
speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
pieces += [*speech_output, silence.copy()]
whole_ouput = np.concatenate(pieces)
audio_output = (sampling_rate, whole_ouput)
return llm_response, audio_output
def transcribe_and_query_llm_text(text_input):
transcription = text_input
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": system_prompt}, # Update this as per your needs
{"role": "user", "content": transcription + "n Answer briefly."}
],
)
llm_response = response.choices[0].message.content
sampling_rate = voice_model.generation_config.sample_rate
silence = np.zeros(int(0.25 * sampling_rate))
BATCH_SIZE = 12
model_input = sent_tokenize(llm_response)
pieces = []
for i in range(0, len(model_input), BATCH_SIZE):
inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
if len(inputs) != 0:
inputs = voice_processor(inputs, voice_preset=voice_preset)
speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
pieces += [*speech_output, silence.copy()]
whole_ouput = np.concatenate(pieces)
audio_output = (sampling_rate, whole_ouput)
return llm_response, audio_output
with gr.Blocks() as demo:
with gr.Row():
with gr.Column():
text_input = gr.Textbox(label="Type your request", placeholder="Type here or use the microphone...")
audio_input = gr.Audio(sources=["microphone"], type="filepath", label="Or record your speech")
with gr.Column():
output_text = gr.Textbox(label="LLM Response")
output_audio = gr.Audio(label="LLM Response as Speech", type="numpy")
submit_btn_text = gr.Button("Submit Text")
submit_btn_voice = gr.Button("Submit Voice")
submit_btn_voice.click(fn=transcribe_and_query_llm_voice, inputs=[audio_input], outputs=[output_text, output_audio])
submit_btn_text.click(fn=transcribe_and_query_llm_text, inputs=[text_input], outputs=[output_text, output_audio])
demo.launch(ssl_verify=False,
share=False,
debug=False)
A few notes on the python code above:
- Bark has a couple of voices you can choose among. We are using "v2/en_speaker_9". You can find the full list of options here: https://huggingface.co/suno/bark/tree/main/speaker_embeddings/v2
- we are assigning
transcribe_and_query_llm_voice
function tosubmit_btn_voice
to run the model on the voice input of the user - we are assigning
transcribe_and_query_llm_text
function tosubmit_btn_text
to run the model on the input text from the user. - In line 57, what we are basically doing is we are creating chunks and running the Bark model on each chunk, and then aggregating as a single voice, since it doesn’t work very well on very long text inputs.
You must have a microphone on your PC to be able to submit voice to the model. Otherwise, you can still type your text in the textbox and get back the text and audio output from the model.
finally, to run the UI, after you have started your llm engine, in a new terminal run:
python gradio_tts.py
you will receive a link in terminal for the gradio app. It most probably is going to be http://127.0.0.1:7860 . And you app will be ready to use!
How to access the app from your phone on your home wifi
If you want to access this app through your phone, there’s an additional step involved. Assuming you’re running the app on your local PC, which is connected to your home WiFi, and you wish to access this app on your phone while it’s hosted by your PC, follow these instructions:
To obtain the local IP address of your PC on Ubuntu, you can use the ip addr
command in the terminal:
ip addr | grep 'inet ' | grep -v ' lo' | awk '{print $2}' | cut -d'/' -f1
For example, in my case, the local IP address is 192.168.0.231
. With the Gradio app currently launched, I can access my app by navigating to the following URL in the Chrome browser on my iPhone:
http://192.168.0.231:7860
Here, 192.168.0.231
is the local IP address of my PC, and 7860
is the port. However, there’s a caveat: accessing the app over HTTP means Chrome won’t allow the use of your phone’s microphone due to security restrictions. To utilize the microphone, you must access the app over HTTPS.
To set up HTTPS for your Gradio app and access it securely over your home WiFi, especially from your iPhone, you’ll need to follow a series of steps. This involves generating a self-signed SSL certificate, configuring your Gradio app to use this certificate, and then accessing the app from your iPhone.
First, ensure OpenSSL is installed on your PC, as it’s needed to generate the SSL certificate. OpenSSL is usually pre-installed on Linux and macOS. Open a terminal and run this command. This command generates a new self-signed certificate (cert.pem
) and private key (key.pem
):
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes
When you run this command, OpenSSL will ask you several questions (like your country, state, and organization). You can fill these out, but since this is for personal use within your home network, the specifics aren’t crucial.
After creating the cert.pem and key.pem, you can refer to them in demo.launch
function at the end of the gradio_app.py file. So you should replace demo.launch with this command:
demo.launch(ssl_verify=False,
share=False,
debug=False,
server_name="0.0.0.0",
ssl_certfile="cert.pem",
ssl_keyfile="key.pem")
After the app is up and running, on you cell phone, you can open chrome and open this address:
https://[192.168.0.231:7860](http://192.168.0.231:7860)
Then you will be able to use your phone microphone with the app.
Don’t forget to follow for more AI content. More hands-on articles are on the way!