Building an All-In-One Audio Analysis Toolkit in Python

Analyze your audio file in a single place

Published in

Towards Data Science

7 min readSep 15, 2022

Language forms the basis of every conversation between humans. Due to this, the field of Natural Language Processing (or NLP for short) undoubtedly holds immense potential in assisting humans with their day-to-day lives.

In simple words, the domain of NLP comprises a set of techniques that aim to comprehend human language data and accomplish a downstream task.

NLP techniques encompass numerous areas such as Question Answering (QA), Named Entity Recognition (NER), Text Summarization, Natural Language Generation (NLG), and many more.

A few sub-domains of Natural Language Processing (Image by Author)

While most of the prior research and development in NLP has primarily focused on applying various techniques, specifically over “textual” data, in recent times, the community has witnessed a tremendous adoption of speech-based interaction, veering machine learning engineers to experiment and innovate in the speech space as well.

Therefore, in this blog, I will demonstrate an all-encompassing audio analysis application in Streamlit that takes an audio file as input and:

1. Transcribes the audio
2. Performs sentiment analysis on the audio
3. Summarizes the audio
4. Identifies named entities mentioned in the audio
5. Extracts broad ideas from the audio

To achieve this, we will use the AssemblyAI API to transcribe the audio file and Streamlit to build the web application in Python.

The image below depicts what this application will look like once it is ready.

Overview of Audio Analysis Toolkit (Image by Author)

Let’s begin 🚀!

App Workflow

Before building the application, it will be better to highlight the workflow of our application and how it will function.

A high-level diagrammatic overview of the application is depicted in the diagram below:

Transcription Service Workflow of AssemblyAI (Image by Author)

The Streamlit web application will first take an audio file as input, as described above.

Next, we will upload it to AssemblyAI’s server to obtain a URL for the audio file. Once the URL is available, we shall create a POST request to the transcription endpoint of AssemblyAI and specify the downstream task we wish to perform on the input audio.

Lastly, we will create a GET request to retrieve the transcription results from AssemblyAI and display them on our streamlit application.

Project Requirements

This section will highlight some prerequisites/dependencies for building the audio toolkit.

#1 Install Streamlit

Building web applications in Streamlit requires installing the Streamlit python package locally.

#2 Get the AssemblyAI API Access Token

To access the transcription services of AssemblyAI, you should obtain an API access token from their website. For this project, let’s define it as auth_key.

#3 Import Dependencies

Lastly, we will import the python libraries that we will be required in this project.

With this, we are ready to build our audio analysis web application.

Building the Streamlit Application

Next, let’s proceed with building the web application in Streamlit.

Our application, as discussed above, will comprise four steps. These are:

1. Uploading the file to AssemblyAI
2. Sending the Audio for transcription through a POST request
3. Retrieving the transcription results with a GET request
4. Displaying the results in the web application

To achieve this, we shall define four different methods, each dedicated to one of the four objectives above.

However, before we proceed, we should declare the headers for our request and define the transcription endpoints of AssemblyAI.

Method 1: upload_audio(audio_file)

The objective of this method is to accept the audio file obtained from the user and upload it to AssemblyAI to obtain a URL for the file.

Note that it is not necessary to upload the audio file to AssemblyAI as long as you can access it via a URL. Therefore, if the audio file is already accessible with a URL, you can skip implementing this method.

The implementation of upload_audio() method is shown below:

The function accepts the audio_file as an argument and creates a POST request at the upload_endpoint of AssemblyAI. We fetch the upload_url from the JSON response returned by AssemblyAI.

Method 2: transcribe(upload_url)

As the name suggests, this method will accept the URL of the audio file obtained from upload_audio() method above and send it for transcription to AssemblyAI.

In the JSON object above, we specify the URL of the audio and the downstream services we wish to invoke at AssemblyAI’s transcription endpoint.

For this project, these services include sentiment analysis, topic detection, summarization, entity recognition, and identifying all the speakers in the file.

After creating a POST request at the transcription_endpoint, we return the transcription_id returned by AssemblyAI, which we can later use to fetch the transcription results.

Method 3: get_transcription_result(transcription_id)

The penultimate step is to retrieve the transcription results from AssemblyAI. To achieve this, we must create a GET request this time and provide the unique identifier (transcription_id) received from AssemblyAI in the previous step.

The implementation is demonstrated below:

As the transcription time depends on the duration of the input audio file, we have defined a while loop to create repeated GET requests until the status of our request changes to completed or the transcription request indicates an error.

The transcription response received for a particular audio file is shown below:

Method 4: print_results(results)

The final method in this application is to print the results obtained from AssemblyAI on the Streamlit application.

To avoid clutter and textual chaos on the application's front-end, we shall encapsulate each of the services within a Sreamlit expander.

The keys from the transcription response that are pertinent to this project are:

text: This contains the transcription text of the audio.

iab_categories_result: The value corresponding to this key is a list of topics identified in the audio file.

chapters: This key indicates the summary of the audio file as different chapters.

sentiment_analysis_results: As the name suggests, this key holds the sentence-wise summary of the audio file.

entities: Lastly, this key stores the entities identified in the audio file.

Integrating the Functions in Main Method

As the final step in building our Streamlit application, we integrate the functions defined above in the main() method.

First, we create a file uploader for the user to upload the audio file.

Once the audio file is available, we send it to Method 1 (upload_audio), followed by transcribing the audio (transcribe) and retrieving the results (get_transcription_result), and we finally display the results (print_results) to the user on the Streamlit application.

Executing the Application

Our audio analysis application is ready, and now it’s time to run it!

To do so, open a new terminal session. Next, navigate to your working directory and execute the following command after replacing file-name.py with the name of your python file:

streamlit run file-name.py

Uploading Audio to the App (Image by Author)

Demo Walkthrough

The uploader above asks you to upload an audio file. Once you do that, the functions defined above will be executed sequentially to generate the final results.

The transcription results on the uploaded file are shown below:

Walkthrough of the Audio Analysis App (Gif by Author)

Results

In this section, we will discuss the results obtained from the transcription models of AssemblyAI.

Audio Transcription

A part of the transcription of the input audio is shown in the image below.

Topics

The broad topics discussed in the entire audio by the speaker(s) are shown in the image below.

Summary

To generate a summary, AssemblyAI’s transcription services first break the audio into different chapters and then summarizes each chapter individually.

The summary of the input audio file is shown below.

Sentiment Analysis

AssemblyAI classifies each sentence into three categories of sentiments — Positive, Negative, and Neutral.

The sentiment of the first three sentences in the audio is shown below. They were precisely classified as Neutral by the transcription module.

The sentiment of the sentences in the Audio (Image by Author)

Entity Detection

Finally, the entities identified in the audio and their corresponding entity tags are shown below.

Conclusion

To conclude, in this post, we built a comprehensive audio application to analyze audio files using the AssemblyAI API and Streamlit.

Specifically, I demonstrated how to perform various downstream NLP tasks on the input audio, such as transcription, summarization, sentiment analysis, entity detection, and topic classification.

Thanks for reading!

🧑‍💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.

✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.