The world’s leading publication for data science, AI, and ML professionals.

How to Perform Speech-to-Text and Topic Detection with Python

Performing text transcription and topic detection over audio files with Python and AssemblyAI

Photo by Volodymyr Hryshchenko on Unsplash
Photo by Volodymyr Hryshchenko on Unsplash

Introduction

In one of my recent articles, I discussed about Speech Recognition and how to implement it in Python. In today’s article we will go a step further and explore how to perform topic detection over audio and video files.

As an example let’s consider podcasts which are becoming more and more popular over time. Imagine how many podcasts are created on a daily basis; Wouldn’t be useful for recommendation engines on different platforms (such as Spotify, YouTube or Apple Podcasts) to somehow categorise all these podcasts based on the content discussed?


Performing Speech-to-Text and Topic Detection with Python

In this tutorial, we will be using AssemblyAI API in order to label topics that are spoken in audio and video files. Therefore, if you want to follow along you first need to obtain an AssemblyAI access token (which is absolutely free) that we will be using when calling the API.

Now that we have an access token, let’s start by preparing the headers that will be using when sending requests to the various endpoints for AssemblyAI.

Going forward, we then need to upload our audio (or video) file to the hosting service of AssemblyAI. The endpoint is then going to return as the URL of the uploaded file that we will be using in subsequent requests.

Now the next step is the most interesting part where we will be performing Speech-to-Text over the uploaded audio file. All we need to pass in the POST request is the audio_url that we received from the previous step along with iab_categories parameter that needs to be set to True. The latter is going to trigger topic detection over the text transcription. An example response from the TRANSCRIPT_ENDPOINT is also shown at the end of the Gist below.

Now in order to get the transcription result (along with the topic detection results) we need to make one more request. This is because the transcription is asynchronous – when a file is submitted for transcription it will need some time until we can access the result (typically in about 15–30% time of the overall audio file duration).

Therefore, we need to make a few GET requests until we get successful (or failure) response as illustrated below.

Finally, let’s write the received result into a text file so that it will be easier for us to inspect the output and interpret the response received from the transcription endpoint:


Interpreting the response

An example response from the transcription endpoint is shown below:

The outer text key contains the result of the text transcription over the input audio file. But let’s focus more on the content of categories_iab_result that contains information relevant to the Topic Detection result.

  • status: Contains the status of the topic detection. Normally, this will be success. If for any reason the Topic Detection model has failed the value will then be unavailable.
  • results: This key will include a list of topics that were detected over the input audio file, including the precise text that influenced the prediction and triggered the prediction model to make this decision. Additionally, it includes some metadata about relevance and timestamps. We will discuss about both below.
  • results.text: This key includes the precise transcription text for the portion of audio that has been classified with a particular topic label.
  • results.timestamp: This key indicates the starting and ending time (recorded in milliseconds) for where the results.text was spoken in the input audio file.
  • results.labels: This is a list containing all the labels that were predicted by the Topic Detection model for the portion of text in results.text. The relevance key corresponds to a score that can take any value between 0 and 1.0 and indicates how relevant each predicted label in relation to results.text.
  • summary: For every unique label detection by the Topic Detection model in the results array, the summary key will include the relevancy for that label across the entire length of the input audio file. For example, if the Science>Environment label is detected only once in a 60-minute long audio file, the summary key will include a relatively low relevancy score for that label, since the entire transcription was not found to be consistently relevant to that topic label.

In order to see the full list of topic labels that the Topic Detection model is capable of predicting, make sure to check the relevant section in the official documentation.


Full Code

The full code used as part of this tutorial is shared in the GitHub Gist below:


Final Thoughts

In today’s article we explored how to perform Speech-to-Text and Topic Detection over the generated text transcription using Python and AssemblyAI API. We wen’t through a step-by-step guide and explained in detail how to use the various API endpoints in order to perform topic detection over audio and video files.


Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Join Medium with my referral link – Giorgos Myrianthous


You may also like

How to Perform Real-Time Speech Recognition with Python


How to Summarize Audio and Video Files with Python


Related Articles