“Chain-linking” NLP tasks With Wav2Vec2 & Transformers

Going straight from audio to a range of text-based NLP tasks such as translation, summarisation and sentiment analysis

Published in

Towards Data Science

7 min readMar 19, 2021

The addition of the Wav2Vec2 model in Hugging Face’s transformers library has been one of the more exciting developments in NLP in recent months. Until then, it wasn’t easy to execute tasks like machine translation or sentiment analysis if you only had a long audio clip to work with.

But now you can link up an interesting combination of NLP tasks in one go: transcribe the audio clip with Wav2Vec2, and then use a variety of transformer models to summarize or translate the transcript. The possible permutations and combinations of NLP tasks that you can link up is pretty mind boggling.

Results could be patchy, of course. Some NLP tasks, such as summarisation, are very hard to crack by nature.

This post, a follow-up to my earlier one on Wav2Vec2 trials, will outline 2 trials:

#1: Speech-to-text-to-translation & sentiment analysis
#2: Speech-to-text-to-summarisation

REPO, REQUIREMENTS, AND REFERENCES

The notebooks and audio files needed for the trials are in my repo. Additionally, you’ll need these to run the notebooks:

Transformers ≥ 4.3
Librosa (to manage the audio files)

The code in most of the notebooks have been updated to use a better approach for transcribing long audio files, via a post on Github by Hugging Face’s machine learning engineer Lysandre Jik.

In my earlier work-around, I used Audacity to manually split up long audio clips into smaller, more manageable pieces (audio clips longer than ~90s tend to crash local machines and Colab). Lysandre’s code eliminates this step by using Librosa to stream in the long audio clip in shorter, fixed “chunks”.

There are several versions of the Wav2Vec2 model on Hugging Face’s model hub. For this post, I’ll be using the wav2vec2-large-960h-lv60-self model.

TRIAL #1: TRANSCRIBE + TRANSLATE + SENTIMENT ANALYSIS

Screen-grab on right via CNBC Television’s YouTube channel

For this trial, I opted for US President Joe Biden’s first prime time speech on Mar 11/12 2021 (depending on which time zone you are in).

His speech was about 24 minutes long, and I streamed it to Wav2Vec2 via 25-second chunks. You can opt for longer or shorter chunks. I found 25s to give decent results. Details are in notebook3.0 in my repo, and I won’t repeat them here.

The speech is complex and Biden is known to be fighting a stutter. The transcript (download directly here) is pretty rough in some parts, especially towards the end:

To be fair, Biden spoke pretty clearly in my view, and this is an example where the Wav2Vec2 model struggled. But the model took just 12 minutes or so on my late-2015 iMac to produce the transcript, which is way, way faster than if I were to do it manually.

Next, I wanted to pipe the transcript of Biden’s speech to a machine translation model and see what the quality is like for a Chinese translation. I used Hugging Face’s implementation of MarianMT for this, as I’ve tried it on a few speeches previously.

The results are borderline unusable, unfortunately. Download the full Chinese translation here:

The “chain-linking” process works well from a technical perspective, and you just need about 10 additional lines of code after the Wav2Vec2 process.

But the quality of the translation clearly suffers when the problematic parts of the raw transcript were fed to the MarianMT model without a thorough clean up of the missing or wrong words, and proper punctuation. I added full stops (“.”) at the end of each 25s-worth of text transcript from Wav2Vec2, but that clearly doesn’t capture the correct start and end of each sentence in the original speech.

So while it seems that “chain-linking” the NLP tasks would save a lot of time, the problems in one area can and will compound, resulting in lower quality results for tasks that are further downstream.

Clearly it will take additional time to clean up the raw English transcript. But I believe doing so would improve the quality of the Chinese translation considerably. As for accurate punctuation, there is no quick way to do it within Wav2Vec2 right now. But I suspect future versions of the model would resolve this issue, seeing that a number of auto-speech-recognition services on the market already have the “add punctuation” feature.

Next, I wanted to try applying sentiment analysis to Biden’s speech. The combination of speech-to-text with sentiment analysis can be really useful in political or business settings, where a quick understanding of the speaker’s sentiment can affect certain decisions.

The raw transcript was turned into a simple data-frame, and I then used Hugging Face’s transformers pipeline and Plotly to generate a “sentiment-structure” chart like the one below. I have been experimenting with these sentiment charts for a while. My earlier experiments can be found here.

This worked better than the attempt at Chinese translation, though I’m sure the sentiment analysis would also benefit from a thorough clean up and the inclusion of proper punctuation in the raw English transcript.

There’s technically no reason why one would stop at just 3 tasks. You can easily write a few more lines of code and include summarisation as the 4th NLP task in a row.

But auto-summarisation of long text documents, including speeches, remain a highly challenging task even for transformer models. The results aren’t ready for prime-time, in my view. Let’s take a closer look in the next trial.

TRIAL #2: TRANSCRIBE + SUMMARISE

Screen-grab via the YouTube channel of the Singapore Prime Minister’s Office.

For this I picked a shorter audio clip, a 4-minute video of Singapore Prime Minister Lee Hsien Loong responding to a question on populism at a business conference in 2019. The clip focuses on a singular issue, but has enough complexity for it to be challenging for any auto-summarisation model.

See notebook3.1 in my repo for details.

The Wav2Vec2 output is excellent, as you can see below (or download here). There are some minor trip ups here and there, but nothing you can’t clean up really quickly.

I ran the raw transcript on 2 transformer models fine-tuned for summarisation, FB’s “bart-large-cnn” and Google’s “pegasus-large”.

Here’s the result from Bart:

This is the summary by Pegasus:

Both models did well in first capturing the speaker’s broad definition of populism, followed by the inclusion of his elaboration about how Singapore tried to avoid the problem.

But the FB-Bart model didn’t capture any details from the second part of Mr Lee’s comments. Meanwhile, the Pegasus model captured too much from the first part of his comments, and not enough of the second half. Neither version would pass for a good summary, in my view, though to be fair, both models weren’t trained on political speeches.

So yet again, we see that the chain-linking of NLP tasks via Wav2Vec2 and transformers is technically viable, but the results are not always satisfactory.

CONCLUSION

These are exciting times for folks working in NLP, and Wav2Vec2 looks set to open up a whole new range of possibilities. With Hugging Face initiating a sprint to extend Wav2Vec2 to other languages (beyond English), the scope for “chain-linking” NLP tasks can only grow.

But given existing limitations on Wav2Vec2 and the inherent difficulties in many NLP tasks such as summarisation, it is probably wiser to add a “pause” button in the process.

By that I mean results will be better if the raw Wav2Vec2 transcripts are cleaned up before they are fed downstream to other NLP tasks. Otherwise the problematic areas in the transcript get compounded, resulting in sub-par results elsewhere.

As always, if you spot mistakes in this or any of my earlier posts, ping me at:

Twitter: Chua Chin Hon
LinkedIn: www.linkedin.com/in/chuachinhon

The repo for this post, containing the data and notebooks for the charts, can be found here.