Photo credit: Jason Rosewell

Speech as Input in Virtual Reality

Using speech & NLP for more dynamic virtual environments.

Daniel Rothmann
Towards Data Science
6 min readNov 20, 2018

--

People have been wanting to talk to computers for a long time. Thanks to deep learning, speech recognition has become significantly more robust and significantly less frustrating — Even for less popular languages.

At Kanda, we set out to examine speech recognition and natural language processing (NLP) techniques to make fluid conversational interfaces for augmented reality (AR) & virtual reality (VR). In Danish.

The Danish language is notoriously hard to learn. That goes for both humans and machines. Here’s a couple of insights we gained along the way.

The state of speech recognition

Speech recognition is a task that humans are really quite good at. Human-level speech recognition is often cited with a measured 4% word error rate based on Richard Lippmann’s 1997 paper, “Speech recognition by machines and humans” [1].

Another study found an average of 4.5% disagreement between human transcribers using careful multiple transcriptions. When asked to perform quick transcription (transcribing approx. 1 hour of English audio in only 5 hours), that disagreement rose to 9.6% [2]. These are baselines we can use to assess the effectiveness of an automatic speech recognition system.

When Lippmann released his paper in 1997, he noted that speech recognition technology had come a long way, having achieved error rates less than 10% under ideal circumstances. With spontaneous speech or in noisy conditions, however, error rates would increase to 30–40%.

Fast forward to 2018. (Whoosh!)

At this point, we are getting used to talking to our devices every single day. Think Alexa, think Cortana, think Siri. In 2017, Microsoft announced that they had reached a machine transcription error rate slightly better than the human level — 5.8% vs 5.9% on the Switchboard dataset [3].

Photo credit: Bence Boros

Niche language

All that is pretty great. But these results are for English, one of the most spoken languages in the world. Kanda is based in Denmark and our clients are from here. And we speak Danish. So there is that.

Speech-to-text (STT) is one of those tasks you want to avoid building your own models for — The generality of the task combined with the sheer amount of data you need to succeed makes it ideal to rely on a common provider.

Luckily, both Microsoft Azure and Google Cloud provide Danish STT APIs. But absent comprehensive academic studies and test results, we needed to assess the accuracy of these APIs ourselves.

For this task, we devised our own little test.

We prepared and transcribed a number of Danish speech recordings. Some were news articles and movie reviews read aloud to a laptop microphone, some were movie dialogue read aloud to a headset while others were free-form video blogs with slurred pronunciation or degraded audio quality.

We tried to cover a wide range of scenarios, though the test was admittedly very small scale. First of all, the Google Cloud STT API outperformed the Azure STT API by a significant margin. So we’ll be relying on the Google API for the time being.

Even then, the results were far off the 5-6% error rates we saw for STT in English. In our Danish tests, the Google Cloud STT API had an average error rate of 28% with structured content under good recording conditions around 10% and unstructured speaking in noisy conditions approaching a whopping 60%. That’s actually comparable to the 1997 STT performance for English. For the curious, a selection of the test results are illustrated below:

A selection of results from our Google Cloud SST tests.

Command and control

So, maybe fluid conversational speech interfaces in Danish are still some way off. But if we reduce our scope to a keyword-driven approach and take proper measures to harden and validate the input, “command and control” type interfaces are possible in VR with broadly available tools and services.

This approach might seem a bit ancient when you consider this is the way text-based narrative games have done interaction since the 70's. However, being able to speak as a way of interacting with a virtual environment can add a new layer of physicality, familiarity and credibility to a VR experience.

The interaction doesn’t have to be explicit, as in “go north”, “play Nickelback” or “pick up item”, either. We could listen for spontaneous utterances by the player during the experience, like named entities, areas of interest or emotional indicators and have the virtual environment react to that.

Photo credit: Lux Interaction

It’s all about context

When you start transcribing speech and observe the errors made by STT models, you realize that many lingual features that sound very much alike mean entirely different things depending on context. These words are called homophones.

Context could mean the words that came before, the topic we’re talking about or the person we’re talking to. As a result, context determines the words we’re listening for and expect to be likely to hear.

The Google STT API supports an optional input SpeechContext, which can contain a number of words and phrases to prioritize in the result. When listening, the Google STT model produces a number of possible options for what could’ve been said. The SpeechContext is used to determine which of these options are most likely to be correct, what we expect to hear.

Context is used to select the best candidate among words that sound the same.

Another problem when looking for keywords in STT transcriptions is the small grammatical errors that can occur, such as mistakenly transcribing “computer’s” as “computer” or “computers”. If the change does not alter the intent of the word, we should still try to detect and react to it.

Our list of available NLP tools are restricted since most do not support Danish. However, one remedy for small grammatical and typing errors is to do fuzzy string searching: Measuring for approximate (rather than exact) keyword matches. A metric for approximate keyword match is Levenshtein distance, which is a way to measure word edit distance.

Levenshtein distance measures the number of edits necessary to change one string to another.

The Levenshtein method establishes the distance between two strings by measuring how many edits must be made to change one string into another. An efity can be a character change, insertion or deletion. Using the maximum number of possible edits (equal to the length of the longest string), a normalized word distance metric can be calculated.

Using context and fuzzy string search, we were able to significantly reduce the error rate for predetermined keywords. Across a number of test, we were able to detect Danish keywords with a 12% error rate which is more than twice as good as our initial tests.

Based on these results, we assess that Danish speech recognition services are now viable for practical application. At least for keyword based interfaces.

Photo credit: Adam Solomon

Moving forward

With all the major steps being taken in speech recognition, NLP and deep learning, it’s nice to see less popular languages starting to become supported.

“Foreign” language model performance is lagging behind what is currently possible in English, but with a couple of extra tools and hacks, Danish speech recognition can be made viable for production.

In languages like Danish, we’re still waiting to be able to reliably apply the really clever NLP techniques. Personally, I wonder if improved machine translation can help us do better language understanding by making the English NLP toolbox available in other languages.

On a positive note, it seems like fluid conversational speech interfaces for VR are not that far off in the future. Even for obscure, weird, difficult to learn languages like Danish.

For now, we can start using keyword detection to make virtual environments more dynamic and reactive.

References

[1] Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech communication, 22(1), pp.1–15.

[2] Glenn, M.L., Strassel, S., Lee, H., Maeda, K., Zakhary, R. and Li, X., 2010, May. Transcription Methods for Consistency, Volume and Efficiency. In LREC.

[3] Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D. and Zweig, G., 2016. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256.

--

--

CTO @ Kanda. Technologist by trade and creator at heart. These are my thoughts on code, data, sound and beyond. I hope you find them useful.