Deep Learning summary for 2017: Text and Speech Applications

Vladimir Fedak
4 min readJan 15, 2018

Deep Learning is disrupting many industries, and yours might not be an exception. Learn of the most notable deep learning projects of 2017 and ride the wave, or risk being rolled over…

Deep Learning (DL) has long crossed the traditional boundaries. Various DL projects are launched in the domains from medical services to insurance and from banking to marketing. For example, China aims to become the world leader in AI and creating a $150 billion AI industry by 2030, while the researchers from Baidu group boast that experiments with datasets composed of billions of samplesare becoming trivial for them.

Thus said, every business should pay close attention to possible Deep Learning applications in their industry. We list the most discussed text and speech-related DL accomplishments of 2017 to benefit both Machine Learning professionals and sharp decision-makers who want to increase their bottom line.

Text-related Deep Learning applications

One of the most important areas of DL application is working with the text: translation, chatbots, text analysis and a plethora of other tasks.

From Google Translate…

A year ago Google announced the launch of a new Google Translate training algorithm, the Recurrent Neural Network. Over the year, Translate has progressed from producing an unreadable salad of words when attempting to translate large bricks of text, to producing almost flawless translations. The results are astonishing and Google’s DL RNN only keeps improving!

…to Facebook negotiator chatbot

You might have heard a fairy tale of how Facebook developed a chatbot and shut it down due to it inventing a new language. Truth be told, that DL algorithm did come up with a non-human lexicon, yet it did not stop it from accomplishing its goal. The goal was actually for the AI to become great at negotiations aimed at splitting the inventory with the adversary (one gets the books, the other one gets the hats, etc.), particularly by mastering the textual conversation.

The bot was trained using a supervised recurrent network with a huge dataset of textual transcripts of real negotiations, and further polished the system using reinforced training while 2 instances of the system chatter with one another. The chatbot has mastered one of the real-life negotiation techniques, the false interest. It showed interest for the item it did not actually want and agreed to hand it over to the other party only if given the item it actually required.

Once the task was completed, the restriction to use human language was lifted, which has lead to the system inventing some new terms. Feel free to play with the code yourself and see what happens in your case!

Speech processing and generation

Another important field of DL application is related to speech processing. It includes the generation of speech and music, recognition and synchronization of the lip movements, etc.

DeepMind Wavenet

The company behind the AlphaGo, Google Deepmind is currently developing WaveNet — an algorithm that transforms the input text into raw audio. It shows extremely good results as compared to previous attempts. Listen to the English example.

As of now, the main flaw of this network is its performance, as 1 second of audio takes 1–2 minutes to generate, yet the progress is astonishing. To say even more, the algorithm can even create piano music! More details are available in the PDF here.

Lip reading from Google DeepMind and Oxford University

Yet another initiative from Google DeepMind working in conjunction with specialists from Oxford University — lip reading algorithm described in depth in their joint paper. This model was trained using a dataset of more than 100,000 sentences, videos and audio files, using LSTM for audio, CNN+LSTM for video, and a combination of these 2 state-vectors that generates the state characters.

The system works with different types of input: audio, video, audio+video, making this algorithm multicanal.

Synchronization of the lips movement with the audio stream

The University of Washington processed more than 10,000 of hours of HD records of the President Obama speeches and developed an DL algorithm capable of synchronizing the lips movement with the audio.

This creates immense capabilities for gaming industry and CGI movies… yet poses a disturbing concern the next presidential speech might actually be the computer-generated footage and not a real record.

Conclusions

Deep Learning is on the roll and new exciting projects are revealed in various domains on a regular basis. We are going to describe the advancements in machine perception, reinforced learning and miscellaneous other apps over the course of the next couple of weeks, so stay tuned for the updates!

The article was originally published here.

--

--