The world’s leading publication for data science, AI, and ML professionals.

How I taught my air conditioner some Hebrew

It started as a side project when I was teaching this other neural network how to do some paintings. I just got a shiny new smart home…

Thoughts and Theory

Story time – Not your usual "how to" article

Who'd have thought air conditioners could learn a language? (image: pixabay.com)
Who’d have thought air conditioners could learn a language? (image: pixabay.com)

It started as a side project when I was teaching this other neural network how to do some paintings. I just got a shiny new smart home sensor as a gift, with a promise it was supposed to turn on my home air conditioner when I was coming back from work or from the gym.

It turned out to be a neat toy, with a very sleek app to control it from my phone. But I found that the major drawback was that I always forgot to actually press the "on" button before I started driving home. Being the lazy person I am, I never stopped at the side of the highway just to toggle that button. So I was thinking to myself: "What if I just taught that air conditioner to recognize my speech"?

Sensibo IoT sensors let you toggle anything to do with your air conditioner from your phone. (image: screenshot, under fair-use)
Sensibo IoT sensors let you toggle anything to do with your air conditioner from your phone. (image: screenshot, under fair-use)

Checking on the public documentation of these Sensibo guys, I was able to figure out the outlines of how their sensors work. When the user clicks that button in the app, the phone sends what is called an HTTP request to a Sensibo server somewhere far away. That’s the same thing your browser sends for you when you want to see the latest of Instagram or post a new comment on YouTube. The server then commands the sensor with a second HTTP request to send a matching command to the air conditioner, which it does by using an infrared light emitter identical to the one in your remote control. All of the above meant that I only needed to write my own "button clicking" app, which instead of waiting for my finger tap would listen to my voice to toggle the air conditioner.

I started designing a system that would allow me to do just that. Practicing Machine Learning for the past year made it obvious to me I was going to train a model of my own. But I also needed that model available from my phone. I wasn’t going to pay anyone to store it in the cloud and charge me exorbitant amounts of money to access it, so the vote was quickly cast for a locally deployed model. The problem is, my phone is a 5-year-old antique with hardware that was considered on the weak side even back in the time I bought it. All those new fancy models with billions of parameters running on monster GPUs weren’t going to work here. I needed something simple. So I started reading all the literature I could find.

It turns out that a really big percentage of speech recognition models listen to the user speaking and try to classify the speech as whole words. But the amount of words in any language is always a big number. Why not use a smaller unit of speech? I was thinking of phonemes. Phonemes are one of the most basic units making up human languages. Every unique sound is considered a phoneme, no matter how many ways you can spell it. English for example is estimated to have 171,000 words, but only 42 phonemes.

To follow me from this point you will need to know some math. The meaning of that last paragraph is that if a model wants to classify a word in English, the output layer (the classification layer) needs to have at least 171,000 distinct outputs. Just think of the size of that weight matrix and the dimension of the output vector. If the same model would classify phonemes instead, that number would be down to just 42. Remembering that naïve matrix multiplication has a time complexity of O(n3), just think how much faster a phoneme classifier is compared to a word classifier.

Next: It seems that a lot of models dealing with speech prefer to learn from spectrograms, and not from the raw audio. There have been recent successful models working without spectrograms (check Facebook’s wav2vec 2.0 for example) but these are again at the top edge of model size and inference times. If the user waits for too long before the air conditioner does anything, he’ll just park his car and click the normal button.

In essence, spectrograms are the result of a transform function on the raw audio signal that creates a matrix where each column vector represents the amplitude of a discrete frequency bin within a specific frame of the original signal. If the passage above is Chinese to you (it was to me, until recently), you needn’t worry much. When we speak, our vocal cords vibrate the air to produce sound. The phone microphone converts these vibrations to electrical amplitude readings (amplitude = how loud we just spoke) and stores them in an array. Creating a spectrogram turns that array into an image. Why is that good for us? Raw audio gives us only the amplitude, while spectrograms also give us the frequencies the signal is made of. Even better – since spectrograms are matrices, they can also be displayed as images. This means we can use methods from the image processing world to analyze our voice. Neat!

Three renditions of the same spectrogram, in different amplitude and pitch scales (image by author)
Three renditions of the same spectrogram, in different amplitude and pitch scales (image by author)

The most important of these methods when it comes to machine learning is called the convolution layer. Modern programming libraries used for writing models break them down into layers. Layers are like the Lego blocks models are made of. Convolution layers use a mathematical operation of the same name (not exactly, but a similar one) to analyze image data. Their strength comes from a mathematical property called "invariance". Convolutions are invariant to the location of image features. Meaning that no matter what we are trying to predict or analyze about that image, its position within the image will have very little effect on the final result. A convolution layer can find a dog no matter where it is positioned in an image. In our case – it can find a phoneme no matter where it is positioned within a spectrogram.

Which is good for us, because we’ll find all the phonemes, but not the best because we don’t know their order. Convolutions invariance to location means they cannot efficiently determine phoneme order, making "poultry" (pl-tri) and "triple" (tri-pl) effectively interchangeable. To overcome that, we use sequence modeling on the convolution outputs with something called a "recurrent neural network". Recurrent neural networks (RNNs) usually either read a sequence of inputs or produce a sequence of outputs (sometimes both). Their strength comes from a neat design feature, in which previous inputs influence the next outputs. This is in contrast with standard neural networks, in which each input is independent of the rest of the sequence. This feature made them very prominent in research for the last 35 years, showing very good results.

Recent research has tried to combine the two. When applied to this project, it works as follows: We take the output of the convolution layers, which a matrix with the same dimensions as the input spectrogram, and split it into a sequence of column vectors. Each column vector is fed in sequence to the RNN layer, which should use them all to output the phonemes in their correct order. Models combining both convolution layers and recurrent layers are called (unimaginatively) Convolutional-Recurrent neural networks (CRNNs).

There are a few other building blocks used to construct the model, namely an attention block, sequence-to-sequence RNNs, and Gated Recurrent Units (GRUs), but these are really just buzzwords describing better implementations of the above components. If you are really interested in them you can just Google the terms, which will probably provide you with plenty of resources on each term. This is also the place to share that the full model is available for those of you who are technically oriented at the following link as a Google Collaboratory notebook:

This is the layer diagram for the model, for those of you who didn't open the link to that Colab notebook. (image by author)
This is the layer diagram for the model, for those of you who didn’t open the link to that Colab notebook. (image by author)

Now that we’ve designed the model, we need to train it. The problem is, Hebrew doesn’t have very much speech data publicly available. Here’s where phonemes come in handy again: Even in a small speech dataset, where we record every word in a given language once, the entire set of phonemes will be repeated a hundred times over. In fact, we might be able to work with a small enough dataset we can create ourselves.

For three days I was pacing around the house, recording myself repeating the same voice commands and driving everyone around me crazy. For my research, this was more than enough, although for any serious project you’d definitely want some friends to add their own recordings (this helps the model generalize to unknown voices). This resulted in around 45 minutes of audio, which I grouped by utterance (same words in the same file) and tagged accordingly. I then wrote some code to extract the spoken commands from the files, convert them to spectrograms and group them into a training set and a test set. When adding some spectrograms of background noise (to teach the model to separate speech from silence) This resulted in ~3450 spectrograms for training. If you ever trained a model before, you know that’s not a lot of data.

Then came the training. I tried many augmentations I thought were cool but only ended up reducing the model’s accuracy rates. In the end, I just trained it on the un-augmented dataset for 1200 epochs, which took just under half an hour. This meant I could quickly try new configurations, datasets, and hyper-parameters and check which ones worked best for me. Once I’ve chosen my favourite, I calculated all the metrics data scientists use to check their models are healthy and proceeded to write the mobile app once the metrics gave me a green light.

Checking the metrics all ok is important for the health of your model. In this image: a metric called a Confusion Matrix, showing where the model made the most mistakes on the test set. An ideal confusion matrix is all black, except for the main diagonal which is all white. (image by author)
Checking the metrics all ok is important for the health of your model. In this image: a metric called a Confusion Matrix, showing where the model made the most mistakes on the test set. An ideal confusion matrix is all black, except for the main diagonal which is all white. (image by author)

If you remember from the beginning of this article, my phone is an Android. This means the app needs to be written in Java (or Kotlin). The problem is that nearly the entire data science ecosystem is built around Python. Models are written in Python, preprocessing pipelines are written in Python, scientific programming methods and linear algebra routines are written in C with Python wrappers, everything is in python – and none of that is available in Java.

I literally had to spend weeks writing everything from scratch. Most programmers would just scavenge GitHub or Maven central for packages and libraries someone else wrote. But from my experience these never perform as fast as you need them to and as seamlessly as your own code.

The app that I wrote was really just the voice-activated equivalent of the button-press app. It listened through the phone’s microphone, detected speech, converted it to a spectrogram, and forwarded it through the model. If the output was something intelligible, an appropriate HTTP request would be sent to the Sensibo server, then activate my IoT sensor just as their own app would have done.

After a lot of fine-tuning and some tricks applied to the entire process described in this article, I managed to reduce the time it takes to convert speech to phonemes to just under 1810 milliseconds. On newer phones (which were not my antique device) prediction times were much better (usually under 1s) which was more than enough for any practical need.

All the speech processing is local, as opposed to the common voice recognition assistants that send your voice to a server and wait for an answer. And it actually works! The assistants were never good at recognizing my voice, and now I finally have an app that is fine-tuned just for me, in my native language. Finally, checking the APK details (that’s the file extension for every Android app) it turns out that the entire thing weighs only 35 MB, out of which the model is just 1217 KB. How cool is that?

File composition of the final APK. (image by author)
File composition of the final APK. (image by author)

My project is definitely just a showcase and is not a commercially ready product. I think it would be nice to do some further research in this field since I still have some open questions I’d like to answer. How well does the model scale? How many distinct commands can you teach it? What will be the effect on accuracy once we add in more speakers? What other neat tricks can be applied to reduce memory and disk space requirements from the model? Are there any data augmentations that can be used with this model without significantly harming real-time accuracy rates? How much can we reduce the sizes of the training and test sets before accuracy loss is too significant to allow for proper Speech Recognition?

Many questions to answer, and much more work to do. But for the past couple of months, it was really fun joking with my friends I’m busy teaching my air conditioner to speak in Hebrew 🙂

Screenshot from the Android application (image by Author)
Screenshot from the Android application (image by Author)

Feel free to ask me any questions in the comments below. If you are interested in all the boring details there is actually a research paper and accompanying code I wrote where you can check them out. And of course, if you got this far – thanks for reading my story!


Related Articles