Torch: Spoken digits recognition from features to model
Explore the features extracted from voice data and the different approaches to building a model based on the features.
The spoken digits dataset is a subset of the Tensorflow speech commands dataset that includes other sound recordings besides the digits 0–9. Here, we focus only on identifying the spoken digit.
The dataset can be downloaded as follows.
Metrics for evaluation
The subset of digit recordings is fairly balanced with around 2300 samples in each class. Thus, accuracy is a good measure of assessing the model’s performance. Accuracy is the comparison of the number of correct predictions to the total number of predictions. This is not a good measure of performance for unbalanced datasets as the individual accuracy of the majority class may overshadow the minority.
Cyclic learning rate
While training a model, the learning rate is gradually decreased to fine-tune the training. To improve the learning rate efficiency, the cyclic learning rate process can be applied. Here, the learning rate fluctuates between a minimum and maximum value over epochs instead of being monotonically decreased.
Initial training rates are crucial to the model’s performance and low rates prevent being stuck at the start of training and the subsequent fluctuations inhibit local minima and plateau.
The learning rates used in each epoch can be monitored using the optimizer instance.
The project has three approaches to classifying the recordings:
- Logistic Regression using five extracted features — 76.19% accuracy.
- CNN using Mel spectrogram — 95.81% accuracy.
The models were trained repeatedly by varying the number of epochs and the training rates. The number of hidden layers and the nodes in each of them were also varied. The best of the architecture and hyperparameters for each approach is described here. The accuracies may vary slightly on retraining due to the randomness in the train-validation split.
The source code for the project is here.
There are five .ipynb files:
- Feature extraction — The necessary CSV files and features used by the three approaches are extracted.
- Feature visualization — The features are plotted for two examples in each class.
- Spokendigit-Five features — Implementation of logistic regression using five extracted features.
- Spokendigit-CNN — Implementation of CNN using Mel spectrogram.
1. Logistic Regression using five extracted features
Features
The features extracted include:
- Mel Frequency Cepstral Coefficients (MFCCs) — Coefficients that make up the spectral representation of sound based on frequency bands spaced according to the human auditory system’s response (Mel scale).
- Chroma — Related to the 12 different pitch classes.
- Mean of Mel spectrogram — Spectogram based on Mel scale.
- Spectral Contrast — Indicates the center of mass of the spectrum.
- Tonnetz — Represents tonal space.
These features are NumPy arrays of size (20,) (12,) (128,) (7,) and (6,). These are concatenated to form a feature array of size (173, ). The label is appended to the head of the array and written to the CSV file for each recording.
Model
The linear regression model has 1 input layer, 2 hidden layers, and 1 output layer with ReLu activations in all.
Train
The model took about 3 minutes to train on CPU and has an accuracy of 76.19%.
The final validation loss increases a great extent from the minimum value.
2. CNN using Mel spectrogram images.
Features
This model uses Mel Spectrogram images of the recordings. Mel spectrograms are spectrograms in which the frequencies are converted to a Mel scale. The features are extracted from the recordings and stored in the drive. This took 4.5+ hours.
Model
Train
The model took about 5 hours to train on Colab GPU and has an accuracy of 95.81%.
The high accuracy can again be attributed to the Mel scale.