Photo by James Orr on Unsplash

Torch: Spoken digits recognition from features to model

Explore the features extracted from voice data and the different approaches to building a model based on the features.

Ayisha D
Towards Data Science
3 min readJul 30, 2020

--

The spoken digits dataset is a subset of the Tensorflow speech commands dataset that includes other sound recordings besides the digits 0–9. Here, we focus only on identifying the spoken digit.

The dataset can be downloaded as follows.

Spokendigit feature extraction.ipynb
Spokendigit feature extraction.ipynb

Metrics for evaluation

The subset of digit recordings is fairly balanced with around 2300 samples in each class. Thus, accuracy is a good measure of assessing the model’s performance. Accuracy is the comparison of the number of correct predictions to the total number of predictions. This is not a good measure of performance for unbalanced datasets as the individual accuracy of the majority class may overshadow the minority.

Cyclic learning rate

While training a model, the learning rate is gradually decreased to fine-tune the training. To improve the learning rate efficiency, the cyclic learning rate process can be applied. Here, the learning rate fluctuates between a minimum and maximum value over epochs instead of being monotonically decreased.

Initial training rates are crucial to the model’s performance and low rates prevent being stuck at the start of training and the subsequent fluctuations inhibit local minima and plateau.

The learning rates used in each epoch can be monitored using the optimizer instance.

The project has three approaches to classifying the recordings:

  1. Logistic Regression using five extracted features — 76.19% accuracy.
  2. CNN using Mel spectrogram — 95.81% accuracy.

The models were trained repeatedly by varying the number of epochs and the training rates. The number of hidden layers and the nodes in each of them were also varied. The best of the architecture and hyperparameters for each approach is described here. The accuracies may vary slightly on retraining due to the randomness in the train-validation split.

The source code for the project is here.

There are five .ipynb files:

  1. Feature extraction — The necessary CSV files and features used by the three approaches are extracted.
  2. Feature visualization — The features are plotted for two examples in each class.
  3. Spokendigit-Five features — Implementation of logistic regression using five extracted features.
  4. Spokendigit-CNN — Implementation of CNN using Mel spectrogram.

1. Logistic Regression using five extracted features

Features

The features extracted include:

  • Mel Frequency Cepstral Coefficients (MFCCs) — Coefficients that make up the spectral representation of sound based on frequency bands spaced according to the human auditory system’s response (Mel scale).
  • Chroma — Related to the 12 different pitch classes.
  • Mean of Mel spectrogram — Spectogram based on Mel scale.
  • Spectral Contrast — Indicates the center of mass of the spectrum.
  • Tonnetz — Represents tonal space.

These features are NumPy arrays of size (20,) (12,) (128,) (7,) and (6,). These are concatenated to form a feature array of size (173, ). The label is appended to the head of the array and written to the CSV file for each recording.

Spokendigit-feature-extraction.ipynb

Model

The linear regression model has 1 input layer, 2 hidden layers, and 1 output layer with ReLu activations in all.

Spokendigit — Five features.ipynb

Train

Spokendigit — Five features.ipynb
Spokendigit — Five features.ipynb
Spokendigit — Five features.ipynb

The model took about 3 minutes to train on CPU and has an accuracy of 76.19%.

The plot of validation losses

The final validation loss increases a great extent from the minimum value.

The plot of validation accuracies
The plot of last learning rates in each epoch

2. CNN using Mel spectrogram images.

Features

This model uses Mel Spectrogram images of the recordings. Mel spectrograms are spectrograms in which the frequencies are converted to a Mel scale. The features are extracted from the recordings and stored in the drive. This took 4.5+ hours.

Spokendigit-feature-extraction.ipynb

Model

Spokendigit-CNN.ipynb

Train

Spokendigit-CNN.ipynb
Spokendigit-CNN.ipynb
Spokendigit-CNN.ipynb

The model took about 5 hours to train on Colab GPU and has an accuracy of 95.81%.

The high accuracy can again be attributed to the Mel scale.

The plot of validation losses
The plot of validation accuracies
The plot of last learning rates in each epoch

References

--

--