Generating Chirps with Neural Networks

Grafting models together and iteratively calling a generator

Published in

Towards Data Science

6 min readAug 4, 2020

The sound of birdsong is varied, beautiful, and relaxing. In the pre-Covid times, I made a focus timer which would play some recorded bird sounds during breaks, and I always wondered whether such sounds could be generated. After some trial and error, I landed on a proof-of-concept architecture which can both successfully reproduce a single chirp and has parameters which can be adjusted to alter the generated sound.

Since generating bird sounds seems like a somewhat novel application, I think it is worth sharing this approach. Along the way, I also learned how to take TensorFlow models apart and graft parts of them together. The code blocks below show how this is done. The full code can be found here.

The approach in theory

The generator will be composed two parts. The first part will take the entire sounds and encode key pieces of information about its overall shape in a small number of parameters.

The second part will take a small bit of sound, along with the information about the overall shape, and predict the next little bit of sound.

The second part can be called iteratively on itself with adjusted parameters to produce an entirely new chirp!

Encoding the parameters

An autoencoder structure is used for deriving the key parameters of the sound. This structure takes the entire soundwave and reduces it, through a series of (encoding) layers, down to a small number of components (the waist), before reproducing the sound in full from a series of expanding (decoding) layers. Once trained, the autoencoder model is cut off at the waist so that all it does is reduce the full sound down to the key parameters.

For the proof of concept, a single chirp was used; this chirp:

Soundwave representation of employed chirp.

It comes from the Cornell Guide to Bird Sounds: Essential Set for North America. The same set used for the Birds Sounds Chrome Experiment.

One problem with using just a single sound is that the autoencoder might simply hide all the information about the sound in the biases of the decoding layers, leaving the waist with all zero weights. To mitigate this, the sounds was morphed during training by altering its amplitude and shifting it around a little.

The encoder portion of the autoencoder consists of a series of convolutional layers which compress a 3000-ish long sounds wave down to around 20 numbers, hopefully retaining important information along the way. Since sounds are composed of many different sine waves, allowing many convolutional filters of different sizes to pass over the sound can in theory capture key information about the composite waves. A waist size of 20 was chosen mainly because this seems like a somewhat surmountable number of adjustable parameters.

In this first approach, the layers are stacked sequentially. In a future version, it may be advantageous to use a structure akin to inception-net blocks to run convolutions of different sizes in parallel.

The decoder portion of the model consists of two dense layers, one of length 400, and one of length 3000 — the same length as the input sound. The activation function of the final layer is tanh, as the sound wave representations have values between -1 and 1.

Here is what this looks like visualized:

Representation of the autoencoder network. Produced with PlotNeuralNet.

And here is the code:

Training the Generator

The structure of the generator begins with the encoding portion of the autoencoder network. The output at the waist is combined with some fresh input representing the bit of the sound wave immediately preceding that which is to be predicted. In this case, the previous 200 values of the sound wave are used as input, and the next 10 are predicted.

The combined inputs are fed into a series of dense layers. The sequential dense layers allow the network to learn the relationship between the previous values, information on the overall shape of the sound, and the following values. The final dense layer is of length 10 and activated with a tanh function.

Here is what this network looks like:

Generator network with grafted-on portion of autoencoder network. Produced with PlotNeuralNet.

The layers coming from the autoencoder network are frozen so that additional training resources are not spent on them.

Generating some sounds

Training this network takes only a couple of minutes as the data is not very varied and therefore relatively easy to learn, particularly for the autoencoder network. One final flourish is to produce two new networks from the trained models.

The first is simply the encoder portion of the autoencoder, but now separated. We need this part to produce some initial good parameters.

The second model is same as the generator network, but with the parts from the autoencoder network replaced with a new input source. This is done so that the trained generator no longer requires the entire soundwave as input, but only the encoded parameters capturing the key information about the sound. With these separated out as a new input, we can freely manipulate them when generating chirps.

The following sounds were generated without modifying the parameters, they are very close to the original sound, but are not perfect reproductions. The generator network is only able to reach an accuracy of between 60% and 70%, so some variability is to be expected.

Sounds generated without modifying the encoded parameters.

Modifying the parameters

The advantage of generating bird sounds is in part that new variations on a theme can be produced. This can be done by modifying the parameters produced by the encoder network. In the above case, the encoder produced these parameters:

Not all of the 20 nodes produced non-zero parameters, but there are enough of them to experiment with. There is a lot of complexity to be explored with 12 adjustable parameters that all can be adjusted to arbitrary degrees in both directions. Since this is a proof of concept, it will suffice to present some choice sounds generated by adjusting just a single parameter in each case:

Sounds generated after modifying one of the encoded parameters in each case.

Here are the soundwave representations of the three examples:

Soundwave representation of generated chirps.

Next Steps

It seems that generating bird sounds using a neural networks is possible, although it remains to be seen how practicable it is. The above approach uses just a single sound, so a nearby next step would be to attempt to train the model on multiple different sounds. It is not clear from the outset that this would work. However, if the model as constructed fails on multiple sounds, it would still be possible to train different models on different sounds and simply stack them to produce different sounds.

A larger problem is that not all produced sounds are viable, particularly when modifying the parameters. A fair share of produced sounds are more akin to computer beeps than bird song. Some sound like an angry computer that really doesn’t want you to do what you just tried to do. One way to mitigate this would be to train a separate model to detect bird sounds (perhaps along these lines), and use that to reject or accept generated output.

Computational costs are also a constraint with the current approach; generating a chirp takes an order of magnitude longer than playing the sound, which is not ideal if the idea is to generate beautiful soundscapes on the fly. The main mitigation which comes to mind here is to increase the length of each prediction, possibly at the cost of accuracy. One could also, of course, simply spend the time to pre-generate acceptable soundscapes.

Conclusion

A combination of an autoencoder network, and a short-term prediction network can be grafted together to produce a bird sound generator with some adjustable parameters which can be manipulated to create new and interesting bird sounds.

As with many projects, part of the motivation is to learn in the process. In particular, I did not know how to pull apart trained models and graft parts of them together. The models used above can be used as an example to guide other learners who want to experiment with such approaches.