A Multitask Music Model with BERT, Transformer-XL and Seq2Seq

Andrew Shaw
Towards Data Science
7 min readAug 13, 2019

--

Buzzwordy clickbait title intended, but still a simple concept.

This is Part III of the “Building An A.I. Music Generator” series. I’ll be covering the basics of Multitask training with Music Models — which we’ll use to do really cool things like harmonization, melody generation, and song remixing. We’ll be building off of Part I and Part II.

Background

As you know, Transformers have recently revolutionized the NLP space. As you also probably know, there are several different transformer variations. They all have the same basic attentional layers, but some are specialized for different tasks. Here are 3 of the coolest:

Seq2Seq (Sequence to Sequence Translation)— uses an encoder-decoder architecture to translate between languages. This is the OG transformer that started the revolution.

TransformerXL —this forward-directional decoder is an amazing text generator. Memory and relative positional encoding enable super fast and accurate predictions. We used this model in Part II.

BERT —this bi-directional encoder produced SOTA results in answering questions and filling in the blanks. Token masking and bi-directionality allows for exceptional context.

Dang, all of these variations are just so cool! Instead of having to pick one, why not just combine them?

Wait.. What?!?

Bear with me for a second.

Multitask Model To Rule Them All

If you are training a language model, combining all 3 models doesn’t make sense.

TransformerXL is great for generating text. Seq2Seq is great for languange translations. BERT excels at filling in the blanks. No reason you’d ever want to do all three at once.

For music generation, it’s a different story.

Let’s look at a few scenarios when you are thinking about composing a song:

Task 1. I have a couple of notes I want to make into a song

TransformerXL is great at sequence generation. Let’s use it to autocomplete your song idea.

Task 2a. My melody needs some harmony.
Task 2b. I have a chord progression. Now I need a hook.

Seq2Seq is great for translation tasks. We can use this to translate melody to chords. Or vice versa.

Task 3a. I’ve got a song now, but something doesn’t sound right.
Task 3b. This song needs a better rhythm.

BERT is great for filling in the blanks. We can erase certain parts of the song and use BERT to generate a new variation.

As you can see, each of these transformer variations can be helpful in generating song ideas. We’re going to train a model that can solve all these tasks.

Play First, Comprehend Later

To understand what we’re trying to do here, it might be helpful to play around with the multitask model we’re trying to create.

Each demo has been generated to solve a specific task. Toggle between the prediction (red notes) and the original (green notes) to hear the difference.

Task 1. Song Generation
“Canon in D” by Pachelbel

Task 2a. Harmonizing the Melody
“Where Is The Love” by Black Eyed Peas

Task 2b. New melody with existing chord progression
“Scary Monsters and Nice Sprites” by Skrillex

Task 3a. Same Beat, Different Song
“Levels” by Avicii

Task 3b. Same Song, Remixed Beat
“Fur Elise” by Beethoven

Building the Monster

You’re back. Now let’s build this thing.

It might sound a little daunting at first, but it’s honestly not complicated. Our multitask model is essentially the Seq2Seq architecture. All we are doing is modifying it to train on different tasks.

I’m going to assume you already know how Seq2Seq translation works. If you don’t, please visit this incredible transformer illustration.

Seq2Seq

Alright, now let’s picture how a Seq2Seq Transformer might work for music.

Very similar to how you would translate a language. Chords (the input language) are translated into melodies (the target language).

“Previous output” is fed back into the decoder so it knows what has already been translated.

Encoder vs Decoder

The model is split into blue encoder blocks and red encoder blocks. It’s important to know the difference between the two, because we’ll be re-purposing them for our other tasks.

Blue Encoder Blocks are single bi-directional attention layers. They are able to see both previous and future tokens.

Red Decoder Blocks are double-stacked forward attention layers. The double-stacked blocks use both the encoder output and the “previous output” as context to predict the melody. Forward layers are unable to see the future tokens — only the previous tokens.

As you might have guessed, the blue bi-directional encoders are a perfect match for training BERT models. Likewise, the forward decoder layers can be reused for training TransformerXL tasks.

Let’s take a look.

BERT

Erase some notes and BERT will fill in the blanks:

Masked notes prevent bidirectional layers from cheating

We train the encoder to predict the correct notes (in red), whenever it sees a masked token (in blue).

TransformerXL — Next Word

You’ll recognize this diagram from the previous posts.

Next tokens are attention masked to prevent cheating

We train the decoder to predict the next token by shifting the target over by one.

Putting it all together

As you can see, the Seq2Seq model is a combination of the BERT encoder and TransformerXL decoder. This means we can reuse the encoder and decoder from the Seq2Seq model to train on the BERT and TransformerXL tasks. The only thing that changes, is the input and target.

Here’s a reminder of our 3 tasks from before:

Task 1. Music generation with TransformerXL

Task 2a/2b. Melody-to-Chords/Chords-to-Melody translation with Seq2Seq

Task 3a/3b. Song remixing with BERT

Solving Multitask #2.

Earlier you saw how to do Melody-to-Chords translation (2a.). The Chords-to-Melody task (2b.) is the exact same, but with the input and target flipped.

Multitask #1 and #3

Since BERT only uses the encoder layers and TransformerXL only uses the decoder layers, Task #1 and Task #3 can be trained at the same time. On the BERT side, we mask the input and send it through the encoder. In parallel, we’ll feed a shifted input to the decoder to train TransformerXL.

Here’s what that looks like:

  • Encoder is trained on the masked task.
  • Separately and in parallel, the decoder is trained on the next token task.

Note that the decoder only has a single arrow as input. It doesn’t use the encoder output for this task.

Voila! Here’s our model in code

Hopefully this model is pretty clear to those of you who’ve used PyTorch.

Model architecture:
Encoder - Bi-directional attention layers
Decoder - Uni-directional double-stacked attention layers
Head - Dense layer for token decoding
Forward Prop:
1. If the input is masked ('msk'), train the encoder.
2. If the input is shifted ('lm'), train the decoder.
3. If the input contains both translation input ('enc') and previous tokens ('dec'), use both encoder and decoder.

Run this notebook for actual training.

That’s all there is to it to training a multitask model.

It really is just an encoder/decoder model trained on various types of inputs and outputs.

Masked tokens train the encoder (BERT). Shifted tokens train the decoder (TransformerXL). Paired sequences train both (Seq2Seq).

On to the Results

If you played with the examples at the beginning of this post, then you’ve already seen the results. The musicautobot web app is powered by the multitask transformer. Rather than listening to me drone on about the results, you’ll have more fun if you head back over and generate the results for yourself!

To enable Task 1, 2a, 2b, 3a, 3b, just toggle this switch:

An Alternative Playground

If you’re feeling more hands on, this python notebook generates all the same examples in code. You’ll get a better sense of how predictions work by running through that notebook.

Taking the results even further

Woohoo! We can generate some cool results, but… it seems to be missing some of that pop music magic. In our next post, we’ll uncover some of that magic sauce.

Part IV. Using a Music Bot to Remix The Chainsmokers — We’re down to my final and favorite post of the series!

Thanks for bearing with me!

Special Thanks to Kenneth, Jeremy Howard, SPC and PalapaVC for support.

--

--