The world’s leading publication for data science, AI, and ML professionals.

Everything Product People Need to Know About Transformers, GPT-3, and HuggingFace (

Or, How to Act Like You Know About the Biggest AI Development since CNNs

Everything Product People Need to Know About Transformers (Part 1)

Or, How to Act Like You Know About the Biggest AI Development since CNNs

This is Part 1 in the 3 Part Series on Transformers for Product People. Click here for Part 2.

Source: https://thehustle.co/07202020-gpt-3/
Source: https://thehustle.co/07202020-gpt-3/

Pay attention. Natural language processing (NLP) has passed an industry-changing inflection point. More than 20 long standing NLP challenges have been solved with near-human results in the past year, all by a single model: the attention-based transformer. This model was developed and published in December 2017, and has since kicked off an arms race between Google and OpenAI, with both labs shattering state of the art results with each new model release. With models like GPT-3 making a splash in the media, decision makers are wondering just how big this development is. This series is designed to walk product managers, founders, and investors through new developments in NLP research, explaining only the essential technical aspects while mainly focusing on the business and user experience implications.

Part 1: Transformers

First thing to know: text data is sequential (you’re reading this left to right, right?). NLP models problems recursively, i.e. moving along the text from left to right, step by step. You can imagine that each recursive step involves handing off some the information from what the model just read that might be relevant for the current task. After all, the model can only retain so much information at each step. Going word by word requires that the model make the right hand-off 50 or 100 times in order to remember you mentioned you were vegetarian at the beginning of the conversation.

Now, a transformer model is an encoder-decoder framework for sequence to sequence modeling that "relies entirely on an attention mechanism to draw global dependencies between input and output" (Vaswani et al., 2017) [1]. The important thing to know about the transformer model is that it shifted NLP models from tackling language problems word-by-word to tackling them sequence-by-sequence. Going sequence by sequence requires making the correct hand-off maybe once. All of a sudden, long range dependencies can be effectively captured. But that’s just the tip of the iceberg.

All of a sudden, long range dependencies can be effectively captured.

More important than what transformers model between sequences is what they model within sequences.

My first encounter with encoder-decoder models came from the domain of computer vision, where Convolutional Neural Networks (CNNs) are leveraged to abstract meaningful features from the input image. Attention is actually driving a very similar development in the field of NLP that CNNs drove in computer vision. What makes attention similar to convolutions is that both are techniques for distilling features from a tensor.

Technically speaking, convolutions create a logarithmic path length between any two pixels in an input image, while attention allows for constant length path length between any two words in the input sequence. Both are creating efficient ways to model dependencies within the input. Additionally, both are trivial to parallelize, which enables modeling of numerous independent features.

Practically speaking, CNNs take in a full image and distill meaningful characteristics of the image like colors, textures, shadows, and outlines, all at the same time. Attention does the same thing for a sentence or paragraph, with similarly powerful results. Transformers distill meaningful information from a sequence far better than any previous NLP model.

Transformers distill meaningful information from a sequence far __ better than any previous NLP model.

How Big a Deal is This Really?

Attention brought to text a lot of the same capabilities that CNNs brought to vision. To understand the coming impact of attention-based transformers, just look at what CNNs did to the AI market in 2012, when Alexnet smashed the Imagenet image classification challenge and computer vision suddenly became commercially viable.

Source: CBInsights
Source: CBInsights

AI acquisitions saw a more than 6x uptick from 2013 to 2018, including 2018’s record of 166 AI acquisitions – up 38% year-over-year [2]. With new developments in computer vision like self-driving cars and automated drones on the horizon, the computer vision market looks like it still has room to run.

Basically, NLP market growth should look quite like computer vision market growth. Of course they also have their differences. Want to understand their differences? Just answer a few massive questions:

  1. What is the market size and growth potential of text processing compared to image processing?
  2. What does the existence of a robust computer vision market with powerful CNN Technology mean for the adoption and development of new NLP technology?
  3. How do market conditions for the coming 10 years compare to market conditions for the past 10 years?

Applications

What’s the transformer for? How do you use it? What are the applications? In addition to the transformer model, there are two models built off major modifications to the transformer making major waves in the space: GPT, which was developed by OpenAI, and BERT, which was developed by Google Research. They each have their strengths and weaknesses, unless you ask someone from Google Research or OpenAI. Breaking down their architectures will clearly show their applications, strengths, and weaknesses. (My next post will be devoted to GPT, so stay tuned.)

While each model will likely capture a niche over the next few years, (Raffel et al, 2020) [3] recently demonstrated that:

  1. The original transformer model is better suited for text-to-text tasks that either BERT or GPT architectures if scaled up.
  2. All NLP tasks can be framed as text-to-text tasks.

Takeaway: If you’re on a budget, the transformer model is best suited to translating from one sequence to another. This diagram gives examples of this kind of translating.

Source: Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer
Source: Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer

Note: Green and blue flows are true translating and what the transformer model is best optimized for. Red and yellow flows can be optimized as text-to-scalar (1,0 for red flow).

For non-translation tasks (like red and yellow flows), other model architecture (which I will cover in future posts) are better suited. If you’re not on a budget, just use a huge transformer for everything.

As shown in the above example, translation is broad. It can between between languages (English to German) as well as style (long-form to summarized form). Translation runs the gamut from languages, styles, syntaxes, genres, you name it. You can turn natural language into code, formal language into slang, or articles into titles and descriptions. Since most of these exciting experiments have actually been carried out with GPT-3, however, I will be saving a more in-depth walk-through to the next post.

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.

[2] CB Insights. "The Race For AI: Here Are The Tech Giants Rushing To Snap Up Artificial Intelligence Startups." CB Insights Research, CB Insights, 14 Aug. 2020, www.cbinsights.com/research/top-acquirers-ai-startups-ma-timeline/.

[3] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.


Related Articles