Longformer: The Long-Document Transformer

Understanding Transformer-Based Self-Supervised Architectures

Rohan Jagtap
Towards Data Science
7 min readDec 1, 2020

--

Photo by Joe Gardner on Unsplash

Transformer-based language models have been leading the NLP benchmarks lately. Models like BERT, RoBERTa have been state-of-the-art for a while. However, one major drawback of these models is that they cannot “attend” to longer sequences. For example, BERT is limited to a max of 512 tokens…

--

--

Immensely interested in AI Research | I read papers and post my notes on Medium