Beyond Word Embeddings Part 3

Aaron (Ari) Bornstein

Published in

Towards Data Science

8 min readOct 31, 2018

Four Common Flaws in State of the Art Neural NLP Models.

TLDR;

Since the advent of word2vec, neural word embeddings have become a go to method for encapsulating distributional semantics in NLP applications. This series will review the strengths and weaknesses of using pre-trained word embeddings and demonstrate how to incorporate more complex semantic representation schemes such as Semantic Role Labeling, Abstract Meaning Representation and Semantic Dependency Parsing into your applications.

Introduction

The previous posts in this series reviewed some of the recent milestones in neural NLP as well as methods for representing words as vectors and the progression of the architectures for making use of them.

Even though neural NLP has led to many breakthroughs as seen in the previous posts, there are still many flaws exhibited by state of the art models using pre-trained word embeddings and attention mechanisms. These flaws in neural NLP have dangerous implications for production systems that provide for little to no margin for imprecision and unpredictability.

This post will dive deeper into the pitfalls of modern state of the art NLP systems.

Flaw #1: Lexical Inference

Despite improvements with attention based models and better embeddings, state of the art, neural NLP models struggle to infer missing information. Let’s look at the neural natural language inference model below.

This model struggles to recognize that the text “Jack and Jill climb Everest” entails that Jack and Jill are two hikers climbing a mountain and incorrectly states that the two statements are neutral.

The NLI model’s failure to capture this relationship is counter intuitive, since theoretically the embedding space for (Everest, Mountain) and (hikers, climbers) should be close together. In reality, it is hard to predict in advance exactly what features the attention model will align to and why.

An additional challenge related to lexical inference is the concept of modeling arithmetic relations, though the recent development of mechanisms such as Neural Arithmetical Logic Units (NLAU), have improved the ability to model basic numerical relations, state of the art models still tend to struggle with numerical concepts such as recognizing that when Jack and Jill appear together they should align with the word two.

For more information on NALU see:

A Quick Introduction to Neural Arithmetic Logic Units

Classical neural networks are extremely flexible, but there are certain tasks they are not well suited for. Some…

towardsdatascience.com

Flaw #2: Superficial Correlations

Another challenge with current state of the art models is that they learn superficial correlations in data that result in models that capture implicit social biases.

Below are a couple examples of models that capture such implicit biases.

Higher numbers are correlate to positive sentiment where as lower numbers correlate with negative sentiment.

Note how even though the “my name is ___” or “let’s go get ____ food” statements should all provide near equivalent sentiments regardless of ethnicity, the above neural models in fact capture the implicit bias of the dataset it was trained on. Imagine if these models were applied to tasks such as processing resumes or modeling insurance risk.

Recently Elazar and Goldberg et. al have been doing some early work on removing bias with adversarial training however even though some progress has been made in this field it is still very much not a solved problem.

Beyond the issue of bias, superficial correlation leads to inflated estimation of the performance of state of the art models. On many large datasets such as SQUAD or SNLI, the high performance of many neural attention models is a result more of learning superficial correlations between the test and train datasets that don’t actually represent the problem being solved.

When claims are made that a model is out preforms humans on a test set this is a potential reason why. One example of a superficial correlation is that a strong baseline on the SLNI dataset can be generated just by looking at the sentence lengths of both the premise and hypothesis sentences.

Flaw #3: Adversarial Evaluation

A by product of the superficial correlations is that they expose models to adversarial attack. Take for example the following BiDAF model example as we saw correctly return Prague to the question “What city did Tesla move to in 1880?” in the first post.

When we add a more explicit but clearly out of place reference to a city that someone who is not Tesla moved to to the end of the paragraph it causes the BiDAF model to return an incorrect answer Chicago.

The implications of this are quite shocking as one could imagine using adversarial examples to manipulate and exploit critical production NLP systems in industries such as digital healthcare, digital insurance, and digital law among others.

Flaw 4: Semantic Variability

In natural language there are many ways or paraphrases that can express the same concept, relation or idea. The next post in this series will dive deeper into this challenge.

One potential indicator that superficial correlations are driving many state of the art results is that even though on a given test example the model seems to perform as expected when the same test example is expressed with different syntactic structure state of the art models can fail.

The example below is great demonstration of this even though the sentences “There is no pleasure in watching a child suffer” and “In watching the child suffer, there is no pleasure” express the same sentiment one of the state of the art tree-structured bi-LSTM models for sentiment classification provides two completely different classifications.

Iyyer and collaborators broke the tree-structured bidirectional LSTM sentiment classification model.

In order to have a robust model, it needs to generalize beyond just memorizing correlations related to common syntactic structures.

Innate Prior Debate

The challenges outlined above have contributed to an on going debate on whether introducing to structure to guide neural models is a necessary good or evil.

“Is structure a necessary Good or Evil?.”

Two leading figures in this debate are Christopher Manning from the Stanford NLP lab and Yann Lecun from Facebook’s Research Lab.

While both Manning and Lecun believe that introducing better structure is necessary to mitigate some of the flaws in neural NLP systems, showcased above, they disagree on the nature of introducing structure.

Manning argues that introducing structural bias is a necessary good to enable better generalization from less data and high-order reasoning. He argues that the manner in which many researches represent structure using neural methods reduces to “glorified nearest neighbor learning”which disincentivizes researchers from building better architected learning systems that require less data.

LeCun on the other hand feels that introducing structure in neural models a “necessary evil” that forces model designers to make certain assumptions are limiting. He views structure as merely, “a meta-level substrate that is required for optimization to work”, and that with more representative datasets and more computational power more naive models should be able equally strong generalizations learned directly from the data.

For more on this debate see:

Deep Learning, Structure and Innate Priors

Earlier this month, I had the exciting opportunity to moderate a discussion between Professors Yann LeCun and…

www.abigailsee.com

The truth of the matter most likely falls in between the two moderate extremes of Lecun and Manning. Meanwhile the most NLP researchers in the field agree that in order to address the challenges outlined in this post there is a need for better mechanisms to model and incorporate semantic understanding.

Most state of the art models in NLP process text sequentially, and attempt to model complex relations with in the data implicitly. Though it is true that humans read text sequentially the relations between individuals words and relations are not sequential.

While formalisms for representing linguistic structure computationally are not new to the field, in the past the cost of formalizing rules and annotating relations has been prohibitively expensive. Advances in sequential parsing provided by neural NLP enables new avenues for practical applications and development of these systems.

Regardless of what side you may fall on in the implicit vs structural learning debate, the trend towards combining semantic structure with better implicit textual representations is a promising vehicle to address the pitfalls of current state of the art systems.

If you side with LeCun the hope is that as more generalizable dynamic, and data driven formalisms are developed they will provide better insight into how to build better implicit models and if you side with Manning the goal is use modeling to create more robust semantic formalisms to enable better modeling with less data.

In the next post in this series we will look at some promising approaches for modeling semantic structure and walkthrough tools to incorporate this information into practical NLP applications.

Call To Action

Below are some resources to get a better understanding of the challenges above.

If you have any questions, comments, or topics you would like me to discuss feel free to follow me on Twitter if there is a milestone you feel I missed please let me know.

About the Author

Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.