NLP, Machine Learning

Cross-Topic Argument Mining: Learning How to Classify Texts

Classifying cross-topic natural language texts based on their argumentative structure using deep learning

Stephen Adhisaputra
Towards Data Science
10 min readJan 26, 2021

--

Photo by Etienne Boulanger on Unsplash

Argument mining, or argumentation mining, is one of the research topics in natural language processing (NLP) and knowledge representation learning. Argumentation deals with logical reasoning and is inherent to human intelligence. The goal of argument mining (AM) is to teach machines about argumentative structures. AM automates the identification and classification of arguments within texts. This allows specific searching of arguments related to a certain topic. Powerful AMs can play an important role in making decisions, writing persuasive texts, and assigning legal reasonings.

This article shows a straightforward, easy to understand, and quick implementation of AM in Python using machine learning libraries. The goal of this work is to create a classification pipeline for mining argumentative texts, given a dataset of cross-topic sentences from various online sources.

Formal definition:

“Argumentation is a verbal, social, and rational activity aimed at convincing a reasonable critic of the acceptability of a standpoint by putting forward a constellation of propositions justifying or refuting the proposition expressed in the standpoint” — [3].

Dataset

The UKP Sentential AM Corpus [1] was introduced by the Computer Science Department at TU Darmstadt in 2018. This dataset consists of texts covering controversial topics. It was collected with the Google Search API from 400 online articles. Each sentence was annotated via crowdsourcing (Mechanical Turk). Due to possible copyright reasons, they did not release the complete dataset. However, the annotations and java program to collect the sentences from online articles are available under a free license. You could download the program from their website to get the complete dataset. Later on, the corpus will be explored alongside my own AM pipeline implementation.

“For each topic, we made a Google query for the topic name, removed results not archived by the Wayback Machine, and truncated the list to the top 50 results.” — [1]

Implementation

Assume that the data is already cleaned, presented in excel tables after pre-processing (i.e. removing HTML tags, separating sentences). The data is in the form of rows of sentences and annotations, totaling about 25,000 input-output pairs.

The data consists of eight controversial topics: abortion, cloning, death penalty, gun control, marijuana legalization, minimum wage, nuclear energy, and school uniforms.

Topic distribution — Image by Author

Exploring the data, I found that more than half of the labeled sentences are not an argument. This is logical because argumentative sentences are standpoints (claims). They are followed by non-argument statements (premises), disproving or elaborating the corresponding standpoint. The distribution of the labels can be seen in the graph below. There are three labels for the text classification problem:

  • Not argument
  • Argument against
  • Argument for
Label distribution — Image by Author

My data pipeline for this AM task is as the following:

Argument mining pipeline — Image by Author

Now, let’s jump straight into coding.

Import libraries:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import string
from string import punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
import tensorflow as tf

Read data from the excel table:

data_raw= pd.read_excel('data_raw.xlsx')     
fig, ax = plt.subplots(figsize=(16,4))
sns.countplot(x='topic',ax=ax,data=data_raw)

The training data will only include seven topics, excluding the topic of school uniforms, which will be the topic of the test set.

train_raw = data_raw[data_raw.topic != 'school uniforms']
test_set = data_raw[data_raw.topic == 'school uniforms']

The objective is to validate the prediction capability of the model.

For cross-topic: Given the school uniforms test set, the task of a model is to classify every sentence into one of the three labels, without ever seeing the topic before.

Pre-processing

Many words are commonly used in an English sentence, regardless of argumentative structure. Additionally, models see data in numbers. Removing stop words and one-hot encoding is therefore often necessary in natural language processing techniques.

Removing stop words

Stop words are English words that add little meaning to a sentence. They can be filtered without significantly changing the essence of the sentence.

def remove_stopwords(text):
stpword = stopwords.words('english')
no_punctuation = [char for char in text if char not in
string.punctuation]
no_punctuation = ''.join(no_punctuation)
return ' '.join([word for word in no_punctuation.split() if
word.lower() not in stpword])
train_raw['data'] = train_raw['sentence'].apply(remove_stopwords)
test_set['data'] = test_set['sentence'].apply(remove_stopwords)

For example, the words the, he, you, we, as, for, have, this, that, do, in are considered as stop words in the English language. This decreases the size of the data, in addition to improving training efficiency later on.

Example sentences — Image by Author

One-hot encoding

This technique allows the representation of categorical data in binaries. Since we have three labels, the annotation will be encoded into three-digit binaries, where ‘1’ indicates the truth and everything else is zero.

#Training data
train_one_hot = pd.get_dummies(train_df['annotation'])
train_df = pd.concat([train_df['data'],train_one_hot],axis=1)
y_train = train_df.drop('data',axis=1).value
#Test data
test_one_hot = pd.get_dummies(test_df['annotation'])
test_df = pd.concat([test_df['data'],test_one_hot],axis=1)

This experiment deals with three-labels, so there will be three columns to represent the labels. The sample training data is shown below.

One-hot encoding — Image by Author

Feature Extraction

Before extracting the features, input sentences must be converted into vectors. For this purpose, a list of vocabulary based on the training data will be created using CountVectorizer() from the scikit-learn library. This process of converting text into vectors is also sometimes referred to as “tokenization.”

from sklearn.feature_extraction.text import CountVectorizer #Define input
sentences_train = train_df['data'].values
sentences_test = test_df['sentence'].value
#Convert sentences into vectors
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(sentences_train)
X_test = vect.transform(sentences_test)
List of vocabulary — Image by Author

The training dataset will then be fitted to the vocabulary. The method fit_transform() converts each sentence into an array with a size of 21,920. The X_train will then be a matrix of size (number of sentences x 21,920). This is called the count matrix, as the frequency of vocab will be counted in each sentence to mathematically represent the data. To get a better understanding of the data, let’s visualize the token frequency.

#Visualize word frequency
from yellowbrick.text import FreqDistVisualizer
features = vectorizer.get_feature_names()
visualizer = FreqDistVisualizer(features=features, size=(800, 1000))
visualizer.fit(X_train)
visualizer.finalize()
Word frequency distribution — Image by Author

hapax (noun)

hapax a word that only appears once in a work of or genus of literature or in a body of work by a particular author [www.collinsdictionary.com].

It can be seen that the token would appear the most, followed by death and people. Unsurprisingly, controversial topic words such as nuclear, cloning, abortion, and marijuana all feature in the list of most common vocab.

Tf-idf

Term frequency-inverse document frequency (tf-idf) assigns weights for each word to extract features. This statistical measure is common in information retrieval applications such as search engines and recommender systems.

tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_train = X_train.toarray()

Model

The deep learning model in my pipeline is simply a fully-connected neural network. The layers are equipped with L2 regularization and the ReLU function, except for the output layer, which implements the softmax function. Dropout is also implemented for each layer to prevent memorization.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.regularizers import l2
def create_deep_model(factor, rate):
model = Sequential()
model.add(Dense(units=4096,kernel_regularizer=l2(factor),
activation='relu')), Dropout(rate),
model.add(Dense(units=512,kernel_regularizer=l2(factor),
activation='relu')), Dropout(rate),
model.add(Dense(units=512,kernel_regularizer=l2(factor),
activation='relu')), Dropout(rate),
#Output layer
model.add(Dense(units=3, activation='softmax'))
return model

Let’s create the model with arguments L2 factor 0.0001 and Dropout probability 0.2.

model= create_deep_model(factor=0.0001, rate=0.2)
model.summary()
Deep learning model in Tensorflow — Image by Author

The model can be trained using Adam optimizer with a learning rate of 0.001 and batch size 128. Categorical cross-entropy is used as the loss function for training. Early stopping is set to prevent overfitting. In reality, this model will not reach convergence at all. Using the constructed vocabulary and heterogeneous data, it is extremely difficult for the model to distinguish supporting and opposing arguments.

early_stop = EarlyStopping(monitor='val_loss', mode='min',   
verbose=1, patience=5)
opt=tf.keras.optimizers.Adam(learning_rate=learningrate)
model.compile(loss='categorical_crossentropy', optimizer=opt,
metrics=['accuracy'])

The data is split into training (75%) and validation (25%) before fitting.

#Split data
X_train_enc, X_val, y_train_enc, y_val = train_test_split(X_train, y_train, test_size=0.1, shuffle= False)
#fit the model
history=model.fit(x=X_train_enc, y=y_train_enc, batch_size=batchsize, epochs=epochs, validation_data=(X_val, y_val), verbose=1, callbacks=early_stop)

Evaluation

The “expert” annotators were two graduate-level language technology researchers who were fully briefed on the nature and purpose of the argument model. They achieved an agreement ratio of 0.862, slightly higher than the agreement ratio between an expert and crowd annotators’ (0.844). The reliability of the results was validated using the kappa statistics for inter-annotator agreement [2].

There are two predictions in my three-labels AM experiment: in-topic and cross-topic. For easier evaluation and benchmarking, I revert the one-hot encoding to integers [0,1,2] using the following code:

y_train= np.argmax(y_train, axis=1)
y_test= np.argmax(y_test, axis=1)

The benchmark will be based on accuracy, precision, recall, and f1-score. This will later be presented in a confusion matrix.

  • Trained on seven topics, tested on the validation set (in-topic):
y_test=y_val
y_test=np.argmax(y_test, axis=1)
y_pred = np.argmax(model.predict(X_val), axis=-1)
from sklearn.metrics import classification_reportprint(classification_report(y_test, y_pred, target_names=['No
Argument', 'Argument For', 'Argument Against']))
In-topic classification report— Image by Author
  • Trained on seven topics, tested on the unseen topic (cross-topic):
y_test = test_df.drop('data',axis=1).values
y_test=np.argmax(y_test, axis=1)
y_pred = np.argmax(model.predict(X_test), axis=-1)
Cross-topic classification report — Image by Author

The term Support in this context is the number of labels, which is used to calculate the weighted average. The accuracy, as well as the macro average of the f1-score, is normally used in benchmarking classification results. Based on the precision, recall, and f1-score, this model performs great in predicting “Argument Against” given a sentence. But, the model performs very poorly in mining “Argument For”, recalling barely 10% of the true positives. This shows that the developed model completely lacks the ability in understanding arguments, tending to predict everything as “Argument Against.” This is less obvious when the data is tested in-topic instead of cross-topic, as seen in the first classification report. As expected, in-topic prediction yields higher scores in accuracy, precision, recall, and f1-score compared to cross-topic prediction.

Talking points

  • According to the authors of the research paper [1], the human upper bound f1-score in AM is 0.861 (two-labels). Their best f1-score for two-labels (“Argument” or “Not argument”) is 0.662. For the three-labels experiment setup, their best score is 0.4285. My pipeline achieved an f1-score macro average of 0.37, without any transfer learning. Of course, more attempts need to be done to validate the experiment. But, this is already a relatively good result considering quick implementation. The accuracy of 57% is almost twice as better as when the model gives out a random guess (~33%). However, the scores are in general not good enough, as we want the model to perform similar to humans. Cross-topic argument mining, therefore, remains a challenge to address in NLP.
  • Pre-processing the training set without applying the same procedure to the test set will result in a very bad performance. This is the effect of stop words and tokenization. Future work would be integrating knowledge representation learning methods as well as word embeddings for better feature extraction. The sentience and importance weighting in the procedure will make the pre-processing in my pipeline redundant.
  • The next step for this work would be to make the model aware of topics and context. In the original paper, the authors attempted to enhance the cross-topic prediction capability of their model by using transfer learning (parameter transfer). They used bidirectional LSTMs, pre-trained from another corpus (Google News Dataset and SemEval), and combined the method with word embeddings. They also showed that integrating topic information when training the deep learning model leads to better generalization and recall.

Outlook

Argumentation is in many ways connected to artificial intelligence. The computational linguistics application ranges from detecting controversial tweets to identifying topic-specific allegations in legal documents. The state-of-the-art of this research for example can be found at www.argumentsearch.com. This website functions as a search engine for documents using text mining, big data analytics, and deep learning, giving out the pros and cons of your topic input in real-time.

References

[1] C. Stab, T. Miller, B. Schiller, P. Rai, and I. Gurevych. Cross-topic argument mining from heterogeneous sources, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).

[2] C. Stab, T. Miller, and I. Gurevych, Cross-topic Argument Mining from Heterogeneous Sources Using Attention-based Neural Networks, arXiv e-prints (2018).

[3] F. Eemeran, R. Grootendorst, A Systematic Theory of Argumentation, published by the Press Syndicate of the University of Cambridge (2004).

Disclaimer: The annotations are available online under the Creative Commons Attribution-NonCommercial license (CC BY-NC). Views are my own.

--

--