Auto Tagging Stack Overflow Questions

Published in

Towards Data Science

9 min readMar 14, 2018

One of the most interesting applications of NLP is automatically infer and tag the topic of a question. In this post, we’ll start from exploratory analysis of Stack Overflow questions and answers, and then we will build a simple model to predict the tag of a Stack Overflow question. We’ll solve this text classification problem using Scikit-Learn. Let’s get started.

The Data

For this project, we’ll use text from 10% of Stack Overflow questions and answers on programming topics, and it is freely available on Kaggle.

Exploratory Data Analysis (EDA)

Because ggplot is one of our favourite data visualization tools. So, we will do EDA in R.

Load the necessary pacakges

library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(tidytext)
library(tidyverse)
library(broom)
library(purrr)
library(scales)
theme_set(theme_bw())

The questions data and tags data are stored separately, so we will read them separately.

questions <- read_csv("Questions.csv")
question_tags <- read_csv("Tags.csv")

Tags Data

So, what are the most popular tags?

question_tags %>%
  count(Tag, sort = TRUE)

Questions Data

The number of questions asked per week:

questions <- questions[ -c(8:29)]
questions %>%
  count(Week = round_date(CreationDate, "week")) %>%
  ggplot(aes(Week, n)) +
  geom_line() + 
  ggtitle('The Number of Questions Asked Per Week')

Compare the growth or shrinking of particular tags over time:

tags <- c("c#", "javascript", "python", "r", "php")q_per_year <- questions %>%
  count(Year = year(CreationDate)) %>%
  rename(YearTotal = n)tags_per_year <- question_tags %>%
  filter(Tag %in% tags) %>%
  inner_join(questions) %>%
  count(Year = year(CreationDate), Tag) %>%
  inner_join(q_per_year)ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  ylab("% of Stack Overflow questions with this tag") +
  ggtitle('Growth or Shrinking of Particular Tags Overtime')

What are the most common words in the titles?

title_word_counts <- title_words %>%
  anti_join(stop_words, c(Word = "word")) %>%
  count(Word, sort = TRUE)title_word_counts %>%
  head(20) %>%
  mutate(Word = reorder(Word, n)) %>%
  ggplot(aes(Word, n)) +
  geom_col(fill = "cyan4", alpha = 0.8, width = 0.6) +
  ylab("Number of appearances in question titles") +
  ggtitle('The most common words in the question titles') +
  coord_flip()

Finding tf-idf within tags category

We’d expect the tag category to differ in terms of titles content, and therefore for the frequency of words to differ between them. We will use tf-idf to find the title words that most associated with particular tags.

common_tags <- question_tags %>%
    group_by(Tag) %>%
    mutate(TagTotal = n()) %>%
    ungroup() %>%
    filter(TagTotal >= 100)tag_word_tfidf <- common_tags %>%
    inner_join(title_words, by = "Id") %>%
    count(Tag, Word, TagTotal, sort = TRUE) %>%
    ungroup() %>%
    bind_tf_idf(Word, Tag, n)tag_word_tfidf %>%
    filter(TagTotal > 1000) %>%
    arrange(desc(tf_idf)) %>%
    head(10)

We will examine the top tf-idf for all tag categories to extract words specific to those tags.

tag_word_tfidf %>%
  filter(Tag %in% c("c#", "python", "java", "php", "javascript", "android")) %>%
  group_by(Tag) %>%
  top_n(12, tf_idf) %>%
  ungroup() %>%
  mutate(Word = reorder(Word, tf_idf)) %>%
  ggplot(aes(Word, tf_idf, fill = Tag)) +
  geom_col(show.legend = FALSE, width = 0.6) +
  facet_wrap(~ Tag, scales = "free") +
  ylab("tf-idf") +
  coord_flip() +
  ggtitle('The 12 terms with the highest tf-idf within each of the top tag categories')

Change over time

What words and terms have become more frequent, or less frequent, over time? These could give us a sense of the changing software ecosystem, and let us predict what words will continue to grow in relevance. To achieve that, we need to get the slope of each word.

questions$month<-month(questions$CreationDate)
questions$year <- year(questions$CreationDate)titles_per_month <- questions %>%
  group_by(month) %>%
  summarize(month_total = n())title_words <- questions %>%
  arrange(desc(Score)) %>%
  distinct(Title, .keep_all = TRUE) %>%
  unnest_tokens(word, Title, drop = FALSE) %>%
  distinct(Id, word, .keep_all = TRUE) %>%
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[^\\d]")) %>%
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup()word_month_counts <- title_words %>%
  filter(word_total >= 1000) %>%
  count(word, month, year) %>%
  complete(word, month, year, fill = list(n = 0)) %>%
  inner_join(titles_per_month, by = "month") %>%
  mutate(percent = n / month_total)mod <- ~ glm(cbind(n, month_total - n) ~ year, ., family = "binomial")slopes <- word_month_counts %>%
  nest(-word) %>%
  mutate(model = map(data, mod)) %>%
  unnest(map(model, tidy)) %>%
  filter(term == "year") %>%
  arrange(desc(estimate))slopes

Then plot the top 16 fastest growing words:

slopes %>%
  head(16) %>%
  inner_join(word_month_counts, by = "word") %>%
  mutate(word = reorder(word, -estimate)) %>%
  ggplot(aes(year, n / month_total, color = word)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(show.legend = FALSE) +
  scale_y_continuous(labels = percent_format()) +
  facet_wrap(~ word, scales = "free_y") +
  expand_limits(y = 0) +
  labs(x = "Year",
       y = "Percentage of titles containing this term",
       title = "16 fastest growing words in Stack Overflow question titles")

And top 16 fastest shrinking words:

slopes %>%
  tail(16) %>%
  inner_join(word_month_counts, by = "word") %>%
  mutate(word = reorder(word, -estimate)) %>%
  ggplot(aes(year, n / month_total, color = word)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(show.legend = FALSE) +
  scale_y_continuous(labels = percent_format()) +
  facet_wrap(~ word, scales = "free_y") +
  expand_limits(y = 0) +
  labs(x = "Year",
       y = "Percentage of titles containing this term",
       title = "16 fastest shrinking words in Stack Overflow question titles")

N-gram Analysis

N-grams are used to develop not just unigram models but also bigram and trigram models. A bigram is an n-gram for n=2. The following are the most common bigram in the question titles.

title_bigrams <- questions %>%
  unnest_tokens(bigram, Title, token = "ngrams", n = 2)title_bigrams %>%
  count(bigram, sort = TRUE)

I am sure you find them meaningless. Let’s find the most common meaningful bigrams.

bigrams_separated <- title_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")bigrams_united %>%
  count(bigram, sort = TRUE)

And most common trigrams:

questions %>%
  unnest_tokens(trigram, Title, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

That was fun!

Now we are going to develop a predictive model to automatically tag Stack Overflow questions. We will do that in Python.

write.csv(total, file = "/Users/sli/Documents/total.csv", row.names = FALSE)

Here are the first five rows of the combined question and tag table:

import pandas as pd
total = pd.read_csv('total.csv', encoding='latin-1')total.head()

Below is the full text of the first question:

total['Body'][0]

Text Preprocessing

The raw text data is messy and needs to be cleaned up for any further analysis. We exclude HTML tags, links and code snippets from the data.

from collections import Counter
import numpy as np 
import string
import redef clean_text(text):
    global EMPTY
    EMPTY = ''
    
    if not isinstance(text, str): 
        return text
    text = re.sub('<pre><code>.*?</code></pre>', EMPTY, text)def replace_link(match):
        return EMPTY if re.match('[a-z]+://', match.group(1)) else match.group(1)
    
    text = re.sub('<a[^>]+>(.*)</a>', replace_link, text)
    return re.sub('<[^>]+>', EMPTY, text)

Then we create a new “Text” column for cleaned text from “Body” column.

total['Text'] = total['Body'].apply(clean_text).str.lower()
total.Text = total.Text.apply(lambda x: x.replace('"','').replace("\n","").replace("\t",""))

There are more than 20,000 unique tags in our data.

total['Tag'].nunique()

21981

To simplify the problem, we will only work on the top 10 most frequently used tags, as show below:

def plot_tags(tagCount):
    
    x,y = zip(*tagCount)    colormap = plt.cm.gist_ncar #nipy_spectral, Set1,Paired  
    colors = [colormap(i) for i in np.linspace(0, 0.8,50)]    area = [i/4000 for i in list(y)]   # 0 to 15 point radiuses
    plt.figure(figsize=(10,6))
    plt.ylabel("Number of question associations")
    for i in range(len(y)):
      plt.plot(i,y[i],marker='o',linestyle='',ms=area[i],label=x[i])       plt.legend(numpoints=1)
    plt.show()import collections
import matplotlib.pyplot as plt
tagCount =  collections.Counter(list(total['Tag'])).most_common(10)
print(tagCount)
plot_tags(tagCount)

total = total[(total.Tag == 'c#') | (total.Tag == 'java') | (total.Tag == 'php') | (total.Tag =='javascript') | (total.Tag =='jquery') | (total.Tag == 'android') | (total.Tag == 'c++') | (total.Tag == 'iphone') | (total.Tag == 'python') | (total.Tag == 'asp.net')]

Classification of text documents

We will scikit-learn’s bag-of-words approach to classify text by tags. So, we are only interested in two columns — “Text” and “Tag”.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(total['Text'], total['Tag'], random_state=42, test_size=0.2, shuffle=True)

We are going to try various classifiers that can efficiently handle our text data that have been transformed to sparse matrices.

The bar plot indicates the accuracy, training time (normalized) and test time (normalized) of each classifier.

from __future__ import print_functionfrom time import time
import matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import NearestCentroid
from sklearn.utils.extmath import density
from sklearn import metricstarget_names=total['Tag'].unique()
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train_1, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)t0 = time()
    pred = clf.predict(X_test_1)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)if hasattr(clf, 'coef_'):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))if opts.print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()if opts.print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred,
                                            target_names=target_names))if opts.print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))print()
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_timeresults = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive")):
    print('=' * 80)
    print(name)
    results.append(benchmark(clf))
    
print('=' * 80)
print("Elastic-Net penalty")
results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50,
                                       penalty="elasticnet")))
print('=' * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))print('=' * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=.01)))
results.append(benchmark(BernoulliNB(alpha=.01)))print('=' * 80)
print("LinearSVC with L1-based feature selection")
results.append(benchmark(Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False,
                                                  tol=1e-3))),
  ('classification', LinearSVC(penalty="l2"))])))indices = np.arange(len(results))results = [[x[i] for x in results] for i in range(4)]clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, .2, label="score", color='navy')
plt.barh(indices + .3, training_time, .2, label="training time",
         color='c')
plt.barh(indices + .6, test_time, .2, label="test time", color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=.25)
plt.subplots_adjust(top=.95)
plt.subplots_adjust(bottom=.05)for i, c in zip(indices, clf_names):
    plt.text(-.3, i, c)plt.show()

Classifier using Ridge regression achieved the best results so far. Therefore, we print out the precision and recall for each tag.

model = RidgeClassifier(tol=1e-2, solver="lsqr")
model.fit(X_train_1, y_train)
predicted = model.predict(X_test_1)
from sklearn.metrics import classification_reportprint(classification_report(y_test, predicted, target_names=target_names))

We probably can achieve a better result by parameter tuning, but I leave it to you to do that.

Source code can be found at Github. I look forward to hear any feedback or questions.

References:

Scikit-Learn

Text Mining with R

Auto Tagging Stack Overflow Questions

The Data

Exploratory Data Analysis (EDA)

Text Preprocessing

Classification of text documents

Written by Susan Li