The world’s leading publication for data science, AI, and ML professionals.

What Makes a Question Helpful?

Topic Modeling Exploration of Stack Overflow Text

Social media plays an ever-increasing role in our lives, and without discussing its flaws, one of its positives is that it gives its users the ability to communicate to a wide audience. This makes social media a powerful tool for anyone who wants to learn more about a certain topic or discuss it with others to refine everyone’s understanding. Stack Overflow is a site dedicated to helping people grow in their coding expertise, share​ ​their programming ​knowledge, and build their careers in a community setting. As part of the 12-week curriculum at the Metis bootcamp that I’m currently enrolled in, I completed a data science project using natural language processing tools to mine text on Stack Overflow to explore terms and topics related to the most helpful questions on the site.

I started my project with a simple SQL query on Google Cloud Platform’s BigQuery using the public Stack Overflow dataset that’s hosted on the site to bring in 500,000 Stack Overflow questions and other related features such as the score associated with that post, the number of views, and related tags, among others. Before continuing with discussing the overall process of completing the project, I’ll share some basic information about some of the features.

Sample question similar to those on Stack Overflow site. Image by author, inspired by Stack Overflow post.
Sample question similar to those on Stack Overflow site. Image by author, inspired by Stack Overflow post.

At the top of a post is the question’s title and below that is the body of the post, or the question. You’ll also notice the number to the left of the body of the question and the up and down arrows surrounding it. This is the "score" for the post. It’s a sum of the number of upvotes and downvotes a post receives, and one way to look at the score is its helpfulness. Answers on the site work the same way, but for the purposes of this project, I stuck to looking at just the questions. You may be asking yourself, "What is the range for the scores? Is there a number that indicates a good or bad post?" All scores start at 0 when they’re first posted. If you received just one vote, depending on whether it was up or down would indicate whether the one person who viewed and voted on that post thought it was helpful or not. Thus the overall score is tied not only to how helpful the post was determined to be, but also how many people engaged with that post. At the very bottom of the post are the tags, which are surrounded by a blue square. These are tags and they help categorize the question. This is useful if you’re trying to search for posts related to a certain programming concept.

To start at the beginning of this process in detail, I used the SQL query I discussed earlier to bring in the relevant features. Next, I selected only the most helpful questions by filtering my data for the scores that were in the top ten percent of posts. To dial in the helpfulness factor further, I selected only the top 100 most frequently-used tags on the site. There are over 54,000 unique tags on Stack Overflow, and since the tags are user generated, some may only be used once. By filtering in this way, we ensure that we’re looking at the posts that are the most helpful, attract the most attention, and are probably associated with more experienced users.

After the initial data cleaning step, text preprocessing is usually next in the topic modeling process. This is to ensure the text only contains meaningful words without punctuation, HTML, or stop words (e.g. like, at, which).

Preprocessed text from Topic 6 before removing stop words. Image by author using text from a Stack Overflow post under Creative Commons license.
Preprocessed text from Topic 6 before removing stop words. Image by author using text from a Stack Overflow post under Creative Commons license.

Preprocessing in Data Science is an iterative process that often requires a lot of domain knowledge. For example, since much of the verbiage on Stack Overflow is related to code, some words that are not defined in the English language like git or c++ actually have meaning. One may have to go back after the tokenization step or even the modeling step after realizing that the topics that are displayed are not helpful or are missing words that the data scientist knows should be included. I was unable in this project to catch all of these terms before they were removed in the tokenization step, but this is definitely a goal in the future to refine my model.

Once the text was cleaned in the preprocessing pipeline and tokenized using a CountVectorizer, I performed topic modeling of the questions in my dataset using Latent Dirichlet Allocation (LDA). Also, since tags are in and of themselves a way to classify terms by a topic, I connected the tags to each of my 20 topics using Pandas.

Frequency of the top tags in topics 6 and 8. Image by author.
Frequency of the top tags in topics 6 and 8. Image by author.

The Topic Modeling step also requires some domain knowledge, which in this case is knowing terms that are related to various topics within computer programming. This domain knowledge can be helpful when settling on the optimal number of topics for your model. For this project, I relied on my domain knowledge and settled on 20 topics once I noticed a clear distinction in the terms included in each topic, but there are tools such as GridSearch to find that sweet spot for the number of topics.

Word cloud of 9 out of the 20 topics selected in the LDA model. Image by author.
Word cloud of 9 out of the 20 topics selected in the LDA model. Image by author.

I also came away with some interesting insights in my project that aren’t just related to natural language processing and topic modeling. Stack Overflow is a huge site with a large community and a lot of data within it. For example, the scores of posts in my dataset ranged from -49 to 14,772. Posts could have as little as 0 views or as many as over 2 million views! And in my model, Topic 19 was associated with over 4,000 questions on Stack Overflow. It included terms like session, server, client, connection, and request, and is arguably associated with a question related to the browser.

In the future, the model could be refined further perhaps with some more text preprocessing, as well as exploring other models like Non-negative Matrix Factorization (NMF) since they’re beneficial for processing documents of shorter length. A model that determines helpful topics is not only beneficial to the user who wants to post a question that is likely to get answered or the user who wants to increase their reputation on Stack Overflow, but it’s also beneficial to educators. These helpful questions are the ones that users determine to be useful for learning, and with higher scores associated with them, as I mentioned in the beginning, it means a lot of people have paid attention to that question. In other words, these are the questions that are commonly the most difficult for people. It can help a teacher, professor, or tutor to determine which topics they may want to spend more time on so that their students walk away with knowledge they can apply in their homework, projects, and careers.


Related Articles