Natural Language Processing on multiple columns in python

Rowan Langford
Towards Data Science
5 min readApr 19, 2017

--

When I first heard about NLP, I was amazed and a little overwhelmed. In the most basic terms, NLP or Natural Language Processing goes through text data and examines the importance of each feature (word or character in this case) with regards to its predictive ability for our dependent variable. Given the importance that language has in our world, the ability to extend data science to qualitative variables seems essential. Depending on the size of your dataset, this could mean it goes through 1,000s of features, so this can be computationally expensive, but there are ways to limit the processing needs.

There are a few different kinds of NLP, the main ones that we’ve worked with in my bootcamp are CountVectorizer (CVec) and TFIDF Vectorizer (Term Frequency Inverse Document Frequency). Knowing how to choose between the different kinds is often just a case of testing both on your data (or a subset of your data) and seeing which one performs better. The simplest way to think about the difference is that the CVec is going to count the number of times a word shows up within a document and the TFIDF is going to compare the number of times a word appears within a document to how many documents that phrase appears in. If a word only shows up within a small percentage of the documents, TFIDF will give it a higher “feature importance”

The main drawback I saw from working with NLP, aside from the fact that is can be computationally expensive, is that you can only run a CVec or a TFIDF on one column. I was working with data that I scraped from Indeed.com and I wanted to examine if the job title AND the city would have a predictive effect on whether the salary was above or below the median. I chose to do this using a CVec purely for the reason that this was more predictive on the data that I was working with.

Here’s the head of my data. Avg = salary averaged if given range, Dumsal= a dummy variable if my average salary was above or below my median salary of $106,750

I started by importing several things at the top of my data. Since I’m going to be doing cross-validation and a train-test split there were a few things that I needed to import that you may not.

After I finished importing, I set my X and y variables and completed my train-test split. I chose a test size of 0.33, as this is the industry standard. Since I’m trying to classify my dummy salary variable based on my job title, I set those to be my X and y. I originally tried to put in multiple columns and when I found that I couldn’t, I decided to write this blog post!

When you make a CountVectorizer, there are a few things to remember:

  1. You need to fit AND transform on the X_train
  2. You ONLY transform on the X_test
  3. In transforming, change both of them to dataframes so you can perform regression on them later

We only fit on our X_train just as we do for the other models we have worked with in python. We want to train the model on just the training data and then all we want to do with the testing data is to transform it based on our training data. We still need both of these to be dataframes so we convert our matrices made in the fit to dense matrices so they can be made into dataframes easily.

After transforming my X_test to a dense matrix, I wanted to check and make sure that my X_train dataframe and my y_train had the same number of features, and the same for my testing set. When I found that they did I could run my regression on these terms!

I performed linear regression on these dataframes and found that my score was around 0.77. While this isn’t bad, I was hoping to do better, so I decided to do NLP on my city column as well!

I followed the same steps as above and fit/transformed a CVec on my city column as well.

This next step is the most important for completing NLP on multiple columns. After you fit/transform the CVecs for every column you choose, you need to concatenate them so we can run a classification algorithm on them as one dataframe. Since we want them as features, we’re going to concatenate them along the columns rather than putting them at the bottom of the dataframe.

We can see that my new X_train dataframe and my y_train have the same number of columns, so we can successfully run regression on them. The last step is just to run the logistic regression again and see if this model is better than my first.

Since this score was significantly worse than my original score, I chose not to use the concatenated dataframes, but hey, I learned something new along the way!

--

--