Textfeatures: Library for extracting basic features from text data

Amey Band
Towards Data Science
5 min readJul 23, 2020

--

Photo by Franki Chamaki on Unsplash

Introduction: textfeatures

When we handle text data, we always have concerns about the data features, pre-processing of data, and more likely the predictions. To improve our model, it is important to understand the data and find the more interesting features in the data like hashtags, links, and many more.

What is textfeatures?

This is a python package that helps you to extract the basic features from the text data such as hashtags, stopwords, numerics which will help you to understand the data and improve your model more effectively.

Function call structure:

function_name(dataframe, ”text_column”, ”new_column”)

where,

dataframe:- name of dataframe

text_column:- name of the column from which features are to be extracted.

new_column:- new column derived by feature extraction from text_column.

What will textfeatures serve you?

1. word_count():- give the total words count present in text data.

2. char_count():- give the characters count.

3. avg_word_length():- give the average word length.

4. stopwords_count():- give the stopwords count.

5. stopwords():- extract the stopwords from the text data.

6. hashtags_count():- give the hashtags count.

7. hashtags():- extract the hashtags from text data.

8. links_count():- give the embedded links count from text data.

9. links():- extract the links from the text data.

10. numeric_count():- give the numeric digits count.

11. user_mentions_count():- give the user mentions count from text data.

12. user_mentions():- extract the user mentions from text data.

13. clean():- give the pre-processed data after removal for unnecessary material in text data.

Let’s understand the syntax and functionalities provided by textfeatures package.

We are using the COVID-19 Tweets dataset hosted on Kaggle.

The best way to install the textfeatures package is by using pip.

pip install textfeatures

Let’s import the necessary libraries that you require to build your model.

import textfeatures as tf
import pandas as pd

Read the data CSV file using pandas and define it with a data frame. Take a preview of the dataset.

#enconding is applicable for this dataset.
df = pd.read_csv("COVID-19_Tweets.csv",encoding="latin")
df.head()
Figure 1: Preview of Dataset

1. word_count()

  • That’s the very first task of feature extraction.
  • We calculate the word count in every row of the dataset.
tf.word_count(df,"Tweets","word_cnt")
df[["Tweets","word_cnt"]].head()
Figure 2: Word Count

2. char_count()

  • We calculate the number of characters in every row of the dataset.
  • This can be accomplished by calculating the length of the tweet.
tf.char_count(df,"Tweets","char_len")
df[["Tweets","char_len"]].head()
Figure 3: Character Length

3. avg_word_length()

  • To understand more about the data, we will find the average word length.
  • We calculate it by simply taking the sum of the length of all the words and divide it by the total length of the tweet.
tf.avg_word_length(df,"Tweets","avg_wrd_length")
df[["Tweets","avg_wrd_length"]].head()
Figure 4: Average Word Length

4. stopwords_count()

  • For processing any natural language processing problem, we always try to clean our data. So to find the stopwords is the primary task to follow.
  • We will find the count of stopwords present in the text data.
tf.stopwords_count(df,"Tweets","stopwords_cnt")
df[["Tweets","stopwords_cnt"]].head()
Figure 5: Stopwords Count

5. stopwords()

  • We find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
tf.stopwords(df,"Tweets","stopwords")
df[["Tweets","stopwords"]].head()
Figure 6: Stopwords

6. hashtags_count()

  • It’s the most interesting task to find out the hashtags in the data because hashtags help us to reach the maximum viewers so that your post or tweet will get the maximum response also.
  • we will calculate the count of hashtags first.
tf.hashtags_count(df,"Tweets","hashtags_count")
df[["Tweets","hashtags_count"]].head()
Figure 7: Hashtags Count

7. hashtags()

  • Now we will extract the hashtags and store it into the list for more pre-processing and visualization of data.
tf.hashtags(df,"Tweets","hashtags")
df[["Tweets","hashtags"]].head()
Figure 8: Hashtags

8. links_count()

  • For finding more insights from the data, we find the embedded links also.
  • We will find the count of links.
tf.links_count(df,"Tweets","links_count")
df[["Tweets","links_count"]].head()
Figure 9: Links Count

9. links()

  • Let’s find out the links embedded in text data, store it in the list and use it for further analysis.
tf.links(df,"Tweets","Links")
df[["Tweets","Links"]].head()
Figure 10: Links

10. numerics_count()

  • Just like we searched for words, hashtags, links, and many more things we will find the count of numeric also.
  • It will definitely help us in the processing of text data.
tf.numerics_count(df,"Tweets","num_len")
df[["Tweets","num_len"]].head()
Figure 11: Numerics Count

11. user_mentions_count()

  • While handling twitter data, we always come in contact with user mentions(@). We are curious about this type of data features, it helps us to analyze the data and understand it’s importance more effectively.
  • We are finding the count of user mentions here.
tf.user_mentions_count(df,"Tweets","user_mentions_cnt")
df[["Tweets","user_mentions_cnt"]].head()
Figure 12: User Mentions Count

12. user_mentions()

  • Let’s find out the user mentions, store it in the list and use it for information visualization.
tf.user_mentions(df,"Tweets","user_mentions")
df[["Tweets","user_mentions"]].head()
Figure 13: User Mentions

13. clean()

  • After extracting all the meaningful features, it’s our need to clean data for further sentiment analysis.
  • So we have a clean() function which will give you the pre-processed data after removing unwanted material like numerics, stopwords, punctuations, and links.
tf.clean(df,"Tweets","Clean_tweets")
df[["Tweets","Clean_tweets"]].head()
Figure 14: Clean()

Conclusion

I hope that you understand the basic functionalities provided by this library. Now it’s time for you to design some interesting implementations. If you want to contribute then kindly fork the repository on GitHub and keep doing good work.

Key Takeaways

Enjoy Learning!

Let’s Connect 1:1

Hey everyone!

I’ve been getting a lot of DMs for guidance, so decided to take action on it. I’m excited to help folks out and give back to the community via Topmate. Feel free to reach out if you have any questions or just want to say hi!

--

--