Feature Selection in Text Classification

Published in

Towards Data Science

4 min readNov 15, 2018

When building machine learning model for text classification, there are a lot of features. Because the features are made from words, broader context of the corpus, higher dimensional for features. It happens when i build machine learning for news classification, sentiment analysis, web page classification, and so on.

Those features consume a lot of time and computing power to get the result. For instance, i needed half an hour to get the result for sentiment analysis with 3 algorithms for benchmarking, then something comes to my mind, is it possible to select only some words that importance to some classes?

Feature selection has been a research topic for decades, it is used in many fields such as bioinformatics, image recognition, image retrieval, text mining, etc. Theoretically, feature selection methods can be based on statistics, information theory, manifold, and rough set.

Feature selection methods can be classified into 4 categories. Filter, Wrapper, Embedded, and Hybrid methods. Filter perform a statistical analysis over the feature space to select a discriminative subset of features. In the other hand Wrapper approach choose various subset of features are first identified then evaluated using classifiers. While The embedded approach the feature selection process is embedded into training phase of the classification. Hybrid approach takes advantages of both filter and wrapper approaches.

In practice, there are some feature selections available on Scikit-learn such as Chi-squared, Variance threshold, and Mutual information. Sklearn also provides SelectKBest for how many features you want to proceed. But i recently reading a lot of papers regarding this. I found numerous feature selection techniques especially on filter method, i counted more than 30 filter methods found during my research, and still on going.

Some algorithms for wrapper approach also grow to solve this problem. The use Bio-inspired algorithms also known as meta-heuristic algorithms are applied to this method such as Genetic algorithm, Particle swarm optimization, Firefly algorithm, Ant Colony Optimization, Artificial Bee Colony, and so on. If you want to apply wrapper method, you can use Inspyred or metaheuristic_algorithms_python.

There are a lot of recent papers that improved from previous feature selection, some of them solve redundancy features and some solve maximize relevancy. Here are recently feature selection algorithms that i want you to know.

Multivariate Relative Discrimination Criterion [1]

While the purpose of feature selection is to select a compact feature subset with maximal discriminative capability, which requires having a high relevance to class label and low redundancy within the selected feature subset. MRDC is proposed to consider both relevancy and redundancy concept in its evaluation process. On how MRDC works first computes the relevancy of each feature using Relative Discrimination Criterion measure, and then Pearson correlation is used to compute correlation values between features.

Minimal Redundancy-Maximal New Classification Information [2]

There are two groups for feature selection, one focuses on minimizing redundancy, and other maximizing new classification information. The methods that focus on minimizing feature redundancy do not consider new classification information and vice versa, thereby resulting in selected features with large amounts of new classification information but high redundancy, or features with low redundancy but little new classification information.

Gao et al. proposed a hybrid feature selection method that integrates two groups of feature selection methods by considering two types of feature redundancy and overcomes the limitations that are mentioned above.

MR-MNCI considers both new classification information and feature redundancy. Feature redundancy can be divided into two categories: class-dependent feature redundancy and class-independent feature redundancy. These two types of feature redundancy are both significant for feature selection.

Distinguishing Feature Selector [3]

Uysal et al. proposed a novel filter based probabilistic feature selection method. DFS selects distinctive feature while eliminating uninformative ones considering certain requirements on term characteristics. This method is trying to answer the common question that user are looking for new techniques to select distinctive feature so that the classifcation accuracy can be improved and the processing time can be reduced as well.

In their paper, they compare DFS with some popular filter methods such as Chi-squared, Information Gain, Gini Index, and Deviation from Poisson distribution for timing analysis. The result gives DFS (0.0343s), GI (0.0371s), CHI2 (0.0632s), IG (0.0693s), DP (0.0797s) lesser is better.

There are some popular datasets for text classification task. For instance, Amazon review, Movie review, 20 Newsgroup, Reuters-21578. are great to start.

In addition, popular algorithms are used for benchmarking feature selection, such as Naive Bayes, KNN, Decision Tree, SVM, K-Means, Hierarchical clustering. Naive bayes is the popular one for text classification.

[1] Labani, M., Moradi, P., Ahmadizar, F., Jalili, M., 2018. A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence.

[2] Gao, W., Hu, L., Zhang, P., Wang, F., 2018. Feature selection by integrating two groups of feature evaluation criteria. Expert Systems with Applications.

[3] Uysal, A., Gunal, S., 2012. A novel probabilistic feature selection method for text classification. Knowledge-Based Systems.

Feature Selection in Text Classification

If you enjoyed this post, feel free to hit the clap button 👏🏽 and if you’re interested in posts to come, make sure to follow me on medium

Written by AC