Genetic Algorithms for Natural Language Processing

Why GA’s are effective for preprocessing NLP data

Michael Berk

Published in

Towards Data Science

6 min readJun 29, 2021

NLP, Genetic Algorithm, Extractive Summarization, Token, Gram, Vocabulary — Figure 1: genetic algorithm training a red square to avoid blue rectangles. Image by author.

“Data preparation accounts for about 80% of the work of data scientists.“ — Forbes

NLP modeling projects are no different — often the most time-consuming step is wrangling data and then developing features from the cleaned data. There are many tools that facilitate this process, but it’s still laborious.

To aid in the feature engineering step, researchers at the University of Central Florida published a 2021 paper that leverages genetic algorithms to remove unimportant tokenized text. Genetic algorithms (GA’s) are evolution-inspired optimizations that perform well on complex data, so they naturally lend well to NLP data. They’re also easily parallelized and straightforward to implement.

Let’s start with a technical overview then do a deep dive to really understand what’s going on…

Technical TLDR

Tokenize your data and build a vocabulary. The paper cited uses the python package mosestokenizer to split sentences into grams, which are individual symbols or words.
Assign random weights (between 0 and 1) to each gram. Each gram represents an individual in our population.
Tune the model’s hyperparameters and run. Tuning GA’s is more of an art than a science, so just play around with numbers that make sense.
Determine accuracy using the ROUGE-N model. The paper cited uses the F1 score from ROUGE-N, which is the mean of precision and recall, but you can use other objective functions.
Tokenize new data and start developing your NLP model. Now that your GA is trained, you can tokenize an unseen training set and developing your NLP model.

Cool, but how does this actually work?

Let’s slow down a bit and develop some intuition around genetic algorithms.

What are genetic algorithms?

First, let’s discuss how GA’s work.

In figure 2, we can see the flow of a genetic algorithm — it’s not as complex as it looks. We initialize our population (yellow box) to be a weighted vector of grams, where each gram’s value is a word or symbol. The weights are initialized randomly to a value between 0 and 1.

Next, we feed our initial population into the algorithm. Values in dark blue correspond to a single generation in our algorithm. Crossover and Mutation in light blue are types of variations. Let’s take a look at each one in turn:

Fitness Function: first, we defined our fitness function, which calculates the probability that an individual will reproduce. In our case, we’ll be using the ROUGE-N’s F1 value. Higher scores mean an individual is more likely to reproduce.
Parents Selection: next, we define who can and cannot breed. The example in the paper uses something called tournament selection, where we take random samples of individuals and select a winner. The winners get to breed.
Mating Pool: in this step, we aggregate the winning parents into random pairs.
Variations: next, we randomly vary our data in a predefined manner to hopefully create a more fit population. The first method, called crossover, involves two parents exchanging some percentage of their weights and returning those two new weight vectors as children. Often, this percentage is determined by a randomly selected crossover index; all values past this index are exchanged. The second, called mutation, involves randomly varying the weights in each parent with low probability.
Offspring: from here, we can aggregate the parents and offspring. If there is a population cap, we remove the least fit individuals until we are under that cap.

With one generation complete, we evaluate the fitness of our population. If our stopping criteria is met, we return the population of weights and average them to get a single value for each gram. If our stopping criteria is not met, our population is deemed unfit and we feed the individuals into another generation.

Ethics aside, genetic algorithms are pretty cool right?

If you’re still not convinced, here’s a GA that tries to guess a sentence’s contents through trial and error…

If we had used the brute force method, we would’ve needed to try n^k combinations, where n is the total number of characters in our string and k is the total number of letters in our alphabet. This example would take upwards of 27⁵³ possible combinations (lower/uppercase characters and a space).

The genetic algorithm guessed our string in 51 generations with a population size of 30, meaning it tested less than 1,530 combinations to arrive at the correct result. And, it also tried symbols.

But, why are they effective?

Genetic algorithms perform well on complex data. The traditional gradient-based optimizations, which use a model’s derivatives to determine what direction to search, require that our model has derivatives in the first place. So, if the model isn’t differentiable, we unfortunately can’t use gradient-based optimizations. Furthermore, if the gradient is very “bumpy”, basic gradient optimizations, such as stochastic gradient descent, may not find the global optimum.

GA’s on the other hand don’t require a differentiable model. Their random nature also helps them avoid getting stuck in local optimums, which lends well to “bumpy” and complex gradients such as gram weights. They’re also easily parallelized and tend to work well out-of-the-box with some minor tweaks.

The main downside of GA’s is that they don’t guarantee an optimal solution, whether it be local or global. For cases where optimality is required, this can be a deal-breaker.

A Visual Example

In our final section, let’s take a look at a genetic algorithm that trains “ships” to avoid obstacles and find a red square.

If we take a look at figure 4 above, we’ll quickly notice that from generation 3 to 5, the ships improve at avoiding the green rectangle. That’s because if their “genetics” make them more likely to die, they do not reproduce.

However, in this configuration, the ships have no concept of sight; they just randomly move in a direction and remember what worked in the past. Because the feature space is so poor, this configuration took another 8 generations for ships to accidentally land on the red square. And if we gave them a completely new map, it would take another full training cycle.

To improve the ships’ ability to both optimize quickly and generalize to new problems, we’d need a better feature space and more environments to learn from.

Likewise with NLP, often simple tokenization does not create a sufficiently robust model, no matter how well the GA performs. More complex features, such as gram counts, prior/subsequent grams, etc. are necessary to develop effective models.

Genetic algorithms offer an effective and efficient method to develop a vocabulary of tokenized grams. No more. No less.

Implementation Notes

There is no rule for the best hyperparameters for a GA. It’s common practice to just try several combinations of parameters while using the largest population size and number of generations that your machine will allow.
Convert your tokens to lowercase if the case of the token is not being used as a feature. This can significantly reduce vocabulary size.
An area of personal interest is using GA’s (and other optimization algorithms) to handle the feature engineering portion of our modeling as well. Got any ideas?

Thanks for reading! I’ll be writing 45 more posts that bring “academic” research to the DS industry. Check out my comments for links/ideas on applying genetic algorithms to NLP data.