Genetic Algorithm (GA): A Simple and Intuitive Guide

Learn what are metaheuristics and why we use them sometimes instead of traditional optimization algorithms. Learn the metaheuristic Genetic Algorithm (GA) and how it works through a simple step by step guide.

Published in

Towards Data Science

25 min readJun 14, 2020

Whether you are a data scientist, a data analyst, or a machine learning engineer, operations research and optimization should be a part of your toolbox. Before diving into Genetic Algorithm (GA), I will explain what metaheuristic algorithms are, and why we use them sometimes instead of traditional optimization algorithms. Afterwards, I will introduce the metaheuristic algorithm GA, and explain how it works and the intuition behind it.

What Are Metaheuristics Algorithms?

Metaheuristics are powerful optimization algorithms that are not problem dependent — this means that their frameworks are not designed specifically for a certain problem but you can use them for pretty much any optimization problem (unlike heuristics that are usually adapted to the problem at hand).

Many metaheuristics are inspired by nature. For example, the algorithm simulated annealing is inspired by the process of heating then slowly cooling metal or glass. Another example is ant colony optimization which mimics the way ants behave when travelling to a food source and communicating with each other through pheromones.

Why Do We Use Metaheuristics Instead of Traditional Optimization Algorithms?

Traditional optimization algorithms such as greedy algorithms, branch and bound, and Dantzig’s simplex algorithm, among others, have shortcomings. Here are just a few of the important reasons on why we would tend to prefer metaheuristics over traditional optimization algorithms:

#1- Speed and problem size:

Short answer: it is faster than traditional algorithms and can handle larger problems.

To explain this point, let’s take the example of the Travelling Salesman Problem (TSP). The TSP is a widely used optimization problem that suggests there is a salesman who has to travel to n cities, passing through each city only once, then returns to the origin city from which they left. The question given is what route should they travel (from which city to which city) that would minimize the distance traveled?

**Figure 1:** a TSP with 6 cities (image by author)

Let’s say that you want to solve a certain TSP with 6 cities, with distances given in the image on the left (Figure 1). As it can be seen, the distance between city A and city B is 4 units of distance.

Here, you need to find the optimal route the salesman should take to minimize the distance traveled. The optimal route in this case would be if they traveled from D to C to A to B to E, then back to D (since you return to the origin city), this distance would come out to 41 units (7+11+4+6+13=41), whereas if the salesman were to take the worst route possible, in this case it would be E to A to D to B to C, then back to E, the total distance would come out to 53 units, which is a longer distance to travel than the optimal 41 units. Keep in mind that D to C to A to B to E, then back to D would be the same as going from A to B to E to D to C, then back to A, because a single route is a closed loop; you can start from any point in the loop and end with the same point and the distance traveled would be the same.

For n cities, the number of possible route combinations would be ((n-1)!)/2. It is (n-1) because it does not matter which city you start at, and it is divided by 2 only if the distances between the cities are symmetrical (i.e. the distance between city A and city B is equal to the distance between city B and city A). In the case of asymmetry, the number of possible combinations would be (n-1)!. In our example of 6 cities, where all distances are symmetrical, the total number of combinations would be ((6–1)!)/2 which would be 60 possible combinations.

Now imagine a scenario where you had 25 cities instead of just 6, the total number of combinations in this case would be ((25–1)!)/2 which is a total of 3.102242*10²³ possible paths. To put things into perspective, if you had a computer that would analyze 10,000,000 combinations per second, your computer would need 3.102242*10¹⁶ seconds to find all combinations, which is around 983,041,000 years!

Metaheuristic algorithms find ways to go from a solution to a better one without considering every combination out there; it chooses better solutions than the one in hand, and updates it according to some rules (points #2 and #3 should explain this better).

#2- Local minima and global minima:

Short answer: it can avoid getting stuck in a local minima and escape it because it uses randomness (or random numbers) to accept worse moves (larger f(x) for minimization problems) because they may lead better answers (i.e. a better local minima or a global minima).

When you are tackling an optimization problem, you are always either trying to minimize an objective function such as total distance traveled, or maximize an objective function such as profit.

**Figure 2:** local minima (x1) and global minima (x2) in the search space of f(x) (image by anotherbookondatascience)

For minimization problems, you will often hear the terms local minima and global minima (local optima and global optima for maximization problems). Let’s take a look at Figure 2 on the left where x1 is a local minima and x2 is a global minima. Let’s assume we have a function f(x) where if you input x1, we will get f(x)=10, and if we input x2, we will get f(x)=6, clearly x2 is a better input than x1 because 6 is lower than 10 which is better for our minimization problem. Whatever the function f(x) is, it just asks: what should x be to get the lowest possible f(x)?

**Figure 3:** our current value of x is 2 (image edited by author)

Imagine you started on a point to the left of x1, where x=2 (Figure 3), and you would like to use a greedy algorithm to minimize your f(x) function. Greedy algorithms tend to only update x if it gives you a better answer, in our case, a lower f(x). Now we try x=2.1, f(x=2.1) is lower than f(x=2), so our new x would become 2.1. There is still room for improvement so we continue doing this until we get to x1 where f(x=3).

**Figure 4:** a- our current value of x is 3. b- f(x) gets worse (larger) if we increase x to 3.1. c- x2 is a better choice than x1 because f(x=x2) is smaller than f(x=x1) (image edited by author)

However, if we wanted to increase x to 3.1, you can see that it gives a higher f(x), as shown in Figure 5. Because of this, the best x for the greedy algorithm would be when x is equal to 3 (at x1).

By looking at the image (Figure 4), it is clearly shown that x2 is even a better choice than x1 to lower f(x) but when dealing with real problems, the search space is not known and you need to try out many, many values for x to be able to draw a search space like the Figure. Even with this, since the values for x take on an infinite number of possibilities, finding the value of x that would yield the best result is sometimes impossible. However, if we were to have an upper bound and a lower bound for x, finding the global minima (the best solution) could be possible.

Going back to why we use metaheuristics, with greedy algorithms, when our x became x1, which has a value of 3, any other update would give a worse f(x) so the best solution found would be f(x=3) and the algorithm would terminate. With metaheuristics, we sometimes accept worse moves (in the case of the image when x=3.1). The reason we are okay with accepting worse moves is that they have the potential to lead us to even better solutions, such as x=x2. If we continued updating x towards the top of the hill, we will be able to descend back and find x2 where f(x2)<f(x1).

**Figure 5:** a search space with 2 input variables (image by mathworks)

Note: if we were to calculate all possible solutions for a problem, we would not care about getting stuck in a local minima because we could just pick the best answer from all combinations and that would give us the global minima, however, as discussed earlier, as a problem grows larger, so does the time it takes to compute all possible solutions.

If you take a look back at Figure 2–4, you would only see one variable (x) that we are trying to find the best value for it to give us the lowest f(x) possible. However, in real-life problems, you will have a much larger number of variables to optimize (find the best value for each variable). In Figure 5, we have 2 variables that need optimization (x and y) where the f(x) would be on the z-axis. This case, and the previous one, can easily be solved by just looking at the plots. However, when you have more than 2 variables, it would be impossible to draw and solve by looking at the graphs because you cannot draw more than 3 dimensions.

#3- Neighborhood search and population-based search:

Short answer: when especially dealing with a population-based search method, you will have many solutions at once all over your search space. This means you will be able to evaluate many points or solutions rather than having one solution and updating only that.

**Figure 6:** neighborhood search algorithm converging to a local optima (image by CMU)

With metaheuristics, your algorithm is either a neighborhood (or local) search algorithm or a population-based one. If your algorithm is neighborhood search it means you have one point, or solution, throughout your search and you update that solution as your algorithm is running. Since it is a metaheuristic, the solution is able to escape if it gets stuck in a local minima and this keeps on going until the algorithm terminates and you converge to a final solution, as seen in Figure 6. As your algorithm approaches the end of its search, the probability of accepting worse moves becomes smaller, this is because your algorithm needs to converge towards a better solution. Examples of neighborhood search algorithms are simulated annealing and Tabu search.

**Figure 7:** solutions in a population-based search algorithm (image edited by author)

For population-based search algorithms, you would have multiple solutions at once in your search space, like a “population” of them (Figure 7), where with each iteration your algorithm takes, a certain number of them would update and become new solutions. Each population-based search algorithm has its own rules on how it selects solutions to update and how that update goes about. Population-based search algorithms tend to be very powerful because you can explore various points in your search space rather than one, and at the end, when your algorithm terminates, you can select the best solution from the last update in which it converged to, or, keep track of all updates and select the best out of all. Examples of population-based search algorithms are GA (which will be discussed shortly), evolutionary strategies, particle swarm optimization, and ant colony optimization.

To Sum up Before Diving into GA …

While traditional algorithms are able to work just fine on small problems, we tend to prefer metaheuristics over them when the problem is large thus requiring more time. Metaheuristics are able to avoid getting stuck in a local minima. Although when dealing with a very large problem, such as unbounded problems , where each decision variable can take on any real number, your search space will be infinitely large, therefore, you may not be able to find the global minima, even with metaheuristics, but you will find a good local minima that would suffice. Lastly, population-based search algorithms have the great advantage of searching various points in the search space which increases the algorithm’s chance of finding a better local minima or the global minima.

So, What is Genetic Algorithm (GA)?

GA is a population-based metaheuristic developed by John Holland in the 1970s. GA uses techniques inspired from nature, more specifically evolution, to find an optimal or near-optimal solution towards a problem. It applies evolution concepts such as reproduction and survival of the fittest to solve a problem. GA belongs to the larger class of evolutionary algorithms.

If I were to try and explain GA in a few sentences, I would say that GA is a way to optimize a problem by creating many solutions and updating those solutions in certain ways related to evolution concepts to reach a “good enough” solution or the best solution possible.

In continuous problems, where the decision variable you want to find the optimal value for takes on real numbers, a solution in GA would look like what is shown in Figure 8. This is called a chromosome (sometimes referred to as a string). Each square below is called a “gene” while the value each gene takes is called an “allele”.

Figure 8: an example of a chromosome in GA (image by author)

In real-life problems, most decision variables are constrained and bounded, so that it does not take a value of infinity (which is not a real number). For example, if you had an optimization problem that wanted to maximize the profit (or minimize the cost) for a certain company, it may ask you what is the maximum number of labor hours we can set, usually given other conditions. You may be first inclined to say that we can have an infinite number of hours, but since that is not feasible, the company would tend to set a maximum limit per month, for example, it may say that the number of hours per week should not exceed 120, therefore, the lower bound of the number of hours (x) would be 0 and the upper bound would be 120 (0≤x≤120).

Now that we can place the decision variable x between bounds, calculating a real number for the hours can be found.

**Figure 9:** each gene with its corresponding 2^i (image by author)

For the chromosome in Figure 9, or any other GA chromosome encoded with 0s and 1s (in the case of continuous problems), each gene will be represented with 2^i (2 to the power of i, where i starts at 0, on the far right, and is increased by 1 the more you have genes in the chromosome, i.e. 2⁰, 2¹, 2², 2³, …, 2^i).

To decode this chromosome (transform it to a real value), we use Formula 1, where the summation of bits * 2^i is calculated by summing the multiplications of each value in the gene with its corresponding 2^i. By looking at Figure 9, and starting from the right, this would be equal to (0*2⁰)+(1*2¹)+(0*2²)+(0*2³)+(0*2⁴)+(1*2⁵)+(0*2⁶)+(1*2⁷)+(1*2⁸), this would give us a value of 418.

For the second part of Formula 1, the precision, the equation is given in Formula 2, below, where l is the length of the chromosome, in our case it would be 9, since we have 9 genes, b is the upper bound, which we have as 120 (given in the example as the maximum number of hours allowed), and a is the lower bound which is 0. The precision now would be (120–0)/(2⁹-1) which is equal to 0.234833, therefore decoding the chromosome would be 418*0.234833+0 which is 98.16 hours — a value between 0 and 120.

Important note: if we were to replace all the genes in our 9-gene chromosome with 1s, we should get the maximum value allowed which is 120, as defined by the upper bound, and if we were to replace all the genes with 0s, we should get the minimum allowed which is 0, as defined by the lower bound. Let’s try decoding the chromosome when we have all 1s. Our summation of bits multiplied by 2^i would be (1*2⁰)+(1*2¹)+(1*2²)+(1*2³)+(1*2⁴)+(1*2⁵)+(1*2⁶)+(1*2⁷)+(1*2⁸), see, no zeros, that would give us a total of 511. If we multiply 511 with the precision 0.234833, and add a (which is 0), we would get 120. Note: you will actually get 119.99963 because the precision 0.234833 has more digits but I only took the first 6 after the decimal point, however, if you include all digits, you should get exactly 120.

The chromosome representation (encoding) is called a “genotype” while the decoded value of the chromosome is called a “phenotype”.

Note: the length of the chromosome (l) is set by the user. The longer the string the more accurate your decoded value is, but there is a trade-off with computational time.

What if we had more than one decision variable?

Say we had a minimization problem that we need to find the best values for x and y that would give us the lowest f(x). How would you represent that? Simple, we will have a chromosome where half of the genes represent x, and the other half represent y. Keep in mind that for each x and y, the gene that represents the start of each one would have a corresponding value of 2⁰.

See Figure 10, where the green genes represent x and the blue genes represent y, they both occupy the same string but they should be treated separately when decoding each to get its corresponding value. You decode each the exact same way shown above. The same principle applies for when you have more than two decision variables.

**Figure 10:** a chromosome with 2 decision variable (x and y) (image by author)

Terminologies you need to understand before going deeper into GA:

Besides the terminologies above such as chromosome, genes, alleles, genotype, and phenotype, here I explain a few others, where their explanations might intertwine with each other:

1- Population: when we say population, it means a set of solutions at a certain generation (gen). Remember that GA is a population-based search algorithm, this means that at every time (gen), we have several solutions in our search space. All have the potential of updating and becoming new solutions. In GA, the population size remains constant throughout the search, so if your population size is 100, you will have 100 solutions in each gen, each is a chromosome or string that can be decoded to a certain value.

**Figure 11:** a population size of 10, with the first 4 shown to the reader (image by author)

Figure 11 shows an example of a population size of 10 (population size is usually much higher than 10, but I used 10 for illustration purposes), each with a chromosome size of 8. All of the 10 chromosomes are filled with 0s and 1s (I only show the first 4 filled, also for illustration purposes, but they will all be filled with 0s and 1s).

Important note: the initial population that you want to start your algorithm with is chosen at random. You initially set the length of the chromosome and how many chromosomes you want in the population (population size), and randomly fill in 0s and 1s, this is the case for your first generation. The population in the subsequent generation is the result of manipulating and updating the population from the previous generation, i.e. create a random population for gen #1, the resulting offspring becomes the population for gen #2, then their offspring become the population for gen #3, and so on.

2- Generation: your algorithm will run for gen generations before terminating. Each generation will have the same population size but with different solutions (some might stay the same as the previous generation). Before going to the next generation (e.g. from gen #1 to gen #2), you will need to update your current population with the crossover and mutation operators to create new solutions (offspring) in the search space. After you get a new set of solutions equal to the population size, you then go to the next generation and start over using the new solutions as your population.

3- Parents and offspring: in each generation, your current population are all potential parents that have some chance to “mate” and produce “offspring”, i.e. a chance to become a parent. These offspring are the new set of solutions that will go on to become the population for the next generation, then their offspring will become the population for the generation after that, and so on. Think of it as how we have generations in real life.

How Do We Update the Population and Create Offspring?

GA is inspired by evolution and survival of the fittest, which means that in order to produce offspring, 2 chromosomes or “parents” will reproduce and create 2 offspring or “children”.

Parent selection:

Since we are looking for good or the best solution(s) to a problem, we need to try and make our new and updated solutions come from good ones.

There will be a parent selection method that we will apply to the population to find 2 good solutions, which will be the 2 parents, and apply an operator called crossover on the selected parents to produce the children. So how can we make sure we select 2 good solutions to be the parents?

First, you need to know that just because a solution (chromosome) is not good compared to the rest in the population, it does not mean it will not get picked as a parent, it only means that it has a lower chance of getting picked. There are many parent selection methods developed, but the two widely used are tournament selection and roulette wheel selection.

1- Tournament selection: the idea is straightforward, you randomly select k solutions from the population (with or without replacement), where you set the value k, and pick the best one to be parent #1, then repeat by picking another k solutions and choosing the best one to be parent #2. Simple, right? But how do you know which is best? You know by comparing their corresponding objective function value as to say which chromosome is the best choice for your f(x).

Example: let’s take the Booth function f(x,y)=(x+2y-7)²+(2x+y-5)² as a minimization problem in which x and y are both bounded by -10≤x,y≤10, where -10 and 10 are the lower bound (a) and upper bound (b), respectively. We want to know what are the values of x and y that would give us the lowest f(x).

**Figure 12:** parent #1 selection using the tournament selection method (image by author)

Now we want to select parents to produce offspring, and we set k=3. We randomly select 3 chromosomes from the population we have, then we decode each chromosome to the actual values of x and y. Starting with the first randomly selected chromosome, if x after decoding was 2.1, for example, and y after decoding was 3.6, we plug these values in f(x) and we get f(x)=(2.1+2*3.6-7)²+(2*2.1+3.6–5)², which is equal to 13.13. Then we try another chromosome that was selected and find that, through the same procedure, its f(x) is, for example, 8.32, then the last chromosome we selected and its f(x) is 16.84. Since it is a minimization problem, we need the lowest f(x) as the better solution, therefore, the chromosome that gave us 8.32 outperformed the other two that gave us 13.13 and 16.84, so this would be parent #1, this is illustrated in Figure 12. And to find parent #2, we repeat the exact thing. f(x) for any chromosome is called the fitness, fitness value, or objective function value.

So, the tournament selection method is applied using the following steps:

Select k solutions at random
Find f(x) for each k chromosome
Pick the chromosome that gives the best f(x), if more than one solution tied for best, pick any
Set as parent #1
Repeat to find parent #2

2- Roulette wheel selection: the fitness (f(x)) of a chromosome is used to associate a probability to select the parents. For maximization problems, the higher f(x), the higher the probability it will get to be selected. See Figure 13 as an example, there are 5 solutions in the population that we need to select from. The probability is based on how good the solution is compared to the rest, the higher the probability, the higher the chance of getting selected, which is the survival of the fittest theory but still giving some chance for worse solutions to be selected.

**Figure 13:** roulette wheel selection method (image by NCL)

**Figure 14:** cumulative sum of probabilities (image by author)

To get the probabilities for each solution, you add all the fitness values (f(x)) together to get the total and divide each f(x) by the total to get a percentage of how much it occupies from the total. Then, you select a random number, which is between 0 and 1, and select the solution to be parent #1 from the probabilities set. If the 5 solutions get probabilities 5%, 12%, 14%, 31%, and 38%, based on their fitness, and your random number comes out to be 0.89, then it will be in the range of 0.62 and 1.00 which would make us choose the chromosome #3, which had a probability of 38% of being selected. If our random number came out to be 0.16, then we would choose chromosome #4, which has the probability of 12%. As you can see in Figure 14, the range is based on the cumulative sum of the probabilities.

**Table 1:** example of 5 chromosomes and their corresponding probabilities (table by author)

Another example: if we had 5 chromosomes, and their fitness values were as following: #1: 3.12, #2: 5.18, #3: 6.46, #4: 4.23, and #5: 3.66, for a maximization problem, we first add them to get their total which is 22.65. Now, we divide each one to the total (i.e. 3.12/22.65) to get percentages. We would get 0.13775, 0.2287, 0.28521, 0.18675, and 0.16159, respectively, which when you add up, should give you 1.00. Afterwards, we find the cumulative probability, this allows us to create a range from which our random number will select from. In Table 1, the probability for chromosome #1 to be chosen as a parent is if we get a random number to be between 0 and 0.137748344, and to get chromosome #4, the random number should be higher than 0.651655629 and lower than 0.838410596.

Okay, but what happens if we have a minimization problem?

With maximization problems, the probability of a chromosome is relative to its fitness value which makes the higher the fitness, the better, and the higher the probability. But if we had a minimization problem, the smaller fitness the better but we need to associate it with a higher probability, how? There are many ways you can do this, one way I choose is to take the invert of each fitness value, f(x), so that when you take the invert of the lowest number and compare it to the invert of the others, it would be the highest.

For example, if we had a minimization problem and the fitness values for 3 chromosomes are 2, 4, and 6. We would like 2 (the lowest f(x)) to get the highest probability and 6 to get the lowest. So, we invert them: 2 would become 1/2 (0.5), 4 would become 1/4 (0.25), and 6 would become 1/6 (0.16667). Add 0.5 to 0.25 to 0.16667, now the fitness total is 0.91667. Therefore, the associated probabilities would be 0.54545, 0.27273, and 0.18182, respectively. You can see that 2, the lowest f(x), has the highest probability of 0.54545.

Crossover:

Now that we have 2 parents, the next step is to perform the crossover operator and create offspring or children. There are many ways to do crossover, here I will explain the two methods commonly used:

Important note: I will be using letters instead of 0s and 1s to show you how it works because explaining with 0s and 1s could cause confusion.

#1- Single-point crossover:

**Figure 15:** single-point crossover example (image by author)

When we have the 2 parents, we apply the single-point crossover by randomly selecting a cutoff point. You can do this by selecting a random integer to specify where the crossover will happen. Example, let’s say that our random integer came out to be 6 (this depends on what programming language you will use as Python, for example, starts indexing at 0). So the genes after the cutoff point will be exchanged between the two parents. See Figure 15, the [R, Z, P] from parent #1 will interchange with the [A, F, W] from parent #2 and this will create the offspring: child #1 and child #2.

#2- Two-point crossover:

**Figure 16:** two-point crossover example (image by author)

The idea is the same as single-point crossover but instead of randomly selecting one integer, we select two and exchange what is in between. In Figure 16, the two-point crossover made us exchange [T, B, L] with [P, S, D] to create the offspring. You can either have the two random integers differ from each other by an if or while statement, or you can allow the integers, by chance, to be the same, but if this happens, it will become a single-point crossover. This is based on the user’s preference.

Note: the crossover operator can be done based on a probability, so that you won’t preform crossover every time, however, most studies tend to apply a probability of 1 (100%) with crossover ensuring the operator will happen after every time parents are selected.

Mutation:

To create diversity in the search space, the mutation operator is introduced at a certain probability (usually low, e.g. 0.1 or 0.01). Mutation is applied on the children after creating them from crossing over the two parents.

The way to do this is straightforward: you start with child #1 and go through each gene one by one. You start at the first gene, and get a random number, if the random number is less than the probability of mutation, you mutate, else, move to the next gene and apply another random number and so on. But what happens when we do mutate? Simple, if the gene you want to mutate has a 0 as its value, it would become 1, and if the gene had the value of 1, it would become a 0. Look at Figure 17, the red boxes indicate which genes will be mutated and the green boxes show the result of mutation. Since the probability of mutation is low, not many (or even zero) genes will mutate.

Putting it All Together

The below flowchart, Figure 18, is a flowchart I personally developed for GA when using tournament selection as a parent selection method. I found almost all of the flowcharts that are online a bit difficult for beginners to understand. Notations in the flowchart are as following: M is the number of generations, N is the population size, p_c is the probability of crossover, p_m is the probability of mutation, and k is the number of chromosomes selected at random during tournament selection.

**Figure 18:** GA flowchart with tournament selection (image by author)

At first, we select the number of generations (M), population size (N), probability of crossover (p_c), probability of mutation (p_m), and number of chromosomes to be selected for tournament selection (k). Let’s say we set M=100, N=120, p_c=1.00, p_m=0.10, and k=3. Our problem we want to solve is a continuous minimization problem.

We randomly create a population of 120 chromosomes (each chromosome is randomly created with 0s and 1s, at a length (l) you choose). At this moment, we are at gen #1.
Time to select 2 parents using tournament selection. We randomly select 3 solutions from the population and we find the one with the best fitness value (lowest f(x), since we are dealing with a minimization problem), and that will be parent #1. We then do this again by selecting 3 random solutions and finding the best one, that would be parent #2.
At p_c, we crossover the 2 parents to get 2 children.
At p_m, we mutate the 2 children to get 2 mutated children.
We store these 2 mutated children in an array.
Do steps 2 to 5 N/2 times (60 times) — 60 because 120/2 is 60. We need to recreate a population of 120 for each future generation to keep the population size consistent. Because each time we produce 2 children or 2 mutated children, we need to perform the steps 2 to 5 60 times to get a population of 120 for the next generation (120*2). Store ALL resulted mutated children in an array. At this point we should have a population of 120, which are the mutated offspring of gen #1.
We move to the next generation, gen #2, and repeat steps 2 to 6. The resulted offspring will now become the population of gen #3.
Keep doing this till you finish all 100 (M) generations.
You have two options: 1) either select the best solution (mutated child chromosome) out of gen #100 (the converged solution: the solution your algorithm converged to right before termination), or, 2) out of each generation’s mutated children, right before you go to the next generation, store the best mutated child in a separate array. You do this for all generations so that at the end of your algorithm, you will have an array that has the best solution for each generation which would have the best 100 solutions your algorithm has seen. Then pick the best out of these 100 chromosomes to be your final solution — I, myself, choose to go with option #2 because this way, you get to make sure you have the best answer the algorithm has come across because in the array, solution #100 is the same solution if you ended up picking what the algorithm converged to because when you select the answer the algorithm converged to (option #1), you are selecting the best of only gen #100 which is the same as the array at #100. This way, you make sure you select the best from all 100 generations rather than just the final generation. Because it is extremely likely you may reach the best possible solution at an earlier generation (e.g. gen #42) but then the GA algorithm will start to go to other places in the search space that are worse than what you have at gen #42, and end up converging to a bad solution.
Now that you have the best chromosome, decode it to get real values and there you go, you solved the problem!

Congratulations! You now know what GA is and how it works!

Summary

GA is a powerful population-based search metaheuristic algorithm. It is inspired by evolution and its concepts such as reproduction and survival of the fittest.

In this explanation, I covered how GA is applied to continuous optimization problems where the chromosomes are represented (encoded) with 0s and 1s. We learnt how to decode a chromosome into an actual value and how to select parents, perform crossover to produce children, and how to mutate them. A flowchart and a step by step guide on how the GA algorithm is executed have also been thoroughly explained.

Final note: the same principles of crossover and mutation can be applied to combinatorial problems, such as the TSP, where what you want to optimize are not decision variables that take on real numbers but discrete elements, such as traveler or vehicle routes. Although the idea behind crossover and mutation would be the same for combinatorial problems, the way we apply them is slightly different, because you do not want to end up with a traveler’s path like [A, D, B, D], where city D is mentioned twice and city C is not included in the path, and this may happen if you use the continuous crossover or mutation methods on discrete problems.

References

Holland, John Henry. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.

Intro Image: https://www.freepik.com/free-photo/dna-bacterias_923816.htm#page=3&query=dna+genetics&position=21

Figure 2: https://www.anotherbookondatascience.com/chapter6.html

Figure 5: https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/27178/versions/6/previews/html/gaPeaksExample.html

Figure 6: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume15/ambite01a-html/node9.html

Figure 13: http://www.edc.ncl.ac.uk/highlight/rhjanuary2007g02.php