Machine Learning in Bioinformatics: Genome Geography

From raw sequencing reads to a machine learning model, which infers an individuals geographical origin based on their genomic variation.

Published in

Towards Data Science

10 min readNov 17, 2020

In recent years companies like 23andme have gained traction by feeding our desire to understand the roots of our ancestry. They promise to give insights into the geographical origins of our genetic makeup — using just a droplet of saliva.

In this article, we are going to have a (rather simplistic) look at the bioinformatics aspect of such an analysis. We will walk through all the necessary preprocessing steps to get from raw sequencing reads to a machine learning model which aims to cluster human samples according to their geographical origin.

But before we get to the coding, let’s quickly clarify why this works.

Talking Genomes

Our genomes have a lot to say about who we are. They tell us what colour our eyes are, give information about our sex, how susceptible we are to certain diseases but they also have something to say about where we have been born.

That makes it even more surprising that all human beings are 99.9% identical in their DNA sequence. This 0.1% is what we call human genomic variation. Genomic variation can come in many different flavours depending on the size and type of variation. For the sake of simplicity, we are only going to consider one type of variation, Single Nucleotide Polymorphisms (SNPs). SNPs are substitutions of single bases, e.g. changing an Adenine to a Cytosine.

Essentially, the reason why geographical clustering based on genomic variation works boils down to two basic principles. First, SNPs which occur in the germline of an individual have the chance of being passed down to their offspring. And second, the genome of the offspring is usually a combination of their parent’s genomes. Under the assumption that individuals with a smaller geographical distance between them have a higher chance of producing offspring, emerging clusters of similar genetic variation should show some resemblance of a geographical map.

In 2008 Novembre et al. demonstrated the feasibility of inferring an individuals country of birth from their genomic sequence [1]. The group genotyped over 3,000 European individuals at half a million genomic loci. The resulting plot of the first two principal components has an astonishing resemblance with a geographical map of Europe.

Genomic variation mirrors the geographical map of Europe, Figure by Novembre et al. [1]

(I) Installing Tools

For the subsequent analysis, we are going to need a number of command-line tools and Python packages. I recommend installing miniconda3 and setting up a separate environment using the env.yml file in the Github repository linked at the end of this article.

conda env create -f env.yml
source activate geogenome

(II) Obtaining Data

The 1000 Genomes Project is an incredible resource for publicly available next-generation-sequencing (NGS) data. We are going to use their phase 3 variant call set, incorporating SNPs of around 2500 human samples. In order to reduce the computational cost of our analysis, we are only going to consider SNPs on chromosome 1. The corresponding VCF file can be downloaded with:

mkdir vcf; cd vcfwget http://hgdownload.cse.ucsc.edu/gbdb/hg38/1000Genomes/ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gzwget http://hgdownload.cse.ucsc.edu/gbdb/hg38/1000Genomes/ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbicd ..

At this point, I should mention that with this data alone we are able to build our model. So if you’re solely interested in that you can jump straight to step (VI). However, for the sake of a more comprehensive picture we are going to go through the basic preprocessing steps of NGS data for one sample, namely HG01571. The corresponding paired-end FASTQ files can be obtained with:

mkdir fastq; cd fastqwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764764/SRR764764_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR764/SRR764764/SRR764764_2.fastq.gzcd ..

The last puzzle piece we need is a reference genome to which we can map our reads. Let’s go for GRChg38:

mkdir ref; cd refwget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.facd ..

(III) Quality Control

A crucial step in every NGS-preprocessing pipeline is making sure that your reads have good quality. This includes, among other things, checking per base sequence quality, adapter content and the length distribution of your reads.

mkdir qcfastqc -o qc/ fastq/SRR764764_1.fastq.gz fastq/SRR764764_2.fastq.gz

After investigating the quality report generated by the tool FastQC we can start to map our reads to the reference genome.

Per base sequence quality generated with FastQC, Figure by author

(IV) Mapping

The Burrows-Wheeler-Aligner (bwa) works on specific data structures, namely the Burrows-Wheeler transform and the suffix array to efficiently map reads to a reference. So we first need to create those data structures for our reference FASTA file.

bwa index ref/GRCh38_full_analysis_set_plus_decoy_hla.fa

Mapping can be quite computationally expensive which is why I recommend allowing bwa to use multiple threads.

mkdir mapbwa mem -t 20 -o map/HG01571.sam  ref/GRCh38_full_analysis_set_plus_decoy_hla.fa fastq/SRR764764_1.fastq.gz fastq/SRR764764_2.fastq.gz

Subsequently, we are using samtools to convert our SAM file to BAM format, filter out alignments with mapping quality below 20 as well as sort and index the BAM file.

samtools view -b -h -q 20 -o map/HG01571.bam map/HG01571.samsamtools sort -o map/HG01571.sorted.bam map/HG01571.bamsamtools index map/HG01571.sorted.bam

(V) Variant Calling

Now that our reads are mapped to the reference we can identify SNPs in our sample. We are using the variant caller freebayes which uses Bayesian inference to identify possible small genomic variants.

freebayes -f ref/GRCh38_full_analysis_set_plus_decoy_hla.fa map/HG01571.sorted.bam > vcf/HG01571.vcf

To filter possible false-positive variant calls, we are only keeping variants which are covered by at least 10 reads and have a quality score of at least 20.

bcftools view -i 'FORMAT/DP>10 & QUAL>20' vcf/HG01571.vcf > vcf/HG01571.filtered.vcf

Technically, we could now integrate the variant calls for HG01571 into our 1000 Genomes Variant Call Format (VCF) file but since our sample was already part of the cohort we are just going to use the already existing variant calls.

(VI) Building the Model

Minimalistic Data Exploration

Before building our machine learning model let’s have a look at how our data looks like. A VCF file can be simplified into the following columns: chromosome, position, reference allele and alternative allele followed by genotype information for all samples.

The reference allele describes the most common nucleotide in the population at a specific genomic position and as the name suggests the alternative allele represents a variation in regards to the reference. The information about which variants occur in a specific sample is encoded in the genotype. Here, a 0 stands for the reference allele and a 1 for the alternative. Humans are diploid organisms, meaning we have two copies of each non-sex chromosome. This is why for every position in the genome each sample carries the information about both homologous chromosomes. If we have a look at the example shown above we can see that at position chr1:10456 the first sample carries the alternative allele, T, on both copies of chromosome 1.

Feature Engineering

Currently, our dataset consists of roughly 6 million variants and 2,500 samples. We could train a model with the whole dataset but in an effort to keep computational costs low we are only going to work on a subset of 100 samples.

With that many features, our model is prone to overfitting. So additionally, we are going to further reduce the complexity of our model through some feature engineering. If we look back on our initial hypothesis that samples with similar ethnicities should cluster together, then what comes to mind is defining the genomic similarity between two individuals using some kind of distance metric. Here, we are using the normalized Hamming Distance which is defined as

ham_dist = num_mismatches / num_snps

, where a mismatch is defined as two genotypes not being the same. It should be noted that in our context we don’t care on which copy of chromosome 1 a variant is located such that the genotypes 0|1 and 1|0 are considered a match. Going back to our example, the Hamming distance between the first two samples is equal to 2/4 = 0.5.

After calculating the Hamming distance between all possible sample pairs, each individual can be described through their genomic similarity to all other samples.

Clustering

According to our initial hypothesis, individuals with a similar ethnical background should present a similar SNP profile. Thus, between these samples smaller Hamming distances should be observed compared to the rest of the cohort. A heatmap of the Hamming distances between all samples hints towards some underlying structure in the data. However, just by looking at the raw distances, it is quite hard to pinpoint exact clusters.

Heatmap of normalized Hamming Distances between all samples, Figure by author

In order to make possible clusters more visible, we are applying a dimensionality reduction algorithm. Unlike the authors of the paper, mentioned at the beginning [1], who performed a Principal Component Analysis (PCA) we are using t-distributed stochastic neighbour embedding (t-SNE). I will not go into too much detail here but the key idea of t-SNE is to try to preserve any kind of clustering which existed in high dimensional space. This is achieved by projecting points from a high dimensional space to a lower-dimensional space in a way such that those points which were in close proximity in the high dimensional space are also in close proximity in low dimensional space.

The resulting scatterplot of the first two components shows distinct clustering of samples by their ancestry. Each colour represents a so-called superpopulation.

EUR - European
EAS - East Asian
AMR - Mixed American
AFR - African
SAS - South Asian

What I found particularly interesting is the close proximity of the European cluster to the American one. A reason for the mixture between these two gene pools might be the European colonization of the American continent starting in the late 1400s.

Clustering using t-SNE, Figure by author

Model Selection

The only thing that’s left to do now is deciding what kind of model we want to train. Since there exist 5 superpopulations in our dataset the model needs to be able to perform a multinomial classification. I decided to evaluate the performance of a Random Forest (RF) model, a multinomial Naives Bayes (NB) model, a K-Nearest-Neighbours (KNN) model as well as a multinomial Logistic Regression (LGR) model.

In general, when evaluating a machine learning model one needs to find the right balance between having a large training set and maintaining an appropriately sized test set to test against overfitting. With just 100 samples our dataset is fairly small which makes the problem even harder. Therefore we are applying a method called cross-validation. The idea here is, to divide your data into k equal parts. Subsequently, you train a model using k-1 parts and test it using the remaining one. In that fashion, you iterate over all k parts such that every part acted as test set once.

For example here, we used k=5, so for each classifier, we trained 5 models and obtained 5 accuracy measurements. The resulting performances of all four classifiers can be seen in the figure below. All four classifiers perform reasonably well with Random Forest being the highest and the Logistic Regression model being the lowest-performing one.

Classification accuracy according to algorithm, Figure by author

(VII) Discussion

Limitations

Even though we used rather basic methods we were able to accurately predict an individuals superpopulation based on their SNP profile. However, a number of limitations prevent this analysis from being used in a professional setting.

First, it is quite costly to add new samples to an existing database since every time we add a sample we need to compute the normalized Hamming distance to all other samples. Furthermore, this pipeline does not take individual haplotypes into account but rather the mere genotype at a certain site. A recent paper by Freyman et al. demonstrates an efficient and haplotype-aware method for identity-by-descent inference using the Burrows-Wheeler transform [2]. The four simple multinomial classifiers used in this analysis are only capable of assigning one superpopulation to each sample. A more sophisticated method would be able to assign certain proportions of a genome to a population. In that way, the information about heterogeneous ethnicity backgrounds, which exist in most humans, would not be lost.

Summary

SNPs are single basepair variations in a genome, which can spontaneously emerge.
A human individual roughly inherits half of their genome (including all SNPs) from their mother and the other half from their father.
Individuals from similar regions on earth have a higher probability of producing offspring than individuals from different parts of the planet
Therefore, characteristic clusters of SNPs emerge based on ethnicity.
Here, a set of 100 individuals from five different superpopulations was chosen. Each individual is represented by their SNP profile.
As a distance metric, the normalized Hamming distance was calculated between each pair of samples.
After applying t-SNE, distinct clusters for each superpopulation could be observed.
Four multinomial classifiers were trained to predict a samples superpopulation based on it’s normalized Hamming distances to all other samples. All classifiers showed reasonably high accuracy when evaluated using cross-validation.

Code Availability

All scripts, as well as supplementary data, are available at https://github.com/burgshrimps/geogenome.

Sources

[1] Novembre, J., Johnson, T., Bryc, K. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008). https://doi.org/10.1038/nature07331

[2] Freyman, W., McManus, K., Shringarpure, S. et al. Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform. bioRxiv 2020.09.14.296939 (2020). https://doi.org/10.1101/2020.09.14.296939