Hands-on Tutorials

Predicting Age with DNA methylation data

Building models to predict chronological age from DNA methylation data, including comparing performance across tissues and disease cohorts.

Eleonora Shantsila
Towards Data Science
14 min readMay 14, 2021

--

This project was conducted as part of the Harvard Capstone IACS course

Group members: Daniel Cox, Yaxin Lei, Eleonora Shantsila, Aaron Jacobson

Special thanks to our course instructor Dr. Chris Tanner and TF Phoebe Wong for their guidance and support.

Problem description

How old are you …biologically? Scientists have found that the process of aging causes observable changes, not only in our bodies as a whole, but also in each of our cells, and that this process does not necessarily run at the same speed from person to person. Some people of a given chronological age may be biologically younger or older than others. How can we tell if someone is aging biologically faster or slower than average? And to what degree does accelerated aging reflect current or future disease? These are questions that the team aimed to answer with biological data to assess the degree to which a given individual’s cells are aging in an aberrant manner, perhaps signaling the need for therapeutic intervention.

The approach we took to this problem was to build models to predict true chronological age and then use these models as a baseline against which specific individuals might be compared.

At the outset, it was not clear what sort of biological data would be good markers of aging, and we considered several: blood levels of various biomolecules, MRI scans of the brain, DNA expression levels in various tissues, etc. Ultimately, after preliminary experiments and literature review, we settled on DNA methylation as the most promising predictor of age.

What’s DNA methylation?

In humans (nearly) all cells have a nucleus containing long double strands of DNA called chromosomes. Each chromosome is composed of a backbone with the nitrogenous bases A, G, T and C affixed to it in pairs repeated in varying order like beads on a string (Figure 1). At certain places along this beaded string, there are extra decorations called methyl groups that are put there by enzymes specific for the job, and, more interestingly, where these methyl groups are attached changes with age.

Figure 1. DNA methylation. Image by Author.

In general, the DNA of older individuals is less methylated than of those who are younger, but at any particular site, the degree of methylation can go either way. Methylation may increase with age at some positions and decrease at others. How many positions are we talking about? Millions. Typically, DNA is methylated at sites where a G follows a C with an intervening phosphate group. These sites are referred to as CpG sites, and there are ~20 million of them in our DNA, so the question quickly becomes which are most indicative of age?

Data

To examine this question, we started with data from the Epigenome Wide Datahub (EWAS), a DNA methylation repository. We began by working with data taken from blood cells of healthy individuals. The number of CpG sites represented in this data is around 480,000. Figure 2 below shows a small extract of processed data.

Figure 2. A subsection of EWAS healthy cohort data for whole blood tissue samples. Image by Author.

Each row corresponds to a sample from one individual and each column to a CpG site. The values in the table indicate the probability of a particular site being methylated in the DNA in that sample. After initial preprocessing through selecting individuals between the ages 20–110 and dropping columns with over 10% missing values, the dataset contained 1066 rows and 375,603 columns.

This data was divided into the train (75%) and test sets (25%), with both being imputed with the column means of the training set. Then we started modeling.

Baseline models: Linear and XGBoost model using all features

Our initial approach was to determine how accurately we could predict human chronological age from this healthy control, whole blood data using all of the CpG sites (features) available. To investigate this, we performed linear regression with all of the features using age as the dependent variable. Throughout our analysis, we use the mean absolute error (MAE) as the accuracy metric. MAE refers to the average absolute difference between the predicted ages and the true ages. The results are shown in Figure 3. Encouragingly, with or without regularization, linear models predict age fairly well over the entire lifespan with the best model achieving an MAE of 4.43 years on the test dataset (267 individuals). We also used a nonlinear tree-based regression method, XGboost, on this data and found some improvement achieving an MAE of 4.32 years.

Figure 3. Predicting age from whole-blood DNA methylation data using all 375,603 features, 1066 samples. Image by Author.

The very large number of features compared to the number of samples makes the models prone to overfitting, leading to the next natural question: which CpG sites are the most relevant for age prediction?

Feature selection

To try to answer this question we attempted to reduce this number of features in several ways: statistical testing of linear fits, a bootstrapped correlation analysis, the ranking of Shapley scores, and by using feature importances from XGboost regression. Of these, statistical testing of linear fits, Shapley scores and XGboost regression yielded similar results. We selected the XGboost method described below (Figure 4) for all subsequent model building. We made an 80/20 test-train split of the data. We fit an XGBoost model to the split and recorded which CpGs fell into the top 100 importance scores. This was repeated 50 times, and the frequency with which each CpG site appeared in the top 100 importance scores was then used to rank them in order of importance.

Figure 4. Workflow for feature selection using XGBoost. Image by Author.

Figure 5 below shows a histogram of these frequencies for the top 100 CpG sites. In other words, it shows how often each CpG site appeared in the top 100 importance scores in our 50 trials. For instance, the top 6 features in the figure have a frequency of 50, meaning they appeared in the top 100 in all 50 random trials. This result is extremely unlikely to occur by chance. Indeed, the probability of any CpG showing up by chance more than 4 times in the 50 trials is p = 7.66e-7. Thus, this method is selective and presumably, it selects those CpGs whose methylation is most associated with aging.

Figure 5. Frequency of CpGs occurring in the top 100 importance scores. Image by Author.

With this procedure in hand, we then moved on to the question of just how many of these top features should be used.

Model 1: Linear and XGBoost models using select features

To get a sense of how well this selection procedure picked out the features most associated with age we first reduced the number of features from just under 400,000 to the top 100. The age-predictive baseline models were repeated with these 100 CpG sites, yielding the results in Figure 6.

Figure 6. Predicting age using the top 100 CpG sites ranked by XGboost cross-validation. Image by Author.

Remarkably, after cutting the models’ features from over 400,000 to 100, the smaller models performed comparably. This was not the case when a random set of 100 CpGs was used, indicating that our CpG-ranking method has some merit and that many of the CpG sites in the dataset are likely not relevant to age prediction.

Next, to find the optimal number of features to use in our models, we fit the data with varying numbers of the top-ranked CpG sites. We did this repeatedly, 50 times, with different 80/20 test-validate splits each time and then determined what number of CpGs was optimal. The average MAE values for 50 experiments for each condition are plotted in Figure 7.

Figure 7. Mean absolute error as a function of the number of ranked CpGs used. Image by Author.

Interestingly, the optimal number of CpGs starts to plateau at ~100 for un-regularized Linear regression (Figure 7-A) and ~1000 for Ridge, Lasso, and XGboost regression (Figure 7 B-D). Repeating the modeling with the top 1000 CpGs, a modest improvement over the 100 CpGs was achieved with the best model being a ridge regression model which used the top-ranked 1000 CpGs and achieved an MAE of 3.73 years as shown in Figure 8.

Figure 8. Predicting age using the top 1000 CpG sites ranked by XGboost cross-validation and ridge regression. Image by Author.

Is this the best accuracy we can achieve? Or could we do better with more complex models like neural networks?

Model 2: Neural Networks

Analogous to the linear and XGBoost model analysis, the first step in the neural network (NN) modeling was to test how many features would be optimal. Starting with two NN structures: NN A) containing 3 hidden layers (node number 128->56>28) and NN B) containing 2 layers (node number 128->56). Again we varied the number of CpGs, looking now for the optimum number to use now for neural network modeling (Figure 9).

Figure 9. Mean MAE for varying number of top features for NN A) with 3 hidden layers (left) and NN B) with 2 hidden layers (right). Image by Author.

From Figure 9, we can see that NN A), with 3 hidden layers, performs best with 300 to 700 CpGs, and the performance of NN B) plateaus at around 400 CpGs. With this information, we then varied other model hyperparameters — hidden layer node number, activation function — to tune the NN models for optimal performance. The best model we obtained was a NN with 2 hidden layers (hidden layer node number 128->64) that used the top 700 CpGs and achieved an MAE of 3.597 years (Figure 10).

Figure 10. Predicting age using Neural Network with 3 hidden layers with 700 top CpGs. Image by Author.

The table in Figure 11 below summarizes our modeling results with DNA methylation data from whole blood from a healthy cohort. Ridge and Lasso models with either 1000 or 100 CpGs did well (MAE = 3.73 and 3.88 years respectively) but the neural network performed best (MAE = 3.60). Comparing these results to the literature, the error of our neural network model is comparable to that of Horvath (2013)[1] and Hannum (2013)[2] and not as good as that of Zhang et al (2019)[3], who reported a rMSE on some datasets as low as 2.04 years.

Figure 11. Summary of models fitted to DNA methylation data from blood (test data). Image by Author.

Having built these models using DNA methylation data from whole blood, the next question we considered was whether these models could be used, unchanged, with data from other tissues.

Transferability to other tissues

To examine this, the two best blood models were applied to brain and breast data without re-training. The results are shown in Figure 12 for the ridge regression model and in Figure 13 for the NN. The answer is clear: no the models are not transferable between tissues.

When our blood-fitted ridge model is applied to methylation data from the brain, its age predictions are flat, always close to 40 years of age. (Figure 12-A). And, when it is applied to breast-tissue data, its predictions are again flat but now close to 80 years of age (Figure 12-B).

Figure 12. Applying ridge models developed with whole-blood DNA methylation data using data from other tissues (1000 CpGs). Image by Author.

Similar systematic prediction shifts are also observed when predicting age using blood-trained neural network models (Figure 13). We see a general underprediction when the blood-fitted neural network is applied to methylation data from brain tissue and an overprediction when it is applied to breast data.

Figure 13. Applying Neural network models developed with whole-blood DNA methylation data using data from other tissues. Image by Author.

This lack of model transferability may be because 1) different CpG sites may be most relevant for age prediction in different tissues, or 2) there is something special about DNA methylation in blood cells that makes it more predictive of age than DNA methylation in other tissues. These points are investigated below.

  1. Transferability of features

We first asked if the top-ranked blood features could be used to predict age with methylation data from other tissues. The answer, as it turns out, is yes. We found that the features most important for predicting age with blood data could also be used effectively in age-predictive modeling with leukocyte, breast and brain methylation data. That is, while the models are not directly transferable, the features are.

For the leukocyte data, a NN with 2 hidden layers (hidden layer node number 128->56) that uses 782 of the top-ranked blood CpGs achieved an MAE of 3.51 (Figure 14), in fact slightly better than the analogous model trained on data from whole blood. This impressive performance, however, didn’t hold for all tissues, as the best breast NN trained with blood-ranked CpGs achieved an MAE of 5.97, and the best brain NN trained with blood-ranked CpGs achieved an MAE of 6.02. But still, these results do demonstrate a degree of transferability of features between tissues that could be useful.

Figure 14. Predicting age using Neural Network fitted on leukocyte data with 2 hidden layers with 782 top CpGs generated from whole blood cross-validation with XGBoost. Image by Author.

2. Accuracy of prediction from other tissues

The next question we considered is whether DNA methylation data from tissues other than whole blood are as good for predicting age. This we examined by repeating the feature selection process we used with whole-blood on data from other tissues. Then we made models for each tissue separately, considering only the top-ranked CpG sites specific for each tissue. The ridge regression results for brain and breast tissue are shown in Figure 15 below.

Figure 15. Linear models developed with the top 1000 CpGs from brain and breast data. Image by Author.

Interestingly, models built with data from these tissues are not as good at predicting age as are models built from blood data. Their MAEs are substantially larger, so perhaps there is something unique to blood cells that makes for good age prediction.

Transferability to unhealthy cohorts

Now that we know that models are not transferable between tissues but that the features are to some extent, the next question we asked is are models built with data from healthy individuals transferable to unhealthy individuals? For the purposes of this project, we define “unhealthy” as individuals with neurodegenerative diseases, for instance, Huntington’s, Parkinson’s and Alzheimer’s. Data for these cohorts is also available on EWAS, although only around 225,000 CpG sites were present in the downloaded data. For this analysis, we used brain DNA methylation data from healthy controls and Huntington’s and Alzheimer’s patients.

Linear models were trained on the healthy cohort using the 55 CpG sites in the top 100 for healthy individuals (as selected by XGBoost) that were available in the unhealthy cohort. The best of these models was a lasso regressor, achieving an MAE of 5.431.

Applying these models directly to the Alzheimer’s and Huntington’s patients, we see that the healthy model performs very well on both of the unhealthy cohorts. The best performing linear model (lasso regression) achieves an MAE of 4.771 for Alzheimer’s patients and 4.471 for Huntington’s (Figure 16). In other words, the brain tissue model that uses the 55 healthy CpGs is transferable to the unhealthy cohorts and achieves an even better test accuracy on them than it does on the healthy cohort.

Figure 16. Brain tissue healthy control models for 55 CpG sites applied to the A) Alzheimer’s brain tissue data (811) and B) Huntington’s brain tissue data (270). Image by Author.

The transferability of the model suggests that the CpG sites most highly correlated with age in healthy individuals also correlate with aging in unhealthy individuals, and it raises the question of whether the weights associated with these CpG sites differ between healthy and unhealthy cohorts. Retraining the model using the same 55 CpGs, but now training on the unhealthy cohorts (separately,) we get improvements in MAE across the three models. Figure 17 shows the results for lasso regression which again is our best model achieving an MAE of 4.171 for Alzheimer’s patients and 4.184 for Huntington’s.

Figure 17. Results of the brain tissue model on unhealthy cohorts A) Alzheimer’s B) Huntington’s using the 55 significant CpG sites from healthy cohorts applied to the test set. Image by Author.

Looking at the weights associated with these three models (healthy, Alzheimer’s and Huntington’s) and plotting them for each CpG site, we see the results in Figure 18. From the figure, we can see that for most of the CpG sites the magnitudes of the weights but not their directions (signs) change between the three cohorts.

Figure 18. A plot of the lasso regression model weights for each of the 55 CpG sites for each of the healthy control (HC), Alzheimer’s and Huntington’s models. Image by Author.

Classification of healthy vs unhealthy

Given how much some of the weight magnitudes differ between the cohorts, we trained a logistic regression classifier to determine if we could distinguish between the healthy and Alzheimer’s cohorts using the CpG sites most strongly associated with aging. Using the class accuracy (proportion of points assigned to their true class) as an evaluation metric we trained a number of classifiers.

These included: using the top 55 healthy brain CpGs and age as features (class accuracy of 0.73); using age and the residual values from applying the brain model trained on the healthy cohort as features (class accuracy of 0.73) and using age and the residual values from applying the brain model trained on the Alzheimer’s cohort using CpG sites most associated with aging in that cohort (class accuracy of 0.69). Given the class accuracy values, we conclude that the CpG sites most associated with aging only distinguish between the healthy and Alzheimer’s cohorts with moderate success.

Biological significance

With regard to our results, a natural question is which genes might methylation be affecting and thereby perhaps affecting aging? Figure 19 below shows a mapping of the top 23 blood CpG sites to genes. Some genes are associated with more than one of the top-ranked CpG sites, KLF14 for example, a transcription factor thought to be a master regulator of gene expression in adipose tissue.

Figure 19. Genes associated with the top 23 ranked CpG sites, blood data. Image by Author.

KLF14 and two other genes (ELOVL2 and ZNF423) shown in purple are associated with fat cells or fat metabolism. Thus, it may be that processes involving fat metabolism and storage have an important influence on aging. Also, there are four genes associated with the ubiquitin-proteasome pathway (red), OTUD7A, TRIM59, RNF180, and NHLRC1, — an important pathway for protein degradation. In fact, three of these genes are E3 ubiquitin ligases, which are responsible for marking proteins for degradation. Thus, in terms of looking for interventions in the aging process, targeting this pathway may be a promising avenue of investigation. Indeed, irrespective of DNA methylation, several studies had identified this pathway as having an important influence on aging(Bergsma and Rogaeva (2020)[4], Kevei and Hoppe(2014)[5]).

Conclusion

Conclusions drawn from the above analysis are:

  • We have been able to build models to predict age with a mean error of 3.6 years across the entire adult lifespan.
  • From the ~ 400,000 DNA methylation sites (CpG sites) we started with, we have identified ~700 that are optimal for age predictive modeling.
  • Models are not transferable across tissues, but many CpGs are.
  • Models developed with brain tissue from healthy individuals can also predict the ages of patients with neurodegenerative diseases.
  • Our top-ranked CpGs are often associated with genes that regulate adipose-tissue gene expression and the ubiquitin-proteasome protein degradation pathway.

References

  1. Horvath, S., DNA methylation age of human tissues and cell types. Genome Biol, 2013. 14(10): p. R115.
  2. Hannum, G., et al., Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell, 2013. 49(2): p. 359–367.
  3. Zhang, Q., et al., Improved precision of epigenetic clock estimates across tissues and its implication for biological ageing. Genome Med, 2019. 11(1): p. 54.
  4. Bergsma, T. and E. Rogaeva, DNA Methylation Clocks and Their Predictive Capacity for Aging Phenotypes and Healthspan. Neurosci Insights, 2020. 15: p. 2633105520942221.
  5. Kevei, E. and T. Hoppe, Ubiquitin sets the timer: impacts on aging and longevity. Nat Struct Mol Biol, 2014. 21(4): p. 290–2.

--

--