Machine learning methods for biological age estimation
关键词
摘要
全文
HIGHLIGHTS
INTRODUCTION
Biological age (BA), a measure derived from the physiological condition of an individual’s systems and organs, represents a comprehensive assessment of aging. It surpasses chronological age in its predictive ability for age-related diseases and mortality risks. Compared to chronological age, which simply counts the passage of time since birth, BA integrates complex factors such as genetic predisposition, environmental influences, and lifestyle choices that collectively influence the aging trajectory.[1] It has been widely hypothesized that the estimated age using these complex factors can effectively measure an individual’s biological age. The difference between estimated age and chronological age is referred to as the “age gap”. Individuals with positive age gaps face accelerated aging and are at higher risk of mortality and are more susceptible to age-related diseases. Numerous studies have demonstrated that BA serves as a robust predictor of diseases such as cardiovascular ailments and diabetes, highlighting its potential to reflect the cumulative impacts of these diverse factors on an individual’s health span.[2-3] Therefore, BA can be utilized as a quantitative indicator to accurately assess individuals’ levels of functional status, facilitating implement of early health interventions and reduction of the disease burden.BA is a complex process driven by the interactions of countless molecules and various biological mechanisms. Identifying molecular biomarkers and other factors that reflect biological age not only enhances our understanding of aging’s underlying mechanisms but also provides drug targets for addressing age-related diseases. Furthermore, integrating multiple factors and developing explainable prediction models of biological age offer personalized insights into an individual’s aging process, guiding tailored interventions and treatments, such as medication and lifestyle changes. As the global population ages, the prevalence of age-related diseases and mortality rates have risen significantly. This personalized healthcare approach promotes early detection and intervention, leading to more precise treatments and a reduction in the overall disease burden.
Table 1. The overview of different ML methods
|
ML method |
Definition |
Applicable Data Types |
Strengths |
|
Linear regression |
A statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data |
Numerical, structured data |
Simple to implement and interpret. Efficient for linear relationships. Works well with small datasets. |
|
Support vector machines (SVM) |
Supervised learning models used for classification and regression tasks and work by finding the optimal hyperplane that separates data points of different classes with the maximum margin |
Numerical, categorical, text |
Effective in high-dimensional spaces. Robust against overfitting with appropriate kernels. Good for binary classification tasks. |
|
Decision trees |
A supervised learning algorithm used for classification and regression tasks and the model decisions and possible consequences in a tree-like structure, where each internal node represents a feature or attribute, each branch represents a decision rule, and each leaf node represents an outcome or class label. |
Numerical, categorical |
Intuitive and easy to interpret. Good for feature selection. |
|
Neural networks |
A class of machine learning models inspired by the structure and function of the human brain and consist of interconnected layers of nodes that process and transform input data. Each connection has an associated weight, which adjusts as the network learns from training data. |
Numerical, image, text |
Captures complex non-linear relationships. Can handle large datasets and multiple features. Flexible architecture allows for various applications. |
|
Deep neural networks (DNNs) |
A type of artificial neural network characterized by multiple layers of neurons between the input and output layers. This architecture allows DNNs to learn complex representations and patterns in data and the depth of the network enables it to capture intricate features and hierarchies in the data. |
Numerical, image, text, audio |
Highly effective for complex patterns and large datasets. Can integrate various data types effectively. |
OVERVIEW OF ML METHODS APPLIED TO BA ESTIMATION
Aging clocks are ML models used to measure the biological aging state of the body, structured by combining various types of clinical data with ML algorithms. ML is able to estimate one’s age, using data of aging features as input and an estimated age value as output. The advantage of ML is that it enables computers to analyze large sample data features and make corresponding inferences in a short time.[5] The most commonly used model is linear regression model which generates an optimal fitting line or plane of multidimensional data by calculating parameters of aging data. For complex high-dimensional omics data, penalized linear regression models, such as Lasso regression using L1 regularization,[6] Ridge regression using L2 regularization,[7] and Elastic Net regression,[8] are widely employed to reduce the number of features and enhance the correlations between features. In addition, there are other common ML methods, including support vector machines,[9] decision trees,[10] and neural networks.[11] What’s more, for large sample data, deep neural network shows optimal performance in various fields,[12] which can learn more complex nonlinear relationships from data by connecting numerous nodes. The difference between the estimated age with the chronological age is served as the indicator to evaluate the individual’s aging rate. It is widely acknowledged that individuals with a positive difference are aging faster than their peers, while those with a negative difference are aging slower.[13-14] Additionally, it has been confirmed that higher risks of age-related events occur among individuals with positive difference.[15-17]ML COMBINED WITH OMICS DATA FOR BA ESTIMATION
DNA methylation data for BA assessment
Epigenetic change is a hallmark of aging and DNA methylation is a widely recognized indicator to assess BA,[18] mainly manifesting by an increase or decrease of CpG methylation. Bocklandt et al. built the first methylation clock based on 100 saliva samples using elastic net model in 2011, which predicts CA with an error within 5 years.[19] The concept of the ‘epigenetic aging clock’ was proposed by Hannum et al. in 2013 using DNA methylation data from 71 CpG sites[13] and then Horvath further analyzed whole-genome DNA methylation data, containing 353 CpG sites, to develop a BA estimation model.[13]Methylation predictive model has been further investigated the correlation with mortality. Zhang et al. utilized Lasso Cox regression to determine 10 CpG sites highly associated with mortality.[17] What’s more, DNAm PhenoAge clock was established by Levine et al. combined chronological age with 9 clinical indicators associated with mortality.[15] Other ML models have shown strong association with mortality and multiple age-related functional decline, including heart disease, cognitive and physical failure.[20-23]
Transcriptomics data for BA assessment
Using RNA gene expression to establish aging clocks can link aging with genes, enhancing the ML model’s interpretability and experimental testability. Peters et al. trained a ML model based on peripheral blood mononuclear cells gene expression data from multiple large cohorts in 2015, first to explore the transcriptomics clock.[24] In 2018, Fleischer et al. established a model using human skin fibroblast transcriptomics data.[25] In 2021, Holzscheck et al. developed a transcriptomics age prediction model using deep neural networks and reported its correlation with skin aging measured by various methods.[26] However, age prediction accuracy of transcriptomics clocks varied significantly across cohorts, possibly due to noise of microarray and sequencing data from different platforms.Proteomics data for BA assessment
Studies demonstrate that composition and quantity of thousands of proteins varying with age in human plasma and cerebrospinal fluid.[27-28] Consequently, it’s feasible to evaluate BA by establishing an age-related protein expression pattern. Baird et al. and Menni et al. developed the earliest proteomics aging clocks.[28-29] In 2018, Tanaka et al. explored the correlation between plasma proteins and biological aging and filtered out 76 proteins to develop the BA model.[30] In 2019, Lehallier et al. developed a ML model evaluating age accurately across multiple independent cohorts based on 373 plasma proteins and found three significant peaks in quantities of plasma protein at 34, 60, and 78 years old, indicating the nonlinear process of aging and phased changes of aging features.[31] Furthermore, an artificial intelligence (AI) model of prediction of the eye aging based on aqueous humor liquid biopsy proteomics was developed by Wolf et al.[32]Metabolomics data for BA assessment
A large-scale study used plasma nuclear magnetic resonance analysis to identify metabolites predicting mortality,[33] which demonstrated that multiple metabolites, including albumin, very low-density lipoprotein and amino acids, were correlated with multi-caused death. Another study developed a metabolomics clock based on 56 plasma metabolites and further explored the association between metabolic age differences and cardiovascular phenotypes or mortality.[34] In this study, accelerated metabolic age was found to be associated with cardiovascular risk factors, cardiovascular disease risk, and all-cause mortality. Moreover, other metabolomics predictive models constructed from plasma and urine metabolites, using multi-targeted and untargeted mass spectrometry and nuclear magnetic resonance methods, were further validated to be related to risk factors for diseases, such as hypertension, diabetes, and obesity.[35-36]Other omics data for BA assessment
Furthermore, other omics data including glycomics and microbiome data have been used for BA model development. For example, Krištić et al. predicted age in multiple European cohorts using N-glycosylation patterns of serum IgG proteins,[37] and Galkin et al. predicted age from the taxonomic composition of the gut microbiome using deep neural networks.[38]ML COMBINED WITH TISSUE OR ORGAN-BASED INDICATORS FOR BA ASSESSMENT
Blood biochemical indicators
Many blood biochemical biomarkers correspondingly changing with age can assess the health and disease status. Putin et al. develop a ML model using 46 indicators, including albumin and alkaline phosphatase, from 62,419 patients.[39] Another ML model was constructed based on 19 blood biochemical indicators, including albumin and urea, along with gender and race, from 142,379 individuals across multiple countries, which was validated in 55,751 individuals from the National Health and Nutrition Examination Survey (NHANES) database.[40] This research indicated a positive correlation between blood biochemical age and all-cause mortality. Combining 23 blood biochemical indicators, including glycosylated hemoglobin and urea, with gender and lifestyle factors like smoking in 149,000 Canadians, Mamoshina et al. developed a ML model and suggested that lifestyle potentially affects an individual’s aging status and BA.[41] What’s more, Levine et al. established a model based on multiple indicators of 9926 individuals from NHANES III database, including albumin, alkaline phosphatase, and chronological age.[15] Based on the NHANES IV database, Liu et al. showed that the predicted age by BA model has high accuracy in predicting all-cause mortality and disease-specific mortality in populations.[42]Brain indicators
Brain tissue has gradually characteristic changes during aging, and some researches indicate that the aging process of brain is consistent with cognitive decline. Brain aging is associated with increasing risks of neurodegenerative diseases and dementia.[43] Cole et al. established a model evaluating brain aging levels using T1-weighted MRI from 2,001 healthy individuals, suggesting that increasing brain age correlates with mortality and gradual decline of cognitive and respiratory functions.[44] Similar correlation between the increased brain age and cognition impairment was demonstrated by Elliott et al.[45]Vascular indicators
Vascular tissues also exhibit characteristic changes during aging, such as decreased elasticity and increased stiffness of blood vessel walls. In addition, the incidence of diseases such as carotid artery stenosis and vascular embolism is also increasing during aging.[46] McClelland et al. established an arterial age model based on coronary artery calcification scores and this model accurately predicted the risk of coronary heart disease.[47] Nilsson et al. used aortic pulse wave velocity and carotid intima-media thickness to evaluate vascular aging, revealing a positive correlation with cardiovascular disease and frailty risks.[48] Gale et al. also shows the similar correlations between cardiovascular disease risks and frailty.[49]Chest X-ray indicators
Based on Chest X-ray from 116,035 individuals, Raghu et al. constructed a chest X-ray age prediction model and validated its performance in two validation sets.[50] In this research, the first stage used chest X-ray images from 24,934 individuals from CheXpert, National Institutes of Health Chest X-ray 14, and PadCHEST cohorts to develop a deep learning model predicting chronological age. The second stage fine-tuned the model using chest X-ray images with time to death as labels, from 13,657 individuals from 25% PLCO (Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial) cohort to construct the final model of BA assessment and validated in 75% populations of PLCO and NLST (National Lung Screening Trial) cohorts. The results suggested that chest X-ray age better reflected risks of all-cause mortality and cardiovascular disease mortality than chronological age.Facial features
Clinicians estimate patients’ age and health status based on facial features.[49] Christensen et al. revealed that increased perceived age was correlated with reduced physical activity, cognitive function, and telomere length by following 1,826 twins over 70 for seven years.[51] In 2020, Xia et al. captured facial features from 3D images of 4,719 Chinese individuals to develop deep neural networks to assess facial age.[52] The facial age was highly related with actual chronological age and the difference between predicted perceived age and chronological age had high association with obesity and blood pressure.[52] This research indicated that altering unhealthy living habits might slow down the biological aging level. In addition, based on the age dependency of facial temperature, Yu et al. collected facial images of 2,811 Han Chinese individuals 20–90 years old and developed the age and disease prediction models.[53] The difference between predicted age in this study and chronological age is associated with metabolic parameters, sleep time, gene expression pathways and exercise.Ocular features
For ocular features, researchers have constructed age prediction models using retinal images or lens images. Based on the UK Biobank public database, researchers developed different models. Zhu et al. built a machine learning-based retinal age prediction model using 80,169 retinal images from 46,969 individuals, uncovering that the difference between retinal age and chronological age was correlated with increased risk of all-cause mortality and specific-cause mortality.[54] Hu et al. constructed another retinal age predictive model based on 19,200 retinal images from 46,969 individuals and retinal age differences created by trained deep learning model from 35,834 participants.[55] This research presented that each year increase in the retinal age difference increased Parkinson’s disease risk by 10% and the risk of Parkinson’s disease was significantly increased at the third and fourth quartiles of retinal age difference compared to the lowest quartile. ‘RetiAGE’, the probability of being ≥65 years old predicted by a deep learning model trained by 129,236 retinal images from 40,480 participants, analyzed 56,301 participants in UK Biobank cohort, turning out that individuals in the fourth quartile of RetiAGE had a 67% increased 10-year all-cause mortality risk and 39% increased risks of cardiovascular disease and cancer, compared to those in the first quartile.[56] Liu et al. developed a CNN-based BA assessment model using 12000 fundus images from healthy individuals.[57] Based on the aging characteristics of the lens, Li et al. utilized slit lamp-captured lens images from the Chinese population to develop the “LensAge” model. This model predicts age-related diseases and mortality. Moreover, the “LensAge” model shows promise for application in smartphone-based self-assessment of aging, introducing innovative modes of measuring aging.[58]ML COMBINED WITH COMPOSITE INDICATORS FOR BA ASSESSMENT
Different individuals age at different rates, and different tissues and organs within the same individual also age at various rates. Most BA estimation models, developed by the above methods, are based on single kind of omics data or single tissue or organ and vary their sensitivities of different aging indicators, reflecting different aspects of aging or specific parts aging levels due to the close relationship between organ aging with diseases of specific parts. Comprehensive assessment of individual aging levels can be achieved by constructing aging models combining multiple indicators. Levine et al. developed a model combining multiple system indicators, including blood pressure, lung capacity indicators and blood biochemical indicators.[2] Li et al. combined 37 indicators reflecting multiple functions of the body and calculated system homeostasis imbalance scores which increased during aging and could predict the incidence of age-related events, such as death and diabetes.[59] Furthermore, Belskey et al. suggested a higher association between composite indicators and health outcomes.[60]CONCLUSIONS
At present, although many researches combing machine learning with multiple data have been reported, there is still a lack of a recognized gold standard for BA. Different types of BA models reflect different aging indicators, and different degrees of noise exist in the measurement of omics data or system organ parameters. To make these models better reflect the actual complex aging state, more indicators and aspects should be considered. For example, the utilized data should combine multi-omics aging indicators and incorporate aging features of multiple organs. Furthermore, the influence of lifestyle and environmental exposure matters may affect the process and outcomes of aging, such as physical activity and dietary factors. More importantly, large-scale longitudinal cohort data will improve the predictive performance of the occurrence age-related events though aging indicators. Therefore, the study should not limited to cross-sectional design. With the continuous expansion of data dimensions and sample sizes, it is difficult for traditional statistical methods and algorithms to fit higher-dimensional features and select meaningful aging indicators. The selection of analysis methods and deep learning algorithms using big data according to the research goals and needs is also necessary to improve the performance of BA estimation models. In addition, more convenient and accurate BA methods need to be further explored. Future directions in biological age assessments should emphasize a deeper understanding of aging processes and their underlying mechanisms, the development of a more accurate and accessible model for real-world applications, and the inclusion of larger population samples to create a more robust and generalized biological age model.However, measuring biological age presents several challenges, including data availability, ethical considerations, and clinical implementation. Data availability is often limited due to fragmented datasets and difficulties in obtaining large sample sizes necessary for robust analyses. Ethical considerations are critical, as handling sensitive genetic and health data raises privacy concerns and necessitates clear informed consent processes. What’s more, clinical implementation faces a translational gap, as many potential biomarkers require rigorous validation before widespread adoption. To address these challenges, promoting open-access database with standardized protocols and implementing data protection policies and anonymization techniques may beneficial for data availability and privacy considerations. Additionally, conducting longitudinal studies can enhance accuracy, while initiating pilot programs in clinical settings will facilitate practical applications in real world.
The biological aging is affected by many comprehensive factors. Constructing BA ML models is useful to monitor the aging state of the body in real time, and deeply explore the molecular regulatory mechanism related to aging, which is expected to find targets for delaying aging, hence to prevent the occurrence and development of age-related diseases.
Correction notice
NoneAcknowledgement
NoneAuthor Contributions
(Ⅰ) Conception and design: Haotian Lin(Ⅱ) Administrative support: Haotian Lin
(Ⅲ) Provision of study materials or patients: Ruiyang Li and Wenben Chen
(Ⅳ) Collection and assembly of data: Ruiyang Li, Wenben Chen, and Jialing Chen
(Ⅴ) Data analysis and interpretation: Ruiyang Li, Wenben Chen, and Jialing Chen
(Ⅵ) Manuscript writing: All authors
(Ⅶ) Final approval of manuscript: All authors