Application and performance of artificial intelligence in screening retinopathy of prematurity from 2018 to 2024: a meta-analysis and systematic review
关键词
摘要
全文
HIGHLIGHTS
INTRODUCTION
The clinical diagnosis of ROP relies solely on the appearance of retinal vessels, assessed through dilated ophthalmoscopic examination by retinal doctors, making it highly subjective.[6] Research has highlighted the imbalance between the limited number of experienced ophthalmologists and the large number of preterm infants needing ROP screening and treatment, particularly in developing countries.[7-8] In China, uneven development of the pediatric care system, inadequately trained pediatricians and unmet demand for pediatric care are major challenges. There are approximately 4 pediatrics per 10 000 children.[9] With the huge population in China,the workload of pediatric ophthalmologists for screening ROP is especially excessive.[10] Moreover, personalized screening and accurate diagnosis are crucial for each newborn's condition varies.[1,11-13] Therefore, developing efficient and accurate diagnostic tools is vital.
Since the emergence of the MYCIN system, artificial intelligence (AI) has begun to play a significant role in areas such as medical diagnosis, treatment planning, and drug discovery.[14-15] In recent years, AI-based automatic screening systems have been rapidly developed, with the advantages to save time and reduce subjectivity.[16] AI has demonstrated significant capabilities in diagnosing ocular diseases, like age-related macular degeneration (AMD),[17-18] glaucoma,[19] and diabetic retinopathy (DR).[20] Similarly, AI has been utilized in research related to ROP. According to Brown et al., increasingnumber of studies focused on AI-based screening of ROP, which might become a valuable tool in the future.[6]
In April 2018, FDA authorized the first AI diagnostic system IDx-DR for diagnosing DR using color fundus photographs with a sensitivity of 87.4% and specificity of 89.5% for 900 patients with diabetes at ten primary care sites.[20] This approval marked a significant milestone in the application of AI for diagnosing retinal diseases.
Previous meta-analysis research has either concentrated on binary screening for ROP or on the identification of PLUS. However, a comprehensive evaluation of the potential of AI in screening ROP and PLUS would be essential for guiding clinical practice. Considering the brilliant improvement of AI in retinal disease diagnostics in 2018 and the promising future of AI applications in ROP, we collected studies from 2018 to 2024, performing a first meta-analysis study to comprehensively assess the performance of AI, aiming to objectively appraise the current diagnostic performance of AI for ROP and PLUS at the same time.
METHODS
Protocol
This meta-analysis was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis,[21] with a standardized review and data extraction protocol. The study protocol was registered on the PROSPERO platform under entry number CRD42024564204.Search strategy and selection criterion
We searched PubMed, Embase, Medline, Web of Science and Ovid for studies published between January, 2018, and July, 2024. The full search strategy for each database is available in Appendix 1. Manual searches of bibliographies and citations from included studies were also completed to identify additional potentially missed articles.Only studies aiming to identify the presence of AI in ROP were identified. We accepted standard-of-care diagnosis, expert opinion or consensus as adequate reference standards to classify the disease. We excluded studies that did not test the diagnostic performance or just investigated the accuracy of image segmentation.
Inclusion and exclusion criteria
The inclusion criteria were (1) studies using AI in ROP diagnosis or PLUS classification; (2) studies using clinical diagnosis as the reference standard; (3) original scientific articles; (4) sufficient data for reconstructing 2×2 tables for diagnostic accuracy.The extraction criteria were (1) duplication of publications; (2) non-original studies, including editorials, letters to the editor, review articles and case reports; (3) non-English articles; (4) studies without sufficient information for reconstructing a 2×2 table.
Data extraction
We extracted the following data from the included studies using a standardized form: (1) true positives, false negatives, true negatives, and false positives; (2) study characteristics, including the first author, publication year, country, camera, reference standard, model, algorithm evaluation, screening criteria, source of data (public database or private dataset, data from hospital was defined as private dataset), number of centers, number of doctors, experience year of doctors,gestational age (GA), birth weight (BW), gender (M/F), dataset (validation dataset), classification, outcome, sensitivity and specificity, accuracy, and AUROC.QualityAssessment
The methodological quality of the studies was assessed using the QUADAS-2[22] and QUADAS-AI.[23]Each study was rated in the following domains: patient selection, index test, reference standard, and flow and timing. Each domain was assessed based on the risk of bias and the first three domains, including applicability.Statistical analysis
We created 2×2 tables to calculate the pooled sensitivity, specificity, and corresponding 95% confidence intervals (CIs) using a bivariate random effects model. Moreover, we calculated the diagnostic odds ratio (DOR), a single indicator combining the sensitivity and specificity, was selected for its capacity to demonstrate the overall accuracy of diagnostic test across all threshold settings and less affected by the prevalence of disease among included samples. The positive likelihood ratio (LR+) and negative likelihood ratio (LR-) would provide information on the probability of disease increases or decreases with a positive or negative test result, respectively. The results are graphically shown in the forest plots. We constructed hierarchical summary receiver operating characteristic (HSROC) curves. Furthermore, we calculated the area under the curve (AUC). We performed Deeks’ funnel plot asymmetry test to evaluate the possible presence of publication bias, with P < 0.1 indicating the possibility of publication bias.[24]The heterogeneity of the included studies was evaluated using the inconsistency index (I 2) and Q statistic of the chi-square test.[25]Heterogeneity was further explored through metaregression by adding the following covariates to the bivariate model: (1) country (China vs. other countries), (2) number of centers (<1 vs. ≥ 1), (3) data source (public database vs. private data), and (4) number of doctors (≤3 vs. >3).
Statistical analyses were performed using STATA 17.0 and RevMan 5.3. Statistical significance was set at P < 0.05.
RESULTS
Selection and data extraction
A total of 186 studies were identified and 147 were excluded according to the exclusion criteria. Thirty-nine full-text articles were assessed for eligibility, and 14 studies were finally included in the meta-analysis.[6,26-38]Twenty-five studies were excluded for various reasons, mainly including no 2×2 table available and AI used for other purposes not diagnosing ROP or PLUS (Figure 1).
Figure 1 PRISMA flow chart of article selection process.
Data characteristics and demographics
These characteristics were summarized in Table 1. All researches were conducted retrospectively, from 2006 to 2018. Most studies conducted in China, while the rest 5 studies were from other counties like India, America, New Zealand, Japan and the UK. All studies emphasized the expert judgement as reference standards. Nine studies provided information on their screening criteria, while five did not. Twelve studies reported their algorithm evaluation, while two studies did not. The data was obtained from private datasets or public database like the i-ROP, the KID ROP, the ROP Group and the ART-ROP. Among 14 studies, nine were single-center studies, while five were multicenter studies, from at least 2 centers to 30 centers at most. For the labeling process, most studies reported the number of doctors, but one study did not mention it. Over half of the studies provided informationon the experience of the doctors. Only six studies provided information on GA and BW of included premature infants. Three studies classified the material as “cases”, while the others classified the material as “images”. Moreover, eleven studies reported AI in diagnosing the presence of ROP, six studies reported AI in diagnosing the presence of PLUS and three studies reported AI distinguishing severe ROP which was considered treatments in clinical practice.Table 1 Characteristics of 14 included studies
|
First Author |
Publication Year |
Country |
Camera |
Reference standard |
Model |
Algorithm Evaluation |
Screening Criteria |
Source of data |
Number of centers |
Number of doctors |
Experience year of doctors |
GA/w |
Birth Weight/kg |
Gender (M/F) |
Dataset (Validation dataset) |
Classification |
Validation Dataset |
Outcome |
Sensitivity/Specificity |
Accuracy |
AUROC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Brown[6] |
2018 |
America |
RetCam |
Clinical diagnosis |
CNN: U-Net and Inception V1 |
the 5-fold cross validation |
NA |
i-ROP |
8 |
3 |
2 ophthalmologists and 1 coordinator |
NA |
NA |
NA |
5511 (100) |
cases |
54Normal, 31pre-plus, 15PLUS |
Normal vs. pre and plus |
0.93/0.94 |
0.91 |
0.98* |
|
Normal and pre vs. PLUS |
1/0.94 |
- |
|||||||||||||||||||
|
Wang[26] |
2018 |
China |
RetCam3 |
ICROP, CRYO-ROP, and ETROP |
Id-Net; Gr-Net |
NR |
NA |
Hospital |
1 |
4 |
NA |
NA |
NA |
93/78 |
2226 (298) |
cases |
149Normal, 149ROP, 52 minor ROP, 52 severe ROP |
ROP vs. no- ROP |
0.97/0.99 |
NA |
NA |
|
(520) 104 |
Minor vs. Severe ROP# |
0.88/0.92 |
NA |
NA |
|||||||||||||||||
|
Hu[27] |
2019 |
China |
RetCam3 |
ICROP |
CNNs: VGG-16, Inception-V2; ResNet-50 |
select the best model |
NA |
Hospital |
1 |
3 |
1 chief physician and 2 doctors 5+ years |
32(25-41) |
1.994(0.7-4.25) |
NA |
2068 (300) |
images |
150ROP, 150no ROP |
ROP vs. no-ROP |
0.96/0.98 |
0.97 |
0.99 |
|
466 (100) |
50mild, 50 severe |
Mild vs. Severe ROP× |
0.82/0.86 |
0.84 |
0.92 |
||||||||||||||||
|
Tan[28] |
2019 |
New Zealand |
RetCam |
ETROP |
ROP. AI |
NA |
<1250 g birth weight or <30 weeks gestational age |
ART-ROP |
4 |
NA |
NA |
NA |
NA |
NA |
3487 (116) |
images |
33 PLUS, 26 pre-plus; 57 normal |
PLUS vs. not-PLUS |
0.94/0.81 |
0.86 |
0.98 |
|
Pre plus vs. normal |
0.81/0.81 |
0.81 |
- |
||||||||||||||||||
|
Zhang[29] |
2019 |
China |
Retcam 2/3 |
Clinical diagnosis |
DNN: AlexNet, VGG 16, GoogLeNet |
select the best model |
1) birth weight <2,000 g and 2) preterm infants with birth weight 2,000 g but having severe systemic disorders (according to pediatricians’ assessment). |
Hospital |
1 |
5 |
2 chief physicians, 2 attending physicians, 1 resident |
32.0(25,36.2) |
1.50(0.78-2.00 |
10075/7726 |
19543 (17801) |
images |
8090ROP, 9711 without ROP |
ROP vs. no ROP |
0.941/0.993 |
0.9 |
0.998 |
|
Huang[30] |
2020 |
China and Japan |
RetCam |
ICROP |
DNN: VGG19*, VGG16, InceptionV3, DenseNet, and MobileNet |
select the best module and then 5-fold cross validation |
born within 37 weeks of gestation and/or had to weigh ≤ 1500 g at birth |
Hospital |
2 |
3 |
10 years of experience working |
NA |
NA |
NA |
267 (101) |
cases |
59 ROP, 42 no ROP |
ROP vs. no ROP |
0.97/0.95 |
0.96 |
0.97 |
|
254 (85) |
63mild ROP, 22 severe ROP |
Mild vs. severe |
0.99/0.99 |
0.99 |
0.99 |
||||||||||||||||
|
Mao[31] |
2020 |
China |
RetCam |
Clinical diagnosis |
U-Net, DenseNet |
Select the best model based on the 5-fold cross-validation |
NA |
Hospital |
1 |
1 |
NA |
31.0±2.0
|
1.5833± 0.4016
|
NA |
5711 (450) |
images |
305normal, 104pre-plus, 41PLUS |
PLUS vs. not PLUS |
0.95/0.98 |
- |
0.93 |
|
Preplus vs. not preplus |
0.92/0.97 |
- |
0.99 |
||||||||||||||||||
|
Tong[32] |
2020 |
China |
RetCam |
Clinical diagnosis |
ResNet;Faster-RCNN |
10-fold cross-validation |
NA |
Hospital |
1 |
13 |
Junior (11), 10 years (2) |
NA |
NA |
NA |
36231 (9772) |
images |
519Grading’, 261 PLUS, 8992normal |
Grading’ vs. others |
0.78/0.93 |
0.90 |
- |
|
PLUS vs. not PLUS |
0.71/0.91 |
0.90 |
- |
||||||||||||||||||
|
Huang[33] |
2021 |
China |
RetCam |
Clinical diagnosis |
CNN |
5-fold cross-validation |
infants with a BW of 1500–2000 g or a GA above 32 weeks with any unstable clinical condition |
Hospital |
3 |
3 |
At least 3years |
NA |
NA |
NA |
1975 (244) |
images |
94no-ROP, 44Stage 1, 106Stage 2 |
ROP vs. no ROP |
0.96/0.96 |
0.92 |
0.96 |
|
Stage 1 vs. others |
0.92/0.95 |
- |
0.93 |
||||||||||||||||||
|
Stage 2 vs. others |
0.90/0.99 |
- |
0.92 |
||||||||||||||||||
|
Lei[34] |
2021 |
China |
RetCam2 or 3 |
ICROP |
CASA, Grade CAM, Res-Net 50 |
Select the best model |
Birth weight ≤2000 g and gestational age≤36.5 weeks |
ROP Group |
1 |
5 |
Two are chief physicians, two are attending physicians, one is junior ophthalmologist |
NA |
NA |
NA |
22961 (5160) |
images |
3078ROP, 2082 no ROP |
ROP vs. no ROP |
0.95/0.99 |
0.99 |
0.99 |
|
Ramachandran[35] |
2021 |
India |
RetCam3 |
ICROP |
U-COSFIRE; Darknet-53 |
Select the best model |
BW<2000 g or GW<34w |
KIDROP |
1 |
3 |
NA |
No-PLUS:32.4±1.1 |
No-PLUS:1350±240 |
NA |
289(161) |
images |
94normal, 67plus |
PLUS vs. no PLUS |
0.99/0.98 |
0.97 |
0.99 |
|
PLUS:30.9±1.8 |
PLUS: 1.925 ±0.774 |
||||||||||||||||||||
|
Li[36] |
2022 |
China |
RetCam3 |
Clinical diagnosis |
Retina U-Nets; Dense Net |
Select the best model based on the 5-fold cross-validation |
BW<2000 g and GW<37w |
Hospital |
1 |
3 |
NA |
30.43 ± 5.80 |
1.44203 ± 0.51703 |
NA |
18827 (3680) |
images |
2893no ROP, 378stage I, 262stageII, 147stage III |
Stage I vs. others |
0.90/0.98 |
0.98 |
0.9663 |
|
Stage II vs. others |
0.93/0.99 |
||||||||||||||||||||
|
Stage III vs. otherts |
0.92/0.99 |
||||||||||||||||||||
|
Normal vs. others |
0.96/0.96 |
||||||||||||||||||||
|
Attallah[37] |
2023 |
China |
Retcam2/3 |
Clinical diagnosis |
ResNet-50; DarkNet-53; MobileNet |
5-fold cross-validation |
<2000 g in weight at birth and 2000 g premature neonates who have significant systemic diseases at birth. |
Hospital |
30 |
5 |
2 chief physicians, 2 attending physicians, 1 resident |
31.9 (24-36.4) |
1.49 (0.63-2.00) |
10075/7726 |
17801 (1742) |
images |
155ROP, 1587no ROP |
ROP vs. no ROP |
0.90/0.97 |
0.94 |
0.98 |
|
32.0 (25-36.2) |
1.50 (0.78-2.00) |
988/754 |
|||||||||||||||||||
|
Wagner[38] |
2023 |
UK |
RetCam Version 2 |
Clinical diagnosis |
Bespoke and CFDL models |
Select the best model |
BW<1501 g or GW≤32w |
Hospital |
1 |
4 |
3 years |
NA |
NA |
NA |
6141 (200) |
Images |
111no ROP, 43pre plus, 46 PLUS |
ROP vs. no ROP |
0.973/0.900 |
- |
0.986 |
|
Pre plus vs. others |
0.860/0.860 |
- |
0.927 |
||||||||||||||||||
|
PLUS vs. others |
0.522/0.981 |
- |
0.974 |
Quality Assessment
Figure 2 showed the quality variables of the 14 included studies. All studies had a low risk of bias in index tests and reference standard. However, the exclusion of low-definition images without ensuring a consecutive or random sample might bring bias in the patient selection process. Consequently, studies that excluded low-quality images were considered to have high risk of bias, while those did not emphasize the exclusion of such images had unclear risks. In addition, it would be optimal for the results of the index test and the reference standard to be collected simultaneously in order to prevent misclassification, due to the progress of diseases.[22] In the case of ROP, a progressive disease, it was necessary to have clear time intervals between the selection of images and their validation. However, all included studies did not emphasize, leading to an unclear risk of bias in the section of flow and timing.
Figure 2 Risk of bias and applicability concerns graph and quality assessment of diagnostic accuracy studies-2 (QUADAS-2)
andquality assessment of diagnostic accuracy studies-AI (QUADASAI) criteria for the 14 included studies.
Diagnostic accuracy
For 11 studies using AI to diagnose ROP, the pooled sensitivity and specificity were 0.95 (95% CI 0.93-0.96) and 0.97 (95% CI 0.94-0.98), respectively (Figure 3a). The AUC was 0.97 (95% CI 0.95-0.98) (Figure 4a). The DOR of AI for diagnosing ROP was 611 (95% CI 300-1244). The LR+ was 31.7 (95% CI 16.7-59.5), and the LR- was 0.05 (95% CI 0.04-0.07) (Table 2). There was considerable among-study heterogeneity according to Cohran’s Q test (P < 0.01) and the I 2 heterogeneity index (Figure 3a). Deeks’ funnel plots revealed major publication bias in AI diagnosing ROP disease, with statistical significance (P = 0.02) (Figure 5a).
Figure 3. Coupled forest plots of the pooled sensitivity and specificity of AI detection in ROP patients (a) and in the PLUS cohort (b).

Figure 4. Hierarchical summary receiver operating characteristic (HSROC) curve of the diagnostic performance of AI detection in patients with ROP (a) and in patients with PLUS (b).

Figure 5 Deeks’ funnel plot used to evaluate the potential publication bias of AI detection in ROP patients (a) and in patients with PLUS (b).
Table 2 Sensitivity, specificity, LR+, LR-, and DORs of AI detection in the ROP and PLUS cohorts
|
|
AI detection in ROP |
AI detection in PLUS |
|
Sensitivity (95%CI) |
0.95 (0.93, 0.96) |
0.92 (0.80, 0.97) |
|
Specificity (95%CI) |
0.97 (0.94, 0.98) |
0.95 (0.91, 0.97) |
|
LR+ (95%CI) |
31.7 (16.7, 59.9) |
18.5 (9.9, 34.8) |
|
LR- (95%CI) |
0.05 (0.04, 0.07) |
0.09 (0.03,0.22) |
|
DOR (95%CI) |
611 (300, 1,244) |
218 (58, 815) |
Meta-regression
Meta-regression was used to explore the causes of heterogeneity among the patients who received an AI-based diagnosis of ROP and those who was diagnosed as PLUS (Table 3). Study heterogeneity was independently associated with country, number of centers, data source and number of doctors responsible for the initial screening.Table 3 Results of the meta-regression analysis of the AI for the detection of ROP and the PLUS
|
AI detection |
Covariates |
Category |
Studies (n) |
Meta analytic summary estimates |
|||
|
Sensitivity (95% CI) |
P |
Specificity (95% CI) |
P |
||||
|
ROP |
Country |
China |
8 |
0.95 (0.95-0.96) |
< 0.01 |
0.98 (0.97-0.99) |
0.60 |
|
Other country |
3 |
0.94 (0.90-0.98) |
|
0.89 (0.82-0.97) |
|
||
|
Centers |
>1 |
5 |
0.93 (0.91-0.96) |
< 0.01 |
0.94 (0.90-0.99) |
< 0.01 |
|
|
=1 |
6 |
0.95 (0.95-0.96) |
|
0.98 (0.97-0.99) |
|
||
|
Data source |
Hospitals |
8 |
0.95 (0.94-0.97) |
< 0.01 |
0.97 (0.96-0.99) |
0.59 |
|
|
Database |
3 |
0.93 (0.89-0.97) |
|
0.95 (0.90-1.00) |
|
||
|
Doctors |
≥3 |
5 |
0.94 (0.92-0.97) |
< 0.01 |
0.96 (0.92-0.99) |
0.01 |
|
|
<3 |
6 |
0.95 (0.94-0.97) |
|
0.98 (0.96-1.00) |
|
||
|
PLUS |
Country |
China |
5 |
0.91 (0.80-1.00) |
0.85 |
0.95 (0.90-0.99) |
0.06 |
|
Other country |
4 |
0.93 (0.83-1.00) |
|
0.95 (0.91-1.00) |
|
||
|
Centers |
>1 |
3 |
0.98 (0.95-1.00) |
0.03 |
0.93 (0.86-1.00) |
0.05 |
|
|
=1 |
6 |
0.86 (0.75-0.98) |
|
0.96 (0.93-0.98) |
|
||
|
Data source |
Hospitals |
6 |
0.86 (0.75-0.98) |
0.06 |
0.96 (0.92-0.99) |
0.40 |
|
|
Database |
3 |
0.98 (0.95-1.00) |
|
0.94 (0.87-1.00) |
|
||
|
Doctors |
≥3 |
3 |
0.73 (0.53-0.93) |
< 0.01 |
0.95 (0.90-1.00) |
0.11 |
|
|
<3 |
6 |
0.96 (0.90-1.00) |
|
0.95 (0.92-0.99) |
|
||
Among the 9 studies in which AI was used to distinguish PLUS, multicenter studies had greater sensitivity (0.98, 95% CI 0.95-1.00 vs. 0.86 95% CI 0.75-0.98, P = 0.03).The screening of more than three doctors had a lower sensitivity than that of screening by 3 or fewer doctors (0.73, 95% CI 0.53-0.93 vs. 0.96 95% CI 0.90-1.00, P < 0.01).
DISCUSSION
AI has been widely used in medical diagnostic research, but few has been actually used in practice. In this meta-analysis, we evaluated the performance of AI detection in the diagnosis of ROP and PLUS, which was the first study to comprehensively assess the performance of AI in ROP and PLUS at the same time.The results demonstrated that the AI system achieved high sensitivity and specificity in identifying both ROP and PLUS.For diagnosing ROP, high sensitivity was emphasized to ensure that no patients at potential risk were overlooked. We achieved a strong sensitivity of 0.95 (95% CI 0.93-0.96) across 11 studies, with values ranging from 0.81 to 0.98. The AUC of AI for diagnosing of ROP was 0.97 (95% CI 0.95-0.98), showing outstanding diagnostic efficiency.[39] Moreover, the value of LR- was 0.05, indicating that infants diagnosed without ROP by AI had a low risk of developing ROP. For diagnosing PLUS, high specificity was critical to accurately identify those needing potential therapy to prevent the adverse outcomes of ROP. We got a great specificity value of 0.95 (95% CI 0.91-0.97) from 9 researches, from 0.82 to 1.00. The AUC of AI for the diagnosis of PLUS also performed outstandingly, with a value of 0.98 (95% CI 0.96-0.99). The value of LR+ was 18.5, indicating that patients diagnosed with PLUS by AI are highly likely to actually have PLUS and require close attention for therapy. Consequently, the included studies suggested that AI detection for ROP and PLUS was effective and there still existed potential space to improve sensitivity in ROP and specificity in PLUS, with the highest sensitivity reaching 0.98[38] in ROP and the highest specificity reaching 1.00 in PLUS.[6]
Multiple factors contribute to heterogeneity, including patients’ selection, time interval, publication bias, and so on.Excluding low-quality fundus photographs or those taken from peripheral angles may artificially inflate the sensitivity reported in AI diagnoses.[40-41]Additionally, variations in AI algorithms across studies cause discrepancies in how disease parameters like vessel tortuosity, direction, or ridge position are assessed, leading to inconsistencies even within the same study using different AI tools.[6,26,34,37] The choice of time interval between image captures is another critical factor affecting results. Researchers might choose higher-resolution images to obtain more favorable results, thereby introducing a selection bias based on the timing of the photographs.[42-43] Similarly, most included studies presented a high risk or provided unclear information regarding the time intervals, yet selecting appropriate intervals was crucial since ROP progresses rapidly.If the time interval was not suitable, AI diagnoses using earlier fundus photos might significantly differ from specialist diagnoses based on current fundus examinations with an ophthalmoscope.[44-45] Therefore, the bias of timing and flow was essential in diagnostic models. A synchronous diagnosis would be conducted using AI and standard reference in the future. Moreover, publication bias contributed to the heterogeneity. This phenomenon may be linked to enterprise support.Publication bias can lead to overestimated or underestimated effects, potentially resulting in inappropriate therapies in clinical practice.[46-47] Given the factors contributing to heterogeneity, emphasis was placed on appropriate patient selection and study design, as well as improving the adaptability of AI to varying qualities of fundus photos and disease parameters.[48-49] Overall, risk bias may bring high heterogeneity which finally reduce diagnostic effectiveness of AI in ROP and PLUS.
Meta-regression analysis was employed to explore study heterogeneity based on country, number of centers, data sources, and number of doctors.Significant differences were observed in the sensitivity of ROP diagnosis among different countries, varying numbers of centers, and data sources.[50-52] However, variability in sensitivity across different studies may result from the varying number of research centers and the diversity of data sources. For the number of research centers,AI applications in single-center studies typically encountered a less heterogeneous population and simplify data collection, thereby reducing confounding factors.[53-54] In contrast, multi-center studies exhibited significant variations among patients, image data, and instruments.[37-38] Therefore, binary diagnosis of ROP in a single-center setting tended to achieve higher sensitivity. For the data source,private data demonstrated greater sensitivity compared to data from public databases. Private data, sourced exclusively from a single hospital and utilizing the same equipment, ensured consistent data quality. This consistency enabled AI to effectively focus on training and recognition, thus enhancing its sensitivity.In contrast, public data posed more challenges due to variable quality and lack of timely updates; however, it can be useful for training AI on robustness in diagnosis, given the diverse patient populations and varied photo quality.[55-56] Finally, to enhance the exploration of AI models for diagnosing ROP, timely data sharing from various research centers was encouraged.
The performance of AI in diagnosing PLUS has also been investigated. Unlike the diagnosis of ROP, the identification of PLUS was more effective in multicenter studies than in single-center studies. Several reasons might explain this.Firstly, AI identification currently relies primarily on vascular segmentation. The characteristics of vascular curvature in PLUS can be easily identified.[57-58] Therefore, despite the significant variation in image data quality from different sources, AI can more effectively extract and identify the characteristics of vascular curvature.[59] Secondly, AI can quantify vasodilation and tortuosity in PLUS, overcoming differences in image data quality between multi-center studies and the subjective diagnosis of doctors.[6,32,54] Thirdly, PLUS primarily affected the quadrants around the optic disk, where the identification range was more concentrated, reducing the demand for information from peripheral blood vessels. [31,60] Moreover, the optic disc was served as a point of reference for positioning recognition,[61-62] aiding AI in identifying features, thereby improving the training and increasing the applicability of AI. Therefore, it can be seen that the difficulties brought by multi-center research to the previous binary diagnosis of ROP performed better in the recognition of PLUS. Moreover, the number of doctors responsible for the initial screening also introduced bias. In our study, both in ROP diagnosis and PLUS identification, AI demonstrated lower sensitivity when more than three doctors were responsible for the initial screening.This result highlighted the fragility of the current gold standard, which heavily relies on the experience and diagnostic consistency of the involved practitioners.[63-64] To mitigate potential biases, establishing a more robust gold standard for AI training is crucial. Implementing this could significantly enhance the effectiveness and reliability of AI applications in clinical practice.[63]
Although AI presents promising potential for clinical applications, several significant challenges remain. AI models trained on homogeneous datasets may struggle to generalize across diverse populations, particularly in low- and middle-income regions with distinct ROP phenotypes.[65] Additionally, the lack of transparency in AI decision-making processes, often referred to as "black box" algorithms, raises concerns regarding explainability, which is critical in fields like ophthalmology, where diagnostic accuracy is paramount.[66] Furthermore, determining liability when AI assists in clinical decisionmaking remains unclear. Over-reliance on AI could also hinder the development of essential clinical skills.[67]Thus, optimizing AI's integration into clinical practice should be the focus of future research.
This study has several limitations. Firstly, the included studies exhibited considerable heterogeneity.Although we conducted a meta-regression analysis, the exploration of factors contributing to heterogeneity might have been insufficient.Secondly, none of the included studies adequately emphasized the time intervals between assessments, leading to a high risk of bias in the flow and timing domain. Thirdly, some studies did not provide information on patient selection or photo inclusion criteria, leading to a high risk of bias.Finally, many studies lacked detailed information on patient characteristics, which is crucial for diagnosing ROP in clinical practice.
In conclusion, AI demonstrated excellent performance in diagnosing ROP and PLUS. Given the current shortage of pediatric ophthalmologists, AI could serve as a valuable tool for ROP screening.However, heterogeneity poses a significant challenge to the use of AI in clinical practice.It is recommended that more well-designed studies be conducted to enhance the generalizability of AI in diagnosing ROP and PLUS.
Correction notice
NoneAcknowledgement
NoneAuthor Contributions
(Ⅰ) Conception and design: Rui Liu, Guina Liu(Ⅱ) Administrative support: Fang Lu, Jiang Xiaoshuang
(Ⅲ) Provision of study materials or patients: Rui Liu
(Ⅳ) Collection and assembly of data: Rui Liu, Guina Liu
(Ⅴ) Data analysis and interpretation: Ruiyang Li, Wenben Chen, and Jialing Chen
(Ⅵ) Manuscript writing: All authors
(Ⅶ) Final approval of manuscript: All authors