Oral cancer (OC) and oral potentially malignant disorders (OPMDs) remain major global public health challenges, particularly in low-and middle-income countries. Although early detection substantially improves prognosis, limited healthcare infrastructure restricts timely diagnosis. Artificial intelligence (AI) enabled, mobile phone-based diagnostic systems offer a promising, accessible solution, and multiple systematic reviews have demonstrated their potential. However, uncertainty persists regarding the comparative performance of AI models across diverse real-world settings.
An umbrella review was aimed at evaluating the comparative performance of different AI models in detecting OC and OPMD.
This research identified six systematic reviews from databases such as Medline (via PubMed), Web of Science, Scopus, and EMBASE through October 2024 which were checked at the title, abstract, and full-text levels. The risk of bias (ROB) was then assessed using the Joanna Briggs Institute’s ROB assessment tool.
Across included reviews, pooled sensitivity and specificity for AI-based detection ranged from 88% to 92%, with reported diagnostic odds ratios ranging from 114 to 2549, indicating strong discriminatory performance. Deep learning architectures such as EfficientNet and ResNet consistently demonstrated high diagnostic accuracy, while hybrid approaches (e.g., MLSO + SVM) showed promising performance in selected analyses. However, substantial heterogeneity was observed across studies (
Deep learning models like EfficientNet and ResNet are favored in clinical diagnostics for their exceptional performance and adaptability. Hybrid approaches, such as MLSO + SVM, also show great potential by combining the strengths of traditional and modern methods effectively.
Oral cancer (OC) and oral potentially malignant disorders (OPMDs) pose significant global health challenges due to high incidence rates, delayed diagnosis, and associated mortality.[
The rapid development of artificial intelligence (AI) and deep learning has new opportunities to address these diagnostic challenges. In particular, neural network models, with their ability to process large datasets and detect subtle patterns, have shown promise in identifying early signs of malignancy in various medical imaging applications.[
Mobile phone-based neural networks, with their capacity to analyze vast datasets and recognize intricate patterns, present an innovative pathway to identify early signs of OPMDs and OCs, aiming to assist healthcare providers and even laypersons in identifying suspicious lesions.[
Many published literature has evaluated the diagnostic accuracy of neural network models in detecting OPMD and OC using mobile phone photographs. Systematic reviews that focused on the diagnostic performance of these models using images from various modalities such as computed tomography, histopathological slide images, spectra images, autofluorescence images, and clinical intra-oral photographs are present.[
Despite the growing interest in AI for healthcare, there remains significant uncertainty regarding the specific AI models that deliver optimal performance across diverse clinical and operational settings. Several systematic reviews have highlighted AI in diagnosing OC and OPMDs that fall short of identifying the most effective models tailored to specific contexts, such as resource-limited environments. This lack of clarity hinders the translation of AI-based solutions from research to real-world applications, particularly in regions where early detection could make the most significant impact.
To bridge this knowledge gap, we performed an umbrella review with the primary objective of systematically evaluating the comparative performance of various AI models in detecting OC and OPMD. By evaluating sophisticated deep learning frameworks such as EfficientNet and ResNet, in combination with hybrid models like MLSO + SVM, our objective is to pinpoint algorithms that deliver exceptional diagnostic precision while maintaining versatility across diverse clinical contexts.
The findings of this study will have broad implications for the integration of AI into public health strategies. By identifying the most efficient models, our study aims to facilitate the implementation of AI-driven diagnostic solutions, ensuring their dependability and suitability for application in resource-limited environments. This is particularly crucial for rural India, where access to conventional healthcare facilities is limited, and innovative, cost-effective diagnostic solutions are urgently needed.
This umbrella review was conducted in alignment with PRISMA guidelines to address the following research question: “Which AI models demonstrate the highest diagnostic accuracy and adaptability for detecting OC and OPMD in resource-constrained settings?”
The criteria for eligibility were established based on the following PICOS framework:
(P) Population: Patients undergoing evaluation for OPMD and OC (I) Intervention: Implementation of digital tools which uses Neural Networks for the detection of OPMD and OC (including smartphone or AI-integrated smartphone applications) (C) Comparison: Application of traditional diagnostic techniques is considered the gold standard (such as visual assessment) (O) Outcomes: Diagnostic accuracy (sensitivity, specificity), accuracy, and diagnostic odds ratio (DOR) (S) Study design: Systematic reviews of observational studies.
Research studies adhering to the PICO framework’s “Subjects, Intervention, Control, Outcome” criteria were deemed eligible for inclusion. Systematic reviews of
This umbrella review encompassed systematic reviews of observational studies that analyzed the diagnostic accuracy of various neural networks in identifying OPMD and OC in digital photographs, in contrast to conventional clinical assessments.
Two independent researchers (L C. and R.S.) performed an electronic search for systematic reviews across several databases, including Medline (via PubMed), Web of Science, Scopus, EMBASE, and Google Scholar, covering publications up to October 2024. Detailed search strategies for each database are presented in
Search strategy for the study
To compile relevant articles, duplicate entries were systematically identified and removed based on title, author, and publication year after consolidating results from various databases. Two independent reviewers (L.C. and R.S.) conducted an initial screening of titles and abstracts, excluding studies that did not meet the predefined PICO framework criteria through a consensus-based approach. Any disagreements regarding article selection were resolved via discussion or, if necessary, by consulting a third and fourth reviewer (M.K. and R.K.). Following this preliminary filtering, each reviewer undertook a rigorous evaluation of the full texts of the remaining articles, with any residual discrepancies resolved through further dialogue [
PRISMA flow diagram illustrating the study selection process for the umbrella review. The figure depicts the identification, screening, eligibility assessment, and inclusion of systematic reviews evaluating artificial intelligence-based detection of oral potentially malignant disorders and oral cancer using mobile photographs. The process followed PRISMA guidelines, including removal of duplicates, title and abstract screening, full-text assessment, and final inclusion of articles.
For studies that satisfied the inclusion criteria, data extraction was performed by two reviewers (L.C. and R.S.), adhering to the Joanna Briggs Institute (JBI) guidelines for umbrella reviews.[
General characteristics of included systematic reviews
Inference from included studies
The risk of bias (ROB) for each study was independently evaluated by two reviewers (L.C and R.S) using the JBI systematic review critical appraisal tool [
Risk of bias assessment of included systematic reviews using the Joanna Briggs Institute critical appraisal tool. The figure summarizes domain-wise risk of bias judgments across included reviews, including clarity of the research question, search strategy, study appraisal, data extraction, publication bias assessment, and applicability. Visualization was generated using the Robvis tool, highlighting variability in methodological quality that contributes to heterogeneity.
The study selection process was carried out independently by two authors (L.C and R.S). This involved an initial screening of titles and abstracts, followed by a detailed evaluation of full texts according to set inclusion and exclusion criteria. In the end, data extraction was conducted on six articles [
The results of this Umbrella review demonstrate that the six systematic reviews and meta-analyses evaluated provide compelling evidence on the effectiveness of AI in clinical diagnostics across diverse settings and methodologies. The studies span geographical regions including Iran, the UK, India, Mexico, Thailand, and France, covering a wide range of years from 2000 to 2023. They collectively employed various AI models, traditional machine learning techniques, and robust appraisal frameworks to compare diagnostic performance against conventional methods.
Although several reviews reported pooled diagnostic metrics, a quantitative meta–meta-analysis was not performed due to substantial methodological heterogeneity, including differences in study designs, outcome definitions, AI architectures, validation strategies, and pooling methods. Overlap of primary studies and inconsistent reporting of pooled estimates further precluded reliable secondary quantitative synthesis. Accordingly, findings were narratively synthesized in line with JBI guidance, emphasizing comparative trends rather than pooled effect estimates.
Rokhshad
Ferro
Kavyashree
Warin and Suebnukarn[
Overall, the results affirm the potential of AI, particularly deep learning and hybrid architectures, in enhancing diagnostic accuracy and reliability across clinical domains. However, the variability in study quality, significant heterogeneity, and frequent methodological limitations highlight the need for standardized approaches and more rigorous evaluation in future research.
The findings of the reviewed studies highlight the growing potential of AI in clinical diagnostics, aligning with existing literature on the topic. Previous research has consistently demonstrated the ability of AI models, particularly deep learning architectures like CNNs, to surpass traditional diagnostic methods in terms of accuracy and efficiency. For example, studies outside this review have reported diagnostic accuracies exceeding 90% for AI systems in detecting malignancies and other complex conditions, verifying the high accuracies noted by Rokhshad
Substantial heterogeneity was consistently observed across the included reviews, with several meta-analyses reporting
Some results differ from broader findings in the literature. While most studies reviewed here report high accuracy and low false positive rates for AI models, heterogeneity in sensitivity and specificity metrics was a recurring issue. For instance, Rokhshad
The strengths of this analysis lie in its comprehensive approach, incorporating diverse geographic settings, a wide range of AI models, and robust appraisal tools such as Quality assessment of diagnostic accuracy studies-2. This breadth provides a holistic view of AI’s applicability in clinical diagnostics. Nevertheless, certain limitations must be acknowledged. Many studies exhibited unclear or high risks, as noted by Ferro
Another significant limitation is the lack of detailed subgroup analyses or stratification by factors such as population characteristics, imaging modalities, or model architectures. For instance, while modern AI models consistently outperformed classical approaches, their performance varied significantly across different diagnostic tasks and imaging techniques. Including such stratifications could provide deeper insights into the specific conditions or scenarios where AI excels or underperforms. In addition, certain studies, such as those by Kavyashree
Future research should address these limitations by adopting standardized methodologies and reporting guidelines to improve the consistency and reliability of results. Greater emphasis on external validation using independent datasets is also essential to assess the generalizability of AI models. Studies should also focus on exploring hybrid models that combine the strengths of classical and modern techniques, as suggested by the promising performance of hybrid architectures like MLSO + SVM reported by de Chauveron
In conclusion, while the findings affirm AI’s transformative potential in clinical diagnostics, they also highlight significant gaps in methodology and reporting. Addressing these gaps will be critical in realizing the full potential of AI in delivering accurate, reliable, and scalable diagnostic solutions across diverse clinical settings.
Based on the comprehensive analysis of the studies, modern deep learning architectures consistently demonstrated superior performance compared to classical machine learning models, making them the most recommended for clinical diagnostics. Among these, models such as EfficientNet, ResNet, and hybrid configurations like MLSO + SVM showed the highest levels of accuracy, sensitivity, and specificity across various tasks and datasets.
EfficientNet and ResNet, particularly their advanced versions such as EfficientNet-B4 and ResNet-101, excelled in diverse applications, achieving accuracy rates above 90% in multiple studies. These models also exhibited robust adaptability to different imaging modalities and datasets, underscoring their versatility and reliability in clinical diagnostics. The hybrid MLSO + SVM model emerged as particularly noteworthy, combining the strengths of classical and modern techniques, delivering the best performance in its respective study with strong generalization capabilities.
In contrast, classical models like SVM and Logistic Regression, although achieving high accuracy in certain datasets (e.g., 100% in specific tasks), generally lagged behind deep learning architectures in terms of overall diagnostic power and scalability. Their limited ability to handle complex, high-dimensional data makes them less suitable for modern clinical applications compared to more advanced neural networks.
Therefore, based on current evidence, deep learning models like EfficientNet and ResNet are highly recommended for clinical diagnostics due to their superior performance and adaptability. Hybrid models, such as MLSO + SVM, also hold significant promise, particularly in scenarios where leveraging the complementary strengths of classical and modern methods is beneficial. Future efforts should focus on refining these models further and validating their performance across diverse clinical contexts to ensure consistent and reliable diagnostic outcomes.
The protocol for this study can be reviewed on the International Prospective Register of Systematic Reviews (PROSPERO) database, referenced under the registration number: CRD420250656539.
This research was conducted, reported, and disseminated without direct engagement or input from patients or the general public.
Data can be obtained through a reasonable request process.
This study is a part of an ICMR project (Project ID IIRP-2023-1049) funded by Small extramural grants – 2023.
The authors of this manuscript declare that they have no conflicts of interest, real or perceived, financial or non-financial in this article.
