Subscribe to the PwC Newsletter

Join the community, trending research, the ai scientist: towards fully automated open-ended scientific discovery.

research paper using machine learning

This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems.

research paper using machine learning

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10, 000 words while maintaining output quality.

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation.

Zero-Shot Surgical Tool Segmentation in Monocular Video Using Segment Anything Model 2

AngeLouCN/SAM-2_Surgical_Video • 3 Aug 2024

The Segment Anything Model 2 (SAM 2) is the latest generation foundation model for image and video segmentation.

OpenResearcher: Unleashing AI for Accelerated Scientific Research

gair-nlp/openresearcher • 13 Aug 2024

The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas.

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

research paper using machine learning

We introduce FruitNeRF, a unified novel fruit counting framework that leverages state-of-the-art view synthesis methods to count any fruit type directly in 3D.

research paper using machine learning

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length.

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

internlm/mindsearch • 29 Jul 2024

Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework.

Qwen2-Audio Technical Report

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology

fsoft-ai4code/agilecoder • 16 Jun 2024

Software agents have emerged as promising tools for addressing complex software engineering tasks.

research paper using machine learning

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 29 September 2020

Machine learning prediction in cardiovascular diseases: a meta-analysis

  • Chayakrit Krittanawong 1 , 9 ,
  • Hafeez Ul Hassan Virk 2 ,
  • Sripal Bangalore 3 ,
  • Zhen Wang 4 , 5 ,
  • Kipp W. Johnson 6 ,
  • Rachel Pinotti 7 ,
  • HongJu Zhang 8 ,
  • Scott Kaplin 9 ,
  • Bharat Narasimhan 9 ,
  • Takeshi Kitai 10 ,
  • Usman Baber 9 ,
  • Jonathan L. Halperin 9 &
  • W. H. Wilson Tang 10  

Scientific Reports volume  10 , Article number:  16057 ( 2020 ) Cite this article

54k Accesses

215 Citations

28 Altmetric

Metrics details

  • Cardiovascular diseases
  • Computational biology and bioinformatics
  • Machine learning

Several machine learning (ML) algorithms have been increasingly utilized for cardiovascular disease prediction. We aim to assess and summarize the overall predictive ability of ML algorithms in cardiovascular diseases. A comprehensive search strategy was designed and executed within the MEDLINE, Embase, and Scopus databases from database inception through March 15, 2019. The primary outcome was a composite of the predictive ability of ML algorithms of coronary artery disease, heart failure, stroke, and cardiac arrhythmias. Of 344 total studies identified, 103 cohorts, with a total of 3,377,318 individuals, met our inclusion criteria. For the prediction of coronary artery disease, boosting algorithms had a pooled area under the curve (AUC) of 0.88 (95% CI 0.84–0.91), and custom-built algorithms had a pooled AUC of 0.93 (95% CI 0.85–0.97). For the prediction of stroke, support vector machine (SVM) algorithms had a pooled AUC of 0.92 (95% CI 0.81–0.97), boosting algorithms had a pooled AUC of 0.91 (95% CI 0.81–0.96), and convolutional neural network (CNN) algorithms had a pooled AUC of 0.90 (95% CI 0.83–0.95). Although inadequate studies for each algorithm for meta-analytic methodology for both heart failure and cardiac arrhythmias because the confidence intervals overlap between different methods, showing no difference, SVM may outperform other algorithms in these areas. The predictive ability of ML algorithms in cardiovascular diseases is promising, particularly SVM and boosting algorithms. However, there is heterogeneity among ML algorithms in terms of multiple parameters. This information may assist clinicians in how to interpret data and implement optimal algorithms for their dataset.

Similar content being viewed by others

research paper using machine learning

Pre-existing and machine learning-based models for cardiovascular risk prediction

research paper using machine learning

Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population

research paper using machine learning

Interpretable machine learning models for predicting in-hospital and 30 days adverse events in acute coronary syndrome patients in Kuwait

Introduction.

Machine learning (ML) is a branch of artificial intelligence (AI) that is increasingly utilized within the field of cardiovascular medicine. It is essentially how computers make sense of data and decide or classify a task with or without human supervision. The conceptual framework of ML is based on models that receive input data (e.g., images or text) and through a combination of mathematical optimization and statistical analysis predict outcomes (e.g., favorable, unfavorable, or neutral). Several ML algorithms have been applied to daily activities. As an example, a common ML algorithm designated as SVM can recognize non-linear patterns for use in facial recognition, handwriting interpretation, or detection of fraudulent credit card transactions 1 , 2 . So-called boosting algorithms used for prediction and classification have been applied to the identification and processing of spam email. Another algorithm, denoted random forest (RF), can facilitate decisions by averaging several nodes. While convolutional neural network (CNN) processing, combines several layers and apples to image classification and segmentation 3 , 4 , 5 . We have previously described technical details of each of these algorithms 6 , 7 , 8 , but no consensus has emerged to guide the selection of specific algorithms for clinical application within the field of cardiovascular medicine. Although selecting optimal algorithms for research questions and reproducing algorithms in different clinical datasets is feasible, the clinical interpretation and judgement for implementing algorithms are very challenging. A deep understanding of statistical and clinical knowledge in ML practitioners is also a challenge. Most ML studies reported a discrimination measure such as the area under an ROC curve (AUC), instead of p values. Most importantly, an acceptable cutoff for AUC to be used in clinical practice, interpretation of the cutoff, and the appropriate/best algorithms to be applied in cardiovascular datasets remain to be evaluated. We previously proposed the methodology to conduct ML research in medicine 6 . Systematic review and meta-analysis, the foundation of modern evidence-based medicine, have to be performed in order to evaluate the existing ML algorithm in cardiovascular disease prediction. Here, we performed the first systematic review and meta-analysis of ML research over a million patients in cardiovascular diseases.

This study is reported in accordance with the Preferred Reporting Information for Systematic Reviews and Meta-Analysis (PRISMA) recommendations. Ethical approval was not required for this study.

Search strategy

A comprehensive search strategy was designed and executed within the MEDLINE, Embase, and Scopus databases from database inception through March 15, 2019. One investigator (R.P.) designed and conducted the search strategy using input from the study’s principal investigator (C.K.). Controlled vocabulary, supplemented with keywords, was used to search for studies of ML algorithms and coronary heart disease, stroke, heart failure, and cardiac arrhythmias. The detailed strategy is available from the reprint author. The full search strategies can be found in the supplementary documentation.

Study selection

Search results were exported from all databases and imported into Covidence 9 , an online systematic review tool, by one investigator (R.P.). Duplicates were identified and removed using Covidence's automated de-duplication functionality. The de-duplicated set of results was screened independently by two reviewers (C.K. and H.V.) in two successive rounds to identify studies that met the pre-specified eligibility criteria. In the initial screening, two investigators (C.K. and H.V.) independently examined the titles and abstracts of the records retrieved from the search via the Covidence portal and used a standard extraction form. Conflicts were resolved through consensus and reviewed by other investigators. We included abstracts with sufficient evaluation data, including methodology, the definition of outcomes, and an appropriate evaluation matrix. Studies without any kind of validation (external validation or internal validation) were excluded. We excluded reviews, editorials, non-human studies, letters without sufficient data.

Data extraction

We extracted the following information, if possible, from each study: authors, year of publication, study name, test types, testing indications, analytic models, number of patients, endpoints (CAD, AMI, stroke, heart failure, and cardiac arrhythmias), and performance measures ((AUC, sensitivity, specificity, positive cases (the number of patients who used the AI and were positively diagnosed with the disease), negative cases (the number of patients who used the AI and were negative with the AI test), true positives, false positives, true negatives, and false negatives)). CAD was defined as coronary artery stenosis > 70% using angiography or FFR-based significance. Cardiac arrhythmias included studies involving bradyarrhythmias, tachyarrhythmias, atrial, and ventricular arrhythmias. Data extraction was conducted independently by at least two investigators for each paper. Extracted data were compared and reconciled through consensus. In case studies which did not report positive and negative cases, we manually calculated by standard formulae using statistics available in the manuscripts or provided by the authors. We contacted the authors if the data of interest were not reported in the manuscripts or abstracts. The order of contact originated with the corresponding author, followed by the first author, and then the last author. If we were unable to contact the authors as specified above, the associated studies were excluded from the meta-analysis (but still included it in the systematic review). We also excluded manuscripts or abstracts without sufficient evaluation data after contacting the authors.

Quality assessment

We created the proposed guidance quality assessment of clinical ML research based on our previous recommendation (Table 1 ) 6 . Two investigators (C.K. and H.V.) independently assessed the quality of each ML study by using our proposed guideline to report ML in medical literature (Supplementary Table S1 ). We resolved disagreements through discussion amongst the primary investigators or by involving additional investigators to adjudicate and establish a consensus. We scored study quality as low (0–2), moderate (2.5–5), and high quality (5.5–8) as clinical ML research.

Statistical analysis

We used symmetrical, hierarchical, summary receiver operating characteristic (HSROC) models to jointly estimate sensitivity, specificity, and AUC 10 . \({Sen}_{i}\) and \({Spc}_{i}\) denote the sensitivity and specificity of the i th study. \({\sigma }_{Sen}^{2}\) is the variance of \({\mu }_{Sen}\) and \({\sigma }_{Spc}^{2}\) is the variance of \({\mu }_{spc}\) .

The HSROC model for study i fits the following

\({\pi }_{i1}\) = \({Sen}_{i}\) and \({\pi }_{i0}\) =1- \({Spc}_{i}\) . \({X}_{ij}=-\frac{1}{2}\) when no disease and \({X}_{ij}=\frac{1}{2}\) for those with disease. And \({\theta }_{i}\) and \({\alpha }_{i}\) follow normal distribution.

We conducted subgroup analyses stratified by ML algorithms. We assessed the performances of a subgroup-specific and statistical test of interaction among subgroups. We performed all statistical analyses using OpenMetaAnalyst for 64-bit (Brown University), R version 3.2.3 (Metafor and Phia packages), and Stata version 16.1 (Stata Corp, College Station, Texas). The meta-analysis has been reported in accordance with the Meta-analysis of Observational Studies in Epidemiology guidelines (MOOSE) 11 .

Study search

The database searches between 1966 and March 15, 2019, yielded 15,025 results. 3,716 duplicates were removed by algorithms. After the screening process, we selected 344 articles for full-text review. After full text and supplementary review, we excluded 289 studies due to insufficient data to perform meta-analytic approaches despite contacting corresponding authors. Overall, 103 cohorts (55 studies) met our inclusion criteria. The disposition of studies excluded after the full-text review is shown in Fig.  1 .

figure 1

Study design. This flow chart illustrates the selection process for published reports.

Study characteristics

Table 2 shows the basic characteristics of the included studies. In total, our meta-analysis of ML and cardiovascular diseases included 103 cohorts (55 studies) with a total of 3,377,318 individuals. In total, 12 cohorts  assessed cardiac arrhythmias (3,144,799 individuals), 45 cohorts are CAD-related (117,200 individuals), 34 cohorts are stroke-related (5,577 individuals), and 12 cohorts are HF-related (109,742 individuals). The characteristics of the included studies are listed in Table 2 . We performed post hoc sensitivity analysis, excluding each study, and found no difference among the results.

ML algorithms and prediction of CAD

For the CAD, 45 cohorts reported a total of 116,227 individuals. 10 cohorts used CNN algorithms, 7 cohorts used SVM, 13 cohorts used boosting algorithm, 9 cohorts used custom-built algorithms, and 2 cohorts used RF. The prediction in CAD was associated with pooled AUC of 0.88 (95% CI 0.84–0.91), sensitivity of 0.86 (95% CI 0.77–0.92), and specificity of 0.70 (95% CI 0.51–0.84), for boosting algorithms and pooled of AUC 0.93 (95% CI 0.85–0.97), sensitivity of 0.87 (95% CI 0.74–0.94), and specificity of 0.86 (95% CI 0.73–0.93) for custom-built algorithms (Fig. 2 ).

figure 2

ROC curves comparing different machine learning models for CAD prediction. The prediction in CAD was associated with pooled AUC of 0.87 (95% CI 0.76–0.93) for CNN, pooled AUC of 0.88 (95% CI 0.84–0.91) for boosting algorithms, and pooled of AUC 0.93 (95% CI 0.85–0.97) for others (custom-built algorithms).

ML algorithms and prediction of stroke

For the stroke, 34 cohorts reported a total of 7,027 individuals. 14 cohorts used CNN algorithms, 4 cohorts used SVM, 5 cohorts used boosting algorithm, 2 cohorts used decision tree, 2 cohorts used custom-built algorithms, and 1 cohort used random forest (RF). For prediction of stroke, SVM algorithms had a pooled AUC of 0.92 (95% CI 0.81–0.97), sensitivity 0.57 (95% CI 0.26–0.96), and specificity 0.93 (95% CI 0.71–0.99); boosting algorithms had a pooled AUC of 0.91 (95% CI 0.81–0.96), sensitivity 0.85 (95% CI 0.66–0.94), and specificity 0.85 (95% CI 0.67–0.94); and CNN algorithms had a pooled AUC of 0.90 (95% CI 0.83–0.95), sensitivity of 0.80 (95% CI 0.70–0.87), and specificity of 0.91 (95% CI 0.77–0.97) (Fig. 3 ).

figure 3

ROC curves comparing different machine learning models for stroke prediction. The prediction in stroke was associated with pooled AUC of 0.90 (95% CI 0.83–0.95) for CNN, pooled AUC of 0.92 (95% CI 0.81–0.97) for SVM algorithms, and pooled AUC of 0.91 (95% CI 0.81–0.96) for boosting algorithms.

ML algorithms and prediction of HF

For the HF, 12 cohorts reported a total of 51,612 individuals. 3 cohorts used CNN algorithms, 4 cohorts used logistic regression, 2 cohorts used boosting algorithm, 1 cohort used SVM, 1 cohort used in-house algorithm, and 1 cohort used RF. We could not perform analyses because we had too few studies (≤ 5) for each model.

ML algorithms and prediction of cardiac arrhythmias

For the cardiac arrhythmias, 12 cohorts reported a total of 3,204,837 individuals. 2 cohorts used CNN algorithms, 2 cohorts used logistic regression, 3 cohorts used SVM, 1 cohort used k-NN algorithm, and 4 cohorts used RF. We could not perform analyses because we had too few studies (≤ 5) for each model.

To the best of our knowledge, this is the first and largest novel meta-analytic approach in ML research to date, which drew from an extensive number of studies that included over one million participants, reporting ML algorithms prediction in cardiovascular diseases. Risk assessment is crucial for the reduction of the worldwide burden of CVD. Traditional prediction models, such as the Framingham risk score 12 , the PCE model 13 , SCORE 14 , and QRISK 15 have been derived based on multiple predictive factors. These prediction models have been implemented in guidelines; specifically, the 2010 American College of Cardiology/American Heart Association (ACC/AHA) guideline 16 recommended the Framingham Risk Score, the United Kingdom National Institute for Health and Care Excellence (NICE) guidelines recommend the QRISK3 score 17 , and the 2016 European Society of Cardiology (ESC) guidelines recommended the SCORE model 18 . These traditional CVD risk scores have several limitations, including variations among validation cohorts, particularly in specific populations such as patients with rheumatoid arthritis 19 , 20 . Under some circumstances, the Framingham score overestimates CVD risk, potentially leading to overtreatment 20 . In general, these risk scores encompass a limited number of predictors and omit several important variables. Given the limitations of the most widely accepted risk models, more robust prediction tools are needed to more accurately predict CVD burden. Advances in computational power to process large amounts of data has accelerated interest in ML-based risk prediction, but clinicians typically have limited understanding of this methodology. Accordingly, we have taken a meta-analytic approach to clarify the insights that ML modeling can provide for CVD research.

Unfortunately, we do not know how or why the authors of the analyzed studies selected the chosen algorithms from the large array of options available. Researchers/authors may have selected potential models for their databases and performed several models (e.g., running parallel, hyperparameter tuning) while only reporting the best model, resulting in overfitting to their data. Therefore, we assume the AUC of each study is based upon the best possible algorithm available to the associated researchers. Most importantly, pooled analyses indicate that, in general, ML algorithms are accurate (AUC 0.8–0.9 s) in overall cardiovascular disease prediction. In subgroup analyses of each ML algorithms, ML algorithms are accurate (AUC 0.8–0.9 s) in CAD and stroke prediction. To date, only one other meta-analysis of the ML literature has been reported, and the underlying concept was similar to ours. The investigators compared the diagnostic performance of various deep learning models and clinicians based on medical imaging (2 studies pertained to cardiology) 21 . The investigators concluded that deep learning algorithms were promising but identified several methodological barriers to matching clinician-level accuracy 21 . Although our work suggests that boosting models and support vector machine (SVM) models are promising for predicting CAD and stroke risk, further study comparing human expert and ML models are needed.

First, the results showed that custom-built algorithms tend to perform better than boosting algorithm for CAD prediction in terms of AUC comparison. However, there is significant heterogeneity among custom-built algorithms that do not disclose their details. The boosting algorithm has been increasingly utilized in modern biomedicine 22 , 23 . In order to implement in clinical practice, the essential stages of designing a model and interpretation need to be uniform 24 . For implementation in clinical practice, custom-built algorithms must be transparent and replicated in multiple studies using the same set of independent variables.

Second, the result showed that boosting algorithms and SVM provides similar pooled AUC for stroke prediction. SVMs and boosting shared a common margin to address the clinical question. SVM seems to perform better than boosting algorithms in patients with stroke perhaps due to discrete, linear data or a proper non-linear kernel that fits the data better with improved generalization. SVM is an algorithm designed for maximizing a particular mathematical function with respect to a given collection of data. Compared to the other ML methods, SVM is more powerful at recognizing hidden patterns in complicated clinical datasets 2 , 25 . Both boosting and SVM algorithms have been widely used in biomedicine and prior studies showed mixed results 26 , 27 , 28 , 29 , 30 . SVM seems to outperform boosting in image recognition tasks 28 , while boosting seems to be superior in omic tasks 27 . However, in subgroup analysis, using research questions or types of protocols or images showed no difference in algorithm predictions.

Third, for heart failure and cardiac arrhythmias, we could not perform meta-analytic approaches due to the small number of studies for each model. However, based on our observation in our systematic review, SVM seems to outperform other predictive algorithms in detecting cardiac arrhythmias, especially in one large study 31 . Interestingly, in HF, the results are inconclusive. One small study showed promising results from SVM 32 . CNN seems to outperform others, but the results are suboptimal 33 . Although we assumed all reported algorithms have optimal variables, technical heterogeneity exists in ML algorithms (e.g., number of folds for cross-validation, bootstrapping techniques, how many run time [epochs], multiple parameters adjustments). In addition, optimal cut off for AUC remained unclear in clinical practice. For example, high or low sensitivity/specificity for each test depends on clinical judgement based on clinically correlated. In general, very high AUCs (0.95 or higher) are recommended, and it is known that AUC 0.50 is not able to distinguish between true and false. In some fields such as applied psychology 34 , with several influential variables, AUC values of 0.70 and higher would be considered strong effects. Moreover, standard practice for ML practitioners recommended reporting certain measures (e.g., AUC, c-statistics) without optimal sensitivity and specificity or model calibration, while interpretation in clinical practice is challenging. For example, the difference in BNP cut off for HF patients could result in a difference in volume management between diuresis and IV fluid in pneumonia with septic shock.

Compared to conventional risk scores, most ML models shared a common set of independent demographic variables (e.g., age, sex, smoking status) and include laboratory values. Although those variables are not well-validated individually in clinical studies, they may add predictive value in certain circumstances. Head-to-head studies comparing ML algorithms and conventional risk models are needed. If these studies demonstrate an advantage of ML-based prediction, the optimal algorithms could be implemented through electronic health records (EHR) to facilitate application in clinical practice. The EHR implementation is well poised for ML based prediction since the data are readily accessible, mitigating dependency on a large number of variables, such as discrete laboratory values. While it may be difficult for physicians in resource-constrained practice settings to access the input data necessary for ML algorithms, it is readily implemented in more highly developed clinical environments.

To this end, the selection of ML algorithm should base on the research question and the structure of the dataset (how large the population is, how many cases exist,  how balanced the dataset is,  how many available variables there are, whether the data is longitudinal or not, if the clinical outcome is binary or time to event, etc.) For example, CNN is particularly powerful in dealing with image data, while SVM can reduce the high dimensionality of the dataset if the kernel is correctly chosen. While when the sample size is not large enough, deep learning methods will likely overfit the data.  Most importantly, this study's intent is not to identify one algorithm that is superior to others.

Limitations

Although the performance of ML-based algorithms seems satisfactory, it is far from optimal. Several methodological barriers can confound results and increase heterogeneity. First, technical parameters such as hyperparameter tuning in algorithms are usually not disclosed to the public, leading to high statistical heterogeneity. Indeed, heterogeneity measures the difference in effect size between studies. Therefore, in the present study, heterogeneity is inevitable as several factors can lead to this (e.g., fine-tuning models, hyperparameter selection, epochs). It is also a not good indicator to use as, in our HSROC model, we largely controlled the heterogeneity. Second, the data partition is also arbitrary because of no standard guidelines for utilization. In the present study, most included studies use 80/20 or 70/30 for training and validation sets. In addition, since the sample size for each type of CVD is small, the pooled results could potentially be biased. Third, feature selection methodologies, and techniques are arbitrary and heterogeneous. Fourth, due to the ambiguity of custom-built algorithms, we could not classify the type of those algorithms. Fifth, studies report different evaluation matrices (e.g., some did not report positive or negative cases, sensitivity/specificity, F-score, etc.). We did not report the confusion matrix for this meta-analytic approach as it required aggregation of raw numbers from studies without adjusting for difference between studies, which could result in bias. Instead, we presented pooled sensitivity and specificity using the HSROC model. Although ML algorithms are robust, several studies did not report complete evaluation metrics such as positive or negative cases, Beyes, bias accuracy, or analysis in the validation cohort since there are many ways to interpret the data  depending on the clinical context. Most importantly, some analyses did not correlate with the clinical context, which made it more difficult to interpret. The efficacy of meta-analysis is to increase the power of the study by using the same algorithms. In addition, clinical data are heterogeneous and usually imbalanced. Most ML research did not report balanced accuracy, which could mislead the readers. Sixth, we did not register the analysis in PROSPERO. Finally, some studies reported only the technical aspect without clinical aspects, likely due to a lack of clinician supervision.

Although there are several limitations to overcome to be able to implement ML algorithms in clinical practice, overall ML algorithms showed promising results. SVM and boosting algorithms are widely used in cardiovascular medicine with good results. However, selecting the proper algorithms for the  appropriate research questions, comparison to human experts, validation cohorts, and reporting of  all possible evaluation matrices are needed for study interpretation in the correct clinical context. Most importantly, prospective studies comparing ML algorithms to conventional risk models are needed. Once validated in that way, ML algorithms could be integrated with electronic health record systems and applied in clinical practice, particularly in high resources areas.

Noble, W. S. Support vector machine applications in computational biology. Kernel Methods Comput. Biol. 71 , 92 (2004).

Google Scholar  

Aruna, S. & Rajagopalan, S. A novel SVM based CSSFFS feature selection algorithm for detecting breast cancer. Int. J. Comput. Appl. 31 , 20 (2011).

Lakhani, P. & Sundaram, B. Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 284 , 574–582 (2017).

PubMed   Google Scholar  

Yasaka, K. & Akai, H. Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: A preliminary study. Radiology 286 , 887–896 (2018).

Christ, P. F. et al. Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields. International Conference on Medical Image Computing and Computer-Assisted Intervention 415–423 (Springer, Berlin, 2016).

Krittanawong, C. et al. Deep learning for cardiovascular medicine: A practical primer. Eur. Heart J. 40 , 2058–2073 (2019).

PubMed   PubMed Central   Google Scholar  

Krittanawong, C., Zhang, H., Wang, Z., Aydar, M. & Kitai, T. Artificial intelligence in precision cardiovascular medicine. J. Am. Coll. Cardiol. 69 , 2657–2664 (2017).

Krittanawong, C. et al. Future direction for using artificial intelligence to predict and manage hypertension. Curr. Hypertens. Rep. 20 , 75 (2018).

Covidence systematic review software. Melbourne AVHIAawcoAD.

Rutter, C. M. & Gatsonis, C. A. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat. Med. 20 , 2865–2884 (2001).

CAS   PubMed   Google Scholar  

Stroup, D. F. et al. Meta-analysis of observational studies in epidemiology: A proposal for reporting. Meta-analysis Of Observational Studies in Epidemiology (MOOSE) group. JAMA 283 , 2008–2012 (2000).

Wilson, P. W. et al. Prediction of coronary heart disease using risk factor categories. Circulation 97 , 1837–1847 (1998).

Goff, D. C. Jr. et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. J. Am. Coll. Cardiol. 63 , 2935–2959 (2014).

Conroy, R. M. et al. Estimation of ten-year risk of fatal cardiovascular disease in Europe: The SCORE project. Eur. Heart J. 24 , 987–1003 (2003).

Hippisley-Cox, J. et al. Predicting cardiovascular risk in England and Wales: Prospective derivation and validation of QRISK2. BMJ (Clinical research ed) 336 , 1475–1482 (2008).

Greenland, P. et al. 2010 ACCF/AHA guideline for assessment of cardiovascular risk in asymptomatic adults: A report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Circulation 122 , e584-636 (2010).

Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: Prospective cohort study. BMJ (Clinical research ed) 357 , j2099 (2017).

Piepoli, M. F. et al. 2016 European Guidelines on cardiovascular disease prevention in clinical practice: The Sixth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (constituted by representatives of 10 societies and by invited experts) Developed with the special contribution of the European Association for Cardiovascular Prevention & Rehabilitation (EACPR). Eur. Heart J. 37 , 2315–2381 (2016).

Kremers, H. M., Crowson, C. S., Therneau, T. M., Roger, V. L. & Gabriel, S. E. High ten-year risk of cardiovascular disease in newly diagnosed rheumatoid arthritis patients: A population-based cohort study. Arthritis Rheum. 58 , 2268–2274 (2008).

Damen, J. A. et al. Performance of the Framingham risk models and pooled cohort equations for predicting 10-year risk of cardiovascular disease: A systematic review and meta-analysis. BMC Med. 17 , 109 (2019).

Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. Lancet Digit. Health 1 , e271–e297 (2019).

Mayr, A., Binder, H., Gefeller, O. & Schmid, M. The evolution of boosting algorithms. From machine learning to statistical modelling. Methods Inf. Med. 53 , 419–427 (2014).

Buhlmann, P. et al. Discussion of “the evolution of boosting algorithms” and “extending statistical boosting”. Methods Inf. Med. 53 , 436–445 (2014).

Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7 , 21–21 (2013).

Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).

Zhang H, & Gu C. Support vector machines versus Boosting.

Ogutu, J. O., Piepho, H. P. & Schulz-Streeck, T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc. 5 (Suppl 3), S11 (2011).

Sun, T. et al. Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set. Comput. Methods Programs Biomed. 111 , 519–524 (2013).

Huang, M.-W., Chen, C.-W., Lin, W.-C., Ke, S.-W. & Tsai, C.-F. SVM and SVM ensembles in breast cancer prediction. PLoS One 12 , e0161501–e0161501 (2017).

Caruana R, Karampatziakis N, & Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning: ACM , 2008, 96–103.

Hill, N. R. et al. Machine learning to detect and diagnose atrial fibrillation and atrial flutter (AF/F) using routine clinical data. Value Health 21 , S213 (2018).

Rossing, K. et al. Urinary proteomics pilot study for biomarker discovery and diagnosis in heart failure with reduced ejection fraction. PLoS One 11 , e0157167 (2016).

Golas, S. B. et al. A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: A retrospective analysis of electronic medical records data. BMC Med. Inform. Decis. Mak. 18 , 44 (2018).

Rice, M. E. & Harris, G. T. Comparing effect sizes in follow-up studies: ROC Area, Cohen’s d, and r. Law Hum Behav. 29 , 615–620 (2005).

Download references

There was no funding for this work.

Author information

Authors and affiliations.

Section of Cardiology, Baylor College of Medicine, Houston, TX, USA

Chayakrit Krittanawong

Harrington Heart & Vascular Institute, Case Western Reserve University, University Hospitals Cleveland Medical Center, Cleveland, OH, USA

Hafeez Ul Hassan Virk

Department of Cardiovascular Diseases, New York University School of Medicine, New York, NY, USA

Sripal Bangalore

Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, USA

Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA

Department of Genetics and Genomic Sciences, Institute for Next Generation Healthcare, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Kipp W. Johnson

Levy Library, Icahn School of Medicine at Mount Sinai, New York, NY, USA

Rachel Pinotti

Division of Cardiovascular Diseases, Mayo Clinic, Rochester, MN, USA

HongJu Zhang

Department of Cardiovascular Diseases, Icahn School of Medicine at Mount Sinai, Mount Sinai Hospital, Mount Sinai Heart, New York, NY, USA

Chayakrit Krittanawong, Scott Kaplin, Bharat Narasimhan, Usman Baber & Jonathan L. Halperin

Department of Cardiovascular Medicine, Heart and Vascular Institute, Cleveland Clinic, Cleveland, OH, USA

Takeshi Kitai & W. H. Wilson Tang

You can also search for this author in PubMed   Google Scholar

Contributions

C.K., H.H., S.B., Z.W., K.W.J., R.P., H.Z., S.K., B.N., T.K., U.B., J.L.H., W.T. had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: C.K., H.H., K.W.J., Z.W. Acquisition of data: C.K., H.H., R.P., H.J., T.K. Analysis and interpretation of data: B.N., Z.W. Drafting of the manuscript: C.K., H.H., S.B., U.B., J.L.H., T.W. Critical revision of the manuscript for important intellectual content: T.W., Z.W. Study supervision: C.K., T.W.

Corresponding author

Correspondence to Chayakrit Krittanawong .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary file1, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Krittanawong, C., Virk, H.U.H., Bangalore, S. et al. Machine learning prediction in cardiovascular diseases: a meta-analysis. Sci Rep 10 , 16057 (2020). https://doi.org/10.1038/s41598-020-72685-1

Download citation

Received : 27 April 2020

Accepted : 24 August 2020

Published : 29 September 2020

DOI : https://doi.org/10.1038/s41598-020-72685-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Using machine learning-based algorithms to construct cardiovascular risk prediction models for taiwanese adults based on traditional and novel risk factors.

  • Chien-Hsiang Cheng
  • Bor-Jen Lee
  • Yung-Po Liaw

BMC Medical Informatics and Decision Making (2024)

Predicting severity of acute appendicitis with machine learning methods: a simple and promising approach for clinicians

  • Hilmi Yazici
  • Onur Ugurlu
  • Mehmet Yildirim

BMC Emergency Medicine (2024)

Causal machine learning for predicting treatment outcomes

  • Stefan Feuerriegel
  • Dennis Frauen
  • Mihaela van der Schaar

Nature Medicine (2024)

Detection and classification of diabetic retinopathy based on ensemble learning

  • Ankur Biswas

Advances in Computational Intelligence (2024)

Meta-learning in Healthcare: A Survey

  • Alireza Rafiei
  • Ronald Moore
  • Rishikesan Kamaleswaran

SN Computer Science (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper using machine learning

  • Open access
  • Published: 29 August 2023

Healthcare predictive analytics using machine learning and deep learning techniques: a survey

  • Mohammed Badawy   ORCID: orcid.org/0000-0001-9494-1386 1 ,
  • Nagy Ramadan 1 &
  • Hesham Ahmed Hefny 2  

Journal of Electrical Systems and Information Technology volume  10 , Article number:  40 ( 2023 ) Cite this article

15k Accesses

14 Citations

Metrics details

Healthcare prediction has been a significant factor in saving lives in recent years. In the domain of health care, there is a rapid development of intelligent systems for analyzing complicated data relationships and transforming them into real information for use in the prediction process. Consequently, artificial intelligence is rapidly transforming the healthcare industry, and thus comes the role of systems depending on machine learning and deep learning in the creation of steps that diagnose and predict diseases, whether from clinical data or based on images, that provide tremendous clinical support by simulating human perception and can even diagnose diseases that are difficult to detect by human intelligence. Predictive analytics for healthcare a critical imperative in the healthcare industry. It can significantly affect the accuracy of disease prediction, which may lead to saving patients' lives in the case of accurate and timely prediction; on the contrary, in the case of an incorrect prediction, it may endanger patients' lives. Therefore, diseases must be accurately predicted and estimated. Hence, reliable and efficient methods for healthcare predictive analysis are essential. Therefore, this paper aims to present a comprehensive survey of existing machine learning and deep learning approaches utilized in healthcare prediction and identify the inherent obstacles to applying these approaches in the healthcare domain.

Introduction

Each day, human existence evolves, yet the health of each generation either improves or deteriorates. There are always uncertainties in life. Occasionally encounter many individuals with fatal health problems due to the late detection of diseases. Concerning the adult population, chronic liver disease would affect more than 50 million individuals worldwide. However, if the sickness is diagnosed early, it can be stopped. Disease prediction based on machine learning can be utilized to identify common diseases at an earlier stage. Currently, health is a secondary concern, which has led to numerous problems. Many patients cannot afford to see a doctor, and others are extremely busy and on a tight schedule, yet ignoring recurring symptoms for an extended length of time can have significant health repercussions [ 1 ].

Diseases are a global issue; thus, medical specialists and researchers are exerting their utmost efforts to reduce disease-related mortality. In recent years, predictive analytic models has played a pivotal role in the medical profession because of the increasing volume of healthcare data from a wide range of disparate and incompatible data sources. Nonetheless, processing, storing, and analyzing the massive amount of historical data and the constant inflow of streaming data created by healthcare services has become an unprecedented challenge utilizing traditional database storage [ 2 , 3 , 4 ]. A medical diagnosis is a form of problem-solving and a crucial and significant issue in the real world. Illness diagnosis is the process of translating observational evidence into disease names. The evidence comprises data received from evaluating a patient and substances generated from the patient; illnesses are conceptual medical entities that detect anomalies in the observed evidence [ 5 ].

Healthcare is the collective effort of society to ensure, provide, finance, and promote health. In the twentieth century, there was a significant shift toward the ideal of wellness and the prevention of sickness and incapacity. The delivery of healthcare services entails organized public or private efforts to aid persons in regaining health and preventing disease and impairment [ 6 ]. Health care can be described as standardized rules that help evaluate actions or situations that affect decision-making [ 7 ]. Healthcare is a multi-dimensional system. The basic goal of health care is to diagnose and treat illnesses or disabilities. A healthcare system’s key components are health experts (physicians or nurses), health facilities (clinics and hospitals that provide medications and other diagnostic services), and a funding institution to support the first two [ 8 ].

With the introduction of systems based on computers, the digitalization of all medical records and the evaluation of clinical data in healthcare systems have become widespread routine practices. The phrase "electronic health records" was chosen by the Institute of Medicine, a division of the National Academies of Sciences, Engineering, and Medicine, in 2003 to define the records that continued to enhance the healthcare sector for the benefit of both patients and physicians. Electronic Health Records (EHR) are "computerized medical records for patients that include all information in an individual's past, present, or future that occurs in an electronic system used to capture, store, retrieve, and link data primarily to offer healthcare and health-related services," according to Murphy, Hanken, and Waters [ 8 ].

Daily, healthcare services produce an enormous amount of data, making it increasingly complicated to analyze and handle it in "conventional ways." Using machine learning and deep learning, this data may be properly analyzed to generate actionable insights. In addition, genomics, medical data, social media data, environmental data, and other data sources can be used to supplement healthcare data. Figure  1 provides a visual picture of these data sources. The four key healthcare applications that can benefit from machine learning are prognosis, diagnosis, therapy, and clinical workflow, as outlined in the following section [ 9 ].

figure 1

Illustration of heterogeneous sources contributing to healthcare data [ 9 ]

The long-term investment in developing novel technologies based on machine learning as well as deep learning techniques to improve the health of individuals via the prediction of future events reflects the increased interest in predictive analytics techniques to enhance healthcare. Clinical predictive models, as they have been formerly referred to, assisted in the diagnosis of people with an increased probability of disease. These prediction algorithms are utilized to make clinical treatment decisions and counsel patients based on some patient characteristics [ 10 ].

The concept of medical care is used to stress the organization and administration of curative care, which is a subset of health care. The ecology of medical care was first introduced by White in 1961. White also proposed a framework for perceiving patterns of health concerning symptoms experienced by populations of interest, along with individuals’ choices in getting medical treatment. In this framework, it is possible to calculate the proportion of the population that used medical services over a specific period of time. The "ecology of medical care" theory has become widely accepted in academic circles over the past few decades [ 6 ].

Medical personnel usually face new problems, changing tasks, and frequent interruptions because of the system's dynamism and scalability. This variability often makes disease recognition a secondary concern for medical experts. Moreover, the clinical interpretation of medical data is a challenging task from an epistemological point of view. This not only applies to professionals with extensive experience but also to representatives, such as young physician assistants, with varied or little experience [ 11 ]. The limited time available to medical personnel, the speedy progression of diseases, and the fluctuating patient dynamics make diagnosis a particularly complex process. However, a precise method of diagnosis is critical to ensuring speedy treatment and, thus, patient safety [ 12 ].

Predictive analytics for health care are critical industry requirements. It can have a significant impact on the accuracy of disease prediction, which can save patients' lives in the case of an accurate and timely prediction but can also endanger patients' lives in the case of an incorrect prediction. Diseases must therefore be accurately predicted and estimated. As a result, dependable and efficient methods for healthcare predictive analysis are required.

The purpose of this paper is to present a comprehensive review of common machine learning and deep learning techniques that are utilized in healthcare prediction, in addition to identifying the inherent obstacles that are associated with applying these approaches in the healthcare domain.

The rest of the paper is organized as follows: Section  " Background " gives a theoretical background on artificial intelligence, machine learning, and deep learning techniques. Section  " Disease prediction with analytics " outlines the survey methodology and presents a literature review of machine learning as well as deep learning approaches employed in healthcare prediction. Section  " Results and Discussion " gives a discussion of the results of previous works related to healthcare prediction. Section  " Challenges " covers the existing challenges related to the topic of this survey. Finally, Section  " Conclusion " concludes the paper.

The extensive research and development of cutting-edge tools based on machine learning and deep learning for predicting individual health outcomes demonstrate the increased interest in predictive analytics techniques to improve health care. Clinical predictive models assisted physicians in better identifying and treating patients who were at a higher risk of developing a serious illness. Based on a variety of factors unique to each individual patient, these prediction algorithms are used to advise patients and guide clinical practice.

Artificial intelligence (AI) is the ability of a system to interpret data, and it makes use of computers and machines to improve humans' capacity for decision-making, problem-solving, and technological innovation [ 13 ]. Figure  2 depicts machine learning and deep learning as subsets of AI.

figure 2

AI, ML, and DL

Machine learning

Machine learning (ML) is a subfield of AI that aims to develop predictive algorithms based on the idea that machines should have the capability to access data and learn on their own [ 14 ]. ML utilizes algorithms, methods, and processes to detect basic correlations within data and create descriptive and predictive tools that process those correlations. ML is usually associated with data mining, pattern recognition, and deep learning. Although there are no clear boundaries between these areas and they often overlap, it is generally accepted that deep learning is a relatively new subfield of ML that uses extensive computational algorithms and large amounts of data to define complex relationships within data. As shown in Fig.  3 , ML algorithms can be divided into three categories: supervised learning, unsupervised learning, and reinforcement learning [ 15 ].

figure 3

Different types of machine learning algorithms

Supervised learning

Supervised learning is an ML model for investigating the input–output correlation information of a system depending on a given set of training examples that are paired between the inputs and the outputs [ 16 ]. The model is trained with a labeled dataset. It matches how a student learns fundamental math from a teacher. This kind of learning requires labeled data with predicted correct answers based on algorithm output [ 17 ]. The most widely used supervised learning-based techniques include linear regression, logistic regression, decision trees, random forests, support vector machines, K-nearest neighbor, and naive Bayes.

A. Linear regression

Linear regression is a statistical method commonly used in predictive investigations. It succeeds in forecasting the dependent, output, variable (Y) based on the independent, input, variable (X). The connection between X and Y is represented as shown in Eq.  1 assuming continuous, real, and numeric parameters.

where m indicates the slope and c indicates the intercept. According to Eq.  1 , the association between the independent parameters (X) and the dependent parameters (Y) can be inferred [ 18 ].

The advantage of linear regression is that it is straightforward to learn and easy to-eliminate overfitting through regularization. One drawback of linear regression is that it is not convenient when applied to nonlinear relationships. However, it is not recommended for most practical applications as it greatly simplifies real-world problems [ 19 ]. The implementation tools utilized in linear regression are Python, R, MATLAB, and Excel.

As shown in Fig.  4 , observations are highlighted in red, and random deviations' result (shown in green) from the basic relationship (shown in yellow) between the independent variable (x) and the dependent variable (y) [ 20 ].

figure 4

Linear regression model

B. Logistic regression

Logistic regression, also known as the logistic model, investigates the correlation between many independent variables and a categorical dependent variable and calculates the probability of an event by fitting the data to a logistic curve [ 21 ]. Discrete mean values must be binary, i.e., have only two outcomes: true or false, 0 or 1, yes or no, or either superscript or subscript. In logistic regression, categorical variables need to be predicted and classification problems should be solved. Logistic regression can be implemented using various tools such as R, Python, Java, and MATLAB [ 18 ]. Logistic regression has many benefits; for example, it shows the linear relationship between dependent and independent variables with the best results. It is also simple to understand. On the other hand, it can only predict numerical output, is not relevant to nonlinear data, and is sensitive to outliers [ 22 ].

C. Decision tree

The decision tree (DT) is the supervised learning technique used for classification. It combines the values of attributes based on their order, either ascending or descending [ 23 ]. As a tree-based strategy, DT defines each path starting from the root using a data-separating sequence until a Boolean conclusion is attained at the leaf node [ 24 , 25 ]. DT is a hierarchical representation of knowledge interactions that contains nodes and links. When relations are employed to classify, nodes reflect purposes [ 26 , 27 ]. An example of DT is presented in Fig.  5 .

figure 5

Example of a DT

DTs have various drawbacks, such as increased complexity with increasing nomenclature, small modifications that may lead to a different architecture, and more processing time to train data [ 18 ]. The implementation tools used in DT are Python (Scikit-Learn), RStudio, Orange, KNIME, and Weka [ 22 ].

D. Random forest

Random forest (RF) is a basic technique that produces correct results most of the time. It may be utilized for classification and regression. The program produces an ensemble of DTs and blends them [ 28 ].

In the RF classifier, the higher the number of trees in the forest, the more accurate the results. So, the RF has generated a collection of DTs called the forest and combined them to achieve more accurate prediction results. In RF, each DT is built only on a part of the given dataset and trained on approximations. The RF brings together several DTs to reach the optimal decision [ 18 ].

As indicated in Fig.  6 , RF randomly selects a subset of features from the data, and from each subset it generates n random trees [ 20 ]. RF will combine the results from all DTs and provide them in the final output.

figure 6

Random forest architecture

Two parameters are being used for tuning RF models: mtry —the count of randomly selected features to be considered in each division; and ntree —the model trees count. The mtry parameter has a trade-off: Large values raise the correlation between trees, but enhance the per-tree accuracy [ 29 ].

The RF works with a labeled dataset to do predictions and build a model. The final model is utilized to classify unlabeled data. The model integrates the concept of bagging with a random selection of traits to build variance-controlled DTs [ 30 ].

RF offers significant benefits. First, it can be utilized for determining the relevance of the variables in a regression and classification task [ 31 , 32 ]. This relevance is measured on a scale, based on the impurity drop at each node used for data segmentation [ 33 ]. Second, it automates missing values contained in the data and resolves the overfitting problem of DT. Finally, RF can efficiently handle huge datasets. On the other side, RF suffers from drawbacks; for example, it needs more computing and resources to generate the output results and it requires training effort due to the multiple DTs involved in it. The implementation tools used in RF are Python Scikit-Learn and R [ 18 ].

E. Support vector machine

The supervised ML technique for classification issues and regression models is called the support vector machine (SVM). SVM is a linear model that offers solutions to issues that are both linear and nonlinear. as shown in Fig.  7 . Its foundation is the idea of margin calculation. The dataset is divided into several groups to build relations between them [ 18 ].

figure 7

Support vector machine

SVM is a statistics-based learning method that follows the principle of structural risk minimization and aims to locate decision bounds, also known as hyperplanes, that can optimally separate classes by finding a hyperplane in a usable N-dimensional space that explicitly classifies data points [ 34 , 35 , 36 ]. SVM indicates the decision boundary between two classes by defining the value of each data point, in particular the support vector points placed on the boundary between the respective classes [ 37 ].

SVM has several advantages; for example, it works perfectly with both semi-structured and unstructured data. The kernel trick is a strong point of SVM. Moreover, it can handle any complex problem with the right functionality and can also handle high-dimensional data. Furthermore, SVM generalization has less allocation risk. On the other hand, SVM has many downsides. The model training time is increased on a large dataset. Choosing the right kernel function is also a difficult process. In addition, it is not working well with noisy data. Implementation tools used in SVM include SVMlight with C, LibSVM with Python, MATLAB or Ruby, SAS, Kernlab, Scikit-Learn, and Weka [ 22 ].

F. K-nearest neighbor

K-nearest neighbor (KNN) is an "instance-based learning" or non-generalized learning algorithm, which is often known as a “lazy learning” algorithm [ 38 ]. KNN is used for solving classification problems. To anticipate the target label of the novel test data, KNN determines the distance of the nearest training data class labels with a new test data point in the existence of a K value, as shown in Fig.  8 . It then calculates the number of nearest data points using the K value and terminates the label of the new test data class. To determine the number of nearest-distance training data points, KNN usually sets the value of K according to (1): k  =  n ^(1/2), where n is the size of the dataset [ 22 ].

figure 8

K-nearest neighbor

KNN has many benefits; for example, it is sufficiently powerful if the size of the training data is large. It is also simple and flexible, with attributes and distance functions. Moreover, it can handle multi-class datasets. KNN has many drawbacks, such as the difficulty of choosing the appropriate K value, it being very tedious to choose the distance function type for a particular dataset, and the computation cost being a little high due to the distance between all the training data points, the implementation tools used in KNN are Python (Scikit-Learn), WEKA, R, KNIME, and Orange [ 22 ].

G. Naive Bayes

Naive Bayes (NB) focuses on the probabilistic model of Bayes' theorem and is simple to set up as the complex recursive parameter estimation is basically none, making it suitable for huge datasets [ 39 ]. NB determines the class membership degree based on a given class designation [ 40 ]. It scans the data once, and thus, classification is easy [ 41 ]. Simply, the NB classifier assumes that there is no relation between the presence of a particular feature in a class and the presence of any other characteristic. It is mainly targeted at the text classification industry [ 42 ].

NB has great benefits such as ease of implementation, can provide a good result even using fewer training data, can manage both continuous and discrete data, and is ideal to solve the prediction of multi-class problems, and the irrelevant feature does not affect the prediction. NB, on the other hand, has the following drawbacks: It assumes that all features are independent which is not always viable in real-world problems, suffers from zero frequency problems, and the prediction of NB is not usually accurate. Implementation tools are WEKA, Python, RStudio, and Mahout [ 22 ].

To summarize the previously discussed models, Table 1 demonstrates the advantages and disadvantages of each model.

Unsupervised learning

Unlike supervised learning, there are no correct answers and no teachers in unsupervised learning [ 42 ]. It follows the concept that a machine can learn to understand complex processes and patterns on its own without external guidance. This approach is particularly useful in cases where experts have no knowledge of what to look for in the data and the data itself do not include the objectives. The machine predicts the outcome based on past experiences and learns to predict the real-valued outcome from the information previously provided, as shown in Fig.  9 .

figure 9

Workflow of unsupervised learning [ 23 ]

Unsupervised learning is widely used in the processing of multimedia content, as clustering and partitioning of data in the lack of class labels is often a requirement [ 43 ]. Some of the most popular unsupervised learning-based approaches are k-means, principal component analysis (PCA), and apriori algorithm.

The k-means algorithm is the common portioning method [ 44 ] and one of the most popular unsupervised learning algorithms that deal with the well-known clustering problem. The procedure classifies a particular dataset by a certain number of preselected (assuming k -sets) clusters [ 45 ]. The pseudocode of the K-means algorithm is shown in Pseudocode 1.

research paper using machine learning

K means has several benefits such as being more computationally efficient than hierarchical grouping in case of large variables. It provides more compact clusters than hierarchical ones when a small k is used. Also, the ease of implementation and comprehension of assembly results is another benefit. However, K -means have disadvantages such as the difficulty of predicting the value of K . Also, as different starting sections lead to various final combinations, the performance is affected. It is accurate for raw points and local optimization, and there is no single solution for a given K value—so the average of the K value must be run multiple times (20–100 times) and then pick the results with the minimum J [ 19 ].

B. Principal component analysis

In modern data analysis, principal component analysis (PCA) is an essential tool as it provides a guide for extracting the most important information from a dataset, compressing the data size by keeping only those important features without losing much information, and simplifying the description of a dataset [ 46 , 47 ].

PCA is frequently used to reduce data dimensions before applying classification models. Moreover, unsupervised methods, such as dimensionality reduction or clustering algorithms, are commonly used for data visualizations, detection of common trends or behaviors, and decreasing the data quantity to name a few only [ 48 ].

PCA converts the 2D data into 1D data. This is done by changing the set of variables into new variables known as principal components (PC) which are orthogonal [ 23 ]. In PCA, data dimensions are reduced to make calculations faster and easier. To illustrate how PCA works, let us consider an example of 2D data. When these data are plotted on a graph, it will take two axes. Applying PCA, the data turn into 1D. This process is illustrated in Fig.  10 [ 49 ].

figure 10

Visualization of data before and after applying PCA [ 49 ]

Apriori algorithm is considered an important algorithm, which was first introduced by R. Agrawal and R. Srikant, and published in [ 50 , 51 ].

The principle of the apriori algorithm is to represent the filter generation strategy. It creates a filter element set ( k  + 1) based on the repeated k element groups. Apriori uses an iterative strategy called planar search, where k item sets are employed to explore ( k  + 1) item sets. First, the set of repeating 1 item is produced by scanning the dataset to collect the number for each item and then collecting items that meet the minimum support. The resulting group is called L1. Then L1 is used to find L2, the recursive set of two elements is used to find L3, and so on until no repeated k element groups are found. Finding every Lk needs a full dataset scan. To improve production efficiency at the level-wise of repeated element groups, a key property called the apriori property is used to reduce the search space. Apriori property states that all non-empty subsets of a recursive element group must be iterative. A two-step technique is used to identify groups of common elements: join and prune activities [ 52 ].

Although it is simple, the apriori algorithm suffers from several drawbacks. The main limitation is the costly wasted time to contain many candidates sets with a lot of redundant item sets. It also suffers from low minimum support or large item sets, and multiple rounds of data are needed for data mining which usually results in irrelevant items, in addition to difficulties in discovering individual elements of events [ 53 , 54 ].

To summarize the previously discussed models, Table 2 demonstrates the advantages and disadvantages of each model.

Reinforcement learning

Reinforcement learning (RL) is different from supervised learning and unsupervised learning. It is a goal-oriented learning approach. RL is closely related to an agent (controller) that takes responsibility for the learning process to achieve a goal. The agent chooses actions, and as a result, the environment changes its state and returns rewards. Positive or negative numerical values are used as rewards. An agent's goal is to maximize the rewards accumulated over time. A job is a complete environment specification that identifies how to generate rewards [ 55 ]. Some of the most popular reinforcement learning-based algorithms are the Q-learning algorithm and the Monte Carlo tree search (MCTS).

A. Q-learning

Q-learning is a type of model-free RL. It can be considered an asynchronous dynamic programming approach. It enables agents to learn how to operate optimally in Markovian domains by exploring the effects of actions, without the need to generate domain maps [ 56 ]. It represented an incremental method of dynamic programming that imposed low computing requirements. It works through the successive improvement of the assessment of individual activity quality in particular states [ 57 ].

In information theory, Q-learning is strongly employed, and other related investigations are underway. Recently, Q-learning combined with information theory has been employed in different disciplines such as natural language processing (NLP), pattern recognition, anomaly detection, and image classification [ 58 , 59 , 60 , 60 ]. Moreover, a framework has been created to provide a satisfying response based on the user’s utterance using RL in a voice interaction system [ 61 ]. Furthermore, a high-resolution deep learning-based prediction system for local rainfall has been constructed [ 62 ].

The advantage of developmental Q-learning is that it is possible to identify the reward value effectively on a given multi-agent environment method as agents in ant Q-learning are interacting with each other. The problem with Q-learning is that its output can be stuck in the local minimum as agents just take the shortest path [ 63 ].

B. Monte Carlo tree search

Monte Carlo tree search (MCTS) is an effective technique for solving sequential selection problems. Its strategy is based on a smart tree search that balances exploration and exploitation. MCTS presents random samples in the form of simulations and keeps activity statistics for better educated choices in each future iteration. MCTS is a decision-making algorithm that is employed in searching tree-like huge complex regions. In such trees, each node refers to a state, which is also referred to as problem configuration, while edges represent transitions from one state to another [ 64 ].

The MCTS is related directly to cases that can be represented by a Markov decision process (MDP), which is a type of discrete-time random control process. Some modifications of the MCTS make it possible to apply it to partially observable Markov decision processes (POMDP) [ 65 ]. Recently, MCTS coupled with deep RL became the base of AlphaGo developed by Google DeepMind and documented in [ 66 ]. The basic MCTS method is conceptually simple, as shown in Fig.  11 .

figure 11

Basic MCTS process

Tree 1 is constructed progressively and unevenly. The tree policy is utilized to get the critical node of the current tree for each iteration of the method. The tree strategy seeks to strike a balance between exploration and exploitation concerns. Then, from the specified node, simulation 2 is run, and the search tree is then updated according to the obtained results. This comprises adding a child node that matches the specified node's activity and updating its ancestor's statistics. During this simulation, movements are performed based on some default policy, which in its simplest case is to make uniform random movements. The benefit of MCTS is that there is no need to evaluate the values of the intermediate state, which significantly minimizes the amount of required knowledge in the field [ 67 ].

To summarize the previously discussed models, Table 3 demonstrates the advantages and disadvantages of each model.

Deep learning

Over the past decades, ML has had a significant impact on our daily lives with examples including efficient computer vision, web search, and recognition of optical characters. In addition, by applying ML approaches, AI at the human level has also been improved [ 68 , 69 , 70 ]. However, when it comes to the mechanisms of human information processing (such as sound and vision), the performance of traditional ML algorithms is far from satisfactory. The idea of deep learning (DL) was formed in the late 20th inspired by the deep hierarchical structures of human voice recognition and production systems. DL breaks have been introduced in 2006 when Hinton built a deep-structured learning architecture called deep belief network (DBN) [ 71 ].

The performance of classifiers using DL has been extensively improved with the increased complexity of data compared to classical learning methods. Figure  12 shows the performance of classic ML algorithms and DL methods [ 72 ]. The performance of typical ML algorithms becomes stable when they reach the training data threshold, but DL improves its performance as the complexity of data increases [ 73 ].

figure 12

Performance of deep learning concerning the complexity of data

DL (deep ML, or deep-structured learning) is a subset of ML that involves a collection of algorithms attempting to represent high-level abstractions for data through a model that has complicated structures or is otherwise, composed of numerous nonlinear transformations. The most important characteristic of DL is the depth of the network. Another essential aspect of DL is the ability to replace handcrafted features generated by efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction [ 74 ].

DL has significantly advanced the latest technologies in a variety of applications, including machine translation, speech, and visual object recognition, NLP, and text automation, using multilayer artificial neural networks (ANNs) [ 15 ].

Different DL designs in the past two decades give enormous potential for employment in various sectors such as automatic voice recognition, computer vision, NLP, and bioinformatics. This section discusses the most common architectures of DL such as convolutional neural networks (CNNs), long short-term memory (LSTM), and recurrent convolution neural networks (RCNNs) [ 75 ].

A. Convolutional neural network

CNNs are special types of neural networks inspired by the human visual cortex and used in computer vision. It is an automatic feed-forward neural network in which information transfers exclusively in the forward direction [ 76 ]. CNN is frequently applied in face recognition, human organ localization, text analysis, and biological image recognition [ 77 ].

Since CNN was first created in 1989, it has done well in disease diagnosis over the past three decades [ 78 ]. Figure  13 depicts the general architecture of a CNN composed of feature extractors and a classifier. Each layer of the network accepts the output of the previous layer as input and passes it on to the next layer in feature extraction layers. A typical CNN architecture consists of three types of layers: convolution, pooling, and classification. There are two types of layers at the network's low and middle levels: convolutional layers and pooling layers. Even-numbered layers are used for convolutions, while odd-numbered layers are used for pooling operations. The convolution and pooling layers' output nodes are categorized in a two-dimensional plane called feature mapping. Each layer level is typically generated by combining one or more previous layers [ 79 ].

figure 13

Architecture of CNN [ 79 ]

CNN has a lot of benefits, including a human optical processing system, greatly improved 2D and 3D image processing structure, and is effective in learning and extracting abstract information from 2D information. The max-pooling layer in CNN is efficient in absorbing shape anisotropy. Furthermore, they are constructed from sparse connections with paired weights and contain far fewer parameters than a fully connected network of equal size. CNNs are trained using a gradient-based learning algorithm and are less susceptible to the diminishing gradient problem because the gradient-based approach trains the entire network to directly reduce the error criterion, allowing CNNs to provide highly optimized weights [ 79 ].

B. Long short-term memory

LSTM is a special type of recurrent neural network (RNN) with internal memory and multiplicative gates. Since the original LSTM introduction in 1997 by Sepp Hochrieiter and Jürgen Schmidhuber, a variety of LSTM cell configurations have been described [ 80 ].

LSTM has contributed to the development of well-known software such as Alexa, Siri, Cortana, Google Translate, and Google voice assistant [ 81 ]. LSTM is an implementation of RNN with a special connection between nodes. The special components within the LSTM unit include the input, output, and forget gates. Figure  14 depicts a single LSTM cell.

figure 14

LSTM unit [ 82 ]

x t  = Input vector at the time t.

h t-1  = Previous hidden state.

c t-1  = Previous memory state.

h t  = Current hidden state.

c t  = Current memory state.

[ x ] = Multiplication operation.

[+] = Addition operation.

LSTM is an RNN module that handles gradient loss problems. In general, RNN uses LSTM to eliminate propagation errors. This allows the RNN to learn over multiple time steps. LSTM is characterized by cells that hold information outside the recurring network. This cell enables the RNN to learn over many time steps. The basic principle of LSTMs is the state of the cell, which contains information outside the recurrent network. A cell is like a memory in a computer, which decides when data should be stored, written, read, or erased via the LSTM gateway [ 82 ]. Many network architectures use LSTM such as bidirectional LSTM, hierarchical and attention-based LSTM, convolutional LSTM, autoencoder LSTM, grid LSTM, cross-modal, and associative LSTM [ 83 ].

Bidirectional LSTM networks move the state vector forward and backward in both directions. This implies that dependencies must be considered in both temporal directions. As a result of inverse state propagation, the expected future correlations can be included in the network's current output [ 84 ]. Bidirectional LSTM investigates and analyzes this because it encapsulates spatially and temporally scattered information and can tolerate incomplete inputs via a flexible cell state vector propagation communication mechanism. Based on the detected gaps in data, this filtering mechanism reidentifies the connections between cells for each data sequence. Figure  15 depicts the architecture. A bidirectional network is used in this study to process properties from multiple dimensions into a parallel and integrated architecture [ 83 ].

figure 15

(left) Bidirectional LSTM and (right) filter mechanism for processing incomplete data [ 84 ]

Hierarchical LSTM networks solve multi-dimensional problems by breaking them down into subproblems and organizing them in a hierarchical structure. This has the advantage of focusing on a single or multiple subproblems. This is accomplished by adjusting the weights within the network to generate a certain level of interest [ 83 ]. A weighting-based attention mechanism that analyzes and filters input sequences is also used in hierarchical LSTM networks for long-term dependency prediction [ 85 ].

Convolutional LSTM reduces and filters input data collected over a longer period using convolutional operations applied in LSTM networks or the LSTM cell architecture directly. Furthermore, due to their distinct characteristics, convolutional LSTM networks are useful for modeling many quantities such as spatially and temporally distributed relationships. However, many quantities can be expected collectively in terms of reduced feature representation. Decoding or decoherence layers are required to predict different output quantities not as features but based on their parent units [ 83 ].

The LSTM autoencoder solves the problem of predicting high-dimensional parameters by shrinking and expanding the network [ 86 ]. The autoencoder architecture is separately trained with the aim of accurate reconstruction of the input data as reported in [ 87 ]. Only the encoder is used during testing and commissioning to extract the low-dimensional properties that are transmitted to the LSTM. The LSTM was extended to multimodal prediction using this strategy. To compress the input data and cell states, the encoder and decoder are directly integrated into the LSTM cell architecture. This combined reduction improves the flow of information in the cell and results in an improved cell state update mechanism for both short-term and long-term dependency [ 83 ].

Grid long short-term memory is a network of LSTM cells organized into a multi-dimensional grid that can be applied to sequences, vectors, or higher-dimensional data like images [ 88 ]. Grid LSTM has connections to the spatial or temporal dimensions of input sequences. Thus, connections of different dimensions within cells extend the normal flow of information. As a result, grid LSTM is appropriate for the parallel prediction of several output quantities that may be independent, linear, or nonlinear. The network's dimensions and structure are influenced by the nature of the input data and the goal of the prediction [ 89 ].

A novel method for the collaborative prediction of numerous quantities is the cross-modal and associative LSTM. It uses several standard LSTMs to separately model different quantities. To calculate the dependencies of the quantities, these LSTM streams communicate with one another via recursive connections. The chosen layers' outputs are added as new inputs to the layers before and after them in other streams. Consequently, a multimodal forecast can be made. The benefit of this approach is that the correlation vectors that are produced have the same dimensions as the input vectors. As a result, neither the parameter space nor the computation time increases [ 90 ].

C. Recurrent convolution neural network

CNN is a key method for handling various computer vision challenges. In recent years, a new generation of CNNs has been developed, the recurrent convolution neural network (RCNN), which is inspired by large-scale recurrent connections in the visual systems of animals. The recurrent convolutional layer (RCL) is the main feature of RCNN, which integrates repetitive connections among neurons in the normal convolutional layer. With the increase in the number of repetitive computations, the receptive domains (RFs) of neurons in the RCL expand infinitely, which is contrary to biological facts [ 91 ].

The RCNN prototype was proposed by Ming Liang and Xiaolin Hu [ 92 , 93 ], and the structure is illustrated in Fig.  16 , in which both forward and redundant connections have local connectivity and weights shared between distinct sites. This design is quite like the recurrent multilayer perceptron (RMLP) concept which is often used for dynamic control [ 94 , 95 ] (Fig.  17 , middle). Like the distinction between MLP and CNN, the primary distinction is that in RMLP, common local connections are used in place of full connections. For this reason, the proposed model is known as RCNN [ 96 ].

figure 16

Illustration of the architectures of CNN, RMLP, and RCNN [ 85 ]

figure 17

Illustration of the total number of reviewed papers

The main unit of RCNN is the RCL. RCLs develop through discrete time steps. RCNN offers three basic advantages. First, it allows each unit to accommodate background information in an arbitrarily wide area in the current layer. Second, recursive connections improve the depth of the network while keeping the number of mutable parameters constant through weight sharing. This is consistent with the trend of modern CNN architecture to grow deeper with a relatively limited number of parameters. The third aspect of RCNN is the time exposed in RCNN which is a CNN with many paths between the input layer and the output layer, which makes learning simple. On one hand, having longer paths makes it possible for the model to learn very complex features. On the other hand, having shorter paths may improve the inverse gradient during training [ 91 ].

To summarize the previously discussed models, Table 4 demonstrates the advantages and disadvantages of each model.

Disease prediction with analytics

The studies discussed in this paper have been presented and published in high-quality journals and international conferences published by IEEE, Springer, and Elsevier, and other major scientific publishers such as Hindawi, Frontiers, Taylor, and MDPI. The search engines used are Google Scholar, Scopus, and Science Direct. All papers selected covered the period from 2019 to 2022. Machine learning, deep learning, health care, surgery, cardiology, radiology, hepatology, and nephrology are some of the terms used to search for these studies. The studies chosen for this survey are concerned with the use of machine learning as well as deep learning algorithms in healthcare prediction. For this survey, empirical and review articles on the topics were considered. This section discusses existing research efforts that healthcare prediction using various techniques in ML and DL. This survey gives a detailed discussion about the methods and algorithms which are used for predictions, performance metrics, and tools of their model.

ML-based healthcare prediction

To predict diabetes patients, the authors of [ 97 ] utilized a framework to develop and evaluate ML classification models like logistic regression, KNN, SVM, and RF. ML method was implemented on the Pima Indian Diabetes Database (PIDD) which has 768 rows and 9 columns. The forecast accuracy delivers 83%. Results of the implementation approach indicate how the logistic regression outperformed other algorithms of ML, in addition only a structured dataset was selected but unstructured data are not considered, also model should be implemented in other healthcare domains like heart disease, and COVID-19, finally other factors should be considered for diabetes prediction, like family history of diabetes, smoking habits, and physical inactivity.

The authors created a diagnosis system in [ 98 ] that uses two different datasets (Frankfurt Hospital in Germany and PIDD provided by the UCI ML repository) and four prediction models (RF, SVM, NB, and DT) to predict diabetes. the SVM algorithm performed with an accuracy of 83.1 percent. There are some aspects of this study that need to be improved; such as, using a DL approach to predict diabetes may lead to achieving better results; furthermore, the model should be tested in other healthcare domains such as heart disease and COVID-19 prediction datasets.

In [ 99 ], the authors proposed three ML methods (logistic regression, DT, and boosted RF) to assess COVID-19 using OpenData Resources from Mexico and Brazil. To predict rescue and death, the proposed model incorporates just the COVID-19 patient's geographical, social, and economic conditions, as well as clinical risk factors, medical reports, and demographic data. On the dataset utilized, the model for Mexico has a 93 percent accuracy, and an F1 score is 0.79. On the other hand, on the used dataset, the Brazil model has a 69 percent accuracy and an F1 score is 0.75. The three ML algorithms have been examined and the acquired results showed that logistic regression is the best way of processing data. The authors should be concerned about the usage of authentication and privacy management of the created data.

A new model for predicting type 2 diabetes using a network approach and ML techniques was presented by the authors in [ 100 ] (logistic regression, SVM, NB, KNN, decision tree, RF, XGBoost, and ANN). To predict the risk of type 2 diabetes, the healthcare data of 1,028 type 2 diabetes patients and 1,028 non-type 2 diabetes patients were extracted from de-identified data. The experimental findings reveal the models’ effectiveness with an area under curve (AUC) varied from 0.79 to 0.91. The RF model achieved higher accuracy than others. This study relies only on the dataset providing hospital admission and discharge summaries from one insurance company. External hospital visits and information from other insurance companies are missing for people with many insurance providers.

The authors of [ 101 ] proposed a healthcare management system that can be used by patients to schedule appointments with doctors and verify prescriptions. It gives support for ML to detect ailments and determine medicines. ML models including DT, RF, logistic regression, and NB classifiers are applied to the datasets of diabetes, heart disease, chronic kidney disease, and liver. The results showed that among all the other models, logistic regression had the highest accuracy of 98.5 percent in the heart dataset. while the least accuracy is of the DT classifier which came out to be 92 percent. In the liver dataset the logistic regression with maximum accuracy of 75.17% among all others. In the chronic renal disease dataset, the logistic regression, RF, and Gaussian NB, all performed well with an accuracy of 1, the accuracy of 100% should be verified by using k-fold cross-validation to test the reliability of the models. In the diabetes dataset random forest with maximum accuracy of 83.67 percent. The authors should include a hospital directory as then various hospitals and clinics can be accessed through a single portal. Additionally, image datasets could be included to allow image processing of reports and the deployment of DL to detect diseases.

In [ 102 ], the authors developed an ML model to predict the occurrence of Type 2 Diabetes in the following year (Y + 1) using factors in the present year (Y). Between 2013 and 2018, the dataset was obtained as an electronic health record from a private medical institute. The authors applied logistic regression, RF, SVM, XGBoost, and ensemble ML algorithms to predict the outcome of non-diabetic, prediabetes, and diabetes. Feature selection was applied to choose the three classes efficiently. FPG, HbA1c, triglycerides, BMI, gamma-GTP, gender, age, uric acid, smoking, drinking, physical activity, and family history were among the features selected. According to the experimental results, the maximum accuracy was 73% from RF, while the lowest was 71% from the logistic regression model. The authors presented a model that used only one dataset. As a result, additional data sources should be applied to verify the models developed in this study.

The authors of [ 103 ] classified the diabetes dataset using SVM and NB algorithms with feature selection to improve the model's accuracy. PIDD is taken from the UCI Repository for analysis. For training and testing purposes the authors employed the k-fold cross-validation model, the SVM classifier was performing better than the NB method it offers around 91% correct predictions; however, the authors acknowledge that they need to extend to the latest dataset that will contain additional attributes and rows.

K-means clustering is an unsupervised ML algorithm that was introduced by the authors of [ 104 ] for the purpose of detecting heart disease in its earliest stages using the UCI heart disease dataset. PCA is used for dimensionality reduction. The outcome of the method demonstrates early cardiac disease prediction with 94.06% accuracy. The authors should apply the proposed technique using more than one algorithm and use more than one dataset.

In [ 105 ], the authors constructed a predictive model for the classification of diabetes data using the logistic regression classification technique. The dataset includes 459 patients for training data and 128 cases for testing data. The prediction accuracy using logistic regression was obtained at 92%. The main limitation of this research is that the authors have not compared the model with other diabetes prediction algorithms, so it cannot be confirmed.

The authors of [ 106 ] developed a prediction model that analyzes the user's symptoms and predicts the disease using ML algorithms (DT classifier, RF classifier, and NB classifier). The purpose of this study was to solve health-related problems by allowing medical professionals to predict diseases at an early stage. The dataset is a sample of 4920 patient records with 41 illnesses diagnosed. A total of 41 disorders were included as a dependent variable. All algorithms achieved the same accuracy score of 95.12%. The authors noticed that overfitting occurred when all 132 symptoms from the original dataset were assessed instead of 95 symptoms. That is, the tree appears to remember the dataset provided and thus fails to classify new data. As a result, just 95 symptoms were assessed during the data-cleansing process, with the best ones being chosen.

In [ 107 ], the authors built a decision-making system that assists practitioners to anticipate cardiac problems in exact classification through a simpler method and will deliver automated predictions about the condition of the patient’s heart. implemented 4 algorithms (KNN, RF, DT, and NB), all these algorithms were used in the Cleveland Heart Disease dataset. The accuracy varies for different classification methods. The maximum accuracy is given when they utilized the KNN algorithm with the Correlation factor which is almost 94 percent. The authors should extend the presented technique to leverage more than one dataset and forecast different diseases.

The authors of [ 108 ] used the Cleveland dataset, which included 303 cases and 76 attributes, to test out three different classification strategies: NB, SVM, and DT in addition to KNN. Only 14 of these 76 characteristics are going to be put through the testing process. The authors performed data preprocessing to remove noisy data. The KNN obtained the greatest accuracy with 90.79 percent. The authors need to use more sophisticated models to improve the accuracy of early heart disease prediction.

The authors of [ 109 ] proposed a model to predict heart disease by making use of a cardiovascular dataset, which was then classified through the application of supervised machine learning algorithms (DT, NB, logistic regression, RF, SVM, and KNN). The results reveal that the DT classification model predicted cardiovascular disorders better than other algorithms with an accuracy of 73 percent. The authors highlighted that the ensemble ML techniques employing the CVD dataset can generate a better illness prediction model.

In [ 110 ], the authors attempted to increase the accuracy of heart disease prediction by applying a logistic regression using a healthcare dataset to determine whether patients have heart illness problems or not. The dataset was acquired from an ongoing cardiovascular study on people of the town of Framingham, Massachusetts. The model reached an accuracy prediction of 87 percent. The authors acknowledge the model could be improved with more data and the use of more ML models.

Because breast cancer affects one in every 28 women in India, the author of [ 111 ] presented an accurate classification technique to examine the breast cancer dataset containing 569 rows and 32 columns. Similarly employing a heart disease dataset and Lung cancer dataset, this research offered A novel way to function selection. This method of selection is based on genetic algorithms mixed with the SVM classification. The classifier results are Lung cancer 81.8182, Diabetes 78.9272. noticed that the size, kind, and source of data used are not indicated.

In [ 112 ], the authors predicted the risk factors that cause heart disease using the K-means clustering algorithm and analyzed with a visualization tool using a Cleveland heart disease dataset with 76 features of 303 patients, holds 209 records with 8 attributes such as age, chest pain type, blood pressure, blood glucose level, ECG in rest, heart rate as well as four types of chest pain. The authors forecast cardiac diseases by taking into consideration the primary characteristics of four types of chest discomfort solely and K-means clustering is a common unsupervised ML technique.

The aim of the article [ 113 ] was to report the advantages of using a variety of data mining (DM) methods and validated heart disease survival prediction models. From the observations, the authors proposed that logistic regression and NB achieved the highest accuracy when performed on a high-dimensional dataset on the Cleveland hospital dataset and DT and RF produce better results on low-dimensional datasets. RF delivers more accuracy than the DT classifier as the algorithm is an optimized learning algorithm. The author mentioned that this work can be extended to other ML algorithms, the model could be developed in a distributed environment such as Map–Reduce, Apache Mahout, and HBase.

In [ 114 ], the authors proposed a single algorithm named hybridization to predict heart disease that combines used techniques into one single algorithm. The presented method has three phases. Preprocessing phase, classification phase, and diagnosis phase. They employed the Cleveland database and algorithms NB, SVM, KNN, NN, J4.8, RF, and GA. NB and SVM always perform better than others, whereas others depend on the specified features. results attained an accuracy of 89.2 percent. The authors need to is the key goal. Notice that the dataset is little; hence, the system was not able to train adequately, so the accuracy of the method was bad.

Using six algorithms (logistic regression, KNN, DT, SVM, NB, and RF), the authors of [ 115 ] explored different data representations to better understand how to use clinical data for predicting liver disease. The original dataset was taken from the northeast of Andhra Pradesh, India. includes 583 liver patient data, whereas 75.64 percent are male, and 24.36 percent are female. The analysis result indicated that the logistic regression classifier delivers the most increased order exactness of 75 percent depending on the f1 measure to forecast the liver illness and NB gives the least precision of 53 percent. The authors merely studied a few prominent supervised ML algorithms; more algorithms can be picked to create an increasingly exact model of liver disease prediction and performance can be steadily improved.

In [ 116 ], the authors aimed to predict coronary heart disease (CHD) based on historical medical data using ML technology. The goal of this study is to use three supervised learning approaches, NB, SVM, and DT, to find correlations in CHD data that could aid improve prediction rates. The dataset contains a retrospective sample of males from KEEL, a high-risk heart disease location in the Western Cape of South Africa. The model utilized NB, SVM, and DT. NB achieved the most accurate among the three models. SVM and DT J48 outperformed NB with a specificity rate of 82 percent but showed an inadequate sensitivity rate of less than 50 percent.

With the help of DM and network analysis methods, the authors of [ 117 ] created a chronic disease risk prediction framework that was created and evaluated in the Australian healthcare system to predict type 2 diabetes risk. Using a private healthcare funds dataset from Australia that spans six years and three different predictive algorithms (regression, parameter optimization, and DT). The accuracy of the prediction ranges from 82 to 87 percent. The hospital admission and discharge summary are the dataset's source. As a result, it does not provide information about general physician visits or future diagnoses.

DL-based healthcare prediction

With the help of DL algorithms such as CNN for autofeature extraction and illness prediction, the authors of [ 118 ] proposed a system for predicting patients with the more common inveterate diseases, and they used KNN for distance calculation to locate the exact matching in the dataset and the outcome of the final sickness prediction. A combination of disease symptoms was made for the structure of the dataset, the living habits of a person, and the specific attaches to doctor consultations which are acceptable in this general disease prediction. In this study, the Indian chronic kidney disease dataset was utilized that comprises 400 occurrences, 24 characteristics, and 2 classes were restored from the UCI ML store. Finally, a comparative study of the proposed system with other algorithms such as NB, DT, and logistic regression has been demonstrated in this study. The findings showed that the proposed system gives an accuracy of 95% which is higher than the other two methods. So, the proposed technique should be applied using more than one dataset.

In [ 119 ], the authors developed a DL approach that uses chest radiography images to differentiate between patients with mild, pneumonia, and COVID-19 infections, providing a valid mechanism for COVID-19 diagnosis. To increase the intensity of the chest X-ray image and eliminate noise, image-enhancing techniques were used in the proposed system. Two distinct DL approaches based on a pertained neural network model (ResNet-50) for COVID-19 identification utilizing chest X-ray (CXR) pictures are proposed in this work to minimize overfitting and increase the overall capabilities of the suggested DL systems. The authors emphasized that tests using a vast and hard dataset encompassing several COVID-19 cases are necessary to establish the efficacy of the suggested system.

Diabetes disease prediction was the topic of the article [ 120 ], in which the authors presented a cuckoo search-based deep LSTM classifier for prediction. The deep convLSTM classifier is used in cuckoo search optimization, which is a nature-inspired method for accurately predicting disease by transferring information and therefore reducing time consumption. The PIMA dataset is used to predict the onset of diabetes. The National Institute of Diabetes and Digestive and Kidney Diseases provided the data. The dataset is made up of independent variables including insulin level, age, and BMI index, as well as one dependent variable. The new technique was compared to traditional methods, and the results showed that the proposed method achieved 97.591 percent accuracy, 95.874 percent sensitivity, and 97.094 percent specificity, respectively. The authors noticed more datasets are needed, as well as new approaches to improve the classifier's effectiveness.

In [ 121 ], the authors presented a wavelet-based convolutional neural network to handle data limitations in this time of COVID-19 fast emergence. By investigating the influence of discrete wavelet transform decomposition up to 4 levels, the model demonstrated the capability of multi-resolution analysis for detecting COVID-19 chest X-rays. The wavelet sub-bands are CNN’s inputs at each decomposition level. COVID-19 chest X-ray-12 is a collection of 1,944 chest X-ray pictures divided into 12 groups that were compiled from two open-source datasets (National Institute Health containing several X-rays of pneumonia-related diseases, whereas the COVID-19 dataset is collected from Radiology Society North America). COVID-Neuro wavelet, a suggested model, was trained alongside other well-known ImageNet pre-trained models on COVID-CXR-12. The authors acknowledge they hope to investigate the effects of other wavelet functions besides the Haar wavelet.

A CNN framework for COVID-19 identification was suggested in [ 122 ] it made use of computed tomography images that was developed by the authors. The proposed framework employs a public CT dataset of 2482 CT images from patients of both classifications. the system attained an accuracy of 96.16 percent and recall of 95.41 percent after training using only 20 percent of the dataset. The authors stated that the use of the framework should be extended to multimodal medical pictures in the future.

Using an LSTM network enhanced by two processes to perform multi-label classification based on patients' clinical visit records, the authors of [ 123 ] performed multi-disease prediction for intelligent clinical decision support. A massive dataset of electronic health records was collected from a prominent hospital in southeast China. The suggested LSTM approach outperforms several standard and DL models in predicting future disease diagnoses, according to model evaluation results. The F1 score rises from 78.9 to 86.4 percent, respectively, with the state-of-the-art conventional and DL models, to 88.0 percent with the suggested technique. The authors stated that the model prediction performance may be enhanced further by including new input variables and that to reduce computational complexity, the method only uses one data source.

In [ 124 ], the authors introduced an approach to creating a supervised ANN structure based on the subnets (the group of neurons) instead of layers, in the cases of low datasets, this effectively predicted the disease. The model was evaluated using textual data and compared to multilayer perceptrons (MLPs) as well as LSTM recurrent neural network models using three small-scale publicly accessible benchmark datasets. On the Iris dataset, the experimental findings for classification reached 97% accuracy, compared to 92% for RNN (LSTM) with three layers, and the model had a lower error rate, 81, than RNN (LSTM) and MLP on the diabetic dataset, while RNN (LSTM) has a high error rate of 84. For larger datasets, however, this method is useless. This model is useless because it has not been implemented on large textual and image datasets.

The authors of [ 125 ] presented a novel AI and Internet of Things (IoT) convergence-based disease detection model for a smart healthcare system. Data collection, reprocessing, categorization, and parameter optimization are all stages of the proposed model. IoT devices, such as wearables and sensors, collect data, which AI algorithms then use to diagnose diseases. The forest technique is then used to remove any outliers found in the patient data. Healthcare data were used to assess the performance of the CSO-LSTM model. During the study, the CSO-LSTM model had a maximum accuracy of 96.16% on heart disease diagnoses and 97.26% on diabetes diagnoses. This method offered a greater prediction accuracy for heart disease and diabetes diagnosis, but there was no feature selection mechanism; hence, it requires extensive computations.

The global health crisis posed by coronaviruses was a subject of [ 126 ]. The authors aimed at detecting disease in people whose X-ray had been selected as potential COVID-19 candidates. Chest X-rays of people with COVID-19, viral pneumonia, and healthy people are included in the dataset. The study compared the performance of two DL algorithms, namely CNN and RNN. DL techniques were used to evaluate a total of 657 chest X-ray images for the diagnosis of COVID-19. VGG19 is the most successful model, with a 95% accuracy rate. The VGG19 model successfully categorizes COVID-19 patients, healthy individuals, and viral pneumonia cases. The dataset's most failing approach is InceptionV3. The success percentage can be improved, according to the authors, by improving data collection. In addition to chest radiography, lung tomography can be used. The success ratio and performance can be enhanced by creating numerous DL models.

In [ 127 ], the authors developed a method based on the RNN algorithm for predicting blood glucose levels for diabetics a maximum of one hour in the future, which required the patient's glucose level history. The Ohio T1DM dataset for blood glucose level prediction, which included blood glucose level values for six people with type 1 diabetes, was used to train and assess the approach. The distribution features were further honed with the use of studies that revealed the procedure's certainty estimate nature. The authors point out that they can only evaluate prediction goals with enough glucose level history; thus, they cannot anticipate the beginning levels after a gap, which does not improve the prediction's quality.

To build a new deep anomaly detection model for fast, reliable screening, the authors of [ 128 ] used an 18-layer residual CNN pre-trained on ImageNet with a different anomaly detection mechanism for the classification of COVID-19. On the X-ray dataset, which contains 100 images from 70 COVID-19 persons and 1431 images from 1008 non-COVID-19 pneumonia subjects, the model obtains a sensitivity of 90.00 percent specificity of 87.84 percent or sensitivity of 96.00 percent specificity of 70.65 percent. The authors noted that the model still has certain flaws, such as missing 4% of COVID-19 cases and having a 30% false positive rate. In addition, more clinical data are required to confirm and improve the model's usefulness.

In [ 129 ], the authors developed COVIDX-Net, a novel DL framework that allows radiologists to diagnose COVID-19 in X-ray images automatically. Seven algorithms (MobileNetV2, ResNetV2, VGG19, DenseNet201, InceptionV3, Inception, and Xception) were evaluated using a small dataset of 50 photographs (MobileNetV2, ResNetV2, VGG19, DenseNet201, InceptionV3, Inception, and Xception). Each deep neural network model can classify the patient's status as a negative or positive COVID-19 case based on the normalized intensities of the X-ray image. The f1-scores for the VGG19 and dense convolutional network (DenseNet) models were 0.89 and 0.91, respectively. With f1-scores of 0.67, the InceptionV3 model has the weakest classification performance.

The authors of [ 130 ] designed a DL approach for delivering 30-min predictions about future glucose levels based on a Dilated RNN (DRNN). The performance of the DRNN models was evaluated using data from two electronic health records datasets: OhioT1DM from clinical trials and the in silicon dataset from the UVA-Padova simulator. It outperformed established glucose prediction approaches such as neural networks (NNs), support vector regression (SVR), and autoregressive models (ARX). The results demonstrated that it significantly improved glucose prediction performance, although there are still some limits, such as the authors' creation of a data-driven model that heavily relies on past EHR. The quality of the data has a significant impact on the accuracy of the prediction. The number of clinical datasets is limited and , however, often restricted. Because certain data fields are manually entered, they are occasionally incorrect.

In [ 131 ], the authors utilized a deep neural network (DNN) to discover 15,099 stroke patients, researchers were able to predict stroke death based on medical history and human behaviors utilizing large-scale electronic health information. The Korea Centers for Disease Control and Prevention collected data from 2013 to 2016 and found that there are around 150 hospitals in the country, all having more than 100 beds. Gender, age, type of insurance, mode of admission, necessary brain surgery, area, length of hospital stays, hospital location, number of hospital beds, stroke kind, and CCI were among the 11 variables in the DL model. To automatically create features from the data and identify risk factors for stroke, researchers used a DNN/scaled principal component analysis (PCA). 15,099 people with a history of stroke were enrolled in the study. The data were divided into a training set (66%) and a testing set (34%), with 30 percent of the samples used for validation in the training set. DNN is used to examine the variables of interest, while scaled PCA is utilized to improve the DNN's continuous inputs. The sensitivity, specificity, and AUC values were 64.32%, 85.56%, and 83.48%, respectively.

The authors of [ 132 ] proposed (GluNet), an approach to glucose forecasting. This method made use of a personalized DNN to forecast the probabilistic distribution of short-term measurements for people with Type 1 diabetes based on their historical data. These data included insulin doses, meal information, glucose measurements, and a variety of other factors. It utilized the newest DL techniques consisting of four components: post-processing, dilated CNN, label recovery/ transform, and data preprocessing. The authors run the models on the subjects from the OhioT1DM datasets. The outcomes revealed significant enhancements over the previous procedures via a comprehensive comparison concerning the and root mean square error (RMSE) having a time lag of 60 min prediction horizons (PH) and RMSE having a small-time lag for the case of prediction horizons in the virtual adult participants. If the PH is properly matched to the lag between input and output, the user may learn the control of the system more frequently and it achieves good performance. Additionally, GluNet was validated on two clinical datasets. It attained an RMSE with a time lag of 60 min PH and RMSE with a time lag of 30-min PH. The authors point out that the model does not consider physiological knowledge, and that they need to test GluNet with larger prediction horizons and use it to predict overnight hypoglycemia.

The authors of [ 133 ] proposed the short-term blood glucose prediction model (VMD-IPSO-LSTM), which is a short-term strategy for predicting blood glucose (VMD-IPSO-LSTM). Initially, the intrinsic modal functions (IMF) in various frequency bands were obtained using the variational modal decomposition (VMD) technique, which deconstructed the blood glucose content. The short- and long-term memory networks then constructed a prediction mechanism for each blood glucose component’s intrinsic modal functions (IMF). Because the time window length, learning rate, and neuron count are difficult to set, the upgraded PSO approach optimized these parameters. The improved LSTM network anticipated each IMF, and the projected subsequence was superimposed in the final step to arrive at the ultimate prediction result. The data of 56 participants were chosen as experimental data among 451 diabetic Mellitus patients. The experiments revealed that it improved prediction accuracy at "30 min, 45 min, and 60 min." The RMSE and MAPE were lower than the "VMD-PSO-LSTM, VMD-LSTM, and LSTM," indicating that the suggested model is effective. The longer time it took to anticipate blood glucose levels and the higher accuracy of the predictions gave patients and doctors more time to improve the effectiveness of diabetes therapy and manage blood glucose levels. The authors noted that they still faced challenges, such as an increase in calculation volume and operation time. The time it takes to estimate glucose levels in the short term will be reduced.

To speed up diagnosis and cut down on mistakes, the authors of [ 134 ] proposed a new paradigm for primary COVID-19 detection based on a radiology review of chest radiography or chest X-ray. The authors used a dataset of chest X-rays from verified COVID-19 patients (408 photographs), confirmed pneumonia patients (4273 images), and healthy people (1590 images) to perform a three-class image classification (1590 images). There are 6271 people in total in the dataset. To fulfill this image categorization problem, the authors plan to use CNN and transfer learning. For all the folds of data, the model's accuracy ranged from 93.90 percent to 98.37 percent. Even the lowest level of accuracy, 93.90 percent, is still quite good. The authors will face a restriction, particularly when it comes to adopting such a model on a large scale for practical usage.

In [ 135 ], the authors proposed DL models for predicting the number of COVID-19-positive cases in Indian states. The Ministry of Health and Family Welfare dataset contains time series data for 32 individual confirmed COVID-19 cases in each of the states (28) and union territories (4) since March 14, 2020. This dataset was used to conduct an exploratory analysis of the increase in the number of positive cases in India. As prediction models, RNN-based LSTMs are used. Deep LSTM, convolutional LSTM, and bidirectional LSTM models were tested on 32 states/union territories, and the model with the best accuracy was chosen based on absolute error. Bidirectional LSTM produced the best performance in terms of prediction errors, while convolutional LSTM produced the worst performance. For all states, daily and weekly forecasts were calculated, and bi-LSTM produced accurate results (error less than 3%) for short-term prediction (1–3 days).

With the goal of increasing the reliability and precision of type 1 diabetes predictions, the authors of [ 136 ] proposed a new method based on CNNs and DL. It was about figuring out how to extract the behavioral pattern. Numerous observations of identical behaviors were used to fill in the gaps in the data. The suggested model was trained and verified using data from 759 people with type 1 diabetes who visited Sheffield Teaching Hospitals between 2013 and 2015. A subject's type 1 diabetes test, demographic data (age, gender, years with diabetes), and the final 84 days (12 weeks) of self-monitored blood glucose (SMBG) measurements preceding the test formed each item in the training set. In the presence of insufficient data and certain physiological specificities, prediction accuracy deteriorates, according to the authors.

The authors of [ 137 ] constructed a framework using the PIDD. PID's participants are all female and at least 21 years old. PID comprises 768 incidences, with 268 samples diagnosed as diabetic and 500 samples not diagnosed as diabetic. The eight most important characteristics that led to diabetes prediction. The accuracy of functional classifiers such as ANN, NB, DT, and DL is between 90 and 98 percent. On the PIMA dataset, DL had the best results for diabetes onset among the four, with an accuracy rate of 98.07 percent. The technique uses a variety of classifiers to accurately predict the disease, but it failed to diagnose it at an early stage.

To summarize all previous works discussed in this section, we will categorize them according to the diseases along with the techniques used to predict each disease, the datasets used, and the main findings, as shown in Table 5 .

Results and discussion

This study conducted a systematic review to examine the latest developments in ML and DL for healthcare prediction. It focused on healthcare forecasting and how the use of ML and DL can be relevant and robust. A total of 41 papers were reviewed, 21 in ML and 20 in DL as depicted in Fig.  17 .

In this study, the reviewed paper were classified by diseases predicted; as a result, 5 diseases were discussed including diabetes, COVID-19, heart, liver, and chronic kidney). Table 6 illustrates the number of reviewed papers for each disease in addition to the adopted prediction techniques in each disease.

Table 6 provides a comprehensive summary of the various ML and DL models used for disease prediction. It indicates the number of studies conducted on each disease, the techniques employed, and the highest level of accuracy attained. As shown in Table 6 , the optimal diagnostic accuracy for each disease varies. For diabetes, the DL model achieved a 98.07% accuracy rate. For COVID-19, the accuracy of the logistic regression model was 98.5%. The CSO-LSTM model achieved an accuracy of 96.16 percent for heart disease. For liver disease, the accuracy of the logistic regression model was 75%. The accuracy of the logistic regression model for predicting multiple diseases was 98.5%. It is essential to note that these are merely the best accuracy included in this survey. In addition, it is essential to consider the size and quality of the datasets used to train and validate the models. It is more likely that models trained on larger and more diverse datasets will generalize well to new data. Overall, the results presented in Table 6 indicate that ML and DL models can be used to accurately predict disease. When selecting a model for a specific disease, it is essential to carefully consider the various models and techniques.

Although ML and DL have made incredible strides in recent years, they still have a long way to go before they can effectively be used to solve the fundamental problems plaguing the healthcare systems. Some of the challenges associated with implementing ML and DL approaches in healthcare prediction are discussed here.

The Biomedical Data Stream is the primary challenge that needs to be handled. Significant amounts of new medical data are being generated rapidly, and the healthcare industry as a whole is evolving rapidly. Some examples of such real-time biological signals include measurements of blood pressure, oxygen saturation, and glucose levels. While some variants of DL architecture have attempted to address this problem, many challenges remain before effective analyses of rapidly evolving, massive amounts of streaming data can be conducted. These include problems with memory consumption, feature selection, missing data, and computational complexity. Another challenge for ML and DL is tackling the complexity of the healthcare domain.

Healthcare and biomedical research present more intricate challenges than other fields. There is still a lot we do not know about the origins, transmission, and cures for many of these incredibly diverse diseases. It is hard to collect sufficient data because there are not always enough patients. A solution to this issue may be found, however. The small number of patients necessitates exhaustive patient profiling, innovative data processing, and the incorporation of additional datasets. Researchers can process each dataset independently using the appropriate DL technique and then represent the results in a unified model to extract patient data.

The use of ML and DL techniques for healthcare prediction has the potential to change the way traditional healthcare services are delivered. In the case of ML and DL applications, healthcare data is deemed the most significant component that contributes to medical care systems. This paper aims to present a comprehensive review of the most significant ML and DL techniques employed in healthcare predictive analytics. In addition, it discussed the obstacles and challenges of applying ML and DL Techniques in the healthcare domain. As a result of this survey, a total of 41 papers covering the period from 2019 to 2022 were selected and thoroughly reviewed. In addition, the methodology for each paper was discussed in detail. The reviewed studies have shown that AI techniques (ML and DL) play a significant role in accurately diagnosing diseases and helping to anticipate and analyze healthcare data by linking hundreds of clinical records and rebuilding a patient's history using these data. This work advances research in the field of healthcare predictive analytics using ML and DL approaches and contributes to the literature and future studies by serving as a resource for other academics and researchers.

Availability of data and materials

Not applicable.

Abbreviations

Artificial Intelligence

Machine Learning

Decision Tree

Electronic Health Records

Random Forest

Support Vector Machine

K-Nearest Neighbor

Naive Bayes

Reinforcement Learning

Natural Language Processing

Monte Carlo Tree Search

Partially Observable Markov Decision Processes

Deep Learning

Deep Belief Network

Artificial Neural Networks

Convolutional Neural Networks

Long Short-Term Memory

Recurrent Convolution Neural Networks

Recurrent Neural Networks

Recurrent Convolutional Layer

Receptive Domains

Recurrent Multilayer Perceptron

Pima Indian Diabetes Database

Coronary Heart Disease

Chest X-Ray

Multilayer Perceptrons

Internet of Things

Dilated RNN

Neural Networks

Support Vector Regression

Principal Component Analysis

Deep Neural Network

Prediction Horizons

Root Mean Square Error

Intrinsic Modal Functions

Variational Modal Decomposition

Self-Monitored Blood Glucose

Latha MH, Ramakrishna A, Reddy BSC, Venkateswarlu C, Saraswathi SY (2022) Disease prediction by stacking algorithms over big data from healthcare communities. Intell Manuf Energy Sustain: Proc ICIMES 2021(265):355

Google Scholar  

Van Calster B, Wynants L, Timmerman D, Steyerberg EW, Collins GS (2019) Predictive analytics in health care: how can we know it works? J Am Med Inform Assoc 26(12):1651–1654

Sahoo PK, Mohapatra SK, Wu SL (2018) SLA based healthcare big data analysis and computing in cloud network. J Parallel Distrib Comput 119:121–135

Thanigaivasan V, Narayanan SJ, Iyengar SN, Ch N (2018) Analysis of parallel SVM based classification technique on healthcare using big data management in cloud storage. Recent Patents Comput Sci 11(3):169–178

Elmahdy HN (2014) Medical diagnosis enhancements through artificial intelligence

Xiong X, Cao X, Luo L (2021) The ecology of medical care in Shanghai. BMC Health Serv Res 21:1–9

Donev D, Kovacic L, Laaser U (2013) The role and organization of health care systems. Health: systems, lifestyles, policies, 2nd edn. Jacobs Verlag, Lage, pp 3–144

Murphy G F, Hanken M A, & Waters K A (1999) Electronic health records: changing the vision

Qayyum A, Qadir J, Bilal M, Al-Fuqaha A (2020) Secure and robust machine learning for healthcare: a survey. IEEE Rev Biomed Eng 14:156–180

El Seddawy AB, Moawad R, Hana MA (2018) Applying data mining techniques in CRM

Wang Y, Kung L, Wang WYC, Cegielski CG (2018) An integrated big data analytics-enabled transformation model: application to health care. Inform Manag 55(1):64–79

Mirbabaie M, Stieglitz S, Frick NR (2021) Artificial intelligence in disease diagnostics: a critical review and classification on the current state of research guiding future direction. Heal Technol 11(4):693–731

Tang R, De Donato L, Besinović N, Flammini F, Goverde RM, Lin Z, Wang Z (2022) A literature review of artificial intelligence applications in railway systems. Transp Res Part C: Emerg Technol 140:103679

Singh G, Al’Aref SJ, Van Assen M, Kim TS, van Rosendael A, Kolli KK, Dwivedi A, Maliakal G, Pandey M, Wang J, Do V (2018) Machine learning in cardiac CT: basic concepts and contemporary data. J Cardiovasc Comput Tomograph 12(3):192–201

Kim KJ, Tagkopoulos I (2019) Application of machine learning in rheumatic disease research. Korean J Intern Med 34(4):708

Liu B (2011) Web data mining: exploring hyperlinks, contents, and usage data. Spriger, Berlin

MATH   Google Scholar  

Haykin S, Lippmann R (1994) Neural networks, a comprehensive foundation. Int J Neural Syst 5(4):363–364

Gupta M, Pandya SD (2022) A comparative study on supervised machine learning algorithm. Int J Res Appl Sci Eng Technol (IJRASET) 10(1):1023–1028

Ray S (2019) A quick review of machine learning algorithms. In: 2019 international conference on machine learning, big data, cloud and parallel computing (COMITCon) (pp 35–39). IEEE

Srivastava A, Saini S, & Gupta D (2019) Comparison of various machine learning techniques and its uses in different fields. In: 2019 3rd international conference on electronics, communication and aerospace technology (ICECA) (pp 81–86). IEEE

Park HA (2013) An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J Korean Acad Nurs 43(2):154–164

Obulesu O, Mahendra M, & Thrilok Reddy M (2018) Machine learning techniques and tools: a survey. In: 2018 international conference on inventive research in computing applications (ICIRCA) (pp 605–611). IEEE

Dhall D, Kaur R, & Juneja M (2020) Machine learning: a review of the algorithms and its applications. Proceedings of ICRIC 2019: recent innovations in computing 47–63

Yang F J (2019) An extended idea about Decision Trees. In: 2019 international conference on computational science and computational intelligence (CSCI) (pp 349–354). IEEE

Eesa AS, Orman Z, Brifcani AMA (2015) A novel feature-selection approach based on the cuttlefish optimization algorithm for intrusion detection systems. Expert Syst Appl 42(5):2670–2679

Shamim A, Hussain H, & Shaikh M U (2010) A framework for generation of rules from Decision Tree and decision table. In: 2010 international conference on information and emerging technologies (pp 1–6). IEEE

Eesa AS, Abdulazeez AM, Orman Z (2017) A dids based on the combination of cuttlefish algorithm and Decision Tree. Sci J Univ Zakho 5(4):313–318

Bakyarani ES, Srimathi H, Bagavandas M (2019) A survey of machine learning algorithms in health care. Int J Sci Technol Res 8(11):223

Resende PAA, Drummond AC (2018) A survey of random forest based methods for intrusion detection systems. ACM Comput Surv (CSUR) 51(3):1–36

Breiman L (2001) Random forests. Mach learn 45:5–32

Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844

Hofmann M, & Klinkenberg R (2016) RapidMiner: data mining use cases and business analytics applications. CRC Press

Chow CKCN, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467

Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167

Han J, Pei J, Kamber M (1999) Data mining: concepts and techniques. 2011

Cortes C, Vapnik V (1995) Support-vector networks. Mach learn 20:273–297

Aldahiri A, Alrashed B, Hussain W (2021) Trends in using IoT with machine learning in health prediction system. Forecasting 3(1):181–206

Sarker IH (2021) Machine learning: Algorithms, real-world applications and research directions. SN Comput Sci 2(3):160

Ting K M, & Zheng Z (1999) Improving the performance of boosting for naive Bayesian classification. In: Methodologies for knowledge discovery and data mining: third Pacific-Asia conference, PAKDD-99 Beijing, China, Apr 26–28, 1999 proceedings 3 (pp 296–305). Springer Berlin Heidelberg

Oladipo ID, AbdulRaheem M, Awotunde JB, Bhoi AK, Adeniyi EA, Abiodun MK (2022) Machine learning and deep learning algorithms for smart cities: a start-of-the-art review. In: IoT and IoE driven smart cities, pp 143–162

Shailaja K, Seetharamulu B, & Jabbar M A Machine learning in healthcare: a review. In: 2018 second international conference on electronics, communication and aerospace technology (ICECA) 2018 Mar 29 (pp 910–914)

Mahesh B (2020) Machine learning algorithms-a review. Int J Sci Res (IJSR) 9:381–386

Greene D, Cunningham P, & Mayer R (2008) Unsupervised learning and clustering. Mach learn Techn Multimed: Case Stud Organ Retriev 51–90

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc, USA

Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J 1(6):90–95

Smith LI (2002) A tutorial on principal components analysis

Mishra SP, Sarkar U, Taraphder S, Datta S, Swain D, Saikhom R, Laishram M (2017) Multivariate statistical data analysis-principal component analysis (PCA). Int J Livestock Res 7(5):60–78

Kamani M, Farzin Haddadpour M, Forsati R, and Mahdavi M (2019) "Efficient Fair Principal Component Analysis." arXiv e-prints: arXiv-1911.

Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179

Agrawal R, Imieliński T, & Swami A (1993) Mining association rules between sets of items in large databases. In: proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp 207–216)

Agrawal R, & Srikant R (1994) Fast algorithms for mining association rules. In: Proceeding of 20th international conference very large data bases, VLDB (Vol 1215, pp 487-499)

Singh J, Ram H, Sodhi DJ (2013) Improving efficiency of apriori algorithm using transaction reduction. Int J Sci Res Publ 3(1):1–4

Al-Maolegi M, & Arkok B (2014) An improved Apriori algorithm for association rules. arXiv preprint arXiv:1403.3948

Abaya SA (2012) Association rule mining based on Apriori algorithm in minimizing candidate generation. Int J Sci Eng Res 3(7):1–4

Coronato A, Naeem M, De Pietro G, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: a survey. Artif Intell Med 109:101964

Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8:279–292

Jang B, Kim M, Harerimana G, Kim JW (2019) Q-learning algorithms: a comprehensive classification and applications. IEEE access 7:133653–133667

Achille A, Soatto S (2018) Information dropout: Learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell 40(12):2897–2905

Williams G, Wagener N, Goldfain B, Drews P, Rehg J M, Boots B, & Theodorou E A (2017) Information theoretic MPC for model-based reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA) (pp 1714–1721). IEEE

Wilkes JT, Gallistel CR (2017) Information theory, memory, prediction, and timing in associative learning. Comput Models Brain Behav 29:481–492

Ning Y, Jia J, Wu Z, Li R, An Y, Wang Y, Meng H (2017) Multi-task deep learning for user intention understanding in speech interaction systems. In: Proceedings of the AAAI conference on artificial intelligence (Vol 31, No. 1)

Shi X, Gao Z, Lausen L, Wang H, Yeung DY, Wong WK, Woo WC (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (Eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc.,. https://proceedings.neurips.cc/paper_files/paper/2017/file/a6db4ed04f1621a119799fd3d7545d3d-Paper.pdf

Juang CF, Lu CM (2009) Ant colony optimization incorporated with fuzzy Q-learning for reinforcement fuzzy control. IEEE Trans Syst, Man, Cybernet-Part A: Syst Humans 39(3):597–608

Świechowski M, Godlewski K, Sawicki B, Mańdziuk J (2022) Monte Carlo tree search: a review of recent modifications and applications. Artif Intell Rev 56:1–66

Lizotte DJ, Laber EB (2016) Multi-objective Markov decision processes for data-driven decision support. J Mach Learn Res 17(1):7378–7405

MathSciNet   MATH   Google Scholar  

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43

Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Magaz 32(3):35–52

Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

Yu D, Deng L (2010) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process Mag 28(1):145–154

Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554

Goyal P, Pandey S, Jain K, Goyal P, Pandey S, Jain K (2018) Introduction to natural language processing and deep learning. Deep Learn Nat Language Process: Creat Neural Netw Python 1–74. https://doi.org/10.1007/978-1-4842-3685-7

Mathew A, Amudha P, Sivakumari S (2021) Deep learning techniques: an overview. Adv Mach Learn Technol Appl: Proc AMLTA 2020:599–608

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press, USA

Gomes L (2014) Machine-learning maestro Michael Jordan on the delusions of big data and other huge engineering efforts. IEEE Spectrum 20. https://spectrum.ieee.org/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts

Huang G, Liu Z, Van Der Maaten L, & Weinberger K Q (2017) Densely connected convolutional networks. In: proceedings of the IEEE conference on computer vision and pattern recognition (pp 4700–4708)

Yap MH, Pons G, Marti J, Ganau S, Sentis M, Zwiggelaar R, Marti R (2017) Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform 22(4):1218–1226

Hayashi Y (2019) The right direction needed to develop white-box deep learning in radiology, pathology, and ophthalmology: a short review. Front Robot AI 6:24

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Asari VK (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8(3):292

Schmidhuber J, Hochreiter S (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Smagulova K, James AP (2019) A survey on LSTM memristive neural network architectures and applications. Eur Phys J Spec Topics 228(10):2313–2324

Setyanto A, Laksito A, Alarfaj F, Alreshoodi M, Oyong I, Hayaty M, Kurniasari L (2022) Arabic language opinion mining based on long short-term memory (LSTM). Appl Sci 12(9):4140

Lindemann B, Müller T, Vietz H, Jazdi N, Weyrich M (2021) A survey on long short-term memory networks for time series prediction. Procedia CIRP 99:650–655

Cui Z, Ke R, Pu Z, & Wang Y (2018) Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143

Villegas R, Yang J, Zou Y, Sohn S, Lin X, & Lee H (2017) Learning to generate long-term future via hierarchical prediction. In: international conference on machine learning (pp 3560–3569). PMLR

Gensler A, Henze J, Sick B, & Raabe N (2016) Deep learning for solar power forecasting—an approach using autoencoder and LSTM neural networks. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC) (pp 002858–002865). IEEE

Lindemann B, Fesenmayr F, Jazdi N, Weyrich M (2019) Anomaly detection in discrete manufacturing using self-learning approaches. Procedia CIRP 79:313–318

Kalchbrenner N, Danihelka I, & Graves A (2015) Grid long short-term memory. arXiv preprint arXiv:1507.01526

Cheng B, Xu X, Zeng Y, Ren J, Jung S (2018) Pedestrian trajectory prediction via the social-grid LSTM model. J Eng 2018(16):1468–1474

Veličković P, Karazija L, Lane N D, Bhattacharya S, Liberis E, Liò P & Vegreville M (2018) Cross-modal recurrent models for weight objective prediction from multimodal time-series data. In: proceedings of the 12th EAI international conference on pervasive computing technologies for healthcare (pp 178–186)

Wang J, Hu X (2021) Convolutional neural networks with gated recurrent connections. IEEE Trans Pattern Anal Mach Intell 44(7):3421–3435

Liang M, & Hu X (2015) Recurrent convolutional neural network for object recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (pp 3367–3375)

Liang M, Hu X, Zhang B (2015) Convolutional neural networks with intra-layer recurrent connections for scene labeling. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (Eds) Advances in Neural Information Processing Systems, vol 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/9cf81d8026a9018052c429cc4e56739b-Paper.pdf

Fernandez B, Parlos A G, & Tsai W K (1990) Nonlinear dynamic system identification using artificial neural networks (ANNs). In: 1990 IJCNN international joint conference on neural networks (pp 133–141). IEEE

Puskorius GV, Feldkamp LA (1994) Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Trans Neural Netw 5(2):279–297

Rumelhart DE (1986) Learning representations by error propagation. In: DE Rumelhart and JL McClelland & PDP Research Group, eds, Parallel distributed processing: explorations in the microstructure of cognition. Bradford Books MITPress, Cambridge, Mass

Krishnamoorthi R, Joshi S, Almarzouki H Z, Shukla P K, Rizwan A, Kalpana C, & Tiwari B (2022) A novel diabetes healthcare disease prediction framework using machine learning techniques. J Healthcare Eng. https://doi.org/10.1155/2022/1684017

Edeh MO, Khalaf OI, Tavera CA, Tayeb S, Ghouali S, Abdulsahib GM, Louni A (2022) A classification algorithm-based hybrid diabetes prediction model. Front Publ Health 10:829510

Iwendi C, Huescas C G Y, Chakraborty C, & Mohan S (2022) COVID-19 health analysis and prediction using machine learning algorithms for Mexico and Brazil patients. J Experiment Theor Artif Intell 1–21. https://doi.org/10.1080/0952813X.2022.2058097

Lu H, Uddin S, Hajati F, Moni MA, Khushi M (2022) A patient network-based machine learning model for disease prediction: the case of type 2 diabetes mellitus. Appl Intell 52(3):2411–2422

Chugh M, Johari R, & Goel A (2022) MATHS: machine learning techniques in healthcare system. In: international conference on innovative computing and communications: proceedings of ICICC 2021, Volume 3 (pp 693–702). Springer Singapore

Deberneh HM, Kim I (2021) Prediction of type 2 diabetes based on machine learning algorithm. Int J Environ Res Public Health 18(6):3317

Gupta S, Verma H K, & Bhardwaj D (2021) Classification of diabetes using Naive Bayes and support vector machine as a technique. In: operations management and systems engineering: select proceedings of CPIE 2019 (pp 365–376). Springer Singapore

Islam M T, Rafa S R, & Kibria M G (2020) Early prediction of heart disease using PCA and hybrid genetic algorithm with k-means. In: 2020 23rd international conference on computer and information technology (ICCIT) (pp 1–6). IEEE

Qawqzeh Y K, Bajahzar A S, Jemmali M, Otoom M M, Thaljaoui A (2020) Classification of diabetes using photoplethysmogram (PPG) waveform analysis: logistic regression modeling. BioMed Res Int. https://doi.org/10.1155/2020/3764653

Grampurohit S, Sagarnal C (2020) Disease prediction using machine learning algorithms. In: 2020 international conference for emerging technology (INCET) (pp 1–7). IEEE

Moturi S, Srikanth Vemuru DS (2020) Classification model for prediction of heart disease using correlation coefficient technique. Int J 9(2). https://doi.org/10.30534/ijatcse/2020/185922020

Barik S, Mohanty S, Rout D, Mohanty S, Patra A K, & Mishra A K (2020) Heart disease prediction using machine learning techniques. In: advances in electrical control and signal systems: select proceedings of AECSS 2019 (pp 879–888). Springer, Singapore

Princy R J P, Parthasarathy S, Jose P S H, Lakshminarayanan A R, & Jeganathan S (2020) Prediction of cardiac disease using supervised machine learning algorithms. In: 2020 4th international conference on intelligent computing and control systems (ICICCS) (pp 570–575). IEEE

Saw M, Saxena T, Kaithwas S, Yadav R, & Lal N (2020) Estimation of prediction for getting heart disease using logistic regression model of machine learning. In: 2020 international conference on computer communication and informatics (ICCCI) (pp 1–6). IEEE

Soni VD (2020) Chronic disease detection model using machine learning techniques. Int J Sci Technol Res 9(9):262–266

Indrakumari R, Poongodi T, Jena SR (2020) Heart disease prediction using exploratory data analysis. Procedia Comput Sci 173:130–139

Wu C S M, Badshah M, & Bhagwat V (2019) Heart disease prediction using data mining techniques. In: proceedings of the 2019 2nd international conference on data science and information technology (pp 7–11)

Tarawneh M, & Embarak O (2019) Hybrid approach for heart disease prediction using data mining techniques. In: advances in internet, data and web technologies: the 7th international conference on emerging internet, data and web technologies (EIDWT-2019) (pp 447–454). Springer International Publishing

Rahman AS, Shamrat FJM, Tasnim Z, Roy J, Hossain SA (2019) A comparative study on liver disease prediction using supervised machine learning algorithms. Int J Sci Technol Res 8(11):419–422

Gonsalves A H, Thabtah F, Mohammad R M A, & Singh G (2019) Prediction of coronary heart disease using machine learning: an experimental analysis. In: proceedings of the 2019 3rd international conference on deep learning technologies (pp 51–56)

Khan A, Uddin S, Srinivasan U (2019) Chronic disease prediction using administrative data and graph theory: the case of type 2 diabetes. Expert Syst Appl 136:230–241

Alanazi R (2022) Identification and prediction of chronic diseases using machine learning approach. J Healthcare Eng. https://doi.org/10.1155/2022/2826127

Gouda W, Almurafeh M, Humayun M, Jhanjhi NZ (2022) Detection of COVID-19 based on chest X-rays using deep learning. Healthcare 10(2):343

Kumar A, Satyanarayana Reddy S S, Mahommad G B, Khan B, & Sharma R (2022) Smart healthcare: disease prediction using the cuckoo-enabled deep classifier in IoT framework. Sci Progr. https://doi.org/10.1155/2022/2090681

Monday H N, Li J P, Nneji G U, James E C, Chikwendu I A, Ejiyi C J, & Mgbejime G T (2021) The capability of multi resolution analysis: a case study of COVID-19 diagnosis. In: 2021 4th international conference on pattern recognition and artificial intelligence (PRAI) (pp 236–242). IEEE

Al Rahhal MM, Bazi Y, Jomaa RM, Zuair M, Al Ajlan N (2021) Deep learning approach for COVID-19 detection in computed tomography images. Cmc-Comput Mater Continua 67(2):2093–2110

Men L, Ilk N, Tang X, Liu Y (2021) Multi-disease prediction using LSTM recurrent neural networks. Expert Syst Appl 177:114905

Ahmad U, Song H, Bilal A, Mahmood S, Alazab M, Jolfaei A & Saeed U (2021) A novel deep learning model to secure internet of things in healthcare. Mach Intell Big Data Anal Cybersec Appl 341–353

Mansour RF, El Amraoui A, Nouaouri I, Díaz VG, Gupta D, Kumar S (2021) Artificial intelligence and internet of things enabled disease diagnosis model for smart healthcare systems. IEEE Access 9:45137–45146

Sevi M, & Aydin İ (2020) COVID-19 detection using deep learning methods. In: 2020 international conference on data analytics for business and industry: way towards a sustainable economy (ICDABI) (pp 1–6). IEEE

Martinsson J, Schliep A, Eliasson B, Mogren O (2020) Blood glucose prediction with variance estimation using recurrent neural networks. J Healthc Inform Res 4:1–18

Zhang J, Xie Y, Pang G, Liao Z, Verjans J, Li W, Xia Y (2020) Viral pneumonia screening on chest X-rays using confidence-aware anomaly detection. IEEE Trans Med Imaging 40(3):879–890

Hemdan E E D, Shouman M A, & Karar M E (2020) Covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images. arXiv preprint arXiv:2003.11055

Zhu T, Li K, Chen J, Herrero P, Georgiou P (2020) Dilated recurrent neural networks for glucose forecasting in type 1 diabetes. J Healthc Inform Res 4:308–324

Cheon S, Kim J, Lim J (2019) The use of deep learning to predict stroke patient mortality. Int J Environ Res Public Health 16(11):1876

Li K, Liu C, Zhu T, Herrero P, Georgiou P (2019) GluNet: a deep learning framework for accurate glucose forecasting. IEEE J Biomed Health Inform 24(2):414–423

Wang W, Tong M, Yu M (2020) Blood glucose prediction with VMD and LSTM optimized by improved particle swarm optimization. IEEE Access 8:217908–217916

Rashid N, Hossain M A F, Ali M, Sukanya M I, Mahmud T, & Fattah S A (2020) Transfer learning based method for COVID-19 detection from chest X-ray images. In: 2020 IEEE region 10 conference (TENCON) (pp 585–590). IEEE

Arora P, Kumar H, Panigrahi BK (2020) Prediction and analysis of COVID-19 positive cases using deep learning models: a descriptive case study of India. Chaos, Solitons Fractals 139:110017

MathSciNet   Google Scholar  

Zaitcev A, Eissa MR, Hui Z, Good T, Elliott J, Benaissa M (2020) A deep neural network application for improved prediction of in type 1 diabetes. IEEE J Biomed Health Inform 24(10):2932–2941

Naz H, Ahuja S (2020) Deep learning approach for diabetes prediction using PIMA Indian dataset. J Diabetes Metab Disord 19:391–403

Download references

Acknowledgements

Author information, authors and affiliations.

Department of Information Systems and Technology, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza, Egypt

Mohammed Badawy & Nagy Ramadan

Department of Computer Sciences, Faculty of Graduate Studies for Statistical Research, Cairo University, Giza, Egypt

Hesham Ahmed Hefny

You can also search for this author in PubMed   Google Scholar

Contributions

MB wrote the main text of the manuscript; NR and HAH revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mohammed Badawy .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests. All authors approved the final manuscript.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Badawy, M., Ramadan, N. & Hefny, H.A. Healthcare predictive analytics using machine learning and deep learning techniques: a survey. Journal of Electrical Systems and Inf Technol 10 , 40 (2023). https://doi.org/10.1186/s43067-023-00108-y

Download citation

Received : 27 December 2022

Accepted : 31 July 2023

Published : 29 August 2023

DOI : https://doi.org/10.1186/s43067-023-00108-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Healthcare prediction
  • Artificial intelligence (AI)
  • Machine learning (ML)
  • Deep learning (DL)
  • Medical diagnosis

research paper using machine learning

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Biomed Phys Eng
  • v.12(3); 2022 Jun

Prediction of Breast Cancer using Machine Learning Approaches

Reza rabiei.

1 PhD, Department of Health Information Technology and Management, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran

Seyed Mohammad Ayyoubzadeh

2 PhD, Department of Health Information Technology and Management, School of Allied Medical Sciences, Tehran University of Medical Science, Tehran, Iran

Solmaz Sohrabei

3 MSc, Department Deputy of Development, Management and Resources, Office of Statistic and Information Technology Management, Zanjan University of Medical Sciences, Zanjan, Iran

Marzieh Esmaeili

Alireza atashi.

4 PhD, Department of E-Health, Virtual School, Tehran University of Medical Sciences, Medical Informatics Research Group, Clinical Research Department, Breast Cancer Research Center, Motamed Cancer Institute, ACECR, Tehran, Iran

Background:

Breast cancer is considered one of the most common cancers in women caused by various clinical, lifestyle, social, and economic factors. Machine learning has the potential to predict breast cancer based on features hidden in data.

This study aimed to predict breast cancer using different machine-learning approaches applying demographic, laboratory, and mammographic data.

Material and Methods:

In this analytical study, the database, including 5,178 independent records, 25% of which belonged to breast cancer patients with 24 attributes in each record was obtained from Motamed cancer institute (ACECR), Tehran, Iran. The database contained 5,178 independent records, 25% of which belonged to breast cancer patients containing 24 attributes in each record. The random forest (RF), neural network (MLP), gradient boosting trees (GBT), and genetic algorithms (GA) were used in this study. Models were initially trained with demographic and laboratory features (20 features). The models were then trained with all demographic, laboratory, and mammographic features (24 features) to measure the effectiveness of mammography features in predicting breast cancer.

RF presented higher performance compared to other techniques (accuracy 80%, sensitivity 95%, specificity 80%, and the area under the curve (AUC) 0.56). Gradient boosting (AUC=0.59) showed a stronger performance compared to the neural network.

Conclusion:

Combining multiple risk factors in modeling for breast cancer prediction could help the early diagnosis of the disease with necessary care plans. Collection, storage, and management of different data and intelligent systems based on multiple factors for predicting breast cancer are effective in disease management.

Introduction

Breast cancer is considered a multifactorial disease and the most common cancer in women worldwide [ 1 , 2 ] with approximately 30% of all female cancers [ 3 , 4 ] (i.e. 1.5 million women are diagnosed with breast cancer each year, and 500,000 women die from this disease in the world). Over the past 30 years, this disease has increased, while the death rate has decreased. However, the reduction in mortality due to mammography screening is estimated at 20% and improvement in cancer treatment is estimated at 60% [ 5 , 6 ].

Diagnostic mammography can assess abnormal breast cancer tissue in patients with subtle and inconspicuous malignancy signs. Due to a large number of images, this method cannot effectively be used in assessing cancer suspected areas. According to a report, approximately 50% of breast cancers were not detected in screenings of women with very dense breast tissue [ 7 ]. However, about a quarter of women with breast cancer are diagnosed negatively within two years of screening. Therefore, the early and timely diagnosis of breast cancer is crucial [ 8 ].

Most mammography-based breast cancer screening is performed at regular intervals - usually annually or every two years - for all women. This “A fix screening program for everyone” is not effective in diagnosing cancer at the individual level and may impair the effectiveness of screening programs [ 9 ]. On the other hand, experts suggest that considering other risk factors along with mammography screening can help a more accurate diagnosis of women at risk [ 9 - 11 ]. Moreover, effective risk prediction through modeling can not only help radiologists in setting up a personal screening for patients and encouraging them to participate in the program for early detection but also help identify high-risk patients [ 12 , 13 ].

Machine learning, as a modeling approach, represents the process of extracting knowledge from data and discovering hidden relationships [ 14 ], widely used in healthcare in recent years [ 15 ] to predict different diseases [ 16 - 18 ]. Some studies only used demographic risk factors (lifestyle and laboratory data) in predicting breast cancer [ 19 , 20 ], and several studies predicted based on mammographic stereotypes [ 21 ] or used data from patient biopsy [ 22 ]. Others showed the application of genetic data in predicting breast cancer [ 23 ].

A major challenge in predicting breast cancer is the creation of a model for addressing all known risk factors [ 24 - 26 ]. Current prediction models might only focus on the analysis of mammographic images or demographic risk factors without other critical factors. In addition, these models, which are accurate enough for identifying high-risk women, could result in multiple screening and invasive sampling with magnetic resonance imaging (MRI) and ultrasound. The financial and psychological burden could be experienced by patients [ 27 - 29 ].

The effective prediction of breast cancer risk requires different factors, including demographic, laboratory, and mammographic risk factors [ 24 , 25 , 30 , 31 ]. Therefore, multifactorial models with many risk factors in their analysis can be effective in assessing the risk of breast cancer through more accurate analysis [ 32 , 33 ]. The current study aimed to predict breast cancer using different machine learning approaches considering various factors in modeling.

Material and Methods

In this analytical study, the database was obtained from a clinical breast cancer research center (Motamed cancer institute) in Tehran, Iran. The research was conducted in 4 stages: data collection, data pre-processing, modeling, and model evaluation.

Data Collection

In the first stage, 5178 records of people, referred to the research center over the past 10 years (2011-2021), were prepared retrospectively. Each record covered 24 features (11 demographic features, 9 laboratory features, and 4 mammography features) ( Table 1 ), all labeled to indicate the presence or absence of breast cancer, of which 1,295 records (25%) were identified as breast cancer.

The relevant features of breast cancer

Feature nameDescriptionTypeValues
Ageage at diagnosisDemographic<100 Years
Age.menopage of menopauseDemographic38-65 Years
First pregnancyage at first pregnancyDemographic13-42 Years
Age.menarchage of menarcheDemographic11-18 Years
BMIBody mass indexDemographicUnderweight (Below 18.5) =0, Normal (18.5 - 24.9) =1, Overweight (25.0 - 29.9) =2, Obese (30.0 and Above) =3
LactationBreastfeeding statusDemographic0-96 Mount
Physical ActivityHave a regular Physical ActivityDemographicYes=1 No=0
EducationAcademic educationDemographicIlliterate=1, primary=2, high school=3, university=4
Life event stress life event statues DemographicNo=0, death of father=1, family problems=2, death of mother=3, death of child=4, death of husband=5, divorced=6
SmokingSmoking statusDemographicYes=1, No=0
Maritalmarital statusDemographicSingle=0 other=1
Duration Ocp.usedMount of used Oral Contraceptive PillsLaboratory0-120 Mount
Duration HRT usedmount of Hormone replacement therapy useLaboratory0-120 Mount
Personal. Other. CancerPersonal. Other. CancerLaboratoryNo=0, ovary=1, endometrium=2, colon=3, meningioma=4, lymphoma=5
Family.BCFAMILY Breast CancerLaboratoryYes=1 No=0
Exposure X-ray Exposure X-ray to chestLaboratoryNegative=0 positive=1
Vitamin D3Amount vitamin D in bodyLaboratory>10 mg=0 deficiency 10-30 mg=1 insufficiency 30-100 mg=2 sufficient >100 mg=3 Overdose
Biopsypathology of biopsyLaboratoryno malignancy detected= 0 lobular carcinoma insitu=1 ductal carcinoma insitu=2 ductal carcinoma insitu=3 invasive lobular carcinoma=4 medullary=5 microinvasion=6
Hysterectomyhistory of hysterectomy LaboratoryYes=1 No=0
Personal.BCPersonal Breast Cancer historyLaboratoryYes=1 No=0, surgery=2, RT (Radio Therapy) =3
Breast densityscreeningMammographyFatty tissue=0, glandular and fibrous tissue=1, dense =2, heterogeneously dense extremely dense=3
Micro lobulatedscreeningMammographyNone=0, Fibroadenoma=1, Papilloma=2, Phyllodes tumor=3, DCIS=4, IDC=5, ILC=6, Lactating and tubular adenomas =7
CircumscribedscreeningMammographyNone=0 cysts=1, complicated cyst=2, clustered microcyst=3, solid mass=4
Micro calcification, Macro calcificationscreeningMammographyProbably benign Punctate Intermediate=1 concern Coarse heterogeneous Amorphous =2 Higher probability of malignancy Fine pleomorphic Fine linear/branching=3
ClassBreast Cancermalignant=1 benign=0

DCIS: Ductal carcinoma in situ, IDC: Invasive ductal carcinoma, ILC: Invasive lobular carcinoma

Data preprocessing

The second step was associated with data preprocessing in which five records related to men were removed, and a total of 1290 records remained. Some of the patients’ laboratory features that were outside the considered range were repositioned in the central registry as their laboratory results were available. In addition, for records with missing values, the method of maximum frequency or the same mod was used. Finally, the Synthetic Minority Oversampling Technique (SMOTE) was used to balance the training data due to the difference in the number of study class records.

Modeling for breast cancer prediction

In the third step, the Scikit-Learn 0.18.2 library, NumPy v1.20, TPOT, and Python open-source programming were used for modeling. Three leaners, i.e. Random forest (RF), Gradient Boosting trees (GBT), and Multi-layer Perceptron (MLP) were applied to the dataset. In addition, the K-Fold (K=3) validation was used to gain the optimized hyper-parameter of each model in the genetic algorithm step. In the final evaluation, the train-test split method (75% for training and 25% for testing) was used to more accurately estimate the performance of the model. In this study, a genetic algorithm (GA) with a population of 5, the number of children 50, and the number of 10 generations with the criterion of the highest accuracy in model selection were used to optimize values for variables. Further, these models were then trained with demographic and laboratory features (20 features). Finally, the model was trained with all demographic, laboratory, and mammography features (24 features) to measure the effect of mammography features in predicting breast cancer. In the current study, MLP hidden layers numbers were considered 10, and the alpha value for the training rate was 0.01-0.2. The sigmoid and hyperbolic tangent functions were selected for activation function. The value of the solver optimizer function was set to a gradient-based optimizer method, such as Adam and Stochastic Gradient Descent (SGD) to find the optimal weights. In the GBT model, the learning rate was considered 0.01-0.2, and the maximum depth was regarded as 3, 5, and 8. The buoyancy level learning was 0.1 and the estimator value for the gradient boosting was 10. In the random forest (RF) model, the minimum number of sheets required to split an external node was considered 4 and 12. The estimator value was 151, and the node evaluation parameter to prevent splitting (min_samples_split) was considered 5 and 10. The block diagram for the methods is shown in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is JBPE-12-297-g001.jpg

Block diagram of methods

Random Forest (RF)

As a non-parametric approach, the RF uses the classification method. For each set of data, the RF performs categorization at high speed and applies a large number of decision trees [ 34 ]. In each tree, there is a random number of input variables, then all the trees are combined for a better inference from the variables [ 35 ].

Gradient Boosting Trees (GBT)

This algorithm is one of the reinforcement gradient algorithms with a very good performance in classification and performs the best classification for each of the data [ 36 ]. In this method, the trees are trained one after another; each subset tree is taught primarily with data erroneously predicted by the previous tree. This process continuously reduces the model error since each model is sequentially improved against the weaknesses of the previous model [ 37 , 38 ].

Multi-Layer Perceptron (MLP)

As a deep artificial neural network, the MLP is composed of an input layer for receiving the signal, an output layer used for prediction, and in between those two, some hidden layers are acting as the computation engine. The MLP is trained by a backpropagation algorithm, which is part of the supervised networks. In this network, data are driven from input nodes to output nodes. If there is an error in the output, this error must be somehow returned from the output to the input, and this corrects the weights. The most commonly used method for this is the post-diffusion algorithm [ 39 , 40 ].

Genetic Algorithm (GA)

As a subset of the evolutionary computing algorithm, GA is directly associated with artificial intelligence and used for solving optimization problems through the evolution process [ 41 , 42 ]. To obtain the best answer, the GA applies the best survival rule to a series of problems for patterning the best solution for problems [ 43 , 44 ]. In each generation, the optimal solution is achieved based on a natural biological process and by selecting the best chromosomes for creating the subsequent generation to solve the problem optimally [ 45 ].

Model Evaluation

The test results of the database samples (confusion matrix) are shown in Table 2 . In the final stage, the performance of the created models was measured by different criteria. The classification of samples is one of the common criteria in evaluating and measuring the ability of classifiers, the degree of separation or accuracy, and the separation of classes [ 46 ]. In this study, accuracy, sensitivity, specificity, and the area under the receiver operating characteristic (ROC) curve were used to measure the overall performance of the classifiers.

Confusion matrix of a binominal classifier

Predicted
NegativePositive
ActualNegativeTNFP
PositiveFNTP

TN: True Negative, FN: False Negative, FP: False Positive, TP: True Positive

A total of 1290 records containing 24 demographic, laboratory, and mammographic features related to breast cancer were used in the study; the weight of the features based on their degree of importance is shown in (the weights are between (0.0 - 1) ( Figure 2 ). Family history of breast cancer, personal history of breast cancer, breast density, and age of diagnosis is 5 important factors in the diagnosis of this disease.

An external file that holds a picture, illustration, etc.
Object name is JBPE-12-297-g002.jpg

The weight of the features in breast cancer prediction

The performance of the models shown based on the ROC area under the curve demonstrated the Gradient Boosting Trees (GBT) as the model with the highest performance. The modeling results using RF, GBT, and MLP are shown in Table 3 , and the comparison of their ROC curve is demonstrated in Figure 3 and Table 4 .

Performance comparison of the breast cancer prediction models

Models FeaturesAUCSensitivity (%)Specificity (%)Accuracy (%)
Random ForestDemographics0.53938379
Demographics + Mammography0.53 958380
Gradient BoostingDemographics0.59638762
Demographics + Mammography0.59828674
Multi-Layer PerceptronDemographics 0.56788571
Demographics + Mammography0.56828473

AUC: Area under the ROC curve, ROC: Receiver operating characteristic

An external file that holds a picture, illustration, etc.
Object name is JBPE-12-297-g003.jpg

Receiver operating characteristic (ROC) curve of models

Area under the Receiver operating characteristic (ROC) curve

Test Result Model(s)Area
GBT0.59
MLP0.56
RF0.53

GBT: Gradient Boosting Tree, MLP: Multi-Layer-Perceptron, RF: Random Forest

According to the findings of the current study, the mammographic features along with other features could improve the performance of models. The RF model showed the highest sensitivity (95%), but was more efficient due to the sensitivity of breast cancer diagnosis, models, such as gradient boosting with higher specificity (86%).

In a study by Rosner et al. [ 47 , 48 ], the findings showed that family and personal history of breast cancer were two of the key influential factors in breast cancer, which are consistent with the findings of the current study as these two factors demonstrated the highest weight (0.92 and 0.89) compared to other factors. Breast density and age are influential in tumor appearance and increase the proportion of breast cancers [ 49 ] with the weights (0.80, 0.80), respectively. However, the hysterectomy feature was used along with other risk factors that could influence the performance of models. The study by Chow et al. assessed the risk of breast cancer after hysterectomy and showed a statistical significance between hysterectomy and breast cancer [ 50 ].

The use of optimization algorithms with feature weighting and proper adjustment of classification parameters could improve the performance of classification algorithms [ 51 ]. Studies reported that the classifiers that used GA in feature selection demonstrated better performance compared to those that did not use the GA. For the prediction of breast cancer, Bhattacharya et al. [ 52 ] approached three machine learning algorithms and used GA for feature selection; the findings of this study showed that the GA led to an improved performance for models created. In a study by Sakri et al. [ 53 ] to predict breast cancer recurrence in 198 instances with 34 clinical attributes, the GA was used for optimization. The Naive Bayes accuracy, sensitivity, specificity, and area under the ROC curve were reported at 70%, 81%, 79%, and 0.82, respectively in this study. Kumar et al. [ 54 ] used GA on a breast cancer dataset containing 611 records with 10 features to predict breast cancer survival and the reported accuracy, and ROC were 88% and 0.966 for GA, showing a better performance compared to Naive Bayes, DT, and K-nearest neighbor (KNN); in their study conducted to classify the masses observed in mammographic stereotypes, Thawkar and Ingolikar [ 55 ] used a dataset composed of 651 records with 25 mammography features. In the current study, the models were optimized by GA, and the ROC, accuracy, sensitivity, and specificity were 0.974, 95%, 96.14%, and 93.94% for RF, respectively. In the studies noted above, the modeling was performed using one set of influencing factors.

Some machine-learning studies [ 56 - 62 ] reported higher accuracy (100%) and sensitivity (100%) for breast cancer prediction compared to the present study, which is likely due to using different databases, such as “Wisconsin” and “SEER”. Similar to the database used in the current study, some studies used databases from specific medical or research centers. Behravan and Hartikainen [ 33 ] predicted breast cancer using a database containing 695 records, including demographic risk factors and genetic data; their findings suggested that the XGBoost model with different factors showed improved performance (AUC= 0.788) compared to a model with just one set of factors (AUC= 0.678). In a study by Feld et al. [ 10 ] to predict breast cancer, the modeling was performed on 738 records, including demographic, genetic, and abnormal mammographic data, and the reported AUC was 0.75. Other studies suggest that considering different factors in modeling would improve modeling performance. For example, by Ayvaci MU et al. [ 63 ], the analysis of demographic, mammography, and biopsy data using logistic regression resulted in an AUC of 0.84. Rajendran k et al. [ 64 ] analyzed 2.4 million records of mammography screening and demographic risk factors associated with breast cancer to predict breast cancer using the Naïve Bayes, RF, and C4.5 techniques; the findings indicated the highest AUC (0.993) for Naïve Bayes.

The findings of a study by Atashi et al. [ 65 ] conducted on a database with 4004 records, including demographic risk factors showed the higher performance of the neural network (sensitivity= %80.9, specificity= %99.8, accuracy= %62.8) compared to other approaches, such as C5.0. Mosayebi et al. study [ 66 ] was conducted on a database with 5471 records, including demographic and laboratory features reported for C.50 (accuracy 82%, sensitivity 86%. and specificity 77%). In a study by Jalali et al. [ 67 ] performed on 644 records (with 10 clinical features), the support vector machine (SVM) was reported with the highest sensitivity (94.33%), accuracy (93.72%), and specificity (92.26%). Afshar et al. [ 68 ] studied the survival of breast cancer patients using a dataset with 856 records and 15 clinical features using machine learning models. In this study, C5.0 showed the highest sensitivity (92.21%) and accuracy (84%). In addition, in a similar study by Nourelahi et al. [ 69 ] to predict patient survival on a database consisting of 5673 cases and 41 clinical features, logistic regression presented a sensitivity of 71.85%, specificity of 72.83%, and accuracy of 72.49%. In addition, Tapak et al. [ 70 ] performed a study on a database with 550 records to predict the survival and metastasis of breast cancer and also reported the sensitivity and specificity of 99% for AdaBoost, the findings of the current study suggest that modeling with a variety of related risk factors from different sources could improve the performance of models in breast cancer prediction.

In the current study, limitations are considered as follows: modeling based on records of only one database, and the lack of access to genetic data that could influence the findings of the study. However, different machine learning approaches were used considering demographic, laboratory, and mammography features, resulting in comparing the performance of different approaches in predicting breast cancer.

The proposed machine-learning approaches could predict breast cancer as the early detection of this disease could help slow down the progress of the disease and reduce the mortality rate through appropriate therapeutic interventions at the right time. Applying different machine learning approaches, accessibility to bigger datasets from different institutions (multi-center study), and considering key features from a variety of relevant data sources could improve the performance of modeling.

Authors’ Contribution

R. Rabiei proposed conceptualization and design, supervision of modeling, manuscript drafting, editing, and critical review. Data modeling, interpretation, and manuscript drafting was done by SM. Ayyoubzadeh. S. Sohrabei provided conceptualization and design, data modeling and interpretation, manuscript drafting, and editing. M. Esmaeili presented data interpretation and manuscript drafting. A. Atashi collected data and manuscript drafting. All the authors read, modified, and approved the final version of the manuscript.

Ethical Approval

This study was approved by Clinical Research Department, Breast Cancer Research Center, Motamed Cancer Institute (ACECR), Tehran, Iran, with Approval ID IR, ACECR, IBCRC, REC.1394.68.

Informed consent

We used anonymous data for modeling and no consent was required for conducting this study.

There was no funding for conducting this study.

Conflict of Interest

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

research paper using machine learning

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Machine learning methods in weather and climate applications: a survey.

research paper using machine learning

1. Introduction

  • Limited Scope: Existing surveys predominantly focus either on short-term weather forecasting or medium-to-long-term climate predictions. There is a notable absence of comprehensive surveys that endeavour to bridge these two-time scales. In addition, current investigations tend to focus narrowly on specific methods, such as simple neural networks, thereby neglecting some combination of methods.
  • Lack of model details: Many extisting studies offer only generalized viewpoints and lack a systematic analysis of the specific model employed in weather and climate prediction. This absence creates a barrier for researchers aiming to understand the intricacies and efficacy of individual methods.
  • Neglect of Recent Advances: Despite rapid developments in machine learning and computational techniques, existing surveys have not kept pace with these advancements. The paucity of information on cutting-edge technologies stymies the progression of research in this interdisciplinary field.
  • Comprehensive scope: Unlike research endeavors that restrict their inquiry to a singular temporal scale, our survey provides a comprehensive analysis that amalgamates short-term weather forecasting with medium- and long-term climate predictions. In total, 20 models were surveyed, of which a select subset of eight were chosen for in-depth scrutiny. These models are discerned as the industry’s avant-garde, thereby serving as invaluable references for researchers. For instance, the PanGu model exhibits remarkable congruence with actual observational results, thereby illustrating the caliber of the models included in our analysis.
  • In-Depth Analysis: Breaking new ground, this study delves into the intricate operational mechanisms of the eight focal models. We have dissected the operating mechanisms of these eight models, distinguishing the differences in their approaches and summarizing the commonalities in their methods through comparison. This comparison helps readers gain a deeper understanding of the efficacy and applicability of each model and provides a reference for choosing the most appropriate model for a given scenario.
  • Identification of Contemporary Challenges and Future Work: The survey identifies pressing challenges currently facing the field, such as the limited dataset of chronological seasons and complex climate change effects, and suggests directions for future work, including simulating datasets and physics-based constraint models. These recommendations not only add a forward-looking dimension to our research but also act as a catalyst for further research and development in climate prediction.

2. Background

3. related work, 3.1. statistical method, 3.2. physical models, 4. taxonomy of climate prediction applications, 4.1. climate prediction milestone based on machine-learning, 4.2. classification of climate prediction methods, 5. short-term weather forecast, 5.1. model design.

  • The Navier-Stokes Equations [ 73 ]: Serving as the quintessential descriptors of fluid motion, these equations delineate the fundamental mechanics underlying atmospheric flow. ∇ · v = 0 (3) ρ ∂ v ∂ t + v · ∇ v = − ∇ p + μ ∇ 2 v + ρ g (4)
  • The Thermodynamic Equations [ 74 ]: These equations intricately interrelate the temperature, pressure, and humidity within the atmospheric matrix, offering insights into the state and transitions of atmospheric energy. ∂ ρ ∂ t + ∇ · ( ρ v ) = 0 ( Continuity equation ) (5) ∂ T ∂ t + v · ∇ T = q c p ( Energy equation ) (6) D p D t = − ρ c p ∇ · v ( Pressure equation ) (7)
  • The Cloud Microphysics Parameterization Scheme is instrumental for simulating the life cycles of cloud droplets and ice crystals, thereby affecting [ 75 , 76 ] and atmospheric energy balance.
  • Shortwave and Longwave Radiation Transfer Equations elucidate the absorption, scattering, and emission of both solar and terrestrial radiation, which in turn influence atmospheric temperature and dynamics.
  • Empirical or Semi-Empirical Convection Parameterization Schemes simulate vertical atmospheric motions initiated by local instabilities, facilitating the capture of weather phenomena like thunderstorms.
  • Boundary-Layer Dynamics concentrates on the exchanges of momentum, energy, and matter between the Earth’s surface and the atmosphere which are crucial for the accurate representation of surface conditions in the model.
  • Land Surface and Soil/Ocean Interaction Modules simulate the exchange of energy, moisture, and momentum between the surface and the atmosphere, while also accounting for terrestrial and aquatic influences on atmospheric conditions.
  • Encoder: The encoder component maps the local region of the input data (on the original latitude-longitude grid) onto the nodes of the multigrid graphical representation. It maps two consecutive input frames of the latitude-longitude input grid, with numerous variables per grid point, into a multi-scale internal mesh representation. This mapping process helps the model better capture and understand spatial dependencies in the data, allowing for more accurate predictions of future weather conditions.
  • Processor: This part performs several rounds of message-passing on the multi-mesh, where the edges can span short or long ranges, facilitating efficient communication without necessitating an explicit hierarchy. More specifically, the section uses a multi-mesh graph representation. It refers to a special graph structure that is able to represent the spatial structure of the Earth’s surface in an efficient way. In a multi-mesh graph representation, nodes may represent specific regions of the Earth’s surface, while edges may represent spatial relationships between these regions. In this way, models can capture spatial dependencies on a global scale and are able to utilize the power of GNNs to analyze and predict weather changes.
  • Decoder: It then maps the multi-mesh representation back to the latitude-longitude grid as a prediction for the next time step.

5.2. Result Analysis

6. medium-to-long-term climate prediction, 6.1. model design.

  • Problem Definition: The goal is to approximate p ( Y ∣ X , M ) , a task challenged by high-dimensional geospatial data, data inhomogeneity, and a large dataset.
  • Random Variable z : A latent variable with a fixed standard Gaussian distribution.
  • Parametric Functions p θ , q ϕ , p ψ : Neural networks for transforming z and approximating target and posterior distributions.
  • Objective Function: Maximization of the Evidence Lower Bound (ELBO).
  • Initialize: Define random variable z ∼ N ( 0 , 1 ) [ 96 , 97 ] parametric functions p θ ( z , X , M ) , q ϕ ( z ∣ X , Y , M ) , p ψ ( Y ∣ X , M , z ) .
  • Training Objective (Maximize ELBO) [ 98 ]: The ELBO is defined as: ELBO = E z ∼ q ϕ log p ψ ( Y ∣ X , M , z ) − D KL ( q ϕ ∥ p ( z ∣ X , M ) ) − D KL ( q ϕ ∥ p ( z ∣ X , Y , M ) ) (8) with terms for reconstruction, regularization, and residual error.
  • Optimization: Utilize variational inference, Monte Carlo reparameterization, and Gaussian assumptions.
  • Forecasting: Generate forecasts by sampling p ( z ∣ X , M ) , the likelihood of p ψ , and using the mean of p ψ for an average estimate.
  • Two Generators : The CycleGAN model includes two generators. Generator G learns the mapping from the simulated domain to the real domain, and generator F learns the mapping from the real domain to the simulated domain [ 100 ].
  • Two Discriminators : There are two discriminators, one for the real domain and one for the simulated domain. Discriminator D x encourages generator G to generate samples that look similar to samples in the real domain, and discriminator D y encourages generator F to generate samples that look similar to samples in the simulated domain.
  • Cycle Consistency Loss : To ensure that the mappings are consistent, the model enforces the following condition through a cycle consistency loss: if a sample is mapped from the simulated domain to the real domain and then mapped back to the simulated domain, it should get a sample similar to the original simulated sample. Similarly, if a sample is mapped from the real domain to the simulated domain and then mapped back to the real domain, it should get a sample similar to the original real sample. L cyc ( G , F ) = E x ∼ p data ( x ) | | F ( G ( x ) ) − x | | 1 + E y ∼ p data ( y ) | | G ( F ( y ) ) − y | | 1 (10)
  • Training Process : The model is trained to learn the mapping between these two domains by minimizing the adversarial loss and cycle consistency loss between the generators and discriminators. L Gen ( G , F ) = L GAN ( G , D y , X , Y ) + L GAN ( F , D x , Y , X ) + λ L cyc ( G , F ) (11)
  • Application to Prediction : Once trained, these mappings can be used for various tasks, such as transforming simulated precipitation data into forecasts that resemble observed data.
  • Reference Model: SPCAM. SPCAM serves as the foundational GCM and is embedded with Cloud-Resolving Models (CRMs) to simulate microscale atmospheric processes like cloud formation and convection. SPCAM is employed to generate “target simulation data”, which serves as the training baseline for the neural networks. The use of CRMs is inspired by recent advancements in data science, demonstrating that machine learning parameterizations can potentially outperform traditional methods in simulating convective and cloud processes.
  • Neural Networks: ResDNNs, a specialized form of deep neural networks, are employed for their ability to approximate complex, nonlinear relationships. The network comprises multiple residual blocks, each containing two fully connected layers with Rectified Linear Unit (ReLU) activations. ResDNNs are designed to address the vanishing and exploding gradient problems in deep networks through residual connections, offering a stable and effective gradient propagation mechanism. This makes them well-suited for capturing the complex and nonlinear nature of atmospheric processes.
  • Subgrid-Scale Physical Simulator. Traditional parameterizations often employ simplified equations to model subgrid-scale processes, which might lack accuracy. In contrast, the ResDNNs are organized into a subgrid-scale physical simulator that operates independently within each model grid cell. This simulator takes atmospheric states as inputs and outputs physical quantities at the subgrid scale, such as cloud fraction and precipitation rate.

6.2. Result Analysis

7. discussion, 7.1. overall comparison, 7.2. challenge, 7.3. future work.

  • Simulate the dataset using statistical methods or physical methods.
  • Combining statistical knowledge with machine learning methods to enhance the interpretability of patterns.
  • Consider the introduction of physics-based constraints into deep learning models to produced more accurate and reliable results.
  • Accelerating Physical Model Prediction with machine learning knowledge.

8. Conclusions

Author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest, abbreviations.

vvelocity vector
ttime
fluid density
ppressure
dynamic viscosity
ggravitational acceleration vector
expectation under the variational distribution
latent variable
observed data
joint distribution of observed and latent variables
variational distribution
G, FGenerators for mappings from simulated to real domain and vice versa.
D , D Discriminators for real and simulated domains.
, Cycle consistency loss and Generative Adversarial Network loss.
X, YData distributions for simulated and real domains.
Weighting factor for the cycle consistency loss.
  • Abbe, C. The physical basis of long-range weather. Mon. Weather Rev. 1901 , 29 , 551–561. [ Google Scholar ] [ CrossRef ]
  • Zheng, Y.; Capra, L.; Wolfson, O.; Yang, H. Urban computing: Concepts, methodologies, and applications. Acm Trans. Intell. Syst. Technol. TIST 2014 , 5 , 1–55. [ Google Scholar ]
  • Gneiting, T.; Raftery, A.E. Weather forecasting with ensemble methods. Science 2005 , 310 , 248–249. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Agapiou, A. Remote sensing heritage in a petabyte-scale: Satellite data and heritage Earth Engine applications. Int. J. Digit. Earth 2017 , 10 , 85–102. [ Google Scholar ] [ CrossRef ]
  • Bendre, M.R.; Thool, R.C.; Thool, V.R. Big data in precision agriculture: Weather forecasting for future farming. In Proceedings of the 2015 1st International Conference on Next Generation Computing Technologies (NGCT), Dehradun, India, 4–5 September 2015; pp. 744–750. [ Google Scholar ]
  • Zavala, V.M.; Constantinescu, E.M.; Krause, T. On-line economic optimization of energy systems using weather forecast information. J. Process Control 2009 , 19 , 1725–1736. [ Google Scholar ] [ CrossRef ]
  • Nurmi, V.; Perrels, A.; Nurmi, P.; Michaelides, S.; Athanasatos, S.; Papadakis, M. Economic value of weather forecasts on transportation–Impacts of weather forecast quality developments to the economic effects of severe weather. EWENT FP7 Project . 2012, Volume 490. Available online: http://virtual.vtt.fi/virtual/ewent/Deliverables/D5/D5_2_16_02_2012_revised_final.pdf (accessed on 8 September 2023).
  • Russo, J.A., Jr. The economic impact of weather on the construction industry of the United States. Bull. Am. Meteorol. Soc. 1966 , 47 , 967–972. [ Google Scholar ] [ CrossRef ]
  • Badorf, F.; Hoberg, K. The impact of daily weather on retail sales: An empirical study in brick-and-mortar stores. J. Retail. Consum. Serv. 2020 , 52 , 101921. [ Google Scholar ] [ CrossRef ]
  • De Freitas, C.R. Tourism climatology: Evaluating environmental information for decision making and business planning in the recreation and tourism sector. Int. J. Biometeorol. 2003 , 48 , 45–54. [ Google Scholar ] [ CrossRef ]
  • Smith, K. Environmental Hazards: Assessing Risk and Reducing Disaster ; Routledge: London, UK, 2013. [ Google Scholar ]
  • Hammer, G.L.; Hansen, J.W.; Phillips, J.G.; Mjelde, J.W.; Hill, H.; Love, A.; Potgieter, A. Advances in application of climate prediction in agriculture. Agric. Syst. 2001 , 70 , 515–553. [ Google Scholar ] [ CrossRef ]
  • Guedes, G.; Raad, R.; Raad, L. Welfare consequences of persistent climate prediction errors on insurance markets against natural hazards. Estud. Econ. Sao Paulo 2019 , 49 , 235–264. [ Google Scholar ] [ CrossRef ]
  • McNamara, D.E.; Keeler, A. A coupled physical and economic model of the response of coastal real estate to climate risk. Nat. Clim. Chang. 2013 , 3 , 559–562. [ Google Scholar ] [ CrossRef ]
  • Kleerekoper, L.; Esch, M.V.; Salcedo, T.B. How to make a city climate-proof, addressing the urban heat island effect. Resour. Conserv. Recycl. 2012 , 64 , 30–38. [ Google Scholar ] [ CrossRef ]
  • Kaján, E.; Saarinen, J. Tourism, climate change and adaptation: A review. Curr. Issues Tour. 2013 , 16 , 167–195. [ Google Scholar ]
  • Dessai, S.; Hulme, M.; Lempert, R.; Pielke, R., Jr. Climate prediction: A limit to adaptation. Adapt. Clim. Chang. Threshold. Values Gov. 2009 , 64 , 78. [ Google Scholar ]
  • Ham, Y.-G.; Kim, J.-H.; Luo, J.-J. Deep Learning for Multi-Year ENSO Forecasts. Nature 2019 , 573 , 568–572. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Howe, L.; Wain, A. Predicting the Future ; Cambridge University Press: Cambridge, UK, 1993; Volume V, pp. 1–195. [ Google Scholar ]
  • Hantson, S.; Arneth, A.; Harrison, S.P.; Kelley, D.I.; Prentice, I.C.; Rabin, S.S.; Archibald, S.; Mouillot, F.; Arnold, S.R.; Artaxo, P.; et al. The status and challenge of global fire modelling. Biogeosciences 2016 , 13 , 3359–3375. [ Google Scholar ]
  • Racah, E.; Beckham, C.; Maharaj, T.; Ebrahimi Kahou, S.; Prabhat, M.; Pal, C. ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events. Adv. Neural Inf. Process. Syst. 2017 , 30 , 3402–3413. [ Google Scholar ]
  • Gao, S.; Zhao, P.; Pan, B.; Li, Y.; Zhou, M.; Xu, J.; Zhong, S.; Shi, Z. A nowcasting model for the prediction of typhoon tracks based on a long short term memory neural network. Acta Oceanol. Sin. 2018 , 37 , 8–12. [ Google Scholar ]
  • Ren, X.; Li, X.; Ren, K.; Song, J.; Xu, Z.; Deng, K.; Wang, X. Deep Learning-Based Weather Prediction: A Survey. Big Data Res. 2021 , 23 , 100178. [ Google Scholar ]
  • Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019 , 566 , 195–204. [ Google Scholar ] [ CrossRef ]
  • Stockhause, M.; Lautenschlager, M. CMIP6 data citation of evolving data. Data Sci. J. 2017 , 16 , 30. [ Google Scholar ] [ CrossRef ]
  • Hsieh, W.W. Machine Learning Methods in the Environmental Sciences: Neural Networks and Kernels ; Cambridge University Press: Cambridge, UK, 2009. [ Google Scholar ]
  • Krasnopolsky, V.M.; Fox-Rabinovitz, M.S.; Chalikov, D.V. New Approach to Calculation of Atmospheric Model Physics: Accurate and Fast Neural Network Emulation of Longwave Radiation in a Climate Model. Mon. Weather Rev. 2005 , 133 , 1370–1383. [ Google Scholar ] [ CrossRef ]
  • Krasnopolsky, V.M.; Fox-Rabinovitz, M.S.; Belochitski, A.A. Using ensemble of neural networks to learn stochastic convection parameterizations for climate and numerical weather prediction models from data simulated by a cloud resolving model. Adv. Artif. Neural Syst. 2013 , 2013 , 485913. [ Google Scholar ] [ CrossRef ]
  • Chevallier, F.; Morcrette, J.-J.; Chéruy, F.; Scott, N.A. Use of a neural-network-based long-wave radiative-transfer scheme in the ECMWF atmospheric model. Q. J. R. Meteorol. Soc. 2000 , 126 , 761–776. [ Google Scholar ]
  • Krasnopolsky, V.M.; Fox-Rabinovitz, M.S.; Hou, Y.T.; Lord, S.J.; Belochitski, A.A. Accurate and fast neural network emulations of model radiation for the NCEP coupled climate forecast system: Climate simulations and seasonal predictions. Mon. Weather Rev. 2010 , 138 , 1822–1842. [ Google Scholar ] [ CrossRef ]
  • Tolman, H.L.; Krasnopolsky, V.M.; Chalikov, D.V. Neural network approximations for nonlinear interactions in wind wave spectra: Direct mapping for wind seas in deep water. Ocean. Model. 2005 , 8 , 253–278. [ Google Scholar ] [ CrossRef ]
  • Markakis, E.; Papadopoulos, A.; Perakakis, P. Spatiotemporal Forecasting: A Survey. arXiv 2018 , arXiv:1808.06571. [ Google Scholar ]
  • Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control ; John Wiley & Sons: Hoboken, NJ, USA, 2015. [ Google Scholar ]
  • He, Y.; Kolovos, A. Spatial and Spatio-Temporal Geostatistical Modeling and Kriging. In Wiley StatsRef: Statistics Reference Online ; John Wiley & Sons: Hoboken, NJ, USA, 2015. [ Google Scholar ]
  • Lu, H.; Fan, Z.; Zhu, H. Spatiotemporal Analysis of Air Quality and Its Application in LASG/IAP Climate System Model. Atmos. Ocean. Sci. Lett. 2011 , 4 , 204–210. [ Google Scholar ]
  • Chatfield, C. The Analysis of Time Series: An Introduction , 7th ed.; CRC Press: Boca Raton, FL, USA, 2016. [ Google Scholar ]
  • Stull, R. Meteorology for Scientists and Engineers , 3rd ed.; Brooks/Cole: Pacific Grove, CA, USA, 2015. [ Google Scholar ]
  • Yuval, J.; O’Gorman, P.A. Machine Learning for Parameterization of Moist Convection in the Community Atmosphere Model. Proc. Natl. Acad. Sci. USA 2020 , 117 , 12–20. [ Google Scholar ]
  • Gagne, D.J.; Haupt, S.E.; Nychka, D.W. Machine Learning for Spatial Environmental Data. Meteorol. Monogr. 2020 , 59 , 9.1–9.36. [ Google Scholar ]
  • Xu, Z.; Li, Y.; Guo, Q.; Shi, X.; Zhu, Y. A Multi-Model Deep Learning Ensemble Method for Rainfall Prediction. J. Hydrol. 2020 , 584 , 124579. [ Google Scholar ]
  • Kuligowski, R.J.; Barros, A.P. Localized precipitation forecasts from a numerical weather prediction model using artificial neural networks. Weather. Forecast. 1998 , 13 , 1194–1204. [ Google Scholar ] [ CrossRef ]
  • Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. arXiv 2015 , arXiv:1506.04214. [ Google Scholar ]
  • Qiu, M.; Zhao, P.; Zhang, K.; Huang, J.; Shi, X.; Wang, X.; Chu, W. A short-term rainfall prediction model using multi-task convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; IEEE: New York, NY, USA, 2017; pp. 395–404. [ Google Scholar ]
  • Karevan, Z.; Suykens, J.A. Spatio-temporal stacked lstm for temperature prediction in weather forecasting. arXiv 2018 , arXiv:1811.06341. [ Google Scholar ]
  • Chattopadhyay, A.; Nabizadeh, E.; Hassanzadeh, P. Analog Forecasting of extreme-causing weather patterns using deep learning. J. Adv. Model. Earth Syst. 2020 , 12 , e2019MS001958. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Sønderby, C.K.; Espeholt, L.; Heek, J.; Dehghani, M.; Oliver, A.; Salimans, T.; Alchbrenner, N. MetNet: A Neural Weather Model for Precipitation Forecasting. arXiv 2020 , arXiv:2003.12140. [ Google Scholar ]
  • Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Anandkumar, A. FourCastNet: A Global Data-Driven High-Resolution Weather Model Using Adaptive Fourier Neural Operators. arXiv 2022 , arXiv:2202.11214. [ Google Scholar ]
  • Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Pritzel, A.; Battaglia, P. GraphCast: Learning skillful medium-range global weather forecasting. arXiv 2022 , arXiv:2212.12794. [ Google Scholar ]
  • Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks. Nature 2023 , 619 , 533–538. [ Google Scholar ] [ CrossRef ]
  • Nguyen, T.; Brandstetter, J.; Kapoor, A.; Gupta, J.K.; Grover, A. ClimaX: A foundation model for weather and climate. arXiv 2023 , arXiv:2301.10343. [ Google Scholar ]
  • Gangopadhyay, S.; Clark, M.; Rajagopalan, B. Statistical Down-scaling using K-nearest neighbors. In Water Resources Research ; Wiley Online Library: Hoboken, NJ, USA, 2005; Volume 41. [ Google Scholar ]
  • Tripathi, S.; Srinivas, V.V.; Nanjundiah, R.S. Down-scaling of precipitation for climate change scenarios: A support vector machine approach. J. Hydrol. 2006 , 330 , 621–640. [ Google Scholar ] [ CrossRef ]
  • Krasnopolsky, V.M.; Fox-Rabinovitz, M.S. Complex hybrid models combining deterministic and machine learning components for numerical climate modeling and weather prediction. Neural Netw. 2006 , 19 , 122–134. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Raje, D.; Mujumdar, P.P. A conditional random field–based Down-scaling method for assessment of climate change impact on multisite daily precipitation in the Mahanadi basin. In Water Resources Research ; Wiley Online Library: Hoboken, NJ, USA, 2009; Volume 45. [ Google Scholar ]
  • Zarei, M.; Najarchi, M.; Mastouri, R. Bias correction of global ensemble precipitation forecasts by Random Forest method. Earth Sci. Inform. 2021 , 14 , 677–689. [ Google Scholar ] [ CrossRef ]
  • Andersson, T.R.; Hosking, J.S.; Pérez-Ortiz, M.; Paige, B.; Elliott, A.; Russell, C.; Law, S.; Jones, D.C.; Wilkinson, J.; Phillips, T.; et al. Seasonal Arctic Sea Ice Forecasting with Probabilistic Deep Learning. Nat. Commun. 2021 , 12 , 5124. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Wang, X.; Han, Y.; Xue, W.; Yang, G.; Zhang, G. Stable climate simulations using a realistic general circulation model with neural network parameterizations for atmospheric moist physics and radiation processes. Geosci. Model Dev. 2022 , 15 , 3923–3940. [ Google Scholar ] [ CrossRef ]
  • Baño-Medina, J.; Manzanas, R.; Cimadevilla, E.; Fernández, J.; González-Abad, J.; Cofiño, A.S.; Gutiérrez, J.M. Down-scaling Multi-Model Climate Projection Ensembles with Deep Learning (DeepESD): Contribution to CORDEX EUR-44. Geosci. Model Dev. 2022 , 15 , 6747–6758. [ Google Scholar ] [ CrossRef ]
  • Hess, P.; Lange, S.; Boers, N. Deep Learning for bias-correcting comprehensive high-resolution Earth system models. arXiv 2022 , arXiv:2301.01253. [ Google Scholar ]
  • Wang, F.; Tian, D. On deep learning-based bias correction and Down-scaling of multiple climate models simulations. Clim. Dyn. 2022 , 59 , 3451–3468. [ Google Scholar ] [ CrossRef ]
  • Pan, B.; Anderson, G.J.; Goncalves, A.; Lucas, D.D.; Bonfils, C.J.W.; Lee, J. Improving Seasonal Forecast Using Probabilistic Deep Learning. J. Adv. Model. Earth Syst. 2022 , 14 , e2021MS002766. [ Google Scholar ] [ CrossRef ]
  • Hu, Y.; Chen, L.; Wang, Z.; Li, H. SwinVRNN: A Data-Driven Ensemble Forecasting Model via Learned Distribution Perturbation. J. Adv. Model. Earth Syst. 2023 , 15 , e2022MS003211. [ Google Scholar ] [ CrossRef ]
  • Chen, L.; Zhong, X.; Zhang, F.; Cheng, Y.; Xu, Y.; Qi, Y.; Li, H. FuXi: A cascade machine learning forecasting system for 15-day global weather forecast. arXiv 2023 , arXiv:2306.12873. [ Google Scholar ]
  • Lin, H.; Gao, Z.; Xu, Y.; Wu, L.; Li, L.; Li, S.Z. Conditional local convolution for spatio-temporal meteorological forecasting. Proc. Aaai Conf. Artif. Intell. 2022 , 36 , 7470–7478. [ Google Scholar ] [ CrossRef ]
  • Chen, K.; Han, T.; Gong, J.; Bai, L.; Ling, F.; Luo, J.J.; Chen, X.; Ma, L.; Zhang, T.; Su, R.; et al. FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead. arXiv 2023 , arXiv:2304.02948. [ Google Scholar ]
  • De Burgh-Day, C.O.; Leeuwenburg, T. Machine Learning for numerical weather and climate modelling: A review. EGUsphere 2023 , 2023 , 1–48. [ Google Scholar ]
  • LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998 , 86 , 2278–2324. [ Google Scholar ] [ CrossRef ]
  • Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012 , 25 , 1097–1105. [ Google Scholar ] [ CrossRef ]
  • Scherer, D.; Müller, A.; Behnke, S. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the International Conference on Artificial Neural Networks 2010, Thessaloniki, Greece, 15–18 September 2010; pp. 92–101. [ Google Scholar ]
  • LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015 , 521 , 436–444. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Liu, Y.; Racah, E.; Correa, J.; Khosrowshahi, A.; Lavers, D.; Kunkel, K.; Wehner, M.; Collins, W. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv 2016 , arXiv:1605.01156. [ Google Scholar ]
  • Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1319–1327. [ Google Scholar ]
  • Marion, M.; Roger, T. Navier-Stokes equations: Theory and approximation. Handb. Numer. Anal. 1998 , 6 , 503–689. [ Google Scholar ]
  • Iacono, M.J.; Mlawer, E.J.; Clough, S.A.; Morcrette, J.-J. Impact of an improved longwave radiation model, RRTM, on the energy budget and thermodynamic properties of the NCAR community climate model, CCM3. J. Geophys. Res. Atmos. 2000 , 105 , 14873–14890. [ Google Scholar ] [ CrossRef ]
  • Guo, Y.; Shao, C.; Su, A. Comparative Evaluation of Rainfall Forecasts during the Summer of 2020 over Central East China. Atmosphere 2023 , 14 , 992. [ Google Scholar ] [ CrossRef ]
  • Guo, Y.; Shao, C.; Su, A. Investigation of Land–Atmosphere Coupling during the Extreme Rainstorm of 20 July 2021 over Central East China. Atmosphere 2023 , 14 , 1474. [ Google Scholar ] [ CrossRef ]
  • Bauer, P.; Thorpe, A.; Brunet, G. The Quiet Revolution of Numerical Weather Prediction. Nature 2015 , 525 , 47–55. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017. [ Google Scholar ]
  • Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.C. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. arXiv 2019 , arXiv:2003.07853. [ Google Scholar ]
  • Schmit, T.J.; Griffith, P.; Gunshor, M.M.; Daniels, J.M.; Goodman, S.J.; Lebair, W.J. A closer look at the ABI on the GOES-R series. Bull. Am. Meteorol. Soc. 2017 , 98 , 681–698. [ Google Scholar ] [ CrossRef ]
  • Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Fourier Neural Operator for Parametric Partial Differential Equations. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event. 3–7 May 2021. [ Google Scholar ]
  • Guibas, J.; Mardani, M.; Li, Z.; Tao, A.; Anandkumar, A.; Catanzaro, B. Adaptive Fourier Neural Operators: Efficient token mixers for transformers. In Proceedings of the International Conference on Representation Learning, Virtual Event. 25–29 April 2022. [ Google Scholar ]
  • Rasp, S.; Thuerey, N. Purely data-driven medium-range weather forecasting achieves comparable skill to physical models at similar resolution. arXiv 2020 , arXiv:2008.08626. [ Google Scholar ]
  • Weyn, J.A.; Durran, D.R.; Caruana, R.; Cresswell-Clay, N. Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. arXiv 2021 , arXiv:2102.05107. [ Google Scholar ] [ CrossRef ]
  • Rasp, S.; Dueben, P.D.; Scher, S.; Weyn, J.A.; Mouatadid, S.; Thuerey, N. Weatherbench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst. 2020 , 12 , e2020MS002203. [ Google Scholar ] [ CrossRef ]
  • Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision, Virtual. 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [ Google Scholar ]
  • Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020 , arXiv:2010.11929. [ Google Scholar ]
  • Váňa, F.; Düben, P.; Lang, S.; Palmer, T.; Leutbecher, M.; Salmond, D.; Carver, G. Single precision in weather forecasting models: An evaluation with the IFS. Mon. Weather Rev. 2017 , 145 , 495–502. [ Google Scholar ] [ CrossRef ]
  • IPCC. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change ; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2013. [ Google Scholar ]
  • Flato, G.; Marotzke, J.; Abiodun, B.; Braconnot, P.; Chou, S.C.; Collins, W.; Cox, P.; Driouech, F.; Emori, S.; Eyring, V.; et al. Evaluation of Climate Models. In Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change ; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2013. [ Google Scholar ]
  • Washington, W.M.; Parkinson, C.L. An Introduction to Three-Dimensional Climate Modeling ; University Science Books: Beijing, China, 2005. [ Google Scholar ]
  • Giorgi, F.; Gutowski, W.J. Regional Dynamical Down-scaling and the CORDEX Initiative. Annu. Rev. Environ. Resour. 2015 , 40 , 467–490. [ Google Scholar ] [ CrossRef ]
  • Randall, D.A.; Wood, R.A.; Bony, S.; Colman, R.; Fichefet, T.; Fyfe, J.; Kattsov, V.; Pitman, A.; Shukla, J.; Srinivasan, J.; et al. Climate Models and Their Evaluation. In Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change ; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2007. [ Google Scholar ]
  • Taylor, K.E.; Stouffer, R.J.; Meehl, G.A. An overview of CMIP5 and the experiment design. Bull. Am. Meteorol. Soc. 2012 , 93 , 485–498. [ Google Scholar ] [ CrossRef ]
  • Miao, C.; Shen, Y.; Sun, J. Spatial–temporal ensemble forecasting (STEFS) of high-resolution temperature using machine learning models. J. Adv. Model. Earth Syst. 2019 , 11 , 2961–2973. [ Google Scholar ]
  • Mukkavilli, S.; Perone, C.S.; Rangapuram, S.S.; Müller, K.R. Distribution regression forests for probabilistic spatio-temporal forecasting. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020. [ Google Scholar ]
  • Walker, G.; Charlton-Perez, A.; Lee, R.; Inness, P. Challenges and progress in probabilistic forecasting of convective phenomena: The 2016 GFE/EUMETSAT/NCEP/SPC severe convective weather workshop. Bull. Am. Meteorol. Soc. 2016 , 97 , 1829–1835. [ Google Scholar ]
  • Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013 , arXiv:1312.6114. [ Google Scholar ]
  • Krasting, J.P.; John, J.G.; Blanton, C.; McHugh, C.; Nikonov, S.; Radhakrishnan, A.; Zhao, M. NOAA-GFDL GFDL-ESM4 model output prepared for CMIP6 CMIP. Earth Syst. Grid Fed. 2018 , 10 . [ Google Scholar ] [ CrossRef ]
  • Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [ Google Scholar ]
  • Brands, S.; Herrera, S.; Fernández, J.; Gutiérrez, J.M. How well do CMIP5 Earth System Models simulate present climate conditions in Europe and Africa? Clim. Dynam. 2013 , 41 , 803–817. [ Google Scholar ] [ CrossRef ]
  • Vautard, R.; Kadygrov, N.; Iles, C. Evaluation of the large EURO-CORDEX regional climate model ensemble. J. Geophys. Res.-Atmos. 2021 , 126 , e2019JD032344. [ Google Scholar ] [ CrossRef ]
  • Boé, J.; Somot, S.; Corre, L.; Nabat, P. Large discrepancies in summer climate change over Europe as projected by global and regional climate models: Causes and consequences. Clim. Dynam. 2020 , 54 , 2981–3002. [ Google Scholar ] [ CrossRef ]
  • Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. Configuration and intercomparison of deep learning neural models for statistical Down-scaling. Geosci. Model Dev. 2020 , 13 , 2109–2124. [ Google Scholar ] [ CrossRef ]
  • Lecun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time-Series. Handb. Brain Theory Neural Netw. 1995 , 336 , 1995. [ Google Scholar ]
  • Dee, D.P.; Uppala, S.M.; Simmons, A.J.; Berrisford, P.; Poli, P.; Kobayashi, S.; Andrae, U.; Balmaseda, M.A.; Balsamo, G.; Bauer, D.P.; et al. The ERA-Interim reanalysis: Configuration and performance of the data assimilation system. Q. J. Roy Meteor. Soc. 2011 , 137 , 553–597. [ Google Scholar ] [ CrossRef ]
  • Cornes, R.C.; van der Schrier, G.; van den Besselaar, E.J.M.; Jones, P.D. An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets. J. Geophys. Res.-Atmos. 2018 , 123 , 9391–9409. [ Google Scholar ] [ CrossRef ]
  • Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. On the suitability of deep convolutional neural networks for continentalwide Down-scaling of climate change projections. Clim. Dynam. 2021 , 57 , 1–11. [ Google Scholar ] [ CrossRef ]
  • Maraun, D.; Widmann, M.; Gutiérrez, J.M.; Kotlarski, S.; Chandler, R.E.; Hertig, E.; Wibig, J.; Huth, R.; Wilcke, R.A. VALUE: A framework to validate Down-scaling approaches for climate change studies. Earths Future 2015 , 3 , 1–14. [ Google Scholar ] [ CrossRef ]
  • Vrac, M.; Ayar, P. Influence of Bias Correcting Predictors on Statistical Down-scaling Models. J. Appl. Meteorol. Clim. 2016 , 56 , 5–26. [ Google Scholar ] [ CrossRef ]
  • Williams, P.M. Modelling Seasonality and Trends in Daily Rainfall Data. In Advances in Neural Information Processing Systems 10, Proceedings of the Neural Information Processing Systems (NIPS): Denver, Colorado, USA, 1997 ; MIT Press: Cambridge, MA, USA, 1998; pp. 985–991. ISBN 0-262-10076-2. [ Google Scholar ]
  • Cannon, A.J. Probabilistic Multisite Precipitation Down-scaling by an Expanded Bernoulli–Gamma Density Network. J. Hydrometeorol. 2008 , 9 , 1284–1300. [ Google Scholar ] [ CrossRef ]
  • Schoof, J.T. and Pryor, S.C. Down-scaling temperature and precipitation: A comparison of regression-based methods and artificial neural networks. Int. J. Climatol. 2001 , 21 , 773–790. [ Google Scholar ] [ CrossRef ]
  • Maraun, D.; Widmann, M. Statistical Down-Scaling and Bias Correction for Climate Research ; Cambridge University Press: Cambridge, UK, 2018; ISBN 9781107588783. [ Google Scholar ]
  • Vrac, M.; Stein, M.; Hayhoe, K.; Liang, X.-Z. A general method for validating statistical Down-scaling methods under future climate change. Geophys. Res. Lett. 2007 , 34 , L18701. [ Google Scholar ] [ CrossRef ]
  • San-Martín, D.; Manzanas, R.; Brands, S.; Herrera, S.; Gutiérrez, J.M. Reassessing Model Uncertainty for Regional Projections of Precipitation with an Ensemble of Statistical Down-scaling Methods. J. Clim. 2017 , 30 , 203–223. [ Google Scholar ] [ CrossRef ]
  • Quesada-Chacón, D.; Barfus, K.; Bernhofer, C. Climate change projections and extremes for Costa Rica using tailored predictors from CORDEX model output through statistical Down-scaling with artificial neural networks. Int. J. Climatol. 2021 , 41 , 211–232. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

Time ScaleDomainsApplications
Short TermAgricultureThe timing for sowing and harvesting;
Irrigation and fertilization plans [ ].
EnergyPredicts output for wind and solar energy [ ].
TransportationRoad traffic safety; Rail transport;
Aviation and maritime industries [ ].
ConstructionProject plans and timelines; Safe operations [ ].
Retail and SalesAdjusts inventory based on weather forecasts [ ].
Tourism and
Entertainment
Operations of outdoor activities
and tourist attractions [ ]
Environment and
Disaster Management
Early warnings for floods, fires,
and other natural disasters [ ].
Medium—Long TermAgricultureLong-term land management and planning [ ].
InsurancePreparations for future increases in
types of disasters, such as floods and droughts [ ].
Real EstateAssessment of future sea-level rise or other
climate-related factors [ ].
Urban PlanningWater resource management [ ].
TourismLong-term investments and planning,
such as deciding which regions may become
popular tourist destinations in the future [ ].
Public HealthLong-term climate changes may impact the
spread of diseases [ ].
Time ScaleSpational ScaleTypeModelTechnologyNameEvent
Short-term weather predictionGlobalMLSpecial DNN ModelsAFNOFourCastNet [ ]Extreme Events
3D Neural NetworkPanGu [ ]
Vision TransformersClimaX [ ]Temperature & Extreme
Event
SwinTransformerSwinVRNN [ ]Temperature & Precipitation
U-TransformerFuXi [ ]
Single DNNs ModelGNNCLCRN [ ]Temperature
GraphCast [ ]
TransformerFengWu [ ]Extreme Events
Regional CapsNet [ ]
CNNPrecipitation Convolution
prediction [ ]
Precipitation
ANNPrecipitation Neural
Network prediction [ ]
LSTMStacked-LSTM-Model [ ]Temperature
Hybrid DNNs ModelLSTM + CNNConsvLSTM [ ]Precipitation
MetNet [ ]
Medium-to-long-term climate predictionGlobal Single DNN modelsProbalistic deep learningConditional Generative
Forecasting [ ]
Temperature & Precipitation
ML EnhancedCNNCNN-Bias-correction
model [ ]
Temperature & Extreme
Event
GANCycle GAN [ ]Precipitation
NNHybrid-GCM-Emulation [ ]
ResDNNNNCAM-emulation [ ]
RegionalCNNDeepESD-Down-scaling
model [ ]
Temperature
Non-Deep-Learning
Model
Random forest (RF)RF-bias-correction model [ ]Precipitation
Support vector
machine (SVM)
SVM-Down-scaling model [ ]
K-nearest
neighbor (KNN)
KNN-Down-scaling model [ ]
Conditional random
field (CRF)
CRF-Down-scaling model [ ]
ModelForecast-TimelinessZ500 RMSE (7 Days)Z500 ACC (7 Days)Training-ComplexityForecasting-Speed
MetNet [ ]8 h--256 Google-TPU-accelerators (16-days-training)Fewer seconds
FourCastNet [ ]7 days5950.7624 A100-GPU24-h forecast for 100 members in 7 s
GraphCast [ ]9.75 days4600.82532 Cloud-TPU-V4 (21-days-training)10-days-predication within 1 min
PanGu [ ]7 days5100.872192 V100-GPU (16-days-training)24-h-global-prediction in 1.4 s for each GPU
IFS [ ]8.5 days4390.85--
NameCategoriesMetricsESMThis Model
CycleGAN [ ]Bias correctionMAE0.2410.068
DeepESD [ ]Down-scalingEuclidean Distance to Observations in PDF0.50.03
CGF [ ]PredictionACC0.310.4
NNCAM [ ]EmulationSpeed130 times speed-up
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Chen, L.; Han, B.; Wang, X.; Zhao, J.; Yang, W.; Yang, Z. Machine Learning Methods in Weather and Climate Applications: A Survey. Appl. Sci. 2023 , 13 , 12019. https://doi.org/10.3390/app132112019

Chen L, Han B, Wang X, Zhao J, Yang W, Yang Z. Machine Learning Methods in Weather and Climate Applications: A Survey. Applied Sciences . 2023; 13(21):12019. https://doi.org/10.3390/app132112019

Chen, Liuyi, Bocheng Han, Xuesong Wang, Jiazhen Zhao, Wenke Yang, and Zhengyi Yang. 2023. "Machine Learning Methods in Weather and Climate Applications: A Survey" Applied Sciences 13, no. 21: 12019. https://doi.org/10.3390/app132112019

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Cyberbullying detection and machine learning: a systematic literature review

  • Published: 24 July 2023
  • Volume 56 , pages 1375–1416, ( 2023 )

Cite this article

research paper using machine learning

  • Vimala Balakrisnan 1 &
  • Mohammed Kaity 1  

1025 Accesses

Explore all metrics

The rise in research work focusing on detection of cyberbullying incidents on social media platforms particularly reflect how dire cyberbullying consequences are, regardless of age, gender or location. This paper examines scholarly publications (i.e., 2011–2022) on cyberbullying detection using machine learning through a systematic literature review approach. Specifically, articles were sought from six academic databases (Web of Science, ScienceDirect, IEEE Xplore, Association for Computing Machinery, Scopus, and Google Scholar), resulting in the identification of 4126 articles. A redundancy check followed by eligibility screening and quality assessment resulted in 68 articles included in this review. This review focused on three key aspects, namely, machine learning algorithms used to detect cyberbullying, features, and performance measures, and further supported with classification roles, language of study, data source and type of media. The findings are discussed, and research challenges and future directions are provided for researchers to explore.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper using machine learning

Similar content being viewed by others

research paper using machine learning

Can Machine Learning Really Detect Cyberbullying?

research paper using machine learning

Harnessing the Power of Interdisciplinary Research with Psychology-Informed Cyberbullying Detection Models

research paper using machine learning

Cyber Analyzer—A Machine Learning Approach for the Detection of Cyberbullying—A Survey

Explore related subjects.

  • Artificial Intelligence

https://trends.google.com/trends/explore?date=2011-01-01%202022-12-31&q=deep%20learning&hl=en .

Agrawal S, Awekar A (2018) Deep learning for detecting cyberbullying across multiple social media platforms. In European Conference on Information Retrieval (pp. 141–153). Springer, Cham

Aizenkot D, Kashy-Rosenbaum G (2018) Cyberbullying in WhatsApp classmates’ groups: evaluation of an intervention program implemented in israeli elementary and middle schools. New Media & Society 20(12):4709–4727

Article   Google Scholar  

Akhter MP, Zheng JB, Naqvi IR, Abdelmajeed M, Sadiq MT (2020) Automatic Detection of Offensive Language for Urdu and Roman Urdu. IEEE Access 8:91213–91226.

Aldhyani TH, Al-Adhaileh MH, Alsubari SN (2022) Cyberbullying identification system based deep learning algorithms. Electronics 11(20):3273

Al-Garadi MA, Hussain MR, Khan N, Murtaza G, Nweke HF, Ali I, …, Gani A (2019) Predicting cyberbullying on social media in the big data era using machine learning algorithms: review of literature and open challenges. IEEE Access 7:70701–70718

Al-garadi MA, Varathan KD, Ravana SD (2016) Cybercrime detection in online communications: the experimental case of cyberbullying detection in the Twitter network. Comput Hum Behav 63:433–443

Al-Harigy LM, Al-Nuaim HA, Moradpoor N, Tan Z (2022) Building towards Automated Cyberbullying Detection: A Comparative Analysis. Computational Intelligence and Neuroscience, 2022

Alom Z, Carminati B, Ferrari E (2020) A deep learning model for Twitter spam detection. Online Social Networks and Media 18:100079

Alpaydin E (2010) Introduction to machine learning, 2nd edn. MIT Press

Ates EC, Bostanci E, Guzel MS (2021) Comparative performance of machine learning algorithms in cyberbullying detection: using turkish language preprocessing techniques. arXiv preprint arXiv :2101.12718

Ayo FE, Folorunso O, Ibharalu FT, Osinuga IA (2020) Machine learning techniques for hate speech classification of twitter data: state-of-the-art, future challenges and research directions. Comput Sci Rev 38:100311

Balakrishnan V (2015) Cyberbullying among young adults in Malaysia: the roles of gender, age and internet frequency. Comput Hum Behav 46:149–157

Balakrishnan V, Khan S, Arabnia HR (2020a) Improving cyberbullying detection using Twitter users’ psychological features and machine learning. Computers & Security 90:101710

Balakrishnan V, Khan S, Arabnia HR (2020b) Improving cyberbullying detection using Twitter users’ psychological features and machine learning. Computers & Security 90:101710

Balakrishnan V, Khan S, Fernandez T, Arabnia HR (2019) Cyberbullying detection on twitter using Big Five and Dark Triad features. Pers Individ Differ 141, 252–257.

Bretschneider U, Wöhner T, Peters R (2014) Detecting online harassment in social networks.

Buan TA, Ramachandra R (2020) Automated Cyberbullying Detection in Social Media Using an SVM Activated Stacked Convolution LSTM Network. In Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis (pp. 170–174)

Camerini AL, Marciano L, Carrara A, Schulz PJ (2020) Cyberbullying perpetration and victimization among children and adolescents: a systematic review of longitudinal studies. Telematics Inform 49:101362

Chatzakou D, Kourtellis N, Blackburn J, De Cristofaro E, Stringhini G, Vakali A (2017) Mean birds: Detecting aggression and bullying on twitter. In Proceedings of the 2017 ACM on web science conference (pp. 13–22)

Chavan VS, Shylaja SS (2015) Machine learning approach for detection of cyber-aggressive comments by peers on social media network. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 2354–2358). IEEE

Cheng L, Guo R, Silva YN, Hall D, Liu H (2021) Modeling temporal patterns of cyberbullying detection with hierarchical attention networks. ACM/IMS Trans Data Sci 2(2):1–23

Cheng L, Li J, Silva YN, Hall DL, Liu H (2019) Xbully: Cyberbullying detection within a multi-modal context. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (pp. 339–347)

Chen Y, Zhou Y, Zhu S, Xu H (2012) Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing (pp. 71–80). IEEE

Dadvar M, De Jong F (2012) Cyberbullying detection: a step toward a safer internet yard. In Proceedings of the 21st International Conference on World Wide Web (pp. 121–126)

Dadvar M, Jong FD, Ordelman R, Trieschnigg D (2012) Improved cyberbullying detection using gender information. In Proceedings of the Twelfth Dutch-Belgian Information Retrieval Workshop (DIR 2012) . University of Ghent

Dadvar M, Trieschnigg D, Ordelman R, de Jong F (2013) Improving cyberbullying detection with user context. In European Conference on Information Retrieval (pp. 693–696). Springer, Berlin, Heidelberg

Dey R, Bag S, Sarkar RR (2021) Identification of stable housekeeping genes for normalization of qPCR data in a pathogenic fungus. J Microbiol Methods 180:106106

Google Scholar  

Dinakar K, Picard R, Lieberman H (2015) Common sense reasoning for detection, prevention, and mitigation of cyberbullying. In IJCAI International Joint Conference on Artificial Intelligence .

Dinakar K, Reichart R, Lieberman H (2011) Modeling the detection of textual cyberbullying. In Proceedings of the International Conference on Weblog and Social Media 2011

Divyashree VH, Deepashree NS (2016) An effective approach for cyberbullying detection and avoidance. International Journal of Innovative Research in Computer and Communication Engineering , 14

Djuraskovic O, Cyberbullying Statistics F (2020) and Trends with Charts: First Site Guide; 2020. Available from: https://firstsiteguide.com/cyberbullying-stats/

Elmezain M, Malki A, Gad I, Atlam ES (2022) Hybrid deep learning model–based prediction of images related to Cyberbullying. Int J Appl Math Comput Sci 32(2):323–334

MATH   Google Scholar  

Fahrnberger G, Nayak D, Martha VS, Ramaswamy S (2014) SafeChat: A tool to shield children’s communication from explicit messages. In 2014 14th International Conference on Innovations for Community Services (I4CS) (pp. 80–86). IEEE

Fang Y, Yang S, Zhao B, Huang C (2021) Cyberbullying detection in social networks using bi-gru with self-attention mechanism. Information 12(4):171

Foong YJ, Oussalah M (2017), September Cyberbullying system detection and analysis. In 2017 European Intelligence and Security Informatics Conference (EISIC) (pp. 40–46). IEEE

Galán-García P, Puerta JGDL, Gómez CL, Santos I, Bringas PG (2016) Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying. Log J IGPL 24(1):42–53

MathSciNet   Google Scholar  

García-Recuero Á (2016) Discouraging abusive behavior in privacy-preserving online social networking applications. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 305–309)

Ge S, Cheng L, Liu H (2021) Improving cyberbullying detection with user interaction. In Proceedings of the Web Conference 2021 (pp. 496–506)

Goodboy AK, Martin MM (2015) The personality profile of a cyberbully: examining the Dark Triad. Comput Hum Behav 49:1–4

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press

Haidar B, Chamoun M, Serhrouchni A (2017a) Multilingual cyberbullying detection system: Detecting cyberbullying in Arabic content. In 2017 1st Cyber Security in Networking Conference (CSNet) (pp. 1–8). IEEE

Haidar B, Chamoun M, Serhrouchni A (2017b) A multilingual system for cyberbullying detection: arabic content detection using machine learning. Adv Sci Technol Eng Syst J 2(6):275–284

Hani J, Nashaat M, Ahmed M, Emad Z, Amer E, Mohammed A (2019) Social media cyberbullying detection using machine learning. Int J Adv Comput Sci Appl 10(5):703–707

Hinduja S, Patchin JW (2010) Bullying, cyberbullying, and suicide. Archives of suicide research 14(3):206–221

Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied Logistic Regression. Wiley

Hosseinmardi H, Mattson SA, Rafiq RI, Han R, Lv Q, Mishra S (2015b) Detection of cyberbullying incidents on the instagram social network. arXiv preprint arXiv:1503.03909

Hosseinmardi H, Mattson SA, Rafiq RI, Han R, Lv Q, Mishr S (2015a) Prediction of cyberbullying incidents on the instagram social network. arXiv preprint arXiv:1508.06257

Hosseinmardi H, Rafiq RI, Han R, Lv Q, Mishra S (2016) Prediction of cyberbullying incidents in a media-based social network. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 186–192). IEEE

Huang Q, Singh VK, Atrey PK (2014) Cyber bullying detection using social and textual analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia (pp. 3–6)

Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning: methods, systems, challenges. Springer

Kaity M, Balakrishnan V (2019) An automatic non-english sentiment lexicon builder using unannotated corpus. J Supercomputing 75(4):2243–2268

Kelleher JD, Tierney B, Tierney B (2018) Data science: an introduction. CRC Press

Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Tech Rep EBSE 1:1–57

Koutsou A, Tjortjis C (2018) Predicting hospital readmissions using random forests. IEEE J Biomedical Health Inf 22(1):122–130

Kumar A, Nayak S, Chandra N (2019) Empirical analysis of supervised machine learning techniques for Cyberbullying detection. In International Conference on Innovative Computing and Communications (pp. 223–230). Springer, Singapore

Kumar A, Sachdeva N (2020) Multi-input integrative learning using deep neural networks and transfer learning for cyberbullying detection in real-time code-mix data. Multimedia Systems

Kumar A, Sachdeva N (2021) Multimodal cyberbullying detection using capsule network with dynamic routing and deep convolutional neural network. Multimedia Syst, 1–10

LeCun Y, Bengio Y, Hinton G (2015) Deep Learn Nat 521(7553):436–444

Li W, Li X (2021) Cyberbullying among college students: the roles of individual, familial, and cultural factors. Int J Environ Res Public Health 18(11):1–17

López-Vizcaíno MF, Nóvoa FJ, Carneiro V, Cacheda F (2021) Early detection of cyberbullying on social media networks. Future Generation Computer Systems 118:219–229

Lu N, Wu G, Zhang Z, Zheng Y, Ren Y, Choo KKR (2020) Cyberbullying detection in social media text based on character-level convolutional neural network with shortcuts. Concurrency and Computation: Practice and Experience, e5627

Maity K, Sen T, Saha S, Bhattacharyya P (2022) MTBullyGNN: a graph neural network-based Multitask Framework for Cyberbullying Detection. IEEE Transactions on Computational Social Systems

Malik CI, Radwan RB (2020) Adolescent victims of cyberbullying in Bangladesh- prevalence and relationship with psychiatric disorders. Asian J Psychiatr 48:101893

Mangaonkar A, Hayrapetian A, Raje R (2015) Collaborative detection of cyberbullying behavior in Twitter data. In 2015 IEEE international conference on electro/information technology (EIT) (pp. 611–616). IEEE

Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press

McEvoy MP, Williams MT (2021) Quality Assessment of systematic reviews and Meta-analyses of physical therapy interventions: a systematic review. Phys Ther 101(4):pzaa226

Mercado RNM, Chuctaya HFC, Gutierrez EGC (2018) Automatic cyberbullying detection in spanish-language social networks using sentiment analysis techniques. Int J Adv Comput Sci Appl 9(7):228–235

Monteiro RP, Santana MC, Santos RM, Pereira FC (2022) Cyberbullying victimization and mental health in higher education students: the mediating role of perceived social support. J interpers Violence, 1–23

Nahar V, Al-Maskari S, Li X, Pang C (2014) Semi-supervised learning for cyberbullying detection in social networks. In Australasian Database Conference (pp. 160–171). Springer, Cham

Nahar V, Unankard S, Li X, Pang C (2012) Sentiment analysis for effective detection of cyber bullying. Asia-Pacific Web Conference

Nandhini BS, Sheeba JI (2015) Online social network bullying detection using intelligence techniques. Procedia Comput Sci 45:485–492

Niu M, Yu L, Tian S, Wang X, Zhang Q (2020) Personal-bullying detection based on Multi-Attention and Cognitive Feature. Autom Control Comput Sci 54(1):52–61

Noviantho, Isa SM, Ashianti L (2018) Cyberbullying classification using text mining. In Proceedings - 2017 1st International Conference on Informatics and Computational Sciences , ICICoS 2017

Patil C, Salmalge S, Nartam P (2020) Cyberbullying detection on multiple SMPs using modular neural network. Advances in Cybernetics, Cognition, and machine learning for Communication Technologies. Springer, Singapore, pp 181–188

Chapter   Google Scholar  

Pawar R, Raje RR (2019) Multilingual Cyberbullying Detection System. In 2019 IEEE International Conference on Electro Information Technology (EIT) (pp. 040–044). IEEE

Pires TM, Nunes IL (2019) Support vector machine for human activity recognition: a comprehensive review. Artif Intell Rev 52(3):1925–1962

Pradhan A, Yatam VM, Bera P (2020) Self-Attention for Cyberbullying Detection. In 2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA) (pp. 1–6). IEEE

Pérez PJC, Valdez CJL, Ortiz MDGC, Barrera JPS, Pérez PF (2012) MISAAC: Instant messaging tool for cyberbullying detection. In Proceedings of the 2012 International Conference on Artificial Intelligence , ICAI 2012 (pp. 1049–1052)

Rafiq RI, Hosseinmardi H, Han R, Lv Q, Mishra S (2018) Scalable and timely detection of cyberbullying in online social networks. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 1738–1747)

Raisi E, Huang B (2018) Weakly supervised cyberbullying detection with participant-vocabulary consistency. Social Netw Anal Min 8(1):38

Reynolds K, Kontostathis A, Edwards L (2011) Using machine learning to detect cyberbullying. In 2011 10th International Conference on Machine learning and applications and workshops (Vol. 2, pp. 241–244). IEEE

Rosa H, Matos D, Ribeiro R, Coheur L, Carvalho JP (2018) A “deeper” look at detecting cyberbullying in social networks. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE

Rosa H, Pereira N, Ribeiro R, Ferreira PC, Carvalho JP, Oliveira S, Coheur L, Paulino P, Veiga Simão AM, Trancoso I (2019) Automatic cyberbullying detection: A systematic review. Computers in Human Behavior, 93, 333–345

Salawu S, He Y, Lumsden J (2017) Approaches to automated detection of cyberbullying: a survey. IEEE Trans Affect Comput.

Sanchez H, Kumar S (2011) Twitter bullying detection. ser. NSDI , 12 (2011), 15

Shah N, Maqbool A, Abbasi AF (2021) Predictive modeling for cyberbullying detection in social media. J Ambient Intell Humaniz Comput 12(6):5579–5594

Singh A, Kaur, M (2020) Intelligent content-based cybercrime detection in online social networks using cuckoo search metaheuristic approach [Article]. J Supercomput 76(7):5402–5424

Singh VK, Ghosh S, Jose C (2017) Toward multimodal cyberbullying detection. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (pp. 2090–2099)

Soni D, Singh VK (2018) See no evil, hear no evil: Audio-visual-textual cyberbullying detection. Proceedings of the ACM on Human-Computer Interaction , 2 (CSCW), 1–26

Squicciarini A, Rajtmajer S, Liu Y, Griffin C (2015) Identification and characterization of cyberbullying dynamics in an online social network. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (pp. 280–285)

Sugandhi R, Pande A, Agrawal A, Bhagat H (2016) Automatic monitoring and prevention of cyberbullying. Int J Comput Appl 8:17–19

Tahmasbi N, Rastegari E (2018) A socio-contextual approach in automated detection of public cyberbullying on Twitter. ACM Trans Social Comput 1(4):1–22

Tan SH, Zou W, Zhang J, Zhou Y (2020) Evaluation of machine learning algorithms for prediction of ground-level PM2.5 concentration using satellite-derived aerosol optical depth over China. Environ Sci Pollut Res 27(29):36155–36170

Tarwani S, Jethanandani M, Kant V (2019) Cyberbullying Detection in Hindi-English Code-Mixed Language Using Sentiment Classification. In International Conference on Advances in Computing and Data Sciences (pp. 543–551). Springer, Singapore

Tomkins S, Getoor L, Chen Y, Zhang Y (2018) A socio-linguistic model for cyberbullying detection. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 53–60). IEEE

van Geel M, Goemans A, Toprak F, Vedder P (2017) Which personality traits are related to traditional bullying and cyberbullying? A study with the big five, Dark Triad and sadism. Pers Indiv Differ 106:231–235

Van Hee C, Jacobs G, Emmery C, Desmet B, Lefever E, Verhoeven B, …, Hoste V (2018) Automatic detection of cyberbullying in social media text. PLoS ONE, 13(10), e0203794

Van Hee C, Lefever E, Verhoeven B, Mennes J, Desmet B, De Pauw G, …, Hoste V (2015) Detection and fine-grained classification of cyberbullying events. In International Conference Recent Advances in Natural Language Processing (RANLP) (pp. 672–680)

Wang W, Xie X, Wang X, Lei L, Hu Q, Jiang S (2019) Cyberbullying and depression among chinese college students: a moderated mediation model of social anxiety and neuroticism. J Affect Disord 256:54–61

Whiting P, Savović J, Higgins JP et al (2016) ROBIS: a new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol 69:225–234

Witten IH, Frank E, Hall MA (2016) Data Mining: practical machine learning tools and techniques, 4th edn. Morgan Kaufmann Publishers

Wright MF (2017) Cyberbullying in cultural context. J Cross-Cult Psychol 48(8):1136–1137

Wu J, Wen M, Lu R, Li B, Li J (2020) Toward efficient and effective bullying detection in online social network. Peer-to-Peer Netw Appl, 1–10

Wu T, Wen S, Xiang Y, Zhou W (2018) Twitter spam detection: survey of new approaches and comparative study. Computers & Security 76:265–284

Yin D, Xue Z, Hong L, Davison BD, Kontostathis A, Edwards L (2009) Detection of harassment on web 2.0. Proceedings of the Content Analysis in the WEB , 2 , 1–7

Zhang X, Tong J, Vishwamitra N, Whittaker E, Mazer JP, Kowalski R, Hu H, Luo F, Macbeth J, Dillon E (2017) Cyberbullying detection with a pronunciation based convolutional neural network. In Proceedings - 2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016

Zhao R, Mao K (2017) Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder. IEEE Trans Affect Comput 8(3), 328–339. Article 7412690

Zhao R, Zhou A, Mao K (2016) Automatic detection of cyberbullying on social networks based on bullying features. In Proceedings of the 17th international conference on distributed computing and networking (pp. 1–6)

Zhong H, Li H, Squicciarini AC, Rajtmajer SM, Griffin C, Miller DJ, Caragea C (2016) Content-Driven Detection of Cyberbullying on the Instagram Social Network. In IJCAI (pp. 3952–3958)

Download references

Author information

Authors and affiliations.

Faculty of Computer Science and Information Systems, Universiti Malaya, Kuala Lumpur, 50603, Malaysia

Vimala Balakrisnan & Mohammed Kaity

You can also search for this author in PubMed   Google Scholar

Contributions

VB wrote the original draft; performed analysis; revised the article; MK performed data collection; performed analysis; revised the article; All authors reviewed the manuscript.

Corresponding author

Correspondence to Vimala Balakrisnan .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Balakrisnan, V., Kaity, M. Cyberbullying detection and machine learning: a systematic literature review. Artif Intell Rev 56 (Suppl 1), 1375–1416 (2023). https://doi.org/10.1007/s10462-023-10553-w

Download citation

Accepted : 08 July 2023

Published : 24 July 2023

Issue Date : October 2023

DOI : https://doi.org/10.1007/s10462-023-10553-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cyberbullying
  • Machine learning
  • Systematic literature review
  • Find a journal
  • Publish with us
  • Track your research

share this!

August 6, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

Researchers identify over 2,000 potential toxins using machine learning

by Hebrew University of Jerusalem

Identification of novel toxins using machine learning

In a novel study, researchers have unveiled new secrets about a fascinating bacterial weapon system that acts like a microscopic syringe. The research paper, titled " Identification of novel toxins associated with the extracellular contractile injection system using machine learning " is published in Molecular Systems Biology

Led by Dr. Asaf Levy from the Hebrew University and collaborators from the Hebrew University and from the University of Illinois Urbana-Champaine, the team has made significant strides in understanding the extracellular contractile injection system (eCIS), a unique mechanism used by bacteria and archaea to inject toxins into other organisms.

Cracking the bacterial code with AI

The eCIS is a 100-nanometer long weapon that evolved from viruses that previously attacked microbes (phages). During evolution, these viruses lost their ability to infect microbes and turned into syringes that inject toxins into different organisms, such as insects.

Previously, the Levy group identified eCIS as a weapon carried by more than 1,000 microbial species. Interestingly, these microbes rarely attack humans, and the eCIS role in nature remains mostly unknown . However, we know that it loads and injects protein toxins.

The specific proteins injected by eCIS and their functions have long remained a mystery. Before the study we knew about ~20 toxins that eCIS can load and inject.

To solve this biological puzzle, the research team developed an innovative machine learning tool that combines genetic and biochemical data of different genes and proteins to accurately identify these elusive toxins. The project resulted in identification of over 2,000 potential toxin proteins.

"Our discovery not only sheds light on how microbes interact with their hosts and maybe with each other, but also demonstrates the power of machine learning in uncovering new gene functions," explains Dr. Levy. "This could open up new avenues for developing antimicrobial treatments or novel biotechnological tools."

New toxins with enzymatic activities against different molecules

Using AI technology, the researchers analyzed 950 microbial genomes and identified an impressive 2,194 potential toxins. Among these, four new toxins (named EAT14-17) were experimentally validated by demonstrating that they can inhibit growth of bacteria or yeast cells .

Remarkably, one of these toxins, EAT14, was found to inhibit cell signaling in human cells, showcasing its potential impact on human health. The group showed that the new toxins likely act as enzymes that damage the target cells by targeting proteins, DNA or a molecule that is critical to energy metabolism. Moreover, the group was able to decipher the protein sequence code that allows loading of toxins into the eCIS syringe.

Recently, it was demonstrated that eCIS can be used as a programmable syringe that can be engineered for injection into various cell types, including brain cells. The new findings from the current paper leverage this ability by providing thousands of toxins that are naturally injected by eCIS and the code that facilitates their loading into the eCIS syringe. The code can be transferred into other proteins of interest.

From microscopic battles to medical breakthroughs

The study's findings could have far-reaching applications in medicine, agriculture, and biotechnology. The newly identified toxins might be used to develop new antibiotics or pesticides, efficient enzyme for different industries, or to engineer microbes that can target specific pathogens.

This research highlights the incredible potential of combining biology with artificial intelligence to solve complex problems that could ultimately benefit human health.

"We're essentially deciphering the weapons that bacteria evolved and keep evolving to compete over resources in nature," adds Dr. Levy. "Microbes are creative inventors and it is fulfilling to be part of a group that discovers these amazing and surprising inventions."

The study was led by two students: Aleks Danov and Inbal Pollin from the department of Plant Pathology and Microbiology, the Institute of Environmental Sciences.

Journal information: Molecular Systems Biology

Provided by Hebrew University of Jerusalem

Explore further

Feedback to editors

research paper using machine learning

Researchers discover smarter way to recycle polyurethane

10 hours ago

research paper using machine learning

DNA study challenges thinking on ancestry of people in Japan

11 hours ago

research paper using machine learning

A visionary approach: How a team developed accessible maps for colorblind scientists

research paper using machine learning

New tool simplifies cell tracking data analysis

research paper using machine learning

How some states help residents avoid costly debt during hard times

research paper using machine learning

Review of 400 years of scientific literature corrects the Dodo extinction record

research paper using machine learning

Study confirms likely identity of the remains of Bishop Teodomiro

research paper using machine learning

Computer simulations suggest more than half of people on Earth have limited access to safe drinking water

research paper using machine learning

Hailstone library to improve extreme weather forecasting

12 hours ago

research paper using machine learning

Exploring Huntington's disease: Researchers discover that protein aggregates poke holes in the nuclear membrane

Relevant physicsforums posts, hiking illness danger -- rhabdomyolysis.

8 hours ago

Strategies and Tips for First Responders Interacting with Autism Spectrum Disorder Patients

Cannot find a comfortable side-sleeping position, using capsaicin to get really high, therapeutic interfering particle.

Aug 14, 2024

Neutron contamination threshold in tissue using LINAC

Aug 8, 2024

More from Biology and Medical

Related Stories

research paper using machine learning

Bacterial injection system delivers proteins in mice and human cells

Mar 29, 2023

research paper using machine learning

Study highlights pathoblockers as a future alternative to antibiotics

May 16, 2024

research paper using machine learning

Bacteria of the shield bug protect the insect by degrading plant toxins

Jul 1, 2024

research paper using machine learning

New class of antimicrobials discovered in soil bacteria

Apr 17, 2024

research paper using machine learning

Researchers discover molecular 'barcode' used by bacteria to secrete toxins

Jan 8, 2024

research paper using machine learning

Researchers show how toxins of the bacterium Clostridium difficile get into gut cells

Sep 12, 2018

Recommended for you

research paper using machine learning

Key biofuel-producing microalga believed to be a single species is actually three

research paper using machine learning

Newly discovered mechanism for propagation of flaviviruses reveals potential therapeutic target

research paper using machine learning

Combining genetic diversity data with demographic information reveals extinction risks of natural populations

Let us know if there is a problem with our content.

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Phys.org in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: tabularbench: benchmarking adversarial robustness for tabular deep learning in real-world use-cases.

Abstract: While adversarial robustness in computer vision is a mature research field, fewer researchers have tackled the evasion attacks against tabular deep learning, and even fewer investigated robustification mechanisms and reliable defenses. We hypothesize that this lag in the research on tabular adversarial attacks is in part due to the lack of standardized benchmarks. To fill this gap, we propose TabularBench, the first comprehensive benchmark of robustness of tabular deep learning classification models. We evaluated adversarial robustness with CAA, an ensemble of gradient and search attacks which was recently demonstrated as the most effective attack against a tabular model. In addition to our open benchmark ( this https URL ) where we welcome submissions of new models and defenses, we implement 7 robustification mechanisms inspired by state-of-the-art defenses in computer vision and propose the largest benchmark of robust tabular deep learning over 200 models across five critical scenarios in finance, healthcare and security. We curated real datasets for each use case, augmented with hundreds of thousands of realistic synthetic inputs, and trained and assessed our models with and without data augmentations. We open-source our library that provides API access to all our pre-trained robust tabular models, and the largest datasets of real and synthetic tabular inputs. Finally, we analyze the impact of various defenses on the robustness and provide actionable insights to design new defenses and robustification mechanisms.
Subjects: Machine Learning (cs.LG)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

This paper is in the following e-collection/theme issue:

Published on 14.8.2024 in Vol 26 (2024)

Cancer Prevention and Treatment on Chinese Social Media: Machine Learning–Based Content Analysis Study

Authors of this article:

Author Orcid Image

Original Paper

  • Keyang Zhao 1 * , DPhil   ; 
  • Xiaojing Li 1, 2 * , Prof Dr   ; 
  • Jingyang Li 3 , DPhil  

1 School of Media & Communication, Shanghai Jiao Tong University, Shanghai, China

2 Institute of Psychology and Behavioral Science, Shanghai Jiao Tong University, Shanghai, China

3 School of Software, Shanghai Jiao Tong University, Shanghai, China

*these authors contributed equally

Corresponding Author:

Xiaojing Li, Prof Dr

School of Media & Communication

Shanghai Jiao Tong University

800 Dongchuan Rd.

Minhang District

Shanghai, 200240

Phone: 86 13918611103

Fax:86 21 34207088

Email: [email protected]

Background: Nowadays, social media plays a crucial role in disseminating information about cancer prevention and treatment. A growing body of research has focused on assessing access and communication effects of cancer information on social media. However, there remains a limited understanding of the comprehensive presentation of cancer prevention and treatment methods across social media platforms. Furthermore, research comparing the differences between medical social media (MSM) and common social media (CSM) is also lacking.

Objective: Using big data analytics, this study aims to comprehensively map the characteristics of cancer treatment and prevention information on MSM and CSM. This approach promises to enhance cancer coverage and assist patients in making informed treatment decisions.

Methods: We collected all posts (N=60,843) from 4 medical WeChat official accounts (accounts with professional medical backgrounds, classified as MSM in this paper) and 5 health and lifestyle WeChat official accounts (accounts with nonprofessional medical backgrounds, classified as CSM in this paper). We applied latent Dirichlet allocation topic modeling to extract cancer-related posts (N=8427) and identified 6 cancer themes separately in CSM and MSM. After manually labeling posts according to our codebook, we used a neural-based method for automated labeling. Specifically, we framed our task as a multilabel task and utilized different pretrained models, such as Bidirectional Encoder Representations from Transformers (BERT) and Global Vectors for Word Representation (GloVe), to learn document-level semantic representations for labeling.

Results: We analyzed a total of 4479 articles from MSM and 3948 articles from CSM related to cancer. Among these, 35.52% (2993/8427) contained prevention information and 44.43% (3744/8427) contained treatment information. Themes in CSM were predominantly related to lifestyle, whereas MSM focused more on medical aspects. The most frequently mentioned prevention measures were early screening and testing, healthy diet, and physical exercise. MSM mentioned vaccinations for cancer prevention more frequently compared with CSM. Both types of media provided limited coverage of radiation prevention (including sun protection) and breastfeeding. The most mentioned treatment measures were surgery, chemotherapy, and radiotherapy. Compared with MSM (1137/8427, 13.49%), CSM (2993/8427, 35.52%) focused more on prevention.

Conclusions: The information about cancer prevention and treatment on social media revealed a lack of balance. The focus was primarily limited to a few aspects, indicating a need for broader coverage of prevention measures and treatments in social media. Additionally, the study’s findings underscored the potential of applying machine learning to content analysis as a promising research approach for mapping key dimensions of cancer information on social media. These findings hold methodological and practical significance for future studies and health promotion.

Introduction

In 2020, 4.57 million new cancer cases were reported in China, accounting for 23.7% of the world’s total [ 1 ]. Many of these cancers, however, can be prevented [ 2 , 3 ]. According to the World Health Organization (WHO), 30%-50% of cancers could be avoided through early detection and by reducing exposure to known lifestyle and environmental risks [ 4 ]. This underscores the imperative to advance education on cancer prevention and treatment.

Mass media serves not only as a primary channel for disseminating cancer information but also as a potent force in shaping the public health agenda [ 5 , 6 ]. Previous studies have underscored the necessity of understanding how specific cancer-related content is presented in the media. For example, the specific cancer types frequently mentioned in news reports have the potential to influence the public’s perception of the actual incidence of cancer [ 7 ].

Nowadays, social media plays an essential role in disseminating health information, coordinating resources, and promoting health campaigns aimed at educating individuals about prevention measures [ 8 ]. Additionally, it influences patients’ decision-making processes regarding treatment [ 9 ]. A study revealed that social media use correlates with increased awareness of cancer screening in the general population [ 10 ]. In recent years, there has been a notable surge in studies evaluating cancer-related content on social media. However, previous studies often focused on specific cancer types [ 11 ] and limited aspects of cancer-related issues [ 12 ]. The most recent comprehensive systematic content analysis of cancer coverage, conducted in 2013, indicated that cancer news coverage has heavily focused on treatment, while devoting very little attention to prevention, detection, or coping [ 13 ].

Evaluating cancer prevention information on social media is crucial for future efforts by health educators and cancer control organizations. Moreover, providing reliable medical information to individuals helps alleviate feelings of fear and uncertainty [ 14 ]. Specifically, patients often seek information online when making critical treatment decisions, such as chemotherapy [ 15 ]. Therefore, it is significant to comprehensively evaluate the types of treatment information available on social media.

Although many studies have explored cancer-related posts from the perspectives of patients with cancer [ 16 ] and caregivers [ 17 ], the analysis of posts from medical professionals has been found to be inadequate [ 18 ]. This paradox arises from the expectation that medical professionals, given their professional advantages, should take the lead in providing cancer education on social media. Nevertheless, a significant number of studies have highlighted the prevalence of unreliable medical information on social media [ 19 ]. A Japanese study highlighted a concerning phenomenon: despite efforts by medical professionals to promote cancer screening online, a significant number of antiscreening activists disseminated contradictory messages on the internet, potentially undermining the effectiveness of cancer education initiatives [ 20 ]. Hence, there is an urgent need for the accurate dissemination of health information on social media, with greater involvement from scientists or professional institutions, to combat the spread of misinformation [ 21 ]. Despite efforts to study professional medical websites [ 22 ] and apps [ 23 ], there remains a lack of comprehensive understanding of the content posted on medical social media (MSM). Further study is thus needed to compare the differences between cancer information on social media from professional medical sources and nonprofessional sources to enhance cancer education.

For this study, we defined social media as internet-based platforms characterized by social interactive functions such as reading, commenting, retweeting, and timely interaction [ 24 ]. Based on this definition, we further classified 2 types of media based on ownership, content, and contributors: common social media (CSM) and MSM. MSM refers to social media platforms owned by professional medical institutions or organizations. It primarily provides medical and health information by medical professionals, including medical-focused accounts on social media and mobile health apps. CSM refers to social media owned or managed by individuals without medical backgrounds. It mainly provides health and lifestyle content.

Similar to Facebook (Meta Platforms, Inc.), WeChat (Tencent Holdings Limited) is the most popular social media platform in China, installed on more than 90% of smartphones. Zhang et al [ 25 ] has indicated that 63.26% of people prefer to obtain health information from WeChat. Unlike other Chinese social media platforms, WeChat has a broader user base that spans various age groups [ 26 ]. WeChat Public Accounts (WPAs) operate within the WeChat platform, offering services and information to the public. Many hospitals and primary care institutions in China have increasingly registered WPAs to provide health care services, medical information, health education, and more [ 27 ]. Therefore, this study selected WPA as the focus of research.

Based on big data analytics, this study aims to comprehensively map the characteristics of cancer treatment and prevention information on MSM and CSM, which could significantly enhance cancer coverage and assist patients in treatment decision-making. To address the aforementioned research gaps, 2 research questions were formulated.

  • Research question 1: What are the characteristics of cancer prevention information discussed on social media? What are the differences between MSM and CSM?
  • Research question 2: What are the characteristics of cancer treatment information discussed on social media? What are the differences between MSM and CSM?

Data Collection and Processing

We selected representative WPAs based on the reports from the “Ranking of Influential Health WeChat Public Accounts” [ 28 ] and the “2021 National Rankings of Best Hospitals by Specialty” [ 29 ]. In this study, we focused on 4 medical WPAs within MSM: Doctor Dingxiang (丁香医生), 91Huayi (华医网), The Cancer Hospital of Chinese Academy of Medical Sciences (中国医学科学院肿瘤医院), and Fudan University Shanghai Cancer Center (复旦大学附属肿瘤医院). We also selected 5 health and lifestyle WeChat Official Accounts classified as CSM for this study: Health Times (健康时报), Family Doctor (家庭医生), CCTV Lifestyle (CCTV 生活圈), Road to Health (健康之路), and Life Times (生命时报).

We implemented a Python-based (Python Foundation) crawler to retrieve posts from the aforementioned WPAs. Subsequently, we implemented a filtration process to eliminate noisy and unreliable data. Note that our focus is on WPAs that provide substantial information, defined as containing no fewer than a certain number of characters. We have deleted documents that contain less than 100 Chinese characters. Furthermore, we have removed figures and videos from the remaining documents. Eventually, we conducted an analysis at the paragraph level. According to our findings from random sampling, noise in articles from WPAs mostly originates from advertisements, which are typically found in specific paragraphs. Therefore, we retained only paragraphs that did not contain advertising keywords. In total, we collected 60,843 posts from these WPAs, comprising 20,654 articles from MSM and 40,189 articles from CSM.

The workflow chart in Figure 1 depicts all procedures following data collection and preprocessing. After obtaining meaningful raw documents, we performed word-level segmentation on the texts. We then removed insignificant stopwords and replaced specific types of cancers with a general term to facilitate coarse-grained latent Dirichlet allocation (LDA)–based filtering. Subsequently, we conducted fine-grained LDA topic modeling on the filtered documents without replacing keywords to visualize the topics extracted from the WPAs. Furthermore, we utilized a manually labeled codebook to train a long short-term memory (LSTM) network for document classification into various categories. Finally, we performed data analysis using both the topic distribution derived from fine-grained LDA and the classified documents.

research paper using machine learning

Latent Dirichlet Allocation Topic Modeling

LDA is a generative statistical model that explains sets of observations by latent groups, revealing why some parts of the data are similar [ 30 ]. The LDA algorithm can speculate on the topic distribution of a document.

When comparing LDA with other natural language processing methods such as LSTM-based deep learning, it is worth noting that LDA stands out as an unsupervised learning algorithm. Unlike its counterparts, LDA has the ability to uncover hidden topics without relying on labeled training data. Its strength lies in its capability to automatically identify latent topics within documents by analyzing statistical patterns of word co-occurrences. In addition, LDA provides interpretable outcomes by assigning a probability distribution to each document, representing its association with various topics. Similarly, it assigns a probability distribution to each topic, indicating the prevalence of specific words within that topic. This feature enables researchers to understand the principal themes present in their corpus and the extent to which these themes are manifested in individual documents.

The foundational principle of LDA involves using probabilistic inference to estimate the distribution of topics and word allocations. Specifically, LDA assumes that each document is composed of a mixture of a small number of topics, and each word’s presence can be attributed to one of these topics. This approach allows for overlapping content among documents, rather than strict categorization into separate groups. For a deeper understanding of the technical and theoretical aspects of the LDA algorithm, readers are encouraged to refer to the research conducted by Blei et al [ 30 ]. In this context, our primary focus was on the application of the algorithm to our corpus, and the procedure is outlined in the following sections.

Document Selection

Initially, document selection involves using a methodological approach to sample documents from the corpus, which may include random selection or be guided by predetermined criteria such as document relevance or popularity within the social media context.

Topic Inference

Utilizing LDA or a similar topic modeling technique, we infer the underlying topical structure within each document. This involves modeling documents as mixtures of latent topics represented by a Dirichlet distribution, from which topic proportions are sampled.

Topic Assignment to Words

After determining topic proportions, we proceed to assign topics to individual words in the document. Using a multinomial distribution, each word is probabilistically associated with one of the inferred topics based on the previously derived topic proportions.

Word Distribution Estimation

Each topic is characterized by a distinct distribution over the vocabulary, representing the likelihood of observing specific words within that topic. Using a Dirichlet distribution, we estimate the word distribution for each inferred topic.

Word Generation

Finally, using the multinomial distribution again, we generate words for the document by sampling from the estimated word distribution corresponding to the topic assigned to each word. This iterative process produces synthetic text that mirrors the statistical properties of the original corpus.

To filter out noncancer-related documents in our case, we replaced cancer-related words with “癌症” (cancer or tumor in Chinese) in all documents. We then conducted an LDA analysis to compute the topic distribution of each document and retained documents related to topics where “癌症” appears among the top 10 words.

In our study, we used Python packages such as jieba and gensim for document segmentation and extracting per-topic-per-word probabilities from the model. During segmentation, we applied a stopword dictionary to filter out meaningless words and transformed each document into a cleaned version containing only meaningful words.

During the LDA analysis, to determine the optimal number of topics, our main goal was to compute the topic coherence for various numbers of topics and select the model that yielded the highest coherence score. Coherence measures the interpretability of each topic by assessing whether the words within the same topic are logically associated with each other. The higher the score for a specific number k , the more closely related the words are within that topic. In this phase, we used the Python package pyLDAvis to compare coherence scores with different numbers of topics. Subsequently, we filtered and retained only the documents related to cancer topics, resulting in 4479 articles from MSM and 3948 articles from CSM.

Among the filtered articles, we conducted another LDA analysis to extract topics from the original articles without replacing cancer-related words. Using pyLDAvis, we calculated the coherence score and identified 6 topics for both MSM and CSM articles.

To visualize the topic modeling results, we created bar graphs where the y-axis indicates the top 10 keywords associated with each topic, and the x-axis represents the weight of each keyword (indicating its contribution to the topic). At the bottom of each graph ( Figures 2 and 3 ), we generalized and presented the name of each topic based on the top 10 most relevant keywords.

research paper using machine learning

Manual Content Analysis: Coding Procedure

Based on the codebook, 2 independent coders (KZ and JL) engaged in discussions regarding the coding rules to ensure a shared understanding of the conceptual and operational distinctions among the coding items. To ensure the reliability of the coding process, both coders independently coded 100 randomly selected articles. Upon completion of the pilot coding, any disagreements were resolved through discussion between the 2 coders.

For the subsequent coding phase, each coder was assigned an equitable proportion of articles, with 10% of the cancer-related articles randomly sampled from both MSM samples (450/4479) and CSM samples (394/3948). Manual coding was performed on a total of 844 articles, which served as the training data set for the machine learning model. The operational definitions of each coding variable are detailed in Multimedia Appendix 1 .

Coding Measures

Cancer prevention measures.

Coders identified whether an article mentioned any of the following cancer prevention measures [ 31 - 35 ]: (1) avoid tobacco use, (2) maintain a healthy weight, (3) healthy diet, (4) exercise regularly, (5) limit alcohol use, (6) get vaccinated, (7) reduce exposure to ultraviolet radiation and ionizing radiation, (8) avoid urban air pollution and indoor smoke from household use of solid fuels, (9) early screening and detection, (10) breastfeeding, (11) controlling chronic infections, and (12) other prevention measures.

Cancer Treatment Measures

Coders identified whether an article mentioned any of the following treatments [ 36 ]: (1) surgery (including cryotherapy, lasers, hyperthermia, photodynamic therapy, cuts with scalpels), (2) radiotherapy, (3) chemotherapy, (4) immunotherapy, (5) targeted therapy, (6) hormone therapy, (7) stem cell transplant, (8) precision medicine, (9) cancer biomarker testing, and (10) other treatment measures.

Neural-Based Machine Learning

In this part, we attempted to label each article using a neural network. As mentioned earlier, we manually labeled 450 MSM articles and 394 CSM articles. We divided the labeled data into a training set and a test set with a ratio of 4:1. We adopted the pretrained Bidirectional Encoder Representations from Transformers (BERT) model. As BERT can only accept inputs with fewer than 512 tokens [ 37 ], we segmented each document into pieces of 510 tokens (accounting for BERT’s automatic [CLS] and [SEP] tokens, where [CLS] denotes the start of a sentence or a document, and [SEP] denotes the end of a sentence or a document) with an overlap of 384 tokens between adjacent pieces. We began by utilizing a BERT-based encoder to encode each piece and predict its labels using a multioutput decoder. After predicting labels for each piece, we pooled the outputs for all pieces within the same document and used an LSTM network to predict final labels for each document.

Ethical Considerations

This study did not require institutional research board review as it did not involve interactions with humans or other living entities, private or personally identifiable information, or any pharmaceuticals or medical devices. The data set consists solely of publicly available social media posts.

Cancer Topics on Social Media

Applying LDA, we identified 6 topics each for MSM and CSM articles. The distribution of topics among MSM and CSM is presented in Table 1 , while the keyword weights for each topic are illustrated in Figures 2 and 3 .

Media type and topic numberTopic descriptionArticles, n (%)Top 10 keywords

Topic 1Liver cancer and stomach cancer1519 (18.03)Cancer (癌症), liver cancer (肝癌), stomach cancer (胃癌), factors (因素), food (食物), disease (疾病), (幽门), exercise (运动), patient (患者), and diet (饮食)

Topic 2Female and cancer1611 (19.12)Breast cancer (乳腺癌), female (女性), patient (患者), lung cancer (肺癌), surgery (手术), tumor (肿瘤), mammary gland (乳腺), expert (专家), ovarian cancer (卵巢癌), and lump (结节)

Topic 3Breast cancer1093 (12.97)Breast cancer (乳腺癌), surgery (手术), thyroid (甲状腺), lump (结节), breast (乳房), patient (患者), female (女性), screening and testing (检查), mammary gland (乳腺), and tumor (肿瘤)

Topic 4Cervical cancer1019 (12.09)Vaccine (疫苗), cervical cancer (宫颈癌), virus (病毒), cervix (宫颈), patient (患者), nation (国家), female (女性), nasopharynx cancer (鼻咽癌), medicine (药品), and hospital (医院)

Topic 5Clinical cancer treatment2548 (30.24)Tumor (肿瘤), patient (患者), screening (检查), chemotherapy (化疗), clinic (临床), symptom (症状), hospital (医院), surgery (手术), medicine (药物), and disease (疾病)

Topic 6Diet and cancer risk1741 (20.66)Patient (患者), tumor (肿瘤), food (食物), polyp (息肉), professor (教授), nutrition (营养), expert (专家), surgery (手术), cancer (癌症), and disease (疾病)

Topic 1Cancer-causing substances1136 (13.48)Foods (食物), nutrition (营养), carcinogen (致癌物), food (食品), ingredient (含量), vegetable (蔬菜), cancer (癌症), body (人体), lump (结节), and formaldehyde (甲醛)

Topic 2Cancer treatment1319 (15.65)Patient (患者), cancer (癌症), hospital (医院), lung cancer (肺癌), tumor (肿瘤), medicine (药物), disease (疾病), professor (教授), surgery (手术), and clinic (临床)

Topic 3Female and cancer risk1599 (18.97)Screening and testing (检查), female (女性), disease (疾病), breast cancer (乳腺癌), cancer (癌症), lung cancer (肺 癌), patient (患者), body (身体), tumor (肿瘤), and risk (风险)

Topic 4Exercise, diet, and cancer risk1947 (23.10)Cancer (癌症), exercise (运动), food (食物), risk (风险), body (身体), disease (疾病), suggestion (建议), patient (患者), fat (脂肪), and hospital (医院)

Topic 5Screening and diagnosis of cancer1790 (21.24)Screening and testing (检查), disease (疾病), hospital (医院), stomach cancer (胃癌), symptom (症状), patient (患者), cancer (癌症), liver cancer (肝癌), female (女性), and suggestion (建议)

Topic 6Disease and body parts869 (10.31)Disease (疾病), intestine (肠道), food (食物), hospital (医院), oral cavity (口腔), patient (患者), teeth (牙齿), cancer (癌症), ovary (卵巢), and garlic (大蒜)

a In each article, different topics may appear at the same time. Therefore, the total frequency of each topic did not equate to the total number of 8427 articles.

b To ensure the accuracy of the results, directly translating sampled texts from Chinese into English posed challenges due to differences in semantic elements. In English, cancer screening refers to detecting the possibility of cancer before symptoms appear, while diagnostic tests confirm the presence of cancer after symptoms are observed. However, in Chinese, the term “检查” encompasses both meanings. Therefore, we translated it as both screening and testing.

research paper using machine learning

Among MSM articles, topic 5 was the most frequent (2548/8427, 30.24%), followed by topic 6 (1741/8427, 20.66%) and topic 2 (1611/8427, 19.12%). Both topics 5 and 6 focused on clinical treatments, with topic 5 specifically emphasizing cancer diagnosis. The keywords in topic 6, such as “polyp,” “tumor,” and “surgery,” emphasized the risk and diagnosis of precancerous lesions. Topic 2 primarily focused on cancer surgeries related to breast cancer, lung cancer, and ovarian cancer. The results indicate that MSM articles concentrated on specific cancers with higher incidence in China, including stomach cancer, liver cancer, lung cancer, breast cancer, and cervical cancer [ 10 ].

On CSM, topic 4 (1947/8427, 23.10%) had the highest proportion, followed by topic 5 (1790/8427, 21.24%) and topic 3 (1599/8427, 18.97%). Topic 6 had the smallest proportion. Topics 1 and 4 were related to lifestyle. Topic 1 particularly focused on cancer-causing substances, with keywords such as “food,” “nutrition,” and “carcinogen” appearing most frequently. Topic 4 was centered around exercise, diet, and their impact on cancer risk. Topics 3 and 5 were oriented toward cancer screening and diagnosis. Topic 3 specifically focused on female-related cancers, with discussions prominently featuring breast cancer screening and testing. Topic 5 emphasized early detection and diagnosis of stomach and lung cancers, highlighting keywords such as “screening” and “symptom.”

Cancer Prevention Information

Our experiment on the test set showed that the machine learning model achieved F 1 -scores above 85 for both prevention and treatment categories in both MSM and CSM. For subclasses within prevention and treatment, we achieved F 1 -scores of at least 70 for dense categories (with an occurrence rate >10%, ie, occurs in >1 of 10 entries) and at least 50 for sparse categories (with an occurrence rate <10%, ie, occurs in <1 of 10 entries). Subsequently, we removed items labeled as “other prevention measures” and “other treatment measures” due to semantic ambiguity.

Table 2 presents the distribution of cancer prevention information across MSM (n=4479) and CSM (n=3948).

Type of cancer prevention measuresNumber of articles on MSM (n=4479), n (%)Number of articles on CSM (n=3948), n (%)
Articles containing prevention information1137 (25.39)1856 (47.01)
Early screening and testing737 (16.45)1085 (27.48)
Healthy diet278 (6.21)598 (15.15)
Get vaccinated261 (5.83)113 (2.86)
Avoid tobacco use186 (4.15)368 (9.32)
Exercise regularly135 (3.01)661 (16.74)
Limit alcohol use128 (2.86)281 (7.12)
Avoid urban air pollution and indoor smoke from household use of solid fuels19 (0.42)64 (1.62)
Maintain a healthy weight18 (0.40)193 (4.89)
Practice safe sex12 (0.27)4 (0.10)
Controlling chronic infections3 (0.07)32 (0.81)
Reduce exposure to radiation2 (0.04)1 (0.03)
Breastfeeding1 (0.02)1 (0.03)

a MSM: medical social media.

b CSM: common social media.

Cancer Prevention Information on MSM

The distribution of cancer prevention information on MSM (n=4479) is as follows: articles discussing prevention measures accounted for 25.39% (1137/4479) of all MSM cancer-related articles. The most frequently mentioned measure was “early screening and testing” (737/4479, 16.45%). The second and third most frequently mentioned prevention measures were “healthy diet” (278/4479, 6.21%) and “get vaccinated” (261/4479, 5.83%). The least mentioned prevention measures were “controlling chronic infections” (3/4479, 0.07%), “reduce exposure to radiation” (2/4479, 0.04%), and “breastfeeding” (1/4479, 0.02%), each appearing in only 1-3 articles.

Cancer Prevention Information on CSM

As many as 1856 out of 3948 (47.01%) articles on CSM referred to cancer prevention information. Among these, “early screening and testing” (1085/3948, 27.48%) was the most commonly mentioned prevention measure. “Exercise regularly” (661/3948, 16.74%) and “healthy diet” (598/3948, 15.15%) were the 2 most frequently mentioned lifestyle-related prevention measures. Additionally, “avoid tobacco use” accounted for 9.32% (368/3948) of mentions. Other lifestyle-related prevention measures were “limit alcohol use” (281/3948, 7.12%) and “maintain a healthy weight” (193/3948, 4.89%). The least mentioned prevention measures were “practice safe sex” (4/3948, 0.10%), “reduce exposure to radiation” (1/3948, 0.03%), and “breastfeeding” (1/3948, 0.03%), each appearing in only 1-4 articles.

Cancer Prevention Information on Social Media

Table 3 presents the overall distribution of cancer prevention information on social media (N=8427). Notably, CSM showed a stronger focus on cancer prevention (1856/3948, 47.01%) compared with MSM (1137/8427, 13.49%). Both platforms highlighted the importance of early screening and testing. However, MSM placed greater emphasis on vaccination as a prevention measure. In addition to lifestyle-related prevention measures, both CSM and MSM showed relatively less emphasis on avoiding exposure to environmental carcinogens, such as air pollution, indoor smoke, and radiation. “Breastfeeding” was the least mentioned prevention measure (2/3948, 0.05%) on both types of social media.

Type of cancer prevention measuresNumber of articles on MSM , n (%)Number of articles on CSM , n (%)Number of articles overall (N=8427), n (%)
Articles containing prevention information1137 (13.49)1856 (22.02)2993 (35.52)
Early screening and testing737 (8.75)1085 (12.88)1822 (21.62)
Healthy diet278 (3.30)598 (7.10)876 (10.40)
Get vaccinated261 (3.10)113 (1.34)374 (4.44)
Avoid tobacco use186 (2.21)368 (4.37)554 (6.57)
Exercise regularly135 (1.60)661 (7.84)796 (9.45)
Limit alcohol use128 (1.52)281 (3.33)409 (4.85)
Avoid urban air pollution and indoor smoke from household use of solid fuels19 (0.23)64 (0.76)83 (0.98)
Maintain a healthy weight18 (0.21)193 (2.29)211 (2.50)
Practice safe sex12 (0.14)4 (0.05)16 (0.19)
Controlling chronic infections3 (0.04)32 (0.38)35 (0.42)
Reduce exposure to radiation2 (0.02)1 (0.01)3 (0.04)
Breastfeeding1 (0.01)1 (0.01)2 (0.02)

Cancer Treatment Information

Table 4 presents the distribution of cancer treatment information on MSM (n=4479) and CSM (n=3948).

Type of cancer treatment measuresNumber of articles on MSM (n=4479), n (%)Number of articles on CSM (n=3948), n (%)
Articles containing treatment information2966 (66.22)778 (19.71)
Surgery2045 (45.66)419 (10.61)
Chemotherapy1122 (25.05)285 (7.22)
Radiation therapy1108 (24.74)232 (5.88)
Cancer biomarker testing380 (8.48)55 (1.39)
Targeted therapy379 (8.46)181 (4.58)
Immunotherapy317 (7.08)22 (0.56)
Hormone therapy47 (1.05)14 (0.35)
Stem cell transplantation therapy5 (0.11)0 (0)

Cancer Treatment Information on MSM

Cancer treatment information appeared in 66.22% (2966/4479) of MSM posts. “Surgery” was the most frequently mentioned treatment measure (2045/4479, 45.66%), followed by “chemotherapy” (1122/4479, 25.05%) and “radiation therapy” (1108/4479, 24.74%). The proportions of “cancer biomarker testing” (380/4479, 8.48%), “targeted therapy” (379/4479, 8.46%), and “immunotherapy” (317/4479, 7.08%) were comparable. Only a minimal percentage of articles (47/4479, 1.05%) addressed “hormone therapy.” Furthermore, “stem cell transplantation therapy” was mentioned in just 5 out of 4479 (0.11%) articles.

Cancer Treatment Information on CSM

Cancer treatment information accounted for only 19.71% (778/3948) of CSM posts. “Surgery” was the most frequently mentioned treatment measure (419/3948, 10.61%), followed by “chemotherapy” (285/3948, 7.22%) and “radiation therapy” (232/3948, 5.88%). Relatively, the frequency of “targeted therapy” (181/3948, 4.58%) was similar to that of the first 3 types. However, “cancer biomarker testing” (55/3948, 1.39%), “immunotherapy” (22/3948, 0.56%), and “hormone therapy” (14/3948, 0.35%) appeared rarely on CSM. Notably, there were no articles on CSM mentioning stem cell transplantation.

Cancer Treatment Information on Social Media

Table 5 shows the overall distribution of cancer treatment information on social media (N=8427). A total of 44.43% (3744/8427) of articles contained treatment information. MSM (2966/8427, 35.20%) discussed treatment information much more frequently than CSM (778/8427, 9.23%). Furthermore, the frequency of all types of treatment measures mentioned was higher on MSM than on CSM. The 3 most frequently mentioned types of treatment measures were surgery (2464/8427, 29.24%), chemotherapy (1407/8427, 16.70%), and radiation therapy (1340/8427, 15.90%). Relatively, MSM (380/8427, 4.51%) showed a higher focus on cancer biomarker testing compared with CSM (55/8427, 0.65%).

Type of cancer treatment measuresNumber of articles on MSM , n (%)Number of articles on CSM , n (%)Number of articles overall (N=8427), n (%)
Articles containing treatment information2966 (35.20)778 (9.23)3744 (44.43)
Surgery2045 (24.27)419 (4.97)2464 (29.24)
Radiation therapy1108 (13.15)232 (2.75)1340 (15.90)
Chemotherapy1122 (13.31)285 (3.38)1407 (16.70)
Immunotherapy317 (3.76)22 (0.26)339 (4.02)
Targeted therapy379 (4.50)181 (2.15)560 (6.65)
Hormone therapy47 (0.56)14 (0.17)61 (0.72)
Stem cell transplant5 (0.06)0 (0.00)5 (0.06)
Cancer biomarker testing380 (4.51)55 (0.65)435 (5.16)

Cancer Topics on MSM and CSM

In MSM, treatment-related topics constituted the largest proportion, featuring keywords related to medical examinations. Conversely, in CSM, the distribution of topics appeared more balanced, with keywords frequently associated with cancer risk and screening. Overall, the distribution of topics on MSM and CSM revealed that CSM placed greater emphasis on lifestyle factors and early screening and testing. Specifically, CSM topics focused more on early cancer screening and addressed cancer types with high incidence rates. By contrast, MSM topics centered more on clinical treatment, medical testing, and the cervical cancer vaccine in cancer prevention. Additionally, MSM focused on types of cancers that are easier to screen and prevent, including liver cancer, stomach cancer, breast cancer, cervical cancer, and colon cancer.

Cancer Prevention Information on MSM and CSM

Through content analysis, it was found that 35.52% (2993/8427) of articles on social media contained prevention information, and 44.43% (3744/8427) contained treatment information. Compared with MSM (1137/8427, 13.49%), CSM (2993/8427, 35.52%) focused more on prevention.

Primary prevention mainly involves adopting healthy behaviors to lower the risk of developing cancer, which has been proven to have long-term effects on cancer prevention. Secondary prevention focuses on inhibiting or reversing carcinogenesis, including early screening and detection, as well as the treatment or removal of precancerous lesions [ 38 ]. Compared with cancer screening and treatment, primary prevention is considered the most cost-effective approach to reducing the cancer burden.

From our results, “early screening and testing” (1822/8427, 21.62%) was the most frequently mentioned prevention measure on both MSM and CSM. According to a cancer study from China, behavioral risk factors were identified as the primary cause of cancer [ 10 ]. However, measures related to primary prevention were not frequently mentioned. Additionally, lifestyle-related measures such as “healthy diet,” “regular exercise,” “avoiding tobacco use,” and “limiting alcohol use” were mentioned much less frequently on MSM compared with CSM.

Furthermore, “avoiding tobacco use” (554/8427, 6.57%) and “limiting alcohol use” (409/8427, 4.85%) were rarely mentioned, despite tobacco and alcohol being the leading causes of cancer. In China, public policies on the production, sale, and consumption of alcohol are weaker compared with Western countries. Notably, traditional Chinese customs often promote the belief that moderate drinking is beneficial for health [ 39 ]. Moreover, studies indicated that the smoking rate among adult men exceeded 50% in 2015. By 2018, 25.6% of Chinese adults aged 18 and above were smokers, totaling approximately 282 million smokers in China (271 million males and 11 million females) [ 40 ]. These statistics align with the consistently high incidence of lung cancer among Chinese men [ 41 ]. Simultaneously, the incidence and mortality of lung cancer in Chinese women were more likely associated with exposure to second-hand smoke or occupation-related risk factors.

Although MSM (261/8427, 3.10%) mentioned vaccination more frequently than CSM (113/8427, 1.34%), vaccination was not widely discussed on social media overall (374/8427, 4.44%). The introduction of human papillomavirus vaccination in China has lagged for more than 10 years compared with Western countries. A bivalent vaccine was approved by the Chinese Food and Drug Administration in 2017 but has not been included in the national immunization schedules up to now [ 42 ].

According to the “European Code Against Cancer” [ 43 ], breastfeeding is recommended as a measure to prevent breast cancer. However, there were no articles mentioning the role of breastfeeding in preventing breast cancer on social media.

One of the least frequently mentioned measures was “radiation protection,” which includes sun protection. Although skin cancer is not as common in China as in Western countries, China has the largest population in the world. A study showed that only 55.2% of Chinese people knew that ultraviolet radiation causes skin cancer [ 33 ]. Additional efforts should be made to enhance public awareness of skin cancer prevention through media campaigns.

Overall, our results indicate that social media, especially MSM, focused more on secondary prevention. The outcomes of primary prevention are challenging to identify in individuals, and studies on cancer education may partly explain why primary prevention was often overlooked [ 44 ].

Cancer Treatment Information on MSM and CSM

Compared with a related content analysis study in the United States, our findings also indicate that the media placed greater emphasis on treatment [ 45 ]. Treatment information on MSM was more diverse than on CSM, with a higher proportion of the 3 most common cancer treatments—surgery, chemotherapy, and radiation therapy—mentioned on MSM compared with CSM. Notably, CSM (232/8427, 2.75%) mentioned radiation therapy less frequently compared with MSM (1108/8427, 13.15%), despite it being one of the most common cancer treatment measures in clinical practice.

In addition to common treatment methods, other approaches such as targeted therapy (560/8427, 6.65%) and immunotherapy (339/8427, 4.02%) were rarely discussed. This could be attributed to the high costs associated with these treatments. A study revealed that each newly diagnosed patient with cancer in China faced out-of-pocket expenses of US $4947, amounting to 57.5% of the family’s annual income, posing an unaffordable economic burden of 77.6% [ 46 ]. In 2017, the Chinese government released the National Health Insurance Coverage (NHIC) policy to improve the accessibility and affordability of innovative anticancer medicines, leading to reduced prices and increased availability and utilization of 15 negotiated drugs. However, a study indicated that the availability of these innovative anticancer drugs remained limited. By 2019, the NHIC policy had benefited 44,600 people, while the number of new cancer cases in China in 2020 was 4.57 million [ 47 ]. The promotion of information on innovative therapies helped patients gain a better understanding of their cancer treatment options [ 48 ].

Practical Implications

This research highlighted that MSM did not fully leverage its professional background in providing comprehensive cancer information to the public. In fact, MSM holds substantial potential for contributing to cancer education. The findings from the content analysis also have practical implications for practitioners. They provide valuable insights for experts to assess the effectiveness of social media, monitor the types of information available to the public and patients with cancer, and guide communication and medical professionals in crafting educational and persuasive messages based on widely covered or less attended content.

Limitations and Future Directions

This study had some limitations. First, we only collected 60,843 articles from 9 WPAs in China. Future research could broaden the scope by collecting data from diverse countries and social media platforms. Second, our manual labeling only extracted 10% (450/4479 for MSM and 394/3948 for CSM) of the samples; the accuracy of the machine learning model could be enhanced by training it with a larger set of labeled articles. Finally, our results only represented the media’s presentation, and the impact of this information on individuals remains unclear. Further work could examine its influence on behavioral intentions or actions related to cancer prevention among the audience.

Conclusions

The analysis of cancer-related information on social media revealed an imbalance between prevention and treatment content. Overall, there was more treatment information than prevention information. Compared with MSM, CSM mentioned more prevention information. On MSM, the proportion of treatment information was greater than prevention information, whereas on CSM, the 2 were equal. The focus on cancer prevention and treatment information was primarily limited to a few aspects, with a predominant emphasis on secondary prevention rather than primary prevention. There is a need for further improvement in the coverage of prevention measures and treatments for cancer on social media. Additionally, the findings underscored the potential of applying machine learning to content analysis as a promising research paradigm for mapping key dimensions of cancer information on social media. These findings offer methodological and practical significance for future studies and health promotion.

Acknowledgments

This study was funded by The Major Program of the Chinese National Foundation of Social Sciences under the project “The Challenge and Governance of Smart Media on News Authenticity” (grant number 23&ZD213).

Conflicts of Interest

None declared.

Definitions and descriptions of coding items.

  • International Agency for Research on Cancer (IARC), World Health Organization (WHO). Cancer today: the global cancer observatory. IARC. Geneva, Switzerland. WHO; 2020. URL: https://gco.iarc.who.int/today/en [accessed 2023-12-25]
  • Yu S, Yang CS, Li J, You W, Chen J, Cao Y, et al. Cancer prevention research in China. Cancer Prev Res (Phila). Aug 2015;8(8):662-674. [ CrossRef ] [ Medline ]
  • Xia C, Dong X, Li H, Cao M, Sun D, He S, et al. Cancer statistics in China and United States, 2022: profiles, trends, and determinants. Chin Med J (Engl). Feb 09, 2022;135(5):584-590. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • World Health Organization (WHO). Cancer. WHO. 2023. URL: https://www.who.int/news-room/facts-in-pictures/detail/cancer [accessed 2023-12-27]
  • Pagoto S, Waring ME, Xu R. A call for a public health agenda for social media research. J Med Internet Res. Dec 19, 2019;21(12):e16661. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tekeli-Yesil S, Tanner M. Understanding the contribution of conventional media in earthquake risk communication. J Emerg Manag Disaster Commun. Jun 01, 2024;05(01):111-133. [ CrossRef ]
  • Jensen JD, Scherr CL, Brown N, Jones C, Christy K, Hurley RJ. Public estimates of cancer frequency: cancer incidence perceptions mirror distorted media depictions. J Health Commun. 2014;19(5):609-624. [ CrossRef ] [ Medline ]
  • Banaye Yazdipour A, Niakan Kalhori SR, Bostan H, Masoorian H, Ataee E, Sajjadi H. Effect of social media interventions on the education and communication among patients with cancer: a systematic review protocol. BMJ Open. Nov 30, 2022;12(11):e066550. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wallner LP, Martinez KA, Li Y, Jagsi R, Janz NK, Katz SJ, et al. Use of online communication by patients with newly diagnosed breast cancer during the treatment decision process. JAMA Oncol. Dec 01, 2016;2(12):1654-1656. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Sun D, Li H, Cao M, He S, Lei L, Peng J, et al. Cancer burden in China: trends, risk factors and prevention. Cancer Biol Med. Nov 15, 2020;17(4):879-895. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Basch CH, Menafro A, Mongiovi J, Hillyer GC, Basch CE. A content analysis of YouTube videos related to prostate cancer. Am J Mens Health. Jan 2017;11(1):154-157. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Vasconcelos Silva C, Jayasinghe D, Janda M. What can Twitter tell us about skin cancer communication and prevention on social media? Dermatology. 2020;236(2):81-89. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hurley RJ, Riles JM, Sangalang A. Online cancer news: trends regarding article types, specific cancers, and the cancer continuum. Health Commun. 2014;29(1):41-50. [ CrossRef ] [ Medline ]
  • Mishel MH, Germino BB, Lin L, Pruthi RS, Wallen EM, Crandell J, et al. Managing uncertainty about treatment decision making in early stage prostate cancer: a randomized clinical trial. Patient Educ Couns. Dec 2009;77(3):349-359. [ CrossRef ] [ Medline ]
  • Brown P, Kwan V, Vallerga M, Obhi HK, Woodhead EL. The use of anecdotal information in a hypothetical lung cancer treatment decision. Health Commun. Jun 2019;34(7):713-719. [ CrossRef ] [ Medline ]
  • Crannell WC, Clark E, Jones C, James TA, Moore J. A pattern-matched Twitter analysis of US cancer-patient sentiments. J Surg Res. Dec 2016;206(2):536-542. [ CrossRef ] [ Medline ]
  • Gage-Bouchard EA, LaValley S, Mollica M, Beaupin LK. Cancer communication on social media: examining how cancer caregivers use Facebook for cancer-related communication. Cancer Nurs. 2017;40(4):332-338. [ CrossRef ] [ Medline ]
  • Reid BB, Rodriguez KN, Thompson MA, Matthews GD. Cancer-specific Twitter conversations among physicians in 2014. JCO. May 20, 2015;33(15_suppl):e17500. [ CrossRef ]
  • Warner EL, Waters AR, Cloyes KG, Ellington L, Kirchhoff AC. Young adult cancer caregivers' exposure to cancer misinformation on social media. Cancer. Apr 15, 2021;127(8):1318-1324. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Okuhara T, Ishikawa H, Okada M, Kato M, Kiuchi T. Assertions of Japanese websites for and against cancer screening: a text mining analysis. Asian Pac J Cancer Prev. Apr 01, 2017;18(4):1069-1075. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Qin L, Zhang X, Wu A, Miser JS, Liu Y, Hsu JC, et al. Association between social media use and cancer screening awareness and behavior for people without a cancer diagnosis: matched cohort study. J Med Internet Res. Aug 27, 2021;23(8):e26395. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Denecke K, Nejdl W. How valuable is medical social media data? Content analysis of the medical web. Information Sciences. May 30, 2009;179(12):1870-1880. [ CrossRef ]
  • Bender JL, Yue RYK, To MJ, Deacken L, Jadad AR. A lot of action, but not in the right direction: systematic review and content analysis of smartphone applications for the prevention, detection, and management of cancer. J Med Internet Res. Dec 23, 2013;15(12):e287. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Li X, Liu Q. Social media use, eHealth literacy, disease knowledge, and preventive behaviors in the COVID-19 pandemic: cross-sectional study on Chinese netizens. J Med Internet Res. Oct 09, 2020;22(10):e19684. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Zhang X, Wen D, Liang J, Lei J. How the public uses social media wechat to obtain health information in China: a survey study. BMC Med Inform Decis Mak. Jul 05, 2017;17(Suppl 2):66. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Elad B. WeChat statistics by device allocation, active users, country wise traffic, demographics and marketing channels, social media traffic. EnterpriseAppsToday. 2023. URL: https://www.enterpriseappstoday.com/stats/wechat-statistics.html [accessed 2023-12-26]
  • Liang X, Yan M, Li H, Deng Z, Lu Y, Lu P, et al. WeChat official accounts' posts on medication use of 251 community healthcare centers in Shanghai, China: content analysis and quality assessment. Front Med (Lausanne). 2023;10:1155428. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • NewRank. Ranking of influential health WeChat public accounts(中国健康类微信影响力排行榜). NewRank(新榜). 2018. URL: https://newrank.cn/public/info/rank_detail.html?name=health [accessed 2021-04-30]
  • Hospital Management Institute of Fudan University. 2021 National rankings of best hospitals by oncology specialty (2021年度肿瘤科专科声誉排行榜). Hospital Management Institute of Fudan University. 2021. URL: https://rank.cn-healthcare.com/fudan/specialty-reputation/year/2021/sid/2 [accessed 2021-05-01]
  • Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003;3:993-1022. [ FREE Full text ]
  • World Health Organization (WHO). Health topic: cancer. WHO. URL: https://www.who.int/health-topics/cancer#tab=tab_2 [accessed 2023-12-27]
  • Moore SC, Lee I, Weiderpass E, Campbell PT, Sampson JN, Kitahara CM, et al. Association of leisure-time physical activity with risk of 26 types of cancer in 1.44 million adults. JAMA Intern Med. Jun 01, 2016;176(6):816-825. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Stephens P, Martin B, Ghafari G, Luong J, Nahar V, Pham L, et al. Skin cancer knowledge, attitudes, and practices among Chinese population: a narrative review. Dermatol Res Pract. 2018;2018:1965674. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • International Agency for Research on Cancer (IARC). Agents classified by the IARC monographs, volumes 1–136. IARC. URL: https://monographs.iarc.who.int/agents-classified-by-the-iarc/ [accessed 2023-12-25]
  • Han CJ, Lee YJ, Demiris G. Interventions using social media for cancer prevention and management. Cancer Nurs. 2018;41(6):E19-E31. [ CrossRef ]
  • National Institutes of Health (NIH), National Cancer Institute (NCI). Types of cancer treatment. NIH. URL: https://www.cancer.gov/about-cancer/treatment/types [accessed 2021-03-15]
  • Cui Y, Che W, Liu T, Qin B, Yang Z. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans Audio Speech Lang Process. 2021;29:3504-3514. [ CrossRef ]
  • Loomans-Kropp HA, Umar A. Cancer prevention and screening: the next step in the era of precision medicine. NPJ Precis Oncol. 2019;3:3. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tang Y, Xiang X, Wang X, Cubells JF, Babor TF, Hao W. Alcohol and alcohol-related harm in China: policy changes needed. Bull World Health Organ. Jan 22, 2013;91(4):270-276. [ CrossRef ]
  • Zhang M, Yang L, Wang L, Jiang Y, Huang Z, Zhao Z, et al. Trends in smoking prevalence in urban and rural China, 2007 to 2018: findings from 5 consecutive nationally representative cross-sectional surveys. PLoS Med. Aug 2022;19(8):e1004064. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Li J, Wu B, Selbæk G, Krokstad S, Helvik A. Factors associated with consumption of alcohol in older adults - a comparison between two cultures, China and Norway: the CLHLS and the HUNT-study. BMC Geriatr. Jul 31, 2017;17(1):172. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Feng R, Zong Y, Cao S, Xu R. Current cancer situation in China: good or bad news from the 2018 Global Cancer Statistics? Cancer Commun (Lond). Apr 29, 2019;39(1):22. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Scoccianti C, Key TJ, Anderson AS, Armaroli P, Berrino F, Cecchini M, et al. European code against cancer 4th Edition: breastfeeding and cancer. Cancer Epidemiol. Dec 2015;39 Suppl 1:S101-S106. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Espina C, Porta M, Schüz J, Aguado IH, Percival RV, Dora C, et al. Environmental and occupational interventions for primary prevention of cancer: a cross-sectorial policy framework. Environ Health Perspect. Apr 2013;121(4):420-426. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Jensen JD, Moriarty CM, Hurley RJ, Stryker JE. Making sense of cancer news coverage trends: a comparison of three comprehensive content analyses. J Health Commun. Mar 2010;15(2):136-151. [ CrossRef ] [ Medline ]
  • Huang H, Shi J, Guo L, Zhu X, Wang L, Liao X, et al. Expenditure and financial burden for common cancers in China: a hospital-based multicentre cross-sectional study. The Lancet. Oct 2016;388:S10. [ CrossRef ]
  • People's Daily. 17 Cancer drugs included in medical insurance at reduced prices, reducing medication costs by over 75% (17种抗癌药降价进医保减轻药费负担超75%). People's Daily. 2019. URL: http://www.gov.cn/xinwen/2019-02/13/content_5365211.htm [accessed 2023-12-25]
  • Fang W, Xu X, Zhu Y, Dai H, Shang L, Li X. Impact of the National Health Insurance Coverage Policy on the Utilisation and Accessibility of Innovative Anti-cancer Medicines in China: An Interrupted Time-Series Study. Front Public Health. 2021;9:714127. [ FREE Full text ] [ CrossRef ] [ Medline ]

Abbreviations

Bidirectional Encoder Representations from Transformers
common social media
Global Vectors for Word Representation
latent Dirichlet allocation
long short-term memory
medical social media
National Health Insurance Coverage
World Health Organization
WeChat public account

Edited by S Ma; submitted 02.01.24; peer-reviewed by F Yang, D Wawrzuta; comments to author 20.03.24; revised version received 19.04.24; accepted 03.06.24; published 14.08.24.

©Keyang Zhao, Xiaojing Li, Jingyang Li. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 14.08.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

IMAGES

  1. Machine Learning Journal

    research paper using machine learning

  2. (PDF) Application of Machine Learning Approaches in Intrusion Detection

    research paper using machine learning

  3. (PDF) Applications of Artificial Intelligence in Machine Learning

    research paper using machine learning

  4. Getting Started With Research Papers On Machine Learning: What To Read

    research paper using machine learning

  5. (PDF) A Research on Machine Learning Methods and Its Applications

    research paper using machine learning

  6. (PDF) An Overview of Artificial Intelligence and their Applications

    research paper using machine learning

COMMENTS

  1. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  2. The latest in Machine Learning

    Discover the latest trends and innovations in machine learning research and code. Browse papers with code by topics, tasks, methods, and datasets.

  3. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review

    Machine learning (ML), an area of artificial intelligence (AI), enables researchers, physicians, and patients to solve some of these issues. Based on relevant research, this review explains how machine learning (ML) is being used to help in the early identification of numerous diseases.

  4. Student Performance Prediction Using Machine Learning Algorithms

    This research significantly contributes to the field of EDM by advancing the prediction of student performance using machine learning techniques. By addressing the challenges faced by modern learning institutions and leveraging innovative methodologies, the study offers valuable insights into enhancing academic outcomes.

  5. Machine learning

    Machine learning methods enable computers to learn without being explicitly programmed and have multiple applications, for example, in the improvement of data mining algorithms.

  6. (PDF) Predictive analysis using machine learning: Review of trends and

    Most systems that use ML methods use them to perform predictive analysis. This paper aims to conduct a literature review of trends and methods of machine learning used for predictive analysis.

  7. Crop yield prediction using machine learning: A systematic literature

    Machine learning is an important decision support tool for crop yield prediction, including supporting decisions on what crops to grow and what to do during the growing season of the crops. Several machine learning algorithms have been applied to support crop yield prediction research. In this study, we performed a Systematic Literature Review ...

  8. Advancing agricultural research using machine learning algorithms

    Advancing agricultural research using machine learning algorithms. Scientific Reports 11, Article number: 17879 ( 2021 ) Cite this article. Rising global population and climate change realities ...

  9. Machine learning prediction in cardiovascular diseases: a meta ...

    Machine learning This article is cited by Using machine learning-based algorithms to construct cardiovascular risk prediction models for Taiwanese adults based on traditional and novel risk factors

  10. Smart farming using Machine Learning and Deep Learning techniques

    This article explores how smart farming can benefit from machine learning and deep learning techniques, such as image processing, natural language processing, and reinforcement learning.

  11. (PDF) Predicting Students' Performance Using Machine Learning

    The goal of this paper is to present a systematic literature review on predicting student performance using machine learning techniques and how the prediction algorithm can be used to identify the ...

  12. Machine Learning for Anomaly Detection: A Systematic Review

    Anomaly detection has been used for decades to identify and extract anomalous components from data. Many techniques have been used to detect anomalies. One of the increasingly significant techniques is Machine Learning (ML), which plays an important role in this area. In this research paper, we conduct a Systematic Literature Review (SLR) which analyzes ML models that detect anomalies in their ...

  13. Healthcare predictive analytics using machine learning and deep

    Hence, reliable and efficient methods for healthcare predictive analysis are essential. Therefore, this paper aims to present a comprehensive survey of existing machine learning and deep learning approaches utilized in healthcare prediction and identify the inherent obstacles to applying these approaches in the healthcare domain.

  14. Machine Learning Applications for Precision Agriculture: A

    Precision agriculture also known as smart farming have emerged as an innovative tool to address current challenges in agricultural sustainability. The mechanism that drives this cutting edge technology is machine learning (ML). It gives the machine ability to learn without being explicitly programmed.

  15. Machine learning for financial forecasting, planning and analysis

    This article is an introduction to machine learning for financial forecasting, planning and analysis (FP&A). Machine learning appears well suited to support FP&A with the highly automated extraction of information from large amounts of data. However, because most traditional machine learning techniques focus on forecasting (prediction), we discuss the particular care that must be taken to ...

  16. Crime Prediction Using Machine Learning and Deep Learning: A Systematic

    Additionally, the paper highlights potential gaps and future directions that can enhance the accuracy of crime prediction. Finally, the comprehensive overview of research discussed in this paper on crime prediction using machine learning and deep learning approaches serves as a valuable reference for researchers in this field.

  17. Machine Learning Based Diabetes Classification and Prediction for

    Users can better understand BG changes by using CGM (continuous glucose monitoring) sensors [ 4 ]. By exploiting the advantages of the advancement in modern sensor technology, IoT, and machine learning techniques, we have proposed an approach for the classification, early-stage identification, and prediction of diabetes in this paper.

  18. Prediction of Breast Cancer using Machine Learning Approaches

    This study aimed to predict breast cancer using different machine-learning approaches applying demographic, laboratory, and mammographic data.

  19. Machine learning for email spam filtering: review, approaches and open

    Machine learning methods of recent are being used to successfully detect and filter spam emails. We present a systematic review of some of the popular machine learning based email spam filtering approaches. Our review covers survey of the important concepts, attempts, efficiency, and the research trend in spam filtering.

  20. Applied Sciences

    Section 4 highlights the milestones of forecasting models using machine learning and their categorization. Section 5 and Section 6 analyze representative methods on both short-term and medium- and long-term time scales. Section 7 and Section 8 summarize the challenges faced, present promising future work, and conclude the paper.

  21. Cyberbullying detection and machine learning: a systematic ...

    The rise in research work focusing on detection of cyberbullying incidents on social media platforms particularly reflect how dire cyberbullying consequences are, regardless of age, gender or location. This paper examines scholarly publications (i.e., 2011-2022) on cyberbullying detection using machine learning through a systematic literature review approach. Specifically, articles were ...

  22. Title: Detecting Fake News using Machine Learning: A Systematic

    ake news. The algorithms of machine learning are trained to fulfill this purpose. Machine l. ning algorithms will detect the fake news automatically once they ha. e trained. This literature review will answer the different research questions. The importance.

  23. A Comprehensive Analysis of Weather Prediction Using Machine Learning

    In this paper, we performed an analysis of the 500 most relevant scientific articles published since 2018, concerning machine learning methods in the field of climate and numerical weather ...

  24. Full article: Smart energy management: real-time prediction and

    This research work reports the use of deep neural networks (DNN) to design and implement smart home management systems (Shakeri et al., Citation 2020) with the help of IoT devices and machine learning. The results of this work show that the system uses Karas (or TensorFlow) to train a DNN based on energy data from IoT sensors.

  25. Early detection of Parkinson's disease using machine learning

    The rise of an aging population over the world emphasizes the need to detect PD early, remotely and accurately. This paper highlights the use of machine learning techniques in telemedicine to detect PD in its early stages. Research has been carried out on the MDVP audio data of 30 PWP and healthy people during training of 4 ML models.

  26. Researchers identify over 2,000 potential toxins using machine learning

    The research paper, titled "Identification of novel toxins associated with the extracellular contractile injection system using machine learning" is published in Molecular Systems Biology ...

  27. Loan Approval Prediction using Machine Learning: A Comparative Analysis

    With the rise of machine learning techniques, there is now an opportunity to develop more accurate and reliable predictive models that can help financial institutions make better lending decisions.This study proposes a comparative analysis of various machine learning algorithms for predicting loan approval.

  28. Novel machine learning-based cluster analysis method that leverages

    Novel machine learning-based cluster analysis method that leverages target material property New cluster analysis technique for grouping materials based on both basic features and targeted properties

  29. [2408.07579] TabularBench: Benchmarking Adversarial Robustness for

    While adversarial robustness in computer vision is a mature research field, fewer researchers have tackled the evasion attacks against tabular deep learning, and even fewer investigated robustification mechanisms and reliable defenses. We hypothesize that this lag in the research on tabular adversarial attacks is in part due to the lack of standardized benchmarks. To fill this gap, we propose ...

  30. Journal of Medical Internet Research

    Additionally, the study's findings underscored the potential of applying machine learning to content analysis as a promising research approach for mapping key dimensions of cancer information on social media. These findings hold methodological and practical significance for future studies and health promotion.