Neuroscience News logo for mobile.

Deep Learning

current research topics in deep learning

AI Conversations Help Conspiracy Theorists Change Their Views

This shows a woman's face.

AI Determines How the Brain Predicts and Processes Thoughts

This shows a robot face.

Unlocking the Brain’s “Neural Code” Could Lead to Superhuman AI

This shows a robot on a park bench.

Robot Deception: Some Lies Accepted, Others Rejected

This shows a molecule.

AI Models Complex Molecular States with Precision

This shows a child.

AI Model Predicts Autism in Toddlers with 80% Accuracy

This shows a robot.

AI Lacks Independent Learning, Poses No Existential Threat

This shows a man wearing headphones on a cold day.

Can We Hear Temperature? New Study Says Yes

This shows DNA and a computer code.

AI Helps Decode the Language of DNA

This shows men's faces.

Study Finds Faces Evolve to Match Names Over Time

This shows a robotic face.

Consciousness in AI: Distinguishing Reality from Simulation

This shows a neuron.

AI Identifies Three Parkinson’s Subtypes

Neuroscience News Small Logo

Manipulating Brain Waves During Sleep With Sound

current research topics in deep learning

High Doses of ADHD Meds Linked to Increased Psychosis Risk

This shows neurons.

Key Neurons Found to Predict Memory of People and Places

  • ODSC EUROPE
  • AI+ Training
  • Speak at ODSC

current research topics in deep learning

  • Data Analytics
  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

Best Deep Learning Research of 2021 So Far

Best Deep Learning Research of 2021 So Far

Deep Learning Modeling Research posted by Daniel Gutierrez, ODSC August 2, 2021 Daniel Gutierrez, ODSC

The discipline of AI most often mentioned these days is deep learning (DL) along with its many incarnations implemented with deep neural networks. DL also is a rapidly accelerating area of research with papers being published at a fast clip by research teams from around the globe.

I enjoy keeping a pulse on deep learning research and so far in 2021 research innovations have propagated at a quick pace. Some of the top topical areas for deep learning research are: causality, explainability/interpretability, transformers, NLP, GPT, language models, GANs, deep learning for tabular data, and many others.

In this article, we’ll take a brief tour of my top picks for deep learning research  (in no particular order) of papers that I found to be particularly compelling. I’m pretty attached to this leading-edge research. I’m known to carry a thick folder of recent research papers around in my backpack and consume all the great developments when I have a spare moment. Enjoy! 

Check out my previous lists: Best Machine Learning Research of 2021 So Far , Best of Deep Reinforcement Learning Research of 2019 , Most Influential NLP Research of 2019 , and Most Influential Deep Learning Research of 2019 . 

Cause and Effect: Concept-based Explanation of Neural Networks

In many scenarios, human decisions are explained based on some high-level concepts. This paper takes a step in the interpretability of neural networks by examining their internal representation or neuron’s activations against concepts. A concept is characterized by a set of samples that have specific features in common. A framework is proposed to check the existence of a causal relationship between a concept (or its negation) and task classes. While the previous methods focus on the importance of a concept to a task class, the paper goes further and introduces four measures to quantitatively determine the order of causality. Through experiments, the effectiveness of the proposed method is demonstrated in explaining the relationship between a concept and the predictive behavior of a neural network.

Pretrained Language Models for Text Generation: A Survey

Text generation has become one of the most important yet challenging tasks in natural language processing (NLP). The resurgence of deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs). This paper presents an overview of the major advances achieved in the topic of PLMs for text generation. As the preliminaries, the paper presents the general task definition and briefly describes the mainstream architectures of PLMs for text generation. As the core content, the deep learning research paper discusses how to adapt existing PLMs to model different input data and satisfy special properties in the generated text. 

A Short Survey of Pre-trained Language Models for Conversational AI-A NewAge in NLP

Building a dialogue system that can communicate naturally with humans is a challenging yet interesting problem of agent-based computing. The rapid growth in this area is usually hindered by the long-standing problem of data scarcity as these systems are expected to learn syntax, grammar, decision making, and reasoning from insufficient amounts of task-specific data sets. The recently introduced pre-trained language models have the potential to address the issue of data scarcity and bring considerable advantages by generating contextualized word embeddings. These models are considered counterparts of ImageNet in NLP and have demonstrated the ability to capture different facets of language such as hierarchical relations, long-term dependency, and sentiment. This short survey paper discusses the recent progress made in the field of pre-trained language models. 

TrustyAI Explainability Toolkit

AI is becoming increasingly more popular and can be found in workplaces and homes around the world. However, how do we ensure trust in these systems? Regulation changes such as the GDPR mean that users have a right to understand how their data has been processed as well as saved. Therefore if, for example, you are denied a loan you have the right to ask why. This can be hard if the method for working this out uses “black box” machine learning techniques such as neural networks. TrustyAI is a new initiative which looks into explainable artificial intelligence (XAI) solutions to address trustworthiness in ML as well as decision services landscapes. This deep learning research paper looks at how TrustyAI can support trust in decision services and predictive models. The paper investigates techniques such as LIME, SHAP and counterfactuals, benchmarking both LIME and counterfactual techniques against existing implementations. 

Generative Adversarial Network: Some Analytical Perspectives

Ever since its debut, generative adversarial networks (GANs) have attracted tremendous amount of attention. Over the past years, different variations of GANs models have been developed and tailored to different applications in practice. Meanwhile, some issues regarding the performance and training of GANs have been noticed and investigated from various theoretical perspectives. This paper starts from an introduction of GANs from an analytical perspective, then moves onto the training of GANs via SDE approximations and finally discusses some applications of GANs in computing high dimensional MFGs as well as tackling mathematical finance problems.

PyTorch Tabular: A Framework for Deep Learning with Tabular Data

In spite of showing unreasonable effectiveness in modalities like Text and Image, deep learning has always lagged gradient boosting in tabular data – both in popularity and performance. But recently there have been newer models created specifically for tabular data, which is pushing the performance bar. But popularity is still a challenge because there is no easy, ready-to-use library like scikit-learn for deep learning. PyTorch Tabular is a new deep learning library which makes working with deep learning and tabular data easy and fast. It is a library built on top of PyTorch and PyTorch Lightning and works on Pandas dataframes directly. Many SOTA models like NODE and TabNet are already integrated and implemented in the library with a unified API. PyTorch Tabular is designed to be easily extensible for researchers, simple for practitioners, and robust in industrial deployments.

A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization : in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. This paper surveys approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. 

How to decay your learning rate

Complex learning rate schedules have become an integral part of deep learning. This research finds empirically that common fine-tuned schedules decay the learning rate after the weight norm bounces. This leads to the proposal of ABEL : an automatic scheduler which decays the learning rate by keeping track of the weight norm. ABEL’s performance matches that of tuned schedules and is more robust with respect to its parameters. Through extensive experiments in vision, NLP, and RL, it is shown that if the weight norm does not bounce, it is possible to simplify schedules even further with no loss in performance. In such cases, a complex schedule has similar performance to a constant learning rate with a decay at the end of training.

GPT Understands, Too

While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), this paper shows that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning — which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, it is found that P-tuning also improves BERTs’ performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, P-tuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.

Understanding Robustness of Transformers for Image Classification

Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture — such as the use of non-overlapping patches — lead one to wonder whether these networks are as robust. This paper performs an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. Investigated is robustness to input perturbations as well as robustness to model perturbations. The paper finds that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. Also found is that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.

Improving DeepFake Detection Using Dynamic Face Augmentation

The creation of altered and manipulated faces has become more common due to the improvement of DeepFake generation methods. Simultaneously, we have seen the development of detection models for differentiating between a manipulated and original face from image or video content. We have observed that most publicly available DeepFake detection datasets have limited variations, where a single face is used in many videos, resulting in an oversampled training dataset. Due to this, deep neural networks tend to overfit to the facial features instead of learning to detect manipulation features of DeepFake content. As a result, most detection architectures perform poorly when tested on unseen data. This paper provides a quantitative analysis to investigate this problem and present a solution to prevent model overfitting due to the high volume of samples generated from a small number of actors.

An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. This paper first discusses the major microarchitectural details of Edge TPUs. This is followed by an extensive evaluation of three classes of Edge TPUs, covering different computing ecosystems that are either currently deployed in Google products or are the product pipeline. Building upon this extensive study, the paper discusses critical and interpretable microarchitectural insights about the studied classes of Edge TPUs. Mainly discussed is how Edge TPU accelerators perform across CNNs with different structures. Finally, the paper presents ongoing efforts in developing high-accuracy learned machine learning models to estimate the major performance metrics of accelerators such as latency and energy consumption. These learned models enable significantly faster (in the order of milliseconds) evaluations of accelerators as an alternative to time-consuming cycle-accurate simulators and establish an exciting opportunity for rapid hard-ware/software co-design.

Attention Models for Point Clouds in Deep Learning: A Survey

Recently, the advancement of 3D point clouds in deep learning has attracted intensive research in different application domains such as computer vision and robotic tasks. However, creating feature representation of robust, discriminative from unordered and irregular point clouds is challenging. The goal of this paper is to provide a comprehensive overview of the point clouds feature representation which uses attention models. More than 75+ key contributions in the recent three years are summarized in this survey, including the 3D objective detection, 3D semantic segmentation, 3D pose estimation, point clouds completion etc. Also provided are: a detailed characterization of (i) the role of attention mechanisms, (ii) the usability of attention models into different tasks, and (iii) the development trend of key technology.

Constrained Optimization for Training Deep Neural Networks Under Class Imbalance

Deep neural networks (DNNs) are notorious for making more mistakes for the classes that have substantially fewer samples than the others during training. Such class imbalance is ubiquitous in clinical applications and very crucial to handle because the classes with fewer samples most often correspond to critical cases (e.g., cancer) where misclassifications can have severe consequences. Not to miss such cases, binary classifiers need to be operated at high True Positive Rates (TPR) by setting a higher threshold but this comes at the cost of very high False Positive Rates (FPR) for problems with class imbalance. Existing methods for learning under class imbalance most often do not take this into account. This paper argues that prediction accuracy should be improved by emphasizing reducing FPRs at high TPRs for problems where misclassification of the positive samples are associated with higher cost. To this end, it’s posed the training of a DNN for binary classification as a constrained optimization problem and introduce a novel constraint that can be used with existing loss functions to enforce maximal area under the ROC curve (AUC). The resulting constrained optimization problem is solved using an Augmented Lagrangian method (ALM), where the constraint emphasizes reduction of FPR at high TPR. Results demonstrate that the proposed method almost always improves the loss functions it is used with by attaining lower FPR at high TPR and higher or equal AUC.

Deep Convolutional Neural Networks with Unitary Weights

While normalizations aim to fix the exploding and vanishing gradient problem in deep neural networks, they have drawbacks in speed or accuracy because of their dependency on the data set statistics. This paper is a comprehensive study of a novel method based on unitary synaptic weights derived from Lie Group to construct intrinsically stable neural systems. Here it’s shown that unitary convolutional neural networks deliver up to 32% faster inference speeds while maintaining competitive prediction accuracy. Unlike prior arts restricted to square synaptic weights, the paper expands the unitary networks to weights of any size and dimension.

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

The recent explosive interest with transformers has suggested their potential to become powerful “universal” models for computer vision tasks, such as classification, detection, and segmentation. An important question is how much further transformers can go – are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)? Driven by that curiosity, this paper conducts the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. The proposed vanilla GAN architecture, dubbed TransGAN , consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. TransGAN is seen to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, the architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. The GitHub repo associated with this paper can be found HERE .

https://odsc.com/california/#register

Deep Learning for Scene Classification: A Survey

Scene classification , aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute a dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in scene classification using deep learning. More than 260 major publications are included in this survey covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In retrospect of what has been achieved so far, this paper is concluded with a list of promising research opportunities.

Introducing and assessing the explainable AI (XAI) method: SIDU

Explainable Artificial Intelligence (XAI) has in recent years become a well-suited framework to generate human-understandable explanations of black box models. This paper presents a novel XAI visual explanation algorithm denoted SIDU that can effectively localize entire object regions responsible for prediction. The paper analyzes its robustness and effectiveness through various computational and human subject experiments. In particular, the SIDU algorithm is assessed using three different types of evaluations (Application, Human and Functionally-Grounded) to demonstrate its superior performance. The robustness of SIDU is further studied in presence of adversarial attack on black box models to better understand its performance.

Evolving Reinforcement Learning Algorithms

This paper proposes a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. The method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, the method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, two learned algorithms are highlighted which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

RepVGG: Making VGG-style ConvNets Great Again

VGG-style ConvNets, although now considered a classic architecture, were attractive due to their simplicity. In contrast, ResNets have become popular due to their high accuracy but are more difficult to customize and display undesired inference drawbacks. To address these issues, Ding et al. propose RepVGG – the return of the VGG! 

RepVGG is an efficient and simple architecture using plain VGG-style ConvNets. It decouples the inference-time and training-time architecture through a structural re-parameterization technique. The researchers report favorable speed-accuracy tradeoff compared to state-of-the-art models, such as EfficientNet and RegNet. RepVGG achieves 80% top-1 accuracy on ImageNet and is benchmarked as being 83% faster than ResNet-50. This research is part of a broader effort to build more efficient models using simpler architectures and operations. The GitHub repo associated with this paper can be found HERE .

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model — with outrageous numbers of parameters — but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability — this paper addresses these with the Switch Transformer . The Google Brain researchers simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. The proposed training techniques help wrangle the instabilities and it is shown that large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. They design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings to measure gains over the mT5-Base version across all 101 languages. Finally, the paper advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model. The GitHub repo associated with this paper can be found HERE . 

How to Learn More about Deep Learning Research

At our upcoming event this November 16th-18th in San Francisco,  ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on deep learning and deep learning research. You can register now for 60% off all ticket types  before the discount drops to 40% in a few weeks. Some  highlighted sessions on deep learning  include:

Sessions on Deep Learning and Deep Learning Research:

  • GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

Sessions on Machine Learning:

  • Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University

Sessions on MLOps:

  • Tuning Hyperparameters with Reproducible Experiments: Milecia McGregor | Senior Software Engineer | Iterative
  • MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
  • Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
  • Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investment

current research topics in deep learning

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

west square

How to Use AI to Identify Employee Skill Gaps

Business + Management posted by Zac Amos Sep 13, 2024 Technology is causing businesses to evolve quickly. However, companies must ensure their workforce has the right...

OpenAI Unveils o1-Preview: A New Generation of AI Reasoning Models

OpenAI Unveils o1-Preview: A New Generation of AI Reasoning Models

AI and Data Science News posted by ODSC Team Sep 13, 2024 In a blog post, OpenAI has introduced the o1-preview series. These are a new line of...

Attend Google/ODSC Meetups & Webinars This Fall!

Attend Google/ODSC Meetups & Webinars This Fall!

Google Featured Post posted by ODSC Team Sep 13, 2024 We’re thrilled to announce that we’re partnering with Google to give our community of AI and...

AI weekly square

current research topics in deep learning

Google Research, 2022 & beyond: Algorithms for efficient deep learning

February 7, 2023

Posted by Sanjiv Kumar, VP and Google Fellow, Google Research

Quick links

  • Copy link ×

current research topics in deep learning

The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they’re deeper, more complex, with more parameters, and trained on much more data, resulting in some of the most transformative outcomes in the history of machine learning.

As these models increasingly find themselves deployed in production and business applications, the efficiency and costs of these models has gone from a minor consideration to a primary constraint. In response, Google has continued to invest heavily in ML efficiency, taking on the biggest challenges in (a) efficient architectures, (b) training efficiency, (c) data efficiency, and (d) inference efficiency. Beyond efficiency, there are a number of other challenges around factuality, security, privacy and freshness in these models. Below, we highlight an array of works that demonstrate Google Research’s efforts in developing new algorithms to address the above challenges.

 
 

Efficient architectures

A fundamental question is “Are there better ways of parameterizing a model to allow for greater efficiency?” In 2022, we focused on new techniques for infusing external knowledge by augmenting models via retrieved context; mixture of experts; and making transformers (which lie at the heart of most large ML models) more efficient.

Context-augmented models

In the quest for higher quality and efficiency, neural models can be augmented with external context from large databases or trainable memory. By leveraging retrieved context, a neural network may not have to memorize the huge amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and factuality.

In “ Decoupled Context Processing for Context Augmented Language Modeling ”, we explored a simple architecture for incorporating external context into language models based on a decoupled encoder-decoder architecture. This led to significant computational savings while giving competitive results on auto-regressive language modeling and open domain question answering tasks. However, pre-trained large language models (LLMs) consume a significant amount of information through self-supervision on big training sets. But, it is unclear precisely how the “world knowledge” of such models interacts with the presented context. With knowledge aware fine-tuning (KAFT), we strengthen both controllability and robustness of LLMs by incorporating counterfactual and irrelevant contexts into standard supervised datasets.

An encoder-decoder cross-attention mechanism for context incorporation that allows decoupling of context encoding from language model inference, leading to efficient context-augmented models.

One of the questions in the quest for a modular deep network is how a database of concepts with corresponding computational modules could be designed. We proposed a theoretical architecture that would “remember events” in the form of sketches stored in an external LSH table with pointers to modules that process such sketches.

Another challenge in context-augmented models is fast retrieval on accelerators of information from a large database. We have developed a TPU-based similarity search algorithm that aligns with the performance model of TPUs and gives analytical guarantees on expected recall , achieving peak performance. Search algorithms typically involve a large number of hyperparameters and design choices that make it hard to tune them on new tasks. We have proposed a new constrained optimization algorithm for automating hyperparameter tuning. Fixing the desired cost or recall as input, the proposed algorithm generates tunings that empirically are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.

Mixture-of-experts models

Mixture-of-experts (MoE) models have proven to be an effective means of increasing neural network model capacity without overly increasing their computational cost. The basic idea of MoEs is to construct a network from a number of expert sub-networks, where each input is processed by a suitable subset of experts. Thus, compared to a standard neural network, MoEs invoke only a small portion of the overall model, resulting in high efficiency as shown in language model applications such as GLaM .

The architecture of GLaM where each input token is dynamically routed to two selected expert networks out of 64 for prediction.

The decision of which experts should be active for a given input is determined by a routing function , the design of which is challenging, since one would like to prevent both under- and over-utilization of each expert. In a recent work, we proposed Expert Choice Routing , a new routing mechanism that, instead of assigning each input token to the top- k experts, assigns each expert to the top- k tokens. This automatically ensures load-balancing of experts while also naturally allowing for an input token to be handled by multiple experts.

. Experts with predetermined buffer capacity are assigned top-k tokens, thus guaranteeing even load balancing. Each token can be processed by a variable number of experts.

Efficient transformers

Transformers are popular sequence-to-sequence models that have shown remarkable success in a range of challenging problems from vision to natural language understanding. A central component of such models is the attention layer, which identifies the similarity between “queries” and “keys”, and uses these to construct a suitable weighted combination of “values”. While effective, attention mechanisms have poor (i.e., quadratic) scaling with sequence length.

As the scale of transformers continues to grow, it is interesting to study if there are any naturally occurring structures or patterns in the learned models that may help us decipher how they work. Towards that, we studied the learned embeddings in intermediate MLP layers, revealing that they are very sparse — e.g, T5-Large models have <1% nonzero entries. Sparsity further suggests that we can potentially reduce FLOPs without affecting model performance.

We recently proposed Treeformer , an alternative to standard attention computation that relies on decision trees. Intuitively, this quickly identifies a small subset of keys that are relevant for a query and only performs the attention operation on this set. Empirically, the Treeformer can lead to a 30x reduction in FLOPs for the attention layer. We also introduced Sequential Attention , a differentiable feature selection method that combines attention with a greedy algorithm . This technique has strong provable guarantees for linear models and scales seamlessly to large embedding models.

In Treeformer, attention computation is modeled as a retrieval problem. Hierarchical decision trees are used to find which keys to pay attention to for each query, reducing the quadratic cost of classical attention substantially.

Another way to make transformers efficient is by making the softmax computations faster in the attention layer. Building on our previous work on low-rank approximation of the softmax kernel, we proposed a new class of random features that provides the first “positive and bounded” random feature approximation of the softmax kernel and is computationally linear in the sequence length. We also proposed the first approach for incorporating various attention masking mechanisms, such as causal and relative position encoding, in a scalable manner (i.e., sub-quadratic with relation to the input sequence length).

Training efficiency

Efficient optimization methods are the cornerstone of modern ML applications and are particularly crucial in large scale settings. In such settings, even first order adaptive methods like Adam are often expensive, and training stability becomes challenging. In addition, these approaches are often agnostic to the architecture of the neural network, thereby ignoring the rich structure of the architecture leading to inefficient training. This motivates new techniques to more efficiently and effectively optimize modern neural network models. We are developing new architecture-aware training techniques, e.g., for training transformer networks, including new scale-invariant transformer networks and novel clipping methods that, when combined with vanilla stochastic gradient descent (SGD), results in faster training. Using this approach, for the first time, we were able to effectively train BERT using simple SGD without the need for adaptivity.

Moreover, with LocoProp we proposed a new method that achieves performance similar to that of a second-order optimizer while using the same computational and memory resources as a first-order optimizer. LocoProp takes a modular view of neural networks by decomposing them into a composition of layers. Each layer is then allowed to have its own loss function as well as output target and weight regularizer. With this setup, after a suitable forward-backward pass , LocoProp proceeds to perform parallel updates to each layer’s “local loss”. In fact, these updates can be shown to resemble those of higher-order optimizers, both theoretically and empirically. On a deep autoencoder benchmark, LocoProp achieves performance comparable to that of higher-order optimizers while being significantly faster.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets per neuron "targets" for each layer. Finally, LocoProp splits model training into independent problems across layers where several local updates can be applied to each layer's weights in parallel.

One key assumption in optimizers like SGD is that each data point is sampled independently and identically from a distribution. This is unfortunately hard to satisfy in practical settings such as reinforcement learning, where the model (or agent) has to learn from data generated based on its own predictions. We proposed a new algorithmic approach named SGD with reverse experience replay , which finds optimal solutions in several settings like linear dynamical systems , non-linear dynamical systems , and in Q-learning for reinforcement learning . Furthermore, an enhanced version of this method — IER — turns out to be the state of the art and is the most stable experience replay technique on a variety of popular RL benchmarks.

Data efficiency

For many tasks, deep neural networks heavily rely on large datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets incurs high computational costs. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set.

We analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting. In such a setting, a learner can sample examples one at a time, accessing both the context and true label, but in order to limit overhead costs, is only able to update its state (i.e., further train model weights) once a large enough batch of examples is selected. We developed an algorithm, called IWeS , that selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. We provide a theoretical analysis, proving generalization and sampling rate bounds.

Another concern with training large networks is that they can be highly sensitive to distribution shifts between training data and data seen at deployment time, especially when working with limited amounts of training data that might not cover all of deployment time scenarios. A recent line of work has hypothesized “ extreme simplicity bias ” as the key issue behind this brittleness of neural networks. Our latest work makes this hypothesis actionable, leading to two new complementary approaches — DAFT and FRR — that when combined provide significantly more robust neural networks. In particular, these two approaches use adversarial fine-tuning along with inverse feature predictions to make the learned network robust.

Inference efficiency

Increasing the size of neural networks has proven surprisingly effective in improving their predictive accuracy. However, it is challenging to realize these gains in the real-world, as the inference costs of large models may be prohibitively high for deployment. This motivates strategies to improve the serving efficiency, without sacrificing accuracy. In 2022, we studied different strategies to achieve this, notably those based on knowledge distillation and adaptive computation.

Distillation

Distillation is a simple yet effective method for model compression, which greatly expands the potential applicability of large neural models. Distillation has proved widely effective in a range of practical applications, such as ads recommendation . Most use-cases of distillation involve a direct application of the basic recipe to the given domain, with limited understanding of when and why this ought to work. Our research this year has looked at tailoring distillation to specific settings and formally studying the factors that govern the success of distillation.

On the algorithmic side, by carefully modeling the noise in the teacher labels, we developed a principled approach to reweight the training examples, and a robust method to sample a subset of data to have the teacher label. In “ Teacher Guided Training ”, we presented a new distillation framework: rather than passively using the teacher to annotate a fixed dataset, we actively use the teacher to guide the selection of informative samples to annotate. This makes the distillation process shine in limited data or long-tail settings.

We also researched new recipes for distillation from a cross-encoder (e.g., BERT ) to a factorized dual-encoder , an important setting for the task of scoring the relevance of a [ query , document ] pair. We studied the reasons for the performance gap between cross- and dual-encoders, noting that this can be the result of  generalization rather than capacity limitation in dual-encoders. The careful construction of the loss function for distillation can mitigate this and reduce the gap between cross- and dual-encoder performance. Subsequently, in EmbedDistill , we looked at further improving dual-encoder distillation by matching embeddings from the teacher model. This strategy can also be used to distill from a large to small dual-encoder model, wherein inheriting and freezing the teacher’s document embeddings can prove highly effective.

In EmbedDistill, teacher to student distillation is done by designing new loss functions that match the geometry of student embeddings with that of the teacher in addition to matching the final predictions.

On the theoretical side, we provided a new perspective on distillation through the lens of supervision complexity , a measure of how well the student can predict the teacher labels. Drawing on neural tangent kernel (NTK) theory, this offers conceptual insights, such as the fact that a capacity gap may affect distillation because such teachers’ labels may appear akin to purely random labels to the student. We further demonstrated that distillation can cause the student to underfit points the teacher model finds “hard” to model. Intuitively, this may help the student focus its limited capacity on those samples that it can reasonably model.

Adaptive computation

While distillation is an effective means of reducing inference cost, it does so uniformly across all samples. Intuitively however, some “easy” samples may inherently require less compute than the “hard” samples. The goal of adaptive compute is to design mechanisms that enable such sample-dependent computation.

Confident Adaptive Language Modeling  (CALM) introduced a controlled early-exit functionality to Transformer-based text generators such as T5 . In this form of adaptive computation, the model dynamically modifies the number of transformer layers that it uses per decoding step. The early-exit gates use a confidence measure with a decision threshold that is calibrated to satisfy statistical performance guarantees. In this way, the model needs to compute the full stack of decoder layers for only the most challenging predictions. Easier predictions only require computing a few decoder layers. In practice, the model uses about a third of the layers for prediction on average, yielding 2–3x speed-ups while preserving the same level of generation quality.

Text generation with a regular language model ( ) and with CALM ( ). CALM attempts to make early predictions. Once confident enough (darker blue tones), it skips ahead and saves time.

One popular adaptive compute mechanism is a cascade of two or more base models. A key issue in using cascades is deciding whether to simply use the current model’s predictions, or whether to defer prediction to a downstream model. Learning when to defer requires designing a suitable loss function, which can leverage appropriate signals to act as supervision for the deferral decision. We formally studied existing loss functions for this goal, demonstrating that they may underfit the training sample owing to an implicit application of label smoothing. We showed that one can mitigate this with post-hoc training of a deferral rule, which does not require modifying the model internals in any way.

For the retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by a large model. That is, irrespective of downstream task and its associated compute environment or constraints, the representation size and capability is mostly fixed. Matryoshka representation learning introduces flexibility to adapt representations according to the deployment environment. That is, it forces representations to have a natural ordering within its coordinates such that for resource constrained environments, we can use only the top few coordinates of the representation, while for richer and precision-critical settings, we can use more coordinates of the representation. When combined with standard approximate nearest neighbor search techniques like ScaNN , MRL is able to provide up to 16x lower compute with the same recall and accuracy metrics.

Concluding thoughts

Large ML models are showing transformational outcomes in several domains but efficiency in both training and inference is emerging as a critical need to make these models practical in the real-world. Google Research has been investing significantly in making large ML models efficient by developing new foundational techniques. This is an on-going effort and over the next several months we will continue to explore core challenges to make ML models even more robust and efficient.

Acknowledgements

The work in efficient deep learning is a collaboration among many researchers from Google Research, including Amr Ahmed, Ehsan Amid, Rohan Anil, Mohammad Hossein Bateni, Gantavya Bhatt, Srinadh Bhojanapalli, Zhifeng Chen, Felix Chern, Gui Citovsky, Andrew Dai, Andy Davis, Zihao Deng, Giulia DeSalvo, Nan Du, Avi Dubey, Matthew Fahrbach, Ruiqi Guo, Blake Hechtman, Yanping Huang, Prateek Jain, Wittawat Jitkrittum, Seungyeon Kim, Ravi Kumar, Aditya Kusupati, James Laudon, Quoc Le, Daliang Li, Zonglin Li, Lovish Madaan, David Majnemer, Aditya Menon, Don Metzler, Vahab Mirrokni, Vaishnavh Nagarajan, Harikrishna Narasimhan, Rina Panigrahy, Srikumar Ramalingam, Ankit Singh Rawat, Sashank Reddi, Aniket Rege, Afshin Rostamizadeh, Tal Schuster, Si Si, Apurv Suman, Phil Sun, Erik Vee, Ke Ye, Chong You, Felix Yu, Manzil Zaheer, and Yanqi Zhou.

Google Research, 2022 & beyond

This was the fourth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

  • Algorithms & Theory
  • Machine Intelligence
  • Year in Review

Other posts of interest

current research topics in deep learning

August 21, 2024

  • Generative AI ·
  • Machine Intelligence ·
  • Natural Language Processing

current research topics in deep learning

August 16, 2024

  • Data Mining & Modeling ·

current research topics in deep learning

August 9, 2024

A decade in deep learning, and what's next

Nov 18, 2021

[[read-time]] min read

Marian Croak.jpeg

Twenty years ago, Google started using machine learning, and 10 years ago, it helped spur rapid progress in AI using deep learning. Jeff Dean and Marian Croak of Google Research take a look at how we’ve innovated on these techniques and applied them in helpful ways, and look ahead to a responsible and inclusive path forward.

From research demos to AI that really works

I was first introduced to neural networks — computer systems that roughly imitate how biological brains accomplish tasks — as an undergrad in 1990. I did my senior thesis on using parallel computation to train neural networks. In those early days, I thought if we could 32X more compute power (using 32 processors at the time!), we could get neural networks to do impressive things. I was way off. It turns out we would need about 1 million times as much computational power before neural networks could scale to real-world problems.

A decade later, as an early employee at Google, I became reacquainted with machine learning when the company was still just a startup. In 2001 we used a simpler version of machine learning, statistical ML, to detect spam and suggest better spellings for people’s web searches. But it would be another decade before we had enough computing power to revive a more computationally-intensive machine learning approach called deep learning. Deep learning uses neural networks with multiple layers (thus the “deep”), so it can learn not just simple statistical patterns, but can learn subtler patterns of patterns — such as what’s in an image or what word was spoken in some audio. One of our first publications in 2012 was on a system that could find patterns among millions of frames from YouTube videos. That meant, of course, that it learned to recognize cats.

To get to the helpful features you use every day — searchable photo albums, suggestions on email replies, language translation, flood alerts, and so on — we needed to make years of breakthroughs on top of breakthroughs, tapping into the best of Google Research in collaboration with the broader research community. Let me give you just a couple examples of how we’ve done this.

A big moment for image recognition

In 2012, a paper wowed the research world for making a huge jump in accuracy on image recognition using deep neural networks, leading to a series of rapid advances by researchers outside and within Google. Further advances led to applications like Google Photos in 2015, letting you search photos by what’s in them. We then developed other deep learning models to help you find addresses in Google Maps , make sense of videos on YouTube , and explore the world around you using Google Lens . Beyond our products, we applied these approaches to health-related problems, such as detecting diabetic retinopathy in 2016, and then cancerous cells in 2017, and breast cancer in 2020. Better understanding of aerial imagery through deep learning let us launch flood forecasting in 2018, now expanded to cover more than 360 million people in 2021. It’s been encouraging to see how helpful these advances in image recognition have been.

Similarly, we’ve used deep learning to accelerate language understanding. With sequence-to-sequence learning in 2014, we began looking at how to understand strings of text using deep learning. This led to neural machine translation in Google Translate in 2016, which was a massive leap in quality, particularly for less prevalent languages. We developed neural language models further for Smart Reply in Gmail in 2017, which made it easier and faster for you to knock through your email, especially on mobile. That same year, Google invented Transformers , leading to BERT in 2018, then T5 , and in 2021 MUM , which lets you ask Google much more nuanced questions. And with “sparse” models like GShard , we can dramatically improve on tasks like translation while using less energy .

We’ve driven a similar arc in understanding speech. In 2012, Google used deep neural networks to make major improvements to speech recognition on Android . We kept advancing the state of the art with higher-quality, faster, more efficient speech recognition systems. By 2019, we were able to put the entire neural network on-device so you could get accurate speech recognition even without a connection. And in 2021, we launched Live Translate on the Pixel 6 phone, letting you speak and be translated in 48 languages -- all on-device, while you’re traveling with no Internet.

image of speech-to-text on phone

Project Relate : A communication tool for people with speech impairments.

image of flood forecasting map on phone

ML-based flood forecasting helps equip those in harm’s way with accurate and detailed alerts.

image of mammogram

Google Health's AI system helps radiologists identify cancer in mammograms with greater accuracy.

More invention ahead

As our research goes forward, we’re balancing more immediately applied research with more exploratory fundamental research. So we’re looking at how, for example, AI can aid scientific discovery, with a project like mapping the brain of a fly , which could one day help better understand and treat mental illness in people. We’re also pursuing quantum computing , which will likely take a decade or longer to reach wide-scale applications. This is why we publish nearly 1000 papers a year , including around 200 related to responsible AI, and we’ve given over 6500 grants to external researchers over the past decade and a half.

Looking ahead from 2021 to 2031, I'm excited about the next-generation AI systems we can build, and how much more helpful they’ll be. We’re planting the seeds today with new architectures like Pathways , with more to come.

Marian Croak

Minding the gap(s)

As we develop these lines of research and turn them into useful technologies, we’re mindful of the broader societal impact of AI, and especially that technology has not always had an equitable impact. This is personal for me — I care deeply about ensuring that people from all different backgrounds and circumstances have a good experience.

So we’re increasing the depth and rigor of how we review and evaluate our research to ensure we’re developing it responsibly. We’re also scaling up what we learn by inventing new tools to understand and calibrate critical AI systems across Google's products. We’re growing our organization to 200 experts in Responsible AI and Human Centered Technology, and working with hundreds of partners in product, privacy, security, and other teams across Google.

As one example of our work on responsible AI, Google Research began exploring the nascent field of ML fairness in 2016. The teams realized that on top of publishing papers, they could have a greater impact by teaching ML practitioners how to build with fairness in mind, as with the course we launched in 2018. We also started building interactive tools that coders and researchers could use, from the What-If Tool in 2018 to the 2019 launch of our Fairness Indicators tool, all the way to Know Your Data in 2021. All of these are concrete ways that AI developers can test their datasets and models to see what kind of biases and gaps there are, and start to work on mitigations to prevent unfair outcomes.

A principled approach

In fact, fairness is one of the key tenets of our AI Principles. We developed these principles in 2017 and published them in 2018, announcing not only the Principles themselves but a set of responsible AI practices with practical organizational and technical advice from what we’ve learned along the way. I was proud to be involved in the AI Principles review process from early on — I’ve seen firsthand how rigorous the teams at Google are on evaluating the technology we’re developing and deciding how best to deploy it in the real world.

Indeed, there are paths we’ve chosen not to go down — the AI Principles describe a number of areas we avoid. In line with our principles, we’ve taken a very cautious approach on face recognition. We recognize how fraught this area is not only in terms of privacy and surveillance concerns, but also its potential for unfair bias and impacts on historically marginalized groups. I’m glad that we’re taking this so thoughtfully and carefully.

We’re also developing technologies that help engineers apply the AI Principles directly — for example, incorporating privacy design principles. We invented Federated Learning in 2017 as a way to train ML models without your personal data leaving your phone. In 2018 we showed how well this works on Gboard, the free keyboard you can download for your phone — it learns to provide you more useful suggestions, while keeping what you type private on your device.

If you’re curious, you can learn more about all these veins of research, product impact, processes, and external engagement in our 2021 AI Principles Progress Update .

AI by everyone, for everyone

As we look to the decade ahead, it’s incredibly important that AI be built in a way that works well for everyone. That means building as inclusive a team as we can ourselves at Google. It also means ensuring the field as a whole increasingly represents the people whose lives it aims to improve.

I’m proud to lead the Black Leadership Advisory Group (BLAG) at Google. We helped craft and drive programs included in Google’s recent update on racial equity work . For example, we paired up new director-level hires with BLAG members, and the feedback has been really positive, with 80% of respondents saying they'd recommend the program. We’re looking at extending this to other groups, including for Latinx+ and Asian+ Googlers. We’re holding ourselves accountable as leaders too — we now evaluate all VPs and above at Google on progress on diversity, equity, and inclusion. This is crucial if we’re going to have a more representative set of researchers and engineers building future technologies.

For the broader research and computer science communities, we’re providing a wide variety of grants, programs, and collaborations that we hope will welcome a more representative range of researchers. Our Research Scholar Program , begun in 2021, gave grants to more than 50 universities in 15+ countries — and 43% of the principal investigators identify as part of a group that’s been historically marginalized in tech. Similarly, our exploreCSR and CS Research Mentorship programs support thousands of undergrads from marginalized groups. And we’re partnering with groups like the National Science Foundation on their new Institute for Human-AI Collaborations.

We’re doing everything we can to make AI work well for all people. We’ll not only help ensure products across Google are using the latest practices in responsible AI — we’ll also encourage new products and features that serve those who’ve historically missed out on helpful new technologies. One example is Project Relate , which uses machine learning to help people with speech impairments communicate and use technology more easily. Another is Real Tone , which helps our imaging products like our Pixel phone camera and Google Photos more accurately and beautifully represent a diverse range of skin tones. These are just the start.

We’re excited for what’s ahead in AI, for everyone.

Related stories

Helping-small-businesses-grow-with-AI-tools-

New initiatives to help small businesses grow with AI

DataGemma Logo

DataGemma: Using real-world data to address AI hallucinations

notebooklm audio overview

NotebookLM now lets you listen to a conversation about your sources

Org-Summit_Thumbnail

Accelerating Google.org’s future impact with AI

Org-Summit_Larger

3 new AI tools for nonprofits from Google.org

E03173672-Google-VAC-to-DG-Announcement-assets-Jul24---Blog-hero-Animation_v02

Drive better performance by upgrading Video Action Campaigns to Demand Gen

Let’s stay in touch. Get the latest news from Google in your inbox.

  • Survey Paper
  • Open access
  • Published: 31 March 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

  • Laith Alzubaidi   ORCID: orcid.org/0000-0002-7296-5413 1 , 5 ,
  • Jinglan Zhang 1 ,
  • Amjad J. Humaidi 2 ,
  • Ayad Al-Dujaili 3 ,
  • Ye Duan 4 ,
  • Omran Al-Shamma 5 ,
  • J. Santamaría 6 ,
  • Mohammed A. Fadhel 7 ,
  • Muthana Al-Amidie 4 &
  • Laith Farhan 8  

Journal of Big Data volume  8 , Article number:  53 ( 2021 ) Cite this article

458k Accesses

2915 Citations

37 Altmetric

Metrics details

In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

Introduction

Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [ 1 , 2 , 3 , 4 , 5 , 6 ]. Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [ 7 , 8 , 9 ]. Another name for DL is representation learning (RL). The continuing appearance of novel studies in the fields of deep and distributed learning is due to both the unpredictable growth in the ability to obtain data and the amazing progress made in the hardware technologies, e.g. High Performance Computing (HPC) [ 10 ].

DL is derived from the conventional neural network but considerably outperforms its predecessors. Moreover, DL employs transformations and graph technologies simultaneously in order to build up multi-layer learning models. The most recently developed DL techniques have obtained good outstanding performance across a variety of applications, including audio and speech processing, visual data processing, natural language processing (NLP), among others [ 11 , 12 , 13 , 14 ].

Usually, the effectiveness of an ML algorithm is highly dependent on the integrity of the input-data representation. It has been shown that a suitable data representation provides an improved performance when compared to a poor data representation. Thus, a significant research trend in ML for many years has been feature engineering, which has informed numerous research studies. This approach aims at constructing features from raw data. In addition, it is extremely field-specific and frequently requires sizable human effort. For instance, several types of features were introduced and compared in the computer vision context, such as, histogram of oriented gradients (HOG) [ 15 ], scale-invariant feature transform (SIFT) [ 16 ], and bag of words (BoW) [ 17 ]. As soon as a novel feature is introduced and is found to perform well, it becomes a new research direction that is pursued over multiple decades.

Relatively speaking, feature extraction is achieved in an automatic way throughout the DL algorithms. This encourages researchers to extract discriminative features using the smallest possible amount of human effort and field knowledge [ 18 ]. These algorithms have a multi-layer data representation architecture, in which the first layers extract the low-level features while the last layers extract the high-level features. Note that artificial intelligence (AI) originally inspired this type of architecture, which simulates the process that occurs in core sensorial regions within the human brain. Using different scenes, the human brain can automatically extract data representation. More specifically, the output of this process is the classified objects, while the received scene information represents the input. This process simulates the working methodology of the human brain. Thus, it emphasizes the main benefit of DL.

In the field of ML, DL, due to its considerable success, is currently one of the most prominent research trends. In this paper, an overview of DL is presented that adopts various perspectives such as the main concepts, architectures, challenges, applications, computational tools and evolution matrix. Convolutional neural network (CNN) is one of the most popular and used of DL networks [ 19 , 20 ]. Because of CNN, DL is very popular nowadays. The main advantage of CNN compared to its predecessors is that it automatically detects the significant features without any human supervision which made it the most used. Therefore, we have dug in deep with CNN by presenting the main components of it. Furthermore, we have elaborated in detail the most common CNN architectures, starting with the AlexNet network and ending with the High-Resolution network (HR.Net).

Several published DL review papers have been presented in the last few years. However, all of them have only been addressed one side focusing on one application or topic such as the review of CNN architectures [ 21 ], DL for classification of plant diseases [ 22 ], DL for object detection [ 23 ], DL applications in medical image analysis [ 24 ], and etc. Although these reviews present good topics, they do not provide a full understanding of DL topics such as concepts, detailed research gaps, computational tools, and DL applications. First, It is required to understand DL aspects including concepts, challenges, and applications then going deep in the applications. To achieve that, it requires extensive time and a large number of research papers to learn about DL including research gaps and applications. Therefore, we propose a deep review of DL to provide a more suitable starting point from which to develop a full understanding of DL from one review paper. The motivation behinds our review was to cover the most important aspect of DL including open challenges, applications, and computational tools perspective. Furthermore, our review can be the first step towards other DL topics.

The main aim of this review is to present the most important aspects of DL to make it easy for researchers and students to have a clear image of DL from single review paper. This review will further advance DL research by helping people discover more about recent developments in the field. Researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field. Our contributions are outlined as follows:

This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.

We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.

We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.

We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.

We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.

The rest of the paper is organized as follows: “ Survey methodology ” section describes The survey methodology. “ Background ” section presents the background. “ Classification of DL approaches ” section defines the classification of DL approaches. “ Types of DL networks ” section displays types of DL networks. “ CNN architectures ” section shows CNN Architectures. “ Challenges (limitations) of deep learning and alternate solutions ” section details the challenges of DL and alternate solutions. “ Applications of deep learning ” section outlines the applications of DL. “ Computational approaches ” section explains the influence of computational approaches (CPU, GPU, FPGA) on DL. “ Evaluation metrics ” section presents the evaluation metrics. “ Frameworks and datasets ” section lists frameworks and datasets. “ Summary and conclusion ” section presents the summary and conclusion.

Survey methodology

We have reviewed the significant research papers in the field published during 2010–2020, mainly from the years of 2020 and 2019 with some papers from 2021. The main focus was papers from the most reputed publishers such as IEEE, Elsevier, MDPI, Nature, ACM, and Springer. Some papers have been selected from ArXiv. We have reviewed more than 300 papers on various DL topics. There are 108 papers from the year 2020, 76 papers from the year 2019, and 48 papers from the year 2018. This indicates that this review focused on the latest publications in the field of DL. The selected papers were analyzed and reviewed to (1) list and define the DL approaches and network types, (2) list and explain CNN architectures, (3) present the challenges of DL and suggest the alternate solutions, (4) assess the applications of DL, (5) assess computational approaches. The most keywords used for search criteria for this review paper are (“Deep Learning”), (“Machine Learning”), (“Convolution Neural Network”), (“Deep Learning” AND “Architectures”), ((“Deep Learning”) AND (“Image”) AND (“detection” OR “classification” OR “segmentation” OR “Localization”)), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “CPU” OR “GPU” OR “FPGA”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Imbalanced Data”), (“Deep Learning” AND “Interpretability of data”), (“Deep Learning” AND “Overfitting”), (“Deep Learning” AND “Underspecification”). Figure  1 shows our search structure of the survey paper. Table  1 presents the details of some of the journals that have been cited in this review paper.

figure 1

Search framework

This section will present a background of DL. We begin with a quick introduction to DL, followed by the difference between DL and ML. We then show the situations that require DL. Finally, we present the reasons for applying DL.

DL, a subset of ML (Fig.  2 ), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [ 18 , 25 ].

figure 2

Deep learning family

Achieving the classification task using conventional ML techniques requires several sequential steps, specifically pre-processing, feature extraction, wise feature selection, learning, and classification. Furthermore, feature selection has a great impact on the performance of ML techniques. Biased feature selection may lead to incorrect discrimination between classes. Conversely, DL has the ability to automate the learning of feature sets for several tasks, unlike conventional ML methods [ 18 , 26 ]. DL enables learning and classification to be achieved in a single shot (Fig.  3 ). DL has become an incredibly popular type of ML algorithm in recent years due to the huge growth and evolution of the field of big data [ 27 , 28 ]. It is still in continuous development regarding novel performance for several ML tasks [ 22 , 29 , 30 , 31 ] and has simplified the improvement of many learning fields [ 32 , 33 ], such as image super-resolution [ 34 ], object detection [ 35 , 36 ], and image recognition [ 30 , 37 ]. Recently, DL performance has come to exceed human performance on tasks such as image classification (Fig.  4 ).

figure 3

The difference between deep learning and traditional machine learning

figure 4

Deep learning performance compared to human

Nearly all scientific fields have felt the impact of this technology. Most industries and businesses have already been disrupted and transformed through the use of DL. The leading technology and economy-focused companies around the world are in a race to improve DL. Even now, human-level performance and capability cannot exceed that the performance of DL in many areas, such as predicting the time taken to make car deliveries, decisions to certify loan requests, and predicting movie ratings [ 38 ]. The winners of the 2019 “Nobel Prize” in computing, also known as the Turing Award, were three pioneers in the field of DL (Yann LeCun, Geoffrey Hinton, and Yoshua Bengio) [ 39 ]. Although a large number of goals have been achieved, there is further progress to be made in the DL context. In fact, DL has the ability to enhance human lives by providing additional accuracy in diagnosis, including estimating natural disasters [ 40 ], the discovery of new drugs [ 41 ], and cancer diagnosis [ 42 , 43 , 44 ]. Esteva et al. [ 45 ] found that a DL network has the same ability to diagnose the disease as twenty-one board-certified dermatologists using 129,450 images of 2032 diseases. Furthermore, in grading prostate cancer, US board-certified general pathologists achieved an average accuracy of 61%, while the Google AI [ 44 ] outperformed these specialists by achieving an average accuracy of 70%. In 2020, DL is playing an increasingly vital role in early diagnosis of the novel coronavirus (COVID-19) [ 29 , 46 , 47 , 48 ]. DL has become the main tool in many hospitals around the world for automatic COVID-19 classification and detection using chest X-ray images or other types of images. We end this section by the saying of AI pioneer Geoffrey Hinton “Deep learning is going to be able to do everything”.

When to apply deep learning

Machine intelligence is useful in many situations which is equal or better than human experts in some cases [ 49 , 50 , 51 , 52 ], meaning that DL can be a solution to the following problems:

Cases where human experts are not available.

Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).

Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).

Cases where solutions require adaptation based on specific cases (personalization, biometrics).

Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).

Why deep learning?

Several performance features may answer this question, e.g

Universal Learning Approach: Because DL has the ability to perform in approximately all application domains, it is sometimes referred to as universal learning.

Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.

Generalization: Different data types or different applications can use the same DL technique, an approach frequently referred to as transfer learning (TL) which explained in the latter section. Furthermore, it is a useful approach in problems where data is insufficient.

Scalability: DL is highly scalable. ResNet [ 37 ], which was invented by Microsoft, comprises 1202 layers and is frequently applied at a supercomputing scale. Lawrence Livermore National Laboratory (LLNL), a large enterprise working on evolving frameworks for networks, adopted a similar approach, where thousands of nodes can be implemented [ 53 ].

Classification of DL approaches

DL techniques are classified into three major categories: unsupervised, partially supervised (semi-supervised) and supervised. Furthermore, deep reinforcement learning (DRL), also known as RL, is another type of learning technique, which is mostly considered to fall into the category of partially supervised (and occasionally unsupervised) learning techniques.

Deep supervised learning

Deep semi-supervised learning.

In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.

Deep unsupervised learning

This technique makes it possible to implement the learning process in the absence of available labeled data (i.e. no labels are required). Here, the agent learns the significant features or interior representation required to discover the unidentified structure or relationships in the input data. Techniques of generative networks, dimensionality reduction and clustering are frequently counted within the category of unsupervised learning. Several members of the DL family have performed well on non-linear dimensionality reduction and clustering tasks; these include restricted Boltzmann machines, auto-encoders and GANs as the most recently developed techniques. Moreover, RNNs, which include GRUs and LSTM approaches, have also been employed for unsupervised learning in a wide range of applications. The main disadvantages of unsupervised learning are unable to provide accurate information concerning data sorting and computationally complex. One of the most popular unsupervised learning approaches is clustering [ 54 ].

Deep reinforcement learning

For solving a task, the selection of the type of reinforcement learning that needs to be performed is based on the space or the scope of the problem. For example, DRL is the best way for problems involving many parameters to be optimized. By contrast, derivative-free reinforcement learning is a technique that performs well for problems with limited parameters. Some of the applications of reinforcement learning are business strategy planning and robotics for industrial automation. The main drawback of Reinforcement Learning is that parameters may influence the speed of learning. Here are the main motivations for utilizing Reinforcement Learning:

It assists you to identify which action produces the highest reward over a longer period.

It assists you to discover which situation requires action.

It also enables it to figure out the best approach for reaching large rewards.

Reinforcement Learning also gives the learning agent a reward function.

Reinforcement Learning can’t utilize in all the situation such as:

In case there is sufficient data to resolve the issue with supervised learning techniques.

Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.

Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [ 57 ]. Recursive auto-associative memory (RAAM) [ 58 ] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [ 58 ]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [ 59 ] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig.  5 . RvNN has been employed in several applications [ 60 , 61 , 62 ].

figure 5

An example of RvNN tree

Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL [ 63 , 64 , 65 ]. RNN is mainly applied in the area of speech processing and NLP contexts [ 66 , 67 ]. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig.  6 . Pascanu et al. [ 68 ] introduced three different types of deep RNN techniques, namely “Hidden-to-Hidden”, “Hidden-to-Output”, and “Input-to-Hidden”. A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

figure 6

Typical unfolded RNN diagram

However, RNN’s sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach [ 69 ]. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM [ 70 ]. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks [ 37 ], residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.

Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm [ 30 , 71 , 72 , 73 , 74 , 75 ]. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision [ 76 ]. CNNs have been extensively applied in a range of different fields, including computer vision [ 77 ], speech processing [ 78 ], Face Recognition [ 79 ], etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat’s brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN [ 80 ]. Goodfellow et al. [ 28 ] identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig.  7 .

figure 7

An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or \(m \times m \times r\) , where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions ( \(n \times n \times q\) ), similar to the input image; here, however, n must be smaller than m , while q is either equal to or smaller than r . In addition, the kernels are the basis of the local connections, which share similar parameters (bias \(b^{k}\) and weight \(W^{k}\) ) for generating k feature maps \(h^{k}\) with a size of ( \(m-n-1\) ) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1 , similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size \(p \times p\) , where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.

Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.

Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

Large-scale network implementation is much easier with CNN than with other neural networks.

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.

Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a \(4 \times 4\) gray-scale image with a \(2 \times 2\) random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure  8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the \(2 \times 2\) kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

figure 8

The primary calculations executed at each step of convolutional layer

However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.

Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure  9 illustrates these three pooling operations.

figure 9

Three types of pooling operations

Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2 .

Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3 .

ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4 .

Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.

Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5 .

Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6 .

Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7 .

Note that the learnable weight is denoted as a.

Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig.  10 .

figure 10

Fully connected layer

Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability \(p \in \left\{ 0\left. , 1 \right\} \right. \) . In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8 .

Here, \(e^{a_{i}}\) represents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9 .

Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10 .

Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11 .

The margin m is commonly set to 1. Moreover, the predicted output is denoted as \(p_{_{i}}\) , while the desired output is denoted as \(y_{_{i}}\) .

Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as “just-fitted” if it executes well on both training and testing data. These three types are illustrated in Fig.  11 . Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.

Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.

Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.

Batch Normalization: This method ensures the performance of the output activations [ 81 ]. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the “internal covariance shift” of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

The advantages of utilizing batch normalization are as follows:

It prevents the problem of vanishing gradient from arising.

It can effectively control the poor weight initialization.

It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).

It struggles to decrease training dependency across hyper-parameters.

Chances of over-fitting are reduced, since it has a minor influence on regularization.

figure 11

Over-fitting and under-fitting issues

Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12 .

The final weight in the current training epoch is denoted by \(w_{i j^{t}}\) , while the weight in the preceding \((t-1)\) training epoch is denoted \(w_{i j^{t-1}}\) . The learning rate is \(\eta \) and the prediction error is E . Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

Batch Gradient Descent: During the execution of this technique [ 82 ], the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).

Stochastic Gradient Descent: The parameters are updated at each training sample in this technique [ 83 ]. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.

Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them [ 84 ]. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.

Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor \(\lambda \) (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13 .

The weight increment in the current \(t^{\prime} \text{th}\) training epoch is denoted as \( \Delta w_{i j^{t}}\) , while \(\eta \) is the learning rate, and the weight increment in the preceding \((t-1)^{\prime} \text{th}\) training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [ 85 ] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14 .

Design of algorithms (backpropagation)

Let’s start with a notation that refers to weights in the network unambiguously. We denote \({\varvec{w}}_{i j}^{h}\) to be the weight for the connection from \(\text {ith}\) input or (neuron at \(\left. (\text {h}-1){\text{th}}\right) \) to the \(j{\text{t }}\) neuron in the \(\text {hth}\) layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

figure 12

MLP structure

Where \(w_{11}^{2}\) has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be \(w_{21}^{2}\) which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is \((10 * 2=20\) connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

where \(\text {d}\) is the label of induvial input \(\text {ith}\) and \(\text {y}\) is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}.\) But to compute those, a local variable is introduced, \(\delta _{j}^{1}\) which is called the local error in the \(j{\text{th} }\) neuron in the \(h{\text{th} }\) layer. Based on that local error Backpropagation will give the procedure to compute \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}\) how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13 .

figure 13

Neuron activation functions

Output error for \(\delta _{\text {j}}^{1}\) each \(1=1: \text {L}\) where \(\text {L}\) is no. of neuron in output

where \(\text {e}(\text {k})\) is the error of the epoch \(\text {k}\) as shown in Eq. ( 2 ) and \(\varvec{\vartheta }^{\prime }\left( {\varvec{v}}_{j}({\varvec{k}})\right) \) is the derivate of the activation function for \(v_{j}\) at the output.

Backpropagate the error at all the rest layer except the output

where \(\delta _{j}^{1}({\mathbf {k}})\) is the output error and \(w_{j l}^{h+1}(k)\) is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. ( 16 ) and ( 17 ).

Improving performance of CNN

Based on our experiments in different DL applications [ 86 , 87 , 88 ]. We can conclude the most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning (explained in latter sections).

Increase the training time.

Increase the depth (or width) of the model.

Add regularization.

Increase hyperparameters tuning.

CNN architectures

Over the last 10 years, several CNN architectures have been presented [ 21 , 26 ]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it should be noted that the key upgrade in CNN performance occurred largely due to the processing-unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table  2 presents the brief overview of CNN architectures.

The history of deep CNNs began with the appearance of LeNet [ 89 ] (Fig.  14 ). At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [ 30 ], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [ 30 ] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure  15 illustrates the basic design of the AlexNet architecture.

figure 14

The architecture of LeNet

figure 15

The architecture of AlexNet

The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [ 90 , 91 ]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [ 92 ] could be utilized as a non-saturating activation function to enhance the rate of convergence [ 93 ]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters \((5\times 5 \; \text{and}\; 11 \times 11)\) in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications.

Network-in-network

This network model, which has some slight differences from the preceding models, introduced two innovative concepts [ 94 ]. The first was employing multiple layers of perception convolution. These convolutions are executed using a 1×1 filter, which supports the addition of extra nonlinearity in the networks. Moreover, this supports enlarging the network depth, which may later be regularized using dropout. For DL models, this idea is frequently employed in the bottleneck layer. As a substitution for a FC layer, the GAP is also employed, which represents the second novel concept and enables a significant reduction in the number of model parameters. In addition, GAP considerably updates the network architecture. Generating a final low-dimensional feature vector with no reduction in the feature maps dimension is possible when GAP is used on a large feature map [ 95 , 96 ]. Figure  16 shows the structure of the network.

figure 16

The architecture of network-in-network

Before 2013, the CNN learning mechanism was basically constructed on a trial-and-error basis, which precluded an understanding of the precise purpose following the enhancement. This issue restricted the deep CNN performance on convoluted images. In response, Zeiler and Fergus introduced DeconvNet (a multilayer de-convolutional neural network) in 2013 [ 97 ]. This method later became known as ZefNet, which was developed in order to quantitively visualize the network. Monitoring the CNN performance via understanding the neuron activation was the purpose of the network activity visualization. However, Erhan et al. utilized this exact concept to optimize deep belief network (DBN) performance by visualizing the features of the hidden layers [ 98 ]. Moreover, in addition to this issue, Le et al. assessed the deep unsupervised auto-encoder (AE) performance by visualizing the created classes of the image using the output neurons [ 99 ]. By reversing the operation order of the convolutional and pooling layers, DenconvNet operates like a forward-pass CNN. Reverse mapping of this kind launches the convolutional layer output backward to create visually observable image shapes that accordingly give the neural interpretation of the internal feature representation learned at each layer [ 100 ]. Monitoring the learning schematic through the training stage was the key concept underlying ZefNet. In addition, it utilized the outcomes to recognize an ability issue coupled with the model. This concept was experimentally proven on AlexNet by applying DeconvNet. This indicated that only certain neurons were working, while the others were out of action in the first two layers of the network. Furthermore, it indicated that the features extracted via the second layer contained aliasing objects. Thus, Zeiler and Fergus changed the CNN topology due to the existence of these outcomes. In addition, they executed parameter optimization, and also exploited the CNN learning by decreasing the stride and the filter sizes in order to retain all features of the initial two convolutional layers. An improvement in performance was accordingly achieved due to this rearrangement in CNN topology. This rearrangement proposed that the visualization of the features could be employed to identify design weaknesses and conduct appropriate parameter alteration. Figure  17 shows the structure of the network.

figure 17

The architecture of ZefNet

Visual geometry group (VGG)

After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [ 101 ], it featured nineteen more layers than ZefNet [ 97 ] and AlexNet [ 30 ] to simulate the relations of the network representational capacity in depth. Conversely, in the 2013-ILSVRC competition, ZefNet was the frontier network, which proposed that filters with small sizes could enhance the CNN performance. With reference to these results, VGG inserted a layer of the heap of \(3\times 3\) filters rather than the \(5\times 5\) and 11 × 11 filters in ZefNet. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small-size filters made the receptive field similarly efficient to the large-size filters \((7 \times 7 \; \text{and}\; 5 \times 5)\) . By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting \(1\times 1\) convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. With respect to network tuning, a max pooling layer [ 102 ] is inserted following the convolutional layer, while padding is implemented to maintain the spatial resolution. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014-ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure  18 shows the structure of the network.

figure 18

The architecture of VGG

In the 2014-ILSVRC competition, GoogleNet (also called Inception-V1) emerged as the winner [ 103 ]. Achieving high-level accuracy with decreased computational cost is the core aim of the GoogleNet architecture. It proposed a novel inception block (module) concept in the CNN context, since it combines multiple-scale convolutional transformations by employing merge, transform, and split functions for feature extraction. Figure  19 illustrates the inception block architecture. This architecture incorporates filters of different sizes ( \(5\times 5, 3\times 3, \; \text{and} \; 1\times 1\) ) to capture channel information together with spatial information at diverse ranges of spatial resolution. The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture [ 94 ], which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of variants existing in a similar class of several images. The motivation of GoogLeNet was to improve the efficiency of CNN parameters, as well as to enhance the learning capacity. In addition, it regulates the computation by inserting a \(1\times 1\) convolutional filter, as a bottleneck layer, ahead of using large-size kernels. GoogleNet employed sparse connections to overcome the redundant information problem. It decreased cost by neglecting the irrelevant channels. It should be noted here that only some of the input channels are connected to some of the output channels. By employing a GAP layer as the end layer, rather than utilizing a FC layer, the density of connections was decreased. The number of parameters was also significantly decreased from 40 to 5 million parameters due to these parameter tunings. The additional regularity factors used included the employment of RmsProp as optimizer and batch normalization [ 104 ]. Furthermore, GoogleNet proposed the idea of auxiliary learners to speed up the rate of convergence. Conversely, the main shortcoming of GoogleNet was its heterogeneous topology; this shortcoming requires adaptation from one module to another. Other shortcomings of GoogleNet include the representation jam, which substantially decreased the feature space in the following layer, and in turn occasionally leads to valuable information loss.

figure 19

The basic structure of Google Block

Highway network

Increasing the network depth enhances its performance, mainly for complicated tasks. By contrast, the network training becomes difficult. The presence of several layers in deeper networks may result in small gradient values of the back-propagation of error at lower layers. In 2015, Srivastava et al. [ 105 ] suggested a novel CNN architecture, called Highway Network, to overcome this issue. This approach is based on the cross-connectivity concept. The unhindered information flow in Highway Network is empowered by instructing two gating units inside the layer. The gate mechanism concept was motivated by LSTM-based RNN [ 106 , 107 ]. The information aggregation was conducted by merging the information of the \(\i{\text{th}}-k\) layers with the next \(\i{\text{th}}\) layer to generate a regularization impact, which makes the gradient-based training of the deeper network very simple. This empowers the training of networks with more than 100 layers, such as a deeper network of 900 layers with the SGD algorithm. A Highway Network with a depth of fifty layers presented an improved rate of convergence, which is better than thin and deep architectures at the same time [ 108 ]. By contrast, [ 69 ] empirically demonstrated that plain Net performance declines when more than ten hidden layers are inserted. It should be noted that even a Highway Network 900 layers in depth converges much more rapidly than the plain network.

He et al. [ 37 ] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig.  20 , which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is illustrated in Fig.  20 , which contains the fundamental ResNet block diagram. This is a conventional feedforward network plus a residual connection. The residual layer output can be identified as the \((l - 1){\text{th}}\) outputs, which are delivered from the preceding layer \((x_{l} - 1)\) . After executing different operations [such as convolution using variable-size filters, or batch normalization, before applying an activation function like ReLU on \((x_{l} - 1)\) ], the output is \(F(x_{l} - 1)\) . The ending residual output is \(x_{l}\) , which can be mathematically represented as in Eq. 18 .

There are numerous basic residual blocks included in the residual network. Based on the type of the residual network architecture, operations in the residual block are also changed [ 37 ].

figure 20

The block diagram for ResNet

In comparison to the highway network, ResNet presented shortcut connections inside layers to enable cross-layer connectivity, which are parameter-free and data-independent. Note that the layers characterize non-residual functions when a gated shortcut is closed in the highway network. By contrast, the individuality shortcuts are never closed, while the residual information is permanently passed in ResNet. Furthermore, ResNet has the potential to prevent the problems of gradient diminishing, as the shortcut connections (residual links) accelerate the deep network convergence. ResNet was the winner of the 2015-ILSVRC championship with 152 layers of depth; this represents 8 times the depth of VGG and 20 times the depth of AlexNet. In comparison with VGG, it has lower computational complexity, even with enlarged depth.

Inception: ResNet and Inception-V3/4

Szegedy et al. [ 103 , 109 , 110 ] proposed Inception-ResNet and Inception-V3/4 as upgraded types of Inception-V1/2. The concept behind Inception-V3 was to minimize the computational cost with no effect on the deeper network generalization. Thus, Szegedy et al. used asymmetric small-size filters ( \(1\times 5\) and \(1\times 7\) ) rather than large-size filters ( \( 7\times 7\) and \(5\times 5\) ); moreover, they utilized a bottleneck of \(1\times 1\) convolution prior to the large-size filters [ 110 ]. These changes make the operation of the traditional convolution very similar to cross-channel correlation. Previously, Lin et al. utilized the 1 × 1 filter potential in NIN architecture [ 94 ]. Subsequently, [ 110 ] utilized the same idea in an intelligent manner. By using \(1\times 1\) convolutional operation in Inception-V3, the input data are mapped into three or four isolated spaces, which are smaller than the initial input spaces. Next, all of these correlations are mapped in these smaller spaces through common \(5\times 5\) or \(3\times 3\) convolutions. By contrast, in Inception-ResNet, Szegedy et al. bring together the inception block and the residual learning power by replacing the filter concatenation with the residual connection [ 111 ]. Szegedy et al. empirically demonstrated that Inception-ResNet (Inception-4 with residual connections) can achieve a similar generalization power to Inception-V4 with enlarged width and depth and without residual connections. Thus, it is clearly illustrated that using residual connections in training will significantly accelerate the Inception network training. Figure  21 shows The basic block diagram for Inception Residual unit.

figure 21

The basic block diagram for Inception Residual unit

To solve the problem of the vanishing gradient, DenseNet was presented, following the same direction as ResNet and the Highway network [ 105 , 111 , 112 ]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large number of weights, since each layer has an isolated group of weights. DenseNet employed cross-layer connectivity in an improved approach to address this problem [ 112 , 113 , 114 ]. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. In traditional CNNs, there are l connections between the previous layer and the current layer, while in DenseNet, there are \(\frac{l(l+1)}{2}\) direct connections. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure  22 shows the architecture of DenseNet Network.

figure 22

(adopted from [ 112 ])

The architecture of DenseNet Network

ResNext is an enhanced version of the Inception Network [ 115 ]. It is also known as the Aggregated Residual Transform Network. Cardinality, which is a new term presented by [ 115 ], utilized the split, transform, and merge topology in an easy and effective way. It denotes the size of the transformation set as an extra dimension [ 116 , 117 , 118 ]. However, the Inception network manages network resources more efficiently, as well as enhancing the learning ability of the conventional CNN. In the transformation branch, different spatial embeddings (employing e.g. \(5\times 5\) , \(3\times 3\) , and \(1\times 1\) ) are used. Thus, customizing each layer is required separately. By contrast, ResNext derives its characteristic features from ResNet, VGG, and Inception. It employed the VGG deep homogenous topology with the basic architecture of GoogleNet by setting \(3\times 3\) filters as spatial resolution inside the blocks of split, transform, and merge. Figure  23 shows the ResNext building blocks. ResNext utilized multi-transformations inside the blocks of split, transform, and merge, as well as outlining such transformations in cardinality terms. The performance is significantly improved by increasing the cardinality, as Xie et al. showed. The complexity of ResNext was regulated by employing \(1\times 1\) filters (low embeddings) ahead of a \(3\times 3\) convolution. By contrast, skipping connections are used for optimized training [ 115 ].

figure 23

The basic block diagram for the ResNext building blocks

The feature reuse problem is the core shortcoming related to deep residual networks, since certain feature blocks or transformations contribute a very small amount to learning. Zagoruyko and Komodakis [ 119 ] accordingly proposed WideResNet to address this problem. These authors advised that the depth has a supplemental influence, while the residual units convey the core learning ability of deep residual networks. WideResNet utilized the residual block power via making the ResNet wider instead of deeper [ 37 ]. It enlarged the width by presenting an extra factor, k, which handles the network width. In other words, it indicated that layer widening is a highly successful method of performance enhancement compared to deepening the residual network. While enhanced representational capacity is achieved by deep residual networks, these networks also have certain drawbacks, such as the exploding and vanishing gradient problems, feature reuse problem (inactivation of several feature maps), and the time-intensive nature of the training. He et al. [ 37 ] tackled the feature reuse problem by including a dropout in each residual block to regularize the network in an efficient manner. In a similar manner, utilizing dropouts, Huang et al. [ 120 ] presented the stochastic depth concept to solve the slow learning and gradient vanishing problems. Earlier research was focused on increasing the depth; thus, any small enhancement in performance required the addition of several new layers. When comparing the number of parameters, WideResNet has twice that of ResNet, as an experimental study showed. By contrast, WideResNet presents an improved method for training relative to deep networks [ 119 ]. Note that most architectures prior to residual networks (including the highly effective VGG and Inception) were wider than ResNet. Thus, wider residual networks were established once this was determined. However, inserting a dropout between the convolutional layers (as opposed to within the residual block) made the learning more effective in WideResNet [ 121 , 122 ].

Pyramidal Net

The depth of the feature map increases in the succeeding layer due to the deep stacking of multi-convolutional layers, as shown in previous deep CNN architectures such as ResNet, VGG, and AlexNet. By contrast, the spatial dimension reduces, since a sub-sampling follows each convolutional layer. Thus, augmented feature representation is recompensed by decreasing the size of the feature map. The extreme expansion in the depth of the feature map, alongside the spatial information loss, interferes with the learning ability in the deep CNNs. ResNet obtained notable outcomes for the issue of image classification. Conversely, deleting a convolutional block—in which both the number of channel and spatial dimensions vary (channel depth enlarges, while spatial dimension reduces)—commonly results in decreased classifier performance. Accordingly, the stochastic ResNet enhanced the performance by decreasing the information loss accompanying the residual unit drop. Han et al. [ 123 ] proposed Pyramidal Net to address the ResNet learning interference problem. To address the depth enlargement and extreme reduction in spatial width via ResNet, Pyramidal Net slowly enlarges the residual unit width to cover the most feasible places rather than saving the same spatial dimension inside all residual blocks up to the appearance of the down-sampling. It was referred to as Pyramidal Net due to the slow enlargement in the feature map depth based on the up-down method. Factor l, which was determined by Eq. 19 , regulates the depth of the feature map.

Here, the dimension of the l th residual unit is indicated by \(d_{l}\) ; moreover, n indicates the overall number of residual units, the step factor is indicated by \(\lambda \) , and the depth increase is regulated by the factor \(\frac{\lambda }{n}\) , which uniformly distributes the weight increase across the dimension of the feature map. Zero-padded identity mapping is used to insert the residual connections among the layers. In comparison to the projection-based shortcut connections, zero-padded identity mapping requires fewer parameters, which in turn leads to enhanced generalization [ 124 ]. Multiplication- and addition-based widening are two different approaches used in Pyramidal Nets for network widening. More specifically, the first approach (multiplication) enlarges geometrically, while the second one (addition) enlarges linearly [ 92 ]. The main problem associated with the width enlargement is the growth in time and space required related to the quadratic time.

Extreme inception architecture is the main characteristic of Xception. The main idea behind Xception is its depthwise separable convolution [ 125 ]. The Xception model adjusted the original inception block by making it wider and exchanging a single dimension ( \(3 \times 3\) ) followed by a \(1 \times 1\) convolution to reduce computational complexity. Figure  24 shows the Xception block architecture. The Xception network becomes extra computationally effective through the use of the decoupling channel and spatial correspondence. Moreover, it first performs mapping of the convolved output to the embedding short dimension by applying \(1 \times 1\) convolutions. It then performs k spatial transformations. Note that k here represents the width-defining cardinality, which is obtained via the transformations number in Xception. However, the computations were made simpler in Xception by distinctly convolving each channel around the spatial axes. These axes are subsequently used as the \(1 \times 1\) convolutions (pointwise convolution) for performing cross-channel correspondence. The \(1 \times 1\) convolution is utilized in Xception to regularize the depth of the channel. The traditional convolutional operation in Xception utilizes a number of transformation segments equivalent to the number of channels; Inception, moreover, utilizes three transformation segments, while traditional CNN architecture utilizes only a single transformation segment. Conversely, the suggested Xception transformation approach achieves extra learning efficiency and better performance but does not minimize the number of parameters [ 126 , 127 ].

figure 24

The basic block diagram for the Xception block architecture

Residual attention neural network

To improve the network feature representation, Wang et al. [ 128 ] proposed the Residual Attention Network (RAN). Enabling the network to learn aware features of the object is the main purpose of incorporating attention into the CNN. The RAN consists of stacked residual blocks in addition to the attention module; hence, it is a feed-forward CNN. However, the attention module is divided into two branches, namely the mask branch and trunk branch. These branches adopt a top-down and bottom-up learning strategy respectively. Encapsulating two different strategies in the attention model supports top-down attention feedback and fast feed-forward processing in only one particular feed-forward process. More specifically, the top-down architecture generates dense features to make inferences about every aspect. Moreover, the bottom-up feedforward architecture generates low-resolution feature maps in addition to robust semantic information. Restricted Boltzmann machines employed a top-down bottom-up strategy as in previously proposed studies [ 129 ]. During the training reconstruction phase, Goh et al. [ 130 ] used the mechanism of top-down attention in deep Boltzmann machines (DBMs) as a regularizing factor. Note that the network can be globally optimized using a top-down learning strategy in a similar manner, where the maps progressively output to the input throughout the learning process [ 129 , 130 , 131 , 132 ].

Incorporating the attention concept with convolutional blocks in an easy way was used by the transformation network, as obtained in a previous study [ 133 ]. Unfortunately, these are inflexible, which represents the main problem, along with their inability to be used for varying surroundings. By contrast, stacking multi-attention modules has made RAN very effective at recognizing noisy, complex, and cluttered images. RAN’s hierarchical organization gives it the capability to adaptively allocate a weight for every feature map depending on its importance within the layers. Furthermore, incorporating three distinct levels of attention (spatial, channel, and mixed) enables the model to use this ability to capture the object-aware features at these distinct levels.

Convolutional block attention module

The importance of the feature map utilization and the attention mechanism is certified via SE-Network and RAN [ 128 , 134 , 135 ]. The convolutional block attention (CBAM) module, which is a novel attention-based CNN, was first developed by Woo et al. [ 136 ]. This module is similar to SE-Network and simple in design. SE-Network disregards the object’s spatial locality in the image and considers only the channels’ contribution during the image classification. Regarding object detection, object spatial location plays a significant role. The convolutional block attention module sequentially infers the attention maps. More specifically, it applies channel attention preceding the spatial attention to obtain the refined feature maps. Spatial attention is performed using 1 × 1 convolution and pooling functions, as in the literature. Generating an effective feature descriptor can be achieved by using a spatial axis along with the pooling of features. In addition, generating a robust spatial attention map is possible, as CBAM concatenates the max pooling and average pooling operations. In a similar manner, a collection of GAP and max pooling operations is used to model the feature map statistics. Woo et al. [ 136 ] demonstrated that utilizing GAP will return a sub-optimized inference of channel attention, whereas max pooling provides an indication of the distinguishing object features. Thus, the utilization of max pooling and average pooling enhances the network’s representational power. The feature maps improve the representational power, as well as facilitating a focus on the significant portion of the chosen features. The expression of 3D attention maps through a serial learning procedure assists in decreasing the computational cost and the number of parameters, as Woo et al. [ 136 ] experimentally proved. Note that any CNN architecture can be simply integrated with CBAM.

Concurrent spatial and channel excitation mechanism

To make the work valid for segmentation tasks, Roy et al. [ 137 , 138 ] expanded Hu et al. [ 134 ] effort by adding the influence of spatial information to the channel information. Roy et al. [ 137 , 138 ] presented three types of modules: (1) channel squeeze and excitation with concurrent channels (scSE); (2) exciting spatially and squeezing channel-wise (sSE); (3) exciting channel-wise and squeezing spatially (cSE). For segmentation purposes, they employed auto-encoder-based CNNs. In addition, they suggested inserting modules following the encoder and decoder layers. To specifically highlight the object-specific feature maps, they further allocated attention to every channel by expressing a scaling factor from the channel and spatial information in the first module (scSE). In the second module (sSE), the feature map information has lower importance than the spatial locality, as the spatial information plays a significant role during the segmentation process. Therefore, several channel collections are spatially divided and developed so that they can be employed in segmentation. In the final module (cSE), a similar SE-block concept is used. Furthermore, the scaling factor is derived founded on the contribution of the feature maps within the object detection [ 137 , 138 ].

CNN is an efficient technique for detecting object features and achieving well-behaved recognition performance in comparison with innovative handcrafted feature detectors. A number of restrictions related to CNN are present, meaning that the CNN does not consider certain relations, orientation, size, and perspectives of features. For instance, when considering a face image, the CNN does not count the various face components (such as mouth, eyes, nose, etc.) positions, and will incorrectly activate the CNN neurons and recognize the face without taking specific relations (such as size, orientation etc.) into account. At this point, consider a neuron that has probability in addition to feature properties such as size, orientation, perspective, etc. A specific neuron/capsule of this type has the ability to effectively detect the face along with different types of information. Thus, many layers of capsule nodes are used to construct the capsule network. An encoding unit, which contains three layers of capsule nodes, forms the CapsuleNet or CapsNet (the initial version of the capsule networks).

For example, the MNIST architecture comprises \(28\times 28\) images, applying 256 filters of size \(9\times 9\) and with stride 1. The \(28-9+1=20\) is the output plus 256 feature maps. Next, these outputs are input to the first capsule layer, while producing an 8D vector rather than a scalar; in fact, this is a modified convolution layer. Note that a stride 2 with \(9\times 9\) filters is employed in the first convolution layer. Thus, the dimension of the output is \((20-9)/2+1=6\) . The initial capsules employ \(8\times 32\) filters, which generate 32 × 8 × 6 × 6 (32 for groups, 8 for neurons, while 6 × 6 is the neuron size).

Figure  25 represents the complete CapsNet encoding and decoding processes. In the CNN context, a max-pooling layer is frequently employed to handle the translation change. It can detect the feature moves in the event that the feature is still within the max-pooling window. This approach has the ability to detect the overlapped features; this is highly significant in detection and segmentation operations, since the capsule involves the weighted features sum from the preceding layer.

figure 25

The complete CapsNet encoding and decoding processes

In conventional CNNs, a particular cost function is employed to evaluate the global error that grows toward the back throughout the training process. Conversely, in such cases, the activation of a neuron will not grow further once the weight between two neurons turns out to be zero. Instead of a single size being provided with the complete cost function in repetitive dynamic routing alongside the agreement, the signal is directed based on the feature parameters. Sabour et al. [ 139 ] provides more details about this architecture. When using MNIST to recognize handwritten digits, this innovative CNN architecture gives superior accuracy. From the application perspective, this architecture has extra suitability for segmentation and detection approaches when compared with classification approaches [ 140 , 141 , 142 ].

High-resolution network (HRNet)

High-resolution representations are necessary for position-sensitive vision tasks, such as semantic segmentation, object detection, and human pose estimation. In the present up-to-date frameworks, the input image is encoded as a low-resolution representation using a subnetwork that is constructed as a connected series of high-to-low resolution convolutions such as VGGNet and ResNet. The low-resolution representation is then recovered to become a high-resolution one. Alternatively, high-resolution representations are maintained during the entire process using a novel network, referred to as a High-Resolution Network (HRNet) [ 143 , 144 ]. This network has two principal features. First, the convolution series of high-to-low resolutions are connected in parallel. Second, the information across the resolutions are repeatedly exchanged. The advantage achieved includes getting a representation that is more accurate in the spatial domain and extra-rich in the semantic domain. Moreover, HRNet has several applications in the fields of object detection, semantic segmentation, and human pose prediction. For computer vision problems, the HRNet represents a more robust backbone. Figure  26 illustrates the general architecture of HRNet.

figure 26

The general architecture of HRNet

Challenges (limitations) of deep learning and alternate solutions

When employing DL, several difficulties are often taken into consideration. Those more challenging are listed next and several possible alternatives are accordingly provided.

Training data

DL is extremely data-hungry considering it also involves representation learning [ 145 , 146 ]. DL demands an extensively large amount of data to achieve a well-behaved performance model, i.e. as the data increases, an extra well-behaved performance model can be achieved (Fig.  27 ). In most cases, the available data are sufficient to obtain a good performance model. However, sometimes there is a shortage of data for using DL directly [ 87 ]. To properly address this issue, three suggested methods are available. The first involves the employment of the transfer-learning concept after data is collected from similar tasks. Note that while the transferred data will not directly augment the actual data, it will help in terms of both enhancing the original input representation of data and its mapping function [ 147 ]. In this way, the model performance is boosted. Another technique involves employing a well-trained model from a similar task and fine-tuning the ending of two layers or even one layer based on the limited original data. Refer to [ 148 , 149 ] for a review of different transfer-learning techniques applied in the DL approach. In the second method, data augmentation is performed [ 150 ]. This task is very helpful for use in augmenting the image data, since the image translation, mirroring, and rotation commonly do not change the image label. Conversely, it is important to take care when applying this technique in some cases such as with bioinformatics data. For instance, when mirroring an enzyme sequence, the output data may not represent the actual enzyme sequence. In the third method, the simulated data can be considered for increasing the volume of the training set. It is occasionally possible to create simulators based on the physical process if the issue is well understood. Therefore, the result will involve the simulation of as much data as needed. Processing the data requirement for DL-based simulation is obtained as an example in Ref. [ 151 ].

figure 27

The performance of DL regarding the amount of data

  • Transfer learning

Recent research has revealed a widespread use of deep CNNs, which offer ground-breaking support for answering many classification problems. Generally speaking, deep CNN models require a sizable volume of data to obtain good performance. The common challenge associated with using such models concerns the lack of training data. Indeed, gathering a large volume of data is an exhausting job, and no successful solution is available at this time. The undersized dataset problem is therefore currently solved using the TL technique [ 148 , 149 ], which is highly efficient in addressing the lack of training data issue. The mechanism of TL involves training the CNN model with large volumes of data. In the next step, the model is fine-tuned for training on a small request dataset.

The student-teacher relationship is a suitable approach to clarifying TL. Gathering detailed knowledge of the subject is the first step [ 152 ]. Next, the teacher provides a “course” by conveying the information within a “lecture series” over time. Put simply, the teacher transfers the information to the student. In more detail, the expert (teacher) transfers the knowledge (information) to the learner (student). Similarly, the DL network is trained using a vast volume of data, and also learns the bias and the weights during the training process. These weights are then transferred to different networks for retraining or testing a similar novel model. Thus, the novel model is enabled to pre-train weights rather than requiring training from scratch. Figure  28 illustrates the conceptual diagram of the TL technique.

Pre-trained models: Many CNN models, e.g. AlexNet [ 30 ], GoogleNet [ 103 ], and ResNet [ 37 ], have been trained on large datasets such as ImageNet for image recognition purposes. These models can then be employed to recognize a different task without the need to train from scratch. Furthermore, the weights remain the same apart from a few learned features. In cases where data samples are lacking, these models are very useful. There are many reasons for employing a pre-trained model. First, training large models on sizeable datasets requires high-priced computational power. Second, training large models can be time-consuming, taking up to multiple weeks. Finally, a pre-trained model can assist with network generalization and speed up the convergence.

A research problem using pre-trained models: Training a DL approach requires a massive number of images. Thus, obtaining good performance is a challenge under these circumstances. Achieving excellent outcomes in image classification or recognition applications, with performance occasionally superior to that of a human, becomes possible through the use of deep convolutional neural networks (DCNNs) including several layers if a huge amount of data is available [ 37 , 148 , 153 ]. However, avoiding overfitting problems in such applications requires sizable datasets and properly generalizing DCNN models. When training a DCNN model, the dataset size has no lower limit. However, the accuracy of the model becomes insufficient in the case of the utilized model has fewer layers, or if a small dataset is used for training due to over- or under-fitting problems. Due to they have no ability to utilize the hierarchical features of sizable datasets, models with fewer layers have poor accuracy. It is difficult to acquire sufficient training data for DL models. For example, in medical imaging and environmental science, gathering labelled datasets is very costly [ 148 ]. Moreover, the majority of the crowdsourcing workers are unable to make accurate notes on medical or biological images due to their lack of medical or biological knowledge. Thus, ML researchers often rely on field experts to label such images; however, this process is costly and time consuming. Therefore, producing the large volume of labels required to develop flourishing deep networks turns out to be unfeasible. Recently, TL has been widely employed to address the later issue. Nevertheless, although TL enhances the accuracy of several tasks in the fields of pattern recognition and computer vision [ 154 , 155 ], there is an essential issue related to the source data type used by the TL as compared to the target dataset. For instance, enhancing the medical image classification performance of CNN models is achieved by training the models using the ImageNet dataset, which contains natural images [ 153 ]. However, such natural images are completely dissimilar from the raw medical images, meaning that the model performance is not enhanced. It has further been proven that TL from different domains does not significantly affect performance on medical imaging tasks, as lightweight models trained from scratch perform nearly as well as standard ImageNet-transferred models [ 156 ]. Therefore, there exists scenarios in which using pre-trained models do not become an affordable solution. In 2020, some researchers have utilized same-domain TL and achieved excellent results [ 86 , 87 , 88 , 157 ]. Same-domain TL is an approach of using images that look similar to the target dataset for training. For example, using X-ray images of different chest diseases to train the model, then fine-tuning and training it on chest X-ray images for COVID-19 diagnosis. More details about same-domain TL and how to implement the fine-tuning process can be found in [ 87 ].

figure 28

The conceptual diagram of the TL technique

Data augmentation techniques

If the goal is to increase the amount of available data and avoid the overfitting issue, data augmentation techniques are one possible solution [ 150 , 158 , 159 ]. These techniques are data-space solutions for any limited-data problem. Data augmentation incorporates a collection of methods that improve the attributes and size of training datasets. Thus, DL networks can perform better when these techniques are employed. Next, we list some data augmentation alternate solutions.

Flipping: Flipping the vertical axis is a less common practice than flipping the horizontal one. Flipping has been verified as valuable on datasets like ImageNet and CIFAR-10. Moreover, it is highly simple to implement. In addition, it is not a label-conserving transformation on datasets that involve text recognition (such as SVHN and MNIST).

Color space: Encoding digital image data is commonly used as a dimension tensor ( \(height \times width \times color channels\) ). Accomplishing augmentations in the color space of the channels is an alternative technique, which is extremely workable for implementation. A very easy color augmentation involves separating a channel of a particular color, such as Red, Green, or Blue. A simple way to rapidly convert an image using a single-color channel is achieved by separating that matrix and inserting additional double zeros from the remaining two color channels. Furthermore, increasing or decreasing the image brightness is achieved by using straightforward matrix operations to easily manipulate the RGB values. By deriving a color histogram that describes the image, additional improved color augmentations can be obtained. Lighting alterations are also made possible by adjusting the intensity values in histograms similar to those employed in photo-editing applications.

Cropping: Cropping a dominant patch of every single image is a technique employed with combined dimensions of height and width as a specific processing step for image data. Furthermore, random cropping may be employed to produce an impact similar to translations. The difference between translations and random cropping is that translations conserve the spatial dimensions of this image, while random cropping reduces the input size [for example from (256, 256) to (224, 224)]. According to the selected reduction threshold for cropping, the label-preserving transformation may not be addressed.

Rotation: When rotating an image left or right from within 0 to 360 degrees around the axis, rotation augmentations are obtained. The rotation degree parameter greatly determines the suitability of the rotation augmentations. In digit recognition tasks, small rotations (from 0 to 20 degrees) are very helpful. By contrast, the data label cannot be preserved post-transformation when the rotation degree increases.

Translation: To avoid positional bias within the image data, a very useful transformation is to shift the image up, down, left, or right. For instance, it is common that the whole dataset images are centered; moreover, the tested dataset should be entirely made up of centered images to test the model. Note that when translating the initial images in a particular direction, the residual space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. The spatial dimensions of the image post-augmentation are preserved using this padding.

Noise injection This approach involves injecting a matrix of arbitrary values. Such a matrix is commonly obtained from a Gaussian distribution. Moreno-Barea et al. [ 160 ] employed nine datasets to test the noise injection. These datasets were taken from the UCI repository [ 161 ]. Injecting noise within images enables the CNN to learn additional robust features.

However, highly well-behaved solutions for positional biases available within the training data are achieved by means of geometric transformations. To separate the distribution of the testing data from the training data, several prospective sources of bias exist. For instance, when all faces should be completely centered within the frames (as in facial recognition datasets), the problem of positional biases emerges. Thus, geometric translations are the best solution. Geometric translations are helpful due to their simplicity of implementation, as well as their effective capability to disable the positional biases. Several libraries of image processing are available, which enables beginning with simple operations such as rotation or horizontal flipping. Additional training time, higher computational costs, and additional memory are some shortcomings of geometric transformations. Furthermore, a number of geometric transformations (such as arbitrary cropping or translation) should be manually observed to ensure that they do not change the image label. Finally, the biases that separate the test data from the training data are more complicated than transitional and positional changes. Hence, it is not trivial answering to when and where geometric transformations are suitable to be applied.

Imbalanced data

Commonly, biological data tend to be imbalanced, as negative samples are much more numerous than positive ones [ 162 , 163 , 164 ]. For example, compared to COVID-19-positive X-ray images, the volume of normal X-ray images is very large. It should be noted that undesirable results may be produced when training a DL model using imbalanced data. The following techniques are used to solve this issue. First, it is necessary to employ the correct criteria for evaluating the loss, as well as the prediction result. In considering the imbalanced data, the model should perform well on small classes as well as larger ones. Thus, the model should employ area under curve (AUC) as the resultant loss as well as the criteria [ 165 ]. Second, it should employ the weighted cross-entropy loss, which ensures the model will perform well with small classes if it still prefers to employ the cross-entropy loss. Simultaneously, during model training, it is possible either to down-sample the large classes or up-sample the small classes. Finally, to make the data balanced as in Ref. [ 166 ], it is possible to construct models for every hierarchical level, as a biological system frequently has hierarchical label space. However, the effect of the imbalanced data on the performance of the DL model has been comprehensively investigated. In addition, to lessen the problem, the most frequently used techniques were also compared. Nevertheless, note that these techniques are not specified for biological problems.

Interpretability of data

Occasionally, DL techniques are analyzed to act as a black box. In fact, they are interpretable. The need for a method of interpreting DL, which is used to obtain the valuable motifs and patterns recognized by the network, is common in many fields, such as bioinformatics [ 167 ]. In the task of disease diagnosis, it is not only required to know the disease diagnosis or prediction results of a trained DL model, but also how to enhance the surety of the prediction outcomes, as the model makes its decisions based on these verifications [ 168 ]. To achieve this, it is possible to give a score of importance for every portion of the particular example. Within this solution, back-propagation-based techniques or perturbation-based approaches are used [ 169 ]. In the perturbation-based approaches, a portion of the input is changed and the effect of this change on the model output is observed [ 170 , 171 , 172 , 173 ]. This concept has high computational complexity, but it is simple to understand. On the other hand, to check the score of the importance of various input portions, the signal from the output propagates back to the input layer in the back-propagation-based techniques. These techniques have been proven valuable in [ 174 ]. In different scenarios, various meanings can represent the model interpretability.

Uncertainty scaling

Commonly, the final prediction label is not the only label required when employing DL techniques to achieve the prediction; the score of confidence for every inquiry from the model is also desired. The score of confidence is defined as how confident the model is in its prediction [ 175 ]. Since the score of confidence prevents belief in unreliable and misleading predictions, it is a significant attribute, regardless of the application scenario. In biology, the confidence score reduces the resources and time expended in proving the outcomes of the misleading prediction. Generally speaking, in healthcare or similar applications, the uncertainty scaling is frequently very significant; it helps in evaluating automated clinical decisions and the reliability of machine learning-based disease-diagnosis [ 176 , 177 ]. Because overconfident prediction can be the output of different DL models, the score of probability (achieved from the softmax output of the direct-DL) is often not in the correct scale [ 178 ]. Note that the softmax output requires post-scaling to achieve a reliable probability score. For outputting the probability score in the correct scale, several techniques have been introduced, including Bayesian Binning into Quantiles (BBQ) [ 179 ], isotonic regression [ 180 ], histogram binning [ 181 ], and the legendary Platt scaling [ 182 ]. More specifically, for DL techniques, temperature scaling was recently introduced, which achieves superior performance compared to the other techniques.

Catastrophic forgetting

This is defined as incorporating new information into a plain DL model, made possible by interfering with the learned information. For instance, consider a case where there are 1000 types of flowers and a model is trained to classify these flowers, after which a new type of flower is introduced; if the model is fine-tuned only with this new class, its performance will become unsuccessful with the older classes [ 183 , 184 ]. The logical data are continually collected and renewed, which is in fact a highly typical scenario in many fields, e.g. Biology. To address this issue, there is a direct solution that involves employing old and new data to train an entirely new model from scratch. This solution is time-consuming and computationally intensive; furthermore, it leads to an unstable state for the learned representation of the initial data. At this time, three different types of ML techniques, which have not catastrophic forgetting, are made available to solve the human brain problem founded on the neurophysiological theories [ 185 , 186 ]. Techniques of the first type are founded on regularizations such as EWC [ 183 ] Techniques of the second type employ rehearsal training techniques and dynamic neural network architecture like iCaRL [ 187 , 188 ]. Finally, techniques of the third type are founded on dual-memory learning systems [ 189 ]. Refer to [ 190 , 191 , 192 ] in order to gain more details.

Model compression

To obtain well-trained models that can still be employed productively, DL models have intensive memory and computational requirements due to their huge complexity and large numbers of parameters [ 193 , 194 ]. One of the fields that is characterized as data-intensive is the field of healthcare and environmental science. These needs reduce the deployment of DL in limited computational-power machines, mainly in the healthcare field. The numerous methods of assessing human health and the data heterogeneity have become far more complicated and vastly larger in size [ 195 ]; thus, the issue requires additional computation [ 196 ]. Furthermore, novel hardware-based parallel processing solutions such as FPGAs and GPUs [ 197 , 198 , 199 ] have been developed to solve the computation issues associated with DL. Recently, numerous techniques for compressing the DL models, designed to decrease the computational issues of the models from the starting point, have also been introduced. These techniques can be classified into four classes. In the first class, the redundant parameters (which have no significant impact on model performance) are reduced. This class, which includes the famous deep compression method, is called parameter pruning [ 200 ]. In the second class, the larger model uses its distilled knowledge to train a more compact model; thus, it is called knowledge distillation [ 201 , 202 ]. In the third class, compact convolution filters are used to reduce the number of parameters [ 203 ]. In the final class, the information parameters are estimated for preservation using low-rank factorization [ 204 ]. For model compression, these classes represent the most representative techniques. In [ 193 ], it has been provided a more comprehensive discussion about the topic.

Overfitting

DL models have excessively high possibilities of resulting in data overfitting at the training stage due to the vast number of parameters involved, which are correlated in a complex manner. Such situations reduce the model’s ability to achieve good performance on the tested data [ 90 , 205 ]. This problem is not only limited to a specific field, but involves different tasks. Therefore, when proposing DL techniques, this problem should be fully considered and accurately handled. In DL, the implied bias of the training process enables the model to overcome crucial overfitting problems, as recent studies suggest [ 205 , 206 , 207 , 208 ]. Even so, it is still necessary to develop techniques that handle the overfitting problem. An investigation of the available DL algorithms that ease the overfitting problem can categorize them into three classes. The first class acts on both the model architecture and model parameters and includes the most familiar approaches, such as weight decay [ 209 ], batch normalization [ 210 ], and dropout [ 90 ]. In DL, the default technique is weight decay [ 209 ], which is used extensively in almost all ML algorithms as a universal regularizer. The second class works on model inputs such as data corruption and data augmentation [ 150 , 211 ]. One reason for the overfitting problem is the lack of training data, which makes the learned distribution not mirror the real distribution. Data augmentation enlarges the training data. By contrast, marginalized data corruption improves the solution exclusive to augmenting the data. The final class works on the model output. A recently proposed technique penalizes the over-confident outputs for regularizing the model [ 178 ]. This technique has demonstrated the ability to regularize RNNs and CNNs.

Vanishing gradient problem

In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely in the training stage, a problem called the vanishing gradient problem arises [ 212 , 213 , 214 ]. More specifically, in each training iteration, every weight of the neural network is updated based on the current weight and is proportionally relative to the partial derivative of the error function. However, this weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case means that no extra training is possible and the neural network will stop completely. Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the input that produces a small variation at the output. In a shallow network, only some layers use these activations, which is not a significant issue. While using more layers will lead the gradient to become very small in the training stage, in this case, the network works efficiently. The back-propagation technique is used to determine the gradients of the neural networks. Initially, this technique determines the network derivatives of each layer in the reverse direction, starting from the last layer and progressing back to the first layer. The next step involves multiplying the derivatives of each layer down the network in a similar manner to the first step. For instance, multiplying N small derivatives together when there are N hidden layers employs an activation function such as the sigmoid function. Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the biases and weights of the first layers cannot be updated efficiently during the training stage because the gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are frequently critical to recognizing the essential elements of the input data. However, such a problem can be avoided through employing activation functions. These functions lack the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to max, the ReLU [ 91 ] is the most popular selection, as it does not yield a small derivative that is employed in the field. Another solution involves employing the batch normalization layer [ 81 ]. As mentioned earlier, the problem occurs once a large input space is squashed into a small space, leading to vanishing the derivative. Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression | x | does not accomplish the exterior boundaries of the sigmoid function. The normalization process makes the largest part of it come down in the green area, which ensures that the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers of the network compared to the time required to recognize the vanishing gradient problem [ 215 ].

Exploding gradient problem

Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are accumulated during back-propagation [ 216 , 217 , 218 ]. The latter will lead to extremely significant updates to the weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its ability to learn effectively. Grosso modo, moving backward in the network during back-propagation, the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential solutions include:

Using different weight regularization techniques.

Redesigning the architecture of the network model.

Underspecification

In 2020, a team of computer scientists at Google has identified a new challenge called underspecification [ 219 ]. ML models including DL models often show surprisingly poor behavior when they are tested in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics. The reason behind the weak performance is due to underspecification. It has been shown that small modifications can force a model towards a completely different solution as well as lead to different predictions in deployment domains. There are different techniques of addressing underspecification issue. One of them is to design “stress tests” to examine how good a model works on real-world data and to find out the possible issues. Nevertheless, this demands a reliable understanding of the process the model can work inaccurately. The team stated that “Designing stress tests that are well-matched to applied requirements, and that provide good “coverage” of potential failure modes is a major challenge”. Underspecification puts major constraints on the credibility of ML predictions and may require some reconsidering over certain applications. Since ML is linked to human by serving several applications such as medical imaging and self-driving cars, it will require proper attention to this issue.

Applications of deep learning

Presently, various DL applications are widespread around the world. These applications include healthcare, social network analysis, audio and speech processing (like recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification), among others (Fig.  29 ) [ 220 , 221 , 222 , 223 , 224 ]. These applications have been classified into five categories: classification, localization, detection, segmentation, and registration. Although each of these tasks has its own target, there is fundamental overlap in the pipeline implementation of these applications as shown in Fig.  30 . Classification is a concept that categorizes a set of data into classes. Detection is used to locate interesting objects in an image with consideration given to the background. In detection, multiple objects, which could be from dissimilar classes, are surrounded by bounding boxes. Localization is the concept used to locate the object, which is surrounded by a single bounding box. In segmentation (semantic segmentation), the target object edges are surrounded by outlines, which also label them; moreover, fitting a single image (which could be 2D or 3D) onto another refers to registration. One of the most important and wide-ranging DL applications are in healthcare [ 225 , 226 , 227 , 228 , 229 , 230 ]. This area of research is critical due to its relation to human lives. Moreover, DL has shown tremendous performance in healthcare. Therefore, we take DL applications in the medical image analysis field as an example to describe the DL applications.

figure 29

Examples of DL applications

figure 30

Workflow of deep learning tasks

Classification

Computer-Aided Diagnosis (CADx) is another title sometimes used for classification. Bharati et al. [ 231 ] used a chest X-ray dataset for detecting lung diseases based on a CNN. Another study attempted to read X-ray images by employing CNN [ 232 ]. In this modality, the comparative accessibility of these images has likely enhanced the progress of DL. [ 233 ] used an improved pre-trained GoogLeNet CNN containing more than 150,000 images for training and testing processes. This dataset was augmented from 1850 chest X-rays. The creators reorganized the image orientation into lateral and frontal views and achieved approximately 100% accuracy. This work of orientation classification has clinically limited use. As a part of an ultimately fully automated diagnosis workflow, it obtained the data augmentation and pre-trained efficiency in learning the metadata of relevant images. Chest infection, commonly referred to as pneumonia, is extremely treatable, as it is a commonly occurring health problem worldwide. Conversely, Rajpurkar et al. [ 234 ] utilized CheXNet, which is an improved version of DenseNet [ 112 ] with 121 convolution layers, for classifying fourteen types of disease. These authors used the CheXNet14 dataset [ 235 ], which comprises 112,000 images. This network achieved an excellent performance in recognizing fourteen different diseases. In particular, pneumonia classification accomplished a 0.7632 AUC score using receiver operating characteristics (ROC) analysis. In addition, the network obtained better than or equal to the performance of both a three-radiologist panel and four individual radiologists. Zuo et al. [ 236 ] have adopted CNN for candidate classification in lung nodule. Shen et al. [ 237 ] employed both Random Forest (RF) and SVM classifiers with CNNs to classify lung nodules. They employed two convolutional layers with each of the three parallel CNNs. The LIDC-IDRI (Lung Image Database Consortium) dataset, which contained 1010-labeled CT lung scans, was used to classify the two types of lung nodules (malignant and benign). Different scales of the image patches were used by every CNN to extract features, while the output feature vector was constructed using the learned features. Next, these vectors were classified into malignant or benign using either the RF classifier or SVM with radial basis function (RBF) filter. The model was robust to various noisy input levels and achieved an accuracy of 86% in nodule classification. Conversely, the model of [ 238 ] interpolates the image data missing between PET and MRI images using 3D CNNs. The Alzheimer Disease Neuroimaging Initiative (ADNI) database, containing 830 PET and MRI patient scans, was utilized in their work. The PET and MRI images are used to train the 3D CNNs, first as input and then as output. Furthermore, for patients who have no PET images, the 3D CNNs utilized the trained images to rebuild the PET images. These rebuilt images approximately fitted the actual disease recognition outcomes. However, this approach did not address the overfitting issues, which in turn restricted their technique in terms of its possible capacity for generalization. Diagnosing normal versus Alzheimer’s disease patients has been achieved by several CNN models [ 239 , 240 ]. Hosseini-Asl et al. [ 241 ] attained 99% accuracy for up-to-date outcomes in diagnosing normal versus Alzheimer’s disease patients. These authors applied an auto-encoder architecture using 3D CNNs. The generic brain features were pre-trained on the CADDementia dataset. Subsequently, the outcomes of these learned features became inputs to higher layers to differentiate between patient scans of Alzheimer’s disease, mild cognitive impairment, or normal brains based on the ADNI dataset and using fine-tuned deep supervision techniques. The architectures of VGGNet and RNNs, in that order, were the basis of both VOXCNN and ResNet models developed by Korolev et al. [ 242 ]. They also discriminated between Alzheimer’s disease and normal patients using the ADNI database. Accuracy was 79% for Voxnet and 80% for ResNet. Compared to Hosseini-Asl’s work, both models achieved lower accuracies. Conversely, the implementation of the algorithms was simpler and did not require feature hand-crafting, as Korolev declared. In 2020, Mehmood et al. [ 240 ] trained a developed CNN-based network called “SCNN” with MRI images for the tasks of classification of Alzheimer’s disease. They achieved state-of-the-art results by obtaining an accuracy of 99.05%.

Recently, CNN has taken some medical imaging classification tasks to different level from traditional diagnosis to automated diagnosis with tremendous performance. Examples of these tasks are diabetic foot ulcer (DFU) (as normal and abnormal (DFU) classes) [ 87 , 243 , 244 , 245 , 246 ], sickle cells anemia (SCA) (as normal, abnormal (SCA), and other blood components) [ 86 , 247 ], breast cancer by classify hematoxylin–eosin-stained breast biopsy images into four classes: invasive carcinoma, in-situ carcinoma, benign tumor and normal tissue [ 42 , 88 , 248 , 249 , 250 , 251 , 252 ], and multi-class skin cancer classification [ 253 , 254 , 255 ].

In 2020, CNNs are playing a vital role in early diagnosis of the novel coronavirus (COVID-2019). CNN has become the primary tool for automatic COVID-19 diagnosis in many hospitals around the world using chest X-ray images [ 256 , 257 , 258 , 259 , 260 ]. More details about the classification of medical imaging applications can be found in [ 226 , 261 , 262 , 263 , 264 , 265 ].

Localization

Although applications in anatomy education could increase, the practicing clinician is more likely to be interested in the localization of normal anatomy. Radiological images are independently examined and described outside of human intervention, while localization could be applied in completely automatic end-to-end applications [ 266 , 267 , 268 ]. Zhao et al. [ 269 ] introduced a new deep learning-based approach to localize pancreatic tumor in projection X-ray images for image-guided radiation therapy without the need for fiducials. Roth et al. [ 270 ] constructed and trained a CNN using five convolutional layers to classify around 4000 transverse-axial CT images. These authors used five categories for classification: legs, pelvis, liver, lung, and neck. After data augmentation techniques were applied, they achieved an AUC score of 0.998 and the classification error rate of the model was 5.9%. For detecting the positions of the spleen, kidney, heart, and liver, Shin et al. [ 271 ] employed stacked auto-encoders on 78 contrast-improved MRI scans of the stomach area containing the kidneys or liver. Temporal and spatial domains were used to learn the hierarchal features. Based on the organs, these approaches achieved detection accuracies of 62–79%. Sirazitdinov et al. [ 268 ] presented an aggregate of two convolutional neural networks, namely RetinaNet and Mask R-CNN for pneumonia detection and localization.

Computer-Aided Detection (CADe) is another method used for detection. For both the clinician and the patient, overlooking a lesion on a scan may have dire consequences. Thus, detection is a field of study requiring both accuracy and sensitivity [ 272 , 273 , 274 ]. Chouhan et al. [ 275 ] introduced an innovative deep learning framework for the detection of pneumonia by adopting the idea of transfer learning. Their approach obtained an accuracy of 96.4% with a recall of 99.62% on unseen data. In the area of COVID-19 and pulmonary disease, several convolutional neural network approaches have been proposed for automatic detection from X-ray images which showed an excellent performance [ 46 , 276 , 277 , 278 , 279 ].

In the area of skin cancer, there several applications were introduced for the detection task [ 280 , 281 , 282 ]. Thurnhofer-Hemsi et al. [ 283 ] introduced a deep learning approach for skin cancer detection by fine-tuning five state-of-art convolutional neural network models. They addressed the issue of a lack of training data by adopting the ideas of transfer learning and data augmentation techniques. DenseNet201 network has shown superior results compared to other models.

Another interesting area is that of histopathological images, which are progressively digitized. Several papers have been published in this field [ 284 , 285 , 286 , 287 , 288 , 289 , 290 ]. Human pathologists read these images laboriously; they search for malignancy markers, such as a high index of cell proliferation, using molecular markers (e.g. Ki-67), cellular necrosis signs, abnormal cellular architecture, enlarged numbers of mitotic figures denoting augmented cell replication, and enlarged nucleus-to-cytoplasm ratios. Note that the histopathological slide may contain a huge number of cells (up to the thousands). Thus, the risk of disregarding abnormal neoplastic regions is high when wading through these cells at excessive levels of magnification. Ciresan et al. [ 291 ] employed CNNs of 11–13 layers for identifying mitotic figures. Fifty breast histology images from the MITOS dataset were used. Their technique attained recall and precision scores of 0.7 and 0.88 respectively. Sirinukunwattana et al. [ 292 ] utilized 100 histology images of colorectal adenocarcinoma to detect cell nuclei using CNNs. Roughly 30,000 nuclei were hand-labeled for training purposes. The novelty of this approach was in the use of Spatially Constrained CNN. This CNN detects the center of nuclei using the surrounding spatial context and spatial regression. Instead of this CNN, Xu et al. [ 293 ] employed a stacked sparse auto-encoder (SSAE) to identify nuclei in histological slides of breast cancer, achieving 0.83 and 0.89 recall and precision scores respectively. In this field, they showed that unsupervised learning techniques are also effectively utilized. In medical images, Albarquoni et al. [ 294 ] investigated the problem of insufficient labeling. They crowd-sourced the actual mitoses labeling in the histology images of breast cancer (from amateurs online). Solving the recurrent issue of inadequate labeling during the analysis of medical images can be achieved by feeding the crowd-sourced input labels into the CNN. This method signifies a remarkable proof-of-concept effort. In 2020, Lei et al. [ 285 ] introduced the employment of deep convolutional neural networks for automatic identification of mitotic candidates from histological sections for mitosis screening. They obtained the state-of-the-art detection results on the dataset of the International Pattern Recognition Conference (ICPR) 2012 Mitosis Detection Competition.

Segmentation

Although MRI and CT image segmentation research includes different organs such as knee cartilage, prostate, and liver, most research work has concentrated on brain segmentation, particularly tumors [ 295 , 296 , 297 , 298 , 299 , 300 ]. This issue is highly significant in surgical preparation to obtain the precise tumor limits for the shortest surgical resection. During surgery, excessive sacrificing of key brain regions may lead to neurological shortfalls including cognitive damage, emotionlessness, and limb difficulty. Conventionally, medical anatomical segmentation was done by hand; more specifically, the clinician draws out lines within the complete stack of the CT or MRI volume slice by slice. Thus, it is perfect for implementing a solution that computerizes this painstaking work. Wadhwa et al. [ 301 ] presented a brief overview on brain tumor segmentation of MRI images. Akkus et al. [ 302 ] wrote a brilliant review of brain MRI segmentation that addressed the different metrics and CNN architectures employed. Moreover, they explain several competitions in detail, as well as their datasets, which included Ischemic Stroke Lesion Segmentation (ISLES), Mild Traumatic brain injury Outcome Prediction (MTOP), and Brain Tumor Segmentation (BRATS).

Chen et al. [ 299 ] proposed convolutional neural networks for precise brain tumor segmentation. The approach that they employed involves several approaches for better features learning including the DeepMedic model, a novel dual-force training scheme, a label distribution-based loss function, and Multi-Layer Perceptron-based post-processing. They conducted their method on the two most modern brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Hu et al. [ 300 ] introduced the brain tumor segmentation method by adopting a multi-cascaded convolutional neural network (MCCNN) and fully connected conditional random fields (CRFs). The achieved results were excellent compared with the state-of-the-art methods.

Moeskops et al. [ 303 ] employed three parallel-running CNNs, each of which had a 2D input patch of dissimilar size, for segmenting and classifying MRI brain images. These images, which include 35 adults and 22 pre-term infants, were classified into various tissue categories such as cerebrospinal fluid, grey matter, and white matter. Every patch concentrates on capturing various image aspects with the benefit of employing three dissimilar sizes of input patch; here, the bigger sizes incorporated the spatial features, while the lowest patch sizes concentrated on the local textures. In general, the algorithm has Dice coefficients in the range of 0.82–0.87 and achieved a satisfactory accuracy. Although 2D image slices are employed in the majority of segmentation research, Milletrate et al. [ 304 ] implemented 3D CNN for segmenting MRI prostate images. Furthermore, they used the PROMISE2012 challenge dataset, from which fifty MRI scans were used for training and thirty for testing. The U-Net architecture of Ronnerberger et al. [ 305 ] inspired their V-net. This model attained a 0.869 Dice coefficient score, the same as the winning teams in the competition. To reduce overfitting and create the model of a deeper 11-convolutional layer CNN, Pereira et al. [ 306 ] applied intentionally small-sized filters of 3x3. Their model used MRI scans of 274 gliomas (a type of brain tumor) for training. They achieved first place in the 2013 BRATS challenge, as well as second place in the BRATS challenge 2015. Havaei et al. [ 307 ] also considered gliomas using the 2013 BRATS dataset. They investigated different 2D CNN architectures. Compared to the winner of BRATS 2013, their algorithm worked better, as it required only 3 min to execute rather than 100 min. The concept of cascaded architecture formed the basis of their model. Thus, it is referred to as an InputCascadeCNN. Employing FC Conditional Random Fields (CRFs), atrous spatial pyramid pooling, and up-sampled filters were techniques introduced by Chen et al. [ 308 ]. These authors aimed to enhance the accuracy of localization and enlarge the field of view of every filter at a multi-scale. Their model, DeepLab, attained 79.7% mIOU (mean Intersection Over Union). In the PASCAL VOC-2012 image segmentation, their model obtained an excellent performance.

Recently, the Automatic segmentation of COVID-19 Lung Infection from CT Images helps to detect the development of COVID-19 infection by employing several deep learning techniques [ 309 , 310 , 311 , 312 ].

Registration

Usually, given two input images, the four main stages of the canonical procedure of the image registration task are [ 313 , 314 ]:

Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.

Feature Extraction: it computes the set of features extracted from each input image.

Feature Matching: it allows finding similarities between the previously obtained features.

Pose Optimization: it is aimed to minimize the distance between both input images.

Then, the result of the registration procedure is the suitable geometric transformation (e.g. translation, rotation, scaling, etc.) that provides both input images within the same coordinate system in a way the distance between them is minimal, i.e. their level of superimposition/overlapping is optimal. It is out of the scope of this work to provide an extensive review of this topic. Nevertheless, a short summary is accordingly introduced next.

Commonly, the input images for the DL-based registration approach could be in various forms, e.g. point clouds, voxel grids, and meshes. Additionally, some techniques allow as inputs the result of the Feature Extraction or Matching steps in the canonical scheme. Specifically, the outcome could be some data in a particular form as well as the result of the steps from the classical pipeline (feature vector, matching vector, and transformation). Nevertheless, with the newest DL-based methods, a novel conceptual type of ecosystem issues. It contains acquired characteristics about the target, materials, and their behavior that can be registered with the input data. Such a conceptual ecosystem is formed by a neural network and its training manner, and it could be counted as an input to the registration approach. Nevertheless, it is not an input that one might adopt in every registration situation since it corresponds to an interior data representation.

From a DL view-point, the interpretation of the conceptual design enables differentiating the input data of a registration approach into defined or non-defined models. In particular, the illustrated phases are models that depict particular spatial data (e.g. 2D or 3D) while a non-defined one is a generalization of a data set created by a learning system. Yumer et al. [ 315 ] developed a framework in which the model acquires characteristics of objects, meaning ready to identify what a more sporty car seems like or a more comfy chair is, also adjusting a 3D model to fit those characteristics while maintaining the main characteristics of the primary data. Likewise, a fundamental perspective of the unsupervised learning method introduced by Ding et al. [ 316 ] is that there is no target for the registration approach. In this instance, the network is able of placing each input point cloud in a global space, solving SLAM issues in which many point clouds have to be registered rigidly. On the other hand, Mahadevan [ 317 ] proposed the combination of two conceptual models utilizing the growth of Imagination Machines to give flexible artificial intelligence systems and relationships between the learned phases through training schemes that are not inspired on labels and classifications. Another practical application of DL, especially CNNs, to image registration is the 3D reconstruction of objects. Wang et al. [ 318 ] applied an adversarial way using CNNs to rebuild a 3D model of an object from its 2D image. The network learns many objects and orally accomplishes the registration between the image and the conceptual model. Similarly, Hermoza et al. [ 319 ] also utilize the GAN network for prognosticating the absent geometry of damaged archaeological objects, providing the reconstructed object based on a voxel grid format and a label selecting its class.

DL for medical image registration has numerous applications, which were listed by some review papers [ 320 , 321 , 322 ]. Yang et al. [ 323 ] implemented stacked convolutional layers as an encoder-decoder approach to predict the morphing of the input pixel into its last formation using MRI brain scans from the OASIS dataset. They employed a registration model known as Large Deformation Diffeomorphic Metric Mapping (LDDMM) and attained remarkable enhancements in computation time. Miao et al. [ 324 ] used synthetic X-ray images to train a five-layer CNN to register 3D models of a trans-esophageal probe, a hand implant, and a knee implant onto 2D X-ray images for pose estimation. They determined that their model achieved an execution time of 0.1 s, representing an important enhancement against the conventional registration techniques based on intensity; moreover, it achieved effective registrations 79–99% of the time. Li et al. [ 325 ] introduced a neural network-based approach for the non-rigid 2D–3D registration of the lateral cephalogram and the volumetric cone-beam CT (CBCT) images.

Computational approaches

For computationally exhaustive applications, complex ML and DL approaches have rapidly been identified as the most significant techniques and are widely used in different fields. The development and enhancement of algorithms aggregated with capabilities of well-behaved computational performance and large datasets make it possible to effectively execute several applications, as earlier applications were either not possible or difficult to take into consideration.

Currently, several standard DNN configurations are available. The interconnection patterns between layers and the total number of layers represent the main differences between these configurations. The Table  2 illustrates the growth rate of the overall number of layers over time, which seems to be far faster than the “Moore’s Law growth rate”. In normal DNN, the number of layers grew by around 2.3× each year in the period from 2012 to 2016. Recent investigations of future ResNet versions reveal that the number of layers can be extended up to 1000. However, an SGD technique is employed to fit the weights (or parameters), while different optimization techniques are employed to obtain parameter updating during the DNN training process. Repetitive updates are required to enhance network accuracy in addition to a minorly augmented rate of enhancement. For example, the training process using ImageNet as a large dataset, which contains more than 14 million images, along with ResNet as a network model, take around 30K to 40K repetitions to converge to a steady solution. In addition, the overall computational load, as an upper-level prediction, may exceed 1020 FLOPS when both the training set size and the DNN complexity increase.

Prior to 2008, boosting the training to a satisfactory extent was achieved by using GPUs. Usually, days or weeks are needed for a training session, even with GPU support. By contrast, several optimization strategies were developed to reduce the extensive learning time. The computational requirements are believed to increase as the DNNs continuously enlarge in both complexity and size.

In addition to the computational load cost, the memory bandwidth and capacity have a significant effect on the entire training performance, and to a lesser extent, deduction. More specifically, the parameters are distributed through every layer of the input data, there is a sizeable amount of reused data, and the computation of several network layers exhibits an excessive computation-to-bandwidth ratio. By contrast, there are no distributed parameters, the amount of reused data is extremely small, and the additional FC layers have an extremely small computation-to-bandwidth ratio. Table  3 presents a comparison between different aspects related to the devices. In addition, the table is established to facilitate familiarity with the tradeoffs by obtaining the optimal approach for configuring a system based on either FPGA, GPU, or CPU devices. It should be noted that each has corresponding weaknesses and strengths; accordingly, there are no clear one-size-fits-all solutions.

Although GPU processing has enhanced the ability to address the computational challenges related to such networks, the maximum GPU (or CPU) performance is not achieved, and several techniques or models have turned out to be strongly linked to bandwidth. In the worst cases, the GPU efficiency is between 15 and 20% of the maximum theoretical performance. This issue is required to enlarge the memory bandwidth using high-bandwidth stacked memory. Next, different approaches based on FPGA, GPU, and CPU are accordingly detailed.

CPU-based approach

The well-behaved performance of the CPU nodes usually assists robust network connectivity, storage abilities, and large memory. Although CPU nodes are more common-purpose than those of FPGA or GPU, they lack the ability to match them in unprocessed computation facilities, since this requires increased network ability and a larger memory capacity.

GPU-based approach

GPUs are extremely effective for several basic DL primitives, which include greatly parallel-computing operations such as activation functions, matrix multiplication, and convolutions [ 326 , 327 , 328 , 329 , 330 ]. Incorporating HBM-stacked memory into the up-to-date GPU models significantly enhances the bandwidth. This enhancement allows numerous primitives to efficiently utilize all computational resources of the available GPUs. The improvement in GPU performance over CPU performance is usually 10-20:1 related to dense linear algebra operations.

Maximizing parallel processing is the base of the initial GPU programming model. For example, a GPU model may involve up to sixty-four computational units. There are four SIMD engines per each computational layer, and each SIMD has sixteen floating-point computation lanes. The peak performance is 25 TFLOPS (fp16) and 10 TFLOPS (fp32) as the percentage of the employment approaches 100%. Additional GPU performance may be achieved if the addition and multiply functions for vectors combine the inner production instructions for matching primitives related to matrix operations.

For DNN training, the GPU is usually considered to be an optimized design, while for inference operations, it may also offer considerable performance improvements.

FPGA-based approach

FPGA is wildly utilized in various tasks including deep learning [ 199 , 247 , 331 , 332 , 333 , 334 ]. Inference accelerators are commonly implemented utilizing FPGA. The FPGA can be effectively configured to reduce the unnecessary or overhead functions involved in GPU systems. Compared to GPU, the FPGA is restricted to both weak-behaved floating-point performance and integer inference. The main FPGA aspect is the capability to dynamically reconfigure the array characteristics (at run-time), as well as the capability to configure the array by means of effective design with little or no overhead.

As mentioned earlier, the FPGA offers both performance and latency for every watt it gains over GPU and CPU in DL inference operations. Implementation of custom high-performance hardware, pruned networks, and reduced arithmetic precision are three factors that enable the FPGA to implement DL algorithms and to achieve FPGA with this level of efficiency. In addition, FPGA may be employed to implement CNN overlay engines with over 80% efficiency, eight-bit accuracy, and over 15 TOPs peak performance; this is used for a few conventional CNNs, as Xillinx and partners demonstrated recently. By contrast, pruning techniques are mostly employed in the LSTM context. The sizes of the models can be efficiently minimized by up to 20×, which provides an important benefit during the implementation of the optimal solution, as MLP neural processing demonstrated. A recent study in the field of implementing fixed-point precision and custom floating-point has revealed that lowering the 8-bit is extremely promising; moreover, it aids in supplying additional advancements to implementing peak performance FPGA related to the DNN models.

Evaluation metrics

Evaluation metrics adopted within DL tasks play a crucial role in achieving the optimized classifier [ 335 ]. They are utilized within a usual data classification procedure through two main stages: training and testing. It is utilized to optimize the classification algorithm during the training stage. This means that the evaluation metric is utilized to discriminate and select the optimized solution, e.g., as a discriminator, which can generate an extra-accurate forecast of upcoming evaluations related to a specific classifier. For the time being, the evaluation metric is utilized to measure the efficiency of the created classifier, e.g. as an evaluator, within the model testing stage using hidden data. As given in Eq. 20 , TN and TP are defined as the number of negative and positive instances, respectively, which are successfully classified. In addition, FN and FP are defined as the number of misclassified positive and negative instances respectively. Next, some of the most well-known evaluation metrics are listed below.

Accuracy: Calculates the ratio of correct predicted classes to the total number of samples evaluated (Eq. 20 ).

Sensitivity or Recall: Utilized to calculate the fraction of positive patterns that are correctly classified (Eq. 21 ).

Specificity: Utilized to calculate the fraction of negative patterns that are correctly classified (Eq. 22 ).

Precision: Utilized to calculate the positive patterns that are correctly predicted by all predicted patterns in a positive class (Eq. 23 ).

F1-Score: Calculates the harmonic average between recall and precision rates (Eq. 24 ).

J Score: This metric is also called Youdens J statistic. Eq. 25 represents the metric.

False Positive Rate (FPR): This metric refers to the possibility of a false alarm ratio as calculated in Eq. 26

Area Under the ROC Curve: AUC is a common ranking type metric. It is utilized to conduct comparisons between learning algorithms [ 336 , 337 , 338 ], as well as to construct an optimal learning model [ 339 , 340 ]. In contrast to probability and threshold metrics, the AUC value exposes the entire classifier ranking performance. The following formula is used to calculate the AUC value for two-class problem [ 341 ] (Eq. 27 )

Here, \(S_{p}\) represents the sum of all positive ranked samples. The number of negative and positive samples is denoted as \(n_{n}\) and \(n_{p}\) , respectively. Compared to the accuracy metrics, the AUC value was verified empirically and theoretically, making it very helpful for identifying an optimized solution and evaluating the classifier performance through classification training.

When considering the discrimination and evaluation processes, the AUC performance was brilliant. However, for multiclass issues, the AUC computation is primarily cost-effective when discriminating a large number of created solutions. In addition, the time complexity for computing the AUC is \(O \left( |C|^{2} \; n\log n\right) \) with respect to the Hand and Till AUC model [ 341 ] and \(O \left( |C| \; n\log n\right) \) according to Provost and Domingo’s AUC model [ 336 ].

Frameworks and datasets

Several DL frameworks and datasets have been developed in the last few years. various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Table  4 lists the most utilized frameworks and libraries.

Based on the star ratings on Github, as well as our own background in the field, TensorFlow is deemed the most effective and easy to use. It has the ability to work on several platforms. (Github is one of the biggest software hosting sites, while Github stars refer to how well-regarded a project is on the site). Moreover, there are several other benchmark datasets employed for different DL tasks. Some of these are listed in Table  5 .

Summary and conclusion

Finally, it is mandatory the inclusion of a brief discussion by gathering all the relevant data provided along this extensive research. Next, an itemized analysis is presented in order to conclude our review and exhibit the future directions.

DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.

DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.

Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.

The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.

Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.

In the CNN context, ensemble learning [ 342 , 343 ] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.

The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.

It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.

With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.

Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.

Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.

Availability of data and materials

Not applicable.

Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci. 2020;6:307.

Article   Google Scholar  

Amrit C, Paauw T, Aly R, Lavric M. Identifying child abuse through text mining and machine learning. Expert Syst Appl. 2017;88:402–18.

Hossain E, Khan I, Un-Noor F, Sikander SS, Sunny MSH. Application of big data and machine learning in smart grid, and associated security concerns: a review. IEEE Access. 2019;7:13960–88.

Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015;2(1):23.

Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M. Content-based video recommendation system based on stylistic visual features. J Data Semant. 2016;5(2):99–113.

Al-Dulaimi K, Chandran V, Nguyen K, Banks J, Tomeo-Reyes I. Benchmarking hep-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape. Pattern Recogn Lett. 2019;125:534–41.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR). 2018;51(5):1–36.

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK. A state-of-the-art survey on deep learning theory and architectures. Electronics. 2019;8(3):292.

Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao KT, Rose G, Chakma G. A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC). 2018;14(2):1–21.

Adeel A, Gogate M, Hussain A. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf Fusion. 2020;59:163–70.

Tian H, Chen SC, Shyu ML. Evolutionary programming based deep learning feature selection and network construction for visual data classification. Inf Syst Front. 2020;22(5):1053–66.

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag. 2018;13(3):55–75.

Koppe G, Meyer-Lindenberg A, Durstewitz D. Deep learning for small and big data in psychiatry. Neuropsychopharmacology. 2021;46(1):176–90.

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE; 2005. p. 886–93.

Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol. 2. IEEE; 1999. p. 1150–7.

Wu L, Hoi SC, Yu N. Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process. 2010;19(7):1908–20.

Article   MathSciNet   MATH   Google Scholar  

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Yao G, Lei T, Zhong J. A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett. 2019;118:14–22.

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell. 2020;9(2):85–112.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hasan RI, Yusuf SM, Alzubaidi L. Review of the state of the art of deep learning for plant diseases: a broad analysis and discussion. Plants. 2020;9(10):1302.

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X. A review of object detection based on deep learning. Multimed Tools Appl. 2020;79(33):23729–91.

Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access. 2017;6:9375–89.

Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. 2020. https://doi.org/10.1109/TKDE.2020.2981333 .

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–65.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT press; 2016.

MATH   Google Scholar  

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021;8(1):1–54.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

Bhowmick S, Nagarajaiah S, Veeraraghavan A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from uav videos. Sensors. 2020;20(21):6299.

Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307.

Li Y, Zhang T, Sun S, Gao X. Accelerating flash calculation through deep learning methods. J Comput Phys. 2019;394:153–65.

Yang W, Zhang X, Tian Y, Wang W, Xue JH, Liao Q. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed. 2019;21(12):3106–21.

Tang J, Li S, Liu P. A review of lane detection methods based on deep learning. Pattern Recogn. 2020;111:107623.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.

Ng A. Machine learning yearning: technical strategy for AI engineers in the era of deep learning. 2019. https://www.mlyearning.org .

Metz C. Turing award won by 3 pioneers in artificial intelligence. The New York Times. 2019;27.

Nevo S, Anisimov V, Elidan G, El-Yaniv R, Giencke P, Gigi Y, Hassidim A, Moshe Z, Schlesinger M, Shalev G, et al. Ml for flood forecasting at scale; 2019. arXiv preprint arXiv:1901.09583 .

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241–50.

Benhammou Y, Achchab B, Herrera F, Tabik S. Breakhis based breast cancer automatic diagnosis using deep learning: taxonomy, survey and insights. Neurocomputing. 2020;375:9–24.

Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE. 2020;15(6):e0233678.

Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.

Google Scholar  

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Brunese L, Mercaldo F, Reginelli A, Santone A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput Methods Programs Biomed. 2020;196(105):608.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

Shorfuzzaman M, Hossain MS. Metacovid: a siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recogn. 2020;113:107700.

Carvelli L, Olesen AN, Brink-Kjær A, Leary EB, Peppard PE, Mignot E, Sørensen HB, Jennum P. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep Med. 2020;69:109–19.

De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–50.

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.

Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, McKeown A, Yang G, Wu X, Yan F, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.

Van Essen B, Kim H, Pearce R, Boakye K, Chen B. Lbann: livermore big artificial neural network HPC toolkit. In: Proceedings of the workshop on machine learning in high-performance computing environments; 2015. p. 1–6.

Saeed MM, Al Aghbari Z, Alsharidah M. Big data clustering techniques based on spark: a literature review. PeerJ Comput Sci. 2020;6:321.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1631–42.

Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure. In: Proceedings of international conference on neural networks (ICNN’96), vol 1. IEEE; 1996. p. 347–52.

Socher R, Lin CCY, Ng AY, Manning CD. Parsing natural scenes and natural language with recursive neural networks. In: ICML; 2011.

Louppe G, Cho K, Becot C, Cranmer K. QCD-aware recursive neural networks for jet physics. J High Energy Phys. 2019;2019(1):57.

Sadr H, Pedram MM, Teshnehlab M. A robust sentiment analysis method based on sequential combination of convolutional and recursive neural networks. Neural Process Lett. 2019;50(3):2745–61.

Urban G, Subrahmanya N, Baldi P. Inner and outer recursive neural networks for chemoinformatics applications. J Chem Inf Model. 2018;58(2):207–11.

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast. 2020;37(1):388–427.

Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Learn codes: inventing low-latency codes via recurrent neural networks. IEEE J Sel Areas Inf Theory. 2020;1(1):207–16.

John RA, Acharya J, Zhu C, Surendran A, Bose SK, Chaturvedi A, Tiwari N, Gao Y, He Y, Zhang KK, et al. Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks. Nat Commun. 2020;11(1):1–9.

Batur Dinler Ö, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci. 2020;10(4):1273.

Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, vol. 2016, NIH Public Access; 2016. p. 856.

Pascanu R, Gulcehre C, Cho K, Bengio Y. How to construct deep recurrent neural networks. In: Proceedings of the second international conference on learning representations (ICLR 2014); 2014.

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 249–56.

Gao C, Yan J, Zhou S, Varshney PK, Liu H. Long short-term memory-based deep recurrent neural networks for target tracking. Inf Sci. 2019;502:279–96.

Zhou DX. Theory of deep convolutional neural networks: downsampling. Neural Netw. 2020;124:319–27.

Article   MATH   Google Scholar  

Jhong SY, Tseng PY, Siriphockpirom N, Hsia CH, Huang MS, Hua KL, Chen YY. An automated biometric identification system using CNN-based palm vein recognition. In: 2020 international conference on advanced robotics and intelligent systems (ARIS). IEEE; 2020. p. 1–6.

Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. Deepcryopicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinform. 2020;21(1):1–38.

Wang T, Lu C, Yang M, Hong F, Liu C. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput Sci. 2020;6:324.

Li G, Zhang M, Li J, Lv F, Tong G. Efficient densely connected convolutional neural networks. Pattern Recogn. 2021;109:107610.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.

Fang W, Love PE, Luo H, Ding L. Computer vision for behaviour-based safety in construction: a review and future directions. Adv Eng Inform. 2020;43:100980.

Palaz D, Magimai-Doss M, Collobert R. End-to-end acoustic modeling using convolutional neural networks for hmm-based automatic speech recognition. Speech Commun. 2019;108:15–32.

Li HC, Deng ZY, Chiang HH. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20(21):6114.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift; 2015. arXiv preprint arXiv:1502.03167 .

Ruder S. An overview of gradient descent optimization algorithms; 2016. arXiv preprint arXiv:1609.04747 .

Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–86.

Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on. 2012;14(8).

Zhang Z. Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS). IEEE; 2018. p. 1–2.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Duan Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics. 2020;9(3):427.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Santamaría J, Duan Y, Oleiwi SR. Towards a better understanding of transfer learning for medical imaging: a case study. Appl Sci. 2020;10(13):4523.

Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw Stat Mech Perspect. 1995;261:276.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

MathSciNet   MATH   Google Scholar  

Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8609–13.

Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network; 2015. arXiv preprint arXiv:1505.00853 .

Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst. 1998;6(02):107–16.

Lin M, Chen Q, Yan S. Network in network; 2013. arXiv preprint arXiv:1312.4400 .

Hsiao TY, Chang YC, Chou HH, Chiu CT. Filter-based deep-compression with global average pooling for convolutional networks. J Syst Arch. 2019;95:9–18.

Li Z, Wang SH, Fan RR, Cao G, Zhang YD, Guo T. Teeth category classification via seven-layer deep convolutional neural network with max pooling and global average pooling. Int J Imaging Syst Technol. 2019;29(4):577–83.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33.

Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Univ Montreal. 2009;1341(3):1.

Le QV. Building high-level features using large scale unsupervised learning. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8595–8.

Grün F, Rupprecht C, Navab N, Tombari F. A taxonomy and library for visualizing learned features in convolutional neural networks; 2016. arXiv preprint arXiv:1606.07757 .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:1409.1556 .

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007. p. 1–8.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

Bengio Y, et al. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization; 2015. arXiv:1502.04390 corr abs/1502.04390

Srivastava RK, Greff K, Schmidhuber J. Highway networks; 2015. arXiv preprint arXiv:1505.00387 .

Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans Smart Grid. 2017;10(1):841–51.

Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016;16(1):115.

CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012;32:333–8.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning; 2016. arXiv preprint arXiv:1602.07261 .

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–26.

Wu S, Zhong S, Liu Y. Deep residual learning for image steganalysis. Multimed Tools Appl. 2018;77(9):10437–53.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–08.

Rubin J, Parvaneh S, Rahman A, Conroy B, Babaeizadeh S. Densely connected convolutional networks for detection of atrial fibrillation from short single-lead ECG recordings. J Electrocardiol. 2018;51(6):S18-21.

Kuang P, Ma T, Chen Z, Li F. Image super-resolution with densely connected convolutional networks. Appl Intell. 2019;49(1):125–36.

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–500.

Su A, He X, Zhao X. Jpeg steganalysis based on ResNeXt with gauss partial derivative filters. Multimed Tools Appl. 2020;80(3):3349–66.

Yadav D, Jalal A, Garlapati D, Hossain K, Goyal A, Pant G. Deep learning-based ResNeXt model in phycological studies for future. Algal Res. 2020;50:102018.

Han W, Feng R, Wang L, Gao L. Adaptive spatial-scale-aware deep convolutional neural network for high-resolution remote sensing imagery scene classification. In: IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE; 2018. p. 4736–9.

Zagoruyko S, Komodakis N. Wide residual networks; 2016. arXiv preprint arXiv:1605.07146 .

Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ. Deep networks with stochastic depth. In: European conference on computer vision. Springer; 2016. p. 646–61.

Huynh HT, Nguyen H. Joint age estimation and gender classification of Asian faces using wide ResNet. SN Comput Sci. 2020;1(5):1–9.

Takahashi R, Matsubara T, Uehara K. Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol. 2019;30(9):2917–31.

Han D, Kim J, Kim J. Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 5927–35.

Wang Y, Wang L, Wang H, Li P. End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access. 2019;7:31959–70.

Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1251–8.

Lo WW, Yang X, Wang Y. An xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP international conference on new technologies, mobility and security (NTMS). IEEE; 2019. p. 1–5.

Rahimzadeh M, Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of xception and resnet50v2. Inform Med Unlocked. 2020;19:100360.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3156–64.

Salakhutdinov R, Larochelle H. Efficient learning of deep boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 693–700.

Goh H, Thome N, Cord M, Lim JH. Top-down regularization of deep belief networks. Adv Neural Inf Process Syst. 2013;26:1878–86.

Guan J, Lai R, Xiong A, Liu Z, Gu L. Fixed pattern noise reduction for infrared images based on cascade residual attention CNN. Neurocomputing. 2020;377:301–13.

Bi Q, Qin K, Zhang H, Li Z, Xu K. RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing. 2020;377:345–59.

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2015. p. 2017–25.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

Mou L, Zhu XX. Learning to pay attention on spectral domain: a spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2019;58(1):110–22.

Woo S, Park J, Lee JY, So Kweon I. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.

Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 421–9.

Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation’’ blocks. IEEE Trans Med Imaging. 2018;38(2):540–9.

Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 3856–66.

Arun P, Buddhiraju KM, Porwal A. Capsulenet-based spatial-spectral classifier for hyperspectral images. IEEE J Sel Topics Appl Earth Obs Remote Sens. 2019;12(6):1849–65.

Xinwei L, Lianghao X, Yi Y. Compact video fingerprinting via an improved capsule net. Syst Sci Control Eng. 2020;9:1–9.

Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing. 2020;379:152–61.

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020. https://doi.org/10.1109/TPAMI.2020.2983686 .

Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR 2020; 2020. https://www.microsoft.com/en-us/research/publication/higherhrnet-scale-aware-representation-learning-for-bottom-up-human-pose-estimation/ .

Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv:1912.11460 .

Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385 .

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 3320–8.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International conference on artificial neural networks. Springer; 2018. p. 270–9.

Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

Wang F, Wang H, Wang H, Li G, Situ G. Learning from simulation: an end-to-end deep-learning approach for computational ghost imaging. Opt Express. 2019;27(18):25560–72.

Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing. 2016;177:447–53.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.

Cook D, Feuz KD, Krishnan NC. Transfer learning for activity recognition: a survey. Knowl Inf Syst. 2013;36(3):537–56.

Cao X, Wang Z, Yan P, Li X. Transfer learning for pedestrian detection. Neurocomputing. 2013;100:51–7.

Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2019. p. 3347–57.

Pham TN, Van Tran L, Dao SVT. Early disease classification of mango leaves using feed-forward neural network and hybrid metaheuristic feature selection. IEEE Access. 2020;8:189960–73.

Saleh AM, Hamoud T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J Big Data. 2021;8(1):1–20.

Hirahara D, Takaya E, Takahara T, Ueda T. Effects of data count and image scaling on deep learning training. PeerJ Comput Sci. 2020;6:312.

Moreno-Barea FJ, Strazzera F, Jerez JM, Urda D, Franco L. Forward noise adjustment scheme for data augmentation. In: 2018 IEEE symposium series on computational intelligence (SSCI). IEEE; 2018. p. 728–34.

Dua D, Karra Taniskidou E. Uci machine learning repository. Irvine: University of california. School of Information and Computer Science; 2017. http://archive.ics.uci.edu/ml

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.

Yang P, Zhang Z, Zhou BB, Zomaya AY. Sample subset optimization for classifying imbalanced biological data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2011. p. 333–44.

Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern. 2013;44(3):445–55.

Wang S, Sun S, Xu J. Auc-maximized deep convolutional neural fields for sequence labeling 2015. arXiv preprint arXiv:1511.05265 .

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760–9.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2016. p. 3504–12.

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170,387.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.

Pokuri BSS, Ghosal S, Kokate A, Sarkar S, Ganapathysubramanian B. Interpretable deep learning for guided microstructure-property explorations in photovoltaics. NPJ Comput Mater. 2019;5(1):1–11.

Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

Wang L, Nie R, Yu Z, Xin R, Zheng C, Zhang Z, Zhang J, Cai J. An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNA-sequencing data. Nat Mach Intell. 2020;2(11):1–11.

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks; 2017. arXiv preprint arXiv:1703.01365 .

Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10(3):61–74.

Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557.

Herzog L, Murina E, Dürr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790.

Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions; 2017. arXiv preprint arXiv:1701.06548 .

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the... AAAI conference on artificial intelligence. AAAI conference on artificial intelligence, vol. 2015. NIH Public Access; 2015. p. 2901.

Li M, Sethi IK. Confidence-based classifier design. Pattern Recogn. 2006;39(7):1230–40.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers. In: ICML, vol. 1, Citeseer; 2001. p. 609–16.

Steinwart I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans Inf Theory. 2005;51(1):128–42.

Lee K, Lee K, Shin J, Lee H. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of the IEEE international conference on computer vision; 2019. p. 312–21.

Shmelkov K, Schmid C, Alahari K. Incremental learning of object detectors without catastrophic forgetting. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3400–09.

Zenke F, Gerstner W, Ganguli S. The temporal paradox of Hebbian learning and homeostatic plasticity. Curr Opin Neurobiol. 2017;43:166–76.

Andersen N, Krauth N, Nabavi S. Hebbian plasticity in vivo: relevance and induction. Curr Opin Neurobiol. 2017;45:188–92.

Zheng R, Chakraborti S. A phase ii nonparametric adaptive exponentially weighted moving average control chart. Qual Eng. 2016;28(4):476–90.

Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH. ICARL: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2001–10.

Hinton GE, Plaut DC. Using fast weights to deblur old memories. In: Proceedings of the ninth annual conference of the cognitive science society; 1987. p. 177–86.

Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Soltoggio A, Stanley KO, Risi S. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Netw. 2018;108:48–67.

Parisi GI, Tani J, Weber C, Wermter S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017;96:137–49.

Cheng Y, Wang D, Zhou P, Zhang T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag. 2018;35(1):126–36.

Wiedemann S, Kirchhoffer H, Matlage S, Haase P, Marban A, Marinč T, Neumann D, Nguyen T, Schwarz H, Wiegand T, et al. Deepcabac: a universal compression algorithm for deep neural networks. IEEE J Sel Topics Signal Process. 2020;14(4):700–14.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9.

Shawahna A, Sait SM, El-Maleh A. Fpga-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 2018;7:7823–59.

Min Z. Public welfare organization management system based on FPGA and deep learning. Microprocess Microsyst. 2020;80:103333.

Al-Shamma O, Fadhel MA, Hameed RA, Alzubaidi L, Zhang J. Boosting convolutional neural networks performance based on fpga accelerator. In: International conference on intelligent systems design and applications. Springer; 2018. p. 509–17.

Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding; 2015. arXiv preprint arXiv:1510.00149 .

Chen Z, Zhang L, Cao Z, Guo J. Distilling the knowledge from handcrafted features for human activity recognition. IEEE Trans Ind Inform. 2018;14(10):4334–42.

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network; 2015. arXiv preprint arXiv:1503.02531 .

Lenssen JE, Fey M, Libuschewski P. Group equivariant capsule networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 8844–53.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R. Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 1269–77.

Xu Q, Zhang M, Gu Z, Pan G. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing. 2019;328:69–74.

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. Commun ACM. 2018;64(3):107–15.

Xu X, Jiang X, Ma C, Du P, Li X, Lv S, Yu L, Ni Q, Chen Y, Su J, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–9.

Sharma K, Alsadoon A, Prasad P, Al-Dala’in T, Nguyen TQV, Pham DTH. A novel solution of using deep learning for left ventricle detection: enhanced feature extraction. Comput Methods Programs Biomed. 2020;197:105751.

Zhang G, Wang C, Xu B, Grosse R. Three mechanisms of weight decay regularization; 2018. arXiv preprint arXiv:1810.12281 .

Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE; 2016. p. 2657–61.

Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett. 2017;24(3):279–83.

Wang X, Qin Y, Wang Y, Xiang S, Chen H. ReLTanh: an activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing. 2019;363:88–98.

Tan HH, Lim KH. Vanishing gradient mitigation with deep learning neural network optimization. In: 2019 7th international conference on smart computing & communications (ICSCC). IEEE; 2019. p. 1–4.

MacDonald G, Godbout A, Gillcash B, Cairns S. Volume-preserving neural networks: a solution to the vanishing gradient problem; 2019. arXiv preprint arXiv:1911.09576 .

Mittal S, Vaishay S. A survey of techniques for optimizing deep learning on GPUs. J Syst Arch. 2019;99:101635.

Kanai S, Fujiwara Y, Iwamura S. Preventing gradient explosions in gated recurrent units. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 435–44.

Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 582–91.

Ribeiro AH, Tiels K, Aguirre LA, Schön T. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In: International conference on artificial intelligence and statistics, PMLR; 2020. p. 2370–80.

D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, Chen C, Deaton J, Eisenstein J, Hoffman MD, et al. Underspecification presents challenges for credibility in modern machine learning; 2020. arXiv preprint arXiv:2011.03395 .

Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020;49(2):1–15.

Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

Kuutti S, Bowden R, Jin Y, Barber P, Fallah S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst. 2020;22:712–33.

Yolcu G, Oztel I, Kazan S, Oz C, Bunyak F. Deep learning-based face analysis system for monitoring customer interest. J Ambient Intell Humaniz Comput. 2020;11(1):237–48.

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019;7:128837–68.

Muhammad K, Khan S, Del Ser J, de Albuquerque VHC. Deep learning for multigrade brain tumor classification in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst. 2020;32:507–22.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D. Ensemconvnet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl. 2020;79(41):31663–90.

Zeleznik R, Foldyna B, Eslami P, Weiss J, Alexander I, Taron J, Parmar C, Alvi RM, Banerji D, Uno M, et al. Deep convolutional neural networks to predict cardiovascular risk from computed tomography. Nature Commun. 2021;12(1):1–9.

Wang J, Liu Q, Xie H, Yang Z, Zhou H. Boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural networks. Cancers. 2021;13(4):661.

Yu H, Yang LT, Zhang Q, Armstrong D, Deen MJ. Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives. Neurocomputing. 2021. https://doi.org/10.1016/j.neucom.2020.04.157 .

Bharati S, Podder P, Mondal MRH. Hybrid deep learning for detecting lung diseases from X-ray images. Inform Med Unlocked. 2020;20:100391.

Dong Y, Pan Y, Zhang J, Xu W. Learning to read chest X-ray images from 16000+ examples using CNN. In: 2017 IEEE/ACM international conference on connected health: applications, systems and engineering technologies (CHASE). IEEE; 2017. p. 51–7.

Rajkomar A, Lingam S, Taylor AG, Blum M, Mongan J. High-throughput classification of radiographs using deep convolutional neural networks. J Digit Imaging. 2017;30(1):95–101.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. Chexnet: radiologist-level pneumonia detection on chest X-rays with deep learning; 2017. arXiv preprint arXiv:1711.05225 .

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2097–106.

Zuo W, Zhou F, Li Z, Wang L. Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access. 2019;7:32510–21.

Shen W, Zhou M, Yang F, Yang C, Tian J. Multi-scale convolutional neural networks for lung nodule classification. In: International conference on information processing in medical imaging. Springer; 2015. p. 588–99.

Li R, Zhang W, Suk HI, Wang L, Li J, Shen D, Ji S. Deep learning based imaging data completion for improved brain disease diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer; 2014. p. 305–12.

Wen J, Thibeau-Sutre E, Diaz-Melo M, Samper-González J, Routier A, Bottani S, Dormont D, Durrleman S, Burgos N, Colliot O, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal. 2020;63:101694.

Mehmood A, Maqsood M, Bashir M, Shuyuan Y. A deep siamese convolution neural network for multi-class classification of Alzheimer disease. Brain Sci. 2020;10(2):84.

Hosseini-Asl E, Ghazal M, Mahmoud A, Aslantas A, Shalaby A, Casanova M, Barnes G, Gimel’farb G, Keynton R, El-Baz A. Alzheimer’s disease diagnostics by a 3d deeply supervised adaptable convolutional network. Front Biosci. 2018;23:584–96.

Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). IEEE; 2017. p. 835–8.

Alzubaidi L, Fadhel MA, Oleiwi SR, Al-Shamma O, Zhang J. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimed Tools Appl. 2020;79(21):15655–77.

Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. Dfunet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans Emerg Topics Comput Intell. 2018;4(5):728–39.

Yap MH., Hachiuma R, Alavi A, Brungel R, Goyal M, Zhu H, Cassidy B, Ruckert J, Olshansky M, Huang X, et al. Deep learning in diabetic foot ulcers detection: a comprehensive evaluation; 2020. arXiv preprint arXiv:2010.03341 .

Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:198977–9000.

Fadhel MA, Al-Shamma O, Alzubaidi L, Oleiwi SR. Real-time sickle cell anemia diagnosis based hardware accelerator. In: International conference on new trends in information and communications technology applications, Springer; 2020. p. 189–99.

Debelee TG, Kebede SR, Schwenker F, Shewarega ZM. Deep learning in selected cancers’ image analysis—a survey. J Imaging. 2020;6(11):121.

Khan S, Islam N, Jan Z, Din IU, Rodrigues JJC. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn Lett. 2019;125:1–6.

Alzubaidi L, Hasan RI, Awad FH, Fadhel MA, Alshamma O, Zhang J. Multi-class breast cancer classification by a novel two-branch deep convolutional neural network architecture. In: 2019 12th international conference on developments in eSystems engineering (DeSE). IEEE; 2019. p. 268–73.

Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Gr. 2019;71:90–103.

Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A. Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors. 2020;20(16):4373.

Hosny KM, Kassem MA, Foaud MM. Skin cancer classification using deep learning and transfer learning. In: 2018 9th Cairo international biomedical engineering conference (CIBEC). IEEE; 2018. p. 90–3.

Dorj UO, Lee KK, Choi JY, Lee M. The skin cancer classification using deep convolutional neural network. Multimed Tools Appl. 2018;77(8):9909–24.

Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access. 2020;8:114822–32.

Heidari M, Mirniaharikandehei S, Khuzani AZ, Danala G, Qiu Y, Zheng B. Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. Int J Med Inform. 2020;144:104284.

Al-Timemy AH, Khushaba RN, Mosa ZM, Escudero J. An efficient mixture of deep and machine learning models for COVID-19 and tuberculosis detection using X-ray images in resource limited settings 2020. arXiv preprint arXiv:2007.08223 .

Abraham B, Nair MS. Computer-aided detection of COVID-19 from X-ray images using multi-CNN and Bayesnet classifier. Biocybern Biomed Eng. 2020;40(4):1436–45.

Nour M, Cömert Z, Polat K. A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization. Appl Soft Comput. 2020;97:106580.

Mallio CA, Napolitano A, Castiello G, Giordano FM, D’Alessio P, Iozzino M, Sun Y, Angeletti S, Russano M, Santini D, et al. Deep learning algorithm trained with COVID-19 pneumonia also identifies immune checkpoint inhibitor therapy-related pneumonitis. Cancers. 2021;13(4):652.

Fourcade A, Khonsari R. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019;120(4):279–88.

Guo Z, Li X, Huang H, Guo N, Li Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 2019;3(2):162–9.

Thakur N, Yoon H, Chong Y. Current trends of artificial intelligence for colorectal cancer pathology image analysis: a systematic review. Cancers. 2020;12(7):1884.

Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik. 2019;29(2):102–27.

Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1):113.

Nehme E, Freedman D, Gordon R, Ferdman B, Weiss LE, Alalouf O, Naor T, Orange R, Michaeli T, Shechtman Y. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat Methods. 2020;17(7):734–40.

Zulkifley MA, Abdani SR, Zulkifley NH. Pterygium-Net: a deep learning approach to pterygium detection and localization. Multimed Tools Appl. 2019;78(24):34563–84.

Sirazitdinov I, Kholiavchenko M, Mustafaev T, Yixuan Y, Kuleev R, Ibragimov B. Deep neural network ensemble for pneumonia localization from a large-scale chest X-ray database. Comput Electr Eng. 2019;78:388–99.

Zhao W, Shen L, Han B, Yang Y, Cheng K, Toesca DA, Koong AC, Chang DT, Xing L. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys. 2019;105(2):432–9.

Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Lu L, Summers RM. Anatomy-specific classification of medical images using deep convolutional nets. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI). IEEE; 2015. p. 101–4.

Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.

Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z. CLU-CNNs: object detection for medical images. Neurocomputing. 2019;350:53–9.

Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536.

Article   MathSciNet   Google Scholar  

Lumini A, Nanni L. Review fair comparison of skin detection approaches on publicly available datasets. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113677 .

Chouhan V, Singh SK, Khamparia A, Gupta D, Tiwari P, Moreira C, Damaševičius R, De Albuquerque VHC. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Appl Sci. 2020;10(2):559.

Apostolopoulos ID, Mpesiana TA. COVID-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med. 2020;43(2):635–40.

Mahmud T, Rahman MA, Fattah SA. CovXNet: a multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization. Comput Biol Med. 2020;122:103869.

Tayarani-N MH. Applications of artificial intelligence in battling against COVID-19: a literature review. Chaos Solitons Fractals. 2020;142:110338.

Toraman S, Alakus TB, Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos Solitons Fractals. 2020;140:110122.

Dascalu A, David E. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Adegun A, Viriri S. Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev. 2020;54:1–31.

Zhang N, Cai YX, Wang YY, Tian YT, Wang XL, Badami B. Skin cancer diagnosis based on optimized convolutional neural network. Artif Intell Med. 2020;102:101756.

Thurnhofer-Hemsi K, Domínguez E. A convolutional neural network framework for accurate skin cancer detection. Neural Process Lett. 2020. https://doi.org/10.1007/s11063-020-10364-y .

Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nat Mach Intell. 2020;2(6):356–62.

Lei H, Liu S, Elazab A, Lei B. Attention-guided multi-branch convolutional neural network for mitosis detection from histopathological images. IEEE J Biomed Health Inform. 2020;25(2):358–70.

Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recogn Lett. 2020;133:232–9.

Sebai M, Wang X, Wang T. Maskmitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Med Biol Eng Comput. 2020;58:1603–23.

Sebai M, Wang T, Al-Fadhli SA. Partmitosis: a partially supervised deep learning framework for mitosis detection in breast cancer histopathology images. IEEE Access. 2020;8:45133–47.

Mahmood T, Arsalan M, Owais M, Lee MB, Park KR. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and deep CNNs. J Clin Med. 2020;9(3):749.

Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2020;67:101813.

Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. p. 411–8.

Sirinukunwattana K, Raza SEA, Tsang YW, Snead DR, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1196–206.

Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging. 2015;35(1):119–30.

Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1313–21.

Abd-Ellah MK, Awad AI, Khalaf AA, Hamed HF. Two-phase multi-model automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process. 2018;2018(1):97.

Thaha MM, Kumar KPM, Murugan B, Dhanasekeran S, Vijayakarthick P, Selvi AS. Brain tumor segmentation using convolutional neural networks in MRI images. J Med Syst. 2019;43(9):294.

Talo M, Yildirim O, Baloglu UB, Aydin G, Acharya UR. Convolutional neural networks for multi-class brain disease detection using MRI images. Comput Med Imaging Gr. 2019;78:101673.

Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Mult Scler J. 2020;26(10):1217–26.

Chen S, Ding C, Liu M. Dual-force convolutional neural networks for accurate brain tumor segmentation. Pattern Recogn. 2019;88:90–100.

Hu K, Gan Q, Zhang Y, Deng S, Xiao F, Huang W, Cao C, Gao X. Brain tumor segmentation using multi-cascaded convolutional neural networks and conditional random field. IEEE Access. 2019;7:92615–29.

Wadhwa A, Bhardwaj A, Verma VS. A review on brain tumor segmentation of MRI images. Magn Reson Imaging. 2019;61:247–59.

Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: state of the art and future directions. J Digit Imaging. 2017;30(4):449–59.

Moeskops P, Viergever MA, Mendrik AM, De Vries LS, Benders MJ, Išgum I. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1252–61.

Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.

Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. 2016;35(5):1240–51.

Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.

Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z. COVID-19 chest CT image segmentation—a deep convolutional neural network solution; 2020. arXiv preprint arXiv:2004.10987 .

Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging. 2020;39(8):2653–63.

Khan SH, Sohail A, Khan A, Lee YS. Classification and region analysis of COVID-19 infection using lung CT images and deep convolutional neural networks; 2020. arXiv preprint arXiv:2009.08864 .

Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng. 2020;14:4–5.

Santamaría J, Rivero-Cejudo M, Martos-Fernández M, Roca F. An overview on the latest nature-inspired and metaheuristics-based image registration algorithms. Appl Sci. 2020;10(6):1928.

Santamaría J, Cordón O, Damas S. A comparative study of state-of-the-art evolutionary image registration methods for 3D modeling. Comput Vision Image Underst. 2011;115(9):1340–54.

Yumer ME, Mitra NJ. Learning semantic deformation flows with 3D convolutional networks. In: European conference on computer vision. Springer; 2016. p. 294–311.

Ding L, Feng C. Deepmapping: unsupervised map estimation from multiple point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 8650–9.

Mahadevan S. Imagination machines: a new challenge for artificial intelligence. AAAI. 2018;2018:7988–93.

Wang L, Fang Y. Unsupervised 3D reconstruction from a single image via adversarial learning; 2017. arXiv preprint arXiv:1711.09312 .

Hermoza R, Sipiran I. 3D reconstruction of incomplete archaeological objects using a generative adversarial network. In: Proceedings of computer graphics international 2018. Association for Computing Machinery; 2018. p. 5–11.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.

Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vision Appl. 2020;31(1):8.

de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal. 2019;52:128–43.

Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: fast predictive image registration—a deep learning approach. NeuroImage. 2017;158:378–96.

Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging. 2016;35(5):1352–63.

Li P, Pei Y, Guo Y, Ma G, Xu T, Zha H. Non-rigid 2D–3D registration using convolutional autoencoders. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE; 2020. p. 700–4.

Zhang J, Yeung SH, Shu Y, He B, Wang W. Efficient memory management for GPU-based deep learning systems; 2019. arXiv preprint arXiv:1903.06631 .

Zhao H, Han Z, Yang Z, Zhang Q, Yang F, Zhou L, Yang M, Lau FC, Wang Y, Xiong Y, et al. Hived: sharing a {GPU} cluster for deep learning with guarantees. In: 14th {USENIX} symposium on operating systems design and implementation ({OSDI} 20); 2020. p. 515–32.

Lin Y, Jiang Z, Gu J, Li W, Dhar S, Ren H, Khailany B, Pan DZ. DREAMPlace: deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans Comput Aided Des Integr Circuits Syst. 2020;40:748–61.

Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.

Castro FM, Guil N, Marín-Jiménez MJ, Pérez-Serrano J, Ujaldón M. Energy-based tuning of convolutional neural networks on multi-GPUs. Concurr Comput Pract Exp. 2019;31(21):4786.

Gschwend D. Zynqnet: an fpga-accelerated embedded convolutional neural network; 2020. arXiv preprint arXiv:2005.06892 .

Zhang N, Wei X, Chen H, Liu W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics. 2021;10(3):282.

Zhao M, Hu C, Wei F, Wang K, Wang C, Jiang Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors. 2019;19(2):350.

Liu X, Yang J, Zou C, Chen Q, Yan X, Chen Y, Cai C. Collaborative edge computing with FPGA-based CNN accelerators for energy-efficient and time-aware face tracking system. IEEE Trans Comput Soc Syst. 2021. https://doi.org/10.1109/TCSS.2021.3059318 .

Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.

Provost F, Domingos P. Tree induction for probability-based ranking. Mach Learn. 2003;52(3):199–215.

Rakotomamonyj A. Optimizing area under roc with SVMS. In: Proceedings of the European conference on artificial intelligence workshop on ROC curve and artificial intelligence (ROCAI 2004), 2004. p. 71–80.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the roc curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:101078.

Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.

Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.

Masoudnia S, Mersa O, Araabi BN, Vahabie AH, Sadeghi MA, Ahmadabadi MN. Multi-representational learning for offline signature verification using multi-loss snapshot ensemble of CNNs. Expert Syst Appl. 2019;133:317–30.

Coupé P, Mansencal B, Clément M, Giraud R, de Senneville BD, Ta VT, Lepetit V, Manjon JV. Assemblynet: a large ensemble of CNNs for 3D whole brain MRI segmentation. NeuroImage. 2020;219:117026.

Download references

Acknowledgements

We would like to thank the professors from the Queensland University of Technology and the University of Information Technology and Communications who gave their feedback on the paper.

This research received no external funding.

Author information

Authors and affiliations.

School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia

Laith Alzubaidi & Jinglan Zhang

Control and Systems Engineering Department, University of Technology, Baghdad, 10001, Iraq

Amjad J. Humaidi

Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001, Iraq

Ayad Al-Dujaili

Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO, 65211, USA

Ye Duan & Muthana Al-Amidie

AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001, Iraq

Laith Alzubaidi & Omran Al-Shamma

Department of Computer Science, University of Jaén, 23071, Jaén, Spain

J. Santamaría

College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005, Iraq

Mohammed A. Fadhel

School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD, UK

Laith Farhan

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: LA, and JZ; methodology: LA, JZ, and JS; software: LA, and MAF; validation: LA, JZ, MA, and LF; formal analysis: LA, JZ, YD, and JS; investigation: LA, and JZ; resources: LA, JZ, and MAF; data curation: LA, and OA.; writing–original draft preparation: LA, and OA; writing—review and editing: LA, JZ, AJH, AA, YD, OA, JS, MAF, MA, and LF; visualization: LA, and MAF; supervision: JZ, and YD; project administration: JZ, YD, and JS; funding acquisition: LA, AJH, AA, and YD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laith Alzubaidi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8 , 53 (2021). https://doi.org/10.1186/s40537-021-00444-8

Download citation

Received : 21 January 2021

Accepted : 22 March 2021

Published : 31 March 2021

DOI : https://doi.org/10.1186/s40537-021-00444-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Machine learning
  • Convolution neural network (CNN)
  • Deep neural network architectures
  • Deep learning applications
  • Image classification
  • Medical image analysis
  • Supervised learning

current research topics in deep learning

current research topics in deep learning

Research Topics & Ideas

Private Coaching

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan  to fill that gap.

Research topic idea mega list

AI-Related Research Topics & Ideas

Below you’ll find a list of AI and machine learning-related research topics ideas. These are intentionally broad and generic , so keep in mind that you will need to refine them a little. Nevertheless, they should inspire some ideas for your project.

  • Developing AI algorithms for early detection of chronic diseases using patient data.
  • The use of deep learning in enhancing the accuracy of weather prediction models.
  • Machine learning techniques for real-time language translation in social media platforms.
  • AI-driven approaches to improve cybersecurity in financial transactions.
  • The role of AI in optimizing supply chain logistics for e-commerce.
  • Investigating the impact of machine learning in personalized education systems.
  • The use of AI in predictive maintenance for industrial machinery.
  • Developing ethical frameworks for AI decision-making in healthcare.
  • The application of ML algorithms in autonomous vehicle navigation systems.
  • AI in agricultural technology: Optimizing crop yield predictions.
  • Machine learning techniques for enhancing image recognition in security systems.
  • AI-powered chatbots: Improving customer service efficiency in retail.
  • The impact of AI on enhancing energy efficiency in smart buildings.
  • Deep learning in drug discovery and pharmaceutical research.
  • The use of AI in detecting and combating online misinformation.
  • Machine learning models for real-time traffic prediction and management.
  • AI applications in facial recognition: Privacy and ethical considerations.
  • The effectiveness of ML in financial market prediction and analysis.
  • Developing AI tools for real-time monitoring of environmental pollution.
  • Machine learning for automated content moderation on social platforms.
  • The role of AI in enhancing the accuracy of medical diagnostics.
  • AI in space exploration: Automated data analysis and interpretation.
  • Machine learning techniques in identifying genetic markers for diseases.
  • AI-driven personal finance management tools.
  • The use of AI in developing adaptive learning technologies for disabled students.

Research topic evaluator

AI & ML Research Topic Ideas (Continued)

  • Machine learning in cybersecurity threat detection and response.
  • AI applications in virtual reality and augmented reality experiences.
  • Developing ethical AI systems for recruitment and hiring processes.
  • Machine learning for sentiment analysis in customer feedback.
  • AI in sports analytics for performance enhancement and injury prevention.
  • The role of AI in improving urban planning and smart city initiatives.
  • Machine learning models for predicting consumer behaviour trends.
  • AI and ML in artistic creation: Music, visual arts, and literature.
  • The use of AI in automated drone navigation for delivery services.
  • Developing AI algorithms for effective waste management and recycling.
  • Machine learning in seismology for earthquake prediction.
  • AI-powered tools for enhancing online privacy and data protection.
  • The application of ML in enhancing speech recognition technologies.
  • Investigating the role of AI in mental health assessment and therapy.
  • Machine learning for optimization of renewable energy systems.
  • AI in fashion: Predicting trends and personalizing customer experiences.
  • The impact of AI on legal research and case analysis.
  • Developing AI systems for real-time language interpretation for the deaf and hard of hearing.
  • Machine learning in genomic data analysis for personalized medicine.
  • AI-driven algorithms for credit scoring in microfinance.
  • The use of AI in enhancing public safety and emergency response systems.
  • Machine learning for improving water quality monitoring and management.
  • AI applications in wildlife conservation and habitat monitoring.
  • The role of AI in streamlining manufacturing processes.
  • Investigating the use of AI in enhancing the accessibility of digital content for visually impaired users.

Recent AI & ML-Related Studies

Below, we’ve included a selection of AI-related studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • An overview of artificial intelligence in diabetic retinopathy and other ocular diseases (Sheng et al., 2022)
  • HOW DOES ARTIFICIAL INTELLIGENCE HELP ASTRONOMY? A REVIEW (Patel, 2022)
  • Editorial: Artificial Intelligence in Bioinformatics and Drug Repurposing: Methods and Applications (Zheng et al., 2022)
  • Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities, and Challenges (Mukhamediev et al., 2022)
  • Will digitization, big data, and artificial intelligence – and deep learning–based algorithm govern the practice of medicine? (Goh, 2022)
  • Flower Classifier Web App Using Ml & Flask Web Framework (Singh et al., 2022)
  • Object-based Classification of Natural Scenes Using Machine Learning Methods (Jasim & Younis, 2023)
  • Automated Training Data Construction using Measurements for High-Level Learning-Based FPGA Power Modeling (Richa et al., 2022)
  • Artificial Intelligence (AI) and Internet of Medical Things (IoMT) Assisted Biomedical Systems for Intelligent Healthcare (Manickam et al., 2022)
  • Critical Review of Air Quality Prediction using Machine Learning Techniques (Sharma et al., 2022)
  • Artificial Intelligence: New Frontiers in Real–Time Inverse Scattering and Electromagnetic Imaging (Salucci et al., 2022)
  • Machine learning alternative to systems biology should not solely depend on data (Yeo & Selvarajoo, 2022)
  • Measurement-While-Drilling Based Estimation of Dynamic Penetrometer Values Using Decision Trees and Random Forests (García et al., 2022).
  • Artificial Intelligence in the Diagnosis of Oral Diseases: Applications and Pitfalls (Patil et al., 2022).
  • Automated Machine Learning on High Dimensional Big Data for Prediction Tasks (Jayanthi & Devi, 2022)
  • Breakdown of Machine Learning Algorithms (Meena & Sehrawat, 2022)
  • Technology-Enabled, Evidence-Driven, and Patient-Centered: The Way Forward for Regulating Software as a Medical Device (Carolan et al., 2021)
  • Machine Learning in Tourism (Rugge, 2022)
  • Towards a training data model for artificial intelligence in earth observation (Yue et al., 2022)
  • Classification of Music Generality using ANN, CNN and RNN-LSTM (Tripathy & Patel, 2022)

Get 1-On-1 Help

Private Coaching

Find The Perfect Research Topic

How To Choose A Research Topic: 5 Key Criteria

How To Choose A Research Topic: 5 Key Criteria

How To Choose A Research Topic Step-By-Step Tutorial With Examples + Free Topic...

Research Topics & Ideas: Automation & Robotics

Research Topics & Ideas: Automation & Robotics

A comprehensive list of automation and robotics-related research topics. Includes free access to a webinar and research topic evaluator.

Research Topics & Ideas: Sociology

Research Topics & Ideas: Sociology

Research Topics & Ideas: Sociology 50 Topic Ideas To Kickstart Your Research...

Research Topics & Ideas: Public Health & Epidemiology

Research Topics & Ideas: Public Health & Epidemiology

A comprehensive list of public health-related research topics. Includes free access to a webinar and research topic evaluator.

Research Topics & Ideas: Neuroscience

Research Topics & Ideas: Neuroscience

Research Topics & Ideas: Neuroscience 50 Topic Ideas To Kickstart Your Research...

📄 FREE TEMPLATES

Research Topic Ideation

Proposal Writing

Literature Review

Methodology & Analysis

Academic Writing

Referencing & Citing

Apps, Tools & Tricks

The Grad Coach Podcast

victor

can one come up with their own tppic and get a search

can one come up with their own title and get a search

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Submit Comment

current research topics in deep learning

  • Print Friendly

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

Deep learning (DL), a branch of machine learning (ML) and artificial intelligence (AI) is nowadays considered as a core technology of today’s Fourth Industrial Revolution (4IR or Industry 4.0). Due to its learning capabilities from data, DL technology originated from artificial neural network (ANN), has become a hot topic in the context of computing, and is widely applied in various application areas like healthcare, visual recognition, text analytics, cybersecurity, and many more. However, building an appropriate DL model is a challenging task, due to the dynamic nature and variations in real-world problems and data. Moreover, the lack of core understanding turns DL methods into black-box machines that hamper development at the standard level. This article presents a structured and comprehensive view on DL techniques including a taxonomy considering various types of real-world tasks like supervised or unsupervised. In our taxonomy, we take into account deep networks for supervised or discriminative learning , unsupervised or generative learning as well as hybrid learning and relevant others. We also summarize real-world application areas where deep learning techniques can be used. Finally, we point out ten potential aspects for future generation DL modeling with research directions . Overall, this article aims to draw a big picture on DL modeling that can be used as a reference guide for both academia and industry professionals.

Introduction

In the late 1980s, neural networks became a prevalent topic in the area of Machine Learning (ML) as well as Artificial Intelligence (AI), due to the invention of various efficient learning methods and network structures [ 52 ]. Multilayer perceptron networks trained by “Backpropagation” type algorithms, self-organizing maps, and radial basis function networks were such innovative methods [ 26 , 36 , 37 ]. While neural networks are successfully used in many applications, the interest in researching this topic decreased later on. After that, in 2006, “Deep Learning” (DL) was introduced by Hinton et al. [ 41 ], which was based on the concept of artificial neural network (ANN). Deep learning became a prominent topic after that, resulting in a rebirth in neural network research, hence, some times referred to as “new-generation neural networks”. This is because deep networks, when properly trained, have produced significant success in a variety of classification and regression challenges [ 52 ].

Nowadays, DL technology is considered as one of the hot topics within the area of machine learning, artificial intelligence as well as data science and analytics, due to its learning capabilities from the given data. Many corporations including Google, Microsoft, Nokia, etc., study it actively as it can provide significant results in different classification and regression problems and datasets [ 52 ]. In terms of working domain, DL is considered as a subset of ML and AI, and thus DL can be seen as an AI function that mimics the human brain’s processing of data. The worldwide popularity of “Deep learning” is increasing day by day, which is shown in our earlier paper [ 96 ] based on the historical data collected from Google trends [ 33 ]. Deep learning differs from standard machine learning in terms of efficiency as the volume of data increases, discussed briefly in Section “ Why Deep Learning in Today's Research and Applications? ”. DL technology uses multiple layers to represent the abstractions of data to build computational models. While deep learning takes a long time to train a model due to a large number of parameters, it takes a short amount of time to run during testing as compared to other machine learning algorithms [ 127 ].

While today’s Fourth Industrial Revolution (4IR or Industry 4.0) is typically focusing on technology-driven “automation, smart and intelligent systems”, DL technology, which is originated from ANN, has become one of the core technologies to achieve the goal [ 103 , 114 ]. A typical neural network is mainly composed of many simple, connected processing elements or processors called neurons, each of which generates a series of real-valued activations for the target outcome. Figure ​ Figure1 1 shows a schematic representation of the mathematical model of an artificial neuron, i.e., processing element, highlighting input ( X i ), weight ( w ), bias ( b ), summation function ( ∑ ), activation function ( f ) and corresponding output signal ( y ). Neural network-based DL technology is now widely applied in many fields and research areas such as healthcare, sentiment analysis, natural language processing, visual recognition, business intelligence, cybersecurity, and many more that have been summarized in the latter part of this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig1_HTML.jpg

Schematic representation of the mathematical model of an artificial neuron (processing element), highlighting input ( X i ), weight ( w ), bias ( b ), summation function ( ∑ ), activation function ( f ) and output signal ( y )

Although DL models are successfully applied in various application areas, mentioned above, building an appropriate model of deep learning is a challenging task, due to the dynamic nature and variations of real-world problems and data. Moreover, DL models are typically considered as “black-box” machines that hamper the standard development of deep learning research and applications. Thus for clear understanding, in this paper, we present a structured and comprehensive view on DL techniques considering the variations in real-world problems and tasks. To achieve our goal, we briefly discuss various DL techniques and present a taxonomy by taking into account three major categories: (i) deep networks for supervised or discriminative learning that is utilized to provide a discriminative function in supervised deep learning or classification applications; (ii) deep networks for unsupervised or generative learning that are used to characterize the high-order correlation properties or features for pattern analysis or synthesis, thus can be used as preprocessing for the supervised algorithm; and (ii) deep networks for hybrid learning that is an integration of both supervised and unsupervised model and relevant others. We take into account such categories based on the nature and learning capabilities of different DL techniques and how they are used to solve problems in real-world applications [ 97 ]. Moreover, identifying key research issues and prospects including effective data representation, new algorithm design, data-driven hyper-parameter learning, and model optimization, integrating domain knowledge, adapting resource-constrained devices, etc. is one of the key targets of this study, which can lead to “Future Generation DL-Modeling”. Thus the goal of this paper is set to assist those in academia and industry as a reference guide, who want to research and develop data-driven smart and intelligent systems based on DL techniques.

The overall contribution of this paper is summarized as follows:

  • This article focuses on different aspects of deep learning modeling, i.e., the learning capabilities of DL techniques in different dimensions such as supervised or unsupervised tasks, to function in an automated and intelligent manner, which can play as a core technology of today’s Fourth Industrial Revolution (Industry 4.0).
  • We explore a variety of prominent DL techniques and present a taxonomy by taking into account the variations in deep learning tasks and how they are used for different purposes. In our taxonomy, we divide the techniques into three major categories such as deep networks for supervised or discriminative learning, unsupervised or generative learning, as well as deep networks for hybrid learning, and relevant others.
  • We have summarized several potential real-world application areas of deep learning, to assist developers as well as researchers in broadening their perspectives on DL techniques. Different categories of DL techniques highlighted in our taxonomy can be used to solve various issues accordingly.
  • Finally, we point out and discuss ten potential aspects with research directions for future generation DL modeling in terms of conducting future research and system development.

This paper is organized as follows. Section “ Why Deep Learning in Today's Research and Applications? ” motivates why deep learning is important to build data-driven intelligent systems. In Section“ Deep Learning Techniques and Applications ”, we present our DL taxonomy by taking into account the variations of deep learning tasks and how they are used in solving real-world issues and briefly discuss the techniques with summarizing the potential application areas. In Section “ Research Directions and Future Aspects ”, we discuss various research issues of deep learning-based modeling and highlight the promising topics for future research within the scope of our study. Finally, Section “ Concluding Remarks ” concludes this paper.

Why Deep Learning in Today’s Research and Applications?

The main focus of today’s Fourth Industrial Revolution (Industry 4.0) is typically technology-driven automation, smart and intelligent systems, in various application areas including smart healthcare, business intelligence, smart cities, cybersecurity intelligence, and many more [ 95 ]. Deep learning approaches have grown dramatically in terms of performance in a wide range of applications considering security technologies, particularly, as an excellent solution for uncovering complex architecture in high-dimensional data. Thus, DL techniques can play a key role in building intelligent data-driven systems according to today’s needs, because of their excellent learning capabilities from historical data. Consequently, DL can change the world as well as humans’ everyday life through its automation power and learning from experience. DL technology is therefore relevant to artificial intelligence [ 103 ], machine learning [ 97 ] and data science with advanced analytics [ 95 ] that are well-known areas in computer science, particularly, today’s intelligent computing. In the following, we first discuss regarding the position of deep learning in AI, or how DL technology is related to these areas of computing.

The Position of Deep Learning in AI

Nowadays, artificial intelligence (AI), machine learning (ML), and deep learning (DL) are three popular terms that are sometimes used interchangeably to describe systems or software that behaves intelligently. In Fig. ​ Fig.2, 2 , we illustrate the position of deep Learning, comparing with machine learning and artificial intelligence. According to Fig. ​ Fig.2, 2 , DL is a part of ML as well as a part of the broad area AI. In general, AI incorporates human behavior and intelligence to machines or systems [ 103 ], while ML is the method to learn from data or experience [ 97 ], which automates analytical model building. DL also represents learning methods from data where the computation is done through multi-layer neural networks and processing. The term “Deep” in the deep learning methodology refers to the concept of multiple levels or stages through which data is processed for building a data-driven model.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig2_HTML.jpg

An illustration of the position of deep learning (DL), comparing with machine learning (ML) and artificial intelligence (AI)

Thus, DL can be considered as one of the core technology of AI, a frontier for artificial intelligence, which can be used for building intelligent systems and automation. More importantly, it pushes AI to a new level, termed “Smarter AI”. As DL are capable of learning from data, there is a strong relation of deep learning with “Data Science” [ 95 ] as well. Typically, data science represents the entire process of finding meaning or insights in data in a particular problem domain, where DL methods can play a key role for advanced analytics and intelligent decision-making [ 104 , 106 ]. Overall, we can conclude that DL technology is capable to change the current world, particularly, in terms of a powerful computational engine and contribute to technology-driven automation, smart and intelligent systems accordingly, and meets the goal of Industry 4.0.

Understanding Various Forms of Data

As DL models learn from data, an in-depth understanding and representation of data are important to build a data-driven intelligent system in a particular application area. In the real world, data can be in various forms, which typically can be represented as below for deep learning modeling:

  • Sequential Data Sequential data is any kind of data where the order matters, i,e., a set of sequences. It needs to explicitly account for the sequential nature of input data while building the model. Text streams, audio fragments, video clips, time-series data, are some examples of sequential data.
  • Image or 2D Data A digital image is made up of a matrix, which is a rectangular array of numbers, symbols, or expressions arranged in rows and columns in a 2D array of numbers. Matrix, pixels, voxels, and bit depth are the four essential characteristics or fundamental parameters of a digital image.
  • Tabular Data A tabular dataset consists primarily of rows and columns. Thus tabular datasets contain data in a columnar format as in a database table. Each column (field) must have a name and each column may only contain data of the defined type. Overall, it is a logical and systematic arrangement of data in the form of rows and columns that are based on data properties or features. Deep learning models can learn efficiently on tabular data and allow us to build data-driven intelligent systems.

The above-discussed data forms are common in the real-world application areas of deep learning. Different categories of DL techniques perform differently depending on the nature and characteristics of data, discussed briefly in Section “ Deep Learning Techniques and Applications ” with a taxonomy presentation. However, in many real-world application areas, the standard machine learning techniques, particularly, logic-rule or tree-based techniques [ 93 , 101 ] perform significantly depending on the application nature. Figure ​ Figure3 3 also shows the performance comparison of DL and ML modeling considering the amount of data. In the following, we highlight several cases, where deep learning is useful to solve real-world problems, according to our main focus in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig3_HTML.jpg

An illustration of the performance comparison between deep learning (DL) and other machine learning (ML) algorithms, where DL modeling from large amounts of data can increase the performance

DL Properties and Dependencies

A DL model typically follows the same processing stages as machine learning modeling. In Fig. ​ Fig.4, 4 , we have shown a deep learning workflow to solve real-world problems, which consists of three processing steps, such as data understanding and preprocessing, DL model building, and training, and validation and interpretation. However, unlike the ML modeling [ 98 , 108 ], feature extraction in the DL model is automated rather than manual. K-nearest neighbor, support vector machines, decision tree, random forest, naive Bayes, linear regression, association rules, k-means clustering, are some examples of machine learning techniques that are commonly used in various application areas [ 97 ]. On the other hand, the DL model includes convolution neural network, recurrent neural network, autoencoder, deep belief network, and many more, discussed briefly with their potential application areas in Section 3 . In the following, we discuss the key properties and dependencies of DL techniques, that are needed to take into account before started working on DL modeling for real-world applications.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig4_HTML.jpg

A typical DL workflow to solve real-world problems, which consists of three sequential stages (i) data understanding and preprocessing (ii) DL model building and training (iii) validation and interpretation

  • Data Dependencies Deep learning is typically dependent on a large amount of data to build a data-driven model for a particular problem domain. The reason is that when the data volume is small, deep learning algorithms often perform poorly [ 64 ]. In such circumstances, however, the performance of the standard machine-learning algorithms will be improved if the specified rules are used [ 64 , 107 ].
  • Hardware Dependencies The DL algorithms require large computational operations while training a model with large datasets. As the larger the computations, the more the advantage of a GPU over a CPU, the GPU is mostly used to optimize the operations efficiently. Thus, to work properly with the deep learning training, GPU hardware is necessary. Therefore, DL relies more on high-performance machines with GPUs than standard machine learning methods [ 19 , 127 ].
  • Feature Engineering Process Feature engineering is the process of extracting features (characteristics, properties, and attributes) from raw data using domain knowledge. A fundamental distinction between DL and other machine-learning techniques is the attempt to extract high-level characteristics directly from data [ 22 , 97 ]. Thus, DL decreases the time and effort required to construct a feature extractor for each problem.
  • Model Training and Execution time In general, training a deep learning algorithm takes a long time due to a large number of parameters in the DL algorithm; thus, the model training process takes longer. For instance, the DL models can take more than one week to complete a training session, whereas training with ML algorithms takes relatively little time, only seconds to hours [ 107 , 127 ]. During testing, deep learning algorithms take extremely little time to run [ 127 ], when compared to certain machine learning methods.
  • Black-box Perception and Interpretability Interpretability is an important factor when comparing DL with ML. It’s difficult to explain how a deep learning result was obtained, i.e., “black-box”. On the other hand, the machine-learning algorithms, particularly, rule-based machine learning techniques [ 97 ] provide explicit logic rules (IF-THEN) for making decisions that are easily interpretable for humans. For instance, in our earlier works, we have presented several machines learning rule-based techniques [ 100 , 102 , 105 ], where the extracted rules are human-understandable and easier to interpret, update or delete according to the target applications.

The most significant distinction between deep learning and regular machine learning is how well it performs when data grows exponentially. An illustration of the performance comparison between DL and standard ML algorithms has been shown in Fig. ​ Fig.3, 3 , where DL modeling can increase the performance with the amount of data. Thus, DL modeling is extremely useful when dealing with a large amount of data because of its capacity to process vast amounts of features to build an effective data-driven model. In terms of developing and training DL models, it relies on parallelized matrix and tensor operations as well as computing gradients and optimization. Several, DL libraries and resources [ 30 ] such as PyTorch [ 82 ] (with a high-level API called Lightning) and TensorFlow [ 1 ] (which also offers Keras as a high-level API) offers these core utilities including many pre-trained models, as well as many other necessary functions for implementation and DL model building.

Deep Learning Techniques and Applications

In this section, we go through the various types of deep neural network techniques, which typically consider several layers of information-processing stages in hierarchical structures to learn. A typical deep neural network contains multiple hidden layers including input and output layers. Figure ​ Figure5 5 shows a general structure of a deep neural network ( h i d d e n l a y e r = N and N ≥ 2) comparing with a shallow network ( h i d d e n l a y e r = 1 ). We also present our taxonomy on DL techniques based on how they are used to solve various problems, in this section. However, before exploring the details of the DL techniques, it’s useful to review various types of learning tasks such as (i) Supervised: a task-driven approach that uses labeled training data, (ii) Unsupervised: a data-driven process that analyzes unlabeled datasets, (iii) Semi-supervised: a hybridization of both the supervised and unsupervised methods, and (iv) Reinforcement: an environment driven approach, discussed briefly in our earlier paper [ 97 ]. Thus, to present our taxonomy, we divide DL techniques broadly into three major categories: (i) deep networks for supervised or discriminative learning; (ii) deep networks for unsupervised or generative learning; and (ii) deep networks for hybrid learning combing both and relevant others, as shown in Fig. ​ Fig.6. 6 . In the following, we briefly discuss each of these techniques that can be used to solve real-world problems in various application areas according to their learning capabilities.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig5_HTML.jpg

A general architecture of a a shallow network with one hidden layer and b a deep neural network with multiple hidden layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig6_HTML.jpg

A taxonomy of DL techniques, broadly divided into three major categories (i) deep networks for supervised or discriminative learning, (ii) deep networks for unsupervised or generative learning, and (ii) deep networks for hybrid learning and relevant others

Deep Networks for Supervised or Discriminative Learning

This category of DL techniques is utilized to provide a discriminative function in supervised or classification applications. Discriminative deep architectures are typically designed to give discriminative power for pattern classification by describing the posterior distributions of classes conditioned on visible data [ 21 ]. Discriminative architectures mainly include Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN or ConvNet), Recurrent Neural Networks (RNN), along with their variants. In the following, we briefly discuss these techniques.

Multi-layer Perceptron (MLP)

Multi-layer Perceptron (MLP), a supervised learning approach [ 83 ], is a type of feedforward artificial neural network (ANN). It is also known as the foundation architecture of deep neural networks (DNN) or deep learning. A typical MLP is a fully connected network that consists of an input layer that receives input data, an output layer that makes a decision or prediction about the input signal, and one or more hidden layers between these two that are considered as the network’s computational engine [ 36 , 103 ]. The output of an MLP network is determined using a variety of activation functions, also known as transfer functions, such as ReLU (Rectified Linear Unit), Tanh, Sigmoid, and Softmax [ 83 , 96 ]. To train MLP employs the most extensively used algorithm “Backpropagation” [ 36 ], a supervised learning technique, which is also known as the most basic building block of a neural network. During the training process, various optimization approaches such as Stochastic Gradient Descent (SGD), Limited Memory BFGS (L-BFGS), and Adaptive Moment Estimation (Adam) are applied. MLP requires tuning of several hyperparameters such as the number of hidden layers, neurons, and iterations, which could make solving a complicated model computationally expensive. However, through partial fit, MLP offers the advantage of learning non-linear models in real-time or online [ 83 ].

Convolutional Neural Network (CNN or ConvNet)

The Convolutional Neural Network (CNN or ConvNet) [ 65 ] is a popular discriminative deep learning architecture that learns directly from the input without the need for human feature extraction. Figure ​ Figure7 7 shows an example of a CNN including multiple convolutions and pooling layers. As a result, the CNN enhances the design of traditional ANN like regularized MLP networks. Each layer in CNN takes into account optimum parameters for a meaningful output as well as reduces model complexity. CNN also uses a ‘dropout’ [ 30 ] that can deal with the problem of over-fitting, which may occur in a traditional network.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig7_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

CNNs are specifically intended to deal with a variety of 2D shapes and are thus widely employed in visual recognition, medical image analysis, image segmentation, natural language processing, and many more [ 65 , 96 ]. The capability of automatically discovering essential features from the input without the need for human intervention makes it more powerful than a traditional network. Several variants of CNN are exist in the area that includes visual geometry group (VGG) [ 38 ], AlexNet [ 62 ], Xception [ 17 ], Inception [ 116 ], ResNet [ 39 ], etc. that can be used in various application domains according to their learning capabilities.

Recurrent Neural Network (RNN) and its Variants

A Recurrent Neural Network (RNN) is another popular neural network, which employs sequential or time-series data and feeds the output from the previous step as input to the current stage [ 27 , 74 ]. Like feedforward and CNN, recurrent networks learn from training input, however, distinguish by their “memory”, which allows them to impact current input and output through using information from previous inputs. Unlike typical DNN, which assumes that inputs and outputs are independent of one another, the output of RNN is reliant on prior elements within the sequence. However, standard recurrent networks have the issue of vanishing gradients, which makes learning long data sequences challenging. In the following, we discuss several popular variants of the recurrent network that minimizes the issues and perform well in many real-world application domains.

  • Long short-term memory (LSTM) This is a popular form of RNN architecture that uses special units to deal with the vanishing gradient problem, which was introduced by Hochreiter et al. [ 42 ]. A memory cell in an LSTM unit can store data for long periods and the flow of information into and out of the cell is managed by three gates. For instance, the ‘Forget Gate’ determines what information from the previous state cell will be memorized and what information will be removed that is no longer useful, while the ‘Input Gate’ determines which information should enter the cell state and the ‘Output Gate’ determines and controls the outputs. As it solves the issues of training a recurrent network, the LSTM network is considered one of the most successful RNN.
  • Bidirectional RNN/LSTM Bidirectional RNNs connect two hidden layers that run in opposite directions to a single output, allowing them to accept data from both the past and future. Bidirectional RNNs, unlike traditional recurrent networks, are trained to predict both positive and negative time directions at the same time. A Bidirectional LSTM, often known as a BiLSTM, is an extension of the standard LSTM that can increase model performance on sequence classification issues [ 113 ]. It is a sequence processing model comprising of two LSTMs: one takes the input forward and the other takes it backward. Bidirectional LSTM in particular is a popular choice in natural language processing tasks.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig8_HTML.jpg

Basic structure of a gated recurrent unit (GRU) cell consisting of reset and update gates

Overall, the basic property of a recurrent network is that it has at least one feedback connection, which enables activations to loop. This allows the networks to do temporal processing and sequence learning, such as sequence recognition or reproduction, temporal association or prediction, etc. Following are some popular application areas of recurrent networks such as prediction problems, machine translation, natural language processing, text summarization, speech recognition, and many more.

Deep Networks for Generative or Unsupervised Learning

This category of DL techniques is typically used to characterize the high-order correlation properties or features for pattern analysis or synthesis, as well as the joint statistical distributions of the visible data and their associated classes [ 21 ]. The key idea of generative deep architectures is that during the learning process, precise supervisory information such as target class labels is not of concern. As a result, the methods under this category are essentially applied for unsupervised learning as the methods are typically used for feature learning or data generating and representation [ 20 , 21 ]. Thus generative modeling can be used as preprocessing for the supervised learning tasks as well, which ensures the discriminative model accuracy. Commonly used deep neural network techniques for unsupervised or generative learning are Generative Adversarial Network (GAN), Autoencoder (AE), Restricted Boltzmann Machine (RBM), Self-Organizing Map (SOM), and Deep Belief Network (DBN) along with their variants.

Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN), designed by Ian Goodfellow [ 32 ], is a type of neural network architecture for generative modeling to create new plausible samples on demand. It involves automatically discovering and learning regularities or patterns in input data so that the model may be used to generate or output new examples from the original dataset. As shown in Fig. ​ Fig.9, 9 , GANs are composed of two neural networks, a generator G that creates new data having properties similar to the original data, and a discriminator D that predicts the likelihood of a subsequent sample being drawn from actual data rather than data provided by the generator. Thus in GAN modeling, both the generator and discriminator are trained to compete with each other. While the generator tries to fool and confuse the discriminator by creating more realistic data, the discriminator tries to distinguish the genuine data from the fake data generated by G .

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig9_HTML.jpg

Schematic structure of a standard generative adversarial network (GAN)

Generally, GAN network deployment is designed for unsupervised learning tasks, but it has also proven to be a better solution for semi-supervised and reinforcement learning as well depending on the task [ 3 ]. GANs are also used in state-of-the-art transfer learning research to enforce the alignment of the latent feature space [ 66 ]. Inverse models, such as Bidirectional GAN (BiGAN) [ 25 ] can also learn a mapping from data to the latent space, similar to how the standard GAN model learns a mapping from a latent space to the data distribution. The potential application areas of GAN networks are healthcare, image analysis, data augmentation, video generation, voice generation, pandemics, traffic control, cybersecurity, and many more, which are increasing rapidly. Overall, GANs have established themselves as a comprehensive domain of independent data expansion and as a solution to problems requiring a generative solution.

Auto-Encoder (AE) and Its Variants

An auto-encoder (AE) [ 31 ] is a popular unsupervised learning technique in which neural networks are used to learn representations. Typically, auto-encoders are used to work with high-dimensional data, and dimensionality reduction explains how a set of data is represented. Encoder, code, and decoder are the three parts of an autoencoder. The encoder compresses the input and generates the code, which the decoder subsequently uses to reconstruct the input. The AEs have recently been used to learn generative data models [ 69 ]. The auto-encoder is widely used in many unsupervised learning tasks, e.g., dimensionality reduction, feature extraction, efficient coding, generative modeling, denoising, anomaly or outlier detection, etc. [ 31 , 132 ]. Principal component analysis (PCA) [ 99 ], which is also used to reduce the dimensionality of huge data sets, is essentially similar to a single-layered AE with a linear activation function. Regularized autoencoders such as sparse, denoising, and contractive are useful for learning representations for later classification tasks [ 119 ], while variational autoencoders can be used as generative models [ 56 ], discussed below.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig10_HTML.jpg

Schematic structure of a sparse autoencoder (SAE) with several active units (filled circle) in the hidden layer

  • Denoising Autoencoder (DAE) A denoising autoencoder is a variant on the basic autoencoder that attempts to improve representation (to extract useful features) by altering the reconstruction criterion, and thus reduces the risk of learning the identity function [ 31 , 119 ]. In other words, it receives a corrupted data point as input and is trained to recover the original undistorted input as its output through minimizing the average reconstruction error over the training data, i.e, cleaning the corrupted input, or denoising. Thus, in the context of computing, DAEs can be considered as very powerful filters that can be utilized for automatic pre-processing. A denoising autoencoder, for example, could be used to automatically pre-process an image, thereby boosting its quality for recognition accuracy.
  • Contractive Autoencoder (CAE) The idea behind a contractive autoencoder, proposed by Rifai et al. [ 90 ], is to make the autoencoders robust of small changes in the training dataset. In its objective function, a CAE includes an explicit regularizer that forces the model to learn an encoding that is robust to small changes in input values. As a result, the learned representation’s sensitivity to the training input is reduced. While DAEs encourage the robustness of reconstruction as discussed above, CAEs encourage the robustness of representation.
  • Variational Autoencoder (VAE) A variational autoencoder [ 55 ] has a fundamentally unique property that distinguishes it from the classical autoencoder discussed above, which makes this so effective for generative modeling. VAEs, unlike the traditional autoencoders which map the input onto a latent vector, map the input data into the parameters of a probability distribution, such as the mean and variance of a Gaussian distribution. A VAE assumes that the source data has an underlying probability distribution and then tries to discover the distribution’s parameters. Although this approach was initially designed for unsupervised learning, its use has been demonstrated in other domains such as semi-supervised learning [ 128 ] and supervised learning [ 51 ].

Although, the earlier concept of AE was typically for dimensionality reduction or feature learning mentioned above, recently, AEs have been brought to the forefront of generative modeling, even the generative adversarial network is one of the popular methods in the area. The AEs have been effectively employed in a variety of domains, including healthcare, computer vision, speech recognition, cybersecurity, natural language processing, and many more. Overall, we can conclude that auto-encoder and its variants can play a significant role as unsupervised feature learning with neural network architecture.

Kohonen Map or Self-Organizing Map (SOM)

A Self-Organizing Map (SOM) or Kohonen Map [ 59 ] is another form of unsupervised learning technique for creating a low-dimensional (usually two-dimensional) representation of a higher-dimensional data set while maintaining the topological structure of the data. SOM is also known as a neural network-based dimensionality reduction algorithm that is commonly used for clustering [ 118 ]. A SOM adapts to the topological form of a dataset by repeatedly moving its neurons closer to the data points, allowing us to visualize enormous datasets and find probable clusters. The first layer of a SOM is the input layer, and the second layer is the output layer or feature map. Unlike other neural networks that use error-correction learning, such as backpropagation with gradient descent [ 36 ], SOMs employ competitive learning, which uses a neighborhood function to retain the input space’s topological features. SOM is widely utilized in a variety of applications, including pattern identification, health or medical diagnosis, anomaly detection, and virus or worm attack detection [ 60 , 87 ]. The primary benefit of employing a SOM is that this can make high-dimensional data easier to visualize and analyze to understand the patterns. The reduction of dimensionality and grid clustering makes it easy to observe similarities in the data. As a result, SOMs can play a vital role in developing a data-driven effective model for a particular problem domain, depending on the data characteristics.

Restricted Boltzmann Machine (RBM)

A Restricted Boltzmann Machine (RBM) [ 75 ] is also a generative stochastic neural network capable of learning a probability distribution across its inputs. Boltzmann machines typically consist of visible and hidden nodes and each node is connected to every other node, which helps us understand irregularities by learning how the system works in normal circumstances. RBMs are a subset of Boltzmann machines that have a limit on the number of connections between the visible and hidden layers [ 77 ]. This restriction permits training algorithms like the gradient-based contrastive divergence algorithm to be more efficient than those for Boltzmann machines in general [ 41 ]. RBMs have found applications in dimensionality reduction, classification, regression, collaborative filtering, feature learning, topic modeling, and many others. In the area of deep learning modeling, they can be trained either supervised or unsupervised, depending on the task. Overall, the RBMs can recognize patterns in data automatically and develop probabilistic or stochastic models, which are utilized for feature selection or extraction, as well as forming a deep belief network.

Deep Belief Network (DBN)

A Deep Belief Network (DBN) [ 40 ] is a multi-layer generative graphical model of stacking several individual unsupervised networks such as AEs or RBMs, that use each network’s hidden layer as the input for the next layer, i.e, connected sequentially. Thus, we can divide a DBN into (i) AE-DBN which is known as stacked AE, and (ii) RBM-DBN that is known as stacked RBM, where AE-DBN is composed of autoencoders and RBM-DBN is composed of restricted Boltzmann machines, discussed earlier. The ultimate goal is to develop a faster-unsupervised training technique for each sub-network that depends on contrastive divergence [ 41 ]. DBN can capture a hierarchical representation of input data based on its deep structure. The primary idea behind DBN is to train unsupervised feed-forward neural networks with unlabeled data before fine-tuning the network with labeled input. One of the most important advantages of DBN, as opposed to typical shallow learning networks, is that it permits the detection of deep patterns, which allows for reasoning abilities and the capture of the deep difference between normal and erroneous data [ 89 ]. A continuous DBN is simply an extension of a standard DBN that allows a continuous range of decimals instead of binary data. Overall, the DBN model can play a key role in a wide range of high-dimensional data applications due to its strong feature extraction and classification capabilities and become one of the significant topics in the field of neural networks.

In summary, the generative learning techniques discussed above typically allow us to generate a new representation of data through exploratory analysis. As a result, these deep generative networks can be utilized as preprocessing for supervised or discriminative learning tasks, as well as ensuring model accuracy, where unsupervised representation learning can allow for improved classifier generalization.

Deep Networks for Hybrid Learning and Other Approaches

In addition to the above-discussed deep learning categories, hybrid deep networks and several other approaches such as deep transfer learning (DTL) and deep reinforcement learning (DRL) are popular, which are discussed in the following.

Hybrid Deep Neural Networks

Generative models are adaptable, with the capacity to learn from both labeled and unlabeled data. Discriminative models, on the other hand, are unable to learn from unlabeled data yet outperform their generative counterparts in supervised tasks. A framework for training both deep generative and discriminative models simultaneously can enjoy the benefits of both models, which motivates hybrid networks.

Hybrid deep learning models are typically composed of multiple (two or more) deep basic learning models, where the basic model is a discriminative or generative deep learning model discussed earlier. Based on the integration of different basic generative or discriminative models, the below three categories of hybrid deep learning models might be useful for solving real-world problems. These are as follows:

  • Hybrid M o d e l _ 1 : An integration of different generative or discriminative models to extract more meaningful and robust features. Examples could be CNN+LSTM, AE+GAN, and so on.
  • Hybrid M o d e l _ 2 : An integration of generative model followed by a discriminative model. Examples could be DBN+MLP, GAN+CNN, AE+CNN, and so on.
  • Hybrid M o d e l _ 3 : An integration of generative or discriminative model followed by a non-deep learning classifier. Examples could be AE+SVM, CNN+SVM, and so on.

Thus, in a broad sense, we can conclude that hybrid models can be either classification-focused or non-classification depending on the target use. However, most of the hybrid learning-related studies in the area of deep learning are classification-focused or supervised learning tasks, summarized in Table ​ Table1. 1 . The unsupervised generative models with meaningful representations are employed to enhance the discriminative models. The generative models with useful representation can provide more informative and low-dimensional features for discrimination, and they can also enable to enhance the training data quality and quantity, providing additional information for classification.

A summary of deep learning tasks and methods in several popular real-world applications areas

Application areasTasksMethodsReferences
Healthcare and Medical applicationsRegular health factors analysisCNN-basedIsmail et al. [ ]
Identifying malicious behaviorsRNN-basedXue et al. [ ]
Coronary heart disease risk predictionAutoencoder basedAmarbayasgalan et al. [ ]
Cancer classificationTransfer learning basedSevakula et al. [ ]
Diagnosis of COVID-19CNN and BiLSTM basedAslan et al. [ ]
Detection of COVID-19CNN-LSTM basedIslam et al. [ ]
Natural Language ProcessingText summarizationAuto-encoder basedYousefi et al. [ ]
Sentiment analysisCNN-LSTM basedWang et al. [ ]
Sentiment analysisCNN and Bi-LSTM basedMinaee et al. [ ]
Aspect-level sentiment classificationAttention-based LSTMWang et al. [ ]
Speech recognitionDistant speech recognitionAttention-based LSTMZhang et al. [ ]
Speech emotion classificationTransfer learning basedLatif et al. [ ]
Emotion recognition from speechCNN and LSTM basedSatt et al. [ ]
CybersecurityZero-day malware detectionAutoencoders and GAN basedKim et al. [ ]
Security incidents and fraud analysisSOM-basedLopez et al. [ ]
Android malware detectionAutoencoder and CNN basedWang et al. [ ]
intrusion detection classificationDBN-basedWei et al. [ ]
DoS attack detectionRBM-basedImamverdiyev et al. [ ]
Suspicious flow detectionHybrid deep-learning-basedGarg et al. [ ]
Network intrusion detectionAE and SVM basedAl et al. [ ]
IoT and Smart citiesSmart energy managementCNN and Attention mechanismAbdel et al. [ ]
Particulate matter forecastingCNN-LSTM basedHuang et al. [ ]
Smart parking systemCNN-LSTM basedPiccialli et al. [ ]
Disaster managementDNN-basedAqib et al. [ ]
Air quality predictionLSTM-RNN basedKok et al. [ ]
Cybersecurity in smart citiesRBM, DBN, RNN, CNN, GANChen et al. [ ]
Smart AgricultureA smart agriculture IoT systemRL-basedBu et al. [ ]
Plant disease detectionCNN-basedAle et al. [ ]
Automated soil quality evaluationDNN-basedSumathi et al. [ ]
Business and Financial ServicesPredicting customers’ purchase behaviorDNN basedChaudhuri [ ]
Stock trend predictionCNN and LSTM basedanuradha et al. [ ]
Financial loan default predictionCNN-basedDeng et al. [ ]
Power consumption forecastingLSTM-basedShao et al. [ ]
Virtual Assistant and Chatbot ServicesAn intelligent chatbotBi-RNN and Attention modelDhyani et al. [ ]
Virtual listener agentGRU and LSTM basedHuang et al. [ ]
Smart blind assistantCNN-basedRahman et al. [ ]
Object Detection and RecognitionObject detection in X-ray imagesCNN-basedGu et al. [ ]
Object detection for disaster responseCNN-basedPi et al. [ ]
Medicine recognition systemCNN-basedChang et al. [ ]
Face recognition in IoT-cloud environmentCNN-basedMasud et al. [ ]
Food recognition systemCNN-basedLiu et al. [ ]
Affect recognition systemDBN-basedKawde et al. [ ]
Facial expression analysisCNN and LSTM basedLi et al. [ ]
Recommendation and Intelligent systemHybrid recommender systemDNN-basedKiran et al. [ ]
Visual recommendation and searchCNN-basedShankar et al. [ ]
Recommendation systemCNN and Bi-LSTM basedRosa et al. [ ]
Intelligent system for impaired patientsRL-basedNaeem et al. [ ]
Intelligent transportation systemCNN-basedWang et al. [ ]

Deep Transfer Learning (DTL)

Transfer Learning is a technique for effectively using previously learned model knowledge to solve a new task with minimum training or fine-tuning. In comparison to typical machine learning techniques [ 97 ], DL takes a large amount of training data. As a result, the need for a substantial volume of labeled data is a significant barrier to address some essential domain-specific tasks, particularly, in the medical sector, where creating large-scale, high-quality annotated medical or health datasets is both difficult and costly. Furthermore, the standard DL model demands a lot of computational resources, such as a GPU-enabled server, even though researchers are working hard to improve it. As a result, Deep Transfer Learning (DTL), a DL-based transfer learning method, might be helpful to address this issue. Figure ​ Figure11 11 shows a general structure of the transfer learning process, where knowledge from the pre-trained model is transferred into a new DL model. It’s especially popular in deep learning right now since it allows to train deep neural networks with very little data [ 126 ].

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig11_HTML.jpg

A general structure of transfer learning process, where knowledge from pre-trained model is transferred into new DL model

Transfer learning is a two-stage approach for training a DL model that consists of a pre-training step and a fine-tuning step in which the model is trained on the target task. Since deep neural networks have gained popularity in a variety of fields, a large number of DTL methods have been presented, making it crucial to categorize and summarize them. Based on the techniques used in the literature, DTL can be classified into four categories [ 117 ]. These are (i) instances-based deep transfer learning that utilizes instances in source domain by appropriate weight, (ii) mapping-based deep transfer learning that maps instances from two domains into a new data space with better similarity, (iii) network-based deep transfer learning that reuses the partial of network pre-trained in the source domain, and (iv) adversarial based deep transfer learning that uses adversarial technology to find transferable features that both suitable for two domains. Due to its high effectiveness and practicality, adversarial-based deep transfer learning has exploded in popularity in recent years. Transfer learning can also be classified into inductive, transductive, and unsupervised transfer learning depending on the circumstances between the source and target domains and activities [ 81 ]. While most current research focuses on supervised learning, how deep neural networks can transfer knowledge in unsupervised or semi-supervised learning may gain further interest in the future. DTL techniques are useful in a variety of fields including natural language processing, sentiment classification, visual recognition, speech recognition, spam filtering, and relevant others.

Deep Reinforcement Learning (DRL)

Reinforcement learning takes a different approach to solving the sequential decision-making problem than other approaches we have discussed so far. The concepts of an environment and an agent are often introduced first in reinforcement learning. The agent can perform a series of actions in the environment, each of which has an impact on the environment’s state and can result in possible rewards (feedback) - “positive” for good sequences of actions that result in a “good” state, and “negative” for bad sequences of actions that result in a “bad” state. The purpose of reinforcement learning is to learn good action sequences through interaction with the environment, typically referred to as a policy.

Deep reinforcement learning (DRL or deep RL) [ 9 ] integrates neural networks with a reinforcement learning architecture to allow the agents to learn the appropriate actions in a virtual environment, as shown in Fig. ​ Fig.12. 12 . In the area of reinforcement learning, model-based RL is based on learning a transition model that enables for modeling of the environment without interacting with it directly, whereas model-free RL methods learn directly from interactions with the environment. Q-learning is a popular model-free RL technique for determining the best action-selection policy for any (finite) Markov Decision Process (MDP) [ 86 , 97 ]. MDP is a mathematical framework for modeling decisions based on state, action, and rewards [ 86 ]. In addition, Deep Q-Networks, Double DQN, Bi-directional Learning, Monte Carlo Control, etc. are used in the area [ 50 , 97 ]. In DRL methods it incorporates DL models, e.g. Deep Neural Networks (DNN), based on MDP principle [ 71 ], as policy and/or value function approximators. CNN for example can be used as a component of RL agents to learn directly from raw, high-dimensional visual inputs. In the real world, DRL-based solutions can be used in several application areas including robotics, video games, natural language processing, computer vision, and relevant others.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig12_HTML.jpg

Schematic structure of deep reinforcement learning (DRL) highlighting a deep neural network

Deep Learning Application Summary

During the past few years, deep learning has been successfully applied to numerous problems in many application areas. These include natural language processing, sentiment analysis, cybersecurity, business, virtual assistants, visual recognition, healthcare, robotics, and many more. In Fig. ​ Fig.13, 13 , we have summarized several potential real-world application areas of deep learning. Various deep learning techniques according to our presented taxonomy in Fig. ​ Fig.6 6 that includes discriminative learning, generative learning, as well as hybrid models, discussed earlier, are employed in these application areas. In Table ​ Table1, 1 , we have also summarized various deep learning tasks and techniques that are used to solve the relevant tasks in several real-world applications areas. Overall, from Fig. ​ Fig.13 13 and Table ​ Table1, 1 , we can conclude that the future prospects of deep learning modeling in real-world application areas are huge and there are lots of scopes to work. In the next section, we also summarize the research issues in deep learning modeling and point out the potential aspects for future generation DL modeling.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig13_HTML.jpg

Several potential real-world application areas of deep learning

Research Directions and Future Aspects

While existing methods have established a solid foundation for deep learning systems and research, this section outlines the below ten potential future research directions based on our study.

  • Automation in Data Annotation According to the existing literature, discussed in Section 3 , most of the deep learning models are trained through publicly available datasets that are annotated. However, to build a system for a new problem domain or recent data-driven system, raw data from relevant sources are needed to collect. Thus, data annotation, e.g., categorization, tagging, or labeling of a large amount of raw data, is important for building discriminative deep learning models or supervised tasks, which is challenging. A technique with the capability of automatic and dynamic data annotation, rather than manual annotation or hiring annotators, particularly, for large datasets, could be more effective for supervised learning as well as minimizing human effort. Therefore, a more in-depth investigation of data collection and annotation methods, or designing an unsupervised learning-based solution could be one of the primary research directions in the area of deep learning modeling.
  • Data Preparation for Ensuring Data Quality As discussed earlier throughout the paper, the deep learning algorithms highly impact data quality, and availability for training, and consequently on the resultant model for a particular problem domain. Thus, deep learning models may become worthless or yield decreased accuracy if the data is bad, such as data sparsity, non-representative, poor-quality, ambiguous values, noise, data imbalance, irrelevant features, data inconsistency, insufficient quantity, and so on for training. Consequently, such issues in data can lead to poor processing and inaccurate findings, which is a major problem while discovering insights from data. Thus deep learning models also need to adapt to such rising issues in data, to capture approximated information from observations. Therefore, effective data pre-processing techniques are needed to design according to the nature of the data problem and characteristics, to handling such emerging challenges, which could be another research direction in the area.
  • Black-box Perception and Proper DL/ML Algorithm Selection In general, it’s difficult to explain how a deep learning result is obtained or how they get the ultimate decisions for a particular model. Although DL models achieve significant performance while learning from large datasets, as discussed in Section 2 , this “black-box” perception of DL modeling typically represents weak statistical interpretability that could be a major issue in the area. On the other hand, ML algorithms, particularly, rule-based machine learning techniques provide explicit logic rules (IF-THEN) for making decisions that are easier to interpret, update or delete according to the target applications [ 97 , 100 , 105 ]. If the wrong learning algorithm is chosen, unanticipated results may occur, resulting in a loss of effort as well as the model’s efficacy and accuracy. Thus by taking into account the performance, complexity, model accuracy, and applicability, selecting an appropriate model for the target application is challenging, and in-depth analysis is needed for better understanding and decision making.
  • Deep Networks for Supervised or Discriminative Learning: According to our designed taxonomy of deep learning techniques, as shown in Fig. ​ Fig.6, 6 , discriminative architectures mainly include MLP, CNN, and RNN, along with their variants that are applied widely in various application domains. However, designing new techniques or their variants of such discriminative techniques by taking into account model optimization, accuracy, and applicability, according to the target real-world application and the nature of the data, could be a novel contribution, which can also be considered as a major future aspect in the area of supervised or discriminative learning.
  • Deep Networks for Unsupervised or Generative Learning As discussed in Section 3 , unsupervised learning or generative deep learning modeling is one of the major tasks in the area, as it allows us to characterize the high-order correlation properties or features in data, or generating a new representation of data through exploratory analysis. Moreover, unlike supervised learning [ 97 ], it does not require labeled data due to its capability to derive insights directly from the data as well as data-driven decision making. Consequently, it thus can be used as preprocessing for supervised learning or discriminative modeling as well as semi-supervised learning tasks, which ensure learning accuracy and model efficiency. According to our designed taxonomy of deep learning techniques, as shown in Fig. ​ Fig.6, 6 , generative techniques mainly include GAN, AE, SOM, RBM, DBN, and their variants. Thus, designing new techniques or their variants for an effective data modeling or representation according to the target real-world application could be a novel contribution, which can also be considered as a major future aspect in the area of unsupervised or generative learning.
  • Hybrid/Ensemble Modeling and Uncertainty Handling According to our designed taxonomy of DL techniques, as shown in Fig ​ Fig6, 6 , this is considered as another major category in deep learning tasks. As hybrid modeling enjoys the benefits of both generative and discriminative learning, an effective hybridization can outperform others in terms of performance as well as uncertainty handling in high-risk applications. In Section 3 , we have summarized various types of hybridization, e.g., AE+CNN/SVM. Since a group of neural networks is trained with distinct parameters or with separate sub-sampling training datasets, hybridization or ensembles of such techniques, i.e., DL with DL/ML, can play a key role in the area. Thus designing effective blended discriminative and generative models accordingly rather than naive method, could be an important research opportunity to solve various real-world issues including semi-supervised learning tasks and model uncertainty.
  • Dynamism in Selecting Threshold/ Hyper-parameters Values, and Network Structures with Computational Efficiency In general, the relationship among performance, model complexity, and computational requirements is a key issue in deep learning modeling and applications. A combination of algorithmic advancements with improved accuracy as well as maintaining computational efficiency, i.e., achieving the maximum throughput while consuming the least amount of resources, without significant information loss, can lead to a breakthrough in the effectiveness of deep learning modeling in future real-world applications. The concept of incremental approaches or recency-based learning [ 100 ] might be effective in several cases depending on the nature of target applications. Moreover, assuming the network structures with a static number of nodes and layers, hyper-parameters values or threshold settings, or selecting them by the trial-and-error process may not be effective in many cases, as it can be changed due to the changes in data. Thus, a data-driven approach to select them dynamically could be more effective while building a deep learning model in terms of both performance and real-world applicability. Such type of data-driven automation can lead to future generation deep learning modeling with additional intelligence, which could be a significant future aspect in the area as well as an important research direction to contribute.
  • Lightweight Deep Learning Modeling for Next-Generation Smart Devices and Applications: In recent years, the Internet of Things (IoT) consisting of billions of intelligent and communicating things and mobile communications technologies have become popular to detect and gather human and environmental information (e.g. geo-information, weather data, bio-data, human behaviors, and so on) for a variety of intelligent services and applications. Every day, these ubiquitous smart things or devices generate large amounts of data, requiring rapid data processing on a variety of smart mobile devices [ 72 ]. Deep learning technologies can be incorporate to discover underlying properties and to effectively handle such large amounts of sensor data for a variety of IoT applications including health monitoring and disease analysis, smart cities, traffic flow prediction, and monitoring, smart transportation, manufacture inspection, fault assessment, smart industry or Industry 4.0, and many more. Although deep learning techniques discussed in Section 3 are considered as powerful tools for processing big data, lightweight modeling is important for resource-constrained devices, due to their high computational cost and considerable memory overhead. Thus several techniques such as optimization, simplification, compression, pruning, generalization, important feature extraction, etc. might be helpful in several cases. Therefore, constructing the lightweight deep learning techniques based on a baseline network architecture to adapt the DL model for next-generation mobile, IoT, or resource-constrained devices and applications, could be considered as a significant future aspect in the area.
  • Incorporating Domain Knowledge into Deep Learning Modeling Domain knowledge, as opposed to general knowledge or domain-independent knowledge, is knowledge of a specific, specialized topic or field. For instance, in terms of natural language processing, the properties of the English language typically differ from other languages like Bengali, Arabic, French, etc. Thus integrating domain-based constraints into the deep learning model could produce better results for such particular purpose. For instance, a task-specific feature extractor considering domain knowledge in smart manufacturing for fault diagnosis can resolve the issues in traditional deep-learning-based methods [ 28 ]. Similarly, domain knowledge in medical image analysis [ 58 ], financial sentiment analysis [ 49 ], cybersecurity analytics [ 94 , 103 ] as well as conceptual data model in which semantic information, (i.e., meaningful for a system, rather than merely correlational) [ 45 , 121 , 131 ] is included, can play a vital role in the area. Transfer learning could be an effective way to get started on a new challenge with domain knowledge. Moreover, contextual information such as spatial, temporal, social, environmental contexts [ 92 , 104 , 108 ] can also play an important role to incorporate context-aware computing with domain knowledge for smart decision making as well as building adaptive and intelligent context-aware systems. Therefore understanding domain knowledge and effectively incorporating them into the deep learning model could be another research direction.
  • Designing General Deep Learning Framework for Target Application Domains One promising research direction for deep learning-based solutions is to develop a general framework that can handle data diversity, dimensions, stimulation types, etc. The general framework would require two key capabilities: the attention mechanism that focuses on the most valuable parts of input signals, and the ability to capture latent feature that enables the framework to capture the distinctive and informative features. Attention models have been a popular research topic because of their intuition, versatility, and interpretability, and employed in various application areas like computer vision, natural language processing, text or image classification, sentiment analysis, recommender systems, user profiling, etc [ 13 , 80 ]. Attention mechanism can be implemented based on learning algorithms such as reinforcement learning that is capable of finding the most useful part through a policy search [ 133 , 134 ]. Similarly, CNN can be integrated with suitable attention mechanisms to form a general classification framework, where CNN can be used as a feature learning tool for capturing features in various levels and ranges. Thus, designing a general deep learning framework considering attention as well as a latent feature for target application domains could be another area to contribute.

To summarize, deep learning is a fairly open topic to which academics can contribute by developing new methods or improving existing methods to handle the above-mentioned concerns and tackle real-world problems in a variety of application areas. This can also help the researchers conduct a thorough analysis of the application’s hidden and unexpected challenges to produce more reliable and realistic outcomes. Overall, we can conclude that addressing the above-mentioned issues and contributing to proposing effective and efficient techniques could lead to “Future Generation DL” modeling as well as more intelligent and automated applications.

Concluding Remarks

In this article, we have presented a structured and comprehensive view of deep learning technology, which is considered a core part of artificial intelligence as well as data science. It starts with a history of artificial neural networks and moves to recent deep learning techniques and breakthroughs in different applications. Then, the key algorithms in this area, as well as deep neural network modeling in various dimensions are explored. For this, we have also presented a taxonomy considering the variations of deep learning tasks and how they are used for different purposes. In our comprehensive study, we have taken into account not only the deep networks for supervised or discriminative learning but also the deep networks for unsupervised or generative learning, and hybrid learning that can be used to solve a variety of real-world issues according to the nature of problems.

Deep learning, unlike traditional machine learning and data mining algorithms, can produce extremely high-level data representations from enormous amounts of raw data. As a result, it has provided an excellent solution to a variety of real-world problems. A successful deep learning technique must possess the relevant data-driven modeling depending on the characteristics of raw data. The sophisticated learning algorithms then need to be trained through the collected data and knowledge related to the target application before the system can assist with intelligent decision-making. Deep learning has shown to be useful in a wide range of applications and research areas such as healthcare, sentiment analysis, visual recognition, business intelligence, cybersecurity, and many more that are summarized in the paper.

Finally, we have summarized and discussed the challenges faced and the potential research directions, and future aspects in the area. Although deep learning is considered a black-box solution for many applications due to its poor reasoning and interpretability, addressing the challenges or future aspects that are identified could lead to future generation deep learning modeling and smarter systems. This can also help the researchers for in-depth analysis to produce more reliable and realistic outcomes. Overall, we believe that our study on neural networks and deep learning-based advanced analytics points in a promising path and can be utilized as a reference guide for future research and implementations in relevant application domains by both academic and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K. N. and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Interesting
  • Scholarships
  • UGC-CARE Journals

50 Deep Learning Research Ideas

Deep learning Research and Project Ideas

Dr. Somasundaram R

Deep learning is a branch of artificial intelligence that uses algorithms to model high-level abstractions in data by using multiple layers of processing. It is a subset of machine learning , which is a broader field of artificial intelligence that uses algorithms to learn from data.

Deep learning algorithms are used to recognize patterns in large datasets and make predictions based on those patterns. Deep learning research focuses on developing algorithms that can learn from data in an unsupervised manner, allowing them to learn complex representations of data without relying on explicit instructions from humans.

Deep learning research also focuses on developing methods for improving the accuracy and speed of deep learning algorithms. Additionally, deep learning research explores ways to make deep learning algorithms more efficient and effective in a variety of applications.

In this article, ilovephd listed the 50 interesting research and project ideas in deep learning

Deep Learning Research Ideas

1. Developing a deep learning model to detect and classify objects in images.

2. Developing a deep learning model to detect and classify objects in videos.

3. Developing a deep learning model to detect and classify objects in 3D scenes.

4. Developing a deep learning model to detect and classify objects in audio.

5. Developing a deep learning model to detect and classify objects in text.

6. Develop a deep learning model to generate new images from a given set of images.

7. Develop a deep learning model to generate new videos from a given set of videos.

8. Develop a deep learning model to generate new 3D scenes from a given set of 3D scenes.

9. Developing a deep learning model to generate new audio from a given set of audio.

10. Developing a deep learning model to generate new text from a given set of text.

11. Developing a deep learning model to detect and classify emotions in images.

12. Developing a deep learning model to detect and classify emotions in videos.

13. Developing a deep learning model to detect and classify emotions in audio.

14. Developing a deep learning model to detect and classify emotions in text.

15. Developing a deep learning model to detect and classify objects in medical images.

16. Developing a deep learning model to detect and classify objects in medical videos.

17. Developing a deep learning model to detect and classify objects in medical audio.

18. Developing a deep learning model to detect and classify objects in medical text.

19. Developing a deep learning model to detect and classify objects in satellite images.

20. Developing a deep learning model to detect and classify objects in aerial videos.

21. Developing a deep learning model to detect and classify objects in aerial audio.

22. Developing a deep learning model to detect and classify objects in aerial text.

23. Developing a deep learning model to detect and classify objects in street view images.

24. Developing a deep learning model to detect and classify objects in street view videos.

25. Developing a deep learning model to detect and classify objects in street view audio.

26. Developing a deep learning model to detect and classify objects in street view text.

27. Developing a deep learning model to detect and classify objects in industrial images.

28. Developing a deep learning model to detect and classify objects in industrial videos.

29. Developing a deep learning model to detect and classify objects in industrial audio.

30. Developing a deep learning model to detect and classify objects in industrial text.

31. Developing a deep learning model to detect and classify objects in autonomous vehicle images.

32. Developing a deep learning model to detect and classify objects in autonomous vehicle videos.

33. Developing a deep learning model to detect and classify objects in autonomous vehicle audio.

34. Developing a deep learning model to detect and classify objects in autonomous vehicle text.

35. Developing a deep learning model to detect and classify objects in robotics images.

36. Developing a deep learning model to detect and classify objects in robotics videos.

37. Developing a deep learning model to detect and classify objects in robotics audio.

38. Developing a deep learning model to detect and classify objects in robotics text.

39. Developing a deep learning model to detect and classify objects in natural language processing.

40. Developing a deep learning model to detect and classify objects in computer vision.

41. Developing a deep learning model to detect and classify objects in speech recognition.

42. Developing a deep learning model to detect and classify objects in natural language understanding.

43. Developing a deep learning model to detect and classify objects in facial recognition.

44. Developing a deep learning model to detect and classify objects in gesture recognition.

45. Developing a deep learning model to detect and classify objects in sentiment analysis.

46. Developing a deep learning model to detect and classify objects in time series analysis.

47. Developing a deep learning model to detect and classify objects in anomaly detection.

48. Developing a deep learning model to detect and classify objects in recommender systems.

49. Developing a deep learning model to detect and classify objects in medical diagnosis.

50. Developing a deep learning model to detect and classify objects in fraud detection.

I hope, this article would help you know various Deep Learning research ideas and project ideas.

  • Deep Learning
  • Machine Learning
  • Research Ideas

Dr. Somasundaram R

24 Best Online Plagiarism Checker Free – 2024

480 ugc care list of journals – science – 2024, 100 cutting-edge research ideas in civil engineering, most popular, top 10 online plagiarism checker tools 2024, anna’s archive – download research papers for free, indo-sri lanka joint research programme 2024, top 488 scopus indexed journals in computer science – open access, scopus indexed journals list 2024, what is a phd a comprehensive guide for indian scientists and aspiring researchers, the nippon foundation fellowship programme 2025, best for you, popular posts, popular category.

  • POSTDOC 317
  • Interesting 257
  • Journals 236
  • Fellowship 134
  • Research Methodology 102
  • All Scopus Indexed Journals 94

Mail Subscription

ilovephd_logo

iLovePhD is a research education website to know updated research-related information. It helps researchers to find top journals for publishing research articles and get an easy manual for research tools. The main aim of this website is to help Ph.D. scholars who are working in various domains to get more valuable ideas to carry out their research. Learn the current groundbreaking research activities around the world, love the process of getting a Ph.D.

Contact us: [email protected]

Google News

Copyright © 2024 iLovePhD. All rights reserved

  • Artificial intelligence

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 08 September 2024

Leveraging deep learning and computer vision technologies to enhance management of coastal fisheries in the Pacific region

  • George Shedrawi   ORCID: orcid.org/0000-0002-4507-7293 1 , 2 ,
  • Franck Magron 1 ,
  • Bernard Vigga 1 ,
  • Pauline Bosserelle 1 ,
  • Sebastien Gislard 1 ,
  • Andrew R. Halford 1 ,
  • Sapeti Tiitii 3 ,
  • Faasulu Fepuleai 3 ,
  • Chris Molai 4 ,
  • Manibua Rota 5 ,
  • Shivam Jalam 6 ,
  • Viliami Fatongiatau 7 ,
  • Abel P. Sami 8 ,
  • Beia Nikiari 5 ,
  • Ada H. M. Sokach 8 ,
  • Lucy A. Joy 8 ,
  • Owen Li 2 ,
  • Dirk J. Steenbergen 2 &
  • Neil L. Andrew 2  

Scientific Reports volume  14 , Article number:  20915 ( 2024 ) Cite this article

378 Accesses

2 Altmetric

Metrics details

  • Conservation biology
  • Ecosystem services
  • Marine biology

This paper presents the design and development of a coastal fisheries monitoring system that harnesses artificial intelligence technologies. Application of the system across the Pacific region promises to revolutionize coastal fisheries management. The program is built on a centralized, cloud-based monitoring system to automate data extraction and analysis processes. The system leverages YoloV4, OpenCV, and ResNet101 to extract information from images of fish and invertebrates collected as part of in-country monitoring programs overseen by national fisheries authorities. As of December 2023, the system has facilitated automated identification of over six hundred nearshore finfish species, and automated length and weight measurements of more than 80,000 specimens across the Pacific. The system integrates other key fisheries monitoring data such as catch rates, fishing locations and habitats, volumes, pricing, and market characteristics. The collection of these metrics supports much needed rapid fishery assessments. The system’s co-development with national fisheries authorities and the geographic extent of its application enables capacity development and broader local inclusion of fishing communities in fisheries management. In doing so, the system empowers fishers to work with fisheries authorities to enable data-informed decision-making for more effective adaptive fisheries management. The system overcomes historically entrenched technical and financial barriers in fisheries management in many Pacific island communities.

Similar content being viewed by others

current research topics in deep learning

Whale counting in satellite and aerial images with deep learning

current research topics in deep learning

A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis

current research topics in deep learning

Deepdive: Leveraging Pre-trained Deep Learning for Deep-Sea ROV Biota Identification in the Great Barrier Reef

Introduction.

Although diverse, Small Island Developing States (SIDS) share many challenges in their journeys toward sustainable development 1 . Inherent constraints like remoteness and small size amplify the impacts from natural disasters 2 , disease outbreaks 3 , and climate change 4 , 5 , 6 , 7 , 8 . The unique and disproportionate nature of these vulnerabilities was internationally recognized in 1992 9 , which has driven considerable efforts to develop solutions in the decades since 10 , 11 , 12 .

These vulnerabilities are apparent in all Pacific SIDS 13 , 14 , 15 , where coastal fisheries are central to sustainable development for food security 16 . Fish, broadly defined and include invertebrates, provide critical livelihoods and food in the 22 countries and territories in the region 17 . In all Pacific SIDS, but particularly in the atoll countries of Micronesia, fish consumption rates are among the highest in the world and fish are the dominant animal source food, providing essential macro- and micronutrients 18 . Pacific Island peoples acquire fish from a variety of sources and with a diverse array of exchange mechanisms, ranging from home production through to cash-based markets 12 , 19 .

The central role of coastal fisheries in the lives and economies of Pacific peoples, which include nearshore commercial and subsistence fisheries, means that sustainability of fish catches is an important national policy priority in all jurisdictions in the region 20 , 21 . The importance of ensuring the sustainable supply of fish is also reflected in several international instruments 22 , including the Sustainable Development Goals of the United Nations (UN) 2030 Agenda 23 , 24 .

Agencies and organizations tasked with ensuring the sustainability of small-scale coastal fisheries often face historically insurmountable challenges 25 , 26 . Many of the constraints recognized in oceanic fisheries in Pacific SIDS have present day analogues in their coastal fisheries 27 , 28 , 29 , making conventional fisheries management challenging 30 , 31 , 32 . These are exacerbated by the lack of centralized and mandated coordination of scientific investigation and the diversity of coastal fisheries themselves in terms of individual stocks and fishing methods 33 . Consequently, Pacific coastal fisheries are usually categorized as data deficient despite many initiatives aimed at improving management and research to overcome these constraints 34 , 35 .

A major obstacle in addressing these challenges is the high cost of scalable methods to monitor the status of fisheries and performance of management 36 , 37 . Many of the methods used in the region were developed using twentieth century approaches for large scale fisheries 38 , 39 . The absence of affordable and accessible methods and centralized coordinated approaches creates a disconnect between local fisheries management, national policy development and global understandings of fishery status 40 , 41 .

The persistence of technical and financial barriers regularly inhibits expansion of science-based governance of fisheries in the Pacific. The real-world costs of collecting data in difficult, dynamic field conditions, means such science-based programs that depend on methodological rigor are often unattainable without highly trained personnel. The reliance on complex technical approaches limits the ability of people to adopt and action management independently, so that general involvement and acceptance of fisheries management remains poor 42 . In practice, resource-constrained fisheries managers often grapple with balancing the collection of reliable data with the human and financial resources required to effectively collect, curate and analyze data to manage their fishery 43 . Compounding this are the considerable time lags between data collection, analysis, reporting and application, which mean data are often not available, or made available too late, for it to effectively inform adaptive management 44 .

The high demand for rapid, easy-to-implement assessments that inform management of small-scale fisheries, particularly subsistence and artisanal fisheries, is reflected in many available toolkits, statistical methods, and field approaches 41 , 45 , 46 . These methods rely on relatively simple yet informative data with which to make broad management inferences, for example, comparing mean catch length with information on a species’ size at maturity. The utility of many such ‘rule of thumb’ methods however, still rely on the capacity of agencies and communities to collect accurate data at scales needed to make meaningful management decisions.

Digital technologies hold potential for transforming fishery livelihoods 47 , particularly in the culturally diverse and dispersed island context of the Pacific, where telecommunications infrastructure rollout is extending to ever more remote places. Policymakers, the private sector, scientists, and international development partners are increasingly co-leveraging digital technologies in the Pacific socio-economic development sector 48 . When these advances are paired with rapid evolution of computational, computer vision and artificial intelligence (AI) technologies there is an opportunity to automate repetitive time-consuming aspects of fishery evaluation. In doing so, information gathering exponentially expands and catalyzes a revolution in how coastal fisheries are managed 49 .

AI applications in fisheries have to date been predominantly applied in large-scale commercial fisheries or aquaculture, where machine learning approaches are used to identify growth or disease in products 50 , 51 , predict population connectivity patterns 52 , track oceanic vessel movements 53 , identify species from commercial fishing catches 54 , 55 , 56 , 57 , 58 , or extract classifications from images uploaded by enlisted recreational fishers 59 , 60 . In these cases the scale and resolution at which AI is applied is such that these systems are not suited to monitoring subsistence and artisanal fisheries in small islands developing states 61 . For example, AI technologies designed to track large commercial vessels or automatically identify fish species based on commercial purse-seine catches are not designed to handle the diverse species, fishing methods, and landing practices characteristic of small-scale fisheries. These programs, while appropriate and useful within their respective domains, must include the simultaneous capture of information needed to inform management of complex multispecies coastal fisheries, including method-disaggregated catch per unit effort, specimen length, market or landing dynamics, catch methods, and other fishery associated socio-economic data. Considering the above, most AI applications in recreational or coastal fisheries are primarily serving as a tool to describe or characterize fisheries, rather than utilizing it as an integrated part of comprehensive analysis platforms that fisheries require in order to adapt and adjust management according to change.

In this paper we describe the development of state-of-the-art digital fishery monitoring processing system for Pacific SIDS powered by AI and computer vision technology. This system provides an opportunity to design and build solutions that bridge conventional data collection programs and governance structures with people’s local (traditional) use of fishery resources 62 . While AI’s transformative impacts are growing exponentially across diverse fields 63 , its application in coastal fisheries is in its infancy. AI offers promising avenues to address current limitations in management of small-scale coastal fisheries. AI-powered systems demonstrate the potential to automate data collection and analysis, improving efficiency and cost-effectiveness 64 . The work presented here builds on this emerging trend, introducing an AI-based system specifically tailored for coastal fisheries monitoring. Unlike existing approaches, this system prioritizes affordability, versatility, and accessibility, to ensure usability in (and by) remote Pacific communities. Moreover, the system distinguishes itself in its capacity to process and integrate diverse monitoring data from multispecies coastal fisheries, including catch rates, fishing locations, and socio-economic information. Its additional compatibility with smaller existing programs currently using computer vision technologies, offers growth for learning and expansion of application as technology advances. The platform is managed by the peak regional technical agency the Pacific Community (SPC) headquarters, with access by national fisheries agencies, non-government organizations, and communities in SPC member countries (Fig.  1 ).

figure 1

The basic construct of the AI-enabled coastal fisheries monitoring system behind Ikasavea, showing flow of raw data from various field contexts and countries (e.g. landing sites, communities, and/or markets) to a central computing facility hosted at SPC. Automated analyses of imagery extract information that is then fed back to inform management practice.

Structural architecture of the AI fisheries monitoring system

The architecture of the AI monitoring system is comprised of four main stages: data acquisition (photo imagery and survey data) and upload, AI enabled image processing and data extraction, data analysis, and reporting. CV technology facilitates the extraction of data from images, and deep learning algorithms were applied to train the system to make decisions based on extracted information (e.g., image standardization, image and specimen orientations, specimen taxonomic classifications, and measurement types). These processes are integrated and advanced through an automated software pipeline to ensure data consistency and an efficient workflow.

Data acquisition (field methods)

Data and imagery are collected and uploaded in a variety of ways during fishery surveys in the region. Although the system can accommodate manual entry of data from paper records and images uploaded to a web browser from, for example, a camera SD card, the focus of data acquisition is through a smartphone or tablet-based application. The Android application, Ikasavea , was developed in collaboration with national agency partners ( ika-savea means ‘fish-survey’ in Polynesian languages). Ikasavea can be customized by users through a web portal to adapt the survey design to their needs. Users can, for example, set administrative regions, assign spatial management areas to sampling hierarchies (e.g., locally managed marine areas in Fiji), and set specific input metrics and data fields used for a wide range of fishery assessments. This information is then synchronized to Ikasavea to facilitate structured flows of data input.

Given the diversity in artisanal and subsistence fishing practices, the system allows for various forms of data input through a range of modules in the tablet application and on the web-accessible portal, including data collection from landing sites, community fisheries, and fish markets. Enumerators intercepting fish catches in any of these contexts use Ikasavea to collect information on a range of catch attributes, such as fishing methods, market details, and socio-economic dimensions (See Fig. S1 Supplementary materials). A central attribute of the system is the use of photographs taken of the catch, which are then processed centrally. The use of photographs significantly reduces the level of taxonomic knowledge and time required at the point of data acquisition and reduces disruption to fishers and retailers, thereby reducing refusals.

Two categories of photograph or ‘measurement type’ can be processed – either a single specimen on a calibrated measuring board, with or without a digital scale (hereafter ‘board’), or multiple specimens arranged on a standardized, calibrated mat (hereafter ‘mat’). Specific identifier patterns on boards (numbers) and mats (symbols) were used to calibrate measurements and orient images (see Fig. S2 Supplementary materials). Mats proved more appropriate for fishers taking part in community-based fisheries monitoring programs or by fisheries officers when expediting landing surveys. Once data are uploaded to a central online database, automated image processing and data analyses derive information useful for management (See Fig. S3 in Supplementary material). This information is then sent back to the field for application by managers, in near real-time.

In instances when both length and weight were required, the board was attached to a scale system. Length and weight information allowed the development of length–weight relationships useful in quantitative fishery assessments. In this paper, weight was estimated to the nearest gram. Once sufficient length–weight information is gathered for each species, the relationship may be used to estimate weight from lengths measured in the photographs.

Fulton’s condition factor (K) and Tukey’s outlier detection on K methods were used in combination to remove outliers in body condition to improve LWR estimates 65 , 66 , 67 e.g., Fig.  7 . Fulton’s condition factor is a standard method to identify individuals that deviate significantly from a base population 55 . K was calculated using Eq. ( 1 ).

Where 100 is scaling constant to increase K to manageable units, L is length in mm and W is weight in grams. Once K was calculated, Tukey’s IQR method 68 , 69 was used to detect outliers. The calculation of K and evaluation of outliers served three purposes. Firstly, it identified individuals that had a disproportionate influence on the predictive model over and above what a healthy individual would have for that species’ population. Secondly, it identified specimens that were unsuitable for inclusion in AI model training or for further data analysis. Thirdly, calculation of K allowed tracking of body condition of fished species populations between geographic areas and through time 70 . Once local LWR models are established, predicted specimen weights can be used to estimate production volumes of a fishery. Allometric relationships for many coastal fisheries species can vary seasonally and spatially, so using locally developed LWRs rather than those from global data repositories such as FishBase 71 increase the accuracy of length-based stock assessments 72 . 73 . All analyses were implemented in R core program and RStudio using a range of plotting and statistics packages 74 , 75 , 76 , 77 , 78 , 79 .

AI enabled image processing and data extraction

Raw images are processed through a series of AI-enabled steps via an automated image processing pipeline (Fig.  2 ), each involving model trainings prior to model deployment for consequent data extraction. As images are uploaded, they undergo automated corrections, are classified, and partitioned into various categories. Every subsequent step through the processing phases uses a model developed specifically for the classification level of partitioned imagery (stored in a model library). This method of multi-stage classification decreased the number of distinct types of imagery that each model must classify, which helped avoid overfitting the models. Below we summarize this workflow and the model’s validation and evaluation processes used to assess each of the models’ performance.

figure 2

The multistage process for analyzing and classifying specimen images using convolutional neural network algorithms. Custom C# application accesses and updates the SQL database, fetches original images, creates image outputs, and runs YOLOv4/Darknet53 models through a wrapper and C +  + /CUDA implementation of YOLOv4 darknet and OpenCV libraries. It also runs ResNet101 models using Microsoft ML.NET. In Step 1, YOLOv4 is used to identify images, categorize them, and correct orientation. In Step 2, multiple YOLOv4 and OpenCV models adjust image properties and calibrate pixels to known dimensions. In Step 3, multiple YOLOv4 models detect and classify specimens based on a specimen's visible attributes. In Step 4, YOLOv4 extracts measurements for each taxonomic subclassification. In Step 5, a dual-stage process comparing YOLOv4 and ResNet101 detects the species name from a species detection model library, where the most accurate output is used.

Storage, analysis, and validation of specimen images are integrated into an ASP.NET web application and SQL Server backend. When photos are uploaded manually or during synchronization from Ikasavea , they are stored on the web server and registered for processing in the SQL Server database as photo tables. Using a custom C-sharp (C#) application on a machine equipped with a GPU, the application triggers the automated processing of images every 15 min. The AI detection and automatic measurement system drew from open-source libraries and models such as Open-Source Computer Vision (OpenCV) 80 , ResNet101 81 , and YOLOv4 82 , 83 . OpenCV provides tools and algorithms for image processing, including image enhancement and feature detection 59 . This open-source package offered a comprehensive set of functions for image processing, such as contrast enhancement, thresholding, histogram equalization, and adaptive histogram equalization 84 . These functions, triggered by the C# application when images are uploaded, were used to preprocess images to improve quality before being transferred to deep learning models to perform further analysis. ResNet101 is a deep convolutional neural network architecture that is widely used for image classification tasks. It is a variant of the Residual Network (ResNet) architecture, which involves the training of very deep networks, and is widely used as a backbone in CV classification tasks. YOLOv4, also widely used in CV applications, is a faster single-pass object detection algorithm, which divides the input image into a grid, and predicts bounding boxes and class probabilities for each grid cell.

The C# application processes images by initiating YOLOv4/Darknet 53 models via a C +  + /CUDA implementation for object detection, and ResNet101 models via Microsoft ML.NET for tasks like image classification or feature extraction. It generates image outputs, stores them on an SQL file server, and updates the SQL Server database with processing status and results. The application uses flags and output data fields, such as image type, predicted species, and bounding box coordinates, to filter images for further processing. For instance, fish-on-mat detection is triggered by the C# application once the OpenCV model has successfully calibrated an image containing a mat photo. Model training and validation involved just over 32 000 images of reef associated finfish and invertebrates from 13 Pacific SIDS and territories. The initial stages of development used images from Samoa, New Caledonia, Tonga, Papua New Guinea, Fiji, Kiribati, and Vanuatu. As images were uploaded and additional photographs were validated by trained observers, the models underwent retraining through a process of transfer learning 85 . This involved using validated, corrected, and ‘failed’ images (where ‘failed’ images were corrected and annotated) to retrain the models. Adding failed images to the training dataset improved fish detection or calibration models on a range of images of varying quality. Data augmentation was used to accelerate model training, minimize overfitting, and enhance accuracy for species with fewer photographs 63 . This method of periodic training and re-evaluation was needed to maintain accuracy and precision when dealing with highly variable imagery from the countries and programs using the system.

Preprocessing phase

The pre-processing phase proceeds as a series of steps powered by YOLOv4 and OpenCV to prepare images for data extraction. First, YOLOv4’s object detection capability is used to identify whether the image was from a board or a mat, and to categorize it accordingly. Using a combination of multiple YOLOv4 and OpenCV models, each image went through quality enhancement and calibration to normalize image projection to known real-world dimensions.

Lastly, a library of YOLOv4 models were used to identify, correct, and classify specimens based on their position, orientation, and broad taxonomic type. The ‘taxonomic type’ classification is used to group morphologically similar species into a single classification as either finfish or for grouping invertebrates into lobsters, crabs, bivalves, or gastropods. This step provides critical information about specimen characteristics, aiding subsequent data extraction processes. Once the taxonomic type of the specimen is identified, the images are placed in their respective categories.

Data extraction phase

This two-step phase utilizes deep learning to build algorithms to automate data extraction. First, YOLOv4 is used to extract accurate lengths from a calibrated image after detecting the snout and fork or caudal fin margin of fishes. For invertebrates, learned morphometric features are used. For photographs with an electronic scale, digits are recognized and recorded to provide a weight of the specimen, while for photographs using mats, only the lengths of each of the specimens are estimated.

For the calibrated measuring board method, images were standardized in size and position, starting at zero, vertically centered, and scaled to 2 pixels per mm. The red lines on the image correspond to the theoretical position of the center line and black lines every 10 cm to check that calibration is correct (see Fig. S4 in supplementary material). The fish’s fork position on the board was used to determine its length. A YOLOv4 model, trained on 1600 images, normalized to 416 × 416 input for 80 epochs, was used to detect the fork, providing a bounding box centered on the fork. The vertical position of this bounding box center yielded the fork length of the fish.

For the calibrated mat method, the size and position of specimen images were standardized. A YOLOv4 model, trained on 1334 images and normalized as 512 × 512 input for approximately 287 epochs, was first applied to detect fish bounding boxes on the mat. Each detected specimen’s image was cropped according to the bounding box with an extra 10% margin, then processed through a YOLOv4 fish orientation detection model to standardize the orientation of the fish. This model was trained on 30,528 images, normalized as 416 × 416 input for approximately 33 epochs.

Measurement of the specimen was done using a YOLOv4 fish snout/fork detection model (Fig.  2 ), using the distance between the center of the snout bounding box and the center of the fork bounding box of each detected specimen (see Fig. S5 in supplementary material). This model was trained on 4818 specimens, normalized as 512 × 512 input for approximately 80 epochs. During the processing, the system may detect multiple potential snouts or forks within a single image so a final step to disambiguate multiple specimens in the image processing pipeline ensured each detected fish was associated with only one snout and one fork. This process continues until each specimen is associated with a single snout and fork, a crucial step for ensuring the accuracy of measurements and overall effectiveness of the AI measurement system. The results are displayed to the user for validation or correction if required, with the user interface allowing editing of the measurement line and adding segments for length measurement along a curve. In the next step, pre-trained models for each measurement type and taxonomic group (e.g., mat or board having either a fish or invertebrate) extract a species classification of a specimen using both ResNet101 and YOLOv4 architecture. The species detection model library is used to assign a species name. This involves a dual-stage parallel process where the C# coded system compares both the YOLOv4 and ResNet101 outputs and reports only the most accurate classification in the user interface. This is achieved by comparing the model’s confidence score from 0 to 100, with zero having no confidence and 100 being highly confident.

Model validation and addressing bias

Performance of ai length and weight detection.

We tested the accuracy and precision of 869 AI-measured lengths from a suite of randomly selected specimen species and sizes sold at markets in Tonga. Depending on the species’ morphology either the total or fork length of a specimen was used to validate the AI measurements and collected in situ using the same board. An observer recorded the measurement beneath the posterior margin of the intersecting lobes of the caudal fin (fork length) or furthest margin of a straight-line distance from snout to tail (total length).

A linear regression model was used to evaluate the AI-enabled system’s ability to replicate a human measurement. The model was fitted between paired AI (independent variable) and human (dependent variable) measured lengths. We chose the dependent variable based on the need to evaluate the accuracy of the AI measurement (the predictor) to predict a real-world human measurement. The dataset covered lengths ranging from 130 to 720 mm. Results of this analysis indicated that AI measurements were consistent with human measurements for both measurement types. The near one to one accuracy is verified by a strong positive linear relationship (R2 = 0.99) (Fig.  3 a). In six cases during this validation assessment the AI detection demonstrated that it could detect erroneous measurements by human observers, which were corrected post-hoc.

figure 3

( a ) Comparison of fish length measurements estimated by an AI system and those by a human observer. Each point (n = 869) represents an individual fish, with the x-coordinate being the AI measurement and the y-coordinate being the human measurement. The blue line represents the fitted linear regression and the orange shaded area represents the bootstrapped 95% confidence interval for the fit (please note due to the extremely close relationship the 95% CI is narrow). ( b ) Bland–Altman plot showing the agreement between fish length measurements taken by an AI system and a human observer for the mat measurement type. The x axis shows the mean of the AI and human measurements, and the y axis shows the difference between the two measurements. The solid blue line indicates the mean difference (bias), while the dashed gray lines represent the upper and lower 95% confidence interval and the grey shaded region indicates the 95% confidence interval of the fitted linear model, ( c ) the comparable Bland–Altman plot for the board measuring type.

A Bland–Altman plot was constructed to assess size-specific and measurement type bias that was not detected by the linear regression 86 , 87 . Overall, 92% of the AI measured lengths were within 5 mm and 98 percent of lengths were within 10 mm of the human measured lengths (Fig.  3 b). The overall mean absolute error was 3.4 mm and the root mean square error was 4.9 mm. The mean square error for the board and mat measurement types was 2.7 and 4.2 mm, which equates to a 1.2 and 1.8% error, respectively. There was minimal positive bias apparent with increasing length. The mean bias was approximately 5 mm for specimens greater than 600 mm on both measurement types (slope coefficient b = 0.01 for both mat and board), which was considered acceptable.

We assessed how often enumerators either validated or changed the length estimated by the AI system in a random selection of 80,446 finfish (Fig.  4 ). The Studentized deleted residuals method 69 was used to identify outliers in both x and y variables. Observer validated AI produced length estimates in all but 0.4% of instances. The relationship between these values showed a near perfect fit, where the model coefficients were near zero or one for the intercept and slope, respectively. This high degree of agreement indicated that the AI length measurement system was robust over a range of lengths, species and contexts. We identified 278 (0.35%) instances where lengths were changed on specimens, mostly because the AI failed from poor quality images (e.g., poor color, lighting, and/or orientation—three dimensionality). In addition, observers either changed a correct AI identification or length measurement (See Fig. S6 in Supplementary material).

figure 4

Human validation of AI length estimates. Blue circles indicate validated length estimates and red circles indicate estimates identified as outliers using Studentized deleted residuals method. The orange dashed line indicates the 1:1 relationship and the black dashed line indicates the fitted linear model.

Performance of AI species identification

Correctly identifying species in multi-species fisheries requires specialist knowledge that is rarely available in the context of communities in the Pacific region. This limitation is more evident in SIDS where species names are often in local languages, and where many species are grouped into a single local name. Automating the identification process will significantly improve the likelihood that a sampling program will correctly identify species 88 .

We evaluated four computer vision models (m20, m21, m22, and m23) developed in successive years on their ability to identify species from 51,800 specimen images. The models’ performance was measured by their recall accuracy 89 , 90 , which is the rate of correctly identifying a species (true positives) and the rate of missing a species that was present (false negatives).

Each model was trained in a different year, with an increasing number of species and images. Specifically, m20 was trained on 21 species using 1,817 images, m21 on 64 species using 5210 images, m22 on 111 species using 10,191 images, and m23 on 264 species using 32,818 images. The training data for each model included images from its respective year and all preceding years, with the training datasets comprising 90% of the testing dataset.

Each model was tasked with classifying images of each species and the number of correct classifications were tallied and the proportion of correct classifications were calculated. To maintain a balanced learning system and to make meaningful assessments of each model’s performance, only species with more than twenty images were included. This criterion reduced the total number of finfish species that have been identified by the AI classification system from 612 to 264. The number of training images were capped at two hundred specimens per species to reduce the risk of overfitting. We found that models tend to perform better when trained on 150 to 200 unique specimens (Fig. S7 in supplementary material). When a model is overfitted, it may misclassify images by focusing on characteristics that are common between species but not related to the specimen’s taxonomy, such as the presence of a red pectoral fin margin, which therefore could lead to an increase in misclassifications.

Each model was able to recall species classifications with high accuracy, but only when applied to images taken in its development and preceding years (Fig.  5 ). New species or conspecifics with varied colorations or markings that were uploaded, resulted in a decline in performance. For example, model m22 (trained on imagery up to December 2022) was able to recall and accurately classify species 94%, 95%, 91%, and 35% of the time when assessed against images from 2019 to 2023, respectively (Fig.  5 ). Model m23 had a 79% successful recall rate on 264 species as opposed to m22’s recall rate of approximately 30% on the same dataset. Model m22 correctly identified species in images uploaded in 2022 at a rate 97%, a drastic improvement over Model m21’s accuracy of 35% when evaluated against the same imagery (Fig.  5 ).

figure 5

Recall performance (accuracy) of four models (m20 to m23) over 4 years (2020–2023). Each model was trained on imagery collected preceding 31st December of that year and preceding years. Individual data points within each panel represent the proportion of correct classifications (i.e., the number of correct specimen classifications/total number of specimens) for each species in the training dataset for that year. Only species that were represented by twenty or more images were included. Grey points represent the mean ± standard error of the mean number of correct classifications for each model. The number of species (n) in the training and performance evaluation datasets are annotated on each panel.

Model m23’s recall performance over imagery uploaded in its development year was lower than expected at 79% but when evaluated against preceding years imagery (2020 to 2022) the average recall rate of m23 was over 91% (Fig.  5 ). In contrast, the recall score of models trained on imagery in their same or preceding years was over 90%. Model m23’s lower recall rate is due to the upload of 153 new species with only 20 to 40 specimen images, whereas a maximum of only 47 new species were introduced in 2022 and 43 new species in 2021. The m23 model therefore had to learn to classify over three times as many species with too few images. This indicates that when a model was applied to new imagery over time, they were exposed to new species that were not part of their training, or distinct species that shared common characteristics leading to misidentification. This signals that as the number of species increased, which included distinct species with similar characteristics, a resulting increase in misclassifications occurred (i.e., false positives).

Application and uptake across Pacific SIDS

The Ikasavea monitoring program has seen rapid uptake in the region. Following the first upload of images collected in Kiribati as part of landing surveys, other countries’ fisheries management authorities expressed interest. This initiated a deliberate effort to apply modifications to accommodate the tailored needs of Pacific SIDS programs. As programs were integrated, the volume of images increased. By March 2021, new and established fisheries monitoring programs across all three Pacific subregions of Micronesia, Melanesia, and Polynesia were integrated, thereby accelerating uptake towards what is now a regional AI-supported monitoring system. Larger countries like Papua New Guinea (PNG) have more recently contributed to the data pool, emphasizing the system’s usability, utility, and performance in diverse and extensive geographical contexts. Active participation by diverse national fisheries management agencies demonstrates the system’s ability to manage large-scale data, thereby validating its scalability. As of March 2024, 11 national fisheries authorities are using the system, with the Cook Islands, Nauru and Palau currently being onboarded for their national programs (Fig.  6 ). By December 2023, over 80,000 images had been uploaded, containing over 180,000 specimens (Fig.  7 ).

figure 6

Extent of uptake by Pacific Island Countries and Territories (PICTs) of the Ikasavea system, allowing AI-supported data collection of key invertebrate and finfish landed by fishers or sold at markets. Three letter ISO codes as: American Samoa (ASM), Cook Islands (COK), Federated States of Micronesia (FSM), Fiji (FJI), French Polynesia (PYF), Guam (GUM), Kiribati (KIR), Marshall Islands (MHL), Nauru (NRU), New Caledonia (NCL), Niue (NIU), Northern Mariana Islands (MNP), Palau (PLW), Papua New Guinea (PNG), Pitcairn Islands (PCN), Samoa (WSM), Solomon Islands (SLB), Tokelau (TKL), Tonga (TON), Tuvalu (TUV), Vanuatu (VUT), and Wallis and Futuna (WLF). Extent of EEZs indicative only.

figure 7

Cumulative count of fish species identifications and length measurements made by the AI system using images uploaded to the SPC web portal by national fisheries authorities. Red points indicate when a new country first uploaded images to the SPC web portal. Count does not include images from mats obtained by SPC partners. Three letter ISO codes for PICTs as indicated in Fig.  6 .

In the four years since the system became operational, over 50 finfish and invertebrate species have sufficient data for length–weight predictions across the region. This is a critical metric that can be used in analyses of life histories, population growth, and in comparative analyses between different populations from different regions, habitats and/or environmental conditions. Three species were selected here to demonstrate how the LWR models were developed using AI detection (Fig.  8 ). Of the 7885 measurements made, 313 (not all shown) were detected as outliers using Tukey’s IQR method on K, and were the result of deficient images (e.g., gutted, damaged or malformed specimens) or incorrect lengths or weight validations by observers (e.g., AI’s detected weight being overridden by a human observer) (Fig.  8 a). As the number of uploaded images with boards and scales increase, the number of locally derived allometric LWRs are also expected to increase, providing much needed allometric data across the region. Generalized linear model regression on log 10 transformed data was used to build predictive LWR models 91 .

figure 8

Allometric length–weight relationships of three species from Samoa (WS) and Papua New Guinea (PNG). These relationships were derived from automated length classification and weight readings using CV technologies. Outliers (shown for Lutjanus gibbus as red circles) were identified and removed using Tukey’s outlier method on Fulton’s condition factor (K). The predicted weights, represented by red lines, were obtained through generalized linear regression on log-transformed data.

Implications for fisheries management

The system serves as a comprehensive platform that integrates the collection of multiple lines of fisheries data necessary for informing the sustainable management of fisheries, rather than as a single-purpose tool for automated species identification and/or length measurements. The unique ability of this system to collect and integrate AI-automated morphometric data simultaneously with other relevant fisheries data, over a broad range of coastal fisheries, enhances monitoring beyond conventional paper-based or single species stock assessment approaches 92 . As such, the system facilitates data collection to estimate volumes and pricing of fish products traded across market networks, volumes of landed catches, and catch per unit effort that is appropriately categorized for artisanal and commonly used subsistence fishing methods. In addition to data collection from fish markets and landings, the system integrates monitoring programs from coastal communities implementing community-based fisheries management 93 . Together these data support a broad range of fishing activities 94 , economic dynamics of the sector 18 , and management needs within coastal Pacific communities 13 .

In the absence of tailored fisheries management tools, often broader standardized tools and indices are applied 43 . Data for context-specific determination of LWR are, for the first time, being collected at regional, national, and subnational scales, thereby explicating the spatial and temporal variabilities in complex multispecies coastal fisheries. Such measures of LWR can be used to more accurately assess local population condition and stock status. The temporal and spatial scales of data collected, including both length and weight, have important implications for the management of fish populations in and across Pacific SIDS. The continuous and autonomous tracking of body condition, using indices such as Fulton’s Factor (K) and Le Cren’s modification on K to relative K n together with other length-based stock assessment indices (e.g., Length-Based Spawning Potential Ratio—LBSPR), can provide an indication of changes in stock status 67 , 72 , 73 , 95 . Collecting weight data in addition to length data can further improve these LBSPR indices. It can also help track changes in body condition and detect spawning individuals 55 .

The platform’s principal function is to assist with ‘data poor’ fisheries management, specifically in Pacific SIDS. It responds to the need for access, compatibility and functionality, both within the region and between regional and global systems. Regarding the former, it offers authenticated users with tailored data exports, while for the latter it offers options for customized Application Programming Interfaces (APIs) to connect with alternative platforms aligned with SDG 14.4.1. Among others, this includes, for example the United Nations, Food and Agriculture Organization’s Virtual Research Environment (VRE) 96 . Such interfaces allow state of the art length-based stock assessment methods for SSF 97 to integrate into the AI data collection system, generating models and predictive models as needed and addressing the enduring and critical challenge of timely stock assessment reporting in coastal fisheries 98 . These efficiencies thus enable better data collection and promote integrated assessment and advice in small-scale fisheries 42 , 99 .

While scientific support for decisions concerning highly migratory fish stocks in the Pacific SIDS region has expanded since 2004 under the formal mandate provided by the Western and Central Pacific Fisheries Commission (WCPFC) Convention, a critical gap remains. Unlike the WCPFC’s managed operations in the broader Pacific region, which drive informed policy development through science 100 , there exists no comparable regional framework for the science and management of coastal fisheries in Pacific SIDS 101 . Despite the latter being a critical global indicator of improved fisheries management 102 . This institutional void extends to the literature on applied science guiding national and regional policy development on artisanal and coastal fisheries 103 . Consequently, there is relatively limited attention from national and international policymakers and funding bodies directed toward the national and regional challenges faced by Pacific coastal fisheries 104 . The efficient means of collecting scientific evidence for fisheries management offered through this regional system, stands to decrease the uncertainty in the status and dynamics of Pacific coastal fisheries 47 .

The co-design of the monitoring system has strengthened collaborations among communities and national and regional organizations engaged in Pacific coastal fisheries 100 . The development and rollout of the monitoring system has strengthened decentralization ambitions within national programs for data collection, while also integrating them into regional coordination efforts. Domestically, such collaborations can be leveraged to tailor harvesting strategies as part of national requirements for coastal fisheries management 104 , 105 , 106 . Applying information, they collect and own, leveraged through regional partnerships, enables greater collaboration between fisheries authorities and communities 107 . This allows national fisheries agencies, for example, to more efficiently gauge the effectiveness of their strategies within the context of local practices and traditions 108 . Particularly in the context of community fisheries, where since 2015 regional policy and management has focused on enabling scaling community-based fishery management through cost-effective support measures 109 , this system functions to catalyze the kind of decentralized management solution that is required 93 . Uninterrupted data flows between the centralized support platform and remote fisheries offices and/or villages and fishers, enables the essential rapid return of results that has thus far challenged local adaptive management 13 .

Limitations in practice

A noteworthy limitation in later models trained with more than 140 new species, was the reduction in maximum recall. This is a common occurrence in CV deep learning applications in large datasets 110 , 111 . While further training is expected to overcome this decline in precision, caution is needed as ‘excessive’ training can also lead to specificity and misclassifications (false positives), especially when species share similar characteristics. These issues can be overcome 112 , 113 , 114 , 115 as the system and AI technologies rapidly develop. For example, current trials in integrating CSPDarkNet-53 backbone architecture for species detection 116 , 117 promises to improve recall, and thus accuracy in species detection. Recognizing the need for retraining, models in the system underwent multiple retraining following large influxes of new species or those conspecifics with varied visible characteristics. Detection rates were further improved by using multiple model algorithms simultaneously (e.g., YOLOv4, ResNet101). Other measures that integrated multiple approaches to preprocessing, like splitting classifications into grouped classification chains (Fig.  2 ) were also applied to improve recall and accuracy. This included, for example, partitioning input imagery into morphologically or geographically partitioned groups and trainings and comparing multiple model architectures in each group. As the system matures over time it is critical that new model capabilities are continually integrated into the system to further enhance its applicability, efficiency and useability 118 .

A major obstacle to any such system is the need to continuously upgrade software and hardware. Both these needs have resourcing implications (e.g. budget and technical skills) that must be recognized for these endeavors to continuously deliver capacity enhancements. The growing magnitude of data input from users into the system, for example, resulted in the need to upgrade the infrastructure and invest in new and more powerful computing capacity. While these costs were low relative to those incurred from manual methods at this scale, implementation and support to these systems must consider ongoing costs, and with that have sustainable resourcing mechanisms in place.

This system has created much needed efficiencies in data collection and has closed a critical capacity gap by improving access to technology and monitoring science in artisanal and subsistence fisheries. However, there remains the need for continued investment in national fisheries monitoring budgets and the human resources to collect these data on the ground 119 . Good data collection is quite simply contingent on the time and effort invested by fisheries authorities and community enumerators, so without such functional programs in countries monitoring could not occur, regardless of the improvement in technology. The projected upshift in the frequency of assessment and reporting of the status of key coastal fisheries and ensuring provision of regular informed policy briefs to decision makers, should help strengthen investment in these areas.

Conclusions

This paper describes the evolution, structural components and functionality of a comprehensive AI-enabled monitoring system that serves national and regional coastal fishery management needs. With the system’s development being integrated in national fisheries programs, it is making significant contributions to adaptive management cycles at various scales and demonstrates the value of digital and AI technology in addressing the enduring challenge of delivering scientific evidence for fisheries managers of tropical SSF. The system’s ability to incorporate a broad range of approaches in the fisheries management cycle and its versatility to integrate other programs, promises to revolutionize how institutions and communities communicate and use the information collected 120 .

The AI-enabled fisheries monitoring system represents a first on several fronts. Firstly, it enables near real time transfer of data to diverse, data-poor coastal fisheries management contexts for evidence-based adaptive management. Secondly, it allows the application of state-of-the-art technology in remote fisheries contexts, empowering local actors with information needed to made good management decisions. Thirdly, it supports the development of management metrics and tools that are tailored by and for Pacific people to specific conditions and geographies (e.g., country specific LWRs). Fourthly, the system’s compatibility to interface with other learning platforms facilitates the continued improvement and refinement of supporting models, thereby keeping up with technical advances in the field. In addition to these benefits, the experience of co-developing and implementing the system with multiple stakeholders has strengthened cross border collaboration of Pacific SIDS in the Western, Central and South Pacific Region.

Commitments by national fisheries authorities and stakeholders engaging with the system demonstrate the level of buy-in into the fisheries monitoring support platform. Its evolution and integration at scale serves to illustrate how this approach can be replicated among SIDS globally, thereby addressing critical barriers to achieving adaptive management 44 . The development of this innovation is a result of a technology implemented, tested, and refined as part of policy and practice—this ensures institutional fit by embeddedness as it matures, that it responds to existing and emerging needs, that it challenges, with evidence, entrenched management practices and approaches 121 , and that Pacific Island nations are on the forefront of the rapid advances in AI technologies.

The strategic design and execution of the Ikasavea system responds to an urgent need for timely management of fish stocks in a changing world 44 , 122 . Fish remain a critical resource of people in SIDS, many of whom live in remote communities far from capital cities and national fishery agencies. Projecting into the future, Ikasavea offers potential to develop into an exchange platform for fishery-related knowledge among connected but geographically distant communities and their supporting agencies 123 .

Data availability

The datasets analyzed during the current study are not publicly available due to privacy policy agreements between the Pacific Community (SPC) and its member Countries and Territories but are available from the corresponding author on reasonable request. Main models that are used as part of an image processing chain to calibrate images, detect, measure and identify specimens can be found here https://github.com/PacificCommunity/cfap-ai-models .

Béné, C. et al. Contribution of fisheries and aquaculture to food security and poverty reduction: Assessing the current evidence. World Dev. 79 , 177–196 (2016).

Article   Google Scholar  

Lowitt, K., Ville, A. S., Lewis, P. & Hickey, G. M. Environmental change and food security: the special case of small island developing states. Reg. Environ. Change 15 , 1293–1298 (2015).

Bennett, N. J. et al. The COVID-19 Pandemic, small-scale fisheries and coastal fishing communities. Coast. Man. 48 , 336–347 (2020).

Gillett, R. & Cartwright, I. The Future of Pacific Island Fisheries (Pacific Community, 2010).

Google Scholar  

Bell, J. D. et al. Adapting tropical Pacific fisheries and aquaculture to climate change: management measures, policies and investments. in Vulnerability of Tropical Pacific Fisheries and Aquaculture to Climate Change (eds. Bell, J. D., Johnson, J. E. & Hobday, A. J.) 803–876 (Secretariat of the Pacific Community, Noumea, New Caledonia, 2011).

de Suarez, J. M., Cicin-Sain, B., Wowk, K., Payet, R. & Hoegh-Guldberg, O. Ensuring survival: Oceans, climate and security. Ocean Coast. Manag. 90 , 27–37 (2014).

Bahri, T. et al. Adaptive Management of Fisheries in Response to Climate Change: FAO Fisheries and Aquaculture Technical Paper No. 667 (FAO, 2021).

Leal Filho, W. et al. Climate change adaptation on small island states: An assessment of limits and constraints. J. Mar. Sci. Engin. 9 , 602 (2021).

Report of the United Nations Conference on Environment and Development. (United Nations, Rio de Janeiro, Brazil, 1992).

Report of the Global Conference on the Sustainable Development of Small Island Developing States. (United Nations, Bridgetown, Barbados, 1994).

Report of the International Meeting to Review the Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States. (United Nations, Port Louis, Mauritius, 2005).

Friedman, R. S. et al. Scanning Models of Food Systems Resilience in the Indo-Pacific Region. Front. Sustain. Food Syst. https://doi.org/10.3389/fsufs.2022.714881 (2022).

Andrew, N. L. & Evans, L. Approaches and frameworks for management and research in small-scale fisheries. In Small-scale fisheries management: frameworks and approaches for the developing world (ed. Pomeroy, R. S.) (CABI, 2011).

Thomas, A. et al. Climate change and small island developing states. Annu. Rev. Environ. Resour. 45 , 1–27. https://doi.org/10.1146/annurevenviron-012320-083355 (2020).

Campbell, J. R. Development, global change and traditional food security in Pacific Island countries. Reg. Environ. Change 15 , 1313–1324 (2015).

Gillett R. & Fong M. Fisheries in the economies of Pacific Island countries and territories (Benefish Study 4). Noumea, New Caledonia: Pacific Community. 704 p. https://purl.org/spc/digilib/doc/ppizh . (2023).

Bell, J. D. et al. Planning the use of fish for food security in the Pacific. Mar. Policy 33 , 64–76 (2009).

Gillett, R. E. Fisheries in the economies of Pacific Island countries and territories. Noumea, New Caledonia: Pacific Community 1–684 (2016).

Vaughan, M. B., Vitousek, P. M. & Mahele.,. Sustaining Communities through Small-Scale Inshore Fishery Catch and Sharing Networks. Pac. Sci. 67 , 329–344 (2013).

Gillett, R. & Lightfoot, C. The Contribution of Fisheries to the Economies of Pacific Island Countries: A Report Prepared for the Asian Development Bank, the Forum Fisheries Agency, and the World Bank (ADB, 2002).

Govan, H. & Lalavanua, W. The “Pacific Way” of Coastal Fisheries Management: Status and Progress of Community-Based Fisheries Management. 64 https://purl.org/spc/digilib/doc/ocw6w (2022).

Rice, J. Evolution of international commitments for fisheries sustainability. ICES J. Mar. Sci. 71 , 157–165 (2014).

United Nations. Transforming our world: the 2030 Agenda for Sustainable Development Department of Economic and Social Affairs. https://sdgs.un.org/2030agenda (2015).

United Nations. Sustainable Development Goals: 17 Goals to Transform Our World. https://sdgs.un.org/ (2023).

Hoelting, R. A. After Rio: The Sustainable Development Concept Following the United Nations Conference on Environment and Development. Ga. J. Intl. Comp. Law 24 , 117 (1994).

FAO and SPC. Report of the FAO/SPC Regional Workshop on Improving Information on Status and Trends of Fisheries in the Pacific Region. Apia, Samoa, 22–26 May 2006. FAO Fisheries and Aquaculture Report. No. 920. Rome, FAO. 2010. 70p

Dalzell, P Adams T. J. H. & Polunin N. V. C. Coastal fisheries in the Pacific Islands. Oceanogr. Mar. Biol. 34 , 395–531 (1996).

Barclay, K. & Cartwright, I. Governance of tuna industries: The key to economic viability and sustainability in the Western and Central Pacific Ocean. Mar. Policy 31 , 348–358 (2007).

Cánovas-Molina, A. & García-Frapolli, E. A review of vulnerabilities in worldwide small-scale fisheries. Fish. Manage. Ecol. 29 , 491–501 (2022).

Cochrane, K. L., Andrew, N. L. & Parma, A. M. Primary fisheries management: a minimum requirement for provision of sustainable human benefits in small-scale fisheries. Fish Fish. 12 , 275–288 (2011).

Govan, H. The Pacific Islands and Biodiversity Beyond National Jurisdiction: Briefing Note of the Council of Regional Organisations in the Pacific Members of the Marine Sector Working Group, https://doi.org/10.13140/RG.2.1.1247.9527 . (2014).

Keen, M. R., Schwarz, A.-M. & Wini-Simeon, L. Towards defining the blue economy: Practical lessons from Pacific Ocean governance. Mar. Policy 88 , 333–341 (2018).

Anon. A new song for coastal fisheries pathways to change: the Noumea strategy. Future of coastal/Inshore fisheries management (Pacific Community (SPC), Noumea, New Caledonia, 2015). 639. 2099597

Gillett, R. Marine fishery resources in the Pacific islands (FAO Fisheries and Aquaculture Reviews and Studies, 2011).

Adams, T. J. H. Modern institutional framework for reef fisheries management. In reef fisheries (ed. Nicholas, V. C.) (Springer, 1996).

Punt, A. E. & Nolan, C. P. Evaluating the costs and benefits of alternative monitoring programmes for fisheries management. Proc. International Conference on Integrated Fisheries Monitoring, Sydney, Australia, 1–5 February 1999 (1999).

Hartill, B. W., Payne, G. W., Rush, N. & Bian, R. Bridging the temporal gap: Continuous and cost-effective monitoring of dynamic recreational fisheries by web cameras and creel surveys. Fish. Res. 183 , 488–497 (2016).

Honey, K. T., Moxley, J. H. & Fujita, R. M. From rags to fishes: data-poor methods for fishery managers. Managing Data-Poor Fish. Case Stud. Models Solut. 1 , 159–184 (2010).

Pons, M., Cope, J. M. & Kell, L. T. Comparing performance of catch-based and length-based stock assessment methods in data-limited fisheries. Can. J. Fish. Aquat. Sci. 77 , 1026–1037 (2020).

Chrysafi, A. & Kuparinen, A. Assessing abundance of populations with limited data: Lessons learned from data-poor fisheries stock assessment. Environ. Rev. 24 , 25–38 (2016).

Cope, J. M. et al. The stock assessment theory of relativity: Deconstructing the term “data-limited” fisheries into components and guiding principles to support the science of fisheries management. Rev Fish Biol. Fisheries https://doi.org/10.1007/s11160-022-09748-1 (2023).

Salpin, C., Onwuasoanya, V., Bourrel, M. & Swaddling, A. Marine scientific research in Pacific Small Island Developing States. Mar. Policy 95 , 363–371 (2018).

Parks, J. Adaptive management in small-scale fisheries: a practical approach. In Small scale fisheries management: frameworks and approaches for the developing world (ed. Pomeroy, R. S.) (CAB International, 2011).

Edmondson, E. & Fanning, L. Implementing adaptive management within a fisheries management context: A systematic literature review revealing gaps, challenges, and ways forward. Sustainability 14 , 7249 (2022).

Chong, L. et al. Performance evaluation of data-limited, length-based stock assessment methods. ICES J. Mar. Sci. 77 , 97–108 (2020).

Castello, L. et al. An approach to assess data-less small-scale fisheries: Examples from Congo rivers. Rev Fish Biol Fisheries 33 , 593–610 (2023).

Harden-Davies, H. R. Research for regions: strengthening marine technology transfer for Pacific Island Countries and biodiversity beyond national jurisdiction. Intl. J. Mar. Coast. Law 32 , 797–822 (2017).

UNCTAD. Digital Economy Report Pacific Edition 2022: Towards Value Creation and Inclusiveness (United Nations Publications, 2022).

Grosz, B. J. & Stone, P. A century-long commitment to assessing artificial intelligence and its impact on society. Commun. ACM 61 , 68–73 (2018).

Zion, B. The use of computer vision technologies in aquaculture–a review. Comput. Electron. Agricult. 88 , 125–132 (2012).

Aftab, K. et al. Intelligent fisheries: Cognitive solutions for improving aquaculture commercial efficiency through enhanced biomass estimation and early disease detection. Cogn. Comput. https://doi.org/10.1007/s12559-024-10292-2 (2024).

Lopez-Marcano, S., Brown, C. J., Sievers, M. & Connolly, R. M. The slow rise of technology: Computer vision techniques in fish population connectivity. Aquatic Conser. 31 , 210–217 (2021).

Signaroli, M., Lana, A. & Alós, J. Novel computer vision tools applied to marine recreational fisheries spatial planning. Fish Res 271 , 106924 (2024).

Bradley, D. et al. Opportunities to improve fisheries management through innovative technology and advanced data systems. Fish Fish. 20 , 564–583 (2019).

Vilas, C. et al. Use of computer vision onboard fishing vessels to quantify catches: The iObserver. Mar. Policy 116 , 103714 (2020).

Ovalle, J. C., Vilas, C. & Antelo, L. T. On the use of deep learning for fish species recognition and quantification on board fishing vessels. Mar. Policy 139 , 105015 (2022).

Palmer, M., Álvarez-Ellacuría, A., Moltó, V. & Catalán, I. A. Automatic, operational, high-resolution monitoring of fish length and catch numbers from landings using deep learning. Fish. Res. 246 , 106166 (2022).

Barbedo, J. G. A. A Review on the Use of Computer Vision and Artificial Intelligence for Fish Recognition, Monitoring, and Management. Fishes 7 , 335 (2022).

Silva, C. N. S., Dainys, J., Simmons, S., Vienožinskis, V. & Audzijonyte, A. A Scalable Open-Source Framework for Machine Learning-Based Image Collection, Annotation and Classification: A Case Study for Automatic Fish Species Identification. Sustainability 14 , 14324 (2022).

Lekunberri, X. et al. Identification and measurement of tropical tuna species in purse seiner catches using computer vision and deep learning. Ecol. Inform. 67 , 101495 (2022).

Atlas, W. I. et al. Wild salmon enumeration and monitoring using deep learning empowered detection and tracking. Front. Mar. Sci. https://doi.org/10.3389/fmars.2023.1200408 (2023).

Shorten, C. & Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. J Big Data 6 , 60 (2019).

Beyan, C. & Browman, H. I. Setting the stage for the machine intelligence era in marine science. ICES J. Mar. Sci. 77 , 1267–1273 (2020).

Nash, R., Valencia, A. H. & Geffen, A. The origin of Fulton’s condition factor - Setting the record straight. Fisheries 31 , 236–238 (2006).

Tukey, J. W. Exploratory Data Analysis (Addison-Wesley Pub. Co., Reading Mass, 1997).

Lloret, J., Shulman, G. & Love, R. M. Description of condition indicators. in Condition and Health Indicators of Exploited Marine Fishes 1–16 (John Wiley & Sons, Ltd, 2013). https://doi.org/10.1002/9781118752777.ch1 .

Iglewicz, B. & Hoaglin, D. C. How to Detect and Handle Outliers (Quality Press, 1993).

Sullivan, J. H., Warkentin, M. & Wallace, L. So many ways for assessing outliers: What really works and does it matter?. J. Bus. Res. 132 , 530–543 (2021).

Sánchez-González, J. R., Arbonés, A. & Casals, F. Variation over time of length–weight relationships and condition factors for four exotic fish species from a restored shallow lake in NE Iberian Peninsula. Fishes 5 , 7 (2020).

Froese, R. and Pauly, D. Editors. FishBase 2000: concepts, design and data sources . ICLARM, Los Baños, Laguna, Philippines. 344 p (2000).

Prince, J., Hordyk, A., Valencia, S. R., Loneragan, N. & Sainsbury, K. Revisiting the concept of Beverton-Holt life-history invariants with the aim of informing data-poor fisheries assessment. ICES J. Mar. Sci. 72 , 194–203 (2015).

Hordyk, A. R., Ono, K., Prince, J. D. & Walters, C. J. A simple length-structured model based on life history ratios and incorporating size-dependent selectivity: Application to spawning potential ratios for data-poor stocks. Can. J. Fish. Aquat. Sci. 73 , 1787–1799 (2016).

Auguie B. Package ‘gridExtra’. Miscellaneous functions for “grid” graphics. https://CRAN.R-project.org/package=gridExtra (2017).

RStudio Team. RStudio: Integrated Development for R. RStudio, PBC, Boston, MA. http://www.rstudio.com/ (2023).

Attali, D. & Baker, C. ggExtra: Add Marginal Histograms to ‘ggplot2’, and More ‘ggplot2’ Enhancements. R package version 0.10.1. https://CRAN.R-project.org/package=ggExtra (2023).

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing). https://www.R-project.org/ (2023).

Robinson, D., Hayes, A. & Couch, S. Broom: Convert Statistical Objects into Tidy Tibbles . https://CRAN.R-project.org/package=broom (2023).

Wickham, H. et al. 2023 Welcome to the tidyverse. J. Open Source Softw. 4 , 1686 (2023).

Article   ADS   Google Scholar  

Bradski, G., Kaehler, A. & Pisarevsky, V. Learning-based computer vision with Intel’s open source computer vision library. Intel. Tech. J. 9 , 119–130 (2005).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proc. IEEE conference on computer vision and pattern recognition. 770–778 (2016).

Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. Proc. of the IEEE conference on computer vision and pattern recognition 779–788 (2016).

Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Preprint at http://arxiv.org/abs/2004.10934 (2020).

Pulli, K., Baksheev, A., Kornyakov, K. & Eruhimov, V. Real-time computer vision with OpenCV. Commun. ACM 55 , 61–69 (2012).

Brodzicki, A., Piekarski, M., Kucharski, D., Jaworek-Korjakowska, J. & Gorgon, M. Transfer Learning Methods as a New Approach in Computer Vision Tasks with Small Datasets. Found. Comput. Decis. Sci. 45 , 179–193 (2020).

Bland, J. M. & Altman, D. G. Measuring agreement in method comparison studies. Stat. Methods Med. Res. 8 , 135–160 (1999).

Article   CAS   PubMed   Google Scholar  

Monkman, G. G., Hyder, K., Kaiser, M. J. & Vidal, F. P. Using machine vision to estimate fish length from images using regional convolutional neural networks. Methods Ecol. Evol. 10 , 2045–2056 (2019).

Wäldchen, J. & Mäder, P. Machine learning for image based species identification. Methods Ecol. Evol. 9 , 2216–2225 (2018).

Blair, J. D., Gaynor, K. M., Palmer, M. S. & Marshall, K. E. A gentle introduction to computer vision-based specimen classification in ecological datasets. J. Anim. Ecol. 93 , 147–158 (2024).

Article   PubMed   Google Scholar  

Khalid, M. M. & Karan, O. Deep learning for plant disease detection. Int. J. Math. Comput. Sci. 2 , 75–84 (2024).

De Robertis, A. & Williams, K. Weight-Length Relationships in Fisheries Studies: The Standard Allometric Model Should Be Applied with Caution. Trans. Am. Fish. Soc. 137 , 707–719 (2008).

Evans, K. et al. Optimising fisheries management in relation to tuna catches in the western central Pacific Ocean: A review of research priorities and opportunities. Mar. Policy 59 , 94–104 (2015).

Steenbergen, D. J., Song, A. M. & Andrew, N. A theory of scaling for community-based fisheries management. Ambio 51 , 666–677 (2022).

Article   ADS   PubMed   Google Scholar  

Stewart, K. R. et al. Characterizing Fishing Effort and Spatial Extent of Coastal Fisheries. PLoS One 5 , e14451 (2010).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Hordyk, A., Ono, K., Valencia, S., Loneragan, N. & Prince, J. A novel length-based empirical estimation method of spawning potential ratio (SPR), and tests of its performance, for small-scale, data-poor fisheries. ICES J. Mar. Sci. 72 , 217–231 (2015).

Taconet, M. et al. Virtual Research Environments supporting sustainability of global fisheries. In FAO Fisheries and Aquaculture - Abstracts (eds Taconet, M. et al. ) (FAO Fisheries and Aquaculture - Abstracts, 2024).

Hordyk, A. R. & Carruthers, T. R. A quantitative evaluation of a qualitative risk assessment framework: Examining the assumptions and predictions of the productivity susceptibility analysis (PSA). PLoS One 13 , e0198298 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Prince, J. D., Dowling, N. A., Davies, C. R., Campbell, R. A. & Kolody, D. S. A simple cost-effective and scale-less empirical approach to harvest strategies. ICES J. Mar. Sci. 68 , 947–960 (2011).

Garcia, S. et al. Towards Integrated Assessment and Advice in Small-Scale Fisheries: Principles and Processes (FAO Fisheries and Aquaculture, 2008).

Gillett, R. & Tauati, M. I. Fisheries of the Pacific Islands: Regional and National Information: (FAO Fisheries and Aquaculture (2018).

Govan, H., Kinch, J. & Brjosniovschi, A. Strategic Review of Inshore Fisheries Policies and Strategies in Melanesia-Fiji, New Caledonia, Papua New Guinea, Solomon Islands and Vanuatu-Part II: Country Reports. (Pacific Community, Noumea, New Caledonia, 2013).

Thomas Travaille, K. L., Crowder, L. B., Kendrick, G. A. & Clifton, J. Key attributes related to fishery improvement project (FIP) effectiveness in promoting improvements towards sustainability. Fish Fish. 20 , 452–465 (2019).

Batista, V. S., Fabré, N. N., Malhado, A. C. M. & Ladle, R. J. Tropical artisanal coastal fisheries: Challenges and future directions. Rev. Fish. Sci. Aqua. 22 , 1–15 (2014).

Ayilu, R. K., Fabinyi, M. & Barclay, K. Small-scale fisheries in the blue economy: Review of scholarly papers and multilateral documents. Ocean Coast. Manage. 216 , 105982 (2022).

Dowling, N. A. et al. Empirical harvest strategies for data-poor fisheries: A review of the literature. Fish. Res. 171 , 141–153 (2015).

Carruthers, T. R. & Hordyk, A. R. The Data-Limited Methods Toolkit ( DLM tool): An R package for informing management of data-limited populations. Methods Ecol. Evol. 9 , 2388–2395 (2018).

Steenbergen, D. J. et al. Tracing innovation pathways behind fisheries co-management in Vanuatu. Ambio 51 , 2359–2375 (2022).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Kronen, M., Vunisea, A., Magron, F. & McArdle, B. Socio-economic drivers and indicators for artisanal coastal fisheries in Pacific island countries and territories and their use for fisheries management strategies. Mar. Policy 34 , 1135–1143 (2010).

Pacific Community. Pacific Framework for Action on Scaling up Community-based Fisheries Management: 2021–2025. in 20 (Pacific Community, 2021).

Masarczyk, W. & Tautkute, I. Reducing catastrophic forgetting with learning on synthetic data. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (ed. Masarczyk, W.) (IEEE, 2020).

Tian, J., Mithun, N. C., Seymour, Z., Chiu, H.-P. & Kira, Z. Striking the Right Balance: Recall Loss for Semantic Segmentation. 2022 International Conference on Robotics and Automation (ICRA) 5063–5069, https://doi.org/10.1109/ICRA46639.2022.9811702 . (2022).

Yao, X., Huang, T., Wu, C., Zhang, R.-X. & Sun, L. Adversarial Feature Alignment: Avoid catastrophic forgetting in incremental task lifelong learning. Neural Comput. 31 , 2266–2291 (2019).

Chen, S. et al. Recall and learn: fine-tuning deep pretrained language models with less forgetting. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics). https://doi.org/10.18653/v1/2020.emnlp-main.634 . (2020).

Koike, T., Qian, K., Schuller, B. W. & Yamamoto, Y. Learning higher representations from pre-trained deep models with data augmentation for the COMPARE 2020 Challenge Mask Task. in Interspeech 2020 (ISCA, 2020). https://doi.org/10.21437/interspeech.2020-1552 , (2020).

Maracani, A., Michieli, U., Toldo, M. & Zanuttigh, P. RECALL: Replay-based Continual Learning in Semantic Segmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (ed. Maracani, A.) (IEEE, 2021).

Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. Arxiv abs/1804.02767 , (2018).

Roy, A. M., Bhaduri, J., Kumar, T. & Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 75 , 101919 (2023).

Zhang, Q. A novel ResNet101 model based on dense dilated convolution for image classification. N Appl. Sci. 4 , 9 (2021).

Agapito, M. et al. Beyond the Basics: Improving Information About Small-Scale Fisheries. In Transdisciplinarity for Small-Scale Fisheries Governance (ed. Jentoft, S.) (Springer International Publishing, 2019).

Mease, L. A., Erickson, A. & Hicks, C. Engagement takes a (fishing) village to manage a resource: Principles and practice of effective stakeholder engagement. J. Environ. Manage. 212 , 248–257 (2018).

Cvitanovic, C., Hobday, A. J., McDonald, J., Van Putten, E. I. & Nash, K. L. Governing fisheries through the critical decade: the role and utility of polycentric systems. Rev. Fish Biol. Fish. 28 (1), 1–18 (2018).

Adams, T. J. H. Modern institutional framework for reef fisheries management. in Reef Fisheries (eds. Polunin, N. V. C. & Roberts, C. M.) 337–360 (Springer Netherlands, Dordrecht, 1996). https://doi.org/10.1007/978-94-015-8779-2_13 , (1996).

Johannes, R. E. The renaissance of community-based marine resource management in Oceania. Annu. Rev. Ecol. Syst. 33 , 317–340 (2002).

Download references

Acknowledgements

We are grateful to the many fishers and retailers that allowed their fish to be sampled, and the fisheries officers and enumerators that recorded information that fed into the development and implementation of the monitoring system. We acknowledge support from the European Union and the Government of Sweden under the Pacific European Union Marine Partnership programme through the Pacific Community (SPC), the Australian Department of Foreign Affairs and Trade, and the New Zealand Foreign Affairs and Trade Aid Program. NLA, OL, BN, AS, and DJS were supported by the Australian government through ACIAR project FIS/2020/172. GS acknowledges a University of Wollongong postgraduate scholarship. We are grateful to Kristel Steenbergen for Figure 1 and Eleanor McNeill for edits to graphics.

Author information

Authors and affiliations.

Pacific Community, Noumea, 98848, New Caledonia

George Shedrawi, Franck Magron, Bernard Vigga, Pauline Bosserelle, Sebastien Gislard & Andrew R. Halford

Australian National Centre for Ocean Resources and Security, University of Wollongong, Wollongong, 2522, Australia

George Shedrawi, Owen Li, Dirk J. Steenbergen & Neil L. Andrew

Ministry of Agriculture and Fisheries, Apia, Samoa

Sapeti Tiitii & Faasulu Fepuleai

National Fisheries Authority, Port Moresby, Papua New Guinea

Chris Molai

Ministry of Fisheries and Marine Resources Development, Tarawa, Kiribati

Manibua Rota & Beia Nikiari

Ministry of Fisheries, Suva, Fiji

Shivam Jalam

Ministry of Fisheries, Nukualofa, Tonga

Viliami Fatongiatau

Vanuatu Fisheries Department, Ministry of Agriculture, Livestock, Forestry, Fisheries and Biosecurity, Port Vila, Vanuatu

Abel P. Sami, Ada H. M. Sokach & Lucy A. Joy

You can also search for this author in PubMed   Google Scholar

Contributions

GS, FM, PB, AH, BV, NA, DS conceptualized the program. All authors collected and/or curated the data. GS, FM, SG, DS, NA conducted the formal analysis. All authors contributed to the methodology. FM and BV developed the software. GS, FM, SG validated the study. GS, FM, DS, NA visualized the data. GS, FM, DS, NA wrote the manuscript. All authors reviewed and approved the manuscript.

Corresponding author

Correspondence to George Shedrawi .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary figures., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Shedrawi, G., Magron, F., Vigga, B. et al. Leveraging deep learning and computer vision technologies to enhance management of coastal fisheries in the Pacific region. Sci Rep 14 , 20915 (2024). https://doi.org/10.1038/s41598-024-71763-y

Download citation

Received : 12 April 2024

Accepted : 29 August 2024

Published : 08 September 2024

DOI : https://doi.org/10.1038/s41598-024-71763-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial intelligence
  • Coastal fisheries management
  • Co-management
  • Artisanal fisheries
  • Data poor fisheries
  • Small Island Developing States

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

current research topics in deep learning

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

current research topics in deep learning

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Machine learning-based intrusion detection methods in iot systems: a comprehensive review.

current research topics in deep learning

1. Introduction

2. materials and methods, 2.1. eligibility criteria, 2.2. data sources and search strategy, data  sources, 2.3. search strategy.

  • IoT: ”IoT”, ”Internet of Things”, ”IoT system”.
  • Intrusion detection: ”intrusion detection”, ”anomaly detection”, ”cybersecurity”.
  • Machine learning: ”machine learning”, ”artificial intelligence”, ”ML”, ”AI”, ”deep learning”, ”supervised learning”, ”unsupervised learning”, ”neural network”, ”random forest”, ”support vector machine”, ”SVM”, ”Random Forest”, ”Decision Tree”, ”DNN”, ”ANN”, ”KNN”, ”GAN”, ”logistic regression”, ”ANN”.
  • Challenges: ”security challenges”, ”IoT security issues”, ”cybersecurity challenges”, ”AI challenges”, ”threat detection challenges”, ”IoT vulnerabilities”, ”AI limitations in IoT security”.

2.4. Study Selection

  • Title evaluation;
  • Abstract and keyword evaluation;
  • Full-text evaluation.

3. Internet of Things

3.1. definition and growth of iot, 3.2. basic architecture of iot, 4. classification of iot attacks based on vulnerabilities and layers, 4.1. categorization of vulnerabilities, 4.1.1. physical layer, 4.1.2. network layer, 4.1.3. application layer, 4.2. categorization of attacks, 4.2.1. perception layer, 4.2.2. network layer, 4.2.3. application layer, 5. traditional intrusion detection methods, 5.1. classification based on data source, 5.2. classification based on detection method, 5.3. limits of traditional approaches and comparison with machine learning-based ids, 6. machine learning for intrusion detection, 7. supervised learning, 7.1. popular methods and applications, 7.1.1. artificial neural networks (anns), 7.1.2. support vector machine (svm), 7.1.3. k-nearest neighbors (knns), 7.1.4. logistic regression (lr), 7.1.5. decision tree, 8. unsupervised learning, 8.1. popular methods and applications, 8.1.1. clustering, 8.1.2. principal component analysis (pca), 8.1.3. autoencoders, 8.1.4. density-based anomaly detection, 8.2. semi-supervised learning, 9. deep learning, 9.1. popular methods and applications, 9.1.1. deep neural networks (dnns), 9.1.2. convolutional neural networks (cnns), 9.1.3. recurrent neural networks (rnns), 9.1.4. generative adversarial networks (gans), 9.2. review of datasets, 10. discussion, 10.1. challenges and current limitations, 10.2. current trends, 11. conclusions, author contributions, conflicts of interest.

  • Ashton, K. That ’Internet Of Things’ Thing. RFID J. 2009 , 2 , 97–114. [ Google Scholar ]
  • Perera, C.; Liu, C.H.; Jayawardena, S.; Chen, M. A Survey on Internet of Things From Industrial Market Perspective. IEEE Access 2014 , 2 , 1660–1679. [ Google Scholar ] [ CrossRef ]
  • Islam, N.; Farhin, F.; Sultana, I.; Kaiser, M.S.; Rahman, M.S.; Mahmud, M.; Hosen, A.S.M.S.; Cho, G.H. Towards Machine Learning Based Intrusion Detection in IoT Networks. Comput. Mater. Contin. 2021 , 69 , 1801–1821. [ Google Scholar ] [ CrossRef ]
  • Ahmad, Z.; Khan, A.S.; Nisar, K.; Haider, I.; Hassan, R.; Haque, M.R.; Tarmizi, S.; Rodrigues, J.J.P.C. Anomaly Detection Using Deep Neural Network for IoT Architecture. Appl. Sci. 2021 , 11 , 7050. [ Google Scholar ] [ CrossRef ]
  • Union Internationale des Télécommunications. Infrastructure Mondiale de l’Information, Protocole Internet et RÉSeaux de Prochaine Génération ; UIT: Tromsø, Norway, 2012. [ Google Scholar ]
  • Corici, A.A.; Emmelmann, M.; Luo, J.; Shrestha, R.; Corici, M.; Magedanz, T. IoT inter-security domain trust transfer and service dispatch solution. In Proceedings of the 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT), Reston, VA, USA, 12–14 December 2016; pp. 694–699. [ Google Scholar ] [ CrossRef ]
  • Sha, K.; Errabelly, R.; Wei, W.; Yang, T.A.; Wang, Z. EdgeSec: Design of an Edge Layer Security Service to Enhance IoT Security. In Proceedings of the 2017 IEEE 1st International Conference on Fog and Edge Computing (ICFEC), Madrid, Spain, 14–15 May 2017; pp. 81–88. [ Google Scholar ] [ CrossRef ]
  • Al-Sarawi, S.; Anbar, M.; Abdullah, R.; Al Hawari, A.B. Internet of Things Market Analysis Forecasts, 2020–2030. In Proceedings of the 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), London, UK, 27–28 July 2020; pp. 449–453. [ Google Scholar ] [ CrossRef ]
  • Görmüş, S.; Aydın, H.; Ulutaş, G. Security for the internet of things: A survey of existing mechanisms, protocols and open research issues. J. Fac. Eng. Archit. Gazi Univ. 2018 , 33 , 1247–1272. [ Google Scholar ]
  • Ibrahim, M.; Abdullah, M.T.; Abdullah, A.; Perumal, T. An Epidemic Based Model for the Predictions of OOFI in an IoT Platform. Int. J. Eng. Trends Technol. 2020 , 52–56. [ Google Scholar ] [ CrossRef ]
  • Rebah, H.B. Gateway IoT de pilotage et de surveillance des capteurs domestiques via le protocole MQTT. Researche Gate 2022 , 3. [ Google Scholar ]
  • Atzori, L.; Iera, A.; Morabito, G. The Internet of Things: A survey. Comput. Netw. 2010 , 54 , 2787–2805. [ Google Scholar ] [ CrossRef ]
  • Lombardi, M.; Pascale, F.; Santaniello, D. Internet of Things: A General Overview between Architectures, Protocols and Applications. Information 2021 , 12 , 87. [ Google Scholar ] [ CrossRef ]
  • Hasan, M.A.M.; Nasser, M.; Ahmad, S.; Molla, K.I. Feature Selection for Intrusion Detection Using Random Forest. J. Inf. Secur. 2016 , 7 , 129–140. [ Google Scholar ] [ CrossRef ]
  • Zhao, K.; Ge, L. A Survey on the Internet of Things Security. In Proceedings of the 2013 Ninth International Conference on Computational Intelligence and Security, Emeishan, China, 14–15 December 2013; pp. 663–667. [ Google Scholar ] [ CrossRef ]
  • Alaba, F.A.; Othman, M.; Hashem, I.A.T.; Alotaibi, F. Internet of Things security: A survey. J. Netw. Comput. Appl. 2017 , 88 , 10–28. [ Google Scholar ] [ CrossRef ]
  • Sicari, S.; Rizzardi, A.; Grieco, L.; Coen-Porisini, A. Security, privacy and trust in Internet of Things: The road ahead. Comput. Netw. 2015 , 76 , 146–164. [ Google Scholar ] [ CrossRef ]
  • Jing, D.; Chen, H.B. SVM Based Network Intrusion Detection for the UNSW-NB15 Dataset. In Proceedings of the 2019 IEEE 13th International Conference on ASIC (ASICON), Chongqing, China, 29 October–1 November 2019; pp. 1–4. [ Google Scholar ] [ CrossRef ]
  • Weber, R.H.; Studer, E. Cybersecurity in the Internet of Things: Legal aspects. Comput. Law Secur. Rev. 2016 , 32 , 715–728. [ Google Scholar ] [ CrossRef ]
  • Chen, Z.; Liu, J.; Shen, Y.; Simsek, M.; Kantarci, B.; Mouftah, H.T.; Djukic, P. Machine Learning-Enabled IoT Security: Open Issues and Challenges under Advanced Persistent Threats. ACM Comput. Surv. 2023 , 55 , 1–37. [ Google Scholar ] [ CrossRef ]
  • Rathore, S.; Park, J.H. Semi-supervised learning based distributed attack detection framework for IoT. Appl. Soft Comput. 2018 , 72 , 79–89. [ Google Scholar ] [ CrossRef ]
  • Mishra, A.K.; Tripathy, A.K.; Puthal, D.; Yang, L.T. Analytical Model for Sybil Attack Phases in Internet of Things. IEEE Internet Things J. 2019 , 6 , 379–387. [ Google Scholar ] [ CrossRef ]
  • Chen, B.; Ho, D.W.C.; Hu, G.; Yu, L. Secure Fusion Estimation for Bandwidth Constrained Cyber-Physical Systems under Replay Attacks. IEEE Trans. Cybern. 2018 , 48 , 1862–1876. [ Google Scholar ] [ CrossRef ]
  • Chen, L.; Kuang, X.; Xu, A.; Suo, S.; Yang, Y. A Novel Network Intrusion Detection System Based on CNN. In Proceedings of the 2020 Eighth International Conference on Advanced Cloud and Big Data (CBD), Taiyuan, China, 5–6 December 2020; pp. 243–247. [ Google Scholar ] [ CrossRef ]
  • Liu, H.; Lang, B. Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey. Appl. Sci. 2019 , 9 , 4396. [ Google Scholar ] [ CrossRef ]
  • Sundararajan, K.; Garg, L.; Srinivasan, K.; Bashir, A.K.; Kaliappan, J.; Ganapathy, G.P.; Selvaraj, S.K.; Meena, T. A Contemporary Review on Drought Modeling Using Machine Learning Approaches. Comput. Model. Eng. Sci. 2021 , 128 , 447–487. [ Google Scholar ] [ CrossRef ]
  • Anitha, A.A.; Arockiam, D.L. ANNIDS: Artificial Neural Network based Intrusion Detection System for Internet of Things. Int. J. Innov. Technol. Explor. Eng. 2019 , 8 , 2583–2588. [ Google Scholar ] [ CrossRef ]
  • Hanif, S.; Ilyas, T.; Zeeshan, M. Intrusion Detection In IoT Using Artificial Neural Networks On UNSW-15 Dataset. In Proceedings of the 2019 IEEE 16th International Conference on Smart Cities: Improving Quality of Life Using ICT & IoT and AI (HONET-ICT), Charlotte, NC, USA, 6–9 October 2019; pp. 152–156. [ Google Scholar ] [ CrossRef ]
  • Jamal, A.; Faisal, H.M.; Nasir, M. Malware Detection and Classification in IoT Network using ANN. Mehran Univ. Res. J. Eng. Technol. 2022 , 41 , 80–91. [ Google Scholar ] [ CrossRef ]
  • Goeschel, K. Reducing false positives in intrusion detection systems using data-mining techniques utilizing support vector machines, decision trees, and naive Bayes for off-line analysis. In Proceedings of the SoutheastCon 2016, Norfolk, VA, USA, 30 March–3 April 2016; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Ioannou, C.; Vassiliou, V. Network Attack Classification in IoT Using Support Vector Machines. J. Sens. Actuator Netw. 2021 , 10 , 58. [ Google Scholar ] [ CrossRef ]
  • Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A Survey on Deep Learning. ACM Comput. Surv. 2019 , 51 , 1–36. [ Google Scholar ] [ CrossRef ]
  • Zhu, R.; Ji, X.; Yu, D.; Tan, Z.; Zhao, L.; Li, J.; Xia, X. KNN-Based Approximate Outlier Detection Algorithm over IoT Streaming Data. IEEE Access 2020 , 8 , 42749–42759. [ Google Scholar ] [ CrossRef ]
  • Abdaljabar, Z.H.; Ucan, O.N.; Alheeti, K.M.A. An Intrusion Detection System for IoT Using KNN and Decision-Tree Based Classification. In Proceedings of the 2021 International Conference of Modern Trends in Information and Communication Technology Industry (MTICTI), Sana’a, Yemen, 4–6 December 2021; pp. 1–5. [ Google Scholar ] [ CrossRef ]
  • Li, W.; Yi, P.; Wu, Y.; Pan, L.; Li, J. A New Intrusion Detection System Based on KNN Classification Algorithm in Wireless Sensor Network. J. Electr. Comput. Eng. 2014 , 2014 , 1–8. [ Google Scholar ] [ CrossRef ]
  • Govindarajan, M.; Chandrasekaran, R. Intrusion detection using k-Nearest Neighbor. In Proceedings of the 2009 First International Conference on Advanced Computing, Chennai, India, 13–15 December 2009; pp. 13–20. [ Google Scholar ] [ CrossRef ]
  • Aref, M.A.; Jayaweera, S.K.; Machuzak, S. Multi-Agent Reinforcement Learning Based Cognitive Anti-Jamming. In Proceedings of the 2017 IEEE Wireless Communications and Networking Conference (WCNC), San Francisco, CA, USA, 19–22 March 2017; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Bapat, R.; Mandya, A.; Liu, X.; Abraham, B.; Brown, D.E.; Kang, H.; Veeraraghavan, M. Identifying malicious botnet traffic using logistic regression. In Proceedings of the 2018 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 27–27 April 2018; pp. 266–271. [ Google Scholar ] [ CrossRef ]
  • Sambangi, S.; Gondi, L. A Machine Learning Approach for DDoS (Distributed Denial of Service) Attack Detection Using Multiple Linear Regression. Proceedings 2020 , 63 , 51. [ Google Scholar ] [ CrossRef ]
  • Qaddoura, R.; Al-Zoubi, A.M.; Almomani, I.; Faris, H. A Multi-Stage Classification Approach for IoT Intrusion Detection Based on Clustering with Oversampling. Appl. Sci. 2021 , 11 , 3022. [ Google Scholar ] [ CrossRef ]
  • Ingre, B.; Yadav, A.; Soni, A.K. Decision Tree Based Intrusion Detection System for NSL-KDD Dataset ; Springer: Berlin/Heidelberg, Germany, 2018; pp. 207–218. [ Google Scholar ] [ CrossRef ]
  • Rai, K.; Devi, M.S.; Guleria, A. Decision Tree Based Algorithm for Intrusion Detection. Adv. Netw. Appl. 2016 , 7 , 2828–2834. [ Google Scholar ]
  • Al-Jarrah, O.Y.; Al-Hammdi, Y.; Yoo, P.D.; Muhaidat, S.; Al-Qutayri, M. Semi-supervised multi-layered clustering model for intrusion detection. Digit. Commun. Netw. 2018 , 4 , 277–286. [ Google Scholar ] [ CrossRef ]
  • Muniyandi, A.P.; Rajeswari, R.; Rajaram, R. Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree Algorithm. Procedia Eng. 2012 , 30 , 174–182. [ Google Scholar ] [ CrossRef ]
  • Peng, K.; Leung, V.C.M.; Huang, Q. Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System over Big Data. IEEE Access 2018 , 6 , 11897–11906. [ Google Scholar ] [ CrossRef ]
  • Luo, T.; Nagarajan, S.G. Distributed Anomaly Detection Using Autoencoder Neural Networks in WSN for IoT. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Aboelwafa, M.M.N.; Seddik, K.G.; Eldefrawy, M.H.; Gadallah, Y.; Gidlund, M. A Machine-Learning-Based Technique for False Data Injection Attacks Detection in Industrial IoT. IEEE Internet Things J. 2020 , 7 , 8462–8471. [ Google Scholar ] [ CrossRef ]
  • Garg, S.; Kaur, K.; Batra, S.; Kaddoum, G.; Kumar, N.; Boukerche, A. A multi-stage anomaly detection scheme for augmenting the security in IoT-enabled applications. Future Gener. Comput. Syst. 2020 , 104 , 105–118. [ Google Scholar ] [ CrossRef ]
  • Al-Garadi, M.A.; Mohamed, A.; Al-Ali, A.K.; Du, X.; Ali, I.; Guizani, M. A Survey of Machine and Deep Learning Methods for Internet of Things (IoT) Security. IEEE Commun. Surv. Tutor. 2020 , 22 , 1646–1685. [ Google Scholar ] [ CrossRef ]
  • Brun, O.; Yin, Y.; Yin, Y.; Gelenbe, E. Deep Learning with Dense Random Neural Network for Detecting Attacks against IoT-Connected Home Environments. In Security in Computer and Information Sciences: First International ISCIS Security Workshop 2018, Euro-CYBERSEC 2018, London, UK, February 26–27 2018, Revised Selected Papers 1 ; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 458–463. [ Google Scholar ]
  • Kim, J.; Shin, N.; Jo, S.Y.; Kim, S.H. Method of intrusion detection using deep neural network. In Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, Republic of Korea, 13–16 February 2017; pp. 313–316. [ Google Scholar ] [ CrossRef ]
  • Kim, J.; Kim, J.; Kim, H.; Shim, M.; Choi, E. CNN-Based Network Intrusion Detection against Denial-of-Service Attacks. Electronics 2020 , 9 , 916. [ Google Scholar ] [ CrossRef ]
  • Park, S.H.; Park, H.J.; Choi, Y.J. RNN-Based Prediction for Network Intrusion Detection. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; pp. 572–574. [ Google Scholar ] [ CrossRef ]
  • Tang, T.A.; Mhamdi, L.; McLernon, D.; Zaidi, S.A.R.; Ghogho, M. Deep Recurrent Neural Network for Intrusion Detection in SDN-Based Networks. In Proceedings of the 2018 4th IEEE Conference on Network Softwarization and Workshops (NetSoft), Montreal, QC, Canada, 25–29 June 2018; pp. 202–206. [ Google Scholar ] [ CrossRef ]
  • Torres, P.; Catania, C.; Garcia, S.; Garino, C.G. An analysis of Recurrent Neural Networks for Botnet detection behavior. In Proceedings of the 2016 IEEE Biennial Congress of Argentina (ARGENCON), Buenos Aires, Argentina, 15–17 June 2016; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997 , 9 , 1735–1780. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014 , arXiv:1412.3555. [ Google Scholar ]
  • Ferdowsi, A.; Saad, W. Generative Adversarial Networks for Distributed Intrusion Detection in the Internet of Things. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Liao, D.; Huang, S.; Tan, Y.; Bai, G. Network Intrusion Detection Method Based on GAN Model. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS), Xi’an, China, 21–23 August 2020; pp. 153–156. [ Google Scholar ] [ CrossRef ]
  • Panda, M.; Patra, M.R. Network Intrusion Detection Using Naïve Bayes. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2007 , 7 , 258–263. [ Google Scholar ]
  • Gumus, F.; Sakar, C.O.; Kursun, O. Network Intrusion Detection Using Naïve Bayes. In Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Beijing, China, 17–20 August 2014. [ Google Scholar ]
  • Kim, J.; Kim, J.; Thu, H.L.T.; Kim, H. Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection. In Proceedings of the 2016 International Conference on Platform Technology and Service (PlatCon), Jeju, Republic of Korea, 15–17 February 2016; pp. 1–5. [ Google Scholar ] [ CrossRef ]
  • Zarpelão, B.B.; Miani, R.S.; Kawakani, C.T.; de Alvarenga, S.C. A survey of intrusion detection in Internet of Things. J. Netw. Comput. Appl. 2017 , 84 , 25–37. [ Google Scholar ] [ CrossRef ]
  • Cassales, G.W.; Senger, H.; de Faria, E.R.; Bifet, A. IDSA-IoT: An Intrusion Detection System Architecture for IoT Networks. In Proceedings of the 2019 IEEE Symposium on Computers and Communications (ISCC), Barcelona, Spain, 29 June–3 July 2019; pp. 1–7. [ Google Scholar ]
  • Yahyaoui, A.; Lakhdhar, H.; Abdellatif, T.; Attia, R. Machine learning based network intrusion detection for data streaming IoT applications. In Proceedings of the 2021 21st ACIS International Winter Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD-Winter), Ho Chi Minh City, Vietnam, 28–30 January 2021; pp. 51–56. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

CriterionDescription
LanguageOnly articles written in English.
PeriodOnly articles published in the last 10 years (between January 2014 and May 2024) due to the rapid developments in the field of IoT security.
Main topicOnly articles with the main topic of intrusion detection in IoT systems.
TechniquesOnly articles addressing the topic with machine learning (ML)-based techniques.
EvaluationOnly peer-reviewed articles published in recognized scientific journals.
DatabaseSearch Query
IEEE Xplore(”IoT” OR ”Internet of Things”) AND (”intrusion detection” OR ”anomaly detection” OR ”cybersecurity”) AND (”machine learning” OR ”artificial intelligence” OR ”deep learning” OR ”KNN” OR ”SVM” OR ”GAN” OR ”ANN” OR ”logistic regression” OR ”Random Forest”) AND (LIMIT-TO (PUBYEAR, 2014–2024)) AND (LIMIT-TO (LANGUAGE, ”English”))
PubMed((”IoT” OR ”Internet of Things”) AND (”intrusion detection” OR ”anomaly detection” OR ”cybersecurity”) AND (”machine learning” OR ”artificial intelligence” OR ”deep learning” OR ”KNN” OR ”SVM” OR ”GAN” OR ”ANN” OR ”logistic regression” OR ”Random Forest”) AND (”2014/01/01”[PDAT]: ”2024/05/31”[PDAT]) AND English[lang])
ScopusTITLE-ABS-KEY ((”IoT” OR ”Internet of Things”) AND (”intrusion detection” OR ”anomaly detection” OR ”cybersecurity”) AND (”machine learning” OR ”artificial intelligence” OR ”deep learning” OR ”KNN” OR ”SVM” OR ”GAN” OR ”ANN” OR ”logistic regression” OR ”Random Forest”)) AND NOT (DOCTYPE (”re”)) AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND (LIMIT-TO (LANGUAGE, ”English”))
Google Scholar(”IoT” AND ”intrusion detection” AND ”cybersecurity”) AND (”machine learning” OR ”artificial intelligence” OR ”deep learning” OR ”KNN” OR ”SVM” OR ”GAN” OR ”ANN” OR ”logistic regression” OR ”Random Forest”) AND (LIMIT-TO (PUBYEAR, 2014–2024)) AND (LIMIT-TO (LANGUAGE, ”English”))
CriteriaNIDSHIDS
Surveillance typeGlobal network traffic surveillanceSurveillance of specific host activities
Data sourceNetwork trafficOperating system or application program logs
Detection scopeMalicious or suspicious activities in network trafficFile modifications, unauthorized access attempts, abnormal system behaviors [ ]
Operating system independenceIndependent of the host operating systemDependent on the host operating system
Detection targetAttacks between IoT devices and network nodesAttacks specifically targeting an IoT device or resource [ ]
Detection efficiencyHigh, can detect real-time attacksLow, needs to process numerous logs [ ]
Intrusion traceabilityTraces intrusion position and time-based on IP addresses and timestampsTraces intrusion process based on system call paths
LimitationMonitors only traffic passing through a specific network segmentCannot analyze network behaviors
CriteriaAnomaly-Based MethodSignature-Based Method (Misuse Detection)
Operating principleModeling normal behavior and detecting deviationsRepresenting attack behaviors as signatures
Detection approachMonitoring data flows, traffic models, and communication patternsComparing samples with a signature database
Effectiveness against unknown attacksHighLow
False positive managementHigh false alarm rateLow false alarm rate
Attack informationUnable to provide precise reasons for detected anomaliesProvides detailed information on attack types and possible reasons
Main challengesClearly defining a normal behavior profileDesigning effective signatures
AdvantagesHigh generalization capability, recognizes unknown attacks [ ]Low false alarm rate, detailed information on attacks
DisadvantagesHigh false alarm rate, difficulty in identifying reasons for anomaliesHigh rate of missed alarms, unable to detect unknown attacks, need to maintain a large signature database
CriteriaTraditional IDSMachine Learning-Based IDS
FlexibilityLimited, depends on known signaturesHigh, can detect unknown behaviors
ScalabilityLimited, performance issues with large data volumesGood, handles large data with appropriate resources
Dependency on updatesHigh, requires manual signature updatesLow, learns continuously from new data
Detection of unknown attacksLow, does not detect zero-day attacksHigh, detects anomalies and new attacks
False positive rateLow for known attacks, high for new onesVariable, high for anomalies but manageable
Attack informationDetailed for known attacksLimited but can be improved with interpretability techniques
MethodStudyDatasetAttacks and Vulnerabilities ExploredResults
ANN[ ]UNSW-15 DatasetDos, Probe, U2R, R2LAverage precision of 84%, false positive rate < 8%
[ ]Simulated with Contiki OS/Cooja Simulator 3.0DIS attack, Version attackAccurate classification, low error rate
SVM[ ]KDD Cup 99Various types of attacksSignificant reduction in false positives
[ ]UNSW-NB15Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, Worms85.99% precision in binary classification, 75.77% in multi-classification
KNN[ ]DoH20Various types of attacks100% precision for KNN and DT
[ , ]University of New Mexico dataDos, Probe, U2R, R2LReduced execution time up to 0.01%, decreased error rates up to 0.002%
Naive Bayes[ ]KDD Cup’99DoS, Probe, U2R, R2LImproved false positive rates, cost, and calculation time
Logistic
Regression
[ ]Malware Capture Facility Project, Stratosphere IPS dataTraffic from 8 different botnet familiesAUC of 0.985, precision of 95%, recall of 96.7%
[ ]CIC-IDS 2017DDoS and Bot73.79% precision with information gain-based feature selection
Decision Tree[ ]NSL-KDDDOS and DDOS attacks73.79% precision in DDoS attack detection
[ ]NSL-KDDVarious IoT attacksImproved precision and model construction time
MethodStudyDatasetAttacks and Vulnerabilities ExploredResults
K-means[ ]MIT-DARPA 1999 network traffic dataDDOS attacks, code injectionImproved precision, reduced false positives
K-means and PCA[ ]KDD Cup 99Various attacks-
Autoencoder[ ]Indoor WSN testbedVarious attacksHigh detection accuracy, low false alert rate
[ ]IIoT industry-specificFalse data injection attacks targeting IIoTSignificant improvement in attack detection compared to SVM-based methods
DBSCAN[ ]Not specifiedVarious varied attacksEffective data clustering and identification of abnormal behavior
DNN[ ]IoT-Botnet 2020Various types of attacksHigh precision, adaptability, detects complex patterns
[ ]KDD Cup 99--
CNN[ ]KDD CUP 1999 and CSE-CIC-IDS2018DoS attacksEffective for malware detection on Android, superior to RNNs for DoS detection
[ ]CIC-IDSVarious attacks-
RNN[ ]NSL-KDDVarious attacks-
[ ]NSL-KDDAttacks in SDN networks89% precision with only six raw features
GAN[ ]Bot-IoT DatasetBotnet behaviors in network trafficHigh attack detection rate with low false alarm rate, challenges with indistinguishable and unbalanced traffic
[ ]Daily activity recognition dataset collected from 30 subjects using a smartphoneInternal and external attacks, including false data injectionsDistributed GAN shows up to 20% higher precision, 25% higher recall, and 60% lower false positive rate compared to standalone GAN
[ ]KDD Cup 99-Excellent results for intrusion detection, with approximately 99% precision for all cases and high detection rate
DatasetAttackData SizeData TypeStudy
DARPA1998Dos, Probe, U2R, R2LVariesRaw packets[ ]
KDD Cup 99Dos, R2L, U2R, Probing4,730,503 packetsNetwork records[ , , , , , ]
NSL-KDDDoS, R2L, U2R, Probe with 22 types of subcategories of attacks149,470Network records[ , , , ]
UNSW-NB15Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, Worms2.5 GBPCAP, CSV[ , ]
CIDDS-001Port scan, Dos, Ping of Death, etc.700 MBData flows, CSV[ , , ]
CIC-IDS-2017 and CSE-CIC-IDS2018Bot, brute force, DoS, Infiltration, SQL injectionVaries greatlyPCAP, CSV[ , ]
BoT-IoTDDoS, DoS, Reconnaissance, Theft5 GBPCAP, CSV[ , ]
Edge-IIoTsetDDoS_UDP, DDoS_ICMP, SQL_injection, Password, Vulnerability_scanner, DDoS_TCP, DDoS_HTTP, Uploading, Backdoor, Port_Scanning, XSS, Ransomware, MITM, FingerprintingVaries alsoNetwork traffic flows, Security event logs, IoT device metrics, Specific attack records, Web traffic data, Communication metadataNo studying
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Kikissagbe, B.R.; Adda, M. Machine Learning-Based Intrusion Detection Methods in IoT Systems: A Comprehensive Review. Electronics 2024 , 13 , 3601. https://doi.org/10.3390/electronics13183601

Kikissagbe BR, Adda M. Machine Learning-Based Intrusion Detection Methods in IoT Systems: A Comprehensive Review. Electronics . 2024; 13(18):3601. https://doi.org/10.3390/electronics13183601

Kikissagbe, Brunel Rolack, and Meddi Adda. 2024. "Machine Learning-Based Intrusion Detection Methods in IoT Systems: A Comprehensive Review" Electronics 13, no. 18: 3601. https://doi.org/10.3390/electronics13183601

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Skip to main content
  • Skip to FDA Search
  • Skip to in this section menu
  • Skip to footer links

U.S. flag

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

U.S. Food and Drug Administration

  •   Search
  •   Menu
  • FDA Organization
  • Oncology Center of Excellence

Development of neuroblastoma tissue diagnostic methods through deep learning-based image analytics

FDA Collaborators : Reena Phillip, PhD; Marc Theoret, MD; Diana Bradford, MD; Prakash Jha, MD; Fengmin Li, PhD; Arpita Roy, PhD, Martha Donoghue, MD

External Collaborators :

  • Stanford University (Awardee): Bill Chiu, MD; Hiroyuki Shimada, MD, PhD; Olivier Gevarert, PhD    

Project Start Date : January 2024

Regulatory Science Challenge

Neuroblastoma (NB), a common pediatric solid tumor, is associated with significant clinical variability based on age and biological factors. Current treatment decisions rely on clinical and molecular prognostic factors to categorize patients into different risk groups. While risk-stratified treatment leads to favorable outcomes for low- and intermediate-risk patients (over 90% 5-year survival), the survival of high-risk patients remains below 50% despite aggressive multimodal therapy. 1,2

The International Neuroblastoma Pathology Classification System (based on the Shimada classification) utilizes microscopic examination of tumor specimens to identify specific histologic characteristics and provide the correct diagnosis that correlates with clinical outcome. However, the heterogeneity of neuroblastoma can lead to variations in pathologists’ interpretations, affecting the accuracy in classifying the tumor risk group. 

Project Description & Goals

The goal of this research is to improve neuroblastoma pathologic diagnosis and resulting risk classification by developing a novel diagnostic tool. To achieve this goal, a large collection of neuroblastoma tissue samples from across the US, Canada, Australia, and New Zealand will be analyzed using artificial intelligence (AI) to identify key histopathologic features in whole slide imaging (WSI). The AI-based approach will be used to develop a diagnostic algorithm for improved tumor grading and prognosis evaluation in neuroblastoma.

References:

  • Whittle SB, Smith V, Doherty E, Zhao S, McCarty S, Zage PE (2017). Overview and recent advances in the treatment of neuroblastoma. Expert review of anticancer therapy, 17(4), 369-386.
  • Irwin MS, Naranjo A, Zhang FF, Cohn SL, London WB, Gastier-Foster JM, et al. (2021). Revised neuroblastoma risk classification system: a report from the Children's Oncology Group. Journal of Clinical Oncology, 39(29), 3229-3241.

Further Information

  • OCE Scientific Collaborative
  • FDA Broad Agency Announcement: Frequently Asked Questions for Oncology Researchers
  • OCE-Funded Active Extramural Research Projects

Deep learning and machine learning techniques for head pose estimation: a survey

  • Open access
  • Published: 12 September 2024
  • Volume 57 , article number  288 , ( 2024 )

Cite this article

You have full access to this open access article

current research topics in deep learning

  • Redhwan Algabri 1 ,
  • Ahmed Abdu 2 &
  • Sungon Lee 3  

1 Altmetric

Head pose estimation (HPE) has been extensively investigated over the past decade due to its wide range of applications across several domains of artificial intelligence (AI), resulting in progressive improvements in accuracy. The problem becomes more challenging when the application requires full-range angles, particularly in unconstrained environments, making HPE an active research topic. This paper presents a comprehensive survey of recent AI-based HPE tasks in digital images. We also propose a novel taxonomy based on the main steps to implement each method, broadly dividing these steps into eleven categories under four groups. Moreover, we provide the pros and cons of ten categories of the overall system. Finally, this survey sheds some light on the public datasets, available codes, and future research directions, aiding readers and aspiring researchers in identifying robust methods that exhibit a strong baseline within the subcategory for further exploration in this fascinating area. The review compared and analyzed 113 articles published between 2018 and 2024, distributing 70.5% deep learning, 24.1% machine learning, and 5.4% hybrid approaches. Furthermore, it included 101 articles related to datasets, definitions, and other elements for AI-based HPE systems published over the last two decades. To the best of our knowledge, this is the first paper that aims to survey HPE strategies based on artificial intelligence, with detailed explanations of the main steps to implement each method. A regularly updated project page is provided: ( github ).

Similar content being viewed by others

current research topics in deep learning

Deep Learning for Head Pose Estimation: A Survey

current research topics in deep learning

TinyPoseNet: A Fast and Compact Deep Network for Robust Head Pose Estimation

current research topics in deep learning

Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

AI is the ability of a digital computer to perform tasks commonly associated with intelligent beings, matching the performance levels of human experts in tasks that require everyday knowledge, such as discovering meaning, HPE, and the ability to reason. HPE is an important computer vision task that has garnered significant attention in recent years owing to its wide range of applications in various fields, surveillance, robotics, and including human-computer interaction (HCI) (Zhang et al. 2023 ). The goal of HPE is to determine the position and orientation of a person’s head relative to a fixed coordinate system, typically defined by a camera (Borghi et al. 2018 ). Pose is usually expressed by rotational (yaw, pitch, and roll) components (Cobo et al. 2024 ), or translational (x, y, and z) components (Algabri and Choi 2021 , 2022 ), or both. With the recent rapid development of computer vision technology, deep learning techniques have outperformed classical techniques in various tasks, including AI-based HPE systems. The field of HPE has seen significant progress in recent years, with many state-of-the-art techniques achieving high accuracy and robustness in various applications. However, many challenges and open problems in HPE still require further research, such as the gimbal lock and discontinuous representation problems, as well as the need for dynamic head motions and real-time performance. Moreover, there is a need for standardized evaluation protocols and benchmarks to enable fair comparisons between different HPE techniques. Most of the existing methods use the quaternion representation or Euler angle to train their networks, leading to high performance for the HPE using deep learning methods. More recently, attention has been paid to the use of rotation matrix representation with convolutional networks to mitigate the discontinuous representation and gimbal lock problems.

Given the rapid advancements in the HPE topic, this survey aims to trace the recent progress and recap these achievements to present a clear picture of existing research on deep learning techniques with advanced elements of the holistic system. Before diving deeper into HPE, it is essential to understand some related concepts:

Face detection is the process of locating and identifying faces of people within an image or video frame, as shown in Fig.  1 a. The main objective of facial detection is to identify the presence and location of faces within an image or video stream. It is a crucial step in various applications such as facial recognition and emotion analysis. Face detection techniques include traditional methods such as Haar cascades and more advanced approaches using deep learning models such as convolutional neural networks (CNNs). These methods help in identifying the bounding boxes (rectangles) around detected faces. See Minaee et al. ( 2021 ), Liu et al. ( 2023 ) for more details.

Facial landmark detection (Lee et al. 2020 ) is the process of identifying specific points (landmarks) on a person’s face, such as the corners of the eyes and mouth and the tip of the nose, as shown in Fig.  1 b. Facial landmark detection provides precise information about the locations of key facial features. These landmarks are often used as a foundation for various facial analysis tasks, including face alignment and emotion recognition (Tomar et al. 2023 ). Like face detection, facial landmark detection can be performed using deep learning models, especially those designed for keypoint detection. These models are trained to predict the coordinates of facial landmarks.

Head detection is a broader task than face detection and involves identifying and localizing the entire head in an image or video frame (Kumar et al. 2019 ). This can include detecting both the face and the surrounding head region, as shown in Fig. 1 c.

HPE refers to the process of determining the position and orientation of a person’s head in three-dimensional (3D) space. It provides information regarding the direction people are looking, which can be valuable for understanding user engagement or gaze tracking in applications such as virtual reality, driver monitoring systems (DMSs), or HCI. Estimating the head pose typically involves identifying facial landmarks and using geometric and trigonometric calculations to determine the head’s orientation, as shown in Fig. 1 d. However, HPE is the topic that will covered in this survey.

figure 1

Example images of four different setups

1.1 Previous surveys and our contributions

Several related reviews and surveys on HPE have been previously conducted (Murphy-Chutorian and Trivedi 2008 ; Shao et al. 2020 ; Khan et al. 2021 ; Abate et al. 2022 ; Asperti and Filippini 2023 ). Reference (Murphy-Chutorian and Trivedi 2008 ) reviewed research works related to HPE based on traditional learning approaches conducted before 2009. References (Shao et al. 2020 ; Khan et al. 2021 ; Abate et al. 2022 ) provide surveys of both traditional and deep learning approaches in the context of HPE. Reference (Shao et al. 2020 ) summarized the HPE methods, from classical to recent deep learning-based ones, encompassing nearly 60 papers up to 2019. Reference (Asperti and Filippini 2023 ) presents a survey of deep learning-based methods in the context of HPE. However, this survey focuses on the general field of deep learning-based HPE and ignores the effect of the elements on the holistic system.

An online search and the collection of literature were performed using multiple search engines, including IEEE Xplore, Google Scholar, Science Direct, PubMed, Springer, ACM, and others. The search process included a recursive review of the cited references. The keywords used for the search included HPE, head pose estimation, head poses, head orientation, AI-based head pose estimation, deep learning HPE, real-time head pose estimation, HPE metrics, and HPE datasets. We included papers published in recent years that focused on AI-based methods, selecting those with high citations, available code, or those presenting novel methods or significant improvements in the field of HPE.

This survey covers the recently published papers, given the rapid development of HPE over the past few years. Moreover, it offers a comprehensive discussion of the effect of each element of the framework on the subsequent process and its preprocess on the overall system, addressing aspects neglected in prior surveys. As indicated by detailed analysis and existing experiments, the holistic system’s performance relies on each element. Consequently, a comprehensive review of these elements is essential for readers aspiring to construct a state-of-the-art HPE system from the ground up. This survey seeks to offer valuable insights for a comprehensive understanding of the broader context of end-to-end HPE and to facilitate a systematic exploration. The primary contributions can be succinctly summarized as follows:

This article comprehensively surveys the elements of HPE, encompassing over 210 papers published until 2024. We review and categorize the recent advancements of each element in detail so that readers can systematically understand them.

We survey the eleven reviewed elements from many aspects: choice of application or academic research to improve the performance, multi-task or single task, environment, dataset type, range angle, representation way, freedom degrees, techniques used, landmark method, rotation type, and metrics to evaluate the model as well as challenges (Table  3 ). Moreover, we point out the influence of each element on the system’s overall performance by highlighting the weaknesses and strengths of the different methods for each element.

We gather the current challenges for each element and its sub-categories, aiming to understand their history and status and then support future research from the point of view of the holistic framework (Table  1 ).

We present an overview of current applications of HPE, such as DMSs, surveillance, virtual try-on (VTON), augmented reality (AR), and healthcare.

We provide a comprehensive comparison of several publicly available datasets in tabular form under a summary of datasets of HPE and their annotations (Table  2 ).

1.2 Article organization

figure 2

Structure of this survey

This paper is divided into five main sections: Introduction (Sect.  1 ), Main steps of HPE frameworks (Sect.  2 ), Datasets and ground-truth techniques (Sect.  3 ), and discussion, challenges, and future directions (Sect.  4 ), and conclusions (Sect. 5). Figure  2 shows the structure of this survey. HPE frameworks (Sect.  2 ) comprise eleven main steps under four  broader groups (application context, data handling and preparation, echniques and methodologies, and  evaluation metrics, as shown in Fig. 2  (right side). Every step is divided into subcategories, including choice of application, environment, task types, dataset type, range angle, rotation representations way, number of six degrees of freedom (DoF), techniques used, landmark-based or free methods, rotation type, and evaluation metrics. In datasets and ground-truth techniques (Sect.  3 ), the main characteristics of the available datasets are discussed, including the number of participants and their gender (i.e., female and male), angles (i.e., yaw, pitch, and roll) and range (i.e., full or narrow range), the environment that the dataset captured (i.e., indoor and outdoor), data type (i.e., two-dimensional (2D) image or depth), resolution, and ground truth tools. Moreover, discussion, challenges, and future research directions are presented relatively short (Sect.  4 ). Finally, the article provides conclusions from this survey (Sect.  5 ).

2 Main steps of head pose estimation frameworks

In this section, we categorize the process for HPE into eleven major steps: application choice, multi-task or single task, environment, dataset type, range angle, representation way, freedom degrees, techniques used, landmark method, rotation type, and metrics to evaluate the model. The categories can be organized into broader groups (four groups) that logically sequence the steps and aspects involved in developing an HPE system, better reflecting the logical relationships between these categories, as follows: 1. application context: This group includes the choice of application, the specific tasks, and the environment in which the system will operate. 2. data handling and preparation: This encompasses the type of dataset, the range of angles, the representation method, and the degrees of freedom involved. 3. techniques and methodologies: This includes the techniques used, the approach to landmark detection, and the type of rotation. 4. evaluation metrics: This group covers the metrics used to evaluate the system’s performance. Moreover, we summarize the advantages and disadvantages of all these categories in Table  1 .

figure 3

Taxonomy of steps order for head pose estimation

Organizing all the diverse categorizations and order processes for HPE and unifying them under a single ubiquitous taxonomy presents both a challenge and an ambition. We explored the idea of employing a functional categorization, grouping each method based on the process order and its respective operating domain. By employing this approach, a clear distinction would have been made between different methods. Through this organization, we facilitate discussions on the progression of various techniques while also steering clear of ambiguities that may appear when these techniques are applied beyond their original functional boundaries. As shown in Fig.  3 , our evolutionary taxonomy comprises eleven steps describing the conceptual approaches utilized for HPE, where the small gray circles numbered 1 through 4 refer to groups, with each group having its respective subcategories:

2.1 Application context

It includes the choice of application, the specific tasks, and the environment in which the system will operate.

2.1.1 Application

HPE has extensive applications, such as HCI (Liu et al. 2021a , b ; Madrigal and Lerasle 2020 ), DMS (Lu et al. 2023 ; Chai et al. 2023 ; Wang et al. 2022 ; Jha and Busso 2022 ; Hu et al. 2020 ), examinee monitoring system (EMS) (Chuang et al. 2017 ), VTON (Shao 2022 ), AR (Huang et al. 2012 ), gaming and entertainment (Kulshreshth and LaViola Jr 2013 ; Malek and Rossi 2021 ), audience analysis (Alghowinem et al. 2013 ), emotion recognition (Mellouk and Handouzi 2020 ), head fake detection (Becattini et al. 2023 ), health and medicine (Hammadi et al. 2022 ; Ritthipravat et al. 2024 ), security (Bisogni et al. 2024 ), surveillance (Rahmaniar et al. 2022 ), age estimation (Zhang and Bao 2022 ), human-robot interaction (HRI) and control (Mogahed and Ibrahim 2023 ; Wang and Li 2023 ; Edinger et al. 2023 ; Hwang et al. 2023 ), sports analysis (Kredel et al. 2017 ), attention span prediction (Singh et al. 2021 ; Xu and Teng 2020 ), and human behavior analysis (Liu et al. 2021 ; Baltrusaitis et al. 2018 ). Overall, HPE has diverse applications across industries and domains, enabling more sophisticated and natural interactions between humans and machines.

2.1.2 Task method

The task method can be approached based on single-task (i.e., HPE only) or multi-task (that is, HPE with face detection, gaze detection, facial landmark detection, or other tasks) methods, as shown in Fig.  4 .

figure 4

Single task It focuses on solving the HPE problem as a standalone task. Most recent landmark-free approaches have been designed to estimate head pose as a single task by employing deep-learning models to address the occlusion problem. For example, latent space regression (LSR) (Celestino et al. 2023 ), FSA-Net (Yang et al. 2019 ), 6DHPENet (Chen et al. 2022 ), and 6DoF-HPE (Algabri et al. 2024 ) are single task methods.

LSR (Celestino et al. 2023 ) is designed to estimate head poses under occlusions based on multi-loss and the ResNet-50 backbone. However, it requires substantial graphics processing unit (GPU) power as it has over 23 million parameters and needs to generate its occluded and unoccluded datasets using the same annotations. In Yang et al. ( 2019 ), the authors proposed a model called FSA-Net to implement only a single task, which is an HPE based on a feature aggregation and soft-stage regression architecture. In Chen et al. ( 2022 ), the authors proposed a single-task approach named 6DHPENet using a 6D rotation representation with multi-regression loss for fine-grained HPE. 6DoF-HPE (Algabri et al. 2024 ) is designed to estimate the head pose in real time using a RealSense D435 camera. This method reported full and narrow range angles using various datasets.

The advantages of these methods with a single task are as follows: (1) Simplicity: single-task models are often easier to develop and train because they have a single objective, simplifying the training process. (2) Specificity: these models are designed for a specific task, making them well-suited for applications where HPE is the primary concern. However, these methods with a single task have the following disadvantages: (1) Lack of context: single-task models may not take advantage of additional information in the image or video that could improve the accuracy of pose estimation. (2) Limited use cases: they may be used for applications requiring only single tasks.

Multiple tasks In contrast to the aforementioned topic, multi-task networks are employed to estimate head pose and other tasks, such as face detection, face alignment, emotion recognition, gender, and age.

Hyperface (Ranjan et al. 2017 ) is a landmark-based approach for gender recognition, landmark localization, HPE, and face detection using a CNN. The advantage of HyperFace is that it employs the intermediate layers for fusion to boost the performance of the landmark localization task. The lower layers of the CNN contain local information, which becomes relatively invariant as the depth increases. Zhang et al. ( 2018 ) address the challenge of recognizing facial expressions (FER) under arbitrary head poses through a generative adversarial network (GAN) model using the geometry of the face. This method is a landmark-based end-to-end deep-learning model. Xia et al. ( 2022 ) presented an alignment, tracking, and pose network (ATPN), multiple tasks neural network specifically developed for face tracking, face alignment, and HPE. ATPN improves face alignment by integrating a shortcut connection between deep and shallow layers, effectively utilizing structural facial information. Additionally, ATPN generates a heatmap from face alignment to boost the performance of HPE and provides attention cues for face tracking. Thai et al. ( 2022 ) introduced MHPNet, a lightweight multi-task model. This end-to-end deep model is designed to address both HPE and masked face classification problems simultaneously. The authors adjusted the narrow-range angles from [- 99, 99] to [- 93, 93] degrees based on their observation that the ground-truth Euler angles predominantly fall within this specific range. Consequently, the backbone generated a 62-dimensional distribution vector for each angle, representing 62 bins instead of 66 bins as in HopeNet. Bafti et al. ( 2022 ) proposed an architecture for improving the performance of dense prediction tasks using the multi-task learning model named MBMT-Net. The proposed architecture is a multi-backbone (Mask-RCNN and Resnet50), multi-task deep CNN that simultaneously estimates head poses and age. Malakshan et al. ( 2023 ) presented an approach that includes task-related components such as classification, representation alignment, head pose adversarial, and regression losses. These components work together to improve the accuracy of HPE for low-resolution faces. Basak et al. ( 2021 ) presented a methodology based on a semi-supervised approach for learning 3D head poses from synthetic data. This method also generates synthetic head pose data with a diverse range of variations in gender, race, and age. Fard et al. ( 2021 ) presented the active shape model network (ASMNet), a CNN designed for the multiple tasks of face alignment and pose estimation. ASMNet is engineered to be efficient and lightweight to improve the performance in detecting facial landmarks and estimating the pose of a human face. Chen et al. ( 2021 ) presented an end-to-end multi-task method called TRFH to estimate head poses and detect faces simultaneously. The authors adopted DLA-34 as a backbone, which is a deep learning technique. In Wu et al. ( 2021 ), the authors introduced the Synergy method, which employs 3D facial landmarks and a 3D morphable model (3DMM) to detect 3D face mesh, landmarks, texture, and HPE. The 3DMM has some advantages in tasks, such as face analysis, because the semantic meaning is well-known and prevents possible tracking failures caused by the sudden emergence of face regions Yu et al. ( 2018 ). Khan et al. ( 2021 ) presented a face segmentation and 3D HPE approach based on deep learning. This approach involves segmenting a face image into seven distinct classes. The proposed method uses a probabilistic classification method and creates probability maps (PMAPS) for HPE along with segmentation results to extract features from the CNNs and build a soft-max classifier for face parsing. Dapogny et al. ( 2020 ) introduced an approach that combines HPE and facial landmark alignment inside an attentional cascade. This cascade employs a geometry transfer network (GTN) to enhance landmark localization accuracy by integrating diverse annotations. Additionally, the authors introduced a doubly conditional fusion method to select relevant feature maps and regions based on both head pose and landmark estimates, creating a single deep network for these tasks. Valle et al. ( 2020 ) presented an architecture combining landmark-based face alignment, and HPE called a multi-task neural network (MNN) based on bottleneck residual blocks with a U-Net encoder-decoder using four in-the-wild landmark-related datasets and one dataset acquired in laboratory conditions. Jha et al. ( 2023 ) proposed a framework that depends on CNNs to take the driver’s head pose and eye appearance as inputs, creating a fusion model for estimating probabilistic gaze maps of the driver. The data used for this study was obtained from the multimodal driver monitoring (MDM) corpus (Jha et al. 2021 ).

Multi-task methods aim to collectively address HPE alongside other correlated tasks to improve overall performance. The advantages of these methods with multiple tasks are as follows: 1) Information fusion: multi-task models can take advantage of additional information, such as facial landmarks, which can improve HPE accuracy. 2) Resource efficiency: by sharing features across tasks, multi-task models can be more resource-efficient than training separate models for each task. However, these methods have the following disadvantages: 1) Task interference: tasks can potentially interfere with each other, and optimizing one task can negatively impact the performance of another. 2) Increased complexity: multi-task models are more complex and may require additional data and training, which can make them more computationally intensive. Developing and training multi-task models can be more challenging and complex than single-task models.

The choice between single-task and multi-task HPE depends on the specific requirements of the application. If HPE is the primary focus, a single-task model might suffice. However, if the application needs additional information or resource usage optimization, a multi-task approach could be more suitable. Ultimately, the choice should be driven by a project’s specific needs and trade-offs.

2.1.3 Environment

HPE systems are designed to determine the orientation or position of an individual’s head in a given environment. These systems find applications in various fields, as mentioned in Sect.  2.1.1 . The challenges and requirements for HPE can differ between indoor and outdoor environments.

Indoor environments Indoor environments, such as museums (Zhao et al. 2024 ), laboratories, and offices, which are characterized by controlled settings and predictable conditions, serve as compelling spaces for various technological applications. For example, Algabri et al. ( 2024 ) propose a novel real-time HPE framework using red, green, blue, and depth (RGB-D) data and deep learning without relying on facial landmark localization in an indoor environment. Bafti et al. ( 2022 ) introduced a multi-backbone architecture for simultaneously estimating head poses and age. Celestino et al. ( 2023 ) introduced a deep learning-based methodology to autonomously assist in feeding disabled people with a robotic arm. Song et al. ( 2023 ) proposed an algorithm that combines HPE and human pose tracking with automatic name detection for autistic children. The authors claimed that the experimental results demonstrated high consistency between the proposed approach and clinical diagnosis.

The advantages of these approaches in indoor environments are as follows: (1) Controlled lighting: Controlled and consistent lighting conditions indoors facilitate more reliable and accurate HPE. (2) Structured backgrounds: indoor environments often have less complex and more structured backgrounds, making it easier to distinguish the head from the surroundings and track the head. (3) Calibration: camera calibration is generally more straightforward indoors, allowing for accurate geometric transformations. However, these methods have the following disadvantages in indoor environments: (1) Limited realism: indoor environments may not completely capture the diversity of real scenarios, limiting the realism of the training dataset. (2) Restricted applications: indoor HPE may not directly apply to outdoor scenarios, limiting the system’s versatility.

Outdoor environments Roth and Gavrila ( 2023 ) introduced a technique named IntrApose for an in-car application using a BBox camera. Fard et al. ( 2021 ) proposed a lightweight CNN for face pose estimation and facial landmark point detection in the wild. In Bisogni et al. ( 2021 ), the authors proposed a fractal-based technique called FASHE. This method determines the closest match between the reference array using Hamming distance and the fractal code of the target image for HPE in the wild.

The advantages of these methods in outdoor environments are as follows: (1) Real-world scenarios: outdoor HPE directly applies to real-world scenarios, such as surveillance in public spaces or AR experiences. (2) Diverse backgrounds: outdoor scenes provide more diverse and dynamic backgrounds, challenging the system to adapt to various environments. (3) Adaptability: systems designed for outdoor use are often more adaptable to lighting conditions and scenarios. (4) Practical applications: outdoor HPE is crucial for applications such as AR navigation, HCI in public spaces, and crowd monitoring. However, these methods have the following disadvantages in outdoor environments: (1) Uncontrolled lighting: outdoor environments experience variable and sometimes unpredictable lighting conditions, thus requiring robust algorithms to handle changes in illumination. (2) Background complexity: outdoor scenes may have complex and dynamic backgrounds, posing challenges in accurately isolating and tracking the head. (3) Camera calibration challenges: calibrating cameras in outdoor environments is more challenging owing to the absence of fixed calibration patterns and larger distances.

Both environments Other studies have addressed both indoor and outdoor environments. For example, Berral-Soler et al. ( 2021 ) proposed a single-stage model called RealHePoNet, which is a landmark-free method to work in indoor and outdoor environments based on single-channel (i.e., grayscale) images.

Common considerations for both environments (1) Robustness to occlusions: HPE systems should be robust to partial occlusions of the head. (2) Adaptability to different head movements: systems need to accommodate a wide range of head movements, including rotations, tilts, and nods. (3) Integration with other systems: HPE is often part of a broader system, and integration with other components, such as gaze tracking or facial expression analysis, may be important. (4) Privacy concerns: in both indoor and outdoor scenarios, privacy concerns should be considered, and systems should adhere to ethical guidelines.

Developers of HPE systems need to carefully consider these factors based on the specific requirements and challenges posed by the target environment, whether indoor or outdoor.

2.2 Data handling and preparation

2.2.1 dataset type.

Selecting the dataset type is the most important step for the HPE frameworks. RGB images and RGB-D images are two types of data used in HPE. These data are sometimes videos, whether RGB frame video or depth video. A video is a sequence of images displayed in rapid succession to create the illusion of motion. The main emphasis of this survey (Abate et al. 2022 ) is on the category of dataset types. RGB images provide surface appearance and texture details of the head but lack depth information. HPE using RGB images relies on analyzing facial features and their relative positions to estimate head orientation. RGB images are more appropriate in applications that do not require 6DoF and are within almost uniform lighting environments.

RGB-D images are usually red, RGB-D data consisting of color images along with depth information for each pixel in the image. RGB-D images include an additional depth channel alongside the RGB channels. This depth channel provides information about the distance of each pixel from the camera. When used for HPE, RGB-D images can offer more accurate and robust results, as they enable the system to account for the 3D structure of the head, making it less susceptible to lighting variations and occlusions. RGB-D images can be better for applications that require 6DoF.

RGB images These images, also known as grayscale or color images, represent visual data in two dimensions: width and height. Most HPE methods use RGB images due to plenty. 2DHeadPose (Wang et al. 2023 ), HeadPosr (Dhingra 2021 ), and other methods use RGB images. Dhingra ( 2021 ) proposed HeadPosr, which is an end-to-end trainable model to estimate the head hoses based on transformer encoders using a single RGB image. Hu et al. ( 2021 ) presented a Bernoulli heatmap method to create CNNs without fully connected layers from a single RGB image for HPE. Liu et al. ( 2021 ) presented an approach that eliminates the need for training with head pose labels. Instead, it relies on matches between the 2D input single RGB image and a reconstructed 3D face model for HPE. It uses CNNs that optimize asymmetric Euclidean and keypoint losses jointly. This method comprises two main components: the 3D face reconstruction and 3D-2D matching keypoints. The authors conducted a comparative analysis of their method against those utilizing different data types, asserting that their approach demonstrated superior performance.

The primary benefit of employing the HPE from RGB images as the ground truth in numerous HPE datasets is that a large dataset can be amassed with a relatively straightforward and cost-effective process without requiring special setups. Conversely, the most significant disadvantage is that landmark-free HPE methods may be trained and evaluated using an imperfect ground truth derived from head poses estimated from RGB images (Li et al. 2022 ). Moreover, this data category relies on active sensing, and its application outdoors and in brightly lit environments poses challenges owing to the potential exposure of active lighting (Shao et al. 2020 ). Although RGB images are widely employed and understood in computer vision and image processing, these images are weak under severe illumination conditions and are inaccurate for 6DoF systems.

RGB-D images RGB-D images provide information about the distance of the corresponding point in a 3D scene from a camera or sensor (Du et al. 2021 ). 6DoF-HPE (Algabri et al. 2024 ) and face-from-depth (Borghi et al. 2018 ) methods use either RGB-D images with depth (D) information, point clouds, or both. For example, 6DoF-HPE (Algabri et al. 2024 ) used depth data to estimate the translational components (x, y, and z) and RGB data to estimate the rotational components (yaw, pitch, and roll) relative to the camera pose in real-time. However, it required more GPU power because it used deep learning models, and the mean absolute error (MAE) was still above 5% for the full range dataset. Borghi et al. ( 2018 ) employed a deterministic conditional GAN model to transform RGB-D images into grayscale images. Consequently, the authors introduced a complete end-to-end model called Face-from-Depth based on the \(POSEidon^{+}\) network, designed for tracking driver body posture, with a primary focus on estimating head and shoulder poses based on gestures captured on depth images only. Ma et al. ( 2021 ) presented an end-to-end framework based on the PointNet network and DRF to estimate the head pose from a single-depth image. Luo et al. ( 2019 ) presented a system based on an iterative closest point (ICP) algorithm that can estimate head pose and generate a realistic face model in real time from a single depth image. The system used a deformable face model aligned to the RGB-D image to generate the face model. Hu et al. ( 2021 ) adopted a method to extract discriminative head pose information, leveraging temporal information across frames without handcrafted features using point cloud data. Xu et al. ( 2022 ) proposed a network architecture that utilizes a 3D point cloud generated from depth as input, departing from using RGB or RGB-D image for HPE. In Chen et al. ( 2023 ), the authors introduced a technique for HPE that leverages asymmetry-aware bilinear pooling on RGB-D feature pairs. This approach aims to capture asymmetry patterns and multi-modal interactions within head pose-related regions. This bilinear pooling approach is effective for merging multi-modal information to support various tasks. Nevertheless, it should be noted that the bilinear pooling’s high memory requirements can pose limitations, particularly on low computational power devices, as highlighted in. López-Sánchez et al. ( 2020 ). Importantly, the proposed method’s performance degrades notably when dealing with large pose angles where crucial facial features may become occluded. In Wang et al. ( 2023a ), the authors developed a complete HDPNet pipeline utilizing RGB-D images for head detection and HPE in complicated environments. The proposed method is similar to the HopeNet architecture in terms of HPE.

In contrast to RGB images, depth data maps may exhibit reduced texture detail (Wang et al. 2023b ). Depth data offers a solution to mitigate certain limitations of RGB data, such as issues related to illumination, while also providing more dependable facial landmark detection (Drouard et al. 2017 ). However, RGB-D images require special sensors, leading to a high-cost process.

Other datasets Liu et al. ( 2022 ) presented a Gaussian mixed distribution learning (GMDL) for the HPE model for understanding student attention using infrared (IR) data. However, the author reported the results for only two angles (yaw and pitch). Another study (Kim et al. 2023 ) proposed a real-time DMS system based on HPE and facial landmark-based eye closure detection to monitor driver behavior using IR data.

In summary, RGB images are based solely on color and texture, while RGB-D images incorporate depth information, enhancing the precision and reliability of HPE algorithms. Given the importance of this step, in Sect.  3 we describe in detail 24 public datasets published after 2010, namely, 2DHeadPose (Wang et al. 2023 ), Dad-3dheads (Martyniuk et al. 2022 ), avatars in geography optimized for regression analysis (AGORA) (Patel et al. 2021 ), MDM corpus (Jha et al. 2021 ), ETH-XGaze (Zhang et al. 2020 ), GOTCHA-I (Barra et al. 2020 ), DD-Pose (Roth and Gavrila 2019 ), VGGFace2 (Cao et al. 2018 ), WFLW (Wu et al. 2018 ), SynHead (Gu et al. 2017 ), DriveAHead (Schwarz et al. 2017 ), SASE (Lüsi et al. 2017 ), Pandora (Borghi et al. 2017 ), 300W across Large Pose (300W-LP)  (Zhu et al. 2016 ), AFLW2000 (Zhu et al. 2016 ), CCNU (Liu et al. 2016 ), UPNA (Ariz et al. 2016 ), WIDER FACE (Yang et al. 2016 ), Carnegie Mellon university (CMU) Panoptic (Joo et al. 2015 ), Dali3DHP (Tulyakov et al. 2014 ), EYEDIAP (Funes Mora et al. 2014 ), McGill (Demirkus et al. 2014 ), Biwi (Fanelli et al. 2013 ), ICT-3DHP (Baltrušaitis et al. 2012 ), annotated faces in the wild (AFW) (Zhu and Ramanan 2012 ) and NIMH-ChEFS (Egger et al. 2011 ).

2.2.2 Range angle method

The choice between narrow-range and full-range angles for HPE relies on the specific requirements and constraints of the application. Figure  5 shows the range of narrow and full angles.

figure 5

Different range angles

Narrow-range angle It typically refers to estimating the head pose within a limited field of view, such as a smaller range of yaw, pitch, and roll angles. For example, narrow-range angle methods might focus on head poses that are primarily within a \(\pm 90^{\circ }\) range around the frontal view (Figs.  5 and  6 ). The THESL-Net (Zhu et al. 2022 ), AGCNNs (Ju et al. 2022 ), DADL (Zhao et al. 2024 ), HPNet-RF (Thai et al. 2023 ), DSFNet (Li et al. 2023 ), and DS-HPE (Menan et al. 2023 ) methods, uses narrow-range angles.

Dhingra ( 2022 ) proposed LwPosr, a lightweight network that combines transformer encoder layers and depthwise separable convolution (DSC). Organized in three stages and two streams, these layers collectively aim to deliver precise regression for HPE. According to the researchers cited in Li et al. ( 2022 ), landmark-free techniques fail to address the issue of perspective distortion in facial images, which arises due to the misalignment of the face with the camera’s coordinate system. To mitigate this problem and enhance the accuracy of HPE using a lightweight network, the authors introduce an image rectification approach within narrow-range head pose angles. Liu et al. ( 2022 ) proposed a method for only two narrow-range angles (yaw and pitch) for attention understanding in the learning and instruction scenarios. Huang et al. ( 2020 ) used narrow datasets on a framework similar to the HopeNet framework with average top-k regression. Although HeadFusion Yu et al. ( 2018 ) was designed to track \(360^{\circ }\) head poses, the range angles of data used were between [0 and 80] degrees.

figure 6

Narrow-range angles (Biwi dataset)

Dhingra Wang et al. ( 2022 ) presented a method that includes a regional information exchange fusion network and a four-branch feature selective extraction network (FSEN). The proposed approach aims to address the challenges posed by complex environments, such as occlusions, lighting variations, and cluttered backgrounds. The four-branch FSEN is designed to extract three independent discriminative features of pose angles by three branches and features corresponding to multiple pose angles by one branch from the input images. The regional information exchange fusion network is then used to fuse the extracted features to estimate the head pose.

The advantages of these methods with narrow-range angles are as follows: (1) Reduced computational complexity: narrow-range angles, often limited to a specific range (e.g., \(45^{\circ }\) to \(- 45^{\circ }\) ), can simplify the computational process, making it faster and more efficient. (2) Simplified classification: in applications like facial expression analysis or gaze tracking, focusing on a narrow range of angles can simplify classification tasks, as it reduces the number of possible pose categories. (3) Improved accuracy: narrow-range angles can lead to improved estimation accuracy because the model can focus on a specific subset of possible poses, reducing ambiguity (Guo et al. 2020 ). (4) Reduced noise: By excluding extreme angles, narrow-range approaches may be less sensitive to noise or outliers in the input data. However, these methods have the disadvantages. (1) Limited coverage: narrow-range angles may not provide a complete representation of head pose, limiting the applicability of the system in scenarios where a wider range of poses needs to be detected (Zhou and Gregson 2020 ), such as security cameras (Viet et al. 2021 ). 2) Loss of information: by discarding angles outside the narrow range, valuable information about head orientation may be lost, which can be crucial in some applications (Zhou and Gregson 2020 ).

Full-range angle It involves estimating the head pose across a wider range of yaw, pitch, and roll angles. This approach aims to cover all possible head orientations or one head orientation at least (the frontal and back views, see Figs.  5 and  7 ).

Few studies in the field of HPE have focused on predicting head poses across the full range of angles because the full range does not have rich visual features like those of the narrow range, which is focused on the frontal or large-angle face and yields satisfactory performance in most scenarios. Zhou and Gregson ( 2020 ) introduced WHENet, the first HPE approach to encompass the full range of head angles by combining the 300W-LP and CMU Panoptic datasets. WHENet is an extension to HopeNet Ruiz et al. ( 2018 ) that expands the number of bins for yaw prediction within the full range using EfficientNet as a backbone. Zhou et al. ( 2023 ) proposed DirectMHP, a one-stage network architecture to train end-to-end based on YOLOv5 that predicts full-range angle head poses. However, DirectMHP and WHENet face challenges related to the gimbal lock problem owing to their use of Euler angles for HPE, as mentioned in Zhou et al. ( 2023 ). Viet et al. ( 2021 ) introduced the multitask-net model to estimate full-range angle head poses. To improve HPE, the authors changed from Euler angles to using vectors of the rotation matrix as a representation of the human face. Hempel et al. ( 2024 ) extended their previous method Hempel et al. ( 2022 ) to cover full-range angles using CMU Panoptic datasets. The limitation of this work is that robustness and accuracy may decrease in application scenarios with unusual head poses and camera angles. Viet et al. ( 2021 ) developed FSANet-Wide with a full-range angle dataset called the UET-Headpose. However, this dataset is not available.

figure 7

Full-range angles (CMU dataset)

The advantages of these methods with full-range angles are as follows: (1) Comprehensive pose estimation: full-range angles offer a more comprehensive representation of head pose, allowing the system to estimate head orientation across a wide spectrum of positions (Zhou and Gregson 2020 ). (2) Versatility: full-range approaches are suitable for applications that require detecting head poses in diverse scenarios, including extreme angles (Zhou and Gregson 2020 ). (3) Flexibility: a full-range system can handle variations in head pose that may occur in unconstrained environments, making it adaptable to different real-world situations (Zhou et al. 2023 ). However, these methods have the following disadvantages: (1) Increased computational complexity: estimating full-range angles can be more computationally demanding, especially when dealing with a wide range of pose possibilities. (2) Potentially lower accuracy or complex classification: the larger number of possible pose categories can complicate classification tasks, potentially leading to reduced accuracy due to increased ambiguities in the estimation. 3) Greater sensitivity to noise: full-range angle estimation may be more sensitive to noise, outliers, or inaccuracies in the input data (Zhou et al. 2023 ).

In summary, the choice between narrow-range and full-range angles for HPE should consider the specific application requirements, computational constraints, and trade-off between accuracy and coverage. Narrow-range angles may be suitable for scenarios where simplicity, speed, and specific pose categories are prioritized, whereas full-range angles offer a more comprehensive solution for applications demanding versatility and adaptability across a wide range of head poses.

2.2.3 Rotation representation method

HPE requires a rotation representation to estimate the orientation of the head. Different representations offer unique advantages in capturing complex head movements. These representations are used to enhance HPE accuracy and robustness. Rotation representation methods have different advantages and disadvantages. The authors in Kim and Kim ( 2023 ) conducted a thorough examination of frequently employed rotation representations in industry and academia, which included rotation matrices, Euler angles, rotation axis angles, unit complex numbers, and unit quaternions. The study involved elucidating rotations in both 2D and 3D spaces. Common representations include Euler angles, quaternions, and rotation matrix, as shown in Fig.  8 .

figure 8

Different rotation representations

Euler angles They represent rotations using a set of three angles, typically denoted as ( \(\alpha , \beta , \gamma\) ), which describe rotations around the X, Y, and Z axes, respectively. The order and direction of rotations can vary, and several different conventions exist, such as XYZ, XZY, and YXZ. Euler angles are a widely used representation for describing HPE. They offer an intuitive way to understand how the head is oriented in 3D space. Euler angles decompose rotations into three sequential angles around fixed axes (yaw, pitch, and roll). The DirectMHP (Zhou et al. 2023 ), 2DHeadPose (Wang et al. 2023 ), OsGG-Net (Mo and Miao 2021 ), Hopenet (Ruiz et al. 2018 ), WHENet (Zhou and Gregson 2020 ), FSA-Net (Yang et al. 2019 ), LSR (Celestino et al. 2023 ), HeadDiff (Wang et al. 2024 ), and HHP-Net (Cantarini et al. 2022 ) methods use the Euler angle representation.

Barra et al. ( 2022 ) presented a method based on a previously partitioned iterated function system (PIFS) using gradient boosting regression for HPE by Euler angles. The proposed method aims to decrease the computational cost while maintaining an acceptable accuracy. In Kuhnke and Ostermann ( 2023 ), the authors introduced a semi-supervised learning approach called relative pose consistency. This method utilized Euler angles to represent the HPE.

The advantages of these methods with Euler angles are as follows: 1) Intuitive interpretation: Euler angles provide a straightforward and intuitive understanding of head orientation by representing rotations around distinct axes (Toso et al. 2015 ). 2) Compact: Euler angles only require storing three numbers to represent a rotation (Bernardes and Viollet 2022 ). This makes them memory-efficient compared to other representations like quaternions or rotation matrices. 3) Suitable for a single degree of freedom: in cases with only one degree of freedom, Euler angles can be sufficient to represent the rotation (Bernardes and Viollet 2022 ). This makes them a practical choice for problems with limited rotational complexity. 4) Compatibility with legacy systems: Euler angles have enjoyed extensive use over numerous years, and their continued prevalence in legacy systems and algorithms ensures compatibility with a substantial body of existing work. 5) Visualization: Euler angles are easy to visualize, as they describe rotations in a fixed reference frame. However, these methods have the following disadvantages: 1) Gimbal lock: in certain orientations, Euler angles can encounter a gimbal lock, causing a loss of one degree of freedom and inaccuracies in tracking (Liu et al. 2021 ). When the gimbal lock occurs, a small change in input angles can lead to a large, sudden change in the resulting orientation, causing a discontinuity in the rotation representation. 2) Sequence dependency: the order of rotations significantly affects the final orientation, leading to complexities in calculations and potential confusion (Hsu et al. 2018 ). 3) Limited range: Euler angles might exhibit limitations when dealing with complex or extreme rotations, reducing their suitability for certain applications. 4) Discontinuous: This representation is discontinuous for neural networks and difficult to learn (Zhou et al. 2019 ). 5) Difficulty in vector transformation: Euler angles lack a simple algorithm for vector transformation (Janota et al. 2015 ). This makes them less suitable for applications requiring frequent vector transformations. 6) Singularities: Euler angles suffer from singularities, where two or more rotations can produce the same orientation. This can lead to ambiguity in the representation of orientation and problems with numerical stability (Hsu et al. 2018 ).

Euler angles provide a straightforward interpretation of head pose rotations, making them easy to understand and implement. However, their susceptibility to gimbal lock and potential complexity in certain scenarios might lead to considering alternative representations for more accurate and robust HPE in applications requiring precise orientation analysis.

Quaternions They are a mathematical extension of complex numbers and are used to represent rotations in 3D space. The quaternions are represented as q = [w, x, y, z], where w is the scalar part and (x, y, z) is the vector part. Quaternions have gained prominence as an alternative representation for conveying head pose rotations across various applications. These four-dimensional (4D) mathematical constructs provide an efficient means of representing orientation, avoiding the issues of gimbal lock associated with Euler angles. Quaternions offer seamless interpolation and are well-suited for tasks involving smooth motion tracking. Zhu et al. ( 2017 ) extended their previous 3D dense face alignment(3DDFA) study Zhu et al. ( 2016 ), in which ambiguity was a big limitation when the yaw angle reached \(90^{\circ }\) , by replacing Euler angles with quaternions to eliminate the ambiguity and improving the performance of face alignment in large poses. Hsu et al. ( 2018 ) proposed a landmark-free method called QuatNet and conducted a study of HPE using quaternions to mitigate the Euler angle representation ambiguity. The study used a CNN with a multi-regression loss approach. This combines ordinal regression and L2 regression losses to train a dedicated CNN using RGB images. The ordinal regression loss addresses changing facial features with varying head angles, enhancing feature robustness. The L2 regression loss then utilizes these features to achieve accurate angle predictions. As an advantage of this method, Quatnet shows resilience to nonstationary cases of head poses (Nejkovic et al. 2022 ). Zeng et al. (Zeng et al. 2022 ) introduced an approach called structural relation-aware network (SRNet) for HPE by transforming the problem into the quaternion representation space. The proposed approach explicitly investigates the correlation among various face regions to extract global facial structural information. In. Höffken et al. ( 2014 ), the authors proposed a tracking module to estimate the head poses based on an extended Kalman filter (EKF) using the quaternion space.

The advantages of these methods with quaternions are as follows: (1) No gimbal lock: quaternions avoid gimbal lock, a limitation faced by Euler angles, ensuring an accurate representation of orientation (Hsu et al. 2018 ). (2) Smooth interpolation: they enable smooth interpolation (slerp-spherical linear interpolation) between orientations, resulting in seamless animations and tracking transitions (Hsu et al. 2018 ; Peretroukhin et al. 2020 ). By contrast, interpolating rotation matrices involves more complex calculations, including matrix multiplication and normalization. (3) Efficient computations: quaternion operations, such as conjugation, normalization, and multiplication, are computationally efficient using simple formulas, making them appropriate for real-time applications like gaming and robotics (Dantam 2021 ). (4) Normalization: quaternions can be normalized trivially, which is much more efficient than having to cope with the corresponding matrix orthogonalization problem (Bernardes and Viollet 2022 ). (5) Compact representation: quaternions exhibit greater compactness compared with rotation matrices. While a quaternion comprises merely four components (a scalar and a vector), a 3D rotation matrix requires nine components. This enhanced compactness contributes to the memory efficiency of quaternions and accelerates computation processes. (6) Numerical stability: quaternions are more numerically stable than matrices, particularly when handling combining multiple rotations or small rotations. Quaternion operations, such as normalization and multiplication, encompass fewer floating-point operations and are less prone to accumulated errors and numerical errors. However, the cons of these methods are: (1) Complexity: quaternions are not as intuitive as Euler angles, making it challenging for non-experts, e.g., for beginning roboticists and computer visionists, to grasp their meaning (Toso et al. 2015 ). This can make them less suitable for applications where simplicity and ease of use are important. (2) Storage: quaternions consist of four parameters, requiring more memory and computation compared to Euler angles’ three parameters (Holzinger and Gerstmayr 2021 ). This can be a disadvantage in applications with limited memory resources. (3) Antipodal problem: quaternions have an antipodal problem, where two quaternions representing the same rotation can have opposite signs (Roth and Gavrila 2023 ). This can lead to ambiguity in certain calculations and applications. (4) Discontinuous: neural networks find it challenging to learn when faced with discontinuities (Zhou et al. 2019 ).

Quaternions have proven valuable in an accurate HPE owing to their efficiency, lack of gimbal lock, and suitability for continuous motion analysis, making them a compelling choice for representing complex head movements in different contexts. Quaternions offer advantages such as gimbal lock prevention and smooth interpolation, making them valuable for accurate HPE in dynamic scenarios. While their interpretation and computation might pose challenges, their benefits often outweigh the complexities, especially in applications where precision and continuity are paramount.

Rotation matrix It is a ( \(3 \times 3\) ) matrix that describes a rotation in 3D space. Each column of the matrix represents the transformed basis vectors of the coordinate system after the rotation. Rotation matrices are a fundamental approach employed to represent head pose rotations across various applications. These matrices describe orientation using a set of orthogonal vectors, offering a robust and straightforward representation of head movement within 3D space. The TriNet (Cao et al. 2021 ), MFDNet (Liu et al. 2021 ), and TokenHPE-E (Liu et al. 2023 ) methods, among others, use rotation matrix representation.

Cao et al. ( 2021 ) employed a \(3 \times 3\) orthogonal rotation matrix within the framework to estimate head pose. They evaluated the performance of TriNet using the mean absolute error of vectors (MAEV). MFDNet (Liu et al. 2021 ) has been designed to tackle the challenge of low pose tolerance amidst various disturbances by introducing an exponential probability density model. This model uses a rotation matrix and matrix Fisher distribution. TokenHPE-E (Liu et al. 2023 ) is developed to steer the orientation tokens in their acquisition of the desired regional relationships and similarities using rotation matrix representation. This study is designed to address extreme orientations and low illumination, as stated by the authors. In Kim et al. ( 2023 ), the authors presented a multi-task network architecture to estimate landmarks and head poses using rotation matrix representation. The proposed method followed the 6DRepNet for HPE. Hempel et al. ( 2022 ) proposed a landmark-free technique called 6DRepNet, an end-to-end network and continuous representation for HPE. The final output of this method is a rotation matrix representation. The 6D rotation representation is used as part of deep learning architecture to improve performance and achieve the continuous rotation representation that will be discussed in detail in Sect.  2.3.3 .

The advantages of these methods with rotation matrices are as follows: 1) Precise representation: rotation matrices provide an accurate and unambiguous representation of orientation in 3D space. 2) Orthogonality: rotation matrices are orthonormal, meaning their columns are orthogonal unit vectors (Evans 2001 ). This property ensures that the length of any vector and the angle between any pair of vectors remain unchanged during the rotation. 3) Applicable to compositions: rotation matrices can be easily combined to compose multiple rotations without encountering a gimbal lock or singularities. However, these methods have the following disadvantages: 1) Complex calculations: computation of rotation matrices involves matrix multiplications and trigonometric functions, which can be computationally intensive (Hsu et al. 2018 ). This can make them less suitable for applications where simplicity and ease of use are important. 2) Numerical stability: small numerical errors during calculations can accumulate, potentially leading to inaccuracies over time. 3) Inefficient storage: rotation matrices require nine values, resulting in higher memory requirements than quaternion or Euler angle representations.

It is important to note that the advantages and disadvantages mentioned above are general characteristics of these rotation representation methods. The choice of representation depends on the specific requirements and limitations of the application.

Other rotation representation Rodrigues’ formula, axis-angle, and lie algebra are other rotation representations (Cao et al. 2021 ). Researchers are continuously exploring different rotation representations to improve the accuracy and efficiency of HPE in various applications. The choice of head pose rotation representation depends on the specific application’s requirements. Each representation balances factors such as accuracy, efficiency, and handling of complex rotations. By selecting the appropriate representation, HPE systems can accurately interpret and analyze human head movements in diverse scenarios.

2.2.4 Degrees of Freedom (DoF)

In the context of motion and mechanics, “Degrees of Freedom (DoF)” refers to the number of independent parameters or ways in which a mechanical system can move or rotate in 3D space. It is a concept used in physics, engineering, and computer graphics to describe the flexibility and mobility of an object or a system. Figure  9 shows the number of degrees of freedom.

figure 9

Number of degrees of freedom

6DoF The term 6DoF refers to the ability of an object or system to move freely in 3D space along three rotational axes (pitch, yaw, and roll) and three translational axes (up-down, left-right, and forward-backward). In virtual reality and robotics, 6DoF allows for a more immersive and realistic experience by enabling users or objects to move and interact with their environment naturally and unconstrained, as shown in Fig.  9 b.

Roth and Gavrila ( 2023 ) presented a method called IntrApose. This approach relies on intensity information along with camera intrinsics using a single camera image without landmark localization or prior detection for continuous 6DoF HPE. Kao et al. ( 2023 ) utilized the perspective-n-point (PnP) method facial landmarks to estimate a 6DoF face pose from monocular images. To support this approach, they curated a comprehensive 3D face dataset called the ARKitFace dataset, which consists of 902,724 2D facial images from 500 subjects. This dataset encompasses a diverse range of expressions, poses, and ages. It is worth noting that, despite employing deep learning techniques for HPE, Kao et al. ( 2023 ) adopted a landmark-based methodology. Luo et al. ( 2019 ) estimated rotation angles and position using a RGB-D image. These systems track a head’s position and orientation, enhancing realism and enabling greater interaction in 3D space. However, 6DoF systems typically have higher costs and more complex space requirements.

DoF Three degrees of freedom describe the movement capability along three specific axes in 3D space. In the context of virtual reality, 3DoF typically refers to rotational movement around the pitch, yaw, and roll axes (Ma et al. 2024 ; Yao and Huang 2024 ), as shown in Fig.  9 a. Unlike 6DoF, which allows for both rotational and translational movement, 3DoF only captures rotational changes (Tomenotti et al. 2024 ), limiting the range of motion and spatial interaction. This technology is often used in simpler VR systems or devices requiring constrained movement.

Bisogni et al. ( 2021 ) presented a PIFS model to improve face recognition. The method aims to estimate head pose simultaneously with face recognition. The major limitation of the proposed method is that the system performance could drop drastically if a few frames are very different from those inserted in the input. Another study (Hu et al. 2022 ) introduced an integrated framework that comprises yawning, blinking, gaze, and head rotation using attention-based feature decoupling and heteroscedastic loss for multi-state driver monitoring. These systems are simple and easy to use compared to 6DoF systems. However, 3DoF systems have some limitations, such as reduced realism due to the limited range of movement.

Other DoF Liu et al. ( 2022 ) presented an asymmetric relation-based head pose estimation (ARHPE) model, an asymmetric relation cue between pitch and yaw angles that is powerful in learning the discriminative representations of adjacent head pose images for HPE. The ARHPE model computed the overall MAE for only two degrees of freedom along the pitch and yaw angles. Different weights are allocated to the pitch and yaw orientations using the half-at-half maximum of the 2-D Lorentz distribution. Vo et al. ( 2019 ) introduced a method based on an extreme gradient boosting neural network and histogram of oriented gradients (HOG) in multi-stacked autoencoders to predict yaw and pitch for HPE. Berral-Soler et al. ( 2021 ) estimated head poses by 2DoF using the pitch and yaw angles. Hsu and Chung ( 2020 ) presented a complete representation (CR) pipeline that adaptively learns and generates two comprehensive representations (CR-region and CR-center) of the same individual. This study focuses on eye center localization for head poses based on geometric transformations with CR-center and image translation learning with the CR-region using five image databases tested only with two DoFs (yaw and roll angles). Liu et al. ( 2021a ) estimated head poses by 2D Euler angles using the pitch and yaw angles.

The choice of the number of DoF relies on the specific requirements and goals of the application or system, as well as factors like budget, hardware complexity, and interaction needed. Each has its strengths and weaknesses, and the decision should align with the intended use case.

2.3 Techniques and methodologies

2.3.1 landmark vs. free method.

In the realm of facial analysis and computer vision, two fundamental approaches have emerged for detecting and analyzing facial features: landmark-free and landmark-based methods (Yan and Zhang 2024 ). Each approach offers distinct advantages and disadvantages, catering to different application needs and challenges. In this discussion, we will delve into the benefits and drawbacks of both approaches, shedding light on their respective strengths and limitations.

Landmark-based methods These methods rely on the detection and localization of specific keypoints or landmarks on the person’s face, such as the positions of eye-brows, eyes, lips, nose, and mouth (Zhao et al. 2024 ). Figure  10 shows the visualization of facial landmarks of different versions. For HPE, 5 to 7-point landmarks are sufficient. These keypoints act as reference points for determining the head’s orientation in the 3D space. By analyzing the spatial arrangement of these landmarks, the head pose angles (e.g., roll, pitch, and yaw) can be estimated. This approach benefits from using easily recognizable and distinct facial features, and it is commonly utilized in applications such as facial recognition, gaze tracking, and AR. These approaches first detect either dense or sparse facial landmarks and subsequently estimate the head pose based on these key points. For instance, solvePnP (Gao et al. 2003 ) may be employed for this purpose.

figure 10

Different versions of facial landmarks representation

Wu and Ji ( 2019 ) provide a literature survey for facial landmark detection, which classifies it into three major categories: regression-based, constrained local model (CLM), and holistic methods. In Gupta et al. ( 2019 ), the authors presented a deep-learning method to estimate the head pose. The proposed method used the uncertainty maps in the form of 2D soft localization heatmap images over five facial key points: the nose, right eye, right ear, left eye, and left ear, which pass through a CNN to obtain Euler angles. Common methods used for detecting facial landmarks include face alignment network (FAN) (Bulat and Tzimiropoulos 2017 ), Dlib (Kazemi and Sullivan 2014 ), MediaPipe (Lugaresi et al. 2019 ), 3DDFA (Zhu et al. 2016 ), and 3DDFA_v2 (Guo et al. 2020 ). FAN (Bulat and Tzimiropoulos 2017 ) is a DNN technique designed to detect facial landmarks, with the added potential for use in HPE applications. The proposed method utilizes detected facial landmarks to map RGB images into 3D head poses. This approach converted 12 facial point coordinates to a 3D head pose, yielding MAE of \(9.12^{\circ }\) and \(8.84^{\circ }\) on the annotated facial landmarks in the wild (AFLW) (Koestinger et al. 2011 ) and AFW (Zhu and Ramanan 2012 ) datasets, respectively (Huang et al. 2020 ). Dlib (Kazemi and Sullivan 2014 ) is a C++ toolkit that includes classical algorithms and tools to create complex software for solving real-world problems. One of the applications of Dlib is HPE, which is widely used in diverse computer vision applications such as VR, hands-free gesture-controlled, driver’s attention detection, and gaze estimation applications. Dlib provides 68 informative landmarks that facilitate the estimation of the head pose (Al-Nuimi and Mohammed 2021 ). MediaPipe (Lugaresi et al. 2019 ) is a framework for building perception pipelines, which can be used for various computer vision applications, including HPE. MediaPipe is a method to detect the face area with 468 landmarks (Al-Nuimi and Mohammed 2021 ). 3DDFA (Zhu et al. 2016 ) is a cascaded CNN based on a regression method. Ariz et al. ( 2019 ) improved a method named weighted POSIT (wPOSIT) based on 2D tracking of the face to enhance the performance of both 2D point tracking and 3D HPE using the BU (La Cascia et al. 2000 ) and UPNA (Ariz et al. 2016 ) databases. A set of 12 facial landmarks were selected to track a 2D face. This approach fits a dense 3D facial model onto an image and concurrently predicts the head pose. 3DDFA_v2 (Guo et al. 2020 ) is the second version of 3DDFA. It is proposed to make a balance between speed, stability, and accuracy based on a lightweight backbone.

The previous methods suffer from inevitable difficulties involving large-angle, low-resolution, or partial occlusion. In particular, when the detected landmarks become chaotic in nature, the precision of HPE is significantly compromised. Consequently, researchers frequently refrain from directly employing the generated landmarks (Zhou et al. 2023 ). For example, Mo and Miao ( 2021 ) introduced a method called a one-step graph generation network (OsGG-Net). This model is an end-to-end integration of the graph convolutional network (GCN) and CNN to estimate head poses based on Euler angle representation and the landmark-based method. HHP-Net (Cantarini et al. 2022 ) introduces a technique for estimating head pose angles from single images using a small set of automatically computed head keypoints. The authors adopted OpenPose (Cao et al. 2017 ) as a keypoint extractor, given its ability to balance effectiveness and efficiency. The approach utilizes a carefully designed loss function to quantify heteroscedastic uncertainties related to the three angles. This correlation between uncertainty values and error offers supplementary information for subsequent computational processes. KEPLER (Kumar et al. 2017 ) presents an iterative method to learn local and global features simultaneously using a Heatmap-CNN with a modified GoogLeNet architecture for joint keypoint estimation and HPE in unconstrained facial scenarios.

Nevertheless, landmark-based approaches suffer from a significant limitation in terms of model expressive capability, making it challenging for them to attain performance levels comparable to landmark-free methods. To address this challenge, Xin et al. ( 2021 ) introduced an approach that involves the creation of a landmark-connection graph and the utilization of GCN to capture intricate nonlinear relationships between the graph typologies and head pose angles. Moreover, to address the challenge of unstable landmark detection, this approach incorporates the edge-vertex attention (EVA) mechanism and further enhances performance through the introduction of densely-connected architecture (DCA) and adaptive channel attention (ACA).

The benefits of the landmark-based approaches are as follows: 1) Precise localization: landmark-based approaches provide accurate and specific facial feature localization, making them suitable for tasks requiring detailed facial structure information. 2) Multiple tasks applications: these methods are well-suited for multiple tasks such as facial expression analysis, head motion, HPE, facial deformations, and facial animation due to their ability to provide detailed landmark information (Çeliktutan et al. 2013 ). 3) Widely known and used: landmark-based methods are probably the most commonly used. They can bring the face into a canonical configuration, typically a frontal head pose (Belmonte et al. 2021 ). 4) Interpretable results: explicit landmarks allow for easier interpretation of the model’s decisions and behavior. Moreover, the head can be segmented into small partitions (Zubair et al. 2022 ). However, these methods have the following drawbacks: 1) Sensitivity to landmark quality: landmark-based methods can struggle with variations in facial expressions, poses, occlusion, extreme rotation, and lighting conditions, potentially leading to less robust performance (Hempel et al. 2022 ; Roth and Gavrila 2023 ). 2) Data requirements: training landmark-based models often requires large annotated datasets with accurately labeled landmarks. The primary issue arises from the data itself. Manually annotating landmarks on faces with significant poses is an extremely tedious task, especially when occluded landmarks need to be estimated. This is an insurmountable task for the majority of individuals (Zhu et al. 2016 ). 3) Computational complexity: detecting and tracking multiple landmarks can be computationally intensive, affecting real-time performance (Xia et al. 2022 ). 4) Limited generalization, applicability, and range angles: landmark-based methods might struggle to generalize to unseen variations or populations that differ significantly from the training data. The landmark-based method has a significant drawback: it becomes challenging to locate these landmarks when the face orientation exceeds \(60^{\circ }\) . Consequently, to circumvent this critical limitation, other models directly carried out HPE models from images without relying on landmark detection (Viet et al. 2021 ).

Landmark-free method Conversely, these methods do not depend on the explicit detection and usage of predefined facial keypoints for HPE. Instead, they utilize other facial features or the full image to infer the head’s orientation. These methods analyze head contours and head patterns (see Fig.  11 ) or utilize depth information from depth sensors or depth estimation algorithms to estimate the head pose.

figure 11

Head pose using a landmark-free method

Landmark-free techniques can be advantageous when dealing with challenging conditions, such as partial occlusion of facial features or when keypoints are difficult to detect accurately.

HopeNet (Ruiz et al. 2018 ), LSR (Celestino et al. 2023 ), 6DRepNet (Hempel et al. 2022 ), and TokenHPE (Zhang et al. 2023 ), among others, are landmark-free methods. HopeNet, introduced by Ruiz et al. ( 2018 ), was among the pioneering landmark-free techniques. It incorporates multiple loss functions to estimate Euler angles, initially employing a cross-entropy function to classify the angles and subsequently refining the fine-grained predictions by minimizing the mean-squared error between the predicted pose and the ground truth labels. In Celestino et al. ( 2023 ), the authors presented a deep-learning approach relying on LSR with multi-loss for HPE under occlusions, which was designed as a new application to autonomously feed disabled people with a robotic arm. The authors generated synthetic, occluded datasets by covering part of the human faces using six levels of occlusions without changing the labels of the original datasets for head poses. Hempel et al. ( 2022 ) employed the approach outlined in Zhou et al. ( 2019 ) to introduce 6DRepNet, an end-to-end network without landmarks. In Zhang et al. ( 2023 ), the authors presented a landmark-free method called TokenHPE for efficient HPE using vision transformers (ViT) that allow the model to better capture the spatial relationships between different face parts. This method was designed to address serious occlusions and extreme head pose randomness, as stated by the authors.

The advantages of the landmark-free approaches are: 1) Flexibility: landmark-free methods do not rely on predefined facial landmarks, making them suitable for facial expressions, a wide range of head poses, and variations in appearance. 2) Robustness: they can handle partial occlusions and variations in lighting and background, making them robust for real-world scenarios. 3) Reduced annotation effort: landmark-free methods typically require less manual annotation of facial landmarks, simplifying data preparation. However, these methods have the following disadvantages: 1) Coarser estimation: they may provide coarser head pose estimates compared to landmark-based methods, which can limit their precision for some applications. 2) Limited spatial information: landmark-free methods might not capture fine-grained spatial information about facial features, which can be important in certain contexts. 3) Performance variability: the accuracy of landmark-free methods can vary depending on the head pose’s complexity and the input data’s quality.

Hybrid methods These methods refer to approaches that combine both landmark-based and landmark-free techniques for estimating the head pose. These methods leverage the strengths of each approach to improve accuracy, robustness, and generalization in HPE tasks. Xia et al. ( 2022 ) proposed a collaborative learning framework based on CNNs that utilizes two branches, one based on landmarks and the other landmark-free, as shown in Fig.  12 for HPE. These two branches work together, engaging in both implicit and explicit interactions with information for mutual promotion and complementary semantic learning.

figure 12

Head pose using landmark-free (bottom backbone network) and landmark-based (top backbone network) methods

Fu et al. ( 2023 ) proposed an adaptive occlusion hybrid second-order attention network containing a second-order attention module, occlusion-aware module, and exponential map-based pose prediction. The proposed method is implemented using the PyTorch deep learning framework. ResNet50 is used as a feature extractor backbone. It has some limitations due to the challenges of extracting key points in the presence of occlusions and extensive poses in real-world scenarios. Additionally, landmark detection adds computational cost.

In conclusion, both landmark-free and landmark-based approaches exhibit unique strengths and limitations. The choice between these approaches should be driven by the specific requirements of the application, available resources, and desired level of precision. While landmark-free methods offer adaptability and robustness, landmark-based methods excel in the precision and use of established techniques. As the field of facial analysis continues to evolve, a balanced consideration of the advantages and disadvantages will pave the way for more effective and accurate facial analysis solutions.

2.3.2 Technique

Classical techniques Early approaches employed classical models for HPE, such as template matching, quad-tree, random forest (RF), HOG, Haar cascades, and support vector machine (SVM) Abate et al. ( 2022 ).

The proposed method in Abate et al. ( 2019 ) described a quad-tree adaptation for facial landmark representation, enabling the HPE with a discrete angular resolution of \(5^{\circ }\) in terms of pitch, yaw, and roll angles. In Abate et al. ( 2020 ), a method based on the overlap of a web-shape with landmark detection models to HPE was proposed. WSM is employed to identify the corresponding sector for each of the 68 landmarks, thereby facilitating accurate landmark position prediction within images. Höffken et al. ( 2014 ) employed a synchronized submanifold embedding (SSE), EKF, and quaternion representation to estimate the head poses. SSE is a nonlinear regression method, which comprises a barycentric coordinate estimation, k-nearest neighbor search, and dimensionality reduction. Benini et al. ( 2019 ) proposed a multi-feature framework based on SVM and RF to perform classification tasks on expression, gender, and head pose. However, these techniques are traditional models to train on small to medium-sized datasets and require manual extraction of pertinent features from the data. Barra et al. ( 2022 ) presented a fractal method based on extreme gradient boosting regressor and gradient boosting regressor for HPE by Euler angles. The work of Hoffken et al. achieved a high MAE, though it was trained and tested on the same dataset (Biwi and AFLW2000). Most papers were trained on the 300W-LP dataset and then tested on the Biwi and AFLW2000 datasets for fair comparison.

Deep learning techniques The field of deep learning has garnered significant research interest, driven by its diverse applications in online retail, art/film production, video conferencing, and virtual agents. The progress in deep learning has facilitated the on-demand generation of a person’s visual attributes, including their face and pose. The major focus of this survey (Asperti and Filippini 2023 ) is on deep learning techniques.

6DoF-HPE (Algabri et al. 2024 ), 6DHPENet (Chen et al. 2022 ) and 6DRepNet360 (Hempel et al. 2024 ) were implemented on a RepVGG backbone (Ding et al. 2021 ) with 6D continuous rotation representation. In Celestino et al. ( 2023 ), Wang et al. ( 2023 ), Zhu et al. ( 2022 ), the proposed methods were implemented on a ResNet backbone (He et al. 2016 ) with Euler angle representation. In Roth and Gavrila ( 2023 ), the proposed method was implemented on a ResNet backbone (He et al. 2016 ) with a differentiable rotation representation. In Zhou et al. ( 2023 ), the authors proposed a one-stage network architecture, which is built upon the YOLOv5 framework and Euler angle representation. In Chen et al. ( 2023 ), the authors designed a network architecture with multi-modal attention using asymmetry-aware bilinear pooling to estimate head pose. In Chen et al. ( 2023 ), the authors proposed a lightweight network called Neko-Net for HPE without relying on facial keypoints. Neko-Net consists of a soft stagewise regression (SSR), external attention modules, and a dual-stream lightweight backbone network. It aims to achieve high precision with low computational cost and model parameters.

These techniques are advanced models with multiple layers of interconnected nodes that are trained effectively on large datasets to achieve high performance and are designed to extract pertinent features from the data automatically. However, these techniques require large datasets, demand high computational power, and are difficult to interpret.

2.3.3 Rotation type method

In the context of HPE, the choice of rotation representation can significantly impact the accuracy and performance of the methods. Some representations are more suitable for capturing the full range of head motions and providing smooth, continuous estimates of pitch, yaw, and roll, while others may suffer from discontinuities and limitations in certain scenarios. Continuous and discontinuous representations refer to different ways of representing the head pose. Zhou et al. ( 2019 ) explore the concept of continuous representations in the context of DNNs. The authors proved that any rotation representation with four or fewer dimensions is discontinuous based on topological concepts. Based on these concepts, we categorized works into continuous and discontinuous representations.

Continuous representation A rotation matrix is a continuous representation that can accurately capture the full range of head motions without suffering from discontinuities or gimbal lock. Six and five dimensions (6D and 5D) and vector-based representations are also continuous representations as reported in Zhou et al. ( 2019 ). 6DoF-HPE (Algabri et al. 2024 ), 6DHPENet (Chen et al. 2022 ), 6DRepNet360 (Hempel et al. 2024 ), TriNet (Cao et al. 2021 ), MFDNet (Liu et al. 2021 ), TokenHPE-E (Liu et al. 2023 ), and TokenHPE (Zhang et al. 2023 ) are continuous rotation representation.

6DoF-HPE (Algabri et al. 2024 ), 6DHPENet (Chen et al. 2022 ) and 6DRepNet (Hempel et al. 2022 ) predicted 6D dimensions and then processed to output nine parameters using the Gram–Schmidt process. TriNet (Cao et al. 2021 ) employed a ( \(3 \times 3\) ) orthogonal rotation matrix to predict the three vectors. TokenHPE-E (Liu et al. 2023 ), and TokenHPE (Zhang et al. 2023 ) employed a transformer architecture to predict the final rotation matrix ( \(3 \times 3\) ). Roth and Gavrila ( 2023 ) adopted a continuous, differentiable rotation representation known as singular value decomposition ( \(SVDO^{+}\) ), which directly maps into SO(n) following an unconstrained representation of nine values. \(SVDO^{+}\) is a well-known symmetric orthogonalization (Levinson et al. 2020 )

In HPE, this continuity can be beneficial when needing precise and smooth tracking of head movements. The researchers and developers can use 6D and 5D and vector-based or rotation matrix-based representations to continuously estimate the head’s orientation.

The benefits of the continuous representation are as follows: (1) Smooth transformations: it allows smooth and continuous transformations between angles, making them suitable for applications where gradual changes are essential, such as animations or robotic control systems. (2) Differentiability: it is often differentiable, which is crucial for optimizing algorithms and neural networks. This enables efficient gradient-based training. (3) Better performance: compared to discontinuous representations, it performs better. However, continuous representation has the following drawbacks: (1) Ambiguity: it can lead to unexpected behavior or difficulties in interpretation. (2) Computational complexity: some continuous rotation representations, especially those involving trigonometric functions, can be computationally expensive, particularly in high-dimensional spaces.

Discontinuous representation Quaternions, lie algebra, and Euler representations are discontinuous for 3D rotations, including HPE, because they have three or four dimensions (Cao et al. 2021 ). Nonetheless, as indicated in Zhou et al. ( 2019 ), it has been demonstrated that a continuous representation in 3D space requires a minimum of five dimensions of information. However, if researchers and developers discretize the Euler angles into bins or use them carefully, they can still be useful for certain applications that do not require a perfectly smooth transition between poses. Hopenet (Ruiz et al. 2018 ), QuatNet (Hsu et al. 2018 ), Lie algebra residual architecture (LARNeXt) (Yang et al. 2023 ), WHENet (Zhou and Gregson 2020 ), FSA-Net (Yang et al. 2019 ), LSR (Celestino et al. 2023 ), and HHP-Net (Cantarini et al. 2022 ) are discontinuous rotation representation.

Hopenet (Ruiz et al. 2018 ), WHENet (Zhou and Gregson 2020 ), FSA-Net (Yang et al. 2019 ), LSR (Celestino et al. 2023 ), and HHP-Net (Cantarini et al. 2022 ) predict only three dimensions because they use Euler representation, whereas QuatNet (Hsu et al. 2018 ) predict four dimensions because it use quaternions representation. LARNeXt (Yang et al. 2023 ) is an integrated embedded end-to-end method on the ResNet-50 backbone that achieves head pose and face recognition using Lie algebra representation. The limitation of this method is that the accuracy dramatically decreases when the face has occluded parts.

The advantages of the discontinuous representation are as follows: 1) Simplicity: it is often simpler to implement and understand, which can be advantageous when transparency and interpretability are important. 2) Intuitive interpretation: quaternions and Euler angles may offer greater intuitiveness to certain users as they can be associated with terms like roll, pitch, and yaw, which find common usage in disciplines such as robotics and aviation. However, discontinuous representation has the following disadvantages: 1) Gimbal lock: quaternions and Euler angles can encounter the gimbal lock, a phenomenon in which the representation becomes singular and loses the ability to depict specific rotations precisely. 2) Difficulty in interpolation: interpolating discontinuous representations can pose difficulties in achieving smooth transitions, potentially causing problems in animation and rendering processes.

The choice between continuous and discontinuous rotation representations should be made based on the application’s specific needs. Continuous representations are better for smooth transitions, whereas discontinuous representations are appropriate for discrete decision points or categories. Careful consideration of the trade-offs and suitability of each representation is essential for effectively addressing the requirements of a problem involving angles.

2.4 Evaluation metrics

Some researchers used two or more metrics to evaluate their work. For example, Firintepe et al. ( 2020 ) adopted four metrics to evaluate the HPE, namely MAE, balanced mean angular error (BMAE), root mean squared error (RMSE), standard deviation (STD), and others. Cao et al. ( 2021 ) adopted the MAE and MAEV metrics. WHENet (Zhou and Gregson 2020 ) also was evaluated by MAE and mean absolute wrapped error (MAWE). Meanwhile, Khan et al. ( 2020 ) proposed a framework that evaluated the HPE on four databases, namely AFLW, ICT-3DHP, BU, and Pointing’04 (Gourier 2004 ), using MAE and accuracy. The details of these metrics are presented in the following subsections.

2.4.1 Mean absolute error (MAE)

It is a widely adopted metric in assessing HPE frameworks, frequently employed across various papers reviewed in this study. Its popularity stems from its ability to offer a concise and informative performance evaluation, encompassing all three angles (i.e., roll, pitch, and yaw). Its mathematical equation is as follows:

where \({\hat{\theta }}_i\) is the prediction angle and \(\theta _i\) is the ground truth angle. A smaller MAE value indicates superior performance when comparing methods.

2.4.2 Pose estimation accuracy (PEA)

The PEA is another metric used to assess HPE. As an accuracy metric, PEA relies on pose counts, providing limited insights into actual system performance. Notably, most recent research has not employed PEA as the second metric in the context of HPE. A high PEA value signifies better performance when comparing methods (Asperti and Filippini 2023 ).

2.4.3 Mean absolute error of vectors (MAEV)

Cao et al. ( 2021 ) raised concerns about the suitability of using the MAE of Euler angles as an evaluation metric, particularly for profile images, arguing that it may not accurately measure the performance of networks. Instead, they advocate for adopting the MAEV for evaluation. In their approach, three vectors derived from the rotation matrix are utilized to characterize head poses, and the disparity between the predicted vectors and the ground-truth vectors is computed. Their findings demonstrated the enhanced consistency of this representation and established MAEV as a more dependable indicator for assessing pose estimation outcomes. Its mathematical equation is as follows:

where \(v_p\) and \(v_g\) are the predicted and the ground truth head orientation vectors (Hempel et al. 2024 ).

2.4.4 Mean absolute wrapped error (MAWE)

The MAWE is a metric used to assess the accuracy of models for full-range angle data. It measures the mean absolute difference between the predicted and the ground truth values while considering the shortest angular path between them. However, MAWE is not a widely recognized metric in statistics or machine learning. MAWE is particularly useful in applications where the direction or phase is critical, such as methods of HPE for full-range angles. When applied to measure narrow-range angles, it yields identical results to MAE (Asperti and Filippini 2023 ). Its mathematical equation is:

where \({\hat{\theta }}_i\) is the prediction angle and \(\theta _i\) is the ground truth angle. A lower MAWE value signifies better performance when comparing methods (Viet et al. 2021 ).

2.4.5 Balanced mean angular error (BMAE)

In driving scenarios, there is a bias towards frontal orientations, resulting in an uneven distribution of various head orientations. Schwarz et al. ( 2017 ) proposed the BMAE metric to tackle this issue. Its mathematical equation is:

where \(\phi _{i, i+d}\) is the average angular error between the ground truth and prediction angle. The original study set d and k to \(5^{\circ }\) and \(75^{\circ }\) , respectively (Schwarz et al. 2017 ). A lower BMAE value signifies superior performance when comparing methods.

2.4.6 Root mean squared error (RMSE)

The RMSE is a widely employed metric in machine learning and statistics to measure the average magnitude of errors between predicted \({\hat{\theta }}_i\) and actual \({\theta }_i\) values. It is estimated by obtaining the square root of the mean of the squared differences between actual and predicted values. RMSE measures how well a predictive model or algorithm performs, with smaller values indicating better predictive accuracy. However, RMSE is not a widely used metric in HPE frameworks. Its mathematical equation is as follows:

where n is the number of images for testing (Firintepe et al. 2020 ).

3 Datasets and ground-truth techniques

This section discusses the datasets commonly used for HPE research. Table  2 presents an overview of currently available datasets. These datasets vary regarding the number of images, variety of head poses, annotations available, and other characteristics.

3.1 Datasets for head pose estimation

Many public datasets for HPE have been published since 2010. Ordered from the most recent to the oldest, the following datasets are included in the literature:

2DHeadPose dataset (Wang et al. 2023 ) contains 10,376 RGB images, a rich set of angles, dimensions, and attributes, in 19 scenes, such as masked, obscured, dark, and blurred. A 3D virtual human head was used to simulate the head pose in the images, and Gaussian label smoothing was used to suppress annotation noises, resulting in the 2DHeadPose dataset.

Dad-3dheads dataset (Martyniuk et al. 2022 ) is an accurate, dense, large-scale, and diverse for 3D head alignment from a single image. This database is an in-the-wild collection featuring a diverse range of head poses, facial expressions, challenging lighting conditions, image quality, age groups, and instances of occlusions. It comprises 44,898 images annotated utilizing a 3D head model. It includes annotations for more than 3.5K landmarks that provide precise representations of 3D head shapes compared to ground-truth scans.

AGORA (Patel et al. 2021 ) is a 3D synthetic dataset for the 3D human pose and shape (3DHPS) estimation task. The dataset consists of training, validation, and test images. Corresponding Masks, SMPL-X (Pavlakos et al. 2019 )/SMPL (Loper et al. 2015 ) ground truth, and camera information are also provided for training and validation images. The AGORA dataset contains 3K test and 14K training images rendering between 5 and 15 subjects per image, and it includes 173K individual person crops. The dataset provides high-realism images, with two resolutions of ( \(3840 \times 2160\) ) and ( \(1280 \times 720\) ) for multi-person with ground truth 3D bodies under the complexity of clothing, environmental conditions, and occlusion. The original dataset does not provide head pose labels. DirectMHP (Zhou et al. 2023 ) generated head pose labels for each person in an image based on SMPL-X (Pavlakos et al. 2019 ) by extracting camera parameters and 3D face landmarks.

MDM corpus dataset (Jha et al. 2021 ) was gathered by recording 59 participants while driving cars and engaging in various tasks. The Fi-Cap helmet device, which continuously tracks head movement using fiducial markers, was employed to capture head poses. This dataset comprises 50.23 h at various frames per second (fps) of recordings (approximately one million frames) and encompasses a wide range of head poses across all three rotational axes. This diversity arises from including numerous subjects and considering various main and secondary driving activities in the data collection process. Specifically, yaw angles span from \(\pm 80^{\circ }\) around the origin, whereas pitch angles exhibit an asymmetric span ranging from \(-50^\circ\) to \(100^\circ\) .

ETH-XGaze dataset (Zhang et al. 2020 ) is a substantial dataset created for the purpose of gaze estimation, particularly under challenging illumination conditions involving extreme head poses and variations in gaze direction. This dataset contains 1,083,492 images collected from 110 participants with varying head orientations and gaze directions using 18 digital SLR cameras. The age range of participants was 19-41 years.

GOTCHA-I dataset (Barra et al. 2020 ) is a large-scale dataset for face, gait, and HPE, containing 493 videos with an average duration of 4 min (i.e., approximately 137,826 images). The videos were captured using multiple mobile and body-worn cameras with 11 different video modes in both indoor and outdoor environments for 62 subjects, 47 male and 15 female, with an average age between 18 and 20 years. The dataset was extracted in the range of \(\pm 20^{\circ }\) in roll, \(\pm 30^{\circ }\) in pitch, and \(\pm 40^{\circ }\) in yaw with \(5^{\circ }\) deviations.

DD-Pose dataset (Roth and Gavrila 2019 ) is a large-scale driver benchmark consisting of \(2 \times 330k\) images of drivers in a car. The dataset was captured using stereo and RGB cameras with a \(2048 \times 2048\) resolution. Six DoF continuous head pose annotations were acquired by a motion capture system that estimates the pose of a marker fixed at the back of the person’s head to a reference coordinate. This coordinate was calibrated by estimating the position of eight facial landmarks for each subject’s face. The dataset included 27 subjects, 21 male and 6 female, with an average age of 36 years. The oldest and youngest drivers were 64 and 20 years old. The dataset was extracted in the range of \(\pm 100^{\circ }\) in yaw, \(\pm 40^{\circ }\) in pitch, and \(\pm 60^{\circ }\) in roll.

VGGFace2 (Cao et al. 2018 ) is a dataset for recognizing faces across pose and age collected at the University of Oxford by the visual geometry group. The dataset includes over 3.31 million images, with a total of 9,131 subjects (i.e., an average of 362.6 images per subject). The dataset is designed to be used for face recognition tasks and includes a wide variety of poses (yaw, pitch, and roll), expressions, and lighting conditions. Gender information is balanced, with 59.3% males and the remaining females.

SynHead dataset (Gu et al. 2017 ) is a collection of 3D synthetic images of human heads. The dataset includes over 510,960 images, 10 head models (5 male and 5 female), and 70 motion tracks. It is designed to support research in the area of HPE and related applications. The dataset includes annotations for the yaw, pitch, and roll angles of the heads in each image.

DriveAHead (Schwarz et al. 2017 ) is a driver’s head pose dataset that contains images of drivers captured from the interior of a car. The dataset consists of one million images captured from 20 drivers (16 male and 4 female), with each driver captured under realistic driving conditions. The images were captured using a Kinect V2 sensor with a resolution of \(512 \times 424\) .

SASE (Lüsi et al. 2017 ) is a 3D dataset for HPE and emotion. It consists of 30,000 annotated images with their head pose labeled, including 50 subjects (18 female and 32 male). The dataset includes images captured using a Microsoft Kinect 2, with resolutions of ( \(1080 \times 1920\) for RGB frames) and ( \(424 \times 512\) for 16-bit depth frames). The dataset includes images of males and females aged 7-35. The range of yaw angles is \(-75^{\circ }\) to \(75^{\circ }\) , whereas pitch and roll angles are \(-45^{\circ }\) to \(45^{\circ }\) .

Pandora (Borghi et al. 2017 ) is a 3D dataset for the driver’s HPE and upper body under severe illumination changes, occlusions, and extreme poses. It contains more than 250k images, with the corresponding annotations: 110 annotated sequences for 22 subjects (12 female and 10 male). Every participant got five recordings. The first Kinect version captured images with full-resolution RGB-D images ( \(512 \times 424\) ) and RGB ( \(1920 \times 1080\) pixels). The dataset includes annotations for the yaw, pitch, and roll angles in the range of \(\pm 125^{\circ }\) , \(\pm 100^{\circ }\) , and \(\pm 70^{\circ }\) , respectively.

300W-LP (Zhu et al. 2016 ) is a widely used synthetic dataset, which was generated from the 300W dataset (Sagonas et al. 2013a ), standardizing multiple alignment databases with 68 landmarks. The authors adopted the presented face profiling to generate 61,225 images across large poses collected from multiple datasets (1,786 from IBUG (Sagonas et al. 2013b ), 5,207 from AFW (Zhu and Ramanan 2012 ), 16,556 from LFPW (Belhumeur et al. 2013 ), and 37,676 from HELEN (Le et al. 2012 ). XM2VTSDB (Messer et al. 1999 ) was not used). It was further expanded to 122,450 images by image flipping. The dataset provides annotations for three angles: yaw, pitch, and roll. The ground truth is provided in the Euler angle format (Hempel et al. 2022 ).

AFLW2000 (Zhu et al. 2016 ) is one commonly used dataset. The first 2,000 images were obtained from the AFLW dataset (Koestinger et al. 2011 ). It includes ground truth of 3D faces along with their corresponding 68 facial landmarks. It includes samples in various in-the-wild settings and varying lighting and occlusion conditions.

Valle and colleagues (Valle et al. 2020 ) re-annotated the AFLW2000-3D dataset, incorporating poses estimated from accurate landmarks. This revised dataset is named AFLW2000-3D-POSIT. As a result, the mean MAE of their approach decreased to 1.71. This improvement in performance is significant and demonstrates the importance of accurate landmark annotation in HPE. The AFLW2000-3D dataset is commonly employed to evaluate 3D facial landmark detection models.

CCNU (Liu et al. 2016 ) is a 2D dataset for HPE that consists of 4,350 images of 58 human subjects in 75 different poses. The images were captured indoors, covering a range of yaw from \(-90^{\circ }\) to \(90^{\circ }\) and pitch angles from \(-45^{\circ }\) to \(90^{\circ }\) with various illuminations, expressions, low resolution ( \(70 \times 80\) ), and poses. Orientation and position of all head pose ground-truth images were labeled using Senso Motoric instruments (SMI) eye-tracking glasses.

UPNA (Ariz et al. 2016 ) is a dataset designed for HPE. The database includes both 2D and 3D images of faces and features automatic annotation (roll, yaw, and pitch ) based on 54 face landmarks using a magnetic sensor. The dataset consists of 120 videos of 10 individuals (6 male and 4 female), with 12 videos per model. Every video is 10 s long and contains 300 frames. The videos were recorded using a standard webcam with a resolution of ( \(1280 \times 720\) pixels).

WIDER FACE dataset (Yang et al. 2016 ) is a large-scale face detection benchmark. It contains 32,203 images with a total of 393,703 annotated faces. The images were selected with occlusion variability, high degree of scale, and pose. WIDER FACE provided labels for the human body, which were annotated manually, but no labels for the head pose. The researchers applied existing HPE methods (e.g., RetinaFace (Deng et al. 2020 ) +PnP or FSA-Net (Yang et al. 2019 )) to label the angles for each face in the images. Img2Pose (Albiero et al. 2021 ) annotated the WIDER FACE dataset based on a semi-supervised way for a head pose. However, this weakly supervised learning approach has many factors for more improvement, particularly in handling small faces and automated labeling, and there are no head samples with invisible faces in the labeled WIDER FACE.

CMU Panoptic (Joo et al. 2015 ) contains 65 sequences (5.5 h) with a massive multi-view system collected by synchronized HD video streams captured by many cameras. Some of these videos focus primarily on a single person, and some of these are on multi-person scenarios in a hemispherical device. The original dataset was not designed to provide head pose labels. However, a software technique was used to obtain the full-range head pose labels by other authors.

Dali3DHP (Tulyakov et al. 2014 ) is a 3D dataset for HPE research. It consists of 60,000 depth and color images of 33 subjects. The head poses in the dataset cover a range of yaw, pitch, and roll angles, including extreme poses. Specifically, the yaw angle ranges from \(-89.29^{\circ }\) to \(75.57^{\circ }\) , the pitch angle from \(-52.6^{\circ }\) to \(65.76^{\circ }\) , and the roll angle from \(-\) 29.85 to \(27.09^{\circ }\) . Ground-truth labels were obtained by a Shimmer sensor.

EYEDIAP database (Funes Mora et al. 2014 ) contains 94 sessions for gaze estimation tasks from RGB and RGB-D data. Each session was recorded for 2 to 3 minutes using a Kinect camera, with resolutions of ( \(1920 \times 1080\) ) and ( \(640 \times 480\) ) for a total of more than 4 hours. The participants were 16 people (12 male and 4 female). Several sessions were recorded twice for participants 14, 15, and 16 under various lighting, distances relative to the camera pose, and day conditions. The range of yaw covered \(40^{\circ }\) , which was recorded manually. This data did not provide any annotations for pitch and roll angles.

McGill database (Demirkus et al. 2014 ) is real-world face and head videos comprising 60 videos for 60 subjects. The database was recorded indoors and outdoors using a Canon PowerShot SD770 camera at ( \(640 \times 480\) ) resolution. Yaw angles varied in the interval \(\pm 90^{\circ }\) and were labeled by a semi-automatic labeling framework. In this database, the gender and face location were also labeled. For each participant, a 60-s video with 30 fps (i.e., 1,800 frames per participant) was recorded under free behavior. Therefore, background clutter and arbitrary illumination conditions are present, especially outdoors, owing to this free behavior. An expert manually labeled the 18,000 frames with the head pose angle to obtain ground truth annotations.

Biwi (Fanelli et al. 2013 ) is commonly employed in 3D face analysis and comprises 24 sequences featuring 20 individuals (comprising 6 female and 14 male, with 4 individuals wearing glasses). This dataset encompasses a total of 15,000 images and depth data records. The dataset was recorded indoors while people were sitting and turning their heads in front of a Kinect camera positioned approximately one meter away. The variation of the head pose is a \(\pm 50^{\circ }\) in roll, \(\pm 60^{\circ }\) in pitch, and \(\pm 75^{\circ }\) in yaw. Person-specific templates and ICP tracking were employed to annotate the data, providing information in the form of 3D head location and rotations.

ICT-3DHP (Baltrušaitis et al. 2012 ) dataset collected using a Kinect camera contained 10 RGB-D videos (both RGB and depth data) for a total of approximately 1400 images. The ground truth was obtained using a Polhemus FASTRACK based on electromagnetic technology. The dataset was evaluated for all three angles: yaw, pitch, and roll. The number of participants was ten (6 male, 4 female).

AFW dataset (Zhu and Ramanan 2012 ) is a collection of face images designed to evaluate facial detection and head pose. It contains a total of 205 RGB images of 468 faces, with annotations for facial landmarks. The dataset was collected from the internet with various resolutions annotated with a range of yaw, pitch, and roll angles.

In summary, Table  2 presents all the characteristics of the datasets of HPE. These characteristics are the number of images or videos and their dimensions, the number of participants and their gender (female and male), angles (yaw, pitch, and roll), their range (full or narrow range), the environment where the dataset was captured in (indoor or outdoor), data type (i.e., RGB image or depth), and published year. Moreover, some datasets provided the age of participants, such as ETH-XGaze, GOTCHA-I, DD-Pose, and SASE, and the translational (x, y, and z) components, such as DD-Pose, Biwi, DriveAHead, SASE, CCNU, and UPNA. Most studies used two or more datasets to evaluate the performance of their methods. However, some of the works used only one dataset, such as Biwi in Chen et al. ( 2023 ) and AFLW2000 in Zhang and Yu ( 2022 ). Other available datasets of HPE published before 2010 are still employed by some researchers to evaluate their works, such as Bosphorus (Savran et al. 2008 ), CAS-PEAL (Gao et al. 2007 ), Pointing’04 (PRIMA) (Gourier 2004 ), BU (La Cascia et al. 2000 ), and FERET (Phillips et al. 2000 ). Overall, this discussion of datasets aims to provide a comprehensive understanding of the benchmarking and evaluation practices in HPE research.

3.2 Ground-truth dataset

This is an important step in several areas of computer vision, including semantic segmentation, object detection, and pose estimation. Ground-truth dataset is the annotated data employed to train and evaluate classical techniques. The ground-truth dataset’s accuracy and completeness significantly affect the HPE algorithm’s performance. Creating a high-quality ground-truth dataset is a time-consuming and iterative process that needs careful planning and attention to detail. The most common techniques for creating ground truth data for HPE are described in the following subsections.

3.2.1 Software

In recent years, ground truth data sets created using software involve automated labeling or annotating techniques (Wang et al. 2023 ; Martyniuk et al. 2022 ; Valle et al. 2020 ; Yang et al. 2016 ; Patel et al. 2021 ). These methods are often employed for annotation tasks in HPE. Software-based approaches can provide accurate and consistent annotations and can be faster and more cost-effective than other methods. However, the effectiveness of such approaches relies on the quality of the software algorithms and the availability of appropriate training data. Moreover, software methods may not handle nuanced or complex annotation tasks, as shown in Fig.  13 . CMU panoptic (Joo et al. 2015 ) and AGORA (Patel et al. 2021 ) datasets originally did not provide annotation about the head pose. However, the AGORA provided information about the 3D facial landmarks and body poses; Zhou et al. ( 2023 ) leveraged this information and employed the software technique to annotate the head pose in full-range angles.

figure 13

Examples of wrong-detected and annotated heads using a software technique (AGORA dataset)

3.2.2 Optical motion capture systems (Om-cap)

Om-caps are expensive, robust deployments primarily employed in professional cinematography to capture articulated body movements. A set of near-infrared cameras typically are calibrated by software algorithms with multiview stereo to monitor reflective markers affixed to an individual (Liu et al. 2016 ). In the context of HPE, these markers can be applied to the rear of a subject’s head (Roth and Gavrila 2019 ; Schwarz et al. 2017 ), allowing for accurate position and orientation tracking. This method has facilitated the collection of diverse datasets varying in scope, accuracy, and availability. The fi-cap helmet device is similar to the Mo-cap system but without the need for expensive sensors (Jha et al. 2021 ).

3.2.3 Magnetic sensors

Sensors like the Flock of Birds or Polhemus FastTrak operate by measuring and emitting a magnetic field. These sensors can be attached to a person’s head, providing measurements for orientation angles (Baltrušaitis et al. 2012 ; Ariz et al. 2016 ) and the head’s position. This method is relatively cost-effective and can collect an accurate ground truth; thus, it has been widely adopted. However, magnetic sensors are susceptible to noise when presenting metals in the surroundings. While these sensors offer relatively cost-effective objective pose estimates, they have been widely adopted as a source of objective ground truth. Consequently, data collection with these sensors is severely constrained, making certain applications, such as automotive HPE, impractical.

3.2.4 Inertial sensors

These sensors employ components such as gyroscopes, accelerometers, or other motion-sensing devices, frequently incorporating a Kalman filter to mitigate noise (Tulyakov et al. 2014 ). The more commercial low-cost sensors, such as the Shimmer sensor, provide orientation angle measurements. However, these sensors do not provide position measurements. The advantage of inertial sensors is that they have immunity to metallic interference compared to magnetic sensors. In HPE methods, inertial sensors can be attached to a person’s head to capture data (Borghi et al. 2017 ; Tulyakov et al. 2014 ).

3.2.5 Camera arrays

This approach uses multiple cameras positioned at various fixed locations and simultaneously captures head images from diverse angles (Zhang et al. 2020 ). The approach offers an accurate ground truth dataset when subjects’ head positions remain consistent during acquisition. However, it is limited to near-field images and unsuitable for real-world video scenarios or fine poses.

3.2.6 Manual annotation

The probably earliest way for generating ground-truth dataset involves human observers who assign pose labels based on their subjective perception when viewing head pose images. This method may suffice for a basic set of poses in a single DoF. Nevertheless, it is inadequate for precise HPE, particularly for finer variations, owing to the increased likelihood of human errors (Zhu and Ramanan 2012 ; Funes Mora et al. 2014 ). This method is sometimes used with other methods to improve the quality of the dataset annotation, as was done in Demirkus et al. ( 2014 ), Cao et al. ( 2018 ).

In summary, Table  2 lists the techniques used for obtaining ground truth datasets, how to measure them (relative or absolute), and the sensors used to capture the datasets with their resolution and ranges. In addition, other old methods were used to annotate head poses, such as directional suggestion with a laser pointer. Most datasets are narrow-range angles, as can be observed in Table  2 ; this challenge should be solved in future works. For example, Fig.  14 shows the distribution of the angles of full and narrow range angle datasets and their averages.

figure 14

Pose labels distribution of the three angles

4 Discussion, challenges, and future directions

HPE still has several problems for each element, as mentioned above. Subsequently, we first compare the different HPE methods. Following this, we delve into a discussion about the challenges for HPE, future research directions, and the advantages and limitations of this work.

4.1 Discussion

Table  3 compares different HPE methods, ranking from most recent to oldest. The comparison includes the choice of application, environment, number of tasks, dataset type, range angle, rotation representations way, number of DoF, techniques used, landmark-based or free, rotation type, evaluation metrics, and challenges. This table ignores work for some applications, such as Yu et al. ( 2021 ), Ye et al. ( 2021 ), Perdana et al. ( 2021 ), Indi et al. ( 2021 ), and so on to rely on other works without providing more details. We compare 113 papers published in the last seven years. The papers were published in 2024, 2023, 2022, 2021, 2020, 2019, and 2018; their numbers were 4, 28, 29, 25, 14, 8, and 5, respectively.

We found that the papers (Re.) for improving performance accuracy were around 67%, whereas papers (App) for different applications were around 33%, and work implemented in indoor (I) and outdoor (O) environments was 80.6 and 19.4%, respectively, as shown in Fig.  15 . Notably, we considered any work implemented without a camera to be work implemented in indoor environments. Approximately 67% of the surveyed papers prioritize enhancing performance accuracy, which is critical for applications in DMSs, AR/VR environments, and so on. The remaining 33% of the studies explore diverse applications, illustrating the versatility of HPE technologies across various domains, including healthcare, entertainment, and security. The strong focus on performance enhancement suggests a mature understanding of core algorithms. However, the relatively smaller proportion of application-focused studies indicates a potential gap in translating these advancements into real-world scenarios. Future research could benefit from a balanced approach that equally prioritizes both algorithmic improvements and practical applications. A significant majority (80.6%) of the reviewed methods are designed for indoor environments, leveraging controlled conditions that simplify the estimation process. In contrast, only 19.4% address outdoor environments, where challenges like fluctuating lighting and complex backgrounds present significant hurdles. This focus highlights an opportunity for future research to address the uncontrolled conditions of outdoor HPE and develop robust methods capable of handling outdoor settings, which is essential for applications like DMS or public safety systems. This would involve improving models to cope with varying illumination, occlusions, and background clutter.

Furthermore, we found that the methods for a single task were 25%, whereas methods for multiple tasks were 75%, and methods that were implemented under narrow and full-range angles were 94.6 and 5.4%, respectively. The prevalence of multi-task methods (75%) indicates a growing trend towards integrating HPE with other tasks, such as facial expression recognition or gaze estimation. This multi-tasking capability is beneficial for comprehensive systems that require simultaneous analysis of multiple aspects of human behavior. Only 5.4% of the methods were developed to handle full-range angles, which cover the complete spectrum of possible head orientations, due to not enough available datasets before 2020.

Moreover, the methods that adopted Euler angles and rotation matrices were 72.3 and 16.1%, respectively, whereas other rotation representation methods were 11.6%, and methods that were implemented under 6 and 3 DoFs were 6.2 and 77.7%, respectively, whereas other methods were 16.1%. A majority of methods (72.3%) use Euler angles due to their simplicity and intuitive interpretation to understand how the head is oriented in 3D space, while rotation matrices (16.1%) and continuous representations (17.6%) are less common. The limited use of continuous representations, despite their potential to avoid gimbal locks and provide smooth transitions, suggests an area for further innovation. Most methods (77.7%) focus on 3 DoF, which covers basic head movements but does not capture full spatial dynamics. Only 6.2% tackle 6 DoF, which includes translational movements. There is a critical need to explore 6 DoF representations, especially for applications like robotics or immersive VR, where understanding full head and body movement is crucial. Additionally, adopting continuous rotation representations could improve model robustness and accuracy.

figure 15

Proportion of different HPE methods

Besides, the methods based on deep learning and classical learning accounted for 70.5 and 24.1%, respectively. In contrast, hybrid methods (deep learning and classical learning) accounted for 5.4%. Deep learning techniques account for 70.5% of the methods, outpacing classical (24.1%) and hybrid methods (5.4%). The number of deep learning-based HPE publications has continuously increased in recent years. This dominance reflects deep learning’s superior ability to model complex patterns and handle large datasets.

The landmark-based and landmark-free methods accounted for 71 and 26.2%, respectively. In contrast, hybrid methods (landmark-based and landmark-free) accounted for 2.8%. Landmark-based methods (71%) are predominant, leveraging specific facial features for pose estimation. However, landmark-free methods (26.2%) have gained traction in recent years due to their flexibility and reduced dependency on precise landmark detection. The shift towards landmark-free approaches is promising, particularly for scenarios where landmarks are not easily detectable. Future work should investigate hybrid approaches that combine the strengths of both methodologies to enhance accuracy and robustness across varied conditions.

Finally, the methods based on discontinuous representation were 82.4%, whereas methods based on continuous representation were 17.6%, as shown in Fig.  15 . The limited use of continuous representations, such as 5D, 6D, and rotation matrices related to the number of dimensions of the output of the model is an area for further research (Zhou et al. 2019 ), as these can provide a more accurate and robust handling of head poses.

On the other hand, the methods that employed Biwi, AFLW2000, 300W-LP, POINTING’04, and AFLW datasets to train their models were 23.4, 20.3, 17.6, 6.2, and 5.5%, respectively, whereas own and other datasets were 6.6 and 20.3%, respectively. The analysis reveals that the methods utilized various datasets to train their models. This distribution reflects a reliance on a few narrow-range angles datasets, which may limit the diversity and comprehensiveness of the training data. As a result, there is a potential risk of these models not being generalized for use with applications that require a wide range. Therefore, future work should focus on expanding the dataset variety, including more diverse and real-world data, to enhance the robustness and applicability of HPE models.

Moreover, most methods evaluated their performance using MAE, as shown in Fig.  16 . Furthermore, we observed around 25.9% of methods offer public codes, as shown in Table  3 . Only 25.9% of the methods provide publicly available code. To foster transparency and accelerate progress in this field, the community should prioritize open-source contributions, including sharing datasets and model implementations.

figure 16

Proportion of different datasets and metrics

4.2 Challenges of HPE

HPE holds immense potential across the aforementioned diverse applications. Despite great success, HPE remains an open research topic, especially in unconstrained environments with complex human motion. Some of the challenges in HPE for real-time applications are as follows:

Accuracy: Achieving accurate HPE in real-time can be difficult because of factors such as variations in lighting conditions, occlusions, and complex head movements. Ensuring high accuracy is crucial for applications such as facial recognition and gaze estimation. MAE should be \(5^{\circ }\) or less (Asperti and Filippini 2023 ).

Speed: Real-time applications require fast (30 fps or faster) and efficient HPE algorithms to process video streams in real-time (Murphy-Chutorian and Trivedi 2008 ). The challenge lies in developing algorithms that can provide accurate results within the limited processing time available.

Robustness: HPE algorithms should be robust to variations in head appearance, such as strong illumination conditions, large head pose variations, and occlusions (Asperti and Filippini 2023 ; Xu et al. 2022 ). Robustness is essential to ensure accurate estimation regardless of the individual’s appearance.

Variability: People have different head shapes, sizes, and orientations, which adds to the challenge of HPE (Murphy-Chutorian and Trivedi 2008 ; Baltanas et al. 2020 ). Algorithms need to handle this variability and adapt to different individuals to provide accurate and consistent results.

Real-world conditions: Real-time HPE should be able to handle challenging real-world conditions, such as different light conditions, varying camera viewpoints, cluttered backgrounds, and noisy environments (Fanelli et al. 2012 ; Madrigal and Lerasle 2020 ). These factors can affect the accuracy and reliability of the estimation.

Computational resources: Real-time HPE requires efficient utilization of computational resources, especially in resource-constrained environments, such as embedded systems or mobile devices (Fanelli et al. 2011 ). Balancing accuracy and computational efficiency is a challenge in developing real-time algorithms.

Researchers and developers are continuously working on addressing these challenges by exploring advanced techniques, such as deep learning models, optimization algorithms, and data augmentation methods, to improve the accuracy, speed, and robustness of HPE for real-time applications. Erik Murphy-Chutorian proposed design criteria that the HPE method must satisfy (Murphy-Chutorian and Trivedi 2008 ). The HPE method must be accurate, autonomous, multi-person, monocular, resolution independent, invariant lighting, allow full head motion range, and provide real-time results.

In summary, advanced deep learning techniques have significantly enhanced the performance of HPE methods. In recent years, the utilization of neural networks with continuous rotation representations has resulted in notable improvements in landmark-free methods, exemplified by TriNet (Cao et al. 2021 ) and 6DRepNet (Hempel et al. 2022 ). Table  1 presents the advantages and disadvantages of the main steps of HPE.

4.3 Future research directions

In this survey, we identified several potential research directions for HPE. One area of interest is integrating multimodal information, such as audio and gaze cues, to improve the accuracy and robustness of HPE. For example, using audio-visual synchronization techniques can enhance HPE in noisy environments, and integrating gaze tracking with HPE can be crucial in scenarios like DMSs, where both head orientation and gaze direction are critical. Another direction is the exploration of self-supervised and unsupervised learning methods to decrease the dependence on annotated data and improve generalization to novel scenarios. For example, use contrastive learning to learn robust feature representations from video sequences to adapt HPE models to new environments without extensive retraining. Additionally, developing more effective and efficient training strategies, such as curriculum learning or adversarial training, could lead to better performance and scalability of HPE models. For example, utilize curriculum learning, where training begins with simpler tasks, such as predicting head rotation angles in a controlled environment, and progresses to more complex tasks, such as predicting in dynamic, multi-person scenarios. Another promising direction is incorporating attention mechanisms and spatial reasoning to enable more fine-grained localization and understanding of head pose. For example, integrate attention mechanisms into the model architecture to selectively focus on key facial regions or contextual features, thereby improving the model’s ability to differentiate subtle head movements. Furthermore, developing HPE techniques for specific applications, such as real-time tracking in VR or surveillance scenarios and other applications, could provide new opportunities for the practical deployment of HPE models. To meet the requirements of various applications, design criteria should be used as a roadmap for future developments. For example, the MAE should be equal to or less than \(5^\circ\) under the full range of head motion; the process should be autonomous, without expectation of manual initialization, and estimating multiple people in a single image, with both high and low resolution at a high frame per second (30 fps or faster) by monocular with the dynamic lighting found in many environments in real-time to solve all the challenges mentioned in Sect.  4.2 . Moreover, the relationship between human pose and head pose can be explored through various methodologies to analyze the orientation and position of a person’s body and head. This relationship is significant in applications such as HCI and surveillance because understanding the head pose within the context of the overall body pose can provide valuable insights into a person’s actions and intentions. Finally, the ethical and social implications of HPE, such as privacy concerns and potential biases, warrant further investigation and consideration in future research. Therefore, the research must be conducted into privacy-preserving techniques, including guidelines for data collection and user consent, where models are trained locally on devices without sharing sensitive data. Additionally, explore bias mitigation strategies in HPE models to ensure fair and equitable outcomes across diverse populations. Overall, these potential research directions highlight the exciting opportunities and challenges in the field of HPE, and we hope that this survey paper will contribute to advancing the state-of-the-art and facilitating further research in this important area of computer vision.

4.4 Advantages and limitations of this work

The advantages and limitations of this work can be outlined as follows:

4.4.1 Advantages

Comprehensive coverage: The survey encompasses over 214 papers published until 2024, providing a broad overview of advancements in HPE, which is crucial given the rapid development in this field. This extensive review allows readers to gain insights into various methodologies and applications of HPE.

Detailed categorization: The classification of the HPE techniques into categories. This structured approach aids in understanding the relationships between different components of HPE systems and facilitates systematic exploration of the field.

Discussion of state-of-the-art techniques: The survey includes descriptions of the latest advancements in HPE, such as the use of continuous rotation representations and attention mechanisms. This focus on cutting-edge methods makes it a valuable resource for researchers looking to understand current trends and innovations.

Comparison of datasets: The survey includes a thorough comparison of publicly available datasets relevant to HPE, summarizing their characteristics and annotations. This information is crucial for researchers when selecting appropriate datasets for training and evaluation.

Identification of challenges and future directions: This work identifies current challenges in HPE, such as handling occlusions, achieving real-time performance, and dealing with diverse conditions, and suggests future research directions. This forward-looking perspective is valuable for guiding ongoing and future studies in the field.

4.4.2 Limitations

Focus more on recent developments: The emphasis on recent advancements may overlook foundational techniques that still hold relevance. A more balanced view that includes both historical and contemporary methods could enhance the survey’s comprehensiveness.

A short explanation of applications: Although the survey identifies HPE applications, the explanation may be perceived as relatively brief due to space limitations. A more in-depth exploration of these applications could provide better guidance for future research initiatives. The HPE applications may require a separate survey paper.

5 Conclusions

In this survey paper, we have provided a comprehensive overview of recent techniques for AI-based HPE systems. The survey paper included 214 articles related to AI-based HPE systems published over the last two decades. We compared and analyzed 113 articles published between 2018 and 2024, with 70.5% focusing on deep learning, 24.1% on machine learning, and 5.4% on hybrid approaches. We have categorized the steps of HPE frameworks into eleven main categories, discussed the available datasets and evaluation metrics, and identified potential future research directions. The eleven steps were organized into four groups as follows. 1. application context, including the choice of application, the specific tasks, and the environment in which the system will operate. 2. data handling and preparation that contains the type of dataset, the range of angles, the representation method, and the degrees of freedom. 3. techniques and methodologies involve the techniques used, the approach to landmark detection, and the type of rotation, and 4. evaluation metrics. Moreover, we provided a comprehensive comparison of several publicly available datasets and a visualization of each category’s proportion of different HPE methods. Through this survey, we have highlighted the strengths and limitations of different approaches and provided insights into the challenges and opportunities in this area of computer vision. Overall, our analysis shows that HPE is a challenging and important problem with many potential applications for AI-based HPE systems. While significant progress has been made in recent years, there are still many open research questions and practical challenges to be addressed, such as robustness to occlusion and lighting variation, scalability, and efficiency of models for applications that require full-range angles, particularly in unconstrained environments. We hope that this survey paper will provide a useful resource for researchers and practitioners in the field, facilitating a better understanding and comparison of the different approaches and stimulating further research and development in this important area of computer vision.

Data Availability

No datasets were generated or analysed during the current study.

Abate AF, Barra P, Bisogni C, Nappi M, Ricciardi S (2019) Near real-time three axis head pose estimation without training. IEEE Access 7:64256–64265

Article   Google Scholar  

Abate AF, Barra P, Pero C, Tucci M (2020) Head pose estimation by regression algorithm. Pattern Recogn Lett 140:179–185

Abate AF, Bisogni C, Castiglione A, Nappi M (2022) Head pose estimation: an extensive survey on recent techniques and applications. Pattern Recogn 127:108591

Ahuja K, Kim D, Xhakaj F, Varga V, Xie A, Zhang S, Townsend JE, Harrison C, Ogan A, Agarwal Y (2019) Edusense: Practical classroom sensing at scale. Proc ACM Interac Mob Wear Ubiquitous Technol 3(3):1–26

Al-Nuimi AM, Mohammed GJ (2021) Face direction estimation based on mediapipe landmarks. In: 2021 7th International Conference on Contemporary Information Technology and Mathematics (ICCITM), pp 185–190. IEEE, Mosul, Iraq

Albiero V, Chen X, Yin X, Pang G, Hassner T (2021) img2pose: Face alignment and detection via 6dof, face pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7617–7627. IEEE, Nashville, TN, USA

Algabri R, Choi M-T (2021) Target recovery for robust deep learning-based person following in mobile robots: online trajectory prediction. Appl Sci 11(9):4165

Algabri R, Choi M-T (2022) Online boosting-based target identification among similar appearance for person-following robots. Sensors 22(21):8422

Algabri R, Shin H, Lee S (2024) Real-time 6dof full-range markerless head pose estimation. Expert Syst Appl 239:122293

Alghowinem S, Goecke R, Wagner M, Parkerx G, Breakspear M (2013) Head pose and movement analysis as an indicator of depression. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp 283–288. IEEE, Geneva, Switzerland

Ariz M, Bengoechea JJ, Villanueva A, Cabeza R (2016) A novel 2d/3d database with automatic face annotation for head tracking and pose estimation. Comput Vis Image Underst 148:201–210

Ariz M, Villanueva A, Cabeza R (2019) Robust and accurate 2d-tracking-based 3d positioning method: application to head pose estimation. Comput Vis Image Underst 180:13–22

Asperti A, Filippini D (2023) Deep learning for head pose estimation: a survey. SN Comput Sci 4(4):349

Bafti SM, Chatzidimitriadis S, Sirlantzis K (2022) Cross-domain multitask model for head detection and facial attribute estimation. IEEE Access 10:54703–54712

Baltanas S-F, Ruiz-Sarmiento J-R, Gonzalez-Jimenez J (2020) A face recognition system for assistive robots. In: Proceedings of the 3rd International Conference on Applications of Intelligent Systems, pp 1–6. ACM, Las Palmas de Gran Canaria, Spain

Baltrusaitis T, Zadeh A, Lim YC, Morency L-P (2018) Openface 2.0: Facial behavior analysis toolkit. In: 2018 13th IEEE international conference on Automatic Face & Gesture Recognition (FG 2018), pp 59–66. IEEE, Xi’an, China

Baltrušaitis T, Robinson P, Morency L-P (2012) 3d constrained local model for rigid and non-rigid facial tracking. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2610–2617. IEEE, Providence, RI, USA

Barra P, Barra S, Bisogni C, De Marsico M, Nappi M (2020) Web-shaped model for head pose estimation: an approach for best exemplar selection. IEEE Trans Image Process 29:5457–5468

Barra P, Bisogni C, Nappi M, Freire-Obregón D, Castrillón-Santana M (2020) Gotcha-i: A multiview human videos dataset. Security in Computing and Communications: 7th International Symposium. SSCC 2019, Trivandrum, India, December 18–21, 2019, Revised Selected Papers. Springer, Springer, Singapore, pp 213–224

Barra P, Distasi R, Pero C, Ricciardi S, Tucci M (2022) Gradient boosting regression for faster partitioned iterated function systems-based head pose estimation. IET Biometr 11(4):279–288

Basak S, Corcoran P, Khan F, Mcdonnell R, Schukat M (2021) Learning 3d head pose from synthetic data: a semi-supervised approach. IEEE Access 9:37557–37573

Becattini F, Bisogni C, Loia V, Pero C, Hao F (2023) Head pose estimation patterns as deepfake detectors. ACM Transactions on Multimedia Computing, Communications and Applications

Book   Google Scholar  

Belhumeur PN, Jacobs DW, Kriegman DJ, Kumar N (2013) Localizing parts of faces using a consensus of exemplars. IEEE Trans Pattern Anal Mach Intell 35(12):2930–2940

Belmonte R, Allaert B, Tirilly P, Bilasco IM, Djeraba C, Sebe N (2021) Impact of facial landmark localization on facial expression recognition. IEEE Trans Affect Comput 14(2):1267–1279

Benini S, Khan K, Leonardi R, Mauro M, Migliorati P (2019) Face analysis through semantic face segmentation. Signal Process 74:21–31

Google Scholar  

Bernardes E, Viollet S (2022) Quaternion to euler angles conversion: A direct, general and computationally efficient method. PLoS ONE 17(11):0276302

Berral-Soler R, Madrid-Cuevas FJ, Muñoz-Salinas R, Marín-Jiménez MJ (2021) Realheponet: a robust single-stage convnet for head pose estimation in the wild. Neural Comput Appl 33(13):7673–7689

Bisogni C, Nappi M, Pero C, Ricciardi S (2021) Pifs scheme for head pose estimation aimed at faster face recognition. IEEE Trans Biometr Behav Identity Sci 4(2):173–184

Bisogni C, Nappi M, Pero C, Ricciardi S (2021) Fashe: A fractal based strategy for head pose estimation. IEEE Trans Image Process 30:3192–3203

Bisogni C, Cascone L, Nappi M, Pero C (2024) Iot-enabled biometric security: enhancing smart car safety with depth-based head pose estimation. ACM Transactions on Multimedia Computing, Communications and Applications

Borghi G, Venturelli M, Vezzani R, Cucchiara R (2017) Poseidon: Face-from-depth for driver pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4661–4670. IEEE, Honolulu, HI, USA

Borghi G, Fabbri M, Vezzani R, Calderara S, Cucchiara R (2018) Face-from-depth for head pose estimation on depth images. IEEE Trans Pattern Anal Mach Intell 42(3):596–609

Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE International Conference on Computer Vision, pp 1021–1030. IEEE, Venice, Italy

Cantarini G, Tomenotti FF, Noceti N, Odone F (2022) Hhp-net: A light heteroscedastic neural network for head pose estimation with uncertainty. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3521–3530. IEEE, Waikoloa, HI, USA

Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7291–7299. IEEE, Honolulu, HI, USA

Cao Q, Shen L, Xie W, Parkhi OM, Zisserman A (2018) Vggface2: A dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp 67–74. IEEE, Xi’an, China

Cao Z, Chu Z, Liu D, Chen Y (2021) A vector-based representation to enhance head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1188–1197. IEEE, Waikoloa, HI, USA

Celestino J, Marques M, Nascimento JC, Costeira JP (2023) 2d image head pose estimation via latent space regression under occlusion settings. Pattern Recogn 137:109288

Chai W, Chen J, Wang J, Velipasalar S, Venkatachalapathy A, Adu-Gyamfi Y, Merickel J, Sharma A (2023) Driver head pose detection from naturalistic driving data. IEEE Trans Intell Transp Syst 24(9):9368–9377

Chen S, Zhang Y, Yin B, Wang B (2021) Trfh: towards real-time face detection and head pose estimation. Pattern Anal Appl 24:1745–1755

Chen J, Xu H, Bian M, Shi J, Huang Y, Cheng C (2022) Fine-grained head pose estimation based on a 6d rotation representation with multiregression loss. In: International conference on Collaborative Computing: Networking. Applications and Worksharing. Springer, Cham, pp 231–249

Chapter   Google Scholar  

Chen J, Li Q, Ren D, Cao H, Ling H (2023) Asymmetry-aware bilinear pooling in multi-modal data for head pose estimation. Signal Process 110:116895

Chen X, Lu Y, Cao B, Lin D, Ahmad I (2023) Lightweight head pose estimation without keypoints based on multi-scale lightweight neural network. Vis Comput 39:1–15

Chen K, Wu Z, Huang J, Su Y (2023) Self-attention mechanism-based head pose estimation network with fusion of point cloud and image features. Sensors 23(24):9894

Chuang CY, Craig SD, Femiani J (2017) Detecting probable cheating during online assessments based on time delay and head pose. High Educ Res Dev 36(6):1123–1137

Cobo A, Valle R, Buenaposada JM, Baumela L (2024) On the representation and methodology for wide and short range head pose estimation. Pattern Recogn 149:110263

Çeliktutan O, Ulukaya S, Sankur B (2013) A comparative study of face landmarking techniques. EURASIP J Image Video Process 2013(1):1–27

Dantam NT (2021) Robust and efficient forward, differential, and inverse kinematics using dual quaternions. Int J Robot Res 40(10–11):1087–1105

Dapogny A, Bailly K, Cord M (2020) Deep entwined learning head pose and face alignment inside an attentional cascade with doubly-conditional fusion. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp 192–198. IEEE, Buenos Aires, Argentina

Demirkus M, Clark JJ, Arbel T (2014) Robust semi-automatic head pose labeling for real-world face video sequences. Multimed Tools Appl 70:495–523

Deng J, Guo J, Ververas E, Kotsia I, Zafeiriou S (2020) Retinaface: Single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5203–5212. IEEE, Seattle, WA, USA

Dhingra N (2021) Headposr: End-to-end trainable head pose estimation using transformer encoders. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp 1–8. IEEE, Jodhpur, India

Dhingra N (2022) Lwposr: Lightweight efficient fine grained head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1495–1505. IEEE, Waikoloa, HI, USA

Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13733–13742. IEEE, Nashville, TN, USA

Drouard V, Horaud R, Deleforge A, Ba S, Evangelidis G (2017) Robust head-pose estimation based on partially-latent mixture of linear regressions. IEEE Trans Image Process 26(3):1428–1440

Article   MathSciNet   Google Scholar  

Du G, Wang K, Lian S, Zhao K (2021) Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review. Artif Intell Rev 54(3):1677–1734

Edinger J, Heck M., Lummer L, Wachner A, Becker C (2023) Hands-free mobile device control through head pose estimation. In: 2023 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops), pp 367–373. IEEE, Atlanta, GA, USA

Egger HL, Pine DS, Nelson E, Leibenluft E, Ernst M, Towbin KE, Angold A (2011) The nimh child emotional faces picture set (nimh-chefs): a new set of children’s facial emotion stimuli. Int J Methods Psychiatr Res 20(3):145–156

Evans PR (2001) Rotations and rotation matrices. Acta Crystallogr D 57(10):1355–1359

Fanelli G, Gall J, Van Gool L (2011) Real time head pose estimation with random regression forests. In: CVPR 2011, pp 617–624. IEEE, Colorado Springs, CO, USA

Fanelli G, Gall J, Van Gool L (2012) Real time 3d head pose estimation: Recent achievements and future challenges. 2012 5th International Symposium on Communications. Control and Signal Processing. IEEE, Rome, Italy, pp 1–4

Fanelli G, Dantone M, Gall J, Fossati A, Van Gool L (2013) Random forests for real time 3d face analysis. Int J Comput Vision 101:437–458

Fard AP, Abdollahi H, Mahoor M (2021) Asmnet: A lightweight deep neural network for face alignment and pose estimation. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp 1521–1530. IEEE, Nashville, TN, USA

Firintepe A, Selim M, Pagani A, Stricker D (2020) The more, the merrier? a study on in-car ir-based head pose estimation. In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp 1060–1065. IEEE, Las Vegas, NV, USA

Fu Q, Xie K, Wen C, He J, Zhang W, Tian H, Yang S (2023) Adaptive occlusion hybrid second-order attention network for head pose estimation. Int J Mach Learn Cyber 1:1–17

Funes Mora KA, Monay F, Odobez J-M (2014) Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In: Proceedings of the symposium on Eye Tracking Research and Applications, pp 255–258. ACM, Safety Harbor, Florida, USA

Gao X-S, Hou X-R, Tang J, Cheng H-F (2003) Complete solution classification for the perspective-three-point problem. IEEE Trans Pattern Anal Mach Intell 25(8):930–943

Gao W, Cao B, Shan S, Chen X, Zhou D, Zhang X, Zhao D (2007) The cas-peal large-scale chinese face database and baseline evaluations. IEEE Trans Syst Man Cybern Part A 38(1):149–161

Gourier N (2004) Estimating face orientation from robust detection of salient facial features. In: Proceedings of Pointing 2004, ICPR, International Workshop on Visual Observation of Deictic Gestures, Cambridge, UK

Gu J, Yang X, De Mello S, Kautz J (2017) Dynamic facial analysis: From bayesian filtering to recurrent neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1548–1557. IEEE, Honolulu, HI, USA

Guo J, Zhu X, Yang Y, Yang F, Lei Z, Li SZ (2020) Towards fast, accurate and stable 3d dense face alignment. In: European Conference on Computer Vision, pp 152–168. Springer, Glasgow, UK

Gupta A, Thakkar K, Gandhi V, Narayanan P (2019) Nose, eyes and ears: Head pose estimation by locating facial keypoints. ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, Brighton, UK, pp 1977–1981

Hammadi Y, Grondin F, Ferland F, Lebel K (2022) Evaluation of various state of the art head pose estimation algorithms for clinical scenarios. Sensors 22(18):6850

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778. IEEE, Las Vegas, NV, USA

Hempel T, Abdelrahman AA, Al-Hamadi A (2024) Toward robust and unconstrained full range of rotation head pose estimation. IEEE Trans Image Process 33:2377–2387

Hempel T, Abdelrahman AA, Al-Hamadi A (2022) 6d rotation representation for unconstrained head pose estimation. In: 2022 IEEE International Conference on Image Processing (ICIP), pp 2496–2500. IEEE, Bordeaux, France

Holzinger S, Gerstmayr J (2021) Time integration of rigid bodies modelled with three rotation parameters. Multibody SysDyn 53:1–34

MathSciNet   Google Scholar  

Hsu W-Y, Chung C-J (2020) A novel eye center localization method for head poses with large rotations. IEEE Trans Image Process 30:1369–1381

Hsu H-W, Wu T-Y, Wan S, Wong WH, Lee C-Y (2018) Quatnet: Quaternion-based head pose estimation with multiregression loss. IEEE Trans Multimed 21(4):1035–1046

Hu T, Jha S, Busso C (2020) Robust driver head pose estimation in naturalistic conditions from point-cloud data. In: 2020 IEEE Intelligent Vehicles Symposium (IV), pp 1176–1182. IEEE, Las Vegas, NV, USA

Hu T, Jha S, Busso C (2021) Temporal head pose estimation from point cloud in naturalistic driving conditions. IEEE Trans Intell Transp Syst 23(7):8063–8076

Hu Z, Xing Y, Lv C, Hang P, Liu J (2021) Deep convolutional neural network-based bernoulli heatmap for head pose estimation. Neurocomputing 436:198–209

Hu Z, Zhang Y, Xing Y, Li Q, Lv C (2022) An integrated framework for multi-state driver monitoring using heterogeneous loss and attention-based feature decoupling. Sensors 22(19):7415

Huang S-H, Yang Y-I, Chu C-H (2012) Human-centric design personalization of 3d glasses frame in markerless augmented reality. Adv Eng Inform 26(1):35–45

Huang B, Chen R, Xu W, Zhou Q (2020) Improving head pose estimation using two-stage ensembles with top-k regression. Image Vis Comput 93:103827

Hwang G, Hong S, Lee S, Park S, Chae G (2023) Discohead: audio-and-video-driven talking head generation by disentangled control of head pose and facial expressions. In: ICASSP 2023–2023 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, Rhodes Island, Greece, pp 1–5

Höffken M, Tarayan E, Kreßel U, Dietmayer K (2014) Stereo vision-based driver head pose estimation. In: 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp 253–260. IEEE, Dearborn, MI, USA

Indi CS, Pritham V, Acharya V, Prakasha K (2021) Detection of malpractice in e-exams by head pose and gaze estimation. Int J Emerg Technol Learn 16(8):47

Janota A, Šimák V, Nemec D, Hrbček J (2015) Improving the precision and speed of euler angles computation from low-cost rotation sensor data. Sensors 15(3):7016–7039

Jha S, Busso C (2022) Estimation of driver’s gaze region from head position and orientation using probabilistic confidence regions. IEEE Trans Intell Vehicles 8(1):59–72

Jha S, Marzban MF, Hu T, Mahmoud MH, Al-Dhahir N, Busso C (2021) The multimodal driver monitoring database: a naturalistic corpus to study driver attention. IEEE Trans Intell Transp Syst 23(8):10736–10752

Jha S, Al-Dhahir N, Busso C (2023) Driver visual attention estimation using head pose and eye appearance information. IEEE Open J Intell Transport Syst 4:216–231

Joo H, Liu H, Tan L, Gui L, Nabbe B, Matthews I, Kanade T, Nobuhara S, Sheikh Y (2015) Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3334–3342. IEEE, Santiago, Chile

Ju J, Zheng H, Li C, Li X, Liu H, Liu T (2022) Agcnns: Attention-guided convolutional neural networks for infrared head pose estimation in assisted driving system. Infrared Phys Tachnol 123:104146

Kao Y, Pan B, Xu M, Lyu J, Zhu X, Chang Y, Li X, Lei Z (2023) Towards 3d face reconstruction in perspective projection: Estimating 6dof face pose from monocular image. IEEE Trans Image Process 32:3080–3091

Kazemi V, Sullivan J (2014) One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1867–1874. IEEE, Columbus, OH, USA

Khan K, Ahmad N, Khan F, Syed I (2020) A framework for head pose estimation and face segmentation through conditional random fields. SIViP 14:159–166

Khan K, Khan RU, Leonardi R, Migliorati P, Benini S (2021) Head pose estimation: A survey of the last ten years. Signal Process 99:116479

Khan K, Ali J, Ahmad K, Gul A, Sarwar G, Khan S, Ta QTH, Chung T-S, Attique M (2021) 3d head pose estimation through facial features and deep convolutional neural networks. Comput Mater Continua 66:3

Kim S, Kim M (2023) Rotation representations and their conversions. IEEE Access 11:6682–6699

Kim D, Park H, Kim T, Kim W, Paik J (2023) Real-time driver monitoring system with facial landmark-based eye closure detection and head pose recognition. Sci Rep 13(1):18264

Kim Y, Roh J-H, Kim S (2023) Facial landmark, head pose, and occlusion analysis using multitask stacked hourglass. IEEE Access 11:30970–30981

Koestinger M, Wohlhart P, Roth PM, Bischof H (2011) Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp 2144–2151. IEEE, Barcelona, Spain

Kredel R, Vater C, Klostermann A, Hossner E-J (2017) Eye-tracking technology and the dynamics of natural gaze behavior in sports: a systematic review of 40 years of research. Front Psychol 8:1845

Kuhnke F, Ostermann J (2023) Domain adaptation for head pose estimation using relative pose consistency. IEEE `Trans Biometr Behav Identity Sci 5(3):348–359

Kulshreshth A, LaViola Jr JJ (2013) Evaluating performance benefits of head tracking in modern video games. In: Proceedings of the 1st symposium on spatial user interaction, pp 53–60. ACM, Los Angeles California USA

Kumar A, Kaur A, Kumar M (2019) Face detection techniques: a review. Artif Intell Rev 52:927–948

Kumar A, Alavi A, Chellappa R (2017) Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In: 2017 12th Ieee International Conference on Automatic Face & Gesture Recognition (fg 2017), pp 258–265. IEEE, Washington, DC, USA

La Cascia M, Sclaroff S, Athitsos V (2000) Fast, reliable head tracking under varying illumination: an approach based on registration of texture-mapped 3d models. IEEE Trans Pattern Anal Mach Intell 22(4):322–336

Le V, Brandt J, Lin Z, Bourdev L, Huang TS (2012) Interactive facial feature localization. Computer Vision-ECCV 2012: 12th European Conference on Computer Vision. Florence, Italy, October 7–13, 2012, Proceedings, Part III 12. Springer, Florence, Italy, pp 679–692

Lee C-H, Liu Z, Wu L, Luo P (2020) Maskgan: towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5549–5558. IEEE, Seattle, WA, USA

Levinson J, Esteves C, Chen K, Snavely N, Kanazawa A, Rostamizadeh A, Makadia A (2020) An analysis of svd for deep rotation estimation. Adv Neural Inf Process Syst 33:22554–22565

Li X, Zhang D, Li M, Lee D-J (2022) Accurate head pose estimation using image rectification and a lightweight convolutional neural network. IEEE Trans Multimed 25:2239–2251

Li H, Wang B, Cheng Y, Kankanhalli M, Tan RT (2023) Dsfnet: Dual space fusion network for occlusion-robust 3d dense face alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4531–4540. IEEE, Vancouver, BC, Canada

Liu Y, Chen J, Su Z, Luo Z, Luo N, Liu L, Zhang K (2016) Robust head pose estimation using dirichlet-tree distribution enhanced random forests. Neurocomputing 173:42–53

Liu H, Fang S, Zhang Z, Li D, Lin K, Wang J (2021) Mfdnet: collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Trans Multimed 24:2449–2460

Liu L, Ke Z, Huo J, Chen J (2021) Head pose estimation through keypoints matching between reconstructed 3d face model and 2d image. Sensors 21(5):1841

Liu H, Li D, Wang X, Liu L, Zhang Z, Subramanian S (2021) Precise head pose estimation on hpd5a database for attention recognition based on convolutional neural network in human-computer interaction. Infrared Phys Technol 116:103740

Liu H, Nie H, Zhang Z, Li Y-F (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322

Liu T, Wang J, Yang B, Wang X (2021) Ngdnet: Nonuniform gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436:210–220

Liu H, Liu T, Zhang Z, Sangaiah AK, Yang B, Li Y (2022) Arhpe: asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Trans Ind Inf 18(10):7107–7117

Liu T, Yang B, Liu H, Ju J, Tang J, Subramanian S, Zhang Z (2022) Gmdl: toward precise head pose estimation via gaussian mixed distribution learning for students’ attention understanding. Infrared Phys Technol 122:104099

Liu F, Chen D, Wang F, Li Z, Xu F (2023) Deep learning based single sample face recognition: a survey. Artif Intell Rev 56(3):2723–2748

Liu H, Zhang C, Deng Y, Liu T, Zhang Z, Li Y-F (2023) Orientation cues-aware facial relationship representation for head pose estimation via transformer. IEEE Trans Image Process 32:6289–6302

Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) Smpl: a skinned multi-person linear model. ACM Trans Graph 34(6):1–16

Lu Y, Liu C, Chang F, Liu H, Huan H (2023) Jhpfa-net: Joint head pose and facial action network for driver yawning detection across arbitrary poses in videos. IEEE Trans Intell Transp Syst 24(11):11850–11863

Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, Zhang F, Chang C-L, Yong MG, Lee J, et al (2019) Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172

Luo C, Zhang J, Yu J, Chen CW, Wang S (2019) Real-time head pose estimation and face modeling from a depth image. IEEE Trans Multimed 21(10):2473–2481

López-Sánchez D, Arrieta AG, Corchado JM (2020) Compact bilinear pooling via kernelized random projection for fine-grained image categorization on low computational power devices. Neurocomputing 398:411–421

Lüsi I, Junior JCJ, Gorbova J, Baró X, Escalera S, Demirel H, Allik J, Ozcinar C, Anbarjafari G (2017) Joint challenge on dominant and complementary emotion recognition using micro emotion features and head-pose estimation: Databases. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp 809–813. IEEE, Washington, DC, USA

Ma X, Sang N, Xiao S, Wang X (2021) Learning a deep regression forest for head pose estimation from a single depth image. J Circuits Syst Comput 30(08):2150139

Ma D, Fu T, Yang Y, Cao K, Fan J, Xiao D, Song H, Gu Y, Yang J (2024) Fusion-competition framework of local topology and global texture for head pose estimation. Pattern Recogn 149:110285

Madrigal F, Lerasle F (2020) Robust head pose estimation based on key frames for human-machine interaction. EURASIP J Image Video Process 2020:1–19

Malakshan SR, Saadabadi MSE, Mostofa M, Soleymani S, Nasrabadi NM (2023) Joint super-resolution and head pose estimation for extreme low-resolution faces. IEEE Access 11:11238–11253

Malek S, Rossi S (2021) Head pose estimation using facial-landmarks classification for children rehabilitation games. Pattern Recogn Lett 152:406–412

Martyniuk T, Kupyn O, Kurlyak Y, Krashenyi I, Matas J, Sharmanska V (2022) Dad-3dheads: A large-scale dense, accurate and diverse dataset for 3d head alignment from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 20942–20952. IEEE, New Orleans, LA, USA

Mellouk W, Handouzi W (2020) Facial emotion recognition using deep learning: review and insights. Proc Comput Sci 175:689–694

Menan, V., Gawesha, A., Samarasinghe, P., Kasthurirathna, D.: Ds-hpe: Deep set for head pose estimation. In: 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), pp 1179–1184. IEEE, Las Vegas, NV, USA (2023)

Messer K, Matas J, Kittler J, Luettin J, Maitre G et al (1999) Xm2vtsdb: The extended m2vts database. In: Second International Conference on Audio and Video-based Biometric Person Authentication, vol 964, pp 965–966. Citeseer, Washington D.C, USA

Minaee S, Luo P, Lin Z, Bowyer K (2021) Going deeper into face detection: a survey. arXiv preprint arXiv:2103.14983

Mo S, Miao X (2021) Osgg-net: One-step graph generation network for unbiased head pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 2465–2473. ACM, Virtual Event, China

Mogahed HS, Ibrahim MM (2023) Development of a motion controller for the electric wheelchair of quadriplegic patients using head movements recognition. IEEE Embed Syst Lett 1:1–1

Murphy-Chutorian E, Trivedi MM (2008) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626

Nejkovic V, Öztürk MM, Petrovic N (2022) Head pose healthiness prediction using a novel image quality based stacked autoencoder. Dig Signal Process 130:103696

Patel P, Huang C-HP, Tesch J, Hoffmann DT, Tripathi S, Black M.J (2021) Agora: Avatars in geography optimized for regression analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13468–13478. IEEE, Nashville, TN, USA

Pavlakos G, Choutas V, Ghorbani N, Bolkart T, Osman AA, Tzionas D, Black MJ (2019) Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10975–10985. IEEE, Long Beach, CA, USA

Perdana MI, Anggraeni W, Sidharta HA, Yuniarno EM, Purnomo MH (2021) Early warning pedestrian crossing intention from its head gesture using head pose estimation. In: 2021 International Seminar on Intelligent Technology and Its Applications (ISITIA), pp 402–407. IEEE, Surabaya, Indonesia

Peretroukhin V, Giamou M, Rosen DM, Greene WN, Roy N, Kelly J (2020) A smooth representation of belief over so (3) for deep rotation learning with uncertainty. arXiv preprint arXiv:2006.01031

Phillips PJ, Moon H, Rizvi SA, Rauss PJ (2000) The feret evaluation methodology for face-recognition algorithms. IEEE Trans Pattern Anal Mach Intell 22(10):1090–1104

Rahmaniar W, Haq QM, Lin T-L (2022) Wide range head pose estimation using a single rgb camera for intelligent surveillance. IEEE Sens J 22(11):11112–11121

Ranjan R, Patel VM, Chellappa R (2017) Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans Pattern Anal Mach Intell 41(1):121–135

Ritthipravat P, Chotikkakamthorn K, Lie W-N, Kusakunniran W, Tuakta P, Benjapornlert P (2024) Deep-learning-based head pose estimation from a single rgb image and its application to medical crom measurement. Multimed Tools Appl 1:1–20

Roth M, Gavrila DM (2023) Monocular driver 6 dof head pose estimation leveraging camera intrinsics. IEEE Trans Intell Vehicles 8(8):4057–4068

Roth, M., Gavrila, D.M.: Dd-pose-a large-scale driver head pose benchmark. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp 927–934. IEEE, Paris, France (2019)

Ruiz N, Chong E, Rehg JM (2018) Fine-grained head pose estimation without keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 2074–2083. IEEE, Salt Lake City, UT, USA

Sagonas C, Tzimiropoulos G, Zafeiriou S, Pantic M (2013) 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp 397–403. IEEE, Sydney, NSW, Australia

Sagonas C, Tzimiropoulos G, Zafeiriou S, Pantic M (2013) A semi-automatic methodology for facial landmark annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 896–903. IEEE, Portland, OR, USA

Savran A, Alyüz N, Dibeklioğlu H, Çeliktutan O, Gökberk B, Sankur B, Akarun L (2008) Bosphorus database for 3d face analysis. In: Biometrics and Identity Management: First European Workshop, BIOID 2008, May 7-9, 2008. Revised Selected Papers 1, pp 47–56. Springer, Roskilde, Denmark

Schwarz A, Haurilet M, Martinez M, Stiefelhagen R (2017) Driveahead-a large-scale driver head pose dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1–10. IEEE, Honolulu, HI, USA

Shao X, Qiang Z, Lin H, Dong Y, Wang X (2020) A survey of head pose estimation methods. 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cybernetics Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics). IEEE, Rhodes, Greece, pp 787–796

Shao X (2022) Research on face pose estimation method for virtual try-on system. In: 2022 international seminar on Computer Science and Engineering Technology (SCSET), pp 148–151. IEEE, Indianapolis, IN, USA

Shen J, Qin X, Zhou Z (2022) Head pose estimation in classroom scenes. In: 2022 4th international conference on Artificial Intelligence and Advanced Manufacturing (AIAM), pp 343–349. IEEE, Hamburg, Germany

Singh T, Mohadikar M, Gite S, Patil S, Pradhan B, Alamri A (2021) Attention span prediction using head-pose estimation with deep neural networks. IEEE Access 9:142632–142643

Song C, Wang S, Chen M, Li H, Jia F, Zhao Y (2023) A multimodal discrimination method for the response to name behavior of autistic children based on human pose tracking and head pose estimation. Displays 76:102360

Thai C, Tran V, Bui M, Nguyen D, Ninh H, Tran H (2022) Real-time masked face classification and head pose estimation for rgb facial image via knowledge distillation. Inf Sci 616:330–347

Thai C, Nham N, Tran V, Bui M, Ninh H, Tran H (2023) Multiple teacher knowledge distillation for head pose estimation without keypoints. SN Comput Sci 4(6):758

Tomar V, Kumar N, Srivastava AR (2023) Single sample face recognition using deep learning: a survey. Artif Intell Rev 56(Suppl 1):1063–1111

Tomenotti FF, Noceti N, Odone F (2024) Head pose estimation with uncertainty and an application to dyadic interaction detection. Comput Vis Image Underst 243:103999

Toso M, Pennestrì E, Rossi V (2015) Esa multibody simulator for spacecrafts’ ascent and landing in a microgravity environment. CEAS Space J 7:335–346

Tulyakov S, Vieriu R-L, Semeniuta S, Sebe N (2014) Robust real-time extreme head pose estimation. In: 2014 22nd International Conference on Pattern Recognition, pp 2263–2268. IEEE, Stockholm, Sweden

Valle R, Buenaposada JM, Baumela L (2020) Multi-task head pose estimation in-the-wild. IEEE Trans Pattern Anal Mach Intell 43(8):2874–2881

Viet LN, Dinh TN, Minh DT, Viet HN, Tran QL (2021) Uet-headpose: A sensor-based top-view head pose dataset. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE), pp 1–7. IEEE, Bangkok, Thailand

Viet HN, Viet LN, Dinh TN, Minh DT, Quac LT (2021) Simultaneous face detection and 360 degree head pose estimation. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE), pp 1–7. IEEE, Bangkok, Thailand

Vo MT, Nguyen T, Le T (2019) Robust head pose estimation using extreme gradient boosting machine on stacked autoencoders neural network. IEEE Access 8:3687–3694

Wang L, Li S (2023) Wheelchair-centered omnidirectional gaze-point estimation in the wild. IEEE Trans Hum Mach Syst 53(3):466–478

Wang B-Y, Xie K, He S-T, Wen C, He J-B (2022) Head pose estimation in complex environment based on four-branch feature selective extraction and regional information exchange fusion network. IEEE Access 10:41287–41302

Wang Y, Yuan G, Fu X (2022) Driver’s head pose and gaze zone estimation based on multi-zone templates registration and multi-frame point cloud fusion. Sensors 22(9):3154

Wang Q, Lei H, Qian W (2023) Siamese pointnet: 3d head pose estimation with local feature descriptor. Electronics 12(5):1194

Wang Y, Zhou W, Zhou J (2023) 2dheadpose: a simple and effective annotation method for the head pose in rgb images and its dataset. Neural Netw 160:50–62

Wang Q, Lei H, Li G, Wang X, Chen L (2023) A novel convolutional neural network for head detection and pose estimation in complex environments from single-depth images. Cogn Comput 1:1–14

Wang Y, Liu H, Feng Y, Li Z, Wu X, Zhu C (2024) Headdiff: Exploring rotation uncertainty with diffusion models for head pose estimation. In: IEEE Transactions on Image Processing

Wu Y, Ji Q (2019) Facial landmark detection: a literature survey. Int J Comput Vision 127:115–142

Wu C-Y, Xu Q, Neumann U (2021) Synergy between 3dmm and 3d landmarks for accurate 3d facial geometry. In: 2021 international conference on 3D Vision (3DV), pp 453–463. IEEE, London, UK

Wu W, Qian C, Yang S, Wang Q, Cai Y, Zhou Q (2018) Look at boundary: A boundary-aware face alignment algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2129–2138. IEEE, Salt Lake City, UT, USA

Xia H, Liu G, Xu L, Gan Y (2022) Collaborative learning network for head pose estimation. Image Vis Comput 127:104555

Xia J, Zhang H, Wen S, Yang S, Xu M (2022) An efficient multitask neural network for face alignment, head pose estimation and face tracking. Expert Syst Appl 205:117368

Xin M, Mo S, Lin Y (2021) Eva-gcn: Head pose estimation based on graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1462–1471. IEEE, Nashville, TN, USA

Xu Y, Jung C, Chang Y (2022) Head pose estimation using deep neural networks and 3d point clouds. Pattern Recogn 121:108210

Xu X, Teng X (2020) Classroom attention analysis based on multiple euler angles constraint and head pose estimation. In: MultiMedia Modeling: 26th International Conference. MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26. Springer, Daejeon, South Korea, pp 329–340

Yan C, Zhang X (2024) Head pose estimation based on multi-level feature fusion. Int J Pattern Recogni Artif Intell 1:1

Yang S, Luo P, Loy C-C, Tang X (2016) Wider face: A face detection benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5525–5533. IEEE, Las Vegas, NV, USA

Yang T-Y, Chen Y-T, Lin Y-Y, Chuang Y-Y (2019) Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1087–1096. IEEE, Long Beach, CA, USA

Yao S-N, Huang C-W (2024) Head-pose estimation based on lateral canthus localizations in 2-d images. In: IEEE Transactions on Human-Machine Systems

Ye M, Zhang W, Cao P, Liu K (2021) Driver fatigue detection based on residual channel attention network and head pose estimation. Appl Sci 11(19):9195

Yu Y, Mora KAF, Odobez J-M (2018) Headfusion: 360 head pose tracking combining 3d morphable model and 3d reconstruction. IEEE Trans Pattern Anal Mach Intell 40(11):2653–2667

Yu H, Gupta A, Lee W, Arroyo I, Betke M, Allesio D, Murray T, Magee J, Woolf BP (2021) Measuring and integrating facial expressions and head pose as indicators of engagement and affect in tutoring systems. In: International Conference on Human-Computer Interaction, pp 219–233. Springer, Virtual Event

Yang X, Jia X, Gong D, Yan D-M, Li Z, Liu W (2023) Larnext: End-to-end lie algebra residual network for face recognition. IEEE Trans Pattern Anal Mach Intell 45(10):11961–11976

Zeng Z, Zhu D, Zhang G, Shi W, Wang L, Zhang X, Li J (2022) Srnet: Structural relation-aware network for head pose estimation. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp 826–832. IEEE, Montreal, QC, Canada

Zhang B, Bao Y (2022) Age estimation of faces in videos using head pose estimation and convolutional neural networks. Sensors 22(11):4171

Zhang J, Yu H (2022) Collaborative 3d face alignment and head pose estimation with frontal face constraint based on rgb and sparse depth. Electron Lett 58(21):801–803

Zhang F, Zhang T, Mao Q, Xu C (2018) Joint pose and expression modeling for facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3359–3368. IEEE, Salt Lake City, UT, USA

Zhang X, Park S, Beeler T, Bradley D, Tang S, Hilliges O (2020) Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. Computer Vision-ECCV 2020: 16th European Conference. Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, Glasgow, UK, pp 365–381

Zhang C, Liu H, Deng Y, Xie B, Li Y (2023) Tokenhpe: learning orientation tokens for efficient head pose estimation via transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8897–8906. IEEE, Vancouver, BC, Canada

Zhao W, Wang S, Wang X, Li D, Wang J, Lai C, Li X (2024) Dadl: double asymmetric distribution learning for head pose estimation in wisdom museum. J King Saud Univ Comput Inf Sci 36(1):101869

Zhao N, Ma Y, Li X, Lee S-J, Wang J (2024) 6dflrnet: 6d rotation representation for head pose estimation based on facial landmarks and regression. Multimed Tools Appl, 1–20

Zhou Y, Gregson J (2020) Whenet: Real-time fine-grained estimation for wide range head pose. arXiv preprint arXiv:2005.10353

Zhou Y, Barnes C, Lu J, Yang J, Li H (2019) On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5745–5753. IEEE, Long Beach, CA, USA

Zhou H, Jiang F, Lu H (2023) A simple baseline for direct 2d multi-person head pose estimation with full-range angles. arXiv preprint arXiv:2302.01110

Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2879–2886. IEEE, Providence, RI, USA

Zhu X, Lei Z, Liu X, Shi H, Li SZ (2016) Face alignment across large poses: A 3d solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 146–155. IEEE, Las Vegas, NV, USA

Zhu X, Liu X, Lei Z, Li SZ (2017) Face alignment in full pose range: a 3d total solution. IEEE Trans Pattern Anal Mach Intell 41(1):78–92

Zhu X, Yang Q, Zhao L, Dai Z, He Z, Rong W (2022) Dual-position features fusion for head pose estimation for complex scene. Optik 270:169986

Zhu X, Yang Q, Zhao L, Dai Z, He Z, Rong W, Sun J, Liu G (2022) An improved tiered head pose estimation network with self-adjust loss function. Entropy 24(7):974

Zubair M, Kansal S, Mukherjee S (2022) Vision-based pose estimation of craniocervical region: experimental setup and saw bone-based study. Robotica 40(6):2031–2046

Download references

Acknowledgements

This work was supported in part by the research fund of Hanyang University (HY-2023-3239) and in part by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) No. 2022R1A4A3033961.

Author information

Authors and affiliations.

Research Institute of Engineering and Technology, Hanyang University, ERICA Campus, Ansan, 15588, Republic of Korea

Redhwan Algabri

School of Software, Northwestern Polytechnical University, Xi’an, Xian, 710072, China

Department of Robotics, Hanyang University, ERICA Campus, Ansan, 15588, Republic of Korea

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the conception of the study; the Material collection was done by Redhwan Algabri, Ahmed Abdu, and Sungon Lee; Arrangement of materials was done by Redhwan Algabri and Ahmed Abdu; The first draft of the manuscript was written by Redhwan Algabri and Ahmed Abdu. Sungon Lee reviewed and advised this article. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sungon Lee .

Ethics declarations

Conflict of interest.

The authors declare no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This appendix contains Table  3 because it is too long.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Algabri, R., Abdu, A. & Lee, S. Deep learning and machine learning techniques for head pose estimation: a survey. Artif Intell Rev 57 , 288 (2024). https://doi.org/10.1007/s10462-024-10936-7

Download citation

Accepted : 28 August 2024

Published : 12 September 2024

DOI : https://doi.org/10.1007/s10462-024-10936-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Machine learning
  • Head pose estimation
  • Head pose datasets
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. PhD Research Topics in Deep Learning

    current research topics in deep learning

  2. ORIGINAL AND HOT RESEARCH TOPICS IN DEEP LEARNING

    current research topics in deep learning

  3. Machine Learning Research Topics in Deep Learning Models

    current research topics in deep learning

  4. Deep Learning Research ideas

    current research topics in deep learning

  5. Top 10 Interesting Deep Learning Thesis Topics (Research Guidance)

    current research topics in deep learning

  6. Phd Research Proposal Topics in Deep Learning Algorithms

    current research topics in deep learning

VIDEO

  1. New measurement methods to improve design and safety of hydraulic structures

  2. Samvaad-Talk by Prof. Amit Chattopadhyay (January 15, 2018)

  3. A New Perspective on Complex Network Representaion

  4. Current Research Topics in Finance and Accounting

  5. Deep Learning(CS7015): Lec 3.4 Learning Parameters: Gradient Descent

  6. 35 RESEARCH TOPICS IN PHARMACY

COMMENTS

  1. Deep learning: systematic review, models, challenges, and research

    The current development in deep learning is witnessing an exponential transition into automation applications. This automation transition can provide a promising framework for higher performance and lower complexity. This ongoing transition undergoes several rapid changes, resulting in the processing of the data by several studies, while it may lead to time-consuming and costly models. Thus ...

  2. Machine learning

    Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers to learn without being explicitly programmed and have ...

  3. Deep Learning News

    Their method, IF-COMP, uses the minimum description length principle to provide more reliable confidence measures for AI decisions, crucial in high-stakes settings like healthcare. Deep Learning articles from Neuroscience News cover research from science labs, university research departments and science sources around the world.

  4. Current progress and open challenges for applying deep learning across

    Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper ...

  5. Best Deep Learning Research of 2021 So Far

    2021 has been a great year for deep learning research already, including topics like deep reinforcement learning, training deep neural networks, and others. ... the paper advances the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5 ...

  6. Recent advances in deep learning models: a systematic ...

    In recent years, deep learning has evolved as a rapidly growing and stimulating field of machine learning and has redefined state-of-the-art performances in a variety of applications. There are multiple deep learning models that have distinct architectures and capabilities. Up to the present, a large number of novel variants of these baseline deep learning models is proposed to address the ...

  7. Recent advances and applications of deep learning methods in materials

    Deep learning (DL) is one of the fastest-growing topics in materials data science, with rapidly emerging applications spanning atomistic, image-based, spectral, and textual data modalities. DL ...

  8. Google Research, 2022 & beyond: Algorithms for efficient deep learning

    The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they're deeper, more complex, with more parameters, and trained on much more ...

  9. A decade in deep learning, and what's next

    In 2012, a paper wowed the research world for making a huge jump in accuracy on image recognition using deep neural networks, leading to a series of rapid advances by researchers outside and within Google. Further advances led to applications like Google Photos in 2015, letting you search photos by what's in them. We then developed other deep learning models to help you find addresses in ...

  10. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy

    Deep learning (DL), a branch of machine learning (ML) and artificial intelligence (AI) is nowadays considered as a core technology of today's Fourth Industrial Revolution (4IR or Industry 4.0). Due to its learning capabilities from data, DL technology originated from artificial neural network (ANN), has become a hot topic in the context of computing, and is widely applied in various ...

  11. Deep Learning: Current State

    Deep Learning: Current State. Abstract: Deep learning, a derived from machine learning, has grown into widespread usage with applications as diverse as cancer detection, elephant spotting, and game development. The number of published studies shows an increasing interest by researchers because of its demonstrated ability to achieve high ...

  12. A Thorough Review on Recent Deep Learning Methodologies for Image

    The current research on the field is mostly focused on deep learning-based methods, where attention mechanisms along with deep reinforcement and adversarial learning appear to be in the forefront of this research topic. In this paper, we review recent methodologies such as UpDown, OSCAR, VIVO, Meta Learning and a model that uses conditional ...

  13. Review of deep learning: concepts, CNN architectures, challenges

    In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL ...

  14. AI & Machine Learning Research Topics (+ Free Webinar)

    AI-Related Research Topics & Ideas. Below you'll find a list of AI and machine learning-related research topics ideas. These are intentionally broad and generic, so keep in mind that you will need to refine them a little. Nevertheless, they should inspire some ideas for your project.

  15. Deep Reinforcement Learning: Opportunities and Challenges

    Deep learning and reinforcement learning are underlying techniques. Besides games, reinforcement learning has been making tremendous progress ... To share knowledge and lessons, as well as to identify key research challenges, for the topic of RL for real life, we organized workshops in ICML 2019 and ICML 2021, as well as a virtual workshop in ...

  16. A Survey of Deep Learning: Platforms, Applications and Emerging

    Deep learning has exploded in the public consciousness, primarily as predictive and analytical products suffuse our world, in the form of numerous human-centered smart-world systems, including targeted advertisements, natural language assistants and interpreters, and prototype self-driving vehicle systems. Yet to most, the underlying mechanisms that enable such human-centered smart products ...

  17. Understanding the Research Landscape of Deep Learning in Biomedical

    In the process, we identified the current leading fields, major research topics and techniques, knowledge diffusion, and research collaboration. There was a predominant focus on applying deep learning, especially convolutional neural networks, to radiology and medical imaging, whereas a few studies focused on protein or genome analysis ...

  18. Scaling deep learning for materials discovery

    Discovered stable crystals. Using the described process of scaling deep learning for materials exploration, we increase the number of known stable crystals by almost an order of magnitude. In ...

  19. Deep Learning for Network Intrusion Detection in Virtual Networks

    As organizations increasingly adopt virtualized environments for enhanced flexibility and scalability, securing virtual networks has become a critical part of current infrastructures. This research paper addresses the challenges related to intrusion detection in virtual networks, with a focus on various deep learning techniques. Since physical networks do not use encapsulation, but virtual ...

  20. Deep learning for healthcare: review, opportunities and challenges

    Deep learning framework. Machine learning is a general-purpose method of artificial intelligence that can learn relationships from the data without the need to define them a priori [].The major appeal is the ability to derive predictive models without a need for strong assumptions about the underlying mechanisms, which are usually unknown or insufficiently defined [].

  21. 7 Best Research Papers To Read To Get Started With Deep Learning

    The modern quality of research has risen to reach greater heights. Each of them contains large amounts of knowledge for an individual to enlighten themselves with. The quality of the high-level research papers is especially true for deep learning, which involves tons of research and time investment.

  22. Deep learning: emerging trends, applications and research challenges

    Jing et al. (2019) evaluated the three kinds of deep learning algorithms into the China capital market. Lu (2019) proposed an object-region-enhanced deep learning network, including object area enhancement strategy and black-hole-filling strategy. This model can be the reference as future researches for the robust and practical application.

  23. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy

    Designing General Deep Learning Framework for Target Application Domains One promising research direction for deep learning-based solutions is to develop a general framework that can handle data diversity, dimensions, stimulation types, etc. The general framework would require two key capabilities: the attention mechanism that focuses on the ...

  24. 50 Deep Learning Research Ideas

    3. Developing a deep learning model to detect and classify objects in 3D scenes. 4. Developing a deep learning model to detect and classify objects in audio. 5. Developing a deep learning model to detect and classify objects in text. 6. Develop a deep learning model to generate new images from a given set of images. 7.

  25. Leveraging deep learning and computer vision technologies to enhance

    This paper presents the design and development of a coastal fisheries monitoring system that harnesses artificial intelligence technologies. Application of the system across the Pacific region ...

  26. Electronics

    The rise of the Internet of Things (IoT) has transformed our daily lives by connecting objects to the Internet, thereby creating interactive, automated environments. However, this rapid expansion raises major security concerns, particularly regarding intrusion detection. Traditional intrusion detection systems (IDSs) are often ill-suited to the dynamic and varied networks characteristic of the ...

  27. Development of neuroblastoma tissue diagnostic methods through deep

    The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

  28. Knee Osteoporosis Diagnosis Based on Deep Learning

    Osteoporosis, a silent yet debilitating disease, presents a significant challenge due to its asymptomatic nature until fractures occur. Rapid bone loss outpaces regeneration, leading to pain, disability, and loss of independence. Early detection is pivotal for effective management and fracture risk reduction, yet current diagnostic methods are time-consuming. Despite its importance, research ...

  29. Improving Performance in Colorectal Cancer Histology Decomposition

    In routine colorectal cancer management, histologic samples stained with hematoxylin and eosin are commonly used. Nonetheless, their potential for defining objective biomarkers for patient stratification and treatment selection is still being explored. The current gold standard relies on expensive and time-consuming genetic tests. However, recent research highlights the potential of ...

  30. Deep learning and machine learning techniques for head pose ...

    Deep learning techniques The field of deep learning has garnered significant research interest, driven by its diverse applications in online retail, art/film production, video conferencing, and virtual agents. The progress in deep learning has facilitated the on-demand generation of a person's visual attributes, including their face and pose.