Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, image classification.

4093 papers with code • 152 benchmarks • 251 datasets

Image Classification is a fundamental task in vision recognition that aims to understand and categorize an image as a whole under a specific label. Unlike object detection , which involves classification and location of multiple objects within an image, image classification typically pertains to single-object images. When the classification becomes highly detailed or reaches instance-level, it is often referred to as image retrieval , which also involves finding similar images in a large database.

Source: Metamorphic Testing for Object Detection Systems

research paper on image recognition

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
OmniVec(ViT)
efficient adaptive ensembling
efficient adaptive ensembling
VIT-L/16 (Spinal FC)
CoCa
Branching/Merging CNN + Homogeneous Vector Capsules
OmniVec2
E2E-M3
Baseline (ViT-G/14)
CCT-14/7x2
LRA-diffusion (CC)
LRA-diffusion (CLIP ViT)
ALIGN (50 hypers/task)
Model soups (BASIC-L)
Fine-Tuning DARTS
VGG-5 (Spinal FC)
Hiera-H (448px)
NOAH-ViTB/16
efficient adaptive ensembling
Astroformer
ViT-Large/16 (384)
ViT-Large/16 (384)
MAM (ViT-B/16)
InternImage-H
IMP+MTP(IntenImage-XL)
Hiera-H (448px)
WaveMixLite-128/7
ResNet50
Linear FT(ViT-L/14)
VIT-L/16 (Spinal FC, Background)
WaveMixLite-112/16
CurriculumNet
CoAtNet-1
EGNN+Transduction
Heinsen Routing
WaveMixLite-112/16
Bamboo (ViTB/16)
OmniVec2
efficient adaptive ensembling
MLP-DecAug
TWIST (ResNet-50)
NCR (ResNet-18)
NCR (ResNet-18)
NCR (ResNet-18)
adaptive minimal ensembling
LRA-diffusion (CLIP ViT)
BiT-L (ResNet)
UPANets
EfficientNet-B3
V-MoE-H/14 (Every-2)
STS-ResNet
InstanceGM-SS
Entropy-based Logic Explained Network
PGDF (ResNet-18)
SWAG (ViT H/14)
AG-Net
HyT-NAS-BA
cFlow
DL+PCA+GWO
TC-VII (with outside data)
kEffNet-B0 V2 16ch
FG-MAE (ViT-S/16)
Claude 3 Opus
HiFuse_Base
HiFuse_Small
DenseNet121_256x256_Nutrispace
Multi-task
PCGAN-CHAR
PCGAN-CHAR
PCGAN-CHAR
LRA-diffusion
Our Ensemble Learning-2
Fuzzy Distance Ensemble
FaMUS
MentorMix
FaMUS
L3D_original_2level
SimCLR
Sparse-CBM
SEER (RegNet10B)
SEER (RegNet10B)
Diffusion Classifier (zero-shot)
µ2Net+ (ViT-L/16)
SparseSwin with L2
WaveMix
WaveMix
Inception-v3
ASF-former-S
mMND (STDP)
ResMLP-24
UnMixMatch
HiFuse_Base
CapsNet
VGG-5(Spinal FC)
E2E-3M
WRN-28-2 + UDA+AutoDropout
ResNet-50 + UDA+AutoDropout
CoNAL
E2E-3M
ResNet-50
NNCLR
PDO-eConv (ours)
PDO-eConv (ours)
Max Margin Contrastive
MentorMix
Fuzzy rank-based fusion of CNN models using Gompertz function
TransBoost-ResNet50
µ2Net+ (ViT-L/16)
HSANR
ResNet-152 2x (RS training)
ThanosNet
EnGraf-Net101 (G=4, H=1)
SEER (RegNet10B)
ResNet-18
WaveMixLite-128/16
WaveMix-256/16 (level 2)
AP-GeM (ResNet-101)
kMobileNet V3 Large 16ch
shreynet
BinaryViT
FedAvgM + ASAM + SWA
µ2Net (ViT-L/16)
µ2Net+ (ViT-L/16)
µ2Net+ (ViT-L/16)
ResNet50
TransBoost-ResNet50
pFedBreD_ns_mg
SqueezeNet + Simple Bypass
SqueezeNet + Simple Bypass
µ2Net+ (ViT-L/16)
RADAM (ConvNeXt-XL)
RADAM (ConvNeXt-L)
ResNet
WaveMixLite
ResNet
WRN (N=28, k=10)
WRN (N=36, k=5)
VOLO-D5
Model with negotiation paradigm
Model with negotiation paradigm
Model with negotiation paradigm
Model with negotiation paradigm
ArabSignNet
ArabSignNet
RedNet-152
CNN+ Wilson-Cowan model RNN
Ours
efficient adaptive ensembling
SAM
EfficientNet-L2-Ns

research paper on image recognition

Most implemented papers

Deep residual learning for image recognition.

Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Very Deep Convolutional Networks for Large-Scale Image Recognition

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

We present a class of efficient models called MobileNets for mobile and embedded vision applications.

MobileNetV2: Inverted Residuals and Linear Bottlenecks

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

research paper on image recognition

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.

Densely Connected Convolutional Networks

research paper on image recognition

Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

For captioning and VQA, we show that even non-attention based models can localize inputs.

CSPNet: A New Backbone that can Enhance Learning Capability of CNN

Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection.

Rethinking the Inception Architecture for Computer Vision

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks.

Advertisement

Advertisement

Deep learning models for digital image processing: a review

  • Published: 07 January 2024
  • Volume 57 , article number  11 , ( 2024 )

Cite this article

research paper on image recognition

  • R. Archana 1 &
  • P. S. Eliahim Jeevaraj 1  

12k Accesses

12 Citations

Explore all metrics

Within the domain of image processing, a wide array of methodologies is dedicated to tasks including denoising, enhancement, segmentation, feature extraction, and classification. These techniques collectively address the challenges and opportunities posed by different aspects of image analysis and manipulation, enabling applications across various fields. Each of these methodologies contributes to refining our understanding of images, extracting essential information, and making informed decisions based on visual data. Traditional image processing methods and Deep Learning (DL) models represent two distinct approaches to tackling image analysis tasks. Traditional methods often rely on handcrafted algorithms and heuristics, involving a series of predefined steps to process images. DL models learn feature representations directly from data, allowing them to automatically extract intricate features that traditional methods might miss. In denoising, techniques like Self2Self NN, Denoising CNNs, DFT-Net, and MPR-CNN stand out, offering reduced noise while grappling with challenges of data augmentation and parameter tuning. Image enhancement, facilitated by approaches such as R2R and LE-net, showcases potential for refining visual quality, though complexities in real-world scenes and authenticity persist. Segmentation techniques, including PSPNet and Mask-RCNN, exhibit precision in object isolation, while handling complexities like overlapping objects and robustness concerns. For feature extraction, methods like CNN and HLF-DIP showcase the role of automated recognition in uncovering image attributes, with trade-offs in interpretability and complexity. Classification techniques span from Residual Networks to CNN-LSTM, spotlighting their potential in precise categorization despite challenges in computational demands and interpretability. This review offers a comprehensive understanding of the strengths and limitations across methodologies, paving the way for informed decisions in practical applications. As the field evolves, addressing challenges like computational resources and robustness remains pivotal in maximizing the potential of image processing techniques.

Similar content being viewed by others

research paper on image recognition

Image denoising in the deep learning era

research paper on image recognition

Impact of Deep Learning in Image Processing and Computer Vision

research paper on image recognition

Deep Learning for Image Classification: A Review

Explore related subjects.

  • Artificial Intelligence
  • Medical Imaging

Avoid common mistakes on your manuscript.

1 Introduction

Image Processing (IP) stands as a multifaceted field encompassing a range of methodologies dedicated to gleaning valuable insights from images. Concurrently, the landscape of Artificial Intelligence (AI) has burgeoned into an expansive realm of exploration, serving as the conduit through which intelligent machines strive to replicate human cognitive capacities. Within the expansive domain of AI, Machine Learning (ML) emerges as a pivotal subset, empowering models to autonomously extrapolate outcomes from structured datasets, effectively diminishing the need for explicit human intervention in the decision-making process. At the heart of ML lies Deep Learning (DL), a subset that transcends conventional techniques, particularly in handling unstructured data. DL boasts an unparalleled potential for achieving remarkable accuracy, at times even exceeding human-level performance. This prowess, however, hinges on the availability of copious data to train intricate neural network architectures, characterized by their multilayered composition. Unlike their traditional counterparts, DL models exhibit an innate aptitude for feature extraction, a task that historically posed challenges. This proficiency can be attributed to the architecture's capacity to inherently discern pertinent features, bypassing the need for explicit feature engineering. Rooted in the aspiration to emulate cognitive processes, DL strives to engineer learning algorithms that faithfully mirror the intricacies of the human brain. In this paper, a diverse range of deep learning methodologies, contributed by various researchers, is elucidated within the context of Image Processing (IP) techniques.

This comprehensive compendium delves into the diverse and intricate landscape of Image Processing (IP) techniques, encapsulating the domains of image restoration, enhancement, segmentation, feature extraction, and classification. Each domain serves as a cornerstone in the realm of visual data manipulation, contributing to the refinement, understanding, and utilization of images across a plethora of applications.

Image restoration techniques constitute a critical first step in rectifying image degradation and distortion. These methods, encompassing denoising, deblurring, and inpainting, work tirelessly to reverse the effects of blurring, noise, and other forms of corruption. By restoring clarity and accuracy, these techniques lay the groundwork for subsequent analyses and interpretations, essential in fields like medical imaging, surveillance, and more.

The purview extends to image enhancement, where the focus shifts to elevating image quality through an assortment of adjustments. Techniques that manipulate contrast, brightness, sharpness, and other attributes enhance visual interpretability. This enhancement process, applied across diverse domains, empowers professionals to glean finer details, facilitating informed decision-making and improved analysis.

The exploration further extends to image segmentation, a pivotal process for breaking down images into meaningful regions. Techniques such as clustering and semantic segmentation aid in the discernment of distinct entities within images. The significance of image segmentation is particularly pronounced in applications like object detection, tracking, and scene understanding, where it serves as the backbone of accurate identification and analysis.

Feature extraction emerges as a fundamental aspect of image analysis, entailing the identification of crucial attributes that pave the way for subsequent investigations. While traditional methods often struggle to encapsulate intricate attributes, deep learning techniques excel in autonomously recognizing complex features, contributing to a deeper understanding of images and enhancing subsequent analysis.

Image classification, a quintessential task in the realm of visual data analysis, holds prominence. This process involves assigning labels to images based on their content, playing a pivotal role in areas such as object recognition and medical diagnosis. Both machine learning and deep learning techniques are harnessed to automate the accurate categorization of images, enabling efficient and effective decision-making.

The Sect.  1 elaborates the insights of the image processing operations. In Sect.  2 of this paper, a comprehensive overview of the evaluation metrics employed for various image processing operations is provided. Moving to Sect.  3 , an in-depth exploration unfolds concerning the diverse range of Deep Learning (DL) models specifically tailored for image preprocessing tasks. Within Sect.  4 , a thorough examination ensues, outlining the array of DL methods harnessed for image segmentation tasks, unraveling their techniques and applications.

Venturing into Sect.  5 , a meticulous dissection is conducted, illuminating DL strategies for feature extraction, elucidating their significance and effectiveness. In Sect.  6 , the spotlight shifts to DL models designed for the intricate task of image classification, delving into their architecture and performance characteristics. The significance of each models are discussed in Sect.  7 . Concluding this comprehensive analysis, Sect.  8 encapsulates the synthesized findings and key takeaways, consolidating the insights gleaned from the study.

The array of papers discussed in this paper collectively present a panorama of DL methodologies spanning various application domains. Notably, these domains encompass medical imagery, satellite imagery, botanical studies involving flower images, as well as fruit images, and even real-time image scenarios. Each domain's unique challenges and intricacies are met with tailored DL approaches, underscoring the adaptability and potency of these methods across diverse real-world contexts.

2 Metrics for image processing operations

Evaluation metrics serve as pivotal tools in the assessment of the efficacy and impact of diverse image processing techniques. These metrics serve the essential purpose of furnishing quantitative measurements that empower researchers and practitioners to undertake an unbiased analysis and facilitate meaningful comparisons among the outcomes yielded by distinct methods. By employing these metrics, the intricate and often subjective realm of image processing can be rendered more objective, leading to informed decisions and advancements in the field.

2.1 Metrics for image preprocessing

2.1.1 mean squared error (mse).

The average of the squared differences between predicted and actual values. It penalizes larger errors more heavily.

where, M and N are the dimensions of the image. \({Original}_{(i,j)}\,and\, {Denoised}_{(i,j)}\) are the pixel values at position (i, j) in the original and denoised images respectively.

2.1.2 Peak signal-to-noise ratio (PSNR)

PSNR is commonly used to measure the quality of restored images. It compares the original and restored images by considering the mean squared error between their pixel values.

where, MAX is the maximum possible pixel value (255 for 8-bit images), MSE is the mean squared error between the original and denoised images.

2.1.3 Structural similarity index (SSIM)

SSIM is applicable to image restoration as well. It assesses the similarity between the original and restored images in terms of luminance, contrast, and structure. Higher SSIM values indicate better restoration quality.

\({SSIM}_{\left(x,y\right)}=\left(2*{\mu }_{x }*{\mu }_{y }+{c}_{1}\right)*(2*{\sigma }_{xy }+{c}_{2})/({\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1})*({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{c}_{2}\) ).where, \({\mu }_{x }and {\mu }_{y}\) are the mean values of the original and denoised images. \({\sigma }_{x}^{2} and {\sigma }_{y}^{2}\) are the variances of the original and denoised images. \({\sigma }_{xy}\) is the covariance between the original and denoised images. \({c}_{1}{ and c}_{2}\) are constants to avoid division by zero.

2.1.4 Mean structural similarity index (MSSIM)

MSSIM extends SSIM to multiple patches of the image and calculates the mean SSIM value over those patches.

where x i and y i are the patches of the original and enhanced images.

2.1.5 Mean absolute error (MAE)

The average of the absolute differences between predicted and actual values. It provides a more robust measure against outliers.

where n is the number of samples.

2.1.6 NIQE (Naturalness image quality evaluator)

NIQE quantifies the naturalness of an image by measuring the deviation of local statistics from natural images. It calculates the mean of the local differences in luminance and contrast.

2.1.7 FID (Fréchet inception distance)

FID measures the distance between two distributions (real and generated images) using the Fréchet distance between their feature representations calculated by a pre-trained neural network.

2.2 Metrics for image segmentation

2.2.1 intersection over union (iou).

IoU measures the overlap between the predicted bounding box and the ground truth bounding box. Commonly used to evaluate object detection models.

2.2.2 Average precision (AP)

AP measures the precision at different recall levels and computes the area under the precision-recall curve. Used to assess object detection and instance segmentation models.

2.2.3 Dice similarity coefficient

The Dice similarity coefficient is another measure of similarity between the predicted segmentation and ground truth. It considers both false positives and false negatives.

The Dice Similarity Coefficient, also known as the Sørensen-Dice coefficient, is a common metric for evaluating the similarity between two sets. In the context of image segmentation, it quantifies the overlap between the predicted segmentation and the ground truth, taking into account both true positives and false positives. DSC ranges from 0 to 1, where higher values indicate better overlap between the predicted and ground truth segmentations. A DSC of 1 corresponds to a perfect match.

2.2.4 Average accuracy (AA)

Average Accuracy measures the overall accuracy of the segmentation by calculating the percentage of correctly classified pixels across all classes.

where, N is the number of classes. True Positives i and True Negativesi are the true positives and true negatives for class ii. Total Pixels i is the total number of pixels in class.

2.3 Metrics for feature extraction and classification

2.3.1 accuracy.

The ratio of correctly predicted instances to the total number of instances. It's commonly used for balanced datasets but can be misleading for imbalanced datasets.

2.3.2 Precision

The ratio of true positive predictions to the total number of positive predictions. It measures the model’s ability to avoid false positives.

2.3.3 Recall (Sensitivity or true positive rate)

The ratio of true positive predictions to the total number of actual positive instances. It measures the model’s ability to correctly identify positive instances.

2.3.4 F1-Score

The harmonic mean of precision and recall. It provides a balanced measure between precision and recall.

2.3.5 Specificity (True negative rate)

The ratio of true negative predictions to the total number of actual negative instances.

2.3.6 ROC curve (Receiver operating characteristic curve )

A graphical representation of the trade-off between true positive rate and false positive rate as the classification threshold varies. These metrics are commonly used in binary classification. The ROC curve plots this trade-off, and AUC summarizes the curve's performance.

3 Image preprocessing

Image preprocessing is a fundamental step in the field of image processing that involves a series of operations aimed at preparing raw or unprocessed images for further analysis, interpretation, or manipulation. This crucial phase helps enhance the quality of images, mitigate noise, correct anomalies, and extract relevant information, ultimately leading to more accurate and reliable results in subsequent tasks such as image analysis, recognition, and classification.

Image preprocessing is broadly categorized into image restoration which removes the noises and blurring in the images and image enhancement which improves the contrast, brightness and details of the images.

3.1 Image restoration

Image restoration serves as a pivotal process aimed at reclaiming the integrity and visual quality of images that have undergone degradation or distortion. Its objective is to transform a degraded image into a cleaner, more accurate representation, thereby revealing concealed details that may have been obscured. This process is particularly vital in scenarios where images have been compromised due to factors like digital image acquisition issues or post-processing procedures such as compression and transmission. By rectifying these issues, image restoration contributes to enhancing the interpretability and utility of visual data.

A notable adversary in the pursuit of pristine images is noise, an unintended variation in pixel values that introduces unwanted artifacts and can lead to the loss of important information. Different types of noise, such as Gaussian noise characterized by its random distribution, salt and pepper noise causing sporadic bright and dark pixels, and speckle noise resulting from interference, can mar the quality of images. These disturbances often originate from the acquisition process or subsequent manipulations of the image data.

Historically, traditional image restoration techniques have included an array of methods to mitigate the effects of degradation and noise. These techniques encompass constrained least square filters, blind deconvolution methods that aim to reverse the blurring effects, Weiner and inverse filters for enhancing signal-to-noise ratios, as well as Adaptive Mean, Order Static, and Alpha-trimmed mean filters that tailor filtering strategies based on the local pixel distribution. Additionally, algorithms dedicated to deblurring counteract motion or optical-induced blurriness, restoring sharpness. Denoising techniques (Tian et al. 2018 ; Peng et al. March 2020 ; Tian and Fei 2020 ) such as Total Variation Denoising (TVD) and Non-Local Means (NLM) further contribute by effectively reducing random noise while preserving essential image details, collectively advancing the field's capacity to improve image integrity and visual clarity. In Table 1 , a summary of deep learning models for image restoration is provided, including their respective advantages and disadvantages.

Recent advancements in deep learning, particularly through Convolutional Neural Networks (CNN), have revolutionized the field of image restoration. CNNs are adept at learning and extracting complex features from images, allowing them to recognize patterns and nuances that may be challenging for traditional methods to discern. Through extensive training on large datasets, these networks can significantly enhance the quality of restored images, often surpassing the capabilities of conventional techniques. This leap in performance is attributed to the network's ability to implicitly understand the underlying structures of images and infer optimal restoration strategies.

Chunwei Tiana et al. (Tian and Fei 2020 ) provided an overview of deep network utilization in denoising images to eliminate Gaussian noise. They explored deep learning techniques for various noisy tasks, including additive white noisy images, blind denoising, and real noisy images. Through benchmark dataset analysis, they assessed the denoising outcomes, efficiency, and visual effects of distinct networks, followed by cross-comparisons of different image denoising methods against diverse types of noise. They concluded by addressing the challenges encountered by deep learning in image denoising.

Quan et al. ( 2020 ) introduced a self-supervised deep learning method named Self2Self for image denoising. Their study demonstrated that the denoising neural network trained with the Self2Self scheme outperformed non-learning-based denoisers and single-image-learning denoisers.

Yan et al. ( 2020 ) proposed a novel technique for removing speckle noise in digital holographic speckle pattern interferometry (DHSPI) wrapped phase. Their method employed improved denoising convolutional neural networks (DnCNNs) and evaluated noise reduction using Mean Squared Error (MSE) comparisons between noisy and denoised data.

Sori et al. ( 2021 ) presented lung cancer detection from denoised Computed Tomography images using a two-path convolutional neural network (CNN). They employed the denoised image by DR-Net as input for lung cancer detection, achieving superior results in accuracy, sensitivity, and specificity compared to recent approaches.

Pang et al. ( 2021 ) implemented an unsupervised deep learning method for denoising using unmatched noisy images, with a loss function analogous to supervised training. Their model, based on the Additive White Gaussian Noise model, attained competitive outcomes against unsupervised methods.

Hasti and Shin ( 2022 ) proposed a deep learning approach to denoise fuel spray images derived from Mie scattering and droplet center detection. A comprehensive comparison of diverse algorithms—standard CNN, modified ResNet, and modified U-Net—revealed the superior performance of the modified U-Net architecture in terms of Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR).

Niresi and Chi et al. ( 2022 ) employed an unsupervised HSI denoising algorithm under the DIP framework, which minimized the Half-Quadratic Lagrange Function (HLF) without regularizers, effectively removing mixed types of noises like Gaussian noise and sparse noise while preserving edges. Zhou et al. ( 2022 ) introduced a novel bearing fault diagnosis model called deep network-based sparse denoising (DNSD). They addressed the challenges faced by traditional sparse theory algorithms, demonstrating that DNSD overcomes issues related to generalization, parameter adjustment, and data-driven complexity. Tawfik et al. ( 2022 ) conducted a comprehensive evaluation of image denoising techniques, categorizing them as traditional (user-based) non-learnable denoising filters and DL-based methods. They introduced semi-supervised denoising models and employed qualitative and quantitative assessments to compare denoising performance. Meng and Zhang et al. ( 2022 ) proposed a gray image denoising method utilizing a constructed symmetric and dilated convolutional residual network. Their technique not only effectively removed noise in high-noise settings but also achieved higher SSIM, PSNR, FOM, and improved visual effects, offering valuable data for subsequent applications like target detection, recognition, and tracking.

In essence, image restoration encapsulates a continuous endeavor to salvage and improve the visual fidelity of images marred by degradation and noise. As technology advances, the integration of deep learning methodologies promises to propel this field forward, ushering in new standards of image quality and accuracy.

3.2 Image enhancement

Image enhancement refers to the process of manipulating an image to improve its visual quality and interpretability for human perception. This technique involves various adjustments that aim to reveal hidden details, enhance contrast, and sharpen edges, ultimately resulting in an image that is clearer and more suitable for analysis or presentation. The goal of image enhancement is to make the features within an image more prominent and recognizable, often by adjusting brightness, contrast, color balance, and other visual attributes.

Standard image enhancement methods encompass a range of techniques, including histogram matching to adjust the pixel intensity distribution, contrast-limited adaptive histogram equalization (CLAHE) to enhance local contrast, and filters like the Wiener filter and median filter to reduce noise. Linear contrast adjustment and unsharp mask filtering are also commonly employed to boost image clarity and sharpness.

In recent years, deep learning methods have emerged as a powerful approach for image enhancement. These techniques leverage large datasets and complex neural network architectures to learn patterns and features within images, enabling them to restore and enhance images with impressive results. Researchers have explored various deep learning models for image enhancement, each with its strengths and limitations. These insights are summarized in Table 2 .

The study encompasses an array of innovative techniques, including the integration of Retinex theory and deep image priors in the Novel RetinexDIP method, robustness-enhancing Fuzzy operation to mitigate overfitting, and the fusion of established techniques like Unsharp Masking, High-Frequency Emphasis Filtering, and CLAHE with EfficientNet-B4, ResNet-50, and ResNet-18 architectures to bolster generalization and robustness. Among these, FCNN Mean Filter exhibits computational efficiency, while CV-CNN leverages the capabilities of complex-valued convolutional networks. Additionally, the versatile pix2pixHD framework and the swift convergence of LE-net (Light Enhancement Net) contribute to the discourse. Deep Convolutional Neural Networks demonstrate robust enhancements, yet require meticulous hyperparameter tuning. Finally, MSSNet-WS (Multi-Scale-Stage Network) efficiently converges and addresses overfitting. This analysis systematically highlights their merits, encompassing improved convergence rates, overfitting mitigation, robustness, and computational efficiency.

Gao et al. ( 2022 ) proposed an inventive approach for enhancing low-light images by leveraging Retinex decomposition after initial denoising. In their method, the Retinex decomposition technique was applied to restore brightness and contrast, resulting in images that are clearer and more visually interpretable. Notably, their method underwent rigorous comparison with several other techniques, including LIME, NPE, SRIE, KinD, Zero-DCE, and RetinexDIP, showcasing its superior ability to enhance image quality while preserving image resolution and minimizing memory usage (Tables  1 , 2 , 3 , 4 and 5 ).

Liu et al. ( 2019 ) explored the application of deep learning in iris recognition, utilizing Fuzzy-CNN (F-CNN) and F-Capsule models. What sets their approach apart is the integration of Gaussian and triangular fuzzy filters, a novel enhancement step that contributes to improving the clarity of iris images. The significance lies in the method’s practicality, as it smoothly integrates with existing networks, offering a seamless upgrade to the recognition process.

Munadi et al. ( 2020 ) combined deep learning techniques with image enhancement methodologies to tackle tuberculosis (TB) image classification. Their innovative approach involved utilizing Unsharp Masking (UM) and High-Frequency Emphasis Filtering (HEF) in conjunction with EfficientNet-B4, ResNet-50, and ResNet-18 models. By evaluating the performance of three image enhancement algorithms, their work demonstrated remarkable accuracy and Area Under Curve (AUC) scores, revealing the potential of their method for accurate TB image diagnosis.

Lu et al. ( 2021 ) introduced a novel application of deep learning, particularly the use of a fully connected neural network (FCNN), to address impulse noise in degraded images with varying noise densities. What's noteworthy about their approach is the development of an FCNN mean filter that outperformed traditional mean/median filters, especially when handling low-noise density environments. Their study thus highlights the promising capabilities of deep learning in noise reduction scenarios. Quan et al. ( 2020 ) presented a non-blind image deblurring technique employing complex-valued CNN (CV-CNN). The uniqueness of their approach lies in incorporating Gabor-domain denoising as a prior step in the deconvolution model. By evaluating their model using quantitative metrics like Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), their work showcased effective deblurring outcomes, reaffirming the potential of complex-valued CNNs in image restoration.

Jin et al. ( 2021 ) harnessed the power of deep learning, specifically the pix2pixHD model, to enhance multidetector computed tomography (MDCT) images. Their focus was on accurately measuring vertebral bone structure. By utilizing MDCT images, their approach demonstrated the potential of deep learning techniques in precisely enhancing complex medical images, which can play a pivotal role in accurate clinical assessments.

Li et al. ( 2021a ) introduced a CNN-based LE-net tailored for image recovery in low-light conditions, catering to applications like driver assistance systems and connected autonomous vehicles (CAV). Their work highlighted the significance of their model in outperforming traditional approaches and even other deep learning models. The research underscores the importance of tailored solutions for specific real-world scenarios.

Mehranian et al. ( 2022 ) ventured into the realm of Time-of-Flight (ToF) enhancement in positron emission tomography (PET) images using deep convolutional neural networks. Their innovative use of the block-sequential-regularized-expectation–maximization (BSREM) algorithm for PET data reconstruction in combination with DL-ToF(M) demonstrated superior diagnostic performance, measured through metrics like SSIM and Fréchet Inception Distance (FID).

Kim et al. ( 2022 ) introduced the Multi-Scale-Stage Network (MSSNet), a pioneering deep learning-based approach for single image deblurring. What sets their work apart is their meticulous analysis of previous deep learning-based coarse-to-fine approaches, leading to the creation of a network that achieves state-of-the-art performance in terms of image quality, network size, and computation time.

In the core, image enhancement plays a crucial role in improving the visual quality of images, whether for human perception or subsequent analytical tasks. The combination of traditional methods and cutting-edge deep learning techniques continues to advance our ability to reveal and amplify important information within images. Each of these studies contributes to the expanding landscape of image enhancement and restoration, showcasing the immense potential of deep learning techniques in various domains, from medical imaging to low-light scenarios, while addressing specific challenges and advancing the state-of-the-art in their respective fields.

However, the study recognizes inherent limitations, including constrained adaptability, potential loss of intricate details, and challenges posed by complex scenes or real-world images. Through a meticulous exploration of these advantages and disadvantages, the study endeavors to offer a nuanced perspective on the diverse applicability of these methodologies across various image enhancement scenarios.

4 Image segmentation

Image segmentation is a pivotal process that involves breaking down an image into distinct segments based on certain discernible characteristics such as intensity, color, texture, or spatial proximity. This technique is classified into two primary categories: Semantic segmentation and Instance segmentation. Semantic segmentation assigns each pixel to a specific class within the input image, enabling the identification of distinct object regions. On the other hand, instance segmentation takes a step further by not only categorizing pixels into classes but also differentiating individual instances of those classes within the image.

Traditional segmentation methodologies entail the partitioning of data, such as images, into well-defined segments governed by predetermined criteria. This approach predates the era of deep learning and relies on techniques rooted in expert-designed features or domain-specific knowledge. Common techniques encompass thresholding, which categorizes pixels into object and background regions using specific intensity thresholds, region-based segmentation that clusters pixels with similar attributes into coherent regions, and edge detection to identify significant intensity transitions that might signify potential boundaries.Nonetheless, traditional segmentation techniques grapple with inherent complexities when it comes to handling intricate shapes, dynamic backgrounds, and noise within the data. Moreover, the manual craftsmanship of features for various scenarios can be laborious and might not extend well to different contexts. In contrast, deep learning has ushered in a paradigm shift in segmentation by introducing automated feature learning. Deep neural networks have the remarkable ability to extract intricate features directly from raw data, negating the necessity for manual feature engineering. This empowers them to capture nuanced spatial relationships and adapt to variations, effectively addressing the limitations inherent in traditional methods. This transformation, especially pronounced in image segmentation tasks, has opened doors to unprecedented possibilities in the field of computer vision and image analysis. Table 3 encapsulates the strengths and limitations of various explored deep learning models.

Ahmed et al. ( 2020 ) conducted a comprehensive exploration of deep learning-based semantic segmentation models for the challenging task of top-view multiple person segmentation. They assessed the performance of key models, including Fully Convolutional Neural Network (FCN), U-Net, and DeepLabV3. This investigation is particularly important as accurate segmentation of multiple individuals in top-view images holds significance in various applications like surveillance, crowd monitoring, and human–computer interaction. The researchers found that DeepLabV3 and U-Net outperformed FCN in terms of accuracy. These models achieved impressive accuracy and mean Intersection over Union (mIoU) scores, indicating the precision of segmentation, with DeepLabV3 and U-Net leading the way. The results underscore the value of utilizing advanced deep learning models for complex segmentation tasks involving multiple subjects.

Wang et al. ( 2020 ) proposed an adaptive segmentation algorithm employing the UNet structure, which is adept at segmenting both shallow and deep features. Their study addressed the challenge of segmenting complex boundaries within images, a crucial task in numerous medical imaging and computer vision applications. They validated their model's effectiveness on natural scene images and liver cancer CT images, highlighting its advantages over existing segmentation methods. This research contributes to the field by showcasing the potential of adaptive segmentation algorithms, emphasizing their superiority in handling intricate boundaries in diverse image datasets.

Ahammad et al. ( 2020 ) introduced a novel deep learning framework based on Convolutional Neural Networks (CNNs) for diagnosing Spinal Cord Injury (SCI) features through segmentation. This study's significance lies in its application to medical imaging, specifically spinal cord disease prediction. Their model’s high computational efficiency and remarkable accuracy underscore its potential clinical utility. The CNN-based framework leveraged sensor SCI image data, demonstrating the capacity of deep learning to contribute to accurate diagnosis and prediction in medical scenarios, enhancing patient care.

Lorenzoni et al. ( 2020 ) employed Deep Learning techniques based on Convolutional Neural Networks (CNNs) to automate the segmentation of microCT images of distinct cement-based composites. This research is essential in materials science and civil engineering, where automated segmentation can aid in understanding material properties. Their study emphasizes the adaptability of Deep Learning models, showcasing the transferability of network parameters optimized on high-strength materials to other related contexts. This work demonstrates the potential of CNN-based methodologies for advancing materials characterization and analysis.

Mahajan et al. ( 2021 ) introduced a clustering-based profound iterating Deep Learning model (CPIDM) for hyperspectral image segmentation. This research addresses the challenge of segmenting hyperspectral images, which are prevalent in fields like remote sensing and environmental monitoring. The proposed approach's superiority over state-of-the-art methods indicates its potential for enhancing the accuracy of hyperspectral image analysis. The study contributes to the field by providing an innovative methodology to tackle the unique challenges posed by hyperspectral data.

Jalali et al. ( 2021 ) designed a novel deep learning-based approach for segmenting lung regions from CT images using Bi-directional ConvLSTM U-Net with densely connected convolutions (BCDU-Net). This research is critical for medical image analysis, specifically lung-related diagnoses. Their model's impressive accuracy on a large dataset indicates its potential for aiding radiologists in identifying lung regions accurately. The application of advanced deep learning architectures to medical imaging tasks underscores the transformative potential of such technologies in healthcare.

Bouteldja et al. ( 2020 ) developed a CNN-based approach for accurate multiclass segmentation of stained kidney images from various species and renal disease models. This research’s significance lies in its potential contribution to histopathological analysis and disease diagnosis. The model's high performance across diverse species and disease models highlights its robustness and utility for aiding pathologists in accurate image-based diagnosis.

Liu et al. ( 2021 ) proposed a novel convolutional neural network architecture incorporating cross-connected layers and multi-scale feature aggregation for image segmentation. The research addresses the need for advanced segmentation techniques that can capture intricate features and relationships within images. Their model's impressive performance metrics underscore its potential for enhancing segmentation accuracy, which is pivotal in diverse fields, including medical imaging, robotics, and autonomous systems.

Saood and Hatem et al. ( 2021 ) introduced deep learning networks, SegNet and U-Net, for segmenting COVID-19-infected areas in CT scan images. This research's timeliness is evident, as it contributes to the fight against the global pandemic. Their comparison of network performance provides insights into the effectiveness of different deep learning architectures for accurately identifying infected regions in lung images. This work showcases the agility of deep learning in addressing real-world challenges.

Nurmain et al. ( 2020 ), a novel approach employing Mask-RCNN is introduced for accurate fetal septal defect detection. Addressing limitations in previous methods, the model demonstrates multiclass heart chamber detection with high accuracy: right atrium (97.59%), left atrium (99.67%), left ventricle (86.17%), right ventricle (98.83%), and aorta (99.97%). Competitive results are shown for defect detection in atria and ventricles, with MRCNN achieving around 99.48% mAP compared to 82% for FRCNN. The study concludes that the proposed MRCNN model holds promise for aiding cardiologists in early fetal congenital heart disease screening.

Park et al. ( 2021a ) propose a method for intelligently segmenting food in images using deep neural networks. They address labor-intensive data collection by utilizing synthetic data through 3D graphics software Blender, training Mask R-CNN for instance segmentation. The model achieves 52.2% on real-world food instances with only synthetic data, and + 6.4%p performance improvement after fine-tuning compared to training from scratch. Their approach shows promise for healthcare robot systems like meal assistance robots.

Pérez-Borrero et al. ( 2020 ) underscores the significance of fruit instance segmentation, specifically within autonomous fruit-picking systems. It highlights the adoption of deep learning techniques, particularly Mask R-CNN, as a benchmark. The review justifies the proposed methodology's alterations to address limitations, emphasizing its efficiency gains. Additionally, the introduction of the Instance Intersection Over Union (I2oU) metric and the StrawDI_Db1 dataset creation are positioned as contributions with real-world implementation potential.

These studies collectively highlight the transformative impact of deep learning in various segmentation tasks, ranging from medical imaging to materials science and computer vision. By leveraging advanced neural network architectures and training methodologies, researchers are pushing the boundaries of what is achievable in image segmentation, ultimately contributing to advancements in diverse fields and applications.

5 Feature extraction

Feature extraction is a fundamental process in image processing and computer vision that involves transforming raw pixel data into a more compact and informative representation, often referred to as features. These features capture important characteristics of the image, making it easier for algorithms to understand and analyze images for various tasks like object recognition, image classification, and segmentation. Traditional methods of feature extraction were prevalent before the rise of deep learning and involved techniques that analyzed pixel-level information.Some traditional methods are explained here. Principle Components Analysis (PCA) is a statistical technique that reduces the dimensionality of the data while retaining as much of the original variance as possible. It identifies the orthogonal axes (principal components) along which the data varies the most. Independent Component Analysis (ICA) aims to find a linear transformation of the data into statistically independent components. It is often used for separating mixed sources in images, such as separating different image sources from a single mixed image. Locally Linear Embedding (LLE) is a nonlinear dimensionality reduction technique that aims to preserve the local structure of data points. It finds a low-dimensional representation of the data while maintaining the neighborhood relationships.

These traditional methods of feature extraction have been widely used and have provided valuable insights and representations for various image analysis tasks. However, they often rely on handcrafted features designed by experts or domain knowledge, which can be labor-intensive and may not generalize well across different types of images or tasks.

Conventional methods of feature extraction encompass the conversion of raw data into a more concise and insightful representation by pinpointing specific attributes or characteristics. These selected features are chosen to encapsulate vital insights and patterns inherent in the data. This procedure often involves a manual approach guided by domain expertise or specific insights. For example, within image processing, methods like Histogram of Oriented Gradients (HOG) might extract insights about gradient distributions, while in text analysis, features such as word frequencies could be selected.

Despite the effectiveness of traditional feature extraction for particular tasks and its ability to provide data insights, it comes with inherent limitations. Conventional techniques frequently necessitate expert intervention to craft features, which can be a time-intensive process and might overlook intricate relationships or patterns within the data. Moreover, traditional methods might encounter challenges when dealing with data of high dimensionality or scenarios where features are not easily definable.

In contrast, the ascent of deep learning approaches has revolutionized feature extraction by automating the process. Deep neural networks autonomously learn to extract meaningful features directly from raw data, eliminating the need for manual feature engineering. This facilitates the capture of intricate relationships, patterns, and multifaceted interactions that traditional methods might overlook. Consequently, deep learning has showcased exceptional achievements across various domains, particularly in tasks involving intricate data, such as image and speech recognition. Table 4 succinctly outlines the metrics, strengths and limitations of diverse deep learning models explored for feature enhancement.

Magsi et al. ( 2020 ) embarked on a significant endeavor in the realm of disease identification within date palm trees by harnessing the power of deep learning techniques. Their study centered around texture and color extraction methods from images of various date palm diseases. Through the application of Convolutional Neural Networks (CNNs), they effectively created a system that could discern diseases based on specific visual patterns. The achieved accuracy of 89.4% signifies the model's proficiency in accurately diagnosing diseases within this context. This approach not only showcases the potential of deep learning in addressing agricultural challenges but also emphasizes the importance of automated disease detection for crop management and security.

Sharma et al. ( 2020 ) delved into the domain of medical imaging with a focus on chest X-ray images. They introduced a comprehensive investigation involving different deep Convolutional Neural Network (CNN) architectures to facilitate the extraction of features from these images. Notably, the study evaluated the impact of dataset size on CNN performance, highlighting the scalability of their approach. By incorporating augmentation and dropout techniques, the model achieved a high accuracy of 0.9068, suggesting its ability to accurately classify and diagnose chest X-ray images. This work underscores the potential of deep learning in aiding medical professionals in diagnosing diseases and conditions through image analysis.

Zhang et al. ( 2020 ) offered a novel solution to the challenge of distinguishing between genuine and counterfeit facial images generated using deep learning methods. Their approach relied on a Counterfeit Feature Extraction Method that employed a Convolutional Neural Network (CNN) model. This model demonstrated remarkable accuracy, achieving a rate of 97.6%. Beyond the impressive accuracy, the study also addressed a crucial aspect of computational efficiency, highlighting the potential for reducing the computational demands associated with counterfeit image detection. This research is particularly relevant in today's digital landscape where ensuring the authenticity of images has become increasingly vital.

Simon and V et al. ( 2020 ) explored the fusion of deep learning and feature extraction in the context of image classification and texture analysis. Their study involved Convolutional Neural Networks (CNNs) including popular architectures like AlexNet, VGG19, Inception, InceptionResNetV3, ResNet, and DenseNet201. These architectures were employed to extract meaningful features from images, which were then fed into a Support Vector Machine (SVM) for texture classification. The results were promising, with the model achieving good to superior accuracy levels ranging from 85 to 95% across different pretrained models and datasets. This approach showcases the ability of deep learning to contribute to image analysis tasks, particularly when combined with traditional machine learning techniques.

Sungheetha and Sharma et al. ( 2021 ) addressed the critical challenge of detecting diabetic conditions through the identification of specific signs within blood vessels of the eye. Their approach relied on a deep feature Convolutional Neural Network (CNN) designed to spot these indicators. With an impressive accuracy of 97%, the model demonstrated its efficacy in accurately identifying diabetic conditions. This work not only showcases the potential of deep learning in medical diagnostics but also highlights its ability to capture intricate visual patterns that are indicative of specific health conditions.

Devulapalli et al. ( 2021 ) proposed a hybrid feature extraction method that combined Gabor transform-based texture features with automated high-level features using the Googlenet architecture. By utilizing pre-trained models such as Alexnet, VGG 16, and Googlenet, the study achieved exceptional accuracy levels. Interestingly, the hybrid feature extraction method outperformed the existing pre-trained models, underscoring the potential of combining different feature extraction techniques to achieve superior performance in image analysis tasks. Shankar et al. ( 2022 ) embarked on the critical task of COVID-19 diagnosis using chest X-ray images. Their approach involved a multi-step process that encompassed preprocessing through Weiner filtering, fusion-based feature extraction using GLCM, GLRM, and LBP, and finally, classification through an Artificial Neural Network (ANN). By carefully selecting optimal feature subsets, the model exhibited the potential for robust classification between infected and healthy patients. This study showcases the versatility of deep learning in medical diagnostics, particularly in addressing urgent global health challenges.

Ahmad et al. ( 2022 ) made significant strides in breast cancer detection by introducing a hybrid deep learning model, AlexNet-GRU, capable of autonomously extracting features from the PatchCamelyon benchmark dataset. The model demonstrated its prowess in accurately identifying metastatic cancer in breast tissue. With superior performance compared to state-of-the-art methods, this research emphasizes the potential of deep learning in medical imaging, specifically for cancer detection and classification. Sharif et al. ( 2019 ) ventured into the complex field of detecting gastrointestinal tract (GIT) infections using wireless capsule endoscopy (WCE) images. Their innovative approach combined deep convolutional (CNN) and geometric features to address the intricate challenges posed by lesion attributes. The fusion of contrast-enhanced color features and geometric characteristics led to exceptional classification accuracy and precision, showcasing the synergy between deep learning and traditional geometric features. This approach is particularly promising in enhancing medical diagnostics through the integration of multiple information sources.

Aarthi and Rishma ( 2023 ) responded to the pressing challenges of waste management by introducing a real-time automated waste detection and segregation system using deep learning. Leveraging the Mask R-CNN architecture, their model demonstrated the capability to identify and classify waste objects in real time. Additionally, the study explored the extraction of geometric features for more effective object manipulation by robotic arms. This innovative approach not only addresses environmental concerns related to waste but also showcases the potential of deep learning in practical applications beyond traditional image analysis, with the aim of enhancing efficiency and reducing pollution risks.

These studies showcase the efficacy of methods like CNNs, hybrid approaches, and novel architectures in achieving high accuracies and improved performance metrics in applications such as disease identification, image analysis, counterfeit detection, and more. While these methods automate the extraction of meaningful features, they also encounter challenges like computational complexity, dataset quality, and real-world variability, which should be carefully considered in their practical implementation.

6 Image classification

Image classification is a fundamental task in computer vision that involves categorizing images into predefined classes or labels. The goal is to enable machines to recognize and differentiate objects, scenes, or patterns within images.

Traditional classification is a fundamental data analysis technique that involves categorizing data points into specific classes or categories based on predetermined rules and established features. Before the advent of deep learning, several conventional methods were widely used for this purpose, including Decision Trees, Support Vector Machines (SVM), Naive Bayes, and k-Nearest Neighbors (k-NN). In the realm of traditional classification, experts would carefully design and select features that encapsulate relevant information from the data. These features are typically chosen based on domain knowledge and insights, aiming to capture distinguishing characteristics that help discriminate between different classes. While effective in various scenarios, traditional classification methods often require manual feature engineering, which can be time-consuming and may not fully capture intricate patterns and relationships present in complex datasets. These selected features act as inputs for classification algorithms, which utilize predefined criteria to assign data points to specific classes. Table 5 provides a compact overview of strengths and limitations in the realm of image classification by examining various deep learning models.

In the realm of medical image analysis, Sarah Ali et al. (Ismael et al. 2020 ) introduced an advanced approach that harnesses the power of Residual Networks (ResNets) for brain tumor classification. Their study involved a comprehensive evaluation on a benchmark dataset comprising 3064 MRI images of three distinct brain tumor types. Impressively, their model achieved a remarkable accuracy of 99%, surpassing previous works in the same domain. Shifting focus to the domain of remote sensing, Xiaowei et al. ( 2020 ) embarked on a deep learning journey for remote sensing image classification. Their methodology combined Recurrent Neural Networks (RNN) with Random Forest, aiming to optimize cross-validation on the UC Merced dataset. Through rigorous experimentation and comparison with various deep learning techniques, their approach achieved a commendable accuracy of 87%.

Texture analysis and classification hold significant implications, as highlighted by Aggarwal and Kuma ( 2020 ). Their study introduced a novel deep learning-based model, centered around Convolution Neural Networks (CNN), specifically composed of two sub-models. The outcomes were noteworthy, with model-1 achieving an accuracy of 92.42%, while model-2 further improved the accuracy to an impressive 96.36%.

Abdar et al. ( 2021 ) unveiled a pioneering hybrid dynamic Bayesian Deep Learning (BDL) model that leveraged the Three-Way Decision (TWD) theory for skin cancer diagnosis. By incorporating different uncertainty quantification (UQ) methods and deep neural networks within distinct classification phases, they attained substantial accuracy and F1-score percentages on two skin cancer datasets.

The landscape of medical diagnostics saw another stride forward with Ibrahim et al. ( 2021 ), who explored a deep learning approach based on a pretrained AlexNet model for classifying COVID-19, pneumonia, and healthy CXR scans. Their model exhibited notable performance in both three-way and four-way classifications, achieving high accuracy, sensitivity, and specificity percentages.

In the realm of image classification under resource constraints, Ma et al. ( 2022 ) introduced a novel deep CNN classification method with knowledge transfer. This method showcased superior performance compared to traditional histogram-based techniques, achieving an impressive classification accuracy of 93.4%.

Diving into agricultural applications, Gill et al. ( 2022 ) devised a hybrid CNN-RNN approach for fruit classification. Their model demonstrated remarkable efficiency and accuracy in classifying fruits, showcasing its potential for aiding in quality assessment and sorting.

Abu-Jamie et al. et al. ( 2022 ) turned their attention to fruit classification as well, utilizing a deep learning-based approach. By employing CNN Model VGG16, they managed to achieve a remarkable 100% accuracy, underscoring the potential of such methodologies in real-world applications.

Medical imaging remained a prominent field of exploration, as Sharma et al. ( 2022 ) explored breast cancer diagnosis through Convolutional Neural Networks (CNN) with transfer learning. Their study showcased a promising accuracy of 98.4%, reinforcing the potential of deep learning in augmenting medical diagnostics.

Beyond the realm of medical imagery, Yang et al. ( 2022 ) applied diverse CNN models to an urban wetland identification framework, with DenseNet121 emerging as the top-performing model. The achieved high Kappa and OA values underscore the significance of deep learning in land cover classification.

Hussain et al. ( 2020 ) delved into Alzheimer's disease detection using a 12-layer CNN model. Their approach showcased a remarkable accuracy of 97.75%, surpassing existing CNN models on the OASIS dataset. Their study also provided a head-to-head comparison with pre-trained CNNs, solidifying the efficacy of their proposed approach in enhancing Alzheimer's disease detection.

In the textile industry, Gao et al. ( 2019 ) addressed fabric defect detection using deep learning. Their novel approach, involving a convolutional neural network with multi-convolution and max-pooling layers, showcased promising results with an overall detection accuracy of 96.52%, offering potential implications for real-world practical applications.

Expanding the horizon to neurological disorders, Vikas et al. study ( 2021 ) pioneered ADHD classification from resting-state functional MRI (rs-fMRI) data. Employing a hybrid 2D CNN–LSTM model, the study achieved remarkable improvements in accuracy, specificity, sensitivity, F1-score, and AUC when compared to existing methods. The integration of deep learning with rs-fMRI holds the promise of a robust model for effective ADHD diagnosis and differentiation from healthy controls.

Skouta et al. ( 2021 ) work focused on retinal image classification. By harnessing the capabilities of convolutional neural networks (CNNs), their approach achieved an impressive classification accuracy of 95.5% for distinguishing between normal and proliferative diabetic retinas. The inclusion of an expanded dataset contributed to capturing intricate features and ensuring accurate classification outcomes. These studies collectively illuminate the transformative influence of deep learning techniques across diverse classification tasks, spanning medical diagnoses, texture analysis, image categorization, and neurological disorder identification.

While traditional methods have their merits, they heavily rely on domain expertise for feature selection and algorithm tuning. However, these traditional classification approaches encounter limitations. They might struggle with complex and high-dimensional data, where identifying important features becomes intricate. Additionally, they demand substantial manual effort in feature engineering, making them less adaptable to evolving data distributions or novel data types. The emergence of deep learning has revolutionized classification by automating the process of feature extraction. Deep neural networks directly learn hierarchical representations from raw data, eliminating the need for manually crafted features. This enables them to capture intricate patterns and relationships that traditional methods might miss. Notably, Convolutional Neural Networks (CNNs) have excelled in image classification tasks, while Recurrent Neural Networks (RNNs) demonstrate proficiency in handling sequential data. These deep learning models often surpass traditional methods in tackling complex tasks across various domains.

7 Discussion

Among the deep learning model for image denoising, Self2Self NN for cost reduction with data augmentation dependency, Denoising CNNs enhancing accuracy but facing resource challenges, and DFT-Net managing image label imbalance while risking detail loss. Robustness and hyperparameter tuning characterize MPR-CNN, while R2R noise reduction balances results and computational demands. CNN architectures prevent overfitting in denoising, and HLF-DIP achieves high values despite complexity. (Noise 2Noise) models exhibit efficiency and generalization trade-offs, and ConvNet enhances receptive fields while grappling with interpretability. This collection offers insights into the evolving landscape of image processing techniques.

This compilation of studies showcases a variety of image enhancement techniques. Ming Liu et al. employ Fuzzy-CNN and F-Capsule for iris recognition, ensuring robustness and avoiding overfitting. Khairul Munadi combines various methods with EfficientNet and ResNets for tuberculosis image enhancement, enhancing generalization while facing time and memory challenges. Ching Ta Lu employs FCNN mean filters for noise reduction, addressing noise while considering potential detail loss. Yuhui Quan implements CV-CNN for image deblurring, providing an efficient model with overfitting prevention. Dan Jin employs pix2pixHD for high-quality MDCT image enhancement, achieving quality improvement with possible overfitting concerns. Guofa Li introduces LE-net for low-light image recovery, emphasizing generalization and robustness with real-world limitations. Xianjie Gao introduces RetinexDIP for image enhancement, offering faster convergence and reduced runtime, despite challenges in complex scenes. Kiyeon Kim unveils MSSNet-WS for single image deblurring, prioritizing computational efficiency in real-world scenarios.

This compilation of research papers presents a comprehensive exploration of deep learning methodologies applied to two prominent types of image segmentation: semantic segmentation and instance segmentation. In the realm of semantic segmentation, studies utilize architectures like FCN, U-Net, and DeepLabV3 for tasks such as efficient detection of multiple persons and robust object recognition in varying lighting and background conditions. These approaches achieve notable performance metrics, with IoU and mIoU ranging from 80 to 86%. Meanwhile, in the context of instance segmentation, methods like Mask-RCNN and AFD-UNet are employed to precisely delineate individual object instances within an image, contributing to efficient real-time waste collection, accurate medical image interpretation, and more. The papers highlight the benefits of these techniques, including enhanced boundary delineation, reduced manual intervention, and substantial time savings, while acknowledging challenges such as computational complexity, model customization, and hardware limitations. This compilation provides a comprehensive understanding of the strengths and challenges of deep learning-based semantic and instance segmentation techniques across diverse application domains.

This review explores deep learning methodologies tailored to different types of image feature extraction across varied application domains. Texture/color-based approaches encompass studies like Aurangzeb Magsi et al.’s disease classification achieving 89.4% ACC, and Weiguo Zhang’s counterfeit detection at 97% accuracy. Pattern-based analysis includes Akey Sungheetha’s 97% class score for retinal images, K. Shankar et al.'s 95.1%-95.7% accuracy using FM-ANN, GLCM, GLRM, and LBP for chest X-rays, and Shahab Ahmad's 99.5% accuracy with AlexNet-GRU for PCam images. Geometric feature extraction is demonstrated by Sharif, Muhammad with 99.4% accuracy in capsule endoscopy images and Aarthi.R et al. achieving 97% accuracy in real-time waste image analysis using MRCNN. This comprehensive review showcases deep learning's adaptability in extracting diverse image features for various applications.

This compilation of research endeavors showcases diverse deep learning models applied to distinct types of image classification tasks. For multiclass classification, studies like Sarah Ali et al.'s employment of Residual Networks attains 99% accuracy in MRI image classification, while Akarsh Aggarwal et al.'s CNN approach achieves 92.42% accuracy in Kylberg Texture datasets. Abdullahi Umar Ibrahim's utilization of an AlexNet model records a 94% accuracy rate for lung conditions. In multiclass scenarios, Harmandeep Singh Gill's hybrid CNN-RNN attains impressive results in fruit classification, and Tanseem N et al. achieve 100% accuracy with VGG16 on fruit datasets. For binary classification, Emtiaz Hussain et al.'s CNN achieves 97.75% accuracy in OASIS MRI data, while Can Gao et al. achieve 96.52% accuracy in defect detection for fabric images. Vikas Khullar et al.'s CNN-LSTM hybrid records 95.32% accuracy for ADHD diagnosis, and Ayoub Skouta's CNN demonstrates 95.5% accuracy in diabetic retinopathy detection. These studies collectively illustrate the efficacy and adaptability of deep learning techniques across various types of classification tasks while acknowledging challenges such as dataset biases, computational intensity, and interpretability.

8 Conclusions

This comprehensive review paper embarks on an extensive exploration across the diverse domains of image denoising, enhancement, segmentation, feature extraction, and classification. By meticulously analyzing and comparing these methodologies, it offers a panoramic view of the contemporary landscape of image processing. In addition to highlighting the unique strengths of each technique, the review shines a spotlight on the challenges that come hand in hand with their implementation.

In the realm of image denoising, the efficacy of methods like Self2Self NN, DnCNNs, and DFT-Net is evident in noise reduction, although challenges such as detail loss and hyperparameter optimization persist. Transitioning to image enhancement, strategies like Novel RetinexDIP, Unsharp Masking, and LE-net excel in enhancing visual quality but face complexities in handling intricate scenes and maintaining image authenticity.

Segmentation techniques span the gamut from foundational models to advanced ones, providing precise object isolation. Yet, challenges arise in scenarios with overlapping objects and the need for robustness. Feature extraction methodologies encompass a range from CNNs to LSTM-augmented CNNs, unveiling crucial image characteristics while requiring careful consideration of factors like efficiency and adaptability.

Within classification, Residual Networks to CNN-LSTM architectures showcase potential for accurate categorization. However, data dependency, computational complexity, and model interpretability remain as challenges. The review's contributions extend to the broader image processing field, providing a nuanced understanding of each methodology's traits and limitations. By offering such insights, it empowers researchers to make informed decisions regarding technique selection for specific applications. As the field evolves, addressing challenges like computation demands and interpretability will be pivotal to fully realize the potential of these methodologies.

The scope of papers discussed in this review offers a panorama of DL methodologies that traverse diverse application domains. These domains encompass medical and satellite imagery, botanical studies featuring flower and fruit images, as well as real-time scenarios. The tailored DL approaches for each domain underscore the adaptability and efficacy of these methods across multifaceted real-world contexts.

Aarthi R, Rishma G (2023) A Vision based approach to localize waste objects and geometric features exaction for robotic manipulation. Int Conf Mach Learn Data Eng Procedia Comput Sci 218:1342–1352. https://doi.org/10.1016/j.procs.2023.01.113

Article   Google Scholar  

Abdar M, Samami M, Mahmoodabad SD, Doan T, Mazoure B, Hashemifesharaki R, Liu L, Khosravi A, Acharya UR, Makarenkov V, Nahavandi S (2021) Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning. Comput Biol Med 135:104418. https://doi.org/10.1016/j.compbiomed.2021.104418

Aggarwal A, Kuma M (2020) Image surface texture analysis and classification using deep learning. Multimed Tools Appl 80(1):1289–1309. https://doi.org/10.1007/s11042-020-09520-2

Ahammad SH, Rajesh V, Rahman MZU, Lay-Ekuakille A (2020) A hybrid CNN-based segmentation and boosting classifier for real time sensor spinal cord injury data. IEEE Sens J 20(17):10092–10101. https://doi.org/10.1109/jsen.2020.2992879

Ahmad S, Ullah T, Ahmad I, Al-Sharabi A, Ullah K, Khan RA, Rasheed S, Ullah I, Uddin MN, Ali MS (2022) A novel hybrid deep learning model for metastatic cancer detection". Comput Intell Neurosci 2022:14. https://doi.org/10.1155/2022/8141530

Ahmed I, Ahmad M, Khan FA, Asif M (2020) Comparison of deep-learning-based segmentation models: using top view person images”. IEEE Access 8:136361–136373. https://doi.org/10.1109/access.2020.3011406

Aish MA, Abu-Naser SS, Abu-Jamie TN (2022) Classification of pepper using deep learning. Int J Acad Eng Res (IJAER) 6(1):24–31.

Google Scholar  

Ashraf H, Waris A, Ghafoor MF et al (2022) Melanoma segmentation using deep learning with test-time augmentations and conditional random fields. Sci Rep 12:3948. https://doi.org/10.1038/s41598-022-07885-y

Bouteldja N, Klinkhammer BM, Bülow RD et al (2020) Deep learning based segmentation and quantification in experimental kidney histopathology. J Am Soc Nephrol. https://doi.org/10.1681/ASN.2020050597

Cheng G, Xie X, Han J, Guo L, Xia G-S (2020) Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities. IEEE J Select Topics Appl Earth Observ Remote Sens 13:3735–3756. https://doi.org/10.1109/JSTARS.2020.3005403

Devulapalli S, Potti A, Rajakumar Krishnan M, Khan S (2021) Experimental evaluation of unsupervised image retrieval application using hybrid feature extraction by integrating deep learning and handcrafted techniques. Mater Today: Proceed 81:983–988. https://doi.org/10.1016/j.matpr.2021.04.326

Dey S, Bhattacharya R, Malakar S, Schwenker F, Sarkar R (2022) CovidConvLSTM: a fuzzy ensemble model for COVID-19 detection from chest X-rays. Exp Syst Appl 206:117812. https://doi.org/10.1016/j.eswa.2022.117812

Gao C, Zhou J, Wong WK, Gao T (2019) Woven Fabric Defect Detection Based on Convolutional Neural Network for Binary Classification. In: Wong W (ed) Artificial Intelligence on Fashion and Textiles AITA 2018 Advances in Intelligent Systems and Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-99695-0_37

Chapter   Google Scholar  

Gao X, Zhang M, Luo J (2022) Low-light image enhancement via retinex-style decomposition of denoised deep image prior. Sensors 22:5593. https://doi.org/10.3390/s22155593

Gill HS, Murugesan G, Mehbodniya A, Sajja GS, Gupta G, Bhatt A (2023) Fruit Type Classification using Deep Learning and Feature Fusion. Comput Electronic Agric 211:107990 https://doi.org/10.1016/j.compag.2023.107990

Gite S, Mishra A, Kotecha K (2022) Enhanced lung image segmentation using deep learning. Neural Comput and Appl. https://doi.org/10.1007/s00521-021-06719-8

Hasti VR, Shin D (2022) Denoising and fuel spray droplet detection from light-scattered images using deep learning. Energy and AI 7:100130. https://doi.org/10.1016/j.egyai.2021.100130

Hedayati R, Khedmati M, Taghipour-Gorjikolaie M (2021) Deep feature extraction method based on ensemble of convolutional auto encoders: Application to Alzheimer’s disease diagnosis. Biomed Signal Process Control 66:102397. https://doi.org/10.1016/j.bspc.2020.102397

Hussain E, Hasan M, Hassan SZ, Azmi TH, Rahman MA, Parvez MZ (2020) [IEEE 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Kristiansand, Norway (2020.11.9–2020.11.13)] 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA) - Deep Learning Based Binary Classification for Alzheimerâ™s Disease Detection using Brain MRI Images. pp. 1115–1120. https://doi.org/10.1109/iciea48937.2020.9248213

Ibrahim AU, Ozsoz M, Serte S, Al-Turjman F, Yakoi PS (2021) Pneumonia Classifcation Using Deep Learning from Chest X ray Images During COVID 19. Cognitive Computation. Springer, Berlin. https://doi.org/10.1007/s12559-020-09787-5

Ismael SAA, Mohammed A, Hefny H (2020) An enhanced deep learning approach for brain cancer MRI images classification using residual networks. Artif Intell Med 102:101779. https://doi.org/10.1016/j.artmed.2019.101779

Jalali Y, Fateh M, Rezvani M, Abolghasemi V, Anisi MH (2021) ResBCDU-Net: a deep learning framework for lung CT image segmentation. Sensors. https://doi.org/10.3390/s21010268

Jiang X, Zhu Y, Zheng B et al (2021) Images denoising for COVID-19 chest X-ray based on multi-resolution parallel residual CNN. July 2021 Machine Vision and Applications 32(4). https://doi.org/10.1007/s00138-021-01224-3

Jin D, Zheng H, Zhao Q, Wang C, Zhang M, Yuan H (2021) Generation of vertebra micro-CT-like image from MDCT: a deep-learning-based image enhancement approach. Tomography 7:767–782. https://doi.org/10.3390/tomography7040064

Kasongo SM, Sun Y (2020) A deep learning method with wrapper based feature extraction for wireless intrusion detection system. Comput Secur 92:101752. https://doi.org/10.1016/j.cose.2020.101752

Khullar V, Salgotra K, Singh HP, Sharma DP (2021) Deep learning-based binary classification of ADHD using resting state MR images. Augment Hum Res. https://doi.org/10.1007/s41133-020-00042-y

Kim K, Lee S, Cho S (2023) MSSNet: Multi-Scale-Stage Network for Single Image Deblurring. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13802. Springer, Cham. https://doi.org/10.1007/978-3-031-25063-7_32

Kim B, Ye JC (2019) Mumford-Shah Loss functional for image segmentation with deep learning. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2019.2941265

Kong Y, Ma X, Wen C (2022) A new method of deep convolutional neural network image classification based on knowledge transfer in small label sample environment. Sensors 22:898. https://doi.org/10.3390/s22030898

Li G, Yang Y, Xingda Q, Cao D, Li K (2021a) A deep learning based image enhancement approach for autonomous driving at night. Knowl-Based Syst 213:106617. https://doi.org/10.1016/j.knosys.2020.106617

Li W, Raj ANJ, Tjahjadi T, Zhuang Z (2021b) Digital hair removal by deep learning for skin lesion segmentation”. Pattern Recog 117:107994. https://doi.org/10.1016/j.patcog.2021.107994

Liu M, Zhou Z, Shang P, Xu D (2019) Fuzzified image enhancement for deep learning in iris recognition”. IEEE Trans Fuzzy Syst 2019:2912576. https://doi.org/10.1109/TFUZZ.2019.2912576

Liu D, Wen B, Jiao J, Liu X, Wang Z, Huang TS (2020) Connecting image denoising and high-level vision tasks via deep learning. IEEE Trans Image Process 29:3695–3706. https://doi.org/10.1109/TIP.2020.2964518

Liu L, Tsui YY, Mandal M (2021) Skin lesion segmentation using deep learning with auxiliary task. J Imag 7:67. https://doi.org/10.3390/jimaging7040067

Lorenzoni R, Curosu I, Paciornik S, Mechtcherine V, Oppermann M, Silva F (2020) Semantic segmentation of the micro-structure of strain-hardening cement-based composites (SHCC) by applying deep learning on micro-computed tomography scans. Cement Concrete Compos 108:103551. https://doi.org/10.1016/j.cemconcomp.2020.103551

Lu CT, Wang LL, Shen JH et al (2021) Image enhancement using deep-learning fully connected neural network mean filter. J Supercomput 77:3144–3164. https://doi.org/10.1007/s11227-020-03389-6

Ma S, Li L, Zhang C (2022) Adaptive Image denoising method based on diffusion equation and deep learning”. Internet of Robotic Things-Enabled Edge Intelligence Cognition for Humanoid Robots Volume 2022 | Article ID 7115551. https://doi.org/10.1155/2022/7115551

Magsi A, Mahar JA, Razzaq MA, Gill SH (2020) Date Palm Disease Identification Using Features Extraction and Deep Learning Approach. 2020 IEEE 23rd International Multitopic Conference (INMIC). https://doi.org/10.1109/INMIC50486.2020.9318158

Mahajan K, Garg U, Shabaz M (2021) CPIDM: a clustering-based profound iterating deep learning model for HSI segmentation Hindawi. Wireless Commun Mobile Comput 2021:12. https://doi.org/10.1155/2021/7279260

Mahmoudi O, Wahab A, Chong KT (2020) iMethyl-deep: N6 methyladenosine identification of yeast genome with automatic feature extraction technique by using deep learning algorithm. Genes 2020, 11(5), 529; https://doi.org/10.3390/genes11050529

Mehranian A, Wollenweber SD, Walker MD et al (2022) Deep learning–based time-of-flight (ToF) image enhancement of non-ToF PET scans. Eur J Nucl Med Mol Imag 49:3740–3749. https://doi.org/10.1007/s00259-022-05824-7

Meng Y, Zhang J (2022) A novel gray image denoising method using convolutional neural network”. IEEE Access 10:49657–49676 https://doi.org/10.1007/s00259-022-05824-7

Munadi K, Muchtar K, Maulina N (2020) And Biswajeet Pradhan”, image enhancement for tuberculosis detection using deep learning. IEEE Access 8:217897. https://doi.org/10.1109/ACCESS.2020.3041867

Niresi FK, Chi C-Y (2022) Unsupervised hyperspectral denoising based on deep image prior and least favorable distribution”. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing vol. 15, pp. 5967-5983, 2022. https://doi.org/10.1109/JSTARS.2022.3187722

Nurmaini S, Rachmatullah MN, Sapitri AI, Darmawahyuni A, Jovandy A, Firdaus F, Tutuko B, Passarella R (2020) Accurate detection of septal defects with fetal ultrasonography images using deep learning-based multiclass instance segmentation. IEEE Access 8:196160–196174. https://doi.org/10.1109/ACCESS.2020.3034367

Pang T, Zheng H, Quan Y, Ji H (2021) Recorrupted-to-Recorrupted: Unsupervised Deep Learning for Image Denoising” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR46437.2021.00208

Park KH, Batbaatar E, Piao Y, Theera-Umpon N, Ryu KH (2021b) Deep learning feature extraction approach for hematopoietic cancer subtype classification. Int J Environ Res Public Health 18:2197. https://doi.org/10.3390/ijerph18042197

Park D, Lee J, Lee J, Lee K (2021) Deep Learning based Food Instance Segmentation using Synthetic Data, IEEE, 18th International Conference on Ubiquitous Robots (UR). https://doi.org/10.1109/UR52253.2021.9494704

Peng Z, Peng S, Lidan Fu, Binchun Lu, Tanga J, Wang Ke, Wenyuan Li, (2020) A novel deep learning ensemble model with data denoising for short-term wind speed forecasting”. Energy Convers Manag 207:112524. https://doi.org/10.1016/j.enconman.2020.112524

Pérez-Borrero I, Marín-Santos D, Gegúndez-Arias ME, Cortés-Ancos E (2020) A fast and accurate deep learning method for strawberry instance segmentation. Comput Electron Agric 178:105736. https://doi.org/10.1016/j.compag.2020.105736

Picon A, San-Emeterio MG, Bereciartua-Perez A, Klukas C, Eggers T, Navarra-Mestre R (2022) Deep learning-based segmentation of multiple species of weeds and corn crop using synthetic and real image datasets. Comput Electron Agric 194:10671. https://doi.org/10.1016/j.compag.2022.106719

Quan Y, Lin P, Yong X, Nan Y, Ji H (2021) Nonblind image deblurring via deep learning in complex field. IEEE Trans Neural Netw Learn Syst 33(10):5387–5400. https://doi.org/10.1109/TNNLS.2021.3070596

Quan, Y., Chen, M., Pang, T. and Ji, H., 2020 “Self2Self With Dropout: Learning Self-Supervised Denoising From Single Image”, IEEE 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) - Seattle, WA, 2020, pp. 1887–1895. https://doi.org/10.1109/CVPR42600.2020.00196

Robiul Islam Md, Nahiduzzaman Md (2022) Complex features extraction with deep learning model for the detection of COVID19 from CT scan images using ensemble based machine learning approach. Exp Syst Appl 195:116554. https://doi.org/10.1016/j.eswa.2022.116554

Saood A, Hatem I (2021) COVID-19 lung CT image segmentation using deep learning methods: U-Net versus SegNet”. BMC Med Imaging 21:19. https://doi.org/10.1186/s12880-020-00529-5

Sarki R, Ahmed K, Wang H et al (2020) Automated detection of mild and multi-class diabetic eye diseases using deep learning. Health Inf Sci Syst 8:32. https://doi.org/10.1007/s13755-020-00125-5

Shankar K, Perumal E, Tiwari P et al (2022) Deep learning and evolutionary intelligence with fusion-based feature extraction for detection of COVID-19 from chest X-ray images. Multimedia Syst 28:1175–1187. https://doi.org/10.1007/s00530-021-00800-x

Sharif M, Attique Khan M, Rashid M, Yasmin M, Afza F, Tanik UJ (2019) Deep CNN and geometric features-based gastrointestinal tract diseases detection and classification from wireless capsule endoscopy images. J Exp Theor Artif Intell 33:1–23. https://doi.org/10.1080/0952813X.2019.1572657

Sharma A, Mishra PK (2022) Image enhancement techniques on deep learning approaches for automated diagnosis of COVID-19 features using CXR images. Multimed Tools Appl 81:42649–42690. https://doi.org/10.1007/s11042-022-13486-8

Sharma T, Nair R, Gomathi S (2022) Breast cancer image classification using transfer learning and convolutional neural network. Int J Modern Res 2(1):8–16

Sharma, Harsh, Jain, Jai Sethia, Bansal, Priti, Gupta, Sumit (2020). [IEEE 2020 10th International Conference on Cloud Computing, Data Science and Engineering (Confluence) - Noida, India (2020.1.29–2020.1.31)] 2020 10th International Conference on Cloud Computing, Data Science and Engineering (Confluence) - Feature Extraction and Classification of Chest X-Ray Images Using CNN to Detect Pneumonia. pp. 227–231. https://doi.org/10.1109/Confluence47617.2020.9057809

Simon P, Uma V (2020) Deep learning based feature extraction for texture classification. Procedia Comput Sci 171:1680–1687. https://doi.org/10.1016/j.procs.2020.04.180

Skouta A, Elmoufidi A, Jai-Andaloussi S, Ochetto O (2021) Automated Binary Classification of Diabetic Retinopathy by Convolutional Neural Networks. In: Saeed F, Al-Hadhrami T, Mohammed F, Mohammed E (eds) Advances on Smart and Soft Computing, Advances in Intelligent Systems and Computing. Springer, Singapore. https://doi.org/10.1007/978-981-15-6048-4_16

Sori WJ, Feng J, Godana AW et al (2021) DFD-Net: lung cancer detection from denoised CT scan image using deep learning. Front Comput Sci 15:152701. https://doi.org/10.1007/s11704-020-9050-z

Sungheetha A, Rajesh Sharma R (2021) Design an early detection and classification for diabetic retinopathy by deep feature extraction based convolution neural network. J Trends Comput Sci Smart Technol (TCSST) 3(2):81–94. https://doi.org/10.36548/jtcsst.2021.2.002

Tang H, Zhu H, Fei L, Wang T, Cao Y, Xie C (2023) Low-Illumination image enhancement based on deep learning techniques: a brief review. Photonics 10(2):198. https://doi.org/10.3390/photonics10020198

Tanseem N. Abu-Jamie, Samy S. Abu-Naser, Mohammed A. Alkahlout, Mohammed A. Aish,“Six Fruits Classification Using Deep Learning”, International Journal of Academic Information Systems Research (IJAISR) ISSN: 2643–9026. 6(1):1–8

Tawfik MS, Adishesha AS, Hsi Y, Purswani P, Johns RT, Shokouhi P, Huang X, Karpyn ZT (2022) Comparative study of traditional and deep-learning denoising approaches for image-based petrophysical characterization of porous media. Front Water 3:800369 https://doi.org/10.3389/frwa.2021.800369

Tian C, Xu Y, Fei L, Yan K (2019) Deep Learning for Image Denoising: A Survey. In: Pan JS, Lin JW, Sui B, Tseng SP (eds) Genetic and Evolutionary Computing. ICGEC 2018. Advances in Intelligent Systems and Computing. Springer, Singapore. https://doi.org/10.48550/arXiv.1810.05052

Tian C, Fei L, Zheng W, Xu Y, Zuof W, Lin CW (2020) Deep Learning on Image Denoising: An Overview. Neural Networks 131:251-275 https://doi.org/10.1016/j.neunet.2020.07.025

Wang D, Su J, Yu H (2020) Feature Extraction and analysis of natural language processing for deep learning english language. IEEE Access 8:46335–46345. https://doi.org/10.1109/ACCESS.2020.2974101

Wang EK, Chen CM, Hassan MM, Almogren A (2020) A deep learning based medical image segmentation technique in Internet-of-Medical-Things domain. Future Gen Comput Syst 108:135–144. https://doi.org/10.1016/j.future.2020.02.054

Xiaowei Xu, Chen Y, Junfeng Zhang Y, Chen PA, Manickam A (2020) A novel approach for scene classification from remote sensing images using deep learning methods. Eur J Remote Sens 54:383–395. https://doi.org/10.1080/22797254.2020.1790995

Yan K, Chang L, Andrianakis M, Tornari V, Yu Y (2020) Deep learning-based wrapped phase denoising method for application in digital holographic speckle pattern interferometry. Appl Sci 10:4044. https://doi.org/10.3390/app10114044

Yang R, Luo F, Ren F, Huang W, Li Q, Du K, Yuan D (2022) Identifying urban wetlands through remote sensing scene classification using deep learning: a case study of Shenzhen. China ISPRS Int J Geo-Inf 11:131. https://doi.org/10.3390/ijgi11020131

Yoshimura N, Kuzuno H, Shiraishi Y, Morii M (2022) DOC-IDS: a deep learning-based method for feature extraction and anomaly detection in network traffic. Sensors 22:4405. https://doi.org/10.3390/s22124405

Zhang W, Zhao C, Li Y (2020) A novel counterfeit feature extraction technique for exposing face-swap images based on deep learning and error level analysis. Entropy 22(2):249. https://doi.org/10.3390/e22020249

Article   MathSciNet   Google Scholar  

Zhou Y, Zhang C, Han X, Lin Y (2021) Monitoring combustion instabilities of stratified swirl flames by feature extractions of time-averaged flame images using deep learning method. Aerospace Sci Technol 109:106443. https://doi.org/10.1016/j.ast.2020.106443

Zhou X, Zhou H, Wen G, Huang X, Le Z, Zhang Z, Chen X (2022) A hybrid denoising model using deep learning and sparse representation with application in bearing weak fault diagnosis. Measurement 189:110633. https://doi.org/10.1016/j.measurement.2021.110633

Download references

Author information

Authors and affiliations.

Department of Computer Science, Bishop Heber College (Affiliated to Bharathidasan University), Tiruchirappalli, Tamil Nadu, India

R. Archana & P. S. Eliahim Jeevaraj

You can also search for this author in PubMed   Google Scholar

Contributions

All authors reviewed the manuscript.

Corresponding author

Correspondence to P. S. Eliahim Jeevaraj .

Ethics declarations

Conflict of interest.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Archana, R., Jeevaraj, P.S.E. Deep learning models for digital image processing: a review. Artif Intell Rev 57 , 11 (2024). https://doi.org/10.1007/s10462-023-10631-z

Download citation

Accepted : 17 December 2023

Published : 07 January 2024

DOI : https://doi.org/10.1007/s10462-023-10631-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Image processing
  • Deep learning models
  • Convolutional neural networks (CNN)
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 10 May 2021

Comparative analysis of deep learning image detection algorithms

  • Shrey Srivastava 1 ,
  • Amit Vishvas Divekar 1 ,
  • Chandu Anilkumar 1 ,
  • Ishika Naik 1 ,
  • Ved Kulkarni 1 &
  • V. Pattabiraman 1  

Journal of Big Data volume  8 , Article number:  66 ( 2021 ) Cite this article

65k Accesses

158 Citations

1 Altmetric

Metrics details

A computer views all kinds of visual media as an array of numerical values. As a consequence of this approach, they require image processing algorithms to inspect contents of images. This project compares 3 major image processing algorithms: Single Shot Detection (SSD), Faster Region based Convolutional Neural Networks (Faster R-CNN), and You Only Look Once (YOLO) to find the fastest and most efficient of three. In this comparative analysis, using the Microsoft COCO (Common Object in Context) dataset, the performance of these three algorithms is evaluated and their strengths and limitations are analysed based on parameters such as accuracy, precision and F1 score. From the results of the analysis, it can be concluded that the suitability of any of the algorithms over the other two is dictated to a great extent by the use cases they are applied in. In an identical testing environment, YOLO-v3 outperforms SSD and Faster R-CNN, making it the best of the three algorithms.

Introduction

In recent times, the industrial revolution makes use of computer vision for their work. Automation industries, robotics, medical field, and surveillance sectors make extensive use of deep learning [ 1 ]. Deep learning has become the most talked-about technology owing to its results which are mainly acquired in applications involving language processing, object detection and image classification. The market forecast predicts outstanding growth around the coming years. The main reasons cited for this are primarily the accessibility of both strong Graphics Processing Units (GPUs) and many datasets [ 1 ]. In recent times, both these requirements are easily available [ 1 ].

Image classification and detection are the most important pillars of object detection. There is a plethora of datasets available. Microsoft COCO is one such widely used image classification domain. It is a benchmark dataset for object detection. It introduces a large-scale dataset that is available for image detection and classification [ 2 ].

This review article aims to make a comparative analysis of SSD, Faster-RCNN, and YOLO. The first algorithm for the comparison in the current work is SSD which adds layers of several features to the end network and facilitates ease of detection [ 3 ]. The Faster R-CNN is a unified, faster, and accurate method of object detection that uses a convolutional neural network. While YOLO was developed by Joseph Redmon that offers end-to-end network [ 3 ].

In this paper, by using the Microsoft COCO dataset as a common factor of the analysis and measuring the same metrics across all the implementations mentioned, the respective performances of the three above mentioned algorithms, which use different architectures, have been made comparable to each other. The results obtained by comparing the effectiveness of these algorithms on the same dataset can help gain an insight on the unique attributes of each algorithm, understand how they differ from one another and determine which method of object recognition is most effective for any given scenario.

Literature survey

Object detection has been an important topic of research in recent times. With powerful learning tools available deeper features can be easily detected and studied. This work is an attempt to compile information on various object detection tools and algorithms used by different researchers so that a comparative analysis can be done and meaningful conclusions can be drawn to apply them in object detection. Literature survey serves the purpose of getting an insight regarding our work.

The work done by Ross Girshick has introduced the Fast R-CNN model as a method of object detection [ 3 ]. It makes use of the CNN method in the target detection field. The novelty of the method proposed by Girshick has proposed a window extraction algorithm instead of a conventional sliding window extraction procedure in the R-CNN model, there is separate training for the deep convolution network for feature isolation and the support vector machines for categorization [ 4 ]. In the fast R- CNN method they have combined feature extraction with classification into a classification framework [ 3 ]. The training time is nine times faster in Fast R-CNN than in R-CNN. Whereas in the faster R-CNN method the proposal isolation region and bit of Fast R-CNN are put into a network template referred to as region proposal network (RPN). The accuracy of Fast R-CNN and Faster R-CNN is the same. The research concludes that the method is a combined, deep learning-based object detection system that works at 5–7 fps (Frames Per Second) [ 4 ]. Basic knowledge about R-CNN, Fast R-CNN and Faster R-CNN was acquired from this paper. The training of the respective model was also inspired from this paper.

Another research work done by Kim et al is discussed here. This research work uses CNN with background subtraction to build a framework that detects and recognizes moving objects using CCTV (Closed Circuit Television) cameras. It is based on the application of the background subtraction algorithm applied to each frame [ 5 ]. An architecture similar to the one in this paper was used in our work.

Another detection network is YOLO. Joseph Redmon et al have proposed You Only Look Once (YOLO)—A one-time convolutional neural network for the prediction of the frame position and classification of multiple candidates is offered by YOLO. End-to-end target detection can be achieved this way. It uses a regression problem to solve object detection. A single end-to-end system completes the process of putting the output obtained from the original image to the category and position [ 6 ]. Bounding box prediction and feature extraction of YOLO architecture in our work was inspired by the technique discussed in this paper.

Tanvir Ahmed et al have proposed a modified method that uses an advanced YOLO v1 network model which optimizes the loss of function in YOLO v1, it has a new inception model structure, has a specialized pooling pyramid layer, and has better performance. The advanced application of YOLO is taken from this research paper. It is also an end-to-end process that carries out an extensive experiment on a PASCAL VOC (Visual Object Classes) dataset. The network is an improved version and also shows high effectiveness [ 7 ]. The training of the YOLO model using PASCAL VOC was done using the technique proposed in this paper.

Wei Liu et al came up with a new method of detecting objects in images using a single deep neural network. They named this procedure the Single Shot MultiBox Detector SSD. According to the team, SSD is a simple method and requires an object proposal as it is based on the complete elimination of the process that generates a proposal. It also eliminates the subsequent pixel and resampling stages. So, it combines everything into a single step. SSD is also very easy to train and is very straightforward when it comes to integrating it into the system. This makes detection easier. The primary feature of SSD is using multiscale convolutional bounding box outputs that are attached to several feature maps [ 8 ]. Training and model analysis of the SSD model of our work was inspired by the work discussed here.

Another paper is based on an advanced type of SSD. In his paper, the authors have proposed their research work to introduce Tiny SSD, a single shot detection deep convolutional neural network. TINY SSD aimed to ease real-time embedded object detection. It comprises of greatly enhanced layers comprising of non-uniform Fire subnetwork and a stack of non-uniform subnetwork of SSD based auxiliary convolutional feature layers. The best feature of Tiny SSD is its size of 2.3 MB which is even smaller than Tiny YOLO. The results of this work have shown that Tiny SSD is well suited for embedded detections [ 9 ]. A similar model of SSD was used for the purpose of comparison.

The paper by Pathak et al describes the role of deep learning technique by using CNN for object detection. The paper also accesses some deep learning techniques for object detection systems. The current paper states that deep CNNs work on the principle of weight sharing. It gives us information about some crucial points in CNN.

These features of CNN depicted in this paper are: [ 1 ]

CNN is integration and involves the multiplication of two overlapping functions.

Features maps are abstracted to reduce their complexity in terms of space

Repetition of the process is done to produce the feature maps using filters.

CNN utilizes different types of pooling layers.

This paper was used as the basis for understanding Convolutional Neural Networks and their role in deed learning.

In a recent research work by Chen et al, they have used anchor boxes for face detection and more exact regression loss function. They have proposed a face detector termed as YOLO face which is based on YOLOv3 that aims at resolving detection problems of varying face scales. The authors concluded that their algorithm out performed previous YOLO versions and its varieties [ 10 ]. The YOLOv3 was used in our work for comparison with other models.

In the research work by Fan et al, they have proposed an improved system for the detection of pedestrians based on SSD model of object detection. In this work the multi-layered system they introduced the Squeeze-and-Excitation model as an additional layer to the SSD model. The improved model employed self-learning that further enhanced the accuracy of the system for small scale pedestrian detection. Experiments on the INRIA dataset showed high accuracy [ 11 ]. This paper was used for the purpose of understanding the SSD model.

In a recent survey published by Mittal et al, they discussed the algorithms namely Faster RCNN, Cascade RCNN, R-FCN, YOLO and its variants, SSD, RetinaNet and CornerNet, Objects as Point under advanced phases in detectors based on deep learning. This paper provides a comprehensive summary of low-altitude datasets and the algorithms used for the respective work [ 12 ]. Our comparison work was done using coco metrics similar to the comparison that has been done in this paper. The paper also discusses several other techniques for comparison which were considered in our work.

Artificial Intelligence (AI): It is a system's ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation [ 13 ].

Machine Learning (ML): It is the study of algorithms that improve automatically through experience [ 14 ]. ML algorithms build a training model based on sample data, and using it, make predictions or decisions without being ‘explicitly programmed to do so’.

Deep Learning (DL): It is the most used and most preferred approach to machine learning. It is inspired by the working of the biological brain—how individual neurons firing on receiving input only see a very small part of the total input/processed data. It has multiple layers. Upper layers build on the outputs from lower layers. Thus, the higher the layer, the more complex is the data it processes [ 15 ].

Identify more complex patterns—animals, faces, objects, skies, etc. A CNN consists of alternating convolutional and pooling layers with at least one fully connected layer at the end.

Evolution of CNNs

Convolutional Neural Network (CNN): It is a type of artificial neural network that is mainly used to analyse images. It was inspired by the neurological experiments conducted by Hubel and Wiesel on the visual cortex [ 17 ]. The visual cortex is the primary region processing visual sensory information in the brain. It extracts features from images and detects patterns and structures to detect objects in the images. Its distinct feature is the presence of convolutional layers that are hidden. These layers apply filters to extract patterns from images. The filter moves over the image to generates the output. Different filters recognize different patterns. Initial layers have filters to recognize simple patterns. They become more complex through the layers over time as follows:

Origin (Late 1980s–1990s): The first popular CNN was LeNet-5 developed in 1998 by LeCun et al. [ 18 ]. It was in development for almost a decade. Its purpose was to detect handwritten digits. It is credited for sparking R&D of efficient CNNs in the field of deep learning. Banks started using it in ATMs.

Stagnation (Early 2000s): The internal working of CNNs was not yet understood during this period. Also, there was no dataset of a variety of images like Google’s Open Images or Microsoft’s COCO. Hence, most CNNs were only focused on optical character recognition (OCR). CNNs also required high computational time; increasing operating cost. Support Vector Machine (SVM), a machine learning model was showing better results than CNN.

Revival (2006–2011): Ranzato et al. in their paper demonstrated that using the max-pooling algorithm for feature extraction instead of the sub-sampling algorithm used earlier results in significant improvement [ 19 ]. Researchers had started using GPUs to accelerate training of CNNs. Around the same time, NVIDIA introduced the CUDA platform that allowed and facilitated parallel processing, thus speeding up CNN training and validation [ 20 ]. This re-sparked research. In 2010, Stanford University established a large image dataset called Pattern Analysis, Statistical modelling and Computational Learning Visual Object Classes (PASCAL VOC), removing yet another hurdle.

Rise (2012–2013): AlexNet was a major breakthrough for accuracy of CNNs. It achieved an error rate of just 15.3% in the 2012 ILSVR challenge. The second-place network had an error rate of 26.2% [ 21 ]. So, AlexNet was better by a large margin of 10.8% than any other network known at the time. AlexNet achieved this accuracy by having a total of 8 layers [ 21 ], thus truly realizing ‘deep’ learning. This required greater computational power, but the advances in GPU technology made it possible. AlexNet, like LeNet is one of the most influential papers to ever be published on CNNs.

Architectural Innovations (2014–2020): The well-known and widely used VGG architecture was developed in 2014 [ 22 ]. RCNN, based on VGG like many others, introduced the idea that objects are located in certain regions of the image; hence the name: region-based CNN [ 23 ]. Improved versions of RCNN—Fast RCNN [ 24 ] and Faster RCNN [ 3 ] came out in the subsequent years. Both of these reduced computation time, while maintaining the accuracy that RCNN is known for. Single Shot Multibox Detector (SSD), also based on VGG was developed around 2016 [ 8 ]. Another algorithm, You Only Look Once (YOLO), based on an architecture called DarkNet was first published in 2016 [ 6 ]. It is in active development; its third version was released in 2018 [ 25 ].

Existing methodologies

Other object detection models such as YOLO or Faster R-CNN perform their operations at a much lesser speed as compared to SSD, making a much more favourable object detection method.

Before the development of SSD, several attempts had been made to design a faster detector by modifying each stage of the detection pipeline. However, any significant increase in speed by such modifications only resulted in a decrease in the detection’s accuracy and hence researchers concluded that rather than altering an existing model, they would have to come up with a fundamentally different object detection model, and hence, the creation of the SSD model [ 8 ].

SSD does not resample pixels or features for bounding box hypotheses and is as accurate as models that do. In addition to this, it is quite straightforward compared to methods that require object proposals because it completely eradicates feature resampling stages or pixel and proposal generation, by encompassing all computation in a single network. Therefore, SSD is very simple to train and can be easily integrated into systems that perform detection as one of their functions [ 8 ].

It’s architecture heavily depends on the generation of bounding boxes and the extraction of feature maps, which are also known as default bounding boxes. Loss is calculated by the network, using comparisons of the offsets of the predicted classes and the default bounding boxes with the training samples’ ground truth values, using different filters for every iteration. Using the back-propagation algorithm and the calculated loss value, all the parameters are updated. This way, SSD is able to learn the most optimal filter structures that can accurately identify the object features and generalize the given training samples in order to minimize the loss value, resulting in high accuracy during the evaluation phase [ 26 ].

Analysis of the functions

SSD is built on a feed-forward complex network that builds a collection of standard-size bounding boxes and for each occurrence of an object in those boxes, a respective score. After score generation, non-maximum suppression is used to generate the final detection results. The preliminary network layers are built on a standard architecture utilized for high quality image classification (and truncated before any classification layers), which is a VGG-16 network. An auxiliary structure is added to the truncated base network such as convo6 to produce detections.

Extracting feature maps: SSD uses the VGG-16 architecture to extract feature maps because it shows very good performance for the classification of images with high quality. The reason for using auxiliary layers is because they allow us to extract the required features at multiple scales as well as reduce the size of our input with each layer that is traversed through [ 8 ]. For each cell in the image, the layer makes a certain number of predications. Each prediction consists of a boundary box and the box generates scores for all the classes it detects in this box including a score for no object at all. It is an algorithm making a ‘guess’ as to what is in the boundary box by choosing the class with the highest score. These scores a called ‘confidence scores’ and making such predictions is called ‘MultiBox’. Figure  1 depicts the SSD model with the extra feature layers.

Convolutional predictors for object detection: Every feature layer produces a fixed number of predictions by utilising convolutional filters. For every feature layer of size x × y having n channels, the rudimentary component for generating prediction variables of a potential detection result is a 3 × 3 × x small kernel that creates a confidence score for every class, or a shape offset calculated with respect to the default grounding box coordinates which are provided by the COCO Dataset at every single one of the ‘x x y’ locations [ 8 ].

Default boxes and aspect ratios: By now, you may be able to infer that every single feature map cell is associated with a corresponding default bounding box for multiple feature maps in the network. The default boxes are responsible for determining the feature map in a complex manner so that the placement of each box concerning its corresponding cell is fixed. At each feature map cell, we speculate the offsets concerning the default box shapes in the cell and the scores for each class which tells us about the class of object present inside the bounding box. Going into further detail, for every box out of b at a particular given location, s class scores are calculated and its 4 offsets relative to the primal default box shape. This computation results in a total of (s + 4) b filters that are applicable to every location in the feature map, resulting in (s + 4) × b × x × y outputs for a x × y feature map. [ 8 ]

figure 1

Deep Learning Layers illustration [ 15 ]

SSD Training Process

Matching Process: All SSD predictions are divided into two types; negative matches or positive matches. Positive matches are only used by SSD to calculate the localization cost which is the misalignment of the boundary box with the default box. The match is positive only if the corresponding default boundary box’s IoU is greater than 0.5 with the ground truth. In any other case, it is negative. IoU stands for the ‘intersection over the union’. It is the ratio between the intersected area over the joined area for two regions. IoU is also referred to as the Jaccard index and using this condition makes the learning process much easier [ 8 ].

Hard negative mining: After the matching step, almost all of the default boxes are negatives, largely when the total count of possible default boxes is high. This causes a large imbalance between the positive and negative training examples. Rather than using up all the negative examples, SSD sorts them by their greatest confidence loss for each default box, the highest ones such that at any point of time, the ratio of the positives and negatives is a maximum of 3:1. This leads to faster optimization and better training [ 8 ].

Data augmentation: This is crucial for increasing accuracy. There are several data augmentation techniques that we may employ such as color distortion, flipping, and cropping. To deal with a variety of different object sizes and shapes, each training image is randomly picked using one of the methods listed below: [ 8 ].

We use the original,

Sample a patch with IoU of 0.1, 0.3, 0.5, 0.7 or 0.9,

Sample a patch randomly.

Final detection: The results are generated by performing NMS on multi-scale refined bounding boxes. Using the above-mentioned methods such as hard negative mining, data augmentation, and a larger number of other methods, SSD’s performance is much greater than that of Faster R-CNN when it comes to accuracy on PASCAL VOC dataset and the COCO dataset, while being three times faster [ 26 ]. The SSD300, where the size of the input image is 300_300, runs at 59 FPS, which is much more efficient and accurate than YOLO. However, SSD is not as efficient at detection for smaller objects, which can be solved by having a more efficient feature extractor backbone (e.g., ResNet101), with the addition of deconvolution layers along with skip connections to create additional large-scale context, and design a better network structure [ 27 ].

Complexity analysis

For most algorithms,time-complexity is dependent on the size of input and can be defined in terms of the big-Oh notation. However,for deep-learning models, time complexity is evaluated in terms of the total time taken by SSD to be trained and the inference time when the model is run on specific hardware (Fig. 2 ).

figure 2

Evolution of CNNs from 1979 through 2018 [ 16 ]

Deep learning models are required to carry out millions of calculations which can prove to be quite expensive computationally, however most of these calculations end up being performed parallelly by the thousands of identical neurons in each layer of the artificial neural network. Due to this parallel nature , it has been observed that training an SSD model in a Nvidia GeForce GTX 1070i GPU reduces the training time by a factor of ten [ 28 ].

When it comes to time-complexity, matrix multiplication in the forward pass of the base CNN takes up the most amount of time. The total number of multiplications is dependent on the number of layers in the CNN along with more specific details such as the number of neurons per layer, the amount of filters along with their respective sizes, the size of the feature extraction map and the image’s resolution. The activation function used at each layer is a ReLu function that has been found to run in quadratic time for each neuron in each layer. Hence, taking all these factors into account, we can determine the time-complexity of the forward pass at the base CNN :

Here, b denotes the index of the CNN layer, B is the total amount of CNN layers,x b is the number of filters in the b th layer,h is the filter width and height, x c is the number of neurons, x b-1 is the total number of input channels of the b th layer, s b is the size of the output feature map.

It should be noted that five to ten percent of the training time is taken up by things like dropout,regression,batch normalisation,classification as well.

As for SSD’s accuracy, it is determined by Mean Average Precision or mAP which is simply the average of APs over all classes from the area under the precision-recall curve. A higher mAP is an indication of a more accurate model [ 28 ].

Faster R-CNN

R-CNN stands for Region-based Convolutional Neural Networks. This method combines region proposals for object segmentation and high capacity CNNs for object detection [ 28 ].

The algorithm of the original R-CNN technique is as follows: [ 29 ]

Using a Selective Search Algorithm, several candidate region proposals are extracted from the input image. In this algorithm, numerous candidate regions are generated in initial sub-segmentation. Then, regions which are similar are combined to form bigger regions using a greedy algorithm. These regions make up the final region proposals.

The CNN component warps the proposals and extracts distinct features as a vector output.

The features which are extracted are fed into an SVM (Support Vector Machine) for recognizing objects of interest in the proposal.

Figure 4 given below explains the features and working of R-CNN.

This technique was plagued by a lot of drawbacks. The requirement to classify ~2000 region proposals make the training of the CNN a very time-consuming process. This makes real-time implementation impossible as each test image would take close to 47 seconds for execution.

Furthermore, machine learning could not take place as the Selective Search Algorithm is a fixed algorithm. This could result in non-ideal candidate region proposals being generated [ 29 ].

Fast R-CNN is an algorithm for object detection that solves some of the drawbacks of R-CNN. It uses an approach similar to that of its predecessor, but as opposed to using region proposals, the CNN utilizes the image itself for creating a convolutional feature map, following which region proposals are determined and warped from it. An RoI (Region of Interest) pooling layer is employed for reshaping the warped squares according to a predefined size for a fully connected layer to accept them. The region class is then predicted from the RoI vector with the help of a SoftMax layer [ 24 ].

Fast R-CNN is faster than its predecessor because feeding ~2000 proposals as input to the CNN per execution is not required. The convolution operation is done to generate a feature map only once per image. [ 24 ] The Fig. 3 given below describes the features and working of Fast RCNN.

figure 3

SSD model [ 8 ]

This algorithm shows a significant reduction in time required for both training and testing when compared to R-CNN. But it was noticed that including region proposals significantly bottlenecks the algorithm, reducing its performance [ 3 ].

Both Fast R-CNN and its predecessor used Selective Search as the algorithm for determining the region proposals. This being a very time-sapping algorithm, Faster R-CNN eliminated the need for its implementation and instead let the proposals be learned by the network. Just as in the case of Fast R-CNN, a convolutional map is obtained from the image. But a separate network replaces the Selective Search algorithm to predict proposals. These proposals are then reshaped and classified using RoI (Region of Interest) pooling. Refer to the Fig. 4 for the working of Faster R-CNN.

figure 4

R-CNN model [ 15 ]

Faster R-CNN offers an improvement over its predecessors so significant that it is now capable of being implemented for real-time object detection.

Architecture of faster R-CNN

The original implementation of Faster Region-based Convolutional Neural Network (Faster R-CNN) algorithm was experimented on two architectures of convolutional networks: The ZF (Zeiler and Fergus) model, with 5 convolutional layers that a Fast R-CNN network shares with it; and the VGG-16(Simonyan and Zisserman) model, with 13 convolutional layers shared [ 3 ] .

The ZF model is based on an earlier model of a Convolutional Network (made by Krizhevsky, Sutskever and Hinton) [ 30 ] . This model consisted of eight layers, of which five were convolutional and the remaining three were fully connected [ 21 ] .

This architecture exhibited quite a few problems. The first layer filters had negligible coverage medium frequency information compared to that of the very extremes, and the large stride 4 used in the first layer caused aliasing artifacts in the second layer. The ZF model fixed these issues by reducing the size of the first and second layer and making the convolution stride 2, allowing it to hold more information in the first and second layers, and improve classification performance [ 30 ] .

Region based Convolutional Neural Network (RCNN) and Fast-RCNN both use Selective Search. Selective Search is a greedy algorithm. Greedy algorithms don’t always return the best result [ 31 ]. Also, it needs to run multiple times. However, RCNN runs selective search about 2000 times on the image. Fast-RCNN extracts all the regions first and runs selective search just once. This way it reduces time complexity by a large factor [ 3 ]. Faster RCNN (FRCNN) removes the final bottleneck—Selective Search. It does so by instead using the Region Proposal Network (RPN). RPN fixes the regions as a grid of n × n. It needs to run fewer number of times as compared to selective search [ 3 ] .

As shown in the diagram above, FRCNN consists of Deep Fully Convolutional Network (DFCN), Region Proposal Network, ROI pooling, Fully Connected (FC) networks, Bounding Box Regressor and Classifier.

We will consider DFCN to be ZF-5 for consistent calculation [ 30 ]. First feature map, M of dimensions 256 × n × n is extracted from input image, P [ 33 ]. Then it is fed to RPN and ROI.

RPN: There are ‘k’ anchors for each point on M. Hence, Total anchors = n × n × k. Anchors are ranked according to score; 2000 anchors are obtained through Non-Maximum Suppression [ 3 ]. The Complexity comes out to be O(N2/2).

ROI: Anchors get divided into H × W grid of sub-windows based on M. Output grid is obtained by max-pooling values in corresponding sub-windows. ROI is special case of spatial pyramid pooling layer used in SPP-net, with just one pyramid layer [ 24 ]. Hence, complexity becomes O(1) .

In modern times YOLO (You Only Look Once) is one of the most precise and accurate object detection algorithms available. It has been made on the basis of a newly altered and customized architecture named Darknet [ 25 ]. The first version was inspired by Google Net, which used tensor to sample down the image and predicted it with the maximum accuracy. The tensor is generated on the basis of a similar procedure and structure which is also seen in the Region of Interest that is pooled and compiled to decrease the number of individual computations and make the analysis swifter) that is used in the Faster R-CNN network. The following generation utilized an architecture with just 30 convolutional layers, that in turn consisted of 19 layers from DarkNet-19 and an extra 11 for detection of natural objects or objects in natural context as the COCO dataset and metrics have been used. It provided more precise detection and with good speed, although it struggled with pictures of small objects and small pixels. But version 3 has been the greatest and most accurate version of YOLO which has been used widely because of its high precision. Also, the architecture with multiple layers has made the detection more precise [ 26 ].

YOLOv3 makes use of the latest darknet features like 53 layers and it has undergone training with one of the most reliable datasets called ImageNet. The layers used are from an architecture Darnnet-53 which is convolutional in nature. For detection, the aforementioned 53 layers were supplemented instead of the pre-existing 19 and this enhanced architecture was trained and instructed with PASCAL VOC. After so many additional layers the architecture maintains one of the best response times with the accuracy offered. It also is very helpful in analysing live video feed because of its swift data unsampling and object detection techniques. One can notice that this version is the best enhancements in ML (Machine Learning) using neural networks. The previous version did not work well with the images of small pixels but the recent updates in v3 have made it very useful in analysing satellite imaging even for defence departments of some countries. The architecture performs in 3 different layers which makes it more efficient but the process is a little slower yet state-of-the-art. For understanding, the framework refers to the Fig. 5 given below.

figure 5

Fast R-CNN [ 16 ]

Feature extraction and analysis [ 34 ]

1. Forecasting: This model utilizes packages of different lengths and breadths to produce the weights and frames that establish a strong foundation. This technique is an individual where the network determines the objectivity and allocation independently. The logical regression is used by YOLOv3 where it foresees the objectivity score. It is projected over the selection frame initially on the object that has been established to be the fundamental truth in the picture by pre-training models [ 35 ]. This gives a singular bounding box and any kind of fallacy in this part would cause mistakes in both allocation of these boxes and their accuracy and also in the detection arrear. The bounding box forecasting is depicted in the equation given below and Fig.  6 .

figure 6

Faster R-CNN [ 3 ]

Equations for bounding box forecasting [ 34 ]

2. Class Prediction: YOLOv3 executes a soft-max function to alter the scores to an understandable format for the code. The format is 1. YOLOv3 uses multiple classifications by tag. These tags are custom and non-exclusive. For eg. ‘man’ and ‘woman’ are not exclusive. The architecture modifies the function with individualistic logistic classifiers. YOLOv3 uses binary loss function initially. It uses the soft-max function after that. This leads to a reduction in complexity by avoiding it for the first implementation [ 36 ].

3. Predictions: Three distinct orders and dimensions are used for pre-determining the bounding boxes. These are in combination with the function extractor, DarkNet-53. The last levels include detection and categorization into object classes. 3 takes are what is taken on each scale of the COCO dataset. That leads to more than 70 class predictions as an o/p tensor. These features are a classic coder-decoder design introduced in Single-Shot-Detector. The grouping of k-means is also used for finding the best bounding boxes. Finally, in the COCO dataset dimensions like 10 × 13, 62 × 45 and others are used. In total there are 9 distinct dimensions including the aforementioned.

4. DarkNet-53 - The feature Extractor: YOLOv2 had the implementation of DarkNet-19 but in the recently modified model of YOLO Darknet-53 is being used where the 53 is 53 convolutional levels. Speed and accuracy both are an enhanced in Darknet 53 making it 1.5 times quicker. When this architecture is put to compete with ResNet-152, it almost the same performance in terms of accuracy and precision but it is twice as fast [ 37 ]. The following Fig. 7 shows the YOLO model.

figure 7

CNN of the Krizhevsky model [ 21 ]

The YOLO network is based on a systematic division of the given image into grid. The grids are of 3 types which will be mentioned later. These grids serve as a separate image for the algorithm and they undergo further divisions. YOLO utilizes boundaries that are called bounding boxes. These are the anchors for the analysis of an image. These boxes are essentially acknowledged as resulted even though thousands and thousands are ignored because of the low probability scores and are treated as false positives. These boxes are the manifestation of the rigorous breaking down of an image into grids of cells [ 38 , 39 , 40 ].

For determining suitable anchor box sizes, YOLO uses K-means clustering to clutch the boxes among the training data. These prior boxes are the guidelines for the algorithm. After receiving the aforementioned data, the algorithm looks for objects with symmetrical shape and size. YOLO uses 3 boxes as anchor so each grid cell puts out 3 boxes. The further predictions and analysis are based on these 3 anchor boxes. Some cases and studies involve the use of 2 anchor box leading to 2 boxes per grid cell [ 39 ].

In the above Fig. 8 , we can see the anchor box as the dashed box and the forecast of the ground truth or the bounding box is the box with the highlighted borders. There are multiple examples of sizes of image floating around. Each have a distinctive grid cell size and shape. For our model we have taken the standard 448 × 448 image size. The other sizes used for analysis are 416 × 416, 608 × 608 etc. and the grid sizes for them are 19 × 19, 38 × 38 & 76 ×76 and 13 × 13, 52 × 52 & 26 × 26 respectively [ 40 , 41 ].

figure 8

Bounding box forecasting [ 34 ]

For the first step, the image is modified and altered to a size of 448 x 448 and then the image is put through a slice and dice system where they are divided into 7 x 7 size. This implies that the size of each grid is of size 64 x 64. Every single one of these grid cells produce a certain number of bounding boxes. It may vary from version to version (multiple versions in YOLOv3). For our model we are using 2 boxes per grid. This gives us 4 coordinates per bounding box. They are x center , y center , width, height. Also, there’s a corresponding confidence value [ 32 ].

Use of K-means clustering algorithm gives exponential time complexity O(n kd ) where k is the number of images and d is the dimension of the images. After a thorough and stable optimisation technique, the creators have made YOLOv3 the fastest image detection algorithm among the ones mentioned in the paper.

MICROSOFT COCO

In recent times for the search of a perfect combination of algorithm and data set, contenders have used the top and highly rated deep learning architectures and data sets. They are used for arriving at the best possible precision and accuracy. The most commonly used data sets are PASCAL VOC and Microsoft COCO. For the review analysis, COCO is used as a dataset and an evaluation metric. They applied different ways of analysis, tweaking and calibrating the base networks and adjusting the software; that leads to better precision but also for improving accuracy, speed, and local split performance [ 26 ].

For Object detection, the use of computationally costly architectures and algorithms such as RCNN, SPP-NET (Spatial Pyramid Pooling Network) the use of smart data sets having varied objects and images which also have various objects and are of different dimensions have become a necessity. Not to forget the extreme scope in live video feed monitoring the cost of detection becomes too high. Recently the advancement in deep learning architectures has lead algorithms like YOLO and SSD networks to detect objects by the access to a singular NN (neural network). The introduction of latest architectures has increased the competition between various techniques [ 26 ]. But recently COCO has emerged as the most used data set for training and classification. Also, more developments have made it alterable for adding classes [ 2 ].

Furthermore, COCO is better than other popular widely used data sets as per some research papers [ 2 ]. They are namely Pattern Analysis, Statistical Modelling and Computational Learning Visual Object Classes, ImageNet & SUN (Scene Understanding). The above-mentioned data sets vary hugely based on size, categories, and types. ImageNet was made to target a wider category where the number of different categories but they were fine-grained. SUN focused on more of a modular approach where the regions of interest were based on the frequency of them occurring in the data set. Finally, PASCAL VOC’s was similar yet different in approach to COCO. It used a wide range of images taken from the environment and nature. Microsoft Common Objects in Context is made for the detection and classification of the objects in their classic natural context [ 2 ].

Annotation pipeline [ 2 ]

As seen in the following Fig. 9 an annotation pipeline explains the identification and categorization of a particular image.

figure 9

The ZF model [ 30 ]

This type of annotation pipeline gives a better perspective to object detection algorithms. Training algorithms using these diverse images and advanced concepts like crowd scheduling and visual segmentation. Following Fig. 10 gives the detailed categories that are available in MS COCO. The 11 super-categories are Person and Accessories, Animal, Vehicle, Outdoor Objects, Sports, Kitchenware, Food, Furniture, Appliance, Electronics, and Indoor Objects [ 42 ].

figure 10

FRCNN Architecture [ 32 ]

Pascal VOC (Visual Object Classes)

The challenge.

The Pascal VOC (Visual Object Classes) Challenges were a series of challenges that took place from 2005 to 2012 which consisted of two components: A public dataset which contained images from the Flickr website, their annotations and software for evaluation; and a yearly event consisting of a competition and a workshop. The main objectives of the challenge were classification, detection, and segmentation of the images. There were also two additional challenges of action classification and person layout [ 43 ].

The Datasets

The datasets used in the Pascal VOC Challenges consist of two subsets: a trainval dataset, which was further classified into separate sets for training and validation; and a test dataset. All the contained images are fully annotated with the help of bounding boxes for all instances of the following objects for the classification and detection challenges: [ 43 ]

Along with these annotations, attributes such as viewpoint, truncation, difficult, consistent, accurate and exhaustive were specified, some of which were added in later editions of the challenge [ 44 ].

Experimental set up

The hardware comprised of 8 GB DDR5 Random Access Memory, 1 TB Hard Disk Drive, 256 GB Solid State Drive and Intel Core processor i5 8th Generation which clocks at a speed 1.8Ghz (Figs. 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , and 20 ).

figure 11

YOLO architecture [ 26 ]

figure 12

YOLO model ConvNet [ 37 ]

figure 13

Categories of images [ 42 ]

figure 15

The classes of objects considered in the challenge [ 43 ]

figure 16

Statistics of the VOC2012 datasets [ 43 ]

figure 17

Graph for SSD [ 26 ]

figure 18

Graph for faster RCNN [ 26 ]

figure 19

Graph for YOLO [ 26 ]

figure 20

Compared with YOLOv3, the new version of AP (accuracy) and FPS (frame rate per second) are improved by 10% and 12%, respectively [ 46 ]

The software configuration put to use is the Google Colab using inbuilt engine called Python 3 Google Compute Engine Backend. It provides a RAM of 12.72 GB of which 3.54 was used at an average. Also, it provides a disk space of 107.77 GB of which 74.41 GB was used which included the training and validation datasets. The hardware accelerator used was the synthetic GPU offered by Google Colab (Tables 1 and 2 ).

Results and discussions

Two performance metrics are applied to object detecting models for testing. These are ‘Average Precision’ and an F1 score. The predicted bounding boxes are compared with the ground truth bounding boxes by the detector according to IOU (Intersection Over Union). The ‘True Positive’, ‘False Negative’, and ‘False Positive’ are defined and then used for the calculation of precision and recall which in turn are used for calculating the F1 score. The Formulae for these are as follows. [ 42 ]

Precision = TP/ (TP +FP’)

Recall = TP/ (TP + FN’)

And using these, F1 score = 2*Precision*Recall/(Precision + Recall)

Apart from these two, the performance of the models is also measured using the following metrics given by the COCO metrics API. [ 42 ]

Using all these, the outcomes for all three algorithms were compared in order to compare their performance. The outcomes were as follows:

Results comparison

Following were some limitations that were observed in the three models

When it comes to smaller objects, SSD’s performance is much worse as compared to Faster R-CNN. The main reason for this drawback, is that in SSD, higher resolution layers are responsible for detecting small objects. However, these layers are less useful for classification as they contain lower-level features such as colour patches or edges, thereby reducing the overall performance of SSD [ 8 ].

Another limitation of this method which can be inferred from the complexity of SSD’s data augmentation, is that SSD requires a large amount of data for training purposes. This can be quite expensive and time-consuming depending on the application [ 8 ]

Accuracy of this algorithm comes at the cost of time complexity. It is significantly slower than the likes of YOLO.

Despite improvements over RCNN and Fast RCNN, it still requires multiple passes over a single image unlike YOLO [ 3 ] 3

FRCNN has many components—the convolutional network, Regions of Interest (ROI) pooling layer and Region Proposal Network (RPN). Any of these can serve as a bottleneck for the others [ 3 ].

YOLOv3 was one of the best modifications that had been done to an object detection system since the introduction of Darknet 53. This modified update was received very well among the critics and other industrial professionals. But it had its own shortcomings. Though YOLOv3 is still considered to be a veteran, the complexity analysis showed flaws and lacked optimal solutions to the loss function. It was later rectified in an optimized model of the same and was later used and tested for functionality enhancements [ 45 ].

A better version of a given software is the best to analyse the faults in the former. After analysing the paper on YOLOv4 we can see that version 3 used to fail when the image had multiple features to be analysed but they weren’t the highlight of the pic. The lack of accuracy was always an issue when it came to smaller images. It was basically useless to use version 3 to analyse small images because the accuracy was around 16% (proven by our data). Another matter to be looked at is that the use of Darknet 53. YOLOv4 has brought in CSPDarknet-53 which is better than Darknet-53 as it uses only 66% of the number of parameters that version 3 used to use but gives a better result which enhanced speed and accuracy [ 46 ].

The precision-recall curves plotted using the COCO metric, API, allowed us to form proper deductions about the efficiency with which these three models perform object detection. Graphs were plotted for each model based on different object sizes.

The area shaded in orange indicates the precision-recall curve without any errors, the area shaded in violet indicates the objects that were falsely detected, the area shaded in blue indicates the localisation errors (Loc). Lastly, the areas under the precision-recall curve that are white indicates an IoU value greater than 0.75 and area shaded in grey indicates an IoU value greater than 0.5.

From the graphs of the three models, it is evident that both region-based detectors like F R-CNN and SSD both have low accuracy due to their relatively larger violet areas. However, amongst themselves, F R-CNN is more accurate than SSD while SSD is more efficient for real-time processing applications due to its higher mAP values. YOLO is clearly the most efficient of the all evident from its almost non-existent violet regions.

This review article compared the latest and most advanced CNN-based object detection algorithms. Without object detection, it would be impossible to analyse the hundreds of thousands of images that are uploaded to the internet every day [ 42 ]. Technologies like self-driving vehicles that depend on real-time analysis are also impossible to realize without object detection. All the networks were trained with the open-source COCO dataset by Microsoft, to ensure a homogeneous baseline. It was found that Yolo-v3 is the fastest with SSD following closely and Faster RCNN coming in the last place. However, it can be said that the use case influences which algorithm is picked; if you are dealing with a relatively small dataset and don’t need real-time results, it is best to go with Faster RCNN. Yolo-v3 is the one to pick if you need to analyse a live video feed. Meanwhile, SSD provides a good balance between speed and accuracy. Additionally, Yolo-v3 is the most recently released of the three and is actively being contributed to by the vast open-source community. Hence, in conclusion, out of the three Object Detection Convolutional Neural Networks analysed, Yolo-v3 shows the best overall performance. This result is similar to what some of the previous reports have obtained.

A great deal of work can still be done in the future in this field. Every year, either new algorithms or updates to existing ones are published. Also, each field—aviation, autonomous vehicles (aerial and terrestrial), industrial machinery, etc. are suited to different algorithms.

These subjects can be explored in detail in the future.

Availability of data and materials

Coco dataset used in the paper is available from the website https://cocodataset.org/#explore .

Abbreviations

Faster Region based Convolutional Neural Network

Single Shot Detector

You Look Only Once version 3

Common Objects in Context

Visual Geometry Group 16

Pathak AR, Pandey M, Rautaray S. Application of deep learning for object detection. Procedia Comput Sci. 2018;132:1706–17.

Article   Google Scholar  

Palop JJ, Mucke L, Roberson ED. Quantifying biomarkers of cognitive dysfunction and neuronal network hyperexcitability in mouse models of Alzheimer’s disease: depletion of calcium-dependent proteins and inhibitory hippocampal remodeling. In: Alzheimer's Disease and Frontotemporal Dementia. Humana Press, Totowa, NJ; 2010, p. 245–262.

Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(6):1137–49.

Ding S, Zhao K. Research on daily objects detection based on deep neural network. IOP Conf Ser Mater Sci Eng. 2018;322(6):062024.

Kim C, Lee J, Han T, Kim YM. A hybrid framework combining background subtraction and deep neural networks for rapid person detection. J Big Data. 2018;5(1):22.

Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016, pp. 779–788.

Ahmad T, Ma Y, Yahya M, Ahmad B, Nazir S. Object detection through modified YOLO neural network. Scientific Programming, 2020.

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC. Ssd: single shot multibox detector. In: European conference on computer vision. Cham: Springer; 2016, p. 21–37.

Womg A, Shafiee MJ, Li F, Chwyl B. Tiny SSD: a tiny singleshot detection deep convolutional neural network for real-time embedded object detection. In: 2018 15th conference on computer and robot vision (CRV). IEEE; 2018, p. 95101

Chen W, Huang H, Peng S, Zhou C, Zhang C. YOLO-face: a real-time face detector. The Visual Computer 2020:1–9.

Fan D, Liu D, Chi W, Liu X, Li Y. Improved SSD-based multi-scale pedestrian detection algorithm. In: Advances in 3D image and graphics representation, analysis, computing and information technology. Springer, Singapore; 2020, p. 109–118.

Mittal P, Sharma A, Singh R. Deep learning-based object detection in low-altitude UAV datasets: a survey. Image and Vision Computing 2020:104046.

Kaplan A, Haenlein M. Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence. Bus Horiz. 2019;62(1):15–25.

Mitchell T. Machine learning. New York: McGraw Hill; 1997.

MATH   Google Scholar  

Schulz H, Behnke S. Deep learning. KI-Künstliche Intelligenz. 2012;26(4):357–63.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106–54.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Ranzato MA, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007, p. 1–8.

Nickolls J, Buck I, Garland M, Skadron K. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue. 2008;6(2):40–53.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097–105.

Google Scholar  

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556; 2014.

Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2014, p. 580–7.

Girshick R. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision; 2015, p. 1440–8.

Redmon J, Farhadi A. Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767; 2018.

Alganci U, Soydas M, Sertel E. Comparative research on deep learning approaches for airplane detection from very high-resolution satellite images. Remote Sensing. 2020;12(3):458.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):32123232.

Reza Z. N. (2019). Real-time automated weld quality analysis from ultrasonic B-scan using deep learning (Doctoral dissertation, University of Windsor (Canada)).

Shen X, Wu Y. A unified approach to salient object detection via low rank matrix recovery. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE; 2012, p. 853–60.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Cham: Springer, 2014, p. 818–33.

Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW. Selective search for object recognition. Int J Comput Vision. 2013;104(2):154–71. https://doi.org/10.1007/s11263-013-0620-5 .

Wu J. Complexity and accuracy analysis of common artificial neural networks on pedestrian detection. In MATEC Web of Conferences, Vol. 232. EDP Science; 2018, p. 01003.

He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: European conference on computer vision. Cham: Springer; 2016, p. 630–45.

Xu D, Wu Y. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors. 2020;20(15):4276.

Butt UA, Mehmood M, Shah SBH, Amin R, Shaukat MW, Raza SM, Piran M. A review of machine learning algorithms for cloud computing security. Electronics. 2020;9(9):1379.

Ketkar N, Santana E. Deep learning with Python, vol. 1. Berkeley: Apress; 2017.

Book   Google Scholar  

Jiang R, Lin Q, Qu S. Let blind people see: real-time visual recognition with results converted to 3D audio. Report No. 218, Stanford University, Stanford, USA; 2016.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015, p. 1–9.

Zhao L, Li S. Object detection algorithm based on improved YOLOv3. Electronics. 2020;9(3):537.

Syed NR. A PyTorch implementation of YOLOv3 for real time object detection (Part 1). [Internet] [Updated Jun 30 2020]. https://nrsyed.com/2020/04/28/a-pytorch-implementation-of-yolov3-for-real-time-object-detection-part-1/ . Accessed 02 Feb 2021.

Ethan Yanjia Li. Dive really deep into YOLOv3: a beginner’s guide. [Internet][Posted on December 30 2019] Available at https://yanjia.li/dive-really-deep-into-yolo-v3-a-beginners-guide/ . Accessed 31 Jan 2021.

COCO. [Internet]. https://cocodataset.org/#explore . Accessed 28 Oct 2020.

Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A. The pascal visual object classes challenge: a retrospective. Int J Comput Vision. 2015;111(1):98–136.

Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. Int J Comput Vision. 2010;88(2):303–38.

Huang YQ, Zheng JC, Sun SD, Yang CF, Liu J. Optimized YOLOv3 algorithm and its application in traffic flow detections. Appl Sci. 2020;10(9):3079.

Bochkovskiy A, Wang CY, Liao HYM. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

Download references

Acknowledgements

Not applicable.

Author information

Authors and affiliations.

Vellore Institute of Technology (Chennai Campus), Kelambakkam - Vandalur Rd, Rajan Nagar, Chennai, Tamil Nadu, 600127, India

Shrey Srivastava, Amit Vishvas Divekar, Chandu Anilkumar, Ishika Naik, Ved Kulkarni & V. Pattabiraman

You can also search for this author in PubMed   Google Scholar

Contributions

SS: Research and Implementation of YOLO Algorithm. Comparative Analysis. AVD: Research and Implementation of Faster RCNN Algorithm. Comparative Analysis. CA: Research and Implementation on Faster RCNN Algorithm. Comparative Analysis. IN: Research and Implementation of SSD Algorithm. Comparative Analysis. VK: Research and Implementation on SSD Algorithm. Comparative Analysis. VP: Verification of results obtained through implementations. Approval of final manuscript.

Corresponding author

Correspondence to Shrey Srivastava .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Srivastava, S., Divekar, A.V., Anilkumar, C. et al. Comparative analysis of deep learning image detection algorithms. J Big Data 8 , 66 (2021). https://doi.org/10.1186/s40537-021-00434-w

Download citation

Received : 12 December 2020

Accepted : 22 February 2021

Published : 10 May 2021

DOI : https://doi.org/10.1186/s40537-021-00434-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Object detection
  • COCO dataset

research paper on image recognition

EDITORIAL article

Editorial: current trends in image processing and pattern recognition.

KC Santosh

  • PAMI Research Lab, Computer Science, University of South Dakota, Vermillion, SD, United States

Editorial on the Research Topic Current Trends in Image Processing and Pattern Recognition

Technological advancements in computing multiple opportunities in a wide variety of fields that range from document analysis ( Santosh, 2018 ), biomedical and healthcare informatics ( Santosh et al., 2019 ; Santosh et al., 2021 ; Santosh and Gaur, 2021 ; Santosh and Joshi, 2021 ), and biometrics to intelligent language processing. These applications primarily leverage AI tools and/or techniques, where topics such as image processing, signal and pattern recognition, machine learning and computer vision are considered.

With this theme, we opened a call for papers on Current Trends in Image Processing & Pattern Recognition that exactly followed third International Conference on Recent Trends in Image Processing & Pattern Recognition (RTIP2R), 2020 (URL: http://rtip2r-conference.org ). Our call was not limited to RTIP2R 2020, it was open to all. Altogether, 12 papers were submitted and seven of them were accepted for publication.

In Deshpande et al. , authors addressed the use of global fingerprint features (e.g., ridge flow, frequency, and other interest/key points) for matching. With Convolution Neural Network (CNN) matching model, which they called “Combination of Nearest-Neighbor Arrangement Indexing (CNNAI),” on datasets: FVC2004 and NIST SD27, their highest rank-I identification rate of 84.5% was achieved. Authors claimed that their results can be compared with the state-of-the-art algorithms and their approach was robust to rotation and scale. Similarly, in Deshpande et al. , using the exact same datasets, exact same set of authors addressed the importance of minutiae extraction and matching by taking into low quality latent fingerprint images. Their minutiae extraction technique showed remarkable improvement in their results. As claimed by the authors, their results were comparable to state-of-the-art systems.

In Gornale et al. , authors extracted distinguishing features that were geometrically distorted or transformed by taking Hu’s Invariant Moments into account. With this, authors focused on early detection and gradation of Knee Osteoarthritis, and they claimed that their results were validated by ortho surgeons and rheumatologists.

In Tamilmathi and Chithra , authors introduced a new deep learned quantization-based coding for 3D airborne LiDAR point cloud image. In their experimental results, authors showed that their model compressed an image into constant 16-bits of data and decompressed with approximately 160 dB of PSNR value, 174.46 s execution time with 0.6 s execution speed per instruction. Authors claimed that their method can be compared with previous algorithms/techniques in case we consider the following factors: space and time.

In Tamilmathi and Chithra , authors carefully inspected possible signs of plant leaf diseases. They employed the concept of feature learning and observed the correlation and/or similarity between symptoms that are related to diseases, so their disease identification is possible.

In Das Chagas Silva Araujo et al. , authors proposed a benchmark environment to compare multiple algorithms when one needs to deal with depth reconstruction from two-event based sensors. In their evaluation, a stereo matching algorithm was implemented, and multiple experiments were done with multiple camera settings as well as parameters. Authors claimed that this work could be considered as a benchmark when we consider robust evaluation of the multitude of new techniques under the scope of event-based stereo vision.

In Steffen et al. ; Gornale et al. , authors employed handwritten signature to better understand the behavioral biometric trait for document authentication/verification, such letters, contracts, and wills. They used handcrafter features such as LBP and HOG to extract features from 4,790 signatures so shallow learning can efficiently be applied. Using k-NN, decision tree and support vector machine classifiers, they reported promising performance.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Santosh, KC, Antani, S., Guru, D. S., and Dey, N. (2019). Medical Imaging Artificial Intelligence, Image Recognition, and Machine Learning Techniques . United States: CRC Press . ISBN: 9780429029417. doi:10.1201/9780429029417

CrossRef Full Text | Google Scholar

Santosh, KC, Das, N., and Ghosh, S. (2021). Deep Learning Models for Medical Imaging, Primers in Biomedical Imaging Devices and Systems . United States: Elsevier . eBook ISBN: 9780128236505.

Google Scholar

Santosh, KC (2018). Document Image Analysis - Current Trends and Challenges in Graphics Recognition . United States: Springer . ISBN 978-981-13-2338-6. doi:10.1007/978-981-13-2339-3

Santosh, KC, and Gaur, L. (2021). Artificial Intelligence and Machine Learning in Public Healthcare: Opportunities and Societal Impact . Spain: SpringerBriefs in Computational Intelligence Series . ISBN: 978-981-16-6768-8. doi:10.1007/978-981-16-6768-8

Santosh, KC, and Joshi, A. (2021). COVID-19: Prediction, Decision-Making, and its Impacts, Book Series in Lecture Notes on Data Engineering and Communications Technologies . United States: Springer Nature . ISBN: 978-981-15-9682-7. doi:10.1007/978-981-15-9682-7

Keywords: artificial intelligence, computer vision, machine learning, image processing, signal processing, pattern recocgnition

Citation: Santosh KC (2021) Editorial: Current Trends in Image Processing and Pattern Recognition. Front. Robot. AI 8:785075. doi: 10.3389/frobt.2021.785075

Received: 28 September 2021; Accepted: 06 October 2021; Published: 09 December 2021.

Edited and reviewed by:

Copyright © 2021 Santosh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: KC Santosh, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

research paper on image recognition

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Putting gpt-4o to the sword: a comprehensive evaluation of language, vision, speech, and multimodal proficiency.

research paper on image recognition

1. Introduction

1.1. research purpose, 1.2. related work, 2. language capacity of gpt-4o, 2.1. performance on exams, 2.1.1. performance on usmle, 2.1.2. performance on cfa, 2.1.3. performance on sat, 2.1.4. performance on mbe, 2.2. reasoning, 2.3. language translation, 2.3.1. data, 2.3.2. evaluation method.

  • 1 indicates that the vectors are identical.
  • 0 indicates that the vectors are orthogonal (i.e., no similarity).
  • −1 indicates that the vectors are opposed.

2.3.3. Results

2.3.4. limitations, 3. vision capacity of gpt-4o, 3.1. image classification: fruits classification, 3.2. image classification: driver drowsiness detection, 3.3. image classification: crop disease classification, 3.4. image classification: glaucoma detection, 3.5. image classification: cancer, tumor, and aneurysm detection, 3.6. image captioning, 4. speech capacity of gpt-4o, 4.1. emotion detection, 4.2. accent detection, 5. multimodal capacity of gpt-4o, 5.1. visual question answering, 5.2. vision-language capabilities, 6. implications, limitations, and future work, 6.1. implications, 6.2. limitations, 6.3. future work, 7. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Gemini Team; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2024 , arXiv:2312.11805. [ Google Scholar ]
  • Korinek, A. Language Models and Cognitive Automation for Economic Research ; National Bureau of Economic Research: Cambridge, MA, USA, 2023. [ Google Scholar ]
  • Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020 , 30 , 681–694. [ Google Scholar ] [ CrossRef ]
  • Dillion, D.; Mondal, D.; Tandon, N.; Gray, K. Large Language Models as Moral Experts? GPT-4o Outperforms Expert Ethicist in Providing Moral Guidance ; OSF: Peoria, IL, USA, 2024. [ Google Scholar ] [ CrossRef ]
  • Ray, S. Google CEO Says Gemini AI’s ‘Unacceptable’ Responses Offended Users and Showed Bias. 2024. Available online: https://www.forbes.com/sites/siladityaray/2024/02/28/google-ceo-says-gemini-ais-unacceptable-responses-offended-users-and-showed-bias/?sh=250e1a1b1103 (accessed on 15 June 2024).
  • Ongsulee, P. Artificial intelligence, machine learning and deep learning. In Proceedings of the 2017 15th International Conference on ICT and Knowledge Engineering (ICT&KE), Bangkok, Thailand, 22–24 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [ Google Scholar ]
  • Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023 , 29 , 1930–1940. [ Google Scholar ] [ CrossRef ]
  • Hayawi, K.; Shahriar, S. AI Agents from Copilots to Coworkers: Historical Context, Challenges, Limitations, Implications, and Practical Guidelines. Preprints 2024 . [ Google Scholar ] [ CrossRef ]
  • Aher, G.V.; Arriaga, R.I.; Kalai, A.T. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 337–371. Available online: https://proceedings.mlr.press/v202/aher23a.html (accessed on 15 June 2024).
  • Mannuru, N.R.; Shahriar, S.; Teel, Z.A.; Wang, T.; Lund, B.D.; Tijani, S.; Pohboon, C.O.; Agbaji, D.; Alhassan, J.; Galley, J.; et al. Artificial intelligence in developing countries: The impact of generative artificial intelligence (AI) technologies for development. Inf. Dev. 2023 , 02666669231200628. [ Google Scholar ] [ CrossRef ]
  • Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z. ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. J. Assoc. Inf. Sci. Technol. 2023 , 74 , 570–581. [ Google Scholar ] [ CrossRef ]
  • Hu, B.; Sheng, Q.; Cao, J.; Shi, Y.; Li, Y.; Wang, D.; Qi, P. Bad actor, good advisor: Exploring the role of large language models in fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 22105–22113. [ Google Scholar ]
  • Koubaa, A. GPT-4 vs. GPT-3.5: A Concise Showdown. Preprints 2023 . [ Google Scholar ] [ CrossRef ]
  • Coyne, S.; Sakaguchi, K.; Galvan-Sosa, D.; Zock, M.; Inui, K. Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction. arXiv 2023 , arXiv:2303.14342. [ Google Scholar ]
  • Salman, S.; Liu, X. Overfitting Mechanism and Avoidance in Deep Neural Networks. arXiv 2019 , arXiv:1901.06566. [ Google Scholar ]
  • Shen, X.; Wu, Y.; Backes, M.; Zhang, Y. Voice Jailbreak Attacks Against GPT-4o. arXiv 2024 , arXiv:2405.19103. [ Google Scholar ]
  • Ying, Z.; Liu, A.; Liu, X.; Tao, D. Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks. arXiv 2024 , arXiv:2406.06302. [ Google Scholar ]
  • Kalyanpur, A.; Saravanakumar, K.; Barres, V.; Chu-Carroll, J.; Melville, D.; Ferrucci, D. LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic. arXiv 2024 , arXiv:2406.17663. [ Google Scholar ]
  • Zhang, N.; Sun, Z.; Xie, Y.; Wu, H.; Li, C. The latest version ChatGPT powered by GPT-4o: What will it bring to the medical field? Int. J. Surg. 2024 . [ Google Scholar ] [ CrossRef ]
  • Wang, H.; Xu, J.; Xie, S.; Wang, R.; Li, J.; Xie, Z.; Zhang, B.; Xiong, C.; Chen, X. M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models. arXiv 2024 , arXiv:2405.15638. [ Google Scholar ]
  • Sonoda, Y.; Kurokawa, R.; Nakamura, Y.; Kanzawa, J.; Kurokawa, M.; Ohizumi, Y.; Gonoi, W.; Abe, O. Diagnostic Performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” Cases. medRxiv 2024 , 2024.05.26.24307915. [ Google Scholar ] [ CrossRef ]
  • Singgalen, Y.A. Analyzing an Interest in GPT 4o through Sentiment Analysis using CRISP-DM. J. Inf. Syst. Inform. 2024 , 6 , 882–898. [ Google Scholar ] [ CrossRef ]
  • Pang, S.; Nol, E.; Heng, K. ChatGPT-4o for English language teaching and learning: Features, applications, and future prospects. SSRN Sch. Pap. 2024 , 4837988. [ Google Scholar ] [ CrossRef ]
  • Xu, S.; Wang, Y.; Liu, D.; Xu, C. Collage Prompting: Budget-Friendly Visual Recognition with GPT-4V. arXiv 2024 , arXiv:2403.11468. [ Google Scholar ]
  • Zhou, Y.; Ong, H.; Kennedy, P.; Wu, C.C.; Kazam, J.; Hentel, K.; Flanders, A.; Shih, G.; Peng, Y.; Moy, L.; et al. Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs. Radiology 2024 , 311 , e233270. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Allyn, B. Scarlett Johansson Says She Is “Shocked, Angered” over New ChatGPT Voice. Available online: https://www.npr.org/2024/05/20/1252495087/openai-pulls-ai-voice-that-was-compared-to-scarlett-johansson-in-the-movie-her (accessed on 15 June 2024).
  • Li, H.; Ding, W.; Kang, Y.; Liu, T.; Wu, Z.; Liu, Z. CTAL: Pre-training cross-modal transformer for audio-and-language representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3966–3977. [ Google Scholar ]
  • Federation of State Medical Boards and National Board of Medical Examiners. USMLE Step 1 Content Description and General Information. 2024. Available online: https://www.usmle.org (accessed on 15 June 2024).
  • Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; Leon, L.D.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023 , 2 , e0000198. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023 , 9 , e45312. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Brin, D.; Sorin, V.; Vaid, A.; Soroush, A.; Glicksberg, B.S.; Charney, A.W.; Nadkarni, G.; Klang, E. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci. Rep. 2023 , 13 , 16492. [ Google Scholar ] [ CrossRef ]
  • Haleem, A.; Javaid, M.; Qadri, M.A.; Suman, R. Understanding the role of digital technologies in education: A review. Sustain. Oper. Comput. 2022 , 3 , 275–285. [ Google Scholar ] [ CrossRef ]
  • Callanan, E.; Mbakwe, A.; Papadimitriou, A.; Pei, Y.; Sibue, M.; Zhu, X.; Ma, Z.; Liu, X.; Shah, S. Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams. arXiv 2023 , arXiv:2310.08678. [ Google Scholar ]
  • College Board. The SAT Suite of Assessments. Available online: https://www.collegeboard.org (accessed on 15 June 2024).
  • OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024 , arXiv:2303.08774. [ Google Scholar ]
  • National Conference of Bar Examiners. MBE Sample Test Questions. Available online: https://www.ncbex.org (accessed on 15 June 2024).
  • Griggs, M. Building a Better Bar Exam. Tex. A&M Law Rev. 2019 , 7 , 1. [ Google Scholar ]
  • Katz, D.M.; Bommarito, M.J.; Gao, S.; Arredondo, P. Gpt-4 passes the bar exam. Philos. Trans. R. Soc. A 2024 , 382 , 20230254. [ Google Scholar ] [ CrossRef ]
  • Huang, J.; Chang, K.C.-C. Towards Reasoning in Large Language Models: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1049–1065. [ Google Scholar ]
  • Johnson-Laird, P. Deductive reasoning. WIREs Cogn. Sci. 2010 , 1 , 8–17. [ Google Scholar ] [ CrossRef ]
  • Hayes, B.K.; Heit, E.; Swendsen, H. Inductive reasoning. WIREs Cogn. Sci. 2010 , 1 , 278–292. [ Google Scholar ] [ CrossRef ]
  • Walton, D. Abductive Reasoning ; University of Alabama Press: Tuscaloosa, AL, USA, 2014. [ Google Scholar ]
  • Dalvi, B.; Jansen, P.; Tafjord, O.; Xie, Z.; Smith, H.; Pipatanangkura, L.; Clark, P. Explaining Answers with Entailment Trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.-F., Huang, X., Specia, L., Yih, S.W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7358–7370. [ Google Scholar ] [ CrossRef ]
  • Weston, J.; Bordes, A.; Chopra, S.; Rush, A.M.; van Merriënboer, B.; Joulin, A.; Mikolov, T. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv 2015 , arXiv:1502.05698. [ Google Scholar ]
  • Sinha, K.; Sodhani, S.; Dong, J.; Pineau, J.; Hamilton, W.L. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4506–4515. [ Google Scholar ] [ CrossRef ]
  • Bhagavatula, C.; Bras, R.L.; Malaviya, C.; Sakaguchi, K.; Holtzman, A.; Rashkin, H.; Downey, D.; Yih, W.; Choi, Y. Abductive Commonsense Reasoning. International Conference on Learning Representations. 2019. Available online: https://openreview.net/forum?id=Byg1v1HKDB (accessed on 15 June 2024).
  • López Espejel, J.; Ettifouri, E.H.; Yahaya Alassan, M.S.; Chouham, E.M.; Dahhane, W. GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts. Nat. Lang. Process. J. 2023 , 5 , 100032. [ Google Scholar ] [ CrossRef ]
  • Khoshafah, F. ChatGPT for Arabic-English Translation: Evaluating the Accuracy. Res. Sq. 2023 . [ Google Scholar ] [ CrossRef ]
  • Tiedemann, J. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012; pp. 2214–2218. [ Google Scholar ]
  • Kunchukuttan, A.; Mehta, P.; Bhattacharyya, P. The IIT bombay english-hindi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [ Google Scholar ]
  • Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [ Google Scholar ]
  • Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3982–3992. [ Google Scholar ] [ CrossRef ]
  • Cubero, S.; Aleixos, N.; Moltó, E.; Gómez-Sanchis, J.; Blasco, J. Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables. Food Bioprocess Technol. 2011 , 4 , 487–504. [ Google Scholar ] [ CrossRef ]
  • Jebraeily, Y.; Sharafi, Y.; Teshnehlab, M. Driver drowsiness detection based on convolutional neural network architecture optimization using genetic algorithm. IEEE Access 2024 , 12 , 45709–45726. [ Google Scholar ] [ CrossRef ]
  • Hung, L.P.; Alias, S. Beyond Sentiment Analysis: A Review of Recent Trends in Text Based Sentiment Analysis and Emotion Detection. J. Adv. Comput. Intell. Intell. Inform. 2023 , 27 , 84–95. [ Google Scholar ] [ CrossRef ]
  • Shahriar, S. GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network. Displays 2022 , 73 , 102237. [ Google Scholar ] [ CrossRef ]
  • Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits. JASA Express Lett. 2024 , 4 , 025206. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Ahamad, A.; Anand, A.; Bhargava, P. AccentDB: A database of non-native english accents to assist neural speech recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Marseille, France, 2020; pp. 5351–5358. Available online: https://www.aclweb.org/anthology/2020.lrec-1.659 (accessed on 15 June 2024).
  • Yu, W.; Yang, Z.; Li, L.; Wang, J.; Lin, K.; Liu, Z.; Wang, X.; Wang, L. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv 2023 , arXiv:2308.02490. [ Google Scholar ]
  • Zhu, N.; Zhang, N.; Shao, Q.; Cheng, K.; Wu, H. OpenAI’s GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence. Eur. J. Cancer 2024 , 206 , 114132. [ Google Scholar ] [ CrossRef ]
  • Zhu, H.; Wang, Z.; Shi, Y.; Hua, Y.; Xu, G.; Deng, L. Multimodal Fusion Method Based on Self-Attention Mechanism. Wirel. Commun. Mob. Comput. 2020 , 2020 , 8843186. [ Google Scholar ] [ CrossRef ]
  • Zhou, K.; Hassan, F.H.; Hoon, G.K. The State of the Art for Cross-Modal Retrieval: A Survey. IEEE Access 2023 , 11 , 138568–138589. [ Google Scholar ] [ CrossRef ]
  • Hayawi, K.; Shahriar, S.; Mathew, S.S. The imitation game: Detecting human and AI-generated texts in the era of ChatGPT and BARD. J. Inf. Sci. 2024 , 01655515241227531. [ Google Scholar ] [ CrossRef ]
  • Shahriar, S.; Al Roken, N.; Zualkernan, I. Classification of Arabic poetry emotions using deep learning. Computers 2023 , 12 , 89. [ Google Scholar ] [ CrossRef ]
  • Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med. Inform. 2024 , 12 , e55318. [ Google Scholar ] [ CrossRef ]
  • Shahriar, S.; Al Roken, N. How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion. Int. J. Inf. Manag. Data Insights 2022 , 2 , 100066. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

ModelTotal QuestionsCorrect AnswersAccuracy
GPT-3.538920151.67%
GPT-4807290.00%
GPT-4o1189883.05%
ModelAccuracy
GPT-3.558.80%
GPT-473.20%
GPT-4o85.39%
TestTotal QuestionsCorrect AnswersAccuracy
Reading & Writing M1333193.94%
Reading & Writing M2332987.88%
Math M1272592.59%
Math M2272281.48%
ModelReading & WritingMath
GPT-3.583.75%73.75%
GPT-488.75%87.50%
GPT-4 (no vision)88.75%86.25%
GPT-4o90.91%87.04%
ModelAccuracy
GPT-3.545.10%
GPT-475.70%
GPT-4o75.00%
Category/
Model
Deductive ReasoningInductive ReasoningAbductive Reasoning
Entailment BankbAbI (Task 15)CLUTRRbAbI (Task 15)αNLI
GPT-3.525/3026/302/3014/3019/30
GPT-427/3030/3011/3028/3025/30
GPT-4o29/3030/3017/3030/3027/30
LanguageTranslation Accuracy (%)
Spanish88
Arabic78
Hindi82
French75
Portuguese86
Russian80
ClassPrecisionRecallF1-Score
Banana0.911.000.95
Papaya1.001.001.00
Apple1.001.001.00
Litchi1.001.001.00
Jackfruit1.000.900.95
Hog Plum1.001.001.00
Grapes1.001.001.00
Guava1.001.001.00
Mango0.911.000.95
Orange1.000.900.95
ClassPrecisionRecallF1-Score
Drowsy0.80.80.8
Natural0.80.80.8
ClassPrecisionRecallF1-Score
Anthracnose0.600.600.60
Apple Scab1.000.800.89
Black Spot0.671.000.80
Blight0.380.750.50
Blossom End Rot1.001.081.00
Botrytis1.000.200.33
Brown Rot1.000.200.33
Canker0.250.250.25
Cedar Apple Rust0.831.000.91
Clubroot1.001.001.00
Crown Gall1.001.001.00
Downy Mildew1.000.200.33
Fire Blight0.800.800.80
Fusarium1.000.600.75
Gray Mold0.430.750.55
Leaf Spots0.400.800.53
Mosaic Virus0.670.800.72
Nematodes1.001.001.00
Powdery Mildew0.711.000.83
Verticillium0.670.400.50
ClassPrecisionRecallF1-Score
Glaucoma0.580.860.69
Normal0.730.380.50
ClassPrecisionRecallF1-Score
Aneurysm0.37250.57580.4524
Cancer0.00000.00000.0000
Tumor0.27080.39390.3210
MetricScore
BLEU-10.193
BLEU-20.095
BLEU-30.058
BLEU-40.031
ModelRecognitionOCRKnowledgeLanguage GenerationSpatial AwarenessMathTotal
GPT-3.522.331.415.616.632.924.027.6
GPT-434.366.325.636.660.672.048.1
GPT-4V67.568.356.260.769.458.667.7
GPT-4o80.687.575.777.184.788.183.9
Input ImageQuestions|Assessed CapabilitiesGPT-4o Responses
(a) Q: How many gallons of supreme gasoline can I get with $50?GPT-4o: (a) With $50, you can get approximately 13.70 gallons of supreme gasoline.
(b) Q: What is the price difference between diesel and regular gasoline?

GPT-4o: The price difference between diesel and regular gasoline is $0.11 per gallon.
(a) Q: Based on the image, what is the total international awareness of Tina Fey?GPT-4o: Based on the image, the total international awareness of Tina Fey is 35%. This percentage is located at the bottom row under the column labeled “Total” for Tina Fey.
(b) Q: Based on the image, in which country does Tina Fey have the highest level of total awareness?

GPT-4o: Based on the image, Tina Fey has the highest level of total awareness in Australia, with an awareness level of 62%.
(a) Q: What is the price for tomatoes?GPT-4o: The price for tomatoes is 8.00, as indicated by the sign placed above the pile of tomatoes in the image.
(b) Q: What fruit is the right of plums?

GPT-4o: The fruit to the right of the plums in the image is pears.
(a) Q: Which car is on the parking spot 33?GPT-4o: The parking spot labeled 33 is empty.
(b) Q: Which spot is the red car parking on?

GPT-4o: The red car is parked in spot number 30.
(a) Q: Which category did the worst job in plastic recycling, considering the amount of plastic made and thrown out?GPT-4o: The category that did the worst job in plastic recycling, considering the amount of plastic made and thrown out, is “Packaging”. In 2015, 146 million metric tonnes of plastic were made for packaging, and 141 million metric tonnes were thrown out. This shows that almost all the plastic made for packaging was discarded as waste.
(b) Q: What is the difference in metric tonnes between the amount of plastic produced and the amount discarded?

The difference between the amount of plastic produced (407 million metric tonnes) and the amount discarded (302 million metric tonnes) is:

407 million metric tonnes − 302 million metric tonnes = 105 million metric tonnes

So, the difference is 105 million metric tonnes.
Q: What does this factory make?

GPT-4o: The factory in the image belongs to Leica, which is a company known for manufacturing cameras, lenses, and other optical equipment.
Q: In which country was this photo taken?

GPT-4o: The photo shows Novak Djokovic kissing a trophy that appears to be the Australian Open trophy. Therefore, this photo was likely taken in Australia.
Q: Which are producers in this food web?

GPT-4o: In this food web, the producers are:

Phytoplankton
Seaweed
These are the organisms that produce energy through photosynthesis and serve as the base of the food web.
Q: Is it summertime in the scene? Answer it and give the rationale.

GPT-4o: No, it is not summertime in the scene. The presence of large pumpkins suggests that it is likely fall, as pumpkins are typically associated with the autumn harvest season. Additionally, the characters are wearing long-sleeved clothing, which is more common in cooler weather.
Q: Can you explain this meme?

GPT-4o: This meme humorously illustrates the common tendency to procrastinate. The text shows an intention to start a task at a specific time (8:00), but when the time is slightly missed (8:05), the person decides to delay the task further, waiting until the next hour (9:00) to start. The image of Kermit the Frog relaxing on a bed adds to the humor by depicting the relaxed attitude towards procrastination. The meme pokes fun at the irrational yet relatable habit of postponing tasks due to missing an arbitrary starting time by a few minutes.
Q: What is funny about this image?

GPT-4o: This image is a popular meme that combines two unrelated photos to create a humorous juxtaposition.
On the left side, there is an image of a woman (Taylor Armstrong from “The Real Housewives of Beverly Hills”) angrily pointing and shouting, while another woman (Kyle Richards) tries to calm her down.
On the right side, there is a photo of a white cat (known as “Smudge the Cat”) sitting at a dinner table with a confused and displeased expression in front of a plate of vegetables.
The humor comes from the contrast between the woman’s intense emotion and the cat’s indifferent and bemused reaction, as if the cat is unbothered by the woman’s outburst. This combination creates a relatable and funny depiction of exaggerated reactions and misunderstandings.
Q: How to cook this dish?

GPT-4o: This image shows a pie with a lattice crust. Here is a basic recipe to make a lattice-top pie, such as a strawberry or mixed berry pie:
Ingredients: For the Filling: 4 cups fresh strawberries or mixed berries, 3/4 cup granulated sugar, 1/4 cup cornstarch, 1 tablespoon lemon juice, 1 teaspoon vanilla extract, A pinch of salt For the Crust: 2 1/2 cups all-purpose flour, 1 teaspoon salt, 1 teaspoon sugar, 1 cup unsalted butter, chilled and cut into small pieces, 1/4 to 1/2 cup ice water
Instructions: In a large bowl, combine the flour, salt, and sugar. Add the chilled butter pieces and cut them into the flour mixture using a pastry blender or your fingers until the mixture resembles coarse crumbs. Gradually add ice water, a tablespoon at a time, mixing until the dough begins to come together. Do not overwork the dough. Divide the dough into two equal parts, shape them into disks, wrap in plastic wrap, and refrigerate for at least 1 h. In a large bowl, combine the berries, sugar, cornstarch, lemon juice, vanilla extract, and salt. Stir gently to combine. On a lightly floured surface, roll out one disk of dough to fit a 9-inch pie pan. Place the rolled-out dough into the pie pan and trim the edges. Pour the berry mixture into the crust-lined pie pan. Roll out the second disk of dough and cut it into 1-inch-wide strips. Lay 4 to 5 strips over the filling, parallel to each other. Lay another 4 to 5 strips perpendicular to the first set, weaving them to create a lattice pattern. Trim and crimp the edges to seal. Preheat the oven to 375°F (190 °C). Bake: Place the pie on a baking sheet to catch any drips and bake for 45–55 min, or until the crust is golden brown and the filling is bubbly. Cool: Let the pie cool completely before serving to allow the filling to set. Enjoy your homemade lattice-top pie!
Q: Can you give a short introduction to this painting?

GPT-4o: This painting is a historical Chinese artwork titled “Emperor Taizong Receiving the Tibetan Envoy”, created by Yan Liben during the Tang Dynasty. The piece illustrates a significant diplomatic event where Emperor Taizong of the Tang Dynasty meets a Tibetan envoy. This event highlights the political and cultural exchanges between the Tang Dynasty and Tibet. The painting is notable for its detailed depiction of the figures, the hierarchical arrangement, and the ceremonial aspects of the meeting. Yan Liben, a renowned court artist of the Tang Dynasty, is celebrated for his ability to capture the grandeur and solemnity of imperial events. This artwork not only serves as a historical record but also exemplifies the artistic sophistication of the Tang Dynasty.
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Shahriar, S.; Lund, B.D.; Mannuru, N.R.; Arshad, M.A.; Hayawi, K.; Bevara, R.V.K.; Mannuru, A.; Batool, L. Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Appl. Sci. 2024 , 14 , 7782. https://doi.org/10.3390/app14177782

Shahriar S, Lund BD, Mannuru NR, Arshad MA, Hayawi K, Bevara RVK, Mannuru A, Batool L. Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency. Applied Sciences . 2024; 14(17):7782. https://doi.org/10.3390/app14177782

Shahriar, Sakib, Brady D. Lund, Nishith Reddy Mannuru, Muhammad Arbab Arshad, Kadhim Hayawi, Ravi Varma Kumar Bevara, Aashrith Mannuru, and Laiba Batool. 2024. "Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency" Applied Sciences 14, no. 17: 7782. https://doi.org/10.3390/app14177782

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Press Release

Automatic image recognition to identify types of wood, harnessing ai to fight global deforestation.

Research News / September 02, 2024

The new European Deforestation Regulation (EUDR) is intended to prevent goods marketed in the EU from contributing to the spread of deforestation. When a wood product is brought into the EU market, for example, there must be documentation of which types of wood were used to produce it, along with proof of their legal origin. Depending on the material, even the initial review of the declared type of wood is no easy task. Paper, for example, requires time-consuming examination by specialists. Now, a new AI-based analytical tool for determining wood types is being developed to simplify and accelerate this process. Researchers from the Fraunhofer Institute for Industrial Mathematics ITWM are working closely with the Thünen Institute of Wood Research to develop the automated image recognition system for large-scale review of declarations of wood type.

Dr. Stephanie Helmling vom Thünen-Institut beim Mikroskopieren von Proben.

Illegal logging is one of the consequences of rising global demand for lumber. The European Union Timber Regulation (EUTR), the predecessor of the EUDR, entered into force back in 2013 with the aim of curbing the unlawful use of wood. Since then, commercial enterprises have been required to document the types of wood used in their products and their origins, thereby demonstrating the legal origins of the wood used in goods they import into the EU market. The same applies to wood products such as particle board, fiberboard, paper, and cardboard. But how can the types of wood used in fiber materials be identified without a doubt?

As things currently stand, responsibility for examining wood products falls to people such as the employees of the Thünen Institute, a research institute in the portfolio of the Federal Ministry of Food and Agriculture (BMEL). They receive numerous product samples from industry and government agencies so they can check the types of wood used — and the numbers are rising. The samples are then sent for expert analysis under a microscope, which is an extremely time-consuming process. With paper and fiberboard, the wood cells are separated from the material, dyed and then prepared on a slide. The cells can then be classified based on their appearance when viewed through a microscope. But because this preparation and examination process is so time-consuming and more and more samples are coming in for testing, the specialists can only handle a limited number of expert reports. A new AI-based analytical tool is being developed to help with this situation by relieving some of the workload on highly qualified experts, accelerating and automating the examination process, and enabling fast, efficient controls. Researchers from Fraunhofer ITWM and the Thünen Institute of Wood Research in Hamburg have teamed up in the KI_Wood-ID project, using machine learning to develop the new automated image recognition system to identify types of wood. The project is funded by the German Federal Ministry of Food and Agriculture (BMEL).

Algorithms to curb the illegal lumber trade

The researchers’ first area of focus in the project is hardwoods, especially those originating from plantations grown worldwide for cellulose production. Artificial intelligence can be used to determine the type of wood based on vascular tissue, which varies in cell structure, shape, and size. Using reference preparations from the Thünen Institute’s vast collection of wood samples, the researchers from Fraunhofer ITWM are training neural networks until the AI is capable of independently identifying and classifying the characteristic features of particular species so it can detect the types of wood present in the microscopic image of an unknown sample. Training for each different category, such as birch, beech, and poplar, takes place separately. In the individual images, the analytical tool first marks the cells that are key to identifying the particular types of wood. “A sample is considered anomalous if it contains characteristics that don’t match the declared types of wood,” says Dr. Henrike Stephani, the KI_Wood-ID project manager and deputy head of the Image Processing department at Fraunhofer ITWM in Kaiserslautern.

The overarching goal, she says, is to use AI to strengthen the legal wood trade and consequently protect consumers. Ramin is one kind of wood that is under special protection. Ramin forests are a key habitat for orangutans in places like Borneo. “A few years back, chips of this type of wood were sent to the Thünen Institute for identification after Greenpeace collected them from the warehouse at a pulp mill in Asia,” Stephani explains. “If not only wood from plantations but also protected or endangered species like ramin are processed into pulp and then paper, the goal is to use our method to identify those cells in the final product and hold the manufacturers responsible.”

Preventing deforestation around the world

An initial prototype of the analysis system has been trained on reference samples to the point that it can already identify eleven types of hardwood. The next step will involve identifying softwood. The prototype, which features a graphical user interface (GUI), is initially available to the Thünen Institute. In the long term, the AI-based image recognition tool is to be rolled out to support testing labs and government agencies worldwide in monitoring the international wood trade. “Ultimately, deforestation and the illegal timber trade can only be prevented at the global level, so we hope testing organizations that are approved worldwide will be able to benefit from our system in the future,” Stephani says.

  • Fraunhofer Institute for Industrial Mathematics ITWM  (itwm.fraunhofer.de)
  • Research News September 2023 — Harnessing AI to Fight Global Deforestation [ PDF  0,39 MB ]

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: text-to-image generation via energy-based clip.

Abstract: Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative objective, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. EB-CLIP not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of EB-CLIP by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that EB-CLIP can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Research paper on character recognition

    research paper on image recognition

  2. (PDF) Image Recognition Technology Based on Machine Learning

    research paper on image recognition

  3. (PDF) A Review of Image Recognition Technology

    research paper on image recognition

  4. (PDF) Research and design of intelligent image recognition mechanism

    research paper on image recognition

  5. (PDF) A Review Paper on Face Recognition Techniques

    research paper on image recognition

  6. (PDF) Research Progress of Intelligent Image Recognition

    research paper on image recognition

VIDEO

  1. UGC NET Paper-1 Batch 🎯🔔II हिन्दी माध्यम II Syllabus & Strategies II Target JRF 🎯 June & Dec 2024 II

  2. Mock Paper-2 II Research Methodology II PhD Entrance Exam II Dr. Bhimrao Ambedkar University, Agra I

  3. Bibliographic Refererence Recognition

  4. Face detection And Recognition based Attendance System Using Raspberry Pi

  5. Building Image Recognition API with Python TensorFlow and FastAPI

  6. Innovate with AI Services: Perform image recognition with Oracle Cloud Infrastructure OCI Vision

COMMENTS

  1. IMAGE RECOGNITION USING MACHINE LEARNING

    Abstract. Image recognition is important side of image processing for machine learning without involving any human support at any step. In this paper we study how image classification is completed ...

  2. Image recognition based on lightweight convolutional neural network

    This paper describes recent advances in image recognition techniques based on lightweight CNN. Firstly, the classical lightweight CNN models are introduced. Based on the different optimization strategies for image recognition tasks, the state-of-the-art techniques of lightweight CNN for image recognition are summarized into three categories ...

  3. (Pdf) Deep Learning Architectures for Image Recognition: a

    This research paper presents a comprehensive review of various deep learning architectures developed for image recognition tasks. The paper explores the evolution of deep learning models, starting ...

  4. An Analysis Of Convolutional Neural Networks For Image Classification

    Abstract. This paper presents an empirical analysis of theperformance of popular convolutional neural networks (CNNs) for identifying objects in real time video feeds. The most popular convolution neural networks for object detection and object category classification from images are Alex Nets, GoogLeNet, and ResNet50.

  5. Image Recognition Technology Based on Machine Learning

    With the development of machine learning for decades, there are still many problems unsolved, such as image recognition and location detection, image classification, image generation, speech recognition, natural language processing and so on. In the field of deep learning research, the research on image classification has always been the most basic, traditional and urgent research direction ...

  6. Image Recognition Using Machine Learning Techniques

    Image recognition has become a prominent area of research in recent years, and the development of deep learning models has significantly improved the accuracy of image classification tasks. This paper provides an overview of deep learning techniques using two models in image recognition, including deep belief network and convolutional neural network. Additionally, the paper examines some of ...

  7. Research and Application of Deep Learning in Image Recognition

    Abstract: Deep learning is a technical tool with broad application prospects and has an important role in the field of image recognition. In view of the theoretical value and practical significance of image recognition technology in promoting the development of computer vision and artificial intelligence, this paper will review and study the application of deep learning in image recognition.

  8. Image Classification

    116. Paper. Code. **Image Classification** is a fundamental task in vision recognition that aims to understand and categorize an image as a whole under a specific label. Unlike [object detection] (/task/object-detection), which involves classification and location of multiple objects within an image, image classification typically pertains to ...

  9. [1512.03385] Deep Residual Learning for Image Recognition

    Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. View a PDF of the paper titled Deep Residual Learning for Image Recognition, by Kaiming He and 3 other authors. Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are ...

  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    View a PDF of the paper titled An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy and 10 other authors. While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited.

  11. 25426 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on IMAGE RECOGNITION. Find methods information, sources, references or conduct a literature review on ...

  12. Very Deep Convolutional Networks for Large-Scale Image Recognition

    View a PDF of the paper titled Very Deep Convolutional Networks for Large-Scale Image Recognition, by Karen Simonyan and 1 other authors. In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing ...

  13. Deep learning models for digital image processing: a review

    Within the domain of image processing, a wide array of methodologies is dedicated to tasks including denoising, enhancement, segmentation, feature extraction, and classification. These techniques collectively address the challenges and opportunities posed by different aspects of image analysis and manipulation, enabling applications across various fields. Each of these methodologies ...

  14. Deep learning in computer vision: A critical review of emerging

    We identify eight emerging techniques, investigate their origins and updates, and finally emphasize their applications in four key scenarios, including recognition, visual tracking, semantic segmentation, and image restoration. We recognize three development stages in the past decade and emphasize research trends for future works.

  15. 10 Papers You Should Read to Understand Image Classification in the

    With these papers, we can see how this field evolve, and how researchers brought up new ideas based on previous research outcome. Nevertheless, it is still helpful for you to sort out the big picture even if you have already worked in this area for a while. So, let's get started. 1998: LeNet. Gradient-based Learning Applied to Document ...

  16. Comparative analysis of deep learning image detection algorithms

    A computer views all kinds of visual media as an array of numerical values. As a consequence of this approach, they require image processing algorithms to inspect contents of images. This project compares 3 major image processing algorithms: Single Shot Detection (SSD), Faster Region based Convolutional Neural Networks (Faster R-CNN), and You Only Look Once (YOLO) to find the fastest and most ...

  17. PDF Siamese Neural Networks for One-shot Image Recognition

    In general, we learn image representations via a supervised metric-based approach with siamese neural networks, then reuse that network's features for one-shot learning without any retraining. In our experiments, we restrict our attention to character recognition, although the basic approach can be replicated for almost any modality (Figure 2).

  18. Deep Residual Learning for Image Recognition

    Deep Residual Learning for Image Recognition Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research fkahe, v-xiangz, v-shren, [email protected] Abstract Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used ...

  19. Editorial: Current Trends in Image Processing and Pattern Recognition

    This article is part of the Research Topic Current Trends in Image Processing and Pattern Recognition View all 8 ... we opened a call for papers on Current Trends in Image Processing & Pattern Recognition that exactly followed third ... The author declares that the research was conducted in the absence of any commercial or financial ...

  20. Convolutional Neural Network (CNN) for Image Detection and Recognition

    image detection algorithms. In this paper, Convolutional neural networks models. are implemented for image recognition on MNIST dataset. and object detection on the CIFAR-10 dataset. The ...

  21. Image Recognition Using Artificial Intelligence

    The leading intention of the project is to provide a new approach for image recognition using Python and its library in which we extensively use python libraries like numpy, Bing image downloader, matplotlib, sklearn and several others as well for the use of machine learning and its properties like support vector machine (SVM). An image recognition technique utilizing aa info of image ...

  22. (PDF) Artificial Intelligence Image Recognition Method Based on

    Experimental results show that the convolutional neural network algorithm proposed in this paper can learn the diverse features of the image, and improve the accuracy of feature extraction and ...

  23. Putting GPT-4o to the Sword: A Comprehensive Evaluation of ...

    Additionally, GPT-4o's vision and speech capabilities are tested through image classification and object-recognition tasks, as well as accent classification. The multimodal evaluation assesses the model's performance in integrating visual and linguistic data. ... Feature papers represent the most advanced research with significant potential ...

  24. Harnessing AI to Fight Global Deforestation

    Researchers from Fraunhofer ITWM and the Thünen Institute of Wood Research in Hamburg have teamed up in the KI_Wood-ID project, using machine learning to develop the new automated image recognition system to identify types of wood. The project is funded by the German Federal Ministry of Food and Agriculture (BMEL).

  25. Research on image Recognition Technology Based on Artificial

    Image recognition technology is an important field in artificial intelligence. With the development of society and the progress of technology, image recognition technology has gradually been more widely used. This paper introduces the principle of image recognition technology and discusses the advantages of image recognition technology in artificial intelligence. The main function of image ...

  26. [2408.17046] Text-to-Image Generation Via Energy-Based CLIP

    Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present EB-CLIP, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative objective, we introduce an image-text joint-energy function ...