Human Factors in Phishing Attacks: A Systematic Literature Review

research paper on phishing attack

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

  • Yasin A Fatima R Wen L JiangBin Z Niazi M (2025) What goes wrong during phishing education? A probe into a game-based assessment with unfavorable results Entertainment Computing 10.1016/j.entcom.2024.100815 52 (100815) Online publication date: Jan-2025 https://doi.org/10.1016/j.entcom.2024.100815
  • Fan Z Li W Laskey K Chang K (2024) Investigation of Phishing Susceptibility with Explainable Artificial Intelligence Future Internet 10.3390/fi16010031 16 :1 (31) Online publication date: 17-Jan-2024 https://doi.org/10.3390/fi16010031
  • Katsarakes E Edwards M Still J (2024) Where Do Users Look When Deciding If a Text Message is Safe or Malicious? Proceedings of the Human Factors and Ergonomics Society Annual Meeting 10.1177/10711813241264204 Online publication date: 12-Aug-2024 https://doi.org/10.1177/10711813241264204
  • Show More Cited By

Index Terms

Human-centered computing

Human computer interaction (HCI)

Security and privacy

Human and societal aspects of security and privacy

Intrusion/anomaly detection and malware mitigation

Social engineering attacks

Recommendations

Mitigating phishing attacks: an overview.

Social engineering is the process of getting a person to provide a service or complete a task that may give away private or confidential information. Phishing is the most common type of social engineering. In phishing, an attacker poses as a trustworthy ...

Defending against phishing attacks: taxonomy of methods, current issues and future directions

Internet technology is so pervasive today, for example, from online social networking to online banking, it has made people's lives more comfortable. Due the growth of Internet technology, security threats to systems and networks are relentlessly ...

Fighting against phishing attacks: state of the art and future challenges

In the last few years, phishing scams have rapidly grown posing huge threat to global Internet security. Today, phishing attack is one of the most common and serious threats over Internet where cyber attackers try to steal user's personal or financial ...

Information

Published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • human factors
  • cybersecurity

Funding Sources

  • Italian Ministry of University and Research (MUR)
  • PON projects LIFT, TALIsMAn, and SIMPLe
  • “Dipartimento di Eccellenza”
  • DATACLOUD, DESTINI, and FIRST
  • RoMA—Resilience of Metropolitan Areas

Contributors

Other metrics, bibliometrics, article metrics.

  • 35 Total Citations View Citations
  • 3,597 Total Downloads
  • Downloads (Last 12 months) 1,219
  • Downloads (Last 6 weeks) 94
  • Guo S Fan Y (2024) X-Phishing-Writer: A Framework for Cross-lingual Phishing E-mail Generation ACM Transactions on Asian and Low-Resource Language Information Processing 10.1145/3670402 23 :7 (1-34) Online publication date: 26-Jun-2024 https://dl.acm.org/doi/10.1145/3670402
  • Kanaoka A Isohara T (2024) Enhancing Smishing Detection in AR Environments: Cross-Device Solutions for Seamless Reality 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) 10.1109/VRW62533.2024.00108 (565-572) Online publication date: 16-Mar-2024 https://doi.org/10.1109/VRW62533.2024.00108
  • Sarker O Jayatilaka A Haggag S Liu C Babar M (2024) A Multi-vocal Literature Review on challenges and critical success factors of phishing education, training and awareness Journal of Systems and Software 10.1016/j.jss.2023.111899 208 :C Online publication date: 4-Mar-2024 https://dl.acm.org/doi/10.1016/j.jss.2023.111899
  • Varshney G Kumawat R Varadharajan V Tupakula U Gupta C (2024) Anti-phishing Expert Systems with Applications: An International Journal 10.1016/j.eswa.2023.122199 238 :PF Online publication date: 27-Feb-2024 https://dl.acm.org/doi/10.1016/j.eswa.2023.122199
  • Baltuttis D Teubner T (2024) Effects of visual risk indicators on phishing detection behavior: An eye-tracking experiment Computers & Security 10.1016/j.cose.2024.103940 144 (103940) Online publication date: Sep-2024 https://doi.org/10.1016/j.cose.2024.103940
  • Marshall N Sturman D Auton J (2024) Exploring the evidence for email phishing training Computers and Security 10.1016/j.cose.2023.103695 139 :C Online publication date: 16-May-2024 https://dl.acm.org/doi/10.1016/j.cose.2023.103695
  • Chen R Li Z Han W Zhang J (2024) A Survey of Attack Techniques Based on MITRE ATT&CK Enterprise Matrix Network Simulation and Evaluation 10.1007/978-981-97-4522-7_13 (188-199) Online publication date: 2-Aug-2024 https://doi.org/10.1007/978-981-97-4522-7_13

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Wiley - PMC COVID-19 Collection

Logo of pheblackwell

The COVID‐19 scamdemic: A survey of phishing attacks and their countermeasures during COVID‐19

Ali f. al‐qahtani.

1 College of Science and Engineering, Hamad Bin Khalifa University (HBKU), Doha Qatar

Stefano Cresci

2 Institute of Informatics and Telematics (IIT), National Research Council (CNR), Pisa Italy

Associated Data

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

The COVID‐19 pandemic coincided with an equally‐threatening scamdemic: a global epidemic of scams and frauds. The unprecedented cybersecurity concerns emerged during the pandemic sparked a torrent of research to investigate cyber‐attacks and to propose solutions and countermeasures. Within the scamdemic, phishing was by far the most frequent type of attack. This survey paper reviews, summarises, compares and critically discusses 54 scientific studies and many reports by governmental bodies, security firms and the grey literature that investigated phishing attacks during COVID‐19, or that proposed countermeasures against them. Our analysis identifies the main characteristics of the attacks and the main scientific trends for defending against them, thus highlighting current scientific challenges and promising avenues for future research and experimentation.

1. INTRODUCTION

The COVID‐19 pandemic had a dramatic worldwide impact on all sides of our lives, including the way business and social interactions are conducted, and the overall organisation of our work. Regarding the latter, lockdowns and the enforcement of social distancing measures resulted in an unprecedented number of people experiencing changes in their working habits. Many employees had to adapt – oftentimes even abruptly – to using digital platforms, messaging apps and novel communication channels for their everyday activities [ 1 ]. In a line, we witnessed a huge worldwide shift from office work to remote (home) work.

This high‐level change implied several lower‐level fundamental shifts in how work was conducted, especially from a security perspective. In fact, while office work was characterised by a mixture of physical and digital interactions—the latter occurring in relatively secure and monitored environments, remote working necessarily involved the use of digital systems operated in largely insecure and unmanaged environments. Moreover, a large number of people were not used to remote working and did not receive any specific training on how to work remotely in a secure way. The inevitable consequence was the increase of cyber‐risks, which eventually resulted in a massive escalation of cyber‐attacks [ 2 , 3 ]. An early report from the International Association of IT Asset Managers (IAITAM) warned that working from home during the COVID‐19 pandemic could allow for plentiful data breaches. 1 The warnings from the IAITAM report were later confirmed when a large‐scale survey involving more than 3000 employees across 12 countries found that 94% of them experienced data breaches via cyber‐attacks during the course of the pandemic, resulting in an average number of more than 2 breaches suffered per employee [ 1 ].

In addition to the regular and well‐known security risks of remote working, other peculiar risks arose as a consequence of the chaos induced by the pandemic. As a paramount example, the widespread fear and uncertainty that followed the diffusion of COVID‐19 resulted in a huge demand for information (e.g., how to protect from, or cure, the infection), which set the stage for the emergence of a COVID‐19 infodemic [ 4 , 5 ]. As part of such uncontrolled flow of information, a surge in the registration of covid‐related domains was observed. Several investigations demonstrated that a large share of such new domains were outright malicious or, at the very least, suspicious to serve as threat vectors for the exploitation of cyber‐attacks [ 6 ]. The most common type of attack related to the newly‐created COVID‐19 websites was phishing. 2

Many cyber‐attacks involve social engineering techniques to boost their chances of success. To this regard, the increased anxiety caused by the pandemic resulted in a higher success rate for cyber‐attacks occurred during COVID‐19 [ 7 ]. Coupled with the overall increase in cyber‐attacks, this figure depicts a worrying scenario. Moreover, for workers employed in critical business sectors – such as healthcare professionals – the pandemic also meant exceptional workloads, with a consequent increase in stress which also affects the success rate of cyber‐attacks. Indeed, a statistically significant positive correlation was measured in [ 8 ] between workload and the probability of a healthcare staff opening a phishing email. Finally, the steep increase in demand for certain goods such as personal protective equipment (e.g., masks and gloves) exposed health services and even governments to a plethora of digital scams, especially in the form of phishing attacks [ 9 , 10 ]. Given this picture, it comes with little surprise that critical national infrastructures such as healthcare services and hospitals were among the most frequent targets of cyber‐attacks during COVID‐19 [ 1 , 7 ].

1.1. Scope and contributions

The COVID‐19 pandemic was accompanied by an equally‐dangerous epidemic of frauds and manipulations, as noted by the United Nations and the World Health Organization (WHO) [ 11 ]. When referring to the manipulation of online information, this digital epidemic was dubbed infodemic . In addition to this, the pandemic also created the conditions for the rise of a multitude of cyber‐attacks and cybersecurity issues: the COVID‐19 scamdemic . Within the scamdemic, phishing was by far the most frequent type of attack [ 7 ]. Phishing attacks that occurred during the pandemic also featured unique characteristics aimed at exploiting the peculiarities of COVID‐19 in order to increase their chances of success. The combination of the large number of phishing attacks, together with their new characteristics, attracted scholarly attention and many dedicated studies. This survey reviews, summarises, compares and critically discusses 54 scientific studies and many reports by governmental bodies and security firms that investigated phishing attacks during COVID‐19, or that proposed solutions and countermeasures against them. Our analysis identifies the main characteristics of the attacks and the main scientific trends for defending against them, thus highlighting current scientific challenges and promising avenues for future research and experimentation.

1.2. Significance

The rise of cyber‐attacks – and particularly of phishing attacks – occurred during COVID‐19, combined with the increased vulnerabilities of critical systems and persons that underwent extreme levels of stress, holds the potential to cause serious real‐world consequences and motivates research on this important topic.

1.3. Organization

The remainder of our survey is organised as follows. In Section  2 we briefly discuss the recent meta‐analyses, surveys and review articles that are mostly related to our present work. In doing that, we position our survey with respect to the existing ones. Then, we introduce the problem of phishing attacks during COVID‐19. The results of our literature review are presented in Sections  3 and 4 , which respectively focus on phishing attacks and countermeasures. Both sections are structured according to a top‐down approach, where we first present the overall synthesis and the main themes that emerged from the surveyed studies, followed by the detailed discussion of each analysed study. Next, in Section  5 we critically discuss the main findings of our survey, also highlighting challenges and promising directions of future research and experimentation. Finally, in Section  6 we conclude our work by summarising the results of our literature review.

2. BACKGROUND AND PRELIMINARIES

This section begins with a brief critical review of the existing surveys that are mostly related to our present work. Our analysis allows positioning this survey with respect to existing ones, highlighting the novelty and contributions of focussing on phishing attacks occurred during the pandemic. Subsequently, we introduce the problem of phishing and we highlight its importance within the broader landscape of COVID‐19 cyber‐attacks.

2.1. Differences with existing surveys

The unprecedented consequences brought about by COVID‐19 resulted in a remarkable wave of research produced to contrast the many covid‐induced issues. Among this new wave of research are a number of studies that focussed on cyber‐crime, cyber‐attacks and cybersecurity issues occurred during the pandemic. Original research in this direction was complemented by a few survey papers. Here, we briefly review the existing surveys that investigated the relationship between cyber‐attacks and COVID‐19, highlighting their differences with respect to our present survey. Table  1 provides an overview of the existing surveys that are mostly related to our work. In the following we briefly describe each of these works.

Overview of recent related surveys and differences with this survey.Related surveys are listed in reverse chronological order

Relatedness
SurveyYearPhishingCOVID‐19Analysis
Hijji & Alam [ ]2021High‐level/descriptive
Lallie et al. [ ]2021High‐level/descriptive
He et al. [ ]2021High‐level/descriptive
Valiyaveedu et al. [ ]2021In‐depth/technical
Basit et al. [ ]2021In‐depth/technical
Salloum et al. [ ]2021In‐depth/technical
Alkhalil et al. [ ]2021High‐level/descriptive
Hakak et al. [ ]2020High‐level/descriptive
Korkmaz et al. [ ]2020In‐depth/technical
This surveyIn‐depth/technical

Note : ◯: unrelated; ◐: partially related; ⬤: related.

The analysis presented in [ 7 ] describes COVID‐19 from a cyber‐crime perspective and highlights the range of cyber‐attacks experienced globally during the pandemic. Cyber‐attacks were analysed and considered within the context of key global events to reveal the modus‐operandi of cyber‐attack campaigns. Results of the systematic and longitudinal analysis revealed that cyber‐criminals leveraged salient events and governmental announcements to carefully craft and execute more effective cyber‐crime campaigns. This work represents a nice introductory study on cyber‐attacks and COVID‐19, without however going into the details of neither attacks nor countermeasures. The survey in [ 1 ] presents the results of a systematic multivocal (i.e., grey and scientific) literature review of social engineering‐based cyber‐attacks during the COVID‐19 pandemic. The survey covers 52 studies that investigated attacks such as phishing, scamming, spamming, smishing, and vishing, perpetrated via fake emails, websites, mobile apps, trojans, bots, and ransomware. This survey only discusses the high‐level characteristics of the cyber‐attacks, without providing a technical analysis of the techniques proposed for defending against them. The survey is also heavily focussed on grey literature and only to a lower extent on scientific literature. The study presented in [ 15 ] briefly reviews some of the malicious cyber activities associated with COVID‐19 and the potential mitigation solutions. Being published in 2020, the analysis only covers attacks occurred within the first months of the pandemic. Among the surveyed types of cyber‐attacks are denial of service, ransomware, spyware, phising and vishing, and other digital frauds. There is no specific focus on phishing nor a detailed analysis of the detection techniques. The meta‐analysis presented in [ 10 ] identifies the key cybersecurity challenges, the solutions, and the areas of improvement in the health sector, with respect to the cyber‐attacks occurred during the COVID‐19 pandemic. The review highlighted a recent increase in cyberattacks (e.g., phishing campaigns and ransomware attacks) that exploit new vulnerabilities in technology and people. In turn, such vulnerabilities are due to changes in habits, behaviours, and working conditions caused by the COVID‐19 pandemic. This meta‐analysis is non‐technical and not specific to phishing, but provides interesting insights into the challenges faced in the healthcare sector.

In addition to the above studies that heavily focussed on COVID‐19, there also exists a few recent surveys, mostly technical ones, that investigated certain attacks and countermeasures without however considering the context of the pandemic. The recent study discussed in [ 3 ] critically reviews AI‐based approaches for defending against phishing attacks. This survey exclusively focuses on Web phishing attacks and categorises defensive techniques as either: (i) URL‐based, (ii) HTML‐based, or (iii) visual similarity‐based. Similarly to [ 3 ], the survey in [ 16 ] analyzes machine learning‐based phishing detection systems that classify Web pages. In particular, it specifically focuses on the analysis of the main machine learning features used in such systems and on their impact on classifier's accuracy. The survey presented in [ 13 ] reviews works based on natural language processing techniques for detecting phishing emails, while the survey in [ 12 ] focuses on applications of artificial intelligence to detect phishing attacks.

The previous overview of the existing surveys, literature reviews and meta‐analyses reveals that the majority of studies that considered the context of the pandemic, focussed on high‐level, preliminary investigations rather than in‐depth, technical analyses. In other words, such surveys mainly provided scoping reviews instead of systematic, detailed reviews. On the contrary, several technical surveys focussed on phishing attacks, but without specific reference to the pandemic. In the present survey we contribute to filling this gap by providing a detailed analysis of phishing attacks occurred during COVID‐19, and their countermeasures.

2.2. Phishing attacks during COVID‐19

As sketched in Figure  1 , phishing is a typology of cyber‐attacks, heavily grounded in social engineering, where an attacker sends a maliciously‐designed message with the goal of tricking the victim into performing a specific action. Oftentimes, the malicious message points the victim to a system that is controlled by the attacker, from where the victim unwittingly downloads malicious software or simply discloses sensitive information with the attacker. Both the initial message and the system used to collect the victim's information are carefully crafted so as to resemble those of legitimate, authoritative and trustworthy entities (e.g., the WHO or National Health Service (NHS)). Successful phishing attacks can be perpetrated by exploiting a number of media, channels and technologies, including emails, websites and mobile devices. Moreover, both the messages and the systems used to mount phishing attacks can be personalised so as to allow gathering potentially any kind of personal and sensitive information. Because of these reasons, phishing attacks are extremely common and widespread, and often represent the first mandatory step in order to achieve complex frauds and network infiltrations, including Advanced Persistent Threats (APT) [ 17 , 18 ]. Indeed, the first mandatory step in an APT kill chain involves gaining access to the target network, which can be achieved via phishing. After a successful phishing attack, the next step typically involves deploying a payload, such as a carefully crafted malware designed to be stealthy, which persists in the network for long periods of time, exfiltrating data or anyway helping fulfiling the APT objective for as long as it remains undetected. As practical examples occurred during the COVID‐19 pandemic, phishing emails and text messages (e.g., SMSs or WhatsApp messages) were used to lure victims to fraudulent websites. The websites gathered personal data which was used to commit financial fraud or, in other cases, to instal malware (e.g., ransomware) which was then used to commit extortion [ 7 ]. To this regard, phishing attacks represent common entry points for a broad array of cyber‐attack sequences [ 7 , 12 ]. Phishing attacks can manifest in several different ways, which led scholars to identify a few noteworthy subcategories of attacks. Among these, smishing refers to phishing attacks that exploit mobile phone text messages (i.e., SMSs) to lure victims. Instead, vishing (i.e., voice phishing) refers to the use of telephony, robocalls and voice over IP to mount attacks. Finally, the term pharming is used when attackers rely on compromising systems (e.g., user devices or DNS servers) to redirect victims to malicious websites.

An external file that holds a picture, illustration, etc.
Object name is ISE2-16-324-g001.jpg

Complexity and dimensions of phishing attacks. Attacks can exploit several vectors, including websites, emails and Online Social Networks (OSNs), as well as SMSs, robocalls and malwares. As such, defensive techniques leverage a large set of different features to detect possible attacks. Phishing attacks can be perpetrated for a wide array of malicious goals, such as for stealing sensitive information and for financial fraud. This diversity of goals and techniques poses challenges to the detection of phishing attacks

Based on the above, on the one hand promptly detecting phishing attacks and reducing their efficacy represents a critical step for defending against many cyber‐attacks. On the other hand however, the multitude of subtypes, media and technologies exploited in phishing attacks poses challenges to their detection, since detection techniques must adapt and be effective across a broad spectrum of possible scenarios. As shown in Figure  1 , current phishing attacks mainly leverage two media: the Internet (and especially the Web), and telephony. Within these media, a large number of different vectors can be used to perpetrate the attack (e.g., to deliver the malicious message). Among the attack vectors that are mostly used are emails, instant messages and messaging apps, online social networks (OSNs), websites, SMSs, robocalls and malware apps for mobile devices. 3 Also the goals of the attackers can be diverse and multifold, with attacks aimed at stealing personal credentials (e.g., usernames and passwords) for certain services, data exfiltration, financial fraud, extortion, or at installing malicious software such as ransomwares, trojans and key loggers. The multitude of ways in which phishing attacks can be mounted, demands the research and development of different techniques. As such, existing phishing detection systems are designed to leverage the combination of several different information for uncovering attacks, including IP addresses; email and SMS texts; websites text, URL, HTML code, images and metadata; voice transcripts; mobile app permissions; and more. The detailed literature analysis presented in the two following sections highlights attacks, and recent progress for defending against them, along these lines.

This section investigates phishing attacks occurred during the COVID‐19 pandemic. We begin by discussing the main peculiar characteristics of covid‐related phishing attacks in the broader context of COVID‐19, and we conclude by presenting a detailed literature review of the many studies that investigated such attacks.

3.1. Overview and synthesis

3.1.1. the rise of phishing attacks.

As anticipated, phishing represented by far the most frequent type of cyber‐attack that occurred during the pandemic. Evidence for this figure emerges from basically all studies and reports that investigated covid‐related cyber‐attacks. Examples of this kind include measurements related to March 2020 indicating an increase in phishing attacks in the region of 600% with respect to the previous month. 4 To quantify the number and scale of such attacks, Google reportedly blocked 18 million phishing emails related to the virus, in April 2020 [ 19 ].

Results reported in scientific literature corroborate the above findings. The analysis presented in [ 7 ] shows that phishing – including its subcategories, such as smishing and vishing – was involved in 86% of the attacks identified. Moreover, in the context of UK specific cyber‐attacks, [ 7 ] analysed 17 different attacks, all of which involved phishing at some stage of the attack sequence. Similarly, [ 1 ] found that the most frequent social engineering–based attacks were phishing, scamming, spamming, smishing, and vishing. Pharming attacks were much less common but did occur in 13% of cases [ 7 ]. Figure  2 shows the relative frequency of all types of cyber‐attacks during the COVID‐19 pandemic, highlighting the massive frequency of phishing attacks, while Figure  3 drills down into the most frequent subcategories of phishing attacks. Scholars motivate the widespread occurrence of phishing attacks with their high cost‐efficiency: attacks are relatively low‐cost and with reasonable success rate. To this end, analyses also show that the relatively large likelihood of success for phishing attacks during COVID‐19 depended on the strategy of exploiting salient events, media and governmental announcements to the advantage of the attackers [ 7 ]. Regarding the platforms mostly used to perpetrate these attacks, emails accounted for 25% of the attacks, followed by websites (20%) and mobile apps (13%) [ 1 ].

An external file that holds a picture, illustration, etc.
Object name is ISE2-16-324-g003.jpg

Frequency of the different techniques used for cyber‐attacks occurred during COVID‐19, over the total number of attacks. The sum of the frequencies exceeds 100% since some attacks used multiple techniques. Phishing includes all its subcategories: smishing, vishing and spear‐phishing

An external file that holds a picture, illustration, etc.
Object name is ISE2-16-324-g002.jpg

Relative frequency of the prevalent subcategories of phishing attacks occurred during COVID‐19

3.1.2. Vulnerability to phishing

The main vulnerabilities to covid‐related phishing attacks derive from the changes induced by the COVID‐19 pandemic. The scoping review presented in [ 10 ] identifies 5 main changes that are responsible for increased vulnerability to phishing and other cyber‐attacks, during COVID‐19. The first of such changes is represented by the decreased mobility and by the national border closures, which demanded increased reliance on remote work [ 9 , 15 , 20 ]. The shift to remote work often occurred abruptly, with little planning, and involved employees with limited previous experience or training [ 8 , 21 , 22 , 23 ]. These conditions represented the second cause of increased vulnerability to phishing attacks. A third change is related to the necessary use of digital communication systems for personal interactions. This exposed both workers and users of given services to a variety of attacks [ 24 ].

The three previous causes of increased vulnerability affect nearly every sector of our society. However, some sectors – such as healthcare and governmental services – were affected even more because of their peculiar conditions and critical role in the pandemic. In these sectors, additional vulnerabilities arose. In particular, the healthcare sector significantly lags behind other industrial sectors in terms of cybersecurity and digital literacy [ 25 , 26 ]. This made attacks against these targets more valuable and, consequently, more frequent. Finally, the increased demand for certain goods – above all, personal protective equipment – made healthcare and governmental services increasingly exposed to scams [ 9 ]. Typical phishing attacks of this kind involved luring emails purportedly selling goods in high demand, with the goal of tricking victims into disclosing sensitive information.

Other causes for the increased vulnerability to phishing attacks during COVID‐19 are related to the extreme levels of stress, anxiety and uncertainty experienced during the peaks of the pandemic [ 7 , 8 ]. While these conditions were experienced by everyone in covid‐stricken countries, workers in the healthcare and governmental sectors suffered them even more. Finally, several studies highlighted that fraudsters systematically created ad‐hoc phishing messages that echoed official announcements by governmental organisations, in order to boost their credibility and their chances of success [ 7 , 27 ]. In many cases, the delay between an official announcement and the attack exploiting such announcement was remarkably short – for example, in the region of a couple of days – which contributed to lure more victims and to reduce their capacity to detect the scam.

3.1.3. Notable phishing attacks

The majority of notable phishing attacks occurred during COVID‐19 revolved around impersonating government organisations, the WHO, the US Centre for disease Control and Prevention (CDC), the UK NHS, airlines, supermarkets and communication platforms [ 7 ]. Table  2 reports a list of some of the noteworthy attacks detected and documented in both scientific and grey literature, also describing their main characteristics, including the target, vector (e.g., website, email, SMS), goal and date of each phishing attack.

Noteworthy phishing attacks detected and described in literature in the first months of the pandemic. Attacks are listed in reverse chronological order, whenever the date of the attack is available

ReferenceCountryTargetGoalVectorDate
Xia et al. [ ]USA, NetherlandsCitizensCredential theftWebsite17/04/2020
Xia et al. [ ]MalaysiaATB, bell, Canadian GovernmentMalware, espionageWebsite14/04/2020
O’Donell [ ]WorldCitizensCredential theftEmail31/03/2020
Rodger [ ]UKCitizensCredential theftSMS24/03/2020
Lallie et al. [ ]USACitizensMalwareSMS24/03/2020
Lallie et al. [ ]WorldCitizensExtortionEmail20/03/2020
Pilkey [ ]SpainCitizensMalwareEmail10/03/2020
Pilkey [ ]USACitizensMalwareEmail08/03/2020
Pilkey [ ]ItalyCitizensMalwareEmail02/03/2020
Lallie et al. [ ]ChinaCitizensRansomwareEmail09/02/2020
Patranobis [ ]IndiaChinese medical institutesCredential theftEmail06/02/2020
Pilkey [ ]VietnamCitizensMalwareEmail03/02/2020
Lallie et al. [ ]ChinaCitizensCredential theftEmail02/02/2020
Vergelis [ ]USACitizensCredential theftEmail31/01/2020
Lallie et al. [ ]ChinaCitizensMalwareEmail29/01/2020
Walter [ ]JapanCitizensMalwareEmail28/01/2020
Pilkey [ ]PhillipinesCitizensMalwareEmail23/01/2020
Doffman [ ]ChinaMongolian Ministry of foreign AffairsMalwareEmail20/01/2020
Henderson et al. [ ]VietnamChinese GovernmentEspionageEmail06/01/2020
Del Rosso [ ]LibyaCitizensMalware, data theftEmail
Greig [ ]WorldGlobal shipping firmsMalware, espionageEmail
Lallie et al. [ ]WorldCanadian businesses, citizensMalwareEmail
Lallie et al. [ ]SpainSpanish medical institutesRansomwareEmail
Lallie et al. [ ]UKCitizensMalwareSMS
Lallie et al. [ ]SpainCitizensCredential theftSMS
Smithers [ ]UKCitizensCredential theftEmail, website
Vergelis [ ]SingaporeCitizensCredential theftEmail
Xia et al. [ ]USA, Japan, SingaporeBOA, paypal, Apple, ChaseWebsite
Xia et al. [ ]RussiaBanco de ChileWebsite

Among the attacks that have been thoroughly studied, is one where attackers impersonated the WHO. The attack vector was a WHO‐branded email containing useful and legitimate guidance on how to protect from, and curb the spread of, the COVID‐19 infection. Notably, the text of the email contained some grammatical errors and misspellings, and also made use of propaganda techniques [ 28 ] appealing to the reader's emotions by emphasising the value of human lives [ 29 ]. In addition to the useful recommendations, the email also carried an attached ZIP file, purportedly containing an e‐book about ‘the complete research/origin of the coronavirus and the recommended guide to follow to protect yourselves and others’. Upon execution of the file contained in the archive, the GuLoader malware downloaded FormBook, a popular trojan used to collect data from the Windows clipboard, to keylog, and to steal Web browser data. Stolen data was sent back to a C&C server operated by the attackers [ 30 ]. Notably, the tactic of alternating legitimate and malicious information during an attack, in order to increase its credibility, is well‐known and also used in other online scams, such as in the activity of social bots spreading untrustworthy information (e.g., fake news) [ 31 ]. Similar techniques were used in other attacks aimed at downloading malware on the victim's system. These attacks were based on a fake NHS website [ 32 ] and on a malicious website imitating the Johns Hopkins University COVID‐19 dashboard [ 33 ], a Web resource that was widely used during the course of the pandemic. The WHO was also targeted by another attack. This time, it was reported that a group of hackers created a malicious website posing as an email login portal for WHO employees, in an attempt to steal their passwords. The attempt was declared to be largely unsuccessful by the WHO itself [ 10 ]. Nonetheless, the increased phishing attacks targeting the WHO and its partners led the WHO to issue a warning to the general public to raise awareness on these threats. 5 The warning page featured a dedicated section for phishing attacks.

In the US, another attack was based on emails impersonating the CDC and asking for donations to develop a COVID‐19 vaccine. Donations were expected to be made in Bitcoins. In addition to the typical techniques used to convince victims, the attackers also asked recipients to share the message as much as possible, thus aiming to exploit the increased perceived trustworthiness of messages vetted by close ones [ 7 , 46 ]. Finally, also communication platforms such as Zoom, Microsoft Teams and Google Meet, were impersonated in emails and through fake websites. The latter led to a surge of Web domain registrations, a significant share of which was later labelled as outright malicious or suspicious [ 47 ].

3.1.4. The surge of covid‐related domain registrations

The registration of covid‐related domains was a prominent phenomenon that held the stage during the first months of the pandemic. This phenomenon is not new nor peculiar to COVID‐19. In fact, it is widely recognized that malicious campaigns, including phishing, benefit from the prompt exploitation of salient events [ 48 ]. In the case of COVID‐19 the surge in domain registrations was so significant and abrupt to motivate targeted scientific studies and even investigations by law enforcement agencies [ 49 , 50 ]. Among these, a statistical report from Palo Alto researchers published at the end of March 2020 showed that a total of 116,357 new domain titles and registrations related to COVID‐19 were made since the start of the year. Their results showed that 2% of such domains were clearly malicious and 34% were considered to be high‐risk [ 6 ]. A subsequent analysis by the security firm Check Point reached similar results in May 2020, with 17% of the analysed domains deemed malicious or suspicious [ 47 ]. An investigation by the INTERPOL puts the rise of malicious domain registrations into context. The INTERPOL measured, from February to March 2020, a 569% growth in malicious registrations, including malware and phishing, and a 788% growth in high‐risk registrations. 6 The rationale for exploiting covid‐related domains in phishing attacks is straightforward. In fact, as identified in [ 7 , 10 ], domains using keywords such as ‘covid’, ‘coronavirus’ and ‘corona’ are likely to appear as believable, and thus massively accessed. To boost accesses, fraudsters also included other reputable words such as WHO and CDC, or used appealing keywords such as ‘corona‐virusapps.com’, ‘anticovid19‐pharmacy.com’, and more.

The remaining share of domains that was not involved in phishing attacks or in other scams was related to non‐malicious yet nonetheless shady and lucrative practices. The study discussed in [ 49 ] investigated the rationales for such covid‐related domain registrations. Authors concluded that such domains were registered mainly for two reasons: (i) for attracting and then redirecting traffic to other, often totally unrelated, commercial services; or (ii) for domain parking – that is, the practice of registering a high‐demand domain in advance, thus netting profits when reselling the domain later on, once the demand curve is at its peak.

3.1.5. Consequences and economic impact

The majority of assessments about the consequences of phishing attacks derive from governmental bodies and security firms, with only a small minority of scientific studies covering this area. Independently of the source, all reports testify a sharp increase in costs and losses due to cyber‐crime since the start of the COVID‐19 pandemic. Overall, companies spent $110B worldwide for protecting against cyber‐attacks in 2020, according to Accenture's annual security report [ 1 ]. A survey by BAE Systems highlighted the main factors that contributed to cyber‐crime losses registered during the pandemic [ 51 ]. The main losses derived from: (i) IT overtime for incident response, remediation and clean‐up; (ii) payments for ransomware attacks; (iii) operational outages; (iv) legal costs following a major attack (e.g., in cases of class action lawsuits); and (v) customer churn, with its associated financial costs.

Investigations from the US Federal Bureau of Investigation (FBI) contribute to quantify the overall losses and to estimate the trend with respect to pre‐pandemic conditions. For instance, the FBI estimated that spear phishing cost US businesses more than 1.8$B in 2020, up from 1.7$B in 2019. In a notorious case, a US business specialising in hand sanitizers wired nearly 1$M to hackers pretending to sell ventilators. Conversely, losses associated to generic phishing attacks decreased slightly, with 54$M in losses in 2020, down from 57$M in 2019. 7 This trend testifies the increased personalisation and sophistication of recent attacks, which in turn, mandates more advanced detection techniques to keep up with the rapid pace of the attackers.

According to the FBI, ransomware attacks were also a major source of losses, as already highlighted in the BAE Systems survey [ 51 ]. The average ransomware payment reported in Q4 2020 was in the region of 154,000$. Oftentimes however, more severe losses derived from downtime and customer churn rather than from the direct ransomware payment. In a notable case involving a large US healthcare provider, losses due to losing customers to rival providers during a ransomware attack summed up to 67$M. Moreover, while extortion and high ransomware demands were previously reserved for big‐budget enterprises, such attacks also hit the small and medium business sector during the pandemic. The average ransomware payment demand for SMBs in 2020 was 5600$, while the costs of the incurred downtime reached 247,000$, which represents a 94% increase with respect to 2019. Then according to IC3, the overall cost of ransomware in the US tripled in 2020, with 29.1$M in losses compared to just 8.9$M in 2019. Notably, the FBI found that phishing emails were the primary cause of ransomware attacks, underlining the importance of defending against phishing for reducing the efficacy of many of cyber‐attacks.

Among the few scientific studies that reported results on the economic consequences of phishing attacks, is the work in [ 7 ]. The analysis focussed on UK firms and revealed that by early May 2020, more than 160,000 suspect emails had been reported to the UK National Cyber Security Centre (NCSC). By the end of May, 4.6£M had been lost to COVID‐19 related scams with around 11,206 victims of phishing campaigns. In response, the NCSC took down 471 fake online shops and Her Majesty's Revenue and Customs (HMRC) took down 292 fake websites [ 7 ].

3.2. Detailed literature review

While the previous section highlighted the rise and the main characteristics of covid‐related phishing attacks, this section summarises and presents the results of each study that investigated such attacks.

3.2.1. Early and introductory works

Out of all the research published on cyber‐attacks and COVID‐19 – and specifically phishing attacks – the vast majority of existing studies focussed on providing descriptions and characterisations of the types of attacks. This large stream of research is characterised by relatively general, descriptive and high‐level analyses, rather than by technical and detailed discussions. These contributions were among the first to be made in the aftermath of the pandemic, and served as initial assessments of such an unprecedented situation. Their utility was in raising awareness on the increased cybersecurity issues and in guiding subsequent, more technical and specific, research. An example of this kind is the work presented in [ 52 ], where the authors made a first step towards fully characterising the landscape of COVID‐19 themed attacks. In detail, they considered five classes of attacks – namely, malicious websites, malicious emails, malicious mobile apps, malicious messaging, and misinformation. Then, they proposed mapping them to the Lockheed Martin's Cyber Kill Chain (LMCKC) [ 53 ], which is a model consisting of 7 stages: (i) reconnaissance, which corresponds to pre‐attack planning; (ii) weaponization, which corresponds to setting up attack propagation mediums; (iii) delivery, which corresponds to the attackers penetration into a victim's system; (iv) exploitation, which corresponds to the wage of actual attacks; (v) installation, which corresponds to installation of malicious payloads; (vi) command‐and‐control, which corresponds to attacker's use of remote access to victims' systems; and (vii) objectives, which corresponds to the accomplishment of the attacker's pre‐determined goal. Finally, they discussed the defence space, with recommendations on how to defend from malicious websites, malicious emails, malicious mobile apps, malicious messaging, and malicious misinformation. Similarly, the work presented in [ 54 ] provided a detailed review about the COVID‐19 cybersecurity attacks with a critical analysis. The paper also showed the latest research contributions of cybersecurity during COVID‐19, in the form of a literature review corroborated by examples of how Google and Microsoft managed their privacy and cybersecurity, as well as the deriving limitations. Then, the authors discussed the reasons why people are vulnerable to cyber‐attacks, especially with the increase in online activities brought upon by the pandemic, and proposed unique solutions to those problems. The goal of the study reported in [ 55 ] was to examine the shift from physical‐ to cyber‐crime at the onset of the COVID‐19 pandemic. Thus, this work aimed to shed more light on how crime initially moved to cyberspace and what were the implications for organisations and individuals. The author's hypothesis is that there was a shift from physical to cyber‐crime as a result of the mass quarantine around the world at the beginning of the pandemic. The author used data from news articles, government reports, private sector publications, FBI data, and press releases. The results showed that the United States Secret Service Cyber‐Fraud Task Force actually registered an increase in frauds, and that according to the FBI, cyber‐crime increased by 300% since the start of the pandemic. The analysis reported in [ 56 ] identified the top‐ten cybersecurity threats that took place during the pandemic. Phishing emerged as one of the top threats, linked to many frequent attack vectors such as malicious domain attacks, malicious websites, malicious emails, malicious social media messaging, business email compromise and malicious mobile apps.

Instead, in [ 57 ] the authors discussed the types of phishing attacks and their impact during the COVID‐19 lockdown. Specifically, they discussed different types and sub‐types of attacks, such as deceptive phishing, whaling, spear‐phishing, and pharming, also proposing some general recommendations for thwarting them. Similarly, the analysis presented in [ 58 ] discussed the security risks associated with working from home due to the COVID‐19 pandemic and the imposed lockdown. It discussed the increase in cyber‐attacks due to the pandemic and provided a number of general recommendations, including those directed to businesses for backing‐up their data in case of a ransomware attack, recommendations for secure remote networks for employees working from home, encouraging employees to communicate with the IT department regarding any concerns, periodic penetration testing, and educating employees. The paper also discussed the challenges related to dealing with an attack. The author of [ 9 ] discussed how the pandemic‐driven disappearance of home‐work boundaries expanded the cyber‐attack surface area. The study also gave recommendations for employers to encourage employees in using strong encryption on their home routers, strong passwords on personal accounts, and in being extremely vigilant with respect to their personal information. In addition to [ 9 , 58 ], some other works also focussed on the security challenges introduced by the shift to remote working. Among these, [ 59 ] discussed how the sudden change to remote work impacted the security of many organisations. The author described how the pandemic left many organisations with no time nor resources to instal extra security measures on work‐issued devices. The study recommended organisations to utilise multi‐factor authentication instead of just passwords, and to rely on end‐to‐end encryption and virtual private networks (VPNs) for handling company data. Also the discussion in [ 60 ] outlined the many challenges and security concerns caused by the pandemic, and specifically, by the shift to remote working. The author discussed the increase in phishing scams that are preying on COVID‐19 fears and panics, and how cyber‐crime cost the world 6$ trillion annually by 2021. Similarly to the many other papers surveyed in this section, also this article ends with some general recommendations, including the use of multi‐factor authentication, the use of a VPN with an encrypted network connection, updating the cybersecurity policies, and communication between employees and their IT department. The work in [ 61 ] discussed the cybersecurity issues that have occurred during the COVID‐19 pandemic. The authors emphasised that there was a correlation between the pandemic and the increase in cyber‐attacks. Furthermore, they also highlighted that healthcare organisations were one of the main victims of cyber‐attacks during the pandemic. The pandemic has also raised the issue of cybersecurity in relation to: (i) the ‘new normal’ of expecting staff to work from home, (ii) the possibility of state‐sponsored attacks, and (iii) increases in phishing and ransomware. According to the authors, mitigation techniques for these issues include raising user awareness, utilising VPNs and multi‐factor authentication, ensuring firmware and antiviruses are updated, and a strong cybersecurity policy. Authors of [ 62 ] presented a discussion on the vulnerabilities caused by the pandemic and on the many types of cyber‐attacks experienced worldwide. The ultimate goal of their analysis was to raise awareness on these issues, and on cybersecurity in general, as a mandatory defensive step in order to reduce the number and impact of the cyber‐attacks that occurred as a consequence of the COVID‐19 pandemic. The purpose of [ 63 ] was to raise awareness on the exploitation of the pandemic as a cyber‐attack tool and to discuss possible remediation strategies. The research was conducted through a review of existing literature from websites and reputable databases, including Google Scholar and IEEE Xplore. The themes from the literature sources included the prevalence of phishing, scamming, spamming, and malware as the common attack vectors. Business enterprises, including operators in healthcare, finance, and Internet service provision, were advised to actively implement risk management plans to monitor attack vectors and to secure their systems, clients, and users from the COVID‐19 attack tools.

Still within the large body of initial research on phishing attacks and COVID‐19, other papers investigated a number of more specific issues. For example, the work in [ 20 ] focussed on challenges of the heathcare sector, by outlining why cyber‐attacks have been particularly problematic during COVID‐19 and by defining the ways in which healthcare industries could better protect patients' data. The paper discussed how the number of cyber‐attacks increased five‐fold after COVID‐19, and that 90% of healthcare providers had already encountered data breaches. Among the proposed mitigation recommendations were penetration testing, well‐defined software upgrade procedures, and the utilization of secure networks like virtual local area networks. Other scholars focussed instead on analysing and describing national experiences. Among them, [ 64 ] examined the extent to which organisations in the UK and their staff were likely to have been prepared for the unplanned outbreak of home working, along with the increased cyber‐threats that they had to face. The preparedness of businesses was evaluated along the following directions: secure configuration, malware protection, network security, managing user privileges, incident management, monitoring, information risk management regime, user education and awareness, home and mobile working, and removable media controls. The results showed that the businesses that were undertaking actions in each of these steps were as follows: 90% for secure configuration, 88% for malware protection, 83% for network security, 80% for managing user privileges, 68% for incident management, 57% for monitoring, 35% for information risk management regime, 30% for user education and awareness, 25% for home and mobile working, and 23% for removable media controls. Results of this analysis were useful for promptly identifying those security directions requiring additional efforts. Instead, the author of [ 65 ] discussed how Croatia dealt with the pandemic‐related cybersecurity concerns. The analysis revealed that Croatia has stayed completely silent with regards to cybersecurity hazards, and it has left companies to figure out their own ways of reacting to the increased cyber‐threats, without even warning individuals. The analysis then moved on to discuss the cybersecurity threats associated with remote working, the Croatian cybersecurity legal regulation, Croatia's (lack of) response to the increased cybersecurity threats, and liability for personal data breaches arising from cybersecurity attacks. The author concluded by making some recommendations such as cybersecurity auditing, use of multi‐factor authentication, and use of VPN solutions for connecting to the workplace. In [ 66 ], the authors conducted a study by identifying cyber‐incidents in Indonesia that exploited COVID‐19. The analysis made use of a timeline that mapped key events and cyber‐attacks to analyse targeted sectors and their cybersecurity issues. The study illustrated how cyber‐criminals artfully exploited pandemic issues and situations as baits for social engineering techniques. In the analysed cyber‐incidents, criminals using social engineering techniques took advantage of the issue of COVID‐19 by not having a specific target so that anyone could become a victim of their attacks. Finally, differently from all works described above, the analysis presented in [ 67 ] focussed on the skills needed by the cybersecurity workforce in relation to the novel situation caused by the pandemic. Specifically, the authors argued that the cybersecurity workforce, which was already suffering a digital skills crisis, also lacked the adequate soft skills required to effectively tackle the insider threat that was exacerbated by the pandemic. The work first examined the insider threat, and why it became so much more insidious because of COVID‐19. Then, it looked into the essential soft skills required to tackle this threat, before examining how organisations could effectively implement an apprenticeship strategy capable of generating professionals with both hard and soft skills. The authors concluded that many of the covid‐related issues could have been avoided if the industry had not relied so heavily on recruiting graduates rather than apprentices – that is, people trained directly in cybersecurity by the company itself.

3.2.2. Systematic analyses

Following the first wave of introductory research, some scholars carried out systematic and large‐scale analyses of some of the attacks that occurred during the first months of the pandemic. For instance, in [ 68 ] the authors carried out a comprehensive measurement study of online social engineering attacks, with specific references to phishing. By collecting, synthesising, and analysing DNS records, Transport Layer Security (TLS) certificates, phishing URLs, phishing website source code, phishing emails, web traffic to phishing websites, news articles, and government announcements, they tracked trends of phishing activity between January and May 2020 and sought to understand the key implications of the underlying trends. They found that phishing attack traffic in March and April 2020 skyrocketed up to 220% of its pre‐COVID‐19 rate, far exceeding typical seasonal spikes. The results also showed that there was a record high of phishing victims during this period, and that attackers remained several steps ahead of typical modern anti‐phishing defenses. Findings from this analyses could be used to develop more effective phishing detection techniques. Then, the study in [ 27 ] developed a multi‐level influence model to explore how cyber‐criminals exploited the COVID‐19 pandemic by assessing situational factors, identifying victims, impersonating trusted sources, electing attack methods, and employing social engineering techniques. Content and thematic analysis was conducted on 185 distinct COVID‐19 cyber‐crime scam incident documents, including text, images and photos provided by FraudWatch, a global online fraud and cybersecurity company tracking worldwide COVID‐19 related cyber‐crime. The analysis revealed interesting patterns about the sheer breadth and diversity of COVID‐19 related cyber‐crime incidents and how these crimes were continually evolving in response to changing situational factors related to the pandemic. Similarly, the aim of [ 69 ] was that of contributing to users' protection by exploring online perpetrators' modus operandi applied to exploit Internet users' coronavirus fears through phishing emails. To that end, the content of 208 coronavirus‐themed phishing emails was examined. The data was collected by searching for variations of the terms ‘COVID‐19 phishing emails’ from search engines, and then using the images from official websites such as the Action Fraud, FBI, or web pages of universities or companies' IT departments. 2372 images were collected in this way. The results showed that phishers mostly employed social engineering methods to coerce individuals into providing sensitive information. The authors also identified 9 main variations of phishing emails. While the previous work focussed on phishing emails, the authors of [ 70 ] presented a systematic study of coronavirus‐themed Android malware. First, they made a daily growing COVID‐19 themed mobile app dataset, which contains 4322 COVID‐19 themed apk samples (2500 unique apps) and 611 potential malware samples (370 unique malicious apps) by the time of mid‐November, 2020. The authors then presented an analysis of them from multiple perspectives including trends and statistics, installation methods, malicious behaviours and malicious actors behind them. The authors observed that the COVID‐19 themed apps as well as malicious ones began to flourish almost as soon as the pandemic broke out worldwide. Most malicious apps were camouflaged as benign apps using the same app identifiers (e.g., app name, package name and app icon). Their main purposes were either stealing users' private information or making profit by using tricks like phishing and extortion. Notably, several of the characteristics identified in this study are currently exploited as part of many detection techniques for protecting against phishing attacks mounted by means of malicious apps [ 52 ]. Moving on with relevant systematic analyses, in [ 71 ] the authors presented the first measurement study of COVID‐19 themed cryptocurrency scams. They first created a comprehensive taxonomy of COVID‐19 scams by manually analysing the existing scams reported by users from online resources. Then, they proposed a hybrid approach to perform the investigation by (i) collecting reported scams in the wild, and by (ii) detecting undisclosed ones based on information collected from suspicious entities (e.g., domains, tweets, etc.). 195 confirmed COVID‐19 cryptocurrency scams in total were collected, including many well‐known cryptocurrency scams [ 72 ], such as: 91 token scams, 19 giveaway scams, 9 blackmail scams, 14 crypto malware scams, 9 Ponzi scheme scams, and 53 donation scams. Over 200 blockchain addresses associated with these scams were then identified, which led to at least 330$K in losses from 6329 victims. For each type of scams, the tricks and social engineering techniques they used were further investigated. To facilitate future research, the authors released all the well‐labelled scams to the research community. 8 The data for COVID‐19 scams were obtained from BitcoinAbuse, CryptoScamDB, Threat Intelligence Platforms (e.g. AlienVault, McAfee), and StopScamFraud. The authors also obtained data about COVID‐19 themed cryptocurrency scams using a semi‐automated analysis on Etherscan to search for scam tokens, URLScan, RiskIQ, VirusTotal to search for scam domains, Koodous, VirusTotal, AVClass to find Android apps and label the app malware families, and Twitter and Telegram to identify more scams. Given the surge of covid‐related malicious domain registrations, the authors of [ 34 ] focussed on identifying and characterising COVID‐themed malicious domain campaigns, including the evolution of such campaigns, their underlying infrastructures and the different strategies taken by attackers behind these campaigns. Their exploration uncovered some common features of malicious domains, which can help to identify new malicious domains and to raise alarms at the early stage of their deployment. The results also showed peaks in malicious domain registrations in March 2020, indicating bulk registrations that accounted for 73.2% of all malicious domains. The first registered domain was ‘clientdoc.us’, which hosted multiple COVID‐19 related phishing subdomains like ‘banking.covid19.hsbc.clientdocs.us’ and ‘covid19update.hsbc.clientdocs.us’. The authors also identified 15 verified attack campaigns that were used for phishing, malware, and domain squatting. Finally, similarly to the previous study, also [ 49 ] performed an analysis at Internet‐scale of COVID‐19 domain name registrations during the early stages of the virus' spread. The authors leveraged the DomainTools COVID‐19 Threat List and additional measurements to analyse over 150,000 domains registered between 1 January 2020 and 1 May 2020. They identified two key rationales for covid‐related domain registrations: (i) online marketing, by either redirecting traffic or hosting a commercial service on the domain; and (ii) domain parking, by registering domains containing popular COVID‐19 keywords, presumably anticipating a profit when reselling the domain later on.

3.2.3. Studies based on questionnaires, surveys and interviews

Another remarkable body of work about phishing and other cyber‐attacks in relation to COVID‐19 relied on the use of questionnaires, surveys and interviews as tools for assessing the perception, readiness and effect of such attacks on those that experienced them. As part of this literature, the idea of the study presented in [ 73 ] was to examine how teleworking affected employee perceptions of organizational efficiency and cybersecurity, before and during the COVID‐19 pandemic. The research was based on an analytical and empirical approach. The quantitative approach involved the design of a structural equation model, one of the most widely‐used approaches to causal inference [ 74 ], on a sample of 1101 respondents from the category of employees in Montenegro. Within the model, the authors examined simultaneously the impact of the employees' perceptions on the risks of teleworking, changes in cyber‐attacks during teleworking, organisations' capacity to respond to cyber‐attacks, key challenges in achieving an adequate response, as well as the perceptions of key challenges related to cybersecurity. Perhaps surprisingly, the main findings of the research were that teleworking had no impact on digital information security, and that teleworking had a positive and significant impact on organizational efficiency perceptions. Similar conclusions were reached in [ 24 ], where authors discussed how the pandemic impacted the IT industry in terms of the IT security implications, the impact on global IT, and the increase in COVID‐19 phishing attacks and malware. The authors used a survey to demonstrate how the industry was able, for the most part, to cope with and address the challenges brought by the COVID‐19 crisis. With similar techniques, the analysis carried out in [ 75 ] evaluated the cybersecurity culture readiness of organisations from different countries and business domains, when teleworking became a necessity due to the COVID‐19 crisis. The authors designed a targeted questionnaire and conducted a web‐based survey addressing employees while working from home during the COVID‐19 spread over the globe. The questionnaire contained 23 questions and was available for almost a month, between April and May 2020. During that period, 264 participants from 13 European countries spent approximately 8 min to answer it. Gathered data were analysed from different perspectives, allowing to find answers regarding the information security readiness and the resilience of both individuals and organisations. Some of the results of the research showed that 53% of employees reported to not having received any cybersecurity guidance with regards to working from home, 44.44% had no possibility of working from home, and about 15% reported having faced some kind of cyber‐threat. Still related to perceptions, the research in [ 76 ] examined the relationship between teleworking cybersecurity protocols during the COVID‐19 era and employee perception of their efficiency and performance predictability. The premise of this research project was that teleworking could transform employees into unintentional insider threats. Interviews were conducted through video conferencing with nine employees in Virginia, USA to examine the problem and collect data. The data from the interviews was then analysed using narrative analysis to unpack some of the common themes from the interviews [ 77 ]. The major findings demonstrated that employees were trusting the cybersecurity protocols that their organisations implemented, but that they also believed they were vulnerable, and that the protocols were not as reliable as in‐person working arrangements. While the respondents perceived that the cybersecurity protocols lend to performance predictability, they also appeared to think it disrupted their efficiency.

Other studies focussed instead on the effects of cyber‐attacks and of the specific techniques used to carry them out. The experiment described in [ 78 ] examined the effects of persuasive appeals in phishing messages on judgements of credibility. Participants were tasked with reading a combination of legitimate and phishing e‐mails to determine whether each message was legitimate or a scam. When phishing messages included more appeals to authority and likability, phishing susceptibility increased. However, as the number of fear and urgency appeals in the message increased, phishing susceptibility decreased, as it was easier for participants to detect the phishing attempt. Interestingly, results showed that appeals to authority and likability increased credibility, while appeals to fear, urgency, and social proof decreased judgements of credibility. Moving on, in [ 79 ], the authors investigated how the pandemic affected rates of cyber‐victimization. The study considered the pandemic as a natural experiment, thus allowing the comparison between pre‐pandemic rates of victimization and post‐pandemic ones, leveraging datasets originally designed to track cyber‐crime. In particular, the authors built two samples that they used to conduct a survey: (i) one related to the pre‐COVID‐19 situation consisting of 1109 participants, and (ii) another one for the post‐COVID‐19 situation counting 1021 participants. After considering how the pandemic may have altered routines and affected cyber‐victimization, the study found that the pandemic did not radically alter cyber‐routines nor changed cyber‐victimization rates.

The last study that we reviewed in relation to attacks made use of a simulation to evaluate the vulnerability of different groups of employees to phishing during COVID‐19 [ 80 ]. In particular, the authors performed a comparative study of cybersecurity awareness of employees working in different departments within the same organisation in Bangkok, Thailand. In their experiment, they exposed different employees to simulated phishing attacks and evaluated their actions. After data collection and analysis, the authors found significant differences in the cybersecurity awareness levels between Thai employees from technology‐based departments (e.g., IT department) and social‐based departments (e.g., HR department) within the same organisation, with the latter group that showed to be more vulnerable to phishing attacks than the former one. Simulations such as the one described in [ 80 ] have recently been regarded as a promising tool for training staff in preparation for future cyber‐attacks. For instance, in the context of healthcare professionals, [ 81 ] proposed to carry out cybersecurity campaigns in which members of the IT departments send out fake phishing emails to the rest of the staff and provide further training to those who fail to identify the phishing emails. However, in spite of the widespread awareness of cybersecurity limitations of the healthcare sector [ 25 , 26 ] and of the advices, such as those of [ 81 ], given several months before the outbreak of COVID‐19, few enterprises enacted significant changes, which worsened the impact of the massive wave of phishing attacks occurred in the aftermath of the pandemic.

4. COUNTERMEASURES

While the previous section focussed on the drivers and the characteristics of phishing attacks occurred during the COVID‐19 pandemic, this section discusses the proposed defenses and countermeasures to such attacks.

4.1. Overview and synthesis

The multitude of ways in which phishing attacks were mounted demanded the development of a broad array of different solutions. Each solution surveyed and described here exploits some characteristics of COVID‐19 phishing attacks, such as those that we discussed in Section  3 . First, we summarise the main approaches adopted for detecting phishing attacks during COVID‐19. Then, we focus on the key factors that influence the effectiveness of machine learning solutions, that is: data, methods (i.e., algorithms) and features. Hence, we highlight the available datasets for this task, as well as the methods and the features used for developing detectors. Table  3 supports and complements this discussion by presenting a detailed classification and comparison of the techniques that were recently proposed for detecting COVID‐19 phishing attacks.

Detailed classification and comparison of some recently proposed techniques for detecting COVID‐19 phishing, smishing and vishing attacks

ReferenceYearFocusDatasetTargetMethod FeaturesEvaluation
Mishra & Soni [ ]2021Smishing[ ] + pinterest SMSsDeep learning, RF, NB, DTSMS textTest accuracy = 0.98
Biswal [ ]2021Vishing[ ]CallsSVM, LR, MPCall transcript textTest accuracy = 0.65
Wu & Guo [ ]2021PhishingOwn (unreleased)EmailsDocument embeddings, anomaly detectionSMTP headersCase‐study and comparison against commercial solutions
Sarma [ ]2021PhishingMendeley WebsiteskNN, RF, SVM, LRURL, website content, website metadataTest 1 = 0.98
Mukhopadhyay & prajwal [ ]2021PhishingOwn (unreleased)Emails, websites, malwareBlacklists, heuristicsIP, URL, email attachmentsCase‐study and comparison against commercial solutions
Ispahany & Islam [ ]2021PhishingDomainTools URLsSVM, kNN, NBURLTest accuracy = 0.99
Xia [ ]2021PhishingOwn (unreleased)Websites, URLsKnowledge graphs, graph representation learning, graph clusteringIP, URLQualitative and case‐study
Tawalbeh [ ]2020PhishingOwn (unreleased)MalwareDeep learningEmail attachmentsTraining accuracy = 0.85
Saha [ ]2020PhishingKaggle WebsitesMPIP, URL, website metadataTest accuracy = 0.93
Basit [ ]2020PhishingUCI machine learning repository WebsitesEnsemble of classifiers (RF, kNN, DT)URLTest accuracy = 0.97
Pritom [ ]2020PhishingCheckPhish DomainTools WebsitesRF, kNN, DT, LR, SVMURL, website metadataTest accuracy = 0.98

4.1.1. Approaches

Among all solutions that were recently proposed to defend from phishing attacks, the vast majority was aimed at detecting phishing websites , as also shown in Table  3 . This finding is perhaps unsurprising, considering that emails and websites were the most frequent attack vectors exploited during the COVID‐19 pandemic [ 1 ]. The most straightforward way to tackle the task of detecting COVID‐19 phishing websites is by analysing website contents. Approaches of this kind typically revolve around assessing the presence or absence of covid‐specific keywords in website names and contents (e.g., coronavirus, COVID‐19, masks, n95, and more) [ 52 ]. Another frequent approach to the detection of phishing websites is based on the analysis of the website's URL. To this end, it was observed that attackers frequently used cybersquatting and typosquatting techniques, or techniques to obtain homograph domain names, to make COVID‐19 themed malicious websites mimic legitimate ones [ 94 ], which highlights the importance and usefulness of detecting such modified URLs. Other approaches focus instead on the website's age, since malicious websites tend to be more recent than authoritative ones [ 91 ]. The works in Table  3 that target phishing websites represent notable examples of the combination of the aforementioned approaches.

The second most‐common approach for detecting phishing attacks grounds on the analysis of emails , another frequently used attack vector. Similarly to systems for detecting phishing websites, also many systems for phishing email detection are based on the analysis of email contents. For instance, covid‐related keywords – such as those related to cures, guidelines, or offers – can be searched in subject lines and in the textual contents [ 52 ]. Instead, other techniques based on email content analysis focus on the links contained in the email, or on its attachments. The former systems typically analyse the URL of the links by means of the same techniques already described for the analysis of website's URLs. The latter are instead aimed at assessing the harmfulness of any file attached to the email, for instance by means of static and dynamic analyses of the file's content. Finally, another common approach to the detection of COVID‐19 phishing emails is based on spotting email spoofing or masquerading attacks. Here, the analysis is aimed at verifying the identity of the sender, for example, by analysing the headers of the email [ 86 ].

COVID‐19 themed malicious apps are another common vector for phishing attacks. A set of approaches for defending against this threat is based on computer vision techniques that assess the visual similarity of new app logos with those of legitimate existing apps [ 52 ]. Other techniques are instead based on static and/or dynamic analyses of the apps, in order to detect malicious ones (e.g., repackaged apps). A minority of approaches also aims at detecting spoofed app names, for example, by computing string edit distances between the names of new apps with respect to existing and popular ones.

Smishing and vishing attacks represent a minority of all phishing attacks occurred during the pandemic. As such, only few works specifically targeted these attacks, as also shown in Table  3 . Textual analyses of the content of the messages—in the case of smishing, or of the call transcripts—in the case of vishing, is by far the most common approach for detecting these types of attacks. Such analyses can be carried out by the adoption of natural language processing techniques, for instance with the goal of spotting suspicious content, such as the presence of spoofed URLs, special characters, and COVID‐19 themed keywords [ 52 ]. Other sophisticated approaches are also based on natural language processing, but this time the aim is that of detecting persuasive messages that make use of propaganda techniques [ 28 ] or other social engineering techniques [ 95 ]. These latter works lay at the intersection of cyberpsychology and natural language processing [ 96 ].

4.1.2. Datasets

High quality and reference datasets represent an important resource to foster research and experimentation on novel scientific issues [ 97 ]. However, building such resources is notoriously challenging and time‐demanding [ 98 ]. In the case of COVID‐19 phishing attacks, publicly available reference datasets are few and far between. In addition to the aforementioned generic challenges, scholars interested in building a scientific dataset for covid‐themed phishing attacks also had to account for the recency and unpredictability of the pandemic (and its associated scamdemic), and for its rapidly evolving nature. As a result, at the time of writing no reference dataset for COVID‐19 phishing attacks exists and scholars tackling phishing detection either had to build their own dataset or to rely on existing, yet older, ones.

The only partial exceptions to the above consideration are the datasets released by DomainTools 9 and CheckPhish. 10 Both companies were extremely rapid to intervene against the deluge of malicious domains that plagued the Web during the first months of the pandemic. They curated and periodically updated lists of scam covid‐themed websites and made such lists publicly available. As also shown in Table  3 , datasets from DomainTools and CheckPhish were used by a subset of the papers that proposed website and URL COVID‐19 phishing detection systems, such as [ 89 , 93 ]. Unfortunately, as of now both the DomainTools and the CheckPhish datasets appear to be no longer publicly available. To partially ameliorate this issue, DomainTools suggested another publicly available dataset, 11 curated by the COVID‐19 Cyber Threat Coalition. To the best of our knowledge, no scientific study has been conducted on such dataset.

The novelty of the issue and the lack of reference datasets forced many scholars interested in experimenting with COVID‐19 phishing detection to build their own dataset. For instance, this route has been chosen for the development of the HOLMES [ 86 ] and EDITH [ 88 ] systems, and for the systems presented in [ 34 , 90 ]. This approach has however several drawbacks. First, none of the datasets built in this way were made publicly available by the respective authors, thus hindering replicability and future research along this direction. Second, the datasets are related to very specific issues and have been collected with ad‐hoc methodologies. As a practical example, the dataset used in [ 86 ] was obtained from the SMTP server of an unspecified firm. As a consequence of these limitations, datasets built ad‐hoc for a specific study are often small, which raises concerns about the validity and generality of the results obtained from their analysis.

An orthogonal approach to building an ad‐hoc dataset involves the use of well‐known existing datasets. For instance, the datasets originally used in [ 83 , 85 ] were also used to train and evaluate the systems recently proposed in [ 82 , 84 ]. Similarly, other scholars used data published in well‐known scientific repositories such as Mendeley, Kaggle and the UCI collection of machine learning datasets. However, also this solution presents an important drawback. In this case, some systems were designed with COVID‐19 phishing attacks in mind, but the lack of specific reference datasets forced authors to evaluate their proposed system against other attacks. In many cases, the attacks contained in the used datasets occurred way before the start of the COVID‐19 pandemic. Again, the concern is about the reliability of the results of such systems – some of which are remarkably good, as visible in Table  3 – that were designed and proposed for the COVID‐19 scenario, but were evaluated otherwise.

4.1.3. Methods and features

The previous section highlighted the limitations of current research with respect to the choice of datasets for training and evaluating detectors. Similar considerations also apply to the choice of machine learning algorithms and features. Indeed, the choice of a machine learning algorithm strongly depends on the characteristics of the available data [ 99 ]. To this regard, the most powerful and advanced analytical methods currently available are based on deep learning. Deep learning algorithms, however, require massive datasets for training, which are not yet available for the task of COVID‐19 phishing detection. As such, and also due to the relatively limited time passed since the start of the pandemic, the vast majority of existing detectors are based on simpler, general‐purpose and off‐the‐shelf classification algorithms. Table  3 shows that nearly all traditional classification algorithms were tested for the detection of COVID‐19 phishing attacks. These include algorithms such as decision trees and random forests, logistic regression, k‐nearest neighbours and support vector machines. Clearly, these represent the quickest and most straightforward way of tackling a classification task, such as that of phishing detection. Simplicity, scalability and mild data requirements however come at the cost of predictive power and generalisability. The adoption of more complex methods, such as those based on deep learning that were used in [ 82 , 90 ], is still largely overlooked. A minority of systems are also based on ensembles of supervised classifiers—such as [ 92 ], or on unsupervised machine learning—such as [ 86 ].

The machine learning features used by the existing detectors are mainly based on the textual content of the item under investigation (e.g., a website, email, SMS, etc.). In fact, the text has long been the most widely used data modality in many detection tasks [ 100 ]. In the context of phishing attacks, textual content can be found in emails, websites, OSNs, text messages (e.g., SMSs or any other message in instant messaging apps), call transcripts and app information. In addition, the analysis of URLs can also be considered as a form of text analysis. Because of the ubiquity of text, almost all COVID‐19 phishing detection systems leverage textual features. The current state‐of‐the‐art for extracting textual features is based on deep learning, and particularly, on artificial intelligence methods for natural language understanding [ 101 ]. However, the solutions exploited in the surveyed phishing detection systems are again largely based on more traditional and less powerful approaches. For example, bag‐of‐words features or simple sequences of characters and words (i.e., character and word n ‐grams) were used as text features in [ 84 ]. As such, the application of more recent and powerful text feature extraction techniques is still unexplored, with the exception of the HOLMES system that uses unsupervised word embeddings as text features [ 86 ]. The issue related to the use of simple and ‘shallow’ features also emerges when surveying systems that also leverage other data modalities. For example, many different features can be used for the detection of phishing websites, thus going beyond the mere analysis of the textual content of the website. Among such features are images, links, the HTML code and CSS documents of the website, JavaScript features, ActiveX Objects and forms [ 16 ]. However, the website classification systems reported in Table  3 almost exclusively rely on the analysis of the website's URL and on the assessment of the presence of certain covid‐related keywords. Similarly, assessing the validity of URLs could involve querying DNS services and retrieving WHOIS and web traffic data [ 16 ], which is seldom done in the case of the analysed COVID‐19 phishing detectors.

4.2. Detailed literature review

The literature discussing countermeasures to phishing attacks is mainly organised in two large bodies of work. The first body of work proposes general and long‐known recommendations, and discusses their application to the specific and novel situation caused by the COVID‐19 pandemic. Part of the literature in this body of work overlaps, or is anyway similar, to the introductory works already discussed in Section  3.2.1 . Instead, the second category of papers take an orthogonal approach to the problem of phishing attacks during COVID‐19 and proposes ad‐hoc technical solutions, the majority of which is based on machine learning, for automatically and promptly detecting such attacks.

4.2.1. Works proposing general recommendations

Our analysis of the papers that provided actual recommendations to defend against phishing attacks reveals that the majority of works suggested a combination of the following three general strategies: (i) increasing user awareness of phishing attacks, which was suggested in [ 57 , 61 , 62 , 63 , 102 ]; (ii) resorting to multi‐factor authentication, proposed in [ 57 , 59 , 60 , 61 , 65 , 102 ]; and (iii) resorting to the use of VPNs, which was proposed in [ 59 , 60 , 61 , 65 , 102 ]. Among these works, the authors of [ 61 , 102 ] provided all three aforementioned recommendations. In particular, [ 102 ] first conducted a survey to investigate the types of cyber‐attacks that users suffered during COVID‐19, as well as the level of knowledge and the technical challenges faced by users who switched to remote services during the pandemic. The survey highlighted phishing emails as the most common type of attack, corroborating previous findings [ 1 , 7 ]. Part of the survey was also targeted at understanding victim behaviours when they were attacked. Surprisingly, as much as 62.5% of respondents admitted that they did not take any specific countermeasure because of a lack of awareness and understanding of the type of attack. Results such as those presented in this study motivate this body of research – namely, studies that analysed the initial situation of the pandemic and that rapidly intervened to provide simple, yet relatively effective, recommendations such as those listed above.

In addition to the previous ‘horizontal’ works that provided general recommendations, some scholars also carried out ‘vertical’ analyses by focussing on specific issues and relevant case‐studies. As a notable example of this kind, [ 103 ] investigated the task of measuring cyber‐resilience, a preliminary – yet mandatory – step towards the development of better countermeasures to cyber‐attacks. The paper highlighted common misunderstandings in the definition and notion of cyber‐resilience, which impair our capacity to measure it. They stressed the importance of considering systems' abilities to recover and to adapt, and not just to resist to cyber‐attacks. The paper also proposed different methods for measuring cyber‐resilience, taking into account cyber‐security implementations as well as adversarial models. Still related to the analysis of cyber‐resilience, [ 104 ] analysed how a global financial institution (GFI) dealt with the cybersecurity challenges posed by COVID‐19. Authors conducted semi‐structured in‐depth interviews with 11 key actors from the GFI and leveraged Hollnagel's four abilities for resilient performance as a theoretical lens for their evaluation [ 105 ]. Among the main findings of the research was that the organisation performed well in terms of cyber‐resilience, in the sense that the number and impact of cyber‐incidents did not significantly increase after the COVID outbreak. The interviews also revealed that all four abilities of resilience were formally developed prior to the COVID‐19 outbreak. The analysis however also showed that the favourable performance was obtained through many actions undertaken reactively rather than proactively, as it is instead advisable for a number of cybersecurity issues [ 106 ]. As such, [ 104 ] leaves open the question as to whether the four potentials should be developed beforehand, in order to perform resiliently during crises.

4.2.2. Works proposing technical solutions

The general advices discussed in the previous section can be beneficial in reducing the frequency of successful phishing attacks [ 58 ]. This is the reason why so many researchers and practitioners rushed to make these recommendations in the first months of the COVID‐19 pandemic. However, at the same time, none of these countermeasures is capable of completely solving the problem. For instance, studies that analysed advanced phishing attacks made via sophisticated phishing toolkits or via phishing‐as‐a‐service, showed that such attacks are capable of evading two‐factor authentication schemes [ 107 ]. The same result can also be achieved simply by mounting more elaborate social engineering attacks. 12 As such, the need for technical and intelligent systems for detecting such phishing attacks remains. In the remainder of this section we discuss relevant works that provided this kind of contribution.

As anticipated, the majority of technical countermeasures to phishing attacks is based on machine learning. As such, the main goal of the work discussed in [ 108 ] was to identify and propose ways in which machine learning techniques could be deployed for the detection of diverse types of cyber‐crimes, such as phishing, identify theft, hacking, distributed denial of service, email bombing, and digital stalking. Authors discussed different types of machine learning‐based implementations in cyber‐crime mitigation, including the discussion of ways in which machine learning could contribute to phishing detection, with particular reference to the detection of phishing emails via analysis of the headers and body of the emails. The techniques proposed in [ 108 ] are effectively used in the following systems. In [ 86 ], the authors introduced a novel AI‐based anomalous email detector – HOLMES – that can effectively tackle the challenge of anomalous email detection. HOLMES uses the email headers as input for the machine learning algorithm. Furthermore, it combines word embeddings with novelty detection to discover anomalous behaviours from a high volume of mirrored SMTP traffic in a large‐scale enterprise environment. Its performance was measured in a limited number of case‐studies, and its detection capability was compared with several well‐known commercial detectors. The evaluation showed that HOLMES significantly outperformed those commercial products in all considered attack scenarios. During the development of the system, emphasis was also put with respect to its efficiency and capacity to run in environments characterised by a limited availability of computational resources. Also the EDITH system, proposed in [ 88 ], was designed to detect phishing emails. Specifically, EDITH (the Email Disintegration Intrusion‐Detection of Trojan Hacktool) aims at identifying the embedded malware files and fake websites that are often present in phishing emails. EDITH takes emails exported from Thunderbird or Gmail and scans for URLs or attachments. It compares them to the VirusTotal database and applies a blacklist approach and heuristics to detect possible phishing and malicious emails. The peculiarity of this system is its capacity to simultaneously scan for phishing links and malware attachments. However, from the analytical perspective the system relies on rather simple methods (i.e., blacklists and heuristics). For the future it could thus be advisable to adopt a similar approach for the detection of phishing links and malware, but to consider the adoption of more powerful methods based on machine learning and AI. Similarly to [ 88 ], also the system proposed in [ 90 ] is designed to detect malicious emails. This time however, only the content of email attachments are analysed and, as such, the system is specifically focussed on detecting malware. Authors of [ 90 ] proposed to rely on deep learning for performing the detection. However, some important details of their methodology are undisclosed, including the type of deep learning architecture and the types of features used by their system.

Several systems were also developed to detect phishing websites. To this end, the analysis presented in [ 87 ] experimented with various machine learning classifiers, including k‐nearest neighbors (kNN), random forest, support vector machines, and logistic regression. Authors relied on a public dataset available on Mendeley, comprising 5000 phishing websites and 5000 real websites, described by 48 machine learning features mostly based on website content and metadata. Results of the evaluation campaign in [ 87 ] showed that the random forest classifier achieved the best performance, with F1‐score = 0.98. Comparable approaches were discussed in [ 92 , 93 ]. In particular, [ 92 ] proposed an ensemble method to effectively detect website phishing attacks. The authors selected three well‐known machine learning classifiers such as artificial neural network (ANN), kNN, and DT, to use in an ensemble method together with a random forest classifier (RF). The authors used a dataset from the UCI machine learning repository with 11,055 instances and 30 features. Similarly to [ 87 ], also in this case the dataset is almost balanced, with 4898 legitimate instances and 6157 phishing instances. The results show that the ensemble with kNN + RF achieved the best results, with accuracy = 0.97 and TP rate = 0.983, followed by the ANN + RF with TP rate = 0.981 and by the DT + RF with TP rate = 0.977. In [ 91 ] the authors proposed an ANN model that categorizes websites into either 1 of 3 categories: (i) phishing websites, (ii) suspicious websites, and (iii) legitimate websites. To perform the detection, the system leverages a publicly available Kaggle dataset comprising more than 10,000 instances of legitimate and phishing websites, described by features extracted from the IP address, the website's URL and its metadata. The ANN model used is the multilayer perceptron, a very simple kind of ANN architecture. As such, better results are foreseeable by the adoption of more sophisticated classification algorithms or ANN architectures. A somewhat simpler approach to the detection of phishing websites is the detection of phishing URLs. For this latter task, only the URL string of a website is considered, which inevitably leads to a much narrower array of possible features to leverage for the detection. Among the systems that tackled this task, is [ 89 ]. The authors proposed a classification approach that exploits only 5 features extracted from URLs. In addition to traditional and largely used features such as the length of the URL and features counting the number hyphens, [ 89 ] also used a feature computed as the Shannon entropy of the URL. Experimental results involved the use of support vector machines, kNN and naïve Bayes classifiers. The best classification results were achieved by kNN with accuracy = 0.99 on the test‐set. Surprisingly, the authors measured no gain in detection performance when adding the entropy feature to the set of more traditional features – a finding that contrasted with earlier results [ 109 , 110 ]. The reason for this result could however be due to the simplicity of the task tackled in [ 89 ], which could already be addressed with remarkable accuracy by only leveraging traditional URL features.

The work presented in [ 34 ] dealt with the proliferation of malicious domains campaigns. Differently from previous works that tackled the classification of individual websites , the goal of this work was the detection of malicious campaigns . Authors defined malicious domains campaigns as groups of related malicious websites. At first, they demonstrated the widespread presence of such campaigns, especially in the first months of the pandemic which were characterized by the surge of covid‐related domain registrations. Then, they also proposed a detection strategy. The proposed solution is based on 3 steps: (i) the construction of a knowledge graph of domains, where related domains are linked together; (ii) the graph representation learning step, where an informative representation is computed for each node in the graph (i.e., each domain), in the form of a feature vector; and (iii) the graph clustering step, where similar domains are clustered together, based on their representation. In [ 34 ], the clustering step was used to group together the domains belonging to the same malicious campaign, thus effectively leading to discover and characterize malicious campaigns. Based on its characteristics, [ 34 ] represents one of the most advanced solutions to the detection of phishing (websites) in the context of COVID‐19. First of all, it employs state‐of‐the‐art methods, such as knowledge graphs, graph representation learning and graph clustering, instead of traditional classification algorithms. Then, it proposes a solution based on unsupervised machine learning, which was recently proven to be more resilient to the inevitable evolution of cyber‐attacks [ 31 , 111 ]. Finally, it focuses on the detection of groups of malicious websites, rather than individual websites, thus leveraging the inherent relationships between phishing websites and the additional information available in this way. Again, focusing on group analyses instead of the classification of individual entities is a promising direction of research in several areas of cybersecurity [ 31 ]. Among the other advantages of this work is the construction of large and detailed dataset, which however has not been publicly released to the scientific community.

To conclude our detailed analysis of proposed phishing countermeasures, we discuss systems for defending against smishing [ 82 ] and vishing [ 84 ] attacks. In detail, [ 82 ] proposed the DSmishSMS system, targeted at the detection of smishing SMSs. The system aimed to address some of the typical challenges related to the task of smishing detection, including the brevity of text messages which limits the number of available features, and the scarcity of labeled datasets to use for training a detector. To overcome these limitations, DSmishSMS only leveraged 5 features extracted from the text of the SMSs, including features aimed at encoding the authenticity of the URLs contained in the analyzed SMSs. The classification was obtained by leveraging an ANN trained with the backpropagation algorithm, which achieved accuracy = 0.98. Classifications from the ANN were also compared to those obtained with traditional algorithms, such as random forest, naïve Bayes and DT. The comparison showed that the ANN beat competitors by a tiny margin, at the expense of a slightly longer execution time. The RIVPAM system is instead aimed at the detection of vishing attacks [ 84 ]. Specifically, RIVPAM (Real‐Time Vishing Prediction and Awareness Model) was designed to alert potential unwary vishing targets in real‐time, during vishing attacks. The system uses a combination of natural language processing and machine learning to analyze conversations in real‐time and is capable of issuing warning messages in case it detects a possible ongoing attack. The classification is performed by leveraging algorithms such as support vector machine, logistic regression and multilayer perceptron, which analyze some simple linguistic features (e.g., n ‐grams) extracted form the conversations. Vishing detection results achieved by RIVPAM are rather low, with the best reported accuracy = 0.65 on the test‐set. Similarly to other surveyed systems that adopted shallow features and traditional classification algorithms, better results are foreseeable by the adoption of more advanced techniques for both the feature extraction and the classification steps.

5. DISCUSSION: CHALLENGES AND FUTURE DIRECTIONS

Thoroughly investigating a problem represents the first step for reaching a satisfactory solution. This simple consideration and the relatively limited time passed since the start of the pandemic motivate and explain the first finding of our literature review. That is, the landscape of research on COVID‐19 phishing attacks and their countermeasures is made of a majority of studies aimed at investigating attacks, with only a relative minority of works that proposed specific solutions to them. The analysis of the literature that investigated attacks revealed that scholars already explored different directions of research and evaluated different aspects of the attacks. For instance, while some papers provided a general (i.e., horizontal) overview of the cyber‐attacks that occurred during COVID‐19, out of which phishing represents the utmost example, others carried out more constrained yet detailed (i.e., vertical) analyses of specific issues. Among them are papers that investigated (i) the causes of vulnerability to phishing attacks during COVID‐19 [ 7 , 10 ], (ii) the rise of malicious domain registrations [ 34 , 49 ], (iii) the economic impact of phishing attacks [ 7 ], (iv) the responses enacted by some countries to fight the rampaging COVID‐19 scamdemic [ 64 , 65 , 66 ], (v) the peculiar cybersecurity challenges faced by the healthcare sector [ 10 , 20 , 22 , 25 ], and more. As such, the body of research on covid‐related phishing attacks appears to be diversified, dense and overall already mature. On the contrary, our detailed analysis of the proposed countermeasures to such attacks revealed a number of challenges and drawbacks.

5.1. Current challenges

In Section  4 we identified a lack of reference datasets and we highlighted that the majority of proposed COVID‐19 phishing detectors are based on simple and traditional classification algorithms and on small sets of shallow features. The first issue – that is, the limited availability of reference datasets – can be traced back to a combination of long‐known and covid‐related challenges. Firstly, building high‐quality scientific datasets have always represented a very demanding task [ 98 ]. In addition, the impact and the recency of the pandemic left even less time and resources for scholars to tackle this task. As such, a general lack of extensive, high‐quality data on the novel problem of covid‐related phishing attacks is somehow expected at this stage. Nonetheless, this is causing several problems to the scholars working in this field. One general problem is that this lack of resources inevitably hinders the research on covid‐related cyber‐attacks. Moreover, another problem is related to the capability of training and evaluating automatic systems for phishing detection. In particular, the current situation where each detector was evaluated on a different dataset, many of which are small and not publicly available, inevitably raises concerns about the validity and generality of the evaluations reported in the existing papers.

The second issue unveiled by our analysis is related to the use of traditional (i.e., not state‐of‐the‐art) machine learning algorithms and of shallow features. As shown, the majority of proposed phishing detectors was based on classification algorithms such as decision trees, random forests and support vector machines, instead of more recent and better performing solutions, such as those based on deep learning [ 112 ]. The same considerations can be made for the choice of machine learning features, which is not on par with current state‐of‐the‐art solutions [ 113 ]. Notably, the issue with the choice of algorithms and features strictly depends on the lack of reference datasets. This is particularly true for the possible application of deep learning to the task of phishing detection, for which large datasets are needed in order to train and optimize deep neural networks that easily involve millions of parameters [ 114 ].

5.2. Future directions of research

As anticipated, the body of research on covid‐related phishing attacks is overall mature. However, some specific areas could nonetheless benefit from additional research. One of such areas is that related to the quantification of the effects (or impact) of the attacks. This task has been mostly left to cybersecurity firms and governmental agencies, but it could instead see a deeper academic involvement. Notably, measuring the effects of cyber‐attacks currently represents an open and promising research direction that goes beyond phishing and COVID‐19. In fact, quantifying effects is meaningful and needed in all those areas of cybersecurity that deal with relatively new types of attacks (e.g., fake news and all forms of online information manipulation [ 31 ]) and countermeasures [ 115 ]. Here, a better assessment and quantification of the consequences of phishing attacks during a major crisis could inform decisions for a broad array of stakeholders, including policymakers, law enforcement personnel, as well as all those scholars and practitioners actively involved in developing effective countermeasures.

Since each challenge comes with opportunities, the area related to the development of countermeasures to COVID‐19 phishing attacks is the one that currently presents the majority of opportunities for future research. For example, the aforementioned lack of reference datasets for training and validating detectors, mandates additional work in this important direction. In fact, works aimed at collecting, developing and sharing scientific resources – including datasets, but also tools and software as well as benchmark platforms/frameworks – are much needed and are likely to have a strong impact in the scientific community. As such, this scientific endeavour represents a low‐hanging fruit. Then, with more and better data it is foreseeable that more sophisticated and powerful detectors will be developed. In other words, we envision that the greater availability of resources will bootstrap the next wave of research on covid‐related cyber‐attacks, including the experimentation with those algorithms and techniques whose application was daunting or infeasible until now. Notably, not only does this direction of research involve new experimentation with deep learning‐based methods for feature extraction and attack detection, but it also opens up the possibility to experiment with feature selection techniques [ 116 ] and with techniques for combining simple classifiers, such as ensemble methods [ 3 ]. All these techniques have seen very limited application until now, because of the limitations that we previously discussed. However, they have already proven their efficacy in related tasks and are thus likely to provide favourable results also for the detection of COVID‐19 phishing attacks.

Another challenge that we highlighted in the previous section is the difficulty at assessing the validity of the experimental results of phishing detectors. To this regard, another much needed direction of research is the one related to the development of systematic evaluation campaigns of the existing detectors. As it typically happens with many detection tasks [ 31 ], the majority of efforts are devoted to developing new detectors and only a small minority of works focus on evaluating and comparing the different detectors. With the foreseen increase in the development of state‐of‐the‐art phishing detectors, the latter task will become even more important. Systematic evaluations of the existing detectors should not only involve comparisons between the detectors, but should also include experiments aimed at evaluating the generalizability of the different detectors – that is, their capacity to detect attacks for which they were not trained. The latter test in particular has proven valuable in other tasks for identifying detectors' generalization deficiencies and for estimating their capacity to thwart future and unforeseen attacks [ 111 ].

5.3. Final remarks

COVID‐19 has been one of the deadliest pandemics in the history of humanity and the first to occur in a massively digitized and hyperconnected world. Withstanding its spread and impact required drastic changes that gave rise to a plethora of problems. One of such problems – phishing attacks – has been the subject of this survey.

The long‐term effects of the pandemic on our society are still unclear. However, it is already evident that some changes are bound to stay. As an example, the sudden shift to remote work represented a unique opportunity to reimagine and reorganise businesses, jobs and work habits. The world after COVID‐19 will never be the same. Moreover, more and worse pandemics are expected to strike in the coming years [ 117 ].

What all of this means is that at least some of the problems that we faced during COVID‐19 will remain for a long time and will probably reappear and intensify over and over again. Gunther Eysenbach – the father of infodemiology – stressed in 2009 the need to ‘build tools now to manage future infodemics’ [ 118 ]. In retrospect, we clearly see that his warning call went unheeded [ 119 ]. For all of these reasons, it is of the utmost importance to capitalize on the lessons learnt in this pandemic, for such experiences will be decisive to withstand the future infodemics and scamdemics.

6. CONCLUSIONS

In this survey we focussed on the most frequent type of cyber‐attack perpetrated during the COVID‐19 pandemic: phishing. We systematically analysed and discussed both scientific studies, as well as reports by cybersecurity firms and governmental agencies that investigated phishing attacks or that proposed solutions against them.

Our analysis highlighted that many works investigated the drivers and the characteristics of phishing attacks. Instead, only a minority of scholars worked to build and share resources for the community (e.g., reference datasets) and to propose specific solutions against phishing. Moreover, the existing solutions are mostly based on traditional machine learning techniques, thus largely overlooking the state‐of‐the‐art methods for both the classification and feature extraction steps.

Given this picture, the most favourable directions for future research and experimentation revolve around building and sharing resources to the community, such as large datasets and evaluation campaigns. Once more resources will be available, efforts should be directed towards applying state‐of‐the‐art techniques, such as those based on deep learning, for the task of phishing detection. The lessons learnt from contrasting phishing and other cyber‐attacks during the COVID‐19 pandemic will be valuable for responding to the increasing cybersecurity concerns that are rising with each passing year.

CONFLICT OF INTEREST

The authors declare that have no conflicts of interest.

ACKNOWLEDGEMENTS

The publication of this article was funded by the Qatar National Library (QNL), Doha, Qatar and partially supported by award QRLP10‐G‐1803022 from the Qatar National Research Fund (QNRF), member of Qatar Foundation.

Al‐Qahtani, A.F. , Cresci, S. : The COVID‐19 scamdemic: a survey of phishing attacks and their countermeasures during COVID‐19 . IET Inf. Secur . 16 ( 5 ), 324–345 (2022). 10.1049/ise2.12073 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

1 https://www.techrepublic.com/article/covid‐19‐lockdowns‐are‐causing‐a‐huge‐spike‐in‐data‐breaches

2 https://www.zdnet.com/article/thousands‐of‐covid‐19‐scam‐and‐malware‐sites‐are‐being‐created‐on‐a‐daily‐basis/

3 https://us.norton.com/internetsecurity‐online‐scams‐coronavirus‐phishing‐scams.html

4 https://blog.barracuda.com/2020/03/26/threat‐spotlight‐coronavirus‐related‐phishing/

5 https://www.who.int/about/cyber‐security

6 https://www.interpol.int/en/News‐and‐Events/News/2020/INTERPOL‐report‐shows‐alarming‐rate‐of‐cyberattacks‐during‐COVID‐19

7 https://www.vadesecure.com/en/blog/cybercrime‐statistics‐top‐threats‐and‐costliest‐scams‐of‐2020

8 https://covid19scam.github.io/

9 https://www.domaintools.com/resources/blog/free‐covid‐19‐threat‐list‐domain‐risk‐assessments‐for‐coronavirus‐threats

10 https://checkphish.ai/coronavirus‐scams‐tracker

11 https://www.cyberthreatcoalition.org/blocklist

12 https://www.agari.com/email‐security‐blog/phishing‐attacks‐two‐factor‐authentication/

DATA AVAILABILITY STATEMENT

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 25 May 2022

An effective detection approach for phishing websites using URL and HTML features

  • Ali Aljofey 1 , 2 ,
  • Qingshan Jiang 1 ,
  • Abdur Rasool 1 , 2 ,
  • Hui Chen 1 , 2 ,
  • Wenyin Liu 3 ,
  • Qiang Qu 1 &
  • Yang Wang 4  

Scientific Reports volume  12 , Article number:  8842 ( 2022 ) Cite this article

24k Accesses

37 Citations

Metrics details

  • Computer science
  • Information technology
  • Scientific data

Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.

Similar content being viewed by others

research paper on phishing attack

Life-long phishing attack detection using continual learning

research paper on phishing attack

Machine learning-based guilt detection in text

research paper on phishing attack

A Multiple change-point detection framework on linguistic characteristics of real versus fake news articles

Introduction.

Phishing offenses are increasing, resulting in billions of dollars in loss 1 . In these attacks, users enter their critical (i.e., credit card details, passwords, etc.) to the forged website which appears to be legitimate. The Software-as-a-Service (SaaS) and webmail sites are the most common targets of phishing 2 . The phisher makes websites that look very similar to the benign websites. The phishing website link is then sent to millions of internet users via emails and other communication media. These types of cyber-attacks are usually activated by emails, instant messages, or phone calls 3 . The aim of the phishing attack is not only to steal the victims' personality, but it can also be performed to spread other types of malware such as ransomware, to exploit approach weaknesses, or to receive monetary profits 4 . According to the Anti-Phishing Working Group (APWG) report in the 3rd Quarter of 2020, the number of phishing attacks has grown since March, and 28,093 unique phishing sites have been detected between July to September 2 . The average amount demanded during wire transfer Business E-mail Compromise (BEC) attacks was $48,000 in the third quarter, down from $80,000 in the second quarter and $54,000 in the first.

Detecting and preventing phishing offenses is a significant challenge for researchers due to the way phishers carry out the attack to bypass the existing anti-phishing techniques. Moreover, the phisher can even target some educated and experienced users by using new phishing scams. Thus, software-based phishing detection techniques are preferred for fighting against the phishing attack. Mostly available methods for detecting phishing attacks are blacklists/whitelists 5 , natural language processing 6 , visual similarity 7 , rules 8 , machine learning techniques 9 , 10 , etc. Techniques based on blacklists/whitelists fail to detect unlisted phishing sites (i.e. 0-h attacks) as well as these methods fail when blacklisted URL is encountered with minor changes. In the machine learning based techniques, a classification model is trained using various heuristic features (i.e., URL, webpage content, website traffic, search engine, WHOIS record, and Page Rank) in order to improve detection efficiency. However, these heuristic features are not warranted to present in all phishing websites and might also present in the benign websites, which may cause a classification error. Moreover, some of the heuristic features are hard to access and third-party dependent. Some third-party services (i.e., page rank, search engine indexing, WHOIS etc.) may not be sufficient to identify phishing websites that are hosted on hacked servers and these websites are inaccurately identified as benign websites because they are contained in search results. Websites hosted on compromised servers are usually more than a day old unlike other phishing websites which only take a few hours. Also, these services inaccurately identify the new benign website as a phishing site due to the lack of domain age. The visual similarity-based heuristic techniques compare the new website with the pre-stored signature of the website. The website’s visual signature includes screenshots, font styles, images, page layouts, logos, etc. Thus, these techniques cannot identify the fresh phishing websites and generate a high false-negative rate (phishing to benign). The URL based technique does not consider the HTML of the webpage and may misjudge some of the malicious websites hosted on free or compromised servers. Many existing approaches 11 , 12 , 13 extract hand-crafted URL based features, e.g., number of dots, presence of special “@”, “#”, “–” symbol, URL length, brand names in URL, position of Top-Level domain, check hostname for IP address, presence of multiple TLDs, etc. However, there are still hurdles to extracting manual URL features due to the fact that human effort requires time and extra maintenance labor costs. Detecting and preventing phishing offense is a major defiance for researchers because the scammer carries out these offenses in a way that can avoid current anti-phishing methods. Hence, the use of hybrid methods rather than a single approach is highly recommended by the networks security manager.

This paper provides an efficient solution for phishing detection that extracts the features from website's URL and HTML source code. Specifically, we proposed a hybrid feature set including URL character sequence features without expert’s knowledge, various hyperlink information, plaintext and noisy HTML data-based features within the HTML source code. These features are then used to create feature vector required for training the proposed approach by XGBoost classifier. Extensive experiments show that the proposed anti-phishing approach has attained competitive performance on real dataset in terms of different evaluation statistics.

Our anti-phishing approach has been designed to meet the following requirements.

High detection efficiency: To provide high detection efficiency, incorrect classification of benign sites as phishing (false-positive) should be minimal and correct classification of phishing sites (true-positive) should be high.

Real-time detection: The prediction of the phishing detection approach must be provided before exposing the user's personal information on the phishing website.

Target independent: Due to the features extracted from both URL and HTML the proposed approach can detect new phishing websites targeting any benign website (zero-day attack).

Third-party independent: The feature set defined in our work are lightweight and client-side adaptable, which do not rely on third-party services such as blacklist/whitelist, Domain Name System (DNS) records, WHOIS record (domain age), search engine indexing, network traffic measures, etc. Though third-party services may raise the effectiveness of the detection approach, they might misclassify benign websites if a benign website is newly registered. Furthermore, the DNS database and domain age record may be poisoned and lead to false negative results (phishing to benign).

Hence, a light-weight technique is needed for phishing websites detection adaptable at client side. The major contributions in this paper are itemized as follows.

We propose a phishing detection approach, which extracts efficient features from the URL and HTML of the given webpage without relying on third-party services. Thus, it can be adaptable at the client side and specify better privacy.

We proposed eight novel features including URL character sequence features (F1), textual content character level (F2), various hyperlink features (F3, F4, F5, F6, F7, and F14) along with seven existing features adopted from the literature.

We conducted extensive experiments using various machine learning algorithms to measure the efficiency of the proposed features. Evaluation results manifest that the proposed approach precisely identifies the legitimate websites as it has a high true negative rate and very less false positive rate.

We release a real phishing webpage detection dataset to be used by other researchers on this topic.

The rest of this paper is structured as follows: The " Related work " section first reviews the related works about phishing detection. Then the " Proposed approach " section presents an overview of our proposed solution and describes the proposed features set to train the machine learning algorithms. The " Experiments and result analysis ” section introduces extensive experiments including the experimental dataset and results evaluations. Furthermore, the " Discussion and limitation " section contains a discussion and limitations of the proposed approach. Finally, the " Conclusion " section concludes the paper and discusses future work.

Related work

This section provides an overview of the proposed phishing detection techniques in the literature. Phishing methods are divided into two categories; expanding the user awareness to distinguish the characteristics of phishing and benign webpages 14 , and using some extra software. Software-based techniques are further categorized into list-based detection, and machine learning-based detection. However, the problem of phishing is so sophisticated that there is no definitive solution to efficiently bypass all threats; thus, multiple techniques are often dedicated to restrain particular phishing offenses.

List-based detection

List-based phishing detection methods use either whitelist or blacklist-based technique. A blacklist contains a list of suspicious domains, URLs, and IP addresses, which are used to validate if a URL is fraudulent. Simultaneously, the whitelist is a list of legitimate domains, URLs, and IP addresses used to validate a suspected URL. Wang et al. 15 , Jain and Gupta 5 and Han et al. 16 use white list-based method for the detection of suspected URL. Blacklist-based methods are widely used in openly available anti-phishing toolbars, such as Google safe browsing, which maintains a blacklist of URLs and provides warnings to users once a URL is considered as phishing. Prakash et al. 17 proposed a technique to predict phishing URLs called Phishnet. In this technique, phishing URLs are identified from the existing blacklisted URLs using the directory structure, equivalent IP address, and brand name. Felegyhazi et al. 18 developed a method that compares the domain name and name server information of new suspicious URLs to the information of blacklisted URLs for the classification process. Sheng et al. 19 demonstrated that a forged domain was added to the blacklist after a considerable amount of time, and approximately 50–80% of the forged domains were appended after the attack was carried out. Since thousands of deceptive websites are launched every day, the blacklist requires to be updated periodically from its source. Thus, machine learning-based detection techniques are more efficient in dealing with phishing offenses.

Machine learning-based detection

Data mining techniques have provided outstanding performance in many applications, e.g., data security and privacy 20 , game theory 21 , blockchain systems 22 , healthcare 23 , etc. Due to the recent development of phishing detection methods, various machine learning-based techniques have also been employed 6 , 9 , 10 , 13 to investigate the legality of websites. The effectiveness of these methods relies on feature collection, training data, and classification algorithm. The feature collection is extracted from different sources, e.g., URL, webpage content, third party services, etc. However, some of the heuristic features are hard to access and time-consuming, which makes some machine learning approaches demand high computations to extract these features.

Jain and Gupta 24 proposed an anti-phishing approach that extracts the features from the URL and source code of the webpage and does not rely on any third-party services. Although the proposed approach attained high accuracy in detecting phishing webpages, it used a limited dataset (2141 phishing and 1918 legitimate webpages). The same authors 9 present a phishing detection method that can identify phishing attacks by analyzing the hyperlinks extracted from the HTML of the webpage. The proposed method is a client-side and language-independent solution. However, it entirely depends on the HTML of the webpage and may incorrectly classify the phishing webpages if the attacker changes all webpage resource references (i.e., Javascript, CSS, images, etc.). Rao and Pais 25 proposed a two-level anti-phishing technique called BlackPhish. At first level, a blacklist of signatures is created using visual similarity based features (i.e., file names, paths, and screenshots) rather than using blacklist of URLs. At second level, heuristic features are extracted from URL and HTML to identify the phishing websites which override the first level filter. In spite of that, the legitimate websites always undergo two-level filtering. In some researches 26 authors used search engine-based mechanism to authenticate the webpage as first-level authentication. In the second level authentication, various hyperlinks within the HTML of the website are processed for the phishing websites detection. Although the use of search engine-based techniques increases the number of legitimate websites correctly identified as legitimate, it also increases the number of legitimate websites incorrectly identified as phishing when newly created authentic websites are not found in the top results of search engine. Search based approaches assume that genuine website appears in the top search results.

In a recent study, Rao et al. 27 proposed a new phishing websites detection method with word embedding extracted from plain text and domain specific text of the html source code. They implemented different word embedding to evaluate their model using ensemble and multimodal techniques. However, the proposed method is entirely dependent on plain text and domain specific text, and may fail when the text is replaced with images. Some researchers have tried to identify phishing attacks by extracting different hyperlink relationships from webpages. Guo et al. 28 proposed a phishing webpages detection approach which they called HinPhish. The approach establishes a heterogeneous information network (HIN) based on domain nodes and loading resources nodes and establishes three relationships between the four hyperlinks: external link, empty link, internal link and relative link. Then, they applied an authority ranking algorithm to calculate the effect of different relationships and obtain a quantitative score for each node.

In Sahingoz et al. 6 work, the distributed representation of words is adopted within a specific URL, and then seven various machine learning classifiers are employed to identify whether a suspicious URL is a phishing website. Rao et al. 13 proposed an anti-phishing technique called CatchPhish. They extracted hand-crafted and Term Frequency-Inverse Document Frequency (TF-IDF) features from URLs, then trained a classifier on the features using random forest algorithm. Although the above methods have shown satisfactory performance, they suffer from the following restrictions: (1) inability to handle unobserved characters because the URLs usually contain meaningless and unknown words that are not in the training set; (2) they do not consider the content of the website. Accordingly, some URLs, which are distinctive to others but imitate the legitimate sites, may not be identified based on URL string. As their work is only based on URL features, which is not enough to detect the phishing websites. However, we have provided an effective solution by proposing our approach to this domain by utilizing three different types of features to detect the phishing website more efficiently. Specifically, we proposed a hybrid feature set consisting of URL character sequence, various hyperlinks information, and textual content-based features.

Deep learning methods have been used for phishing detection e.g., Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Recurrent Convolutional Neural Networks (RCNN) due to the success of the Natural Language Processing (NLP) attained by these techniques. However, deep learning methods are not employed much in phishing detection due to the inclusive training time. Aljofey et al. 3 proposed a phishing detection approach with a character level convolutional neural network based on URL. The proposed approach was compared by using various machine and deep learning algorithms, and different types of features such as TF-IDF characters, count vectors, and manually-crafted features. Le et al. 29 provided a URLNet method to detect phishing webpage from URL. They extract character-level and word-level features from URL strings and employ CNN networks for training and testing. Chatterjee and Namin 30 introduced a phishing detection technique based on deep reinforcement learning to identify phishing URLs. They used their model on a balanced, labeled dataset of benign and phishing URLs, extracting 14 hand-crafted features from the given URLs to train the proposed model. In recent studies, Xiao et al. 31 proposed phishing website detection approach named CNN–MHSA. CNN network is applied to extract characters features from URLs. In the meanwhile, multi-head self-attention (MHSA) mechanism is employed to calculate the corresponding weights for the CNN learned features. Zheng et al. 32 proposed a new Highway Deep Pyramid Neural Network (HDP-CNN) which is a deep convolutional network that integrates both character-level and word-level embedding representation to identify whether a given URL is phishing or legitimate. Albeit the above approaches have shown valuable performances, they might misclassify phishing websites hosted on compromised servers since the features are extracted only from the URL of the website.

The features extracted in some previous studies are based on manual work and require additional effort since these features need to be reset according to the dataset, which may lead to overfitting of anti-phishing solutions. We got the motivation from the above-mentioned studies and proposed our approach. In which, the current work extract character sequences feature from URL without manual intervention. Moreover, our approach employs noisy data of HTML, plaintext, and hyperlinks information of the website with the benefit of identifying new phishing websites. Table 1 presents the detailed comparison of existing machine learning based phishing detection approaches.

Proposed approach

Our approach extracts and analyzes different features of suspected webpages for effective identification of large-scale phishing offenses. The main contribution of this paper is the combined uses of these feature set. For improving the detection accuracy of phishing webpages, we have proposed eight new features. Our proposed features determine the relationship between the URL of the webpage and the webpage content.

System architecture

The overall architecture of the proposed approach is divided into three phases. In the first phase, all the essential features are extracted and HTML source code will be crawled. The second phase applies feature vectorization to generate a particular feature vector for each webpage. The third phase identifies if the given webpage is phishing. Figure  1 shows the system structure of the proposed approach. Details of each phase are described as follows.

figure 1

General architecture of the proposed approach.

Feature generation

The features are generated in this component. Our features are based on the URL and HTML source code of the webpage. A Document Object Model (DOM) tree of the webpage is used to extract the hyperlink and textual content features using a web crawler automatically. The features of our approach are categorized into four groups as depicted in Table 2 . In particular, features F1–F7, and F14 are new and proposed by us; Features F8–F13, and F15 are taken from other approaches 9 , 11 , 12 , 24 , 33 but we adjusted them for better results. Moreover, the observational method and strategy regarding the interpretation of these features are applied differently in our approach. A detailed explanation of the proposed features is provided in the feature extraction section of this paper.

Feature vectorization

After the features are extracted, we apply feature vectorization to generate a particular feature vector for each webpage to create a labeled dataset. We integrate URL character sequences features with textual content TF-IDF features and hyperlink information features to create feature vector required for training the proposed approach. The hyperlink features combination outputs 13-dimensional feature vector as \(F_{H} = \left\langle {f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle\) , and the URL character sequence features combination outputs 200-dimensional feature vector as \(F_{U} = \left\langle {c_{1} ,c_{2} ,c_{3} , \ldots ,c_{{200}} } \right\rangle\) , we set a fixed URL length to 200. If the URL length is greater than 200, the additional part will be ignored. Otherwise, we put a 0 in the remainder of the URL string. The setting of this value depends on the distribution of URL lengths within our dataset. We have noticed that most of the URL lengths are less than 200 which means that when a vector is long, it may contain useless information, in contrast when the feature vector is too short, it may contain insufficient features. TF-IDF character level combination outputs \(D\) -dimensional feature vector as \(F_{T} = \left\langle {t_{1} ,t_{2} ,t_{3} , \ldots ,t_{D} } \right\rangle\) where \(D\) is the size of dictionary computed from the textual content corpus. It is observed from the experimental analysis that the size of dictionary \(D\)  = 20,332 and the size increases with an increase in number of corpus. The above three feature vectors are combined to generate final feature vector \(F_{V} = F_{T} \cup F_{U} \cup F_{H} = \left\langle {t_{1} ,t_{2} , \ldots ,t_{D} ,c_{1} ,c_{2} \ldots ,c_{{200}} ,f_{3} ,f_{4} ,f_{5} , \ldots ,f_{{15}} } \right\rangle\) that is fed as input to machine learning algorithms to classify the website.

Detection module

The Detection phase includes building a strong classifier by using the boosting method, XGBoost classifier. Boosting integrates many weak and relatively accurate classifiers to build a strong and therefore robust classifier for detecting phishing offences. Boosting also helps to combine diverse features resulting in improved classification performance 34 . Here, XGBoost classifier is employed on integrated feature sets of URL character sequence \({F}_{U}\) , various hyperlinks information \({F}_{H}\) , login form features \({F}_{L}\) , and textual content-based features \({F}_{T}\) to build a strong classifier for phishing detection. In the training phase, XGBoost classifier is trained using the feature vector \(({F}_{U}\cup {F}_{H} \cup {F}_{L} \cup {F}_{T})\) collected from each record in the training dataset. At the testing phase, the classifier detects whether a particular website is a malicious website or not. The detailed description is shown in Fig.  2 .

figure 2

Phishing detection algorithm.

Features extraction

Due to the limited search engine and third-party methods discussed in the literature, we extract the particular features from the client side in our approach. We have introduced eleven hyperlink features (F3–F13), two login form features (F14 and F15), character level TF-IDF features (F2), and URL character sequence features (F1). All these features are discussed in the following subsections.

URL character sequence features (F1)

The URL stands for Uniform Resource Locator. It is used for providing the location of the resources on the web such as images, files, hypertext, video, etc. URL. Each URL starts with a protocol (http, https, and ftp) used to access the resource requested. In this part, we extract character sequence features from URL. We employ the method used in 35 to process the URL at the character level. More information is contained at the character level. Phishers also imitate the URLs of legitimate websites by changing many unnoticeable characters, e.g., “ www.icbc.com ” as “ www.1cbc.com ”. Character level URL processing is a solution to the out of vocabulary problem. Character level sequences identify substantial information from specific groups of characters that appear together which could be a symptom of phishing. In general, a URL is a string of characters or words where some words have little semantic meanings. Character sequences help find this sensitive information and improve the efficiency of phishing URL detection. During the learning task, machine learning techniques can be applied directly using the extracted character sequence features without the expert intervention. The main processes of character sequences generating include: preparing the character vocabulary, creating a tokenizer object using Keras preprocessing package ( https://Keras.io ) to process URLs in char level and add a “UNK” token to the vocabulary after the max value of chars dictionary, transforming text of URLs to sequence of tokens, and padding the sequence of URLs to ensure equal length vectors. The description of URL features extraction is shown in Algorithm 1.

figure a

HTML features

The webpage source code is the programming behind any webpage, or software. In case of websites, this code can be viewed by anyone using various tools, even in the web browser itself. In this section, we extract the textual and hyperlink features existing in the HTML source code of the webpage.

Textual content-based features (F2)

TF-IDF stands for Term Frequency-Inverse Document Frequency. TF-IDF weight is a statistical measure that tells us the importance of a term in a corpus of documents 36 . TF-IDF vectors can be created at various levels of input tokens (words, characters, n-grams) 37 . It is observed that TF-IDF technique has been implemented in many approaches to catch phish of webpages by inspecting URLs 13 , obtain the indirect associated links 38 , target website 11 , and validity of suspected website 39 . In spite of TF-IDF technique extracts outstanding keywords from the text content of the webpage, it has some limitations. One of the limitations is that TF-IDF technique fails when the extracted keywords are meaningless, misspelled, skipped or replaced with images. Since plaintext and noisy data (i.e., attribute values for div, h1, h2, body and form tags) are extracted in our approach from the given webpage using BeautifulSoup parser, TF-IDF character level technique is applied with max features as 25,000. To obtain valid textual information, extra portions (i.e., JavaScript code, CSS code, punctuation symbols, and numbers) of the webpage are removed through regular expressions, including Natural Language Processing packages ( http://www.nltk.org/nltk_data/ ) such as sentence segmentation, word tokenization, text lemmatization and stemming as shown in Fig.  3 .

figure 3

The process of generating text features.

Phishers usually mimic the textual content of the target website to trick the user. Moreover, phishers may mistake or override some texts (i.e., title, copyright, metadata, etc.) and tags in phishing webpages to bypass revealing the actual identification of the webpage. However, tag attributes stay the same to preserve the visual similarity between phishing and targeted site using the same style and theme as that of the benign webpage. Therefore, it is needful to extract the text features (plaintext and noisy part of HTML) of the webpage. The basic of this step is to extract the vectored representation of the text and the effective webpage content. A TF-IDF object is employed to vectorize text of the webpage. The detailed process of the text vector generation algorithm as follows.

figure b

Script, CSS, img, and anchor files (F3, F4, F5, and F6)

External JavaScript or external Cascading Style Sheets (CSS) files are separate files that can be accessed by creating a link within the head section of a webpage. JavaScript, CSS, images, etc. files may contain malicious code while loading a webpage or clicking on a specific link. Moreover, phishing websites have fragile and unprofessional content as the number of hyperlinks referring to a different domain name increases. We can use <img> and <script> tags that have the "src" attribute to extract images and external JavaScript files in the website. Similarly, CSS and anchor files are within "href" attribute in <link> and <a> tags. In Eqs. ( 1 – 4 ), basically we calculated the rate of img and script tags that have the “src” attribute, link and anchor tags that have “href” attribute to the total hyperlinks available in a webpage, these tags usually link to image, Javascript, anchor, and CSS files required for a website

where \({\text{F}}_{\text{Script}\_\text{files}}\) , \({\text{F}}_{\text{CSS}\_\text{files}}\) , \({\text{F}}_{\text{Img}\_\text{files}}\) , \({\text{F}}_{\text{a}\_\text{files}}\) are the numbers of Javascript, CSS, image, anchor files existing in a webpage, and \({\text{F}}_{\text{Total}}\) is the total hyperlinks available in a webpage.

Empty hyperlinks (F7 and F8)

In the empty hyperlink, the “href” or “src” attributes of anchor, link, script, or img tags do not contain any URL. The empty link returns on the same webpage again when the user clicks on it. A benign website contains many webpages; thus, the scammer does not place any values in hyperlinks to make a phishing website behave like the benign website, and the hyperlinks look active on the phishing website. For example, <a href = “#”>, <a href = “#content”> and <a href = “javascript:void(0);”> HTML coding are used to design null hyperlinks 24 . To establish the empty hyperlink features, we define the rate of empty hyperlinks to the total number of hyperlinks available in a webpage, and the rate of anchor tag without “href” attribute to the total number of hyperlinks in a webpage. Following formulas are used to compute empty hyperlink features

where \({\text{F}}_{\text{a}\_\text{null}}\) and \({\text{F}}_{\text{null}}\) are the numbers of anchor tags without href attribute, and null hyperlinks in a webpage.

Total hyperlinks feature (F9)

Phishing websites usually contain minimal pages as compared to benign websites. Furthermore, sometimes the phishing webpage does not contain any hyperlink because the phishers usually only create a login page. Equation ( 7 ) computes the number of hyperlinks in a webpage by extracting the hyperlinks from an anchor, link, script, and img tags in the HTML source code.

Internal and external hyperlinks (F10, F11, and F12)

The base domain name in the external hyperlink is different from the website domain name, unlike the internal hyperlink; the base domain name is the same as the website domain name. The phishing websites may contain many external hyperlinks that indicate to the target websites due to the cybercriminals commonly copy the HTML code from the targeted authorized websites to create their phishing websites. Most of hyperlinks in a benign website contain the similar base domain name, whereas many hyperlinks in a phishing site may include the corresponding benign website domain. In our approach, the internal and external hyperlinks are extracted from the “src” attribute of img, script, frame tags, “action” attribute of form tag, and “href” attribute of the anchor and link tags. We compute the rate of internal hyperlinks to the total links available in a webpage (Eq.  8 ) to establish the internal hyperlink feature, and the rate of external hyperlinks to the total links (Eq.  9 ) to set the external hyperlink feature. Moreover, to set the external/internal hyperlink feature, we compute the rate of external hyperlinks to the internal hyperlinks (Eq.  10 ). A specified number has been used as a way of detecting the suspected websites in some previous studies 5 , 9 , 24 that these features used for classification. For example, if the rate of external hyperlinks to the total links is greater than 0.5, it will indicate that the website is phishing. However, determining a specific number as a parametric detection may cause errors in classification.

where \({\text{F}}_{\text{Internal}}\) , \({\text{F}}_{\text{External}}\) , and \({\text{F}}_{\text{Total}}\) are the number of external, internal, and total hyperlinks in a website.

Error in hyperlinks (F13)

Phishers sometimes add some hyperlinks in the fake website which are dead or broken links. In the hyperlink error feature, we check whether the hyperlink is a valid URL in the website. We do not consider the 403 and 404 error response code of hyperlinks due to the time consumed of the internet access to get the response code of each link. Hyperlink error is defined by dividing the total number of invalid links to the total links as represented in Eq. ( 11 )

where \({\text{F}}_{\text{Error}}\) is the total invalid hyperlinks.

Login form features (F14 and F15)

In the fraudulent website, the common trick to acquire the user's personal information is to include a login form. In the benign webpage, the action attribute of login form commonly includes a hyperlink that has the similar base domain as appear in in the browser address bar 24 . However, in the phishing websites, the form action attribute includes a URL that has a different base domain (external link), empty link, or not valid URL (Eq.  13 ). The suspicious form feature (Eq.  14 ) is defined by dividing the total number of suspicious forms S to the total forms available in a webpage (Eq.  12 )

where \({\text{F}}_{\text{S}}\) and \({\text{L}}_{\text{Total}}\) are the number of suspicious forms and total forms present in a webpage.

Figure  4 shows a comparison between benign and fishing hyperlink features based on the average occurrence rate per feature within each website in our dataset. From the figure, we noticed that the ratios of the external hyperlinks to the internal hyperlinks, and null hyperlinks in the phishing websites are higher than that in benign websites. Whereas, benign sites contain more anchor files, internal hyperlinks, and total hyperlinks.

figure 4

Distribution of hyperlink-based features in our data.

Classification algorithms

To measure the effectiveness of the proposed features, we have used various machine learning classifiers such as eXtreme Gradient Boosting (XGBoost), Random Forest, Logistic Regression, Naïve Bayes, and Ensemble of Random Forest and Adaboost classifiers to train our proposed approach. The major aim of comparing different classifiers is to expose the best classifier fit for our feature set. To apply different machine learning classifiers, Scikit-learn.org package is used, and Python is employed for feature extraction. From the empirical results, we noticed that XGBoost outperformed other classifiers. XGBoost algorithm is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set, thus it has high performance.

XGBoost (extreme gradient boosting) is a scalable machine learning system for tree boosting proposed by Chen and Guestrin 40 . Suppose there are \(N\) websites in the dataset \(\left\{ {\left( {x_{i} ,y_{i} } \right)|i = 1,2,...,N} \right\}\) , where \(x_{i} \in R^{d}\) is the extracted features associated with the \(i - th\) website, \(y_{i} \in \left\{ {0,\left. 1 \right\}} \right.\) is the class label, such that \(y_{i} = 1\) if and only if the website is a labelled phishing website. The final output \(f_{K} \left( x \right)\) of model is as follows 41 , 46 :

where l is the training loss function and  \(\Omega \left( {G_{k}} \right) = \gamma T + \frac{1}{2}\lambda \sum\limits_{t = 1}^{T} {\omega_{t}^{2} }\) is the regulation term, since XGBoost introduces additive training and all previous k-1 base learners are fixed, here we assumed that we are in step k that optimizes our function  \(f_{k} \left( x \right)\) , T is the number of leaves nodes in the base learner G k , γ is the complexity of each leaf, λ is a parameter to scale the penalty, and ω t is the output value at each final leaf node. If we apply the Taylor expansion to expand the Loss function at f k-1  ( x ) we will have 41 :

where  \(g_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1} \left( x \right)}},h_{i} = \frac{{\partial l\left( {y_{i} ,f_{k - 1} \left( {x_{i} } \right)} \right)}}{{\partial f_{k - 1}^{2} \left( x \right)}}\) are respectively first and second derivative of the Loss function.

XGBoost classifier is a type of ensemble classifiers, that transform weak learners to robust ones and convenient for our proposed feature set for the prediction of phishing websites, thus it has high performance. Moreover, XGBoost provides a number of advantages, some of which include: (i) The strength to handle missing values existing within the training set, (ii) handling huge datasets that do not fit into memory and (iii) For faster computing, XGBoost can make use of multiple cores on the CPU. The websites are classified into two possible categories: phishing and benign using a binary classifier. When a user requests a new site, the trained XGBoost classifier determines the validity of a particular webpage from the created feature vector.

Experiments and result analysis

In this section we describe the training and testing dataset, performance metrics, implementation details, and outcomes of our approach. The proposed features described in “ Features extraction ” section are used to build a binary classifier, which classify phishing and benign websites accurately.

We collected the dataset from two sources for our experimental implementation. The benign webpages are collected in February 2020 from Stuff Gate 42 , whereas the phishing webpages are collected from PhishTank 43 , which have been validated from August 2016 to April 2020. Our dataset consists of 60,252 webpages and their HTML source codes, wherein 27,280 ones are phishing and 32,972 ones are benign. Table 3 provides the distribution of the benign and phishing instances. We have divided the dataset into two groups where D1 is our dataset, and D2 is dataset used in existing literature 6 . The database management system (i.e., pgAdmin) has been employed with python to import and pre-process the data. The data sets were randomly split in 80:20 ratios for training and testing, respectively.

Performance metrics

To measure the performance of proposed anti-phishing approach, we used different statistical metrics such true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), sensitivity or recall, accuracy (Acc), precision (Pre), F-Score, AUC, and they are presented in Table 4 . \({N}_{B}\) and \({N}_{P}\) indicate the total number of benign and phishing websites, respectively. \({N}_{B\to B}\) are the benign websites are correctly marked as benign, \({N}_{B\to P}\) are the benign websites are incorrectly marked as phishing, \({N}_{P\to P}\) are the phishing websites are correctly marked as phishing, and \({N}_{P\to B}\) are the phishing websites are incorrectly marked as benign. The receiver operating characteristic (ROC) arch and AUC are commonly used to evaluate the measures of a binary classifier. The horizontal coordinate of the ROC arch is FPR, which indicates the probability that the benign website is misclassified as a phishing; the ordinate is TPR, which indicates the probability that the phishing website is identified as a phishing.

Evaluation of features

In this section, we evaluated the performance of our proposed features (URL and HTML). We have implemented different Machine Learning (ML) classifiers for feature evaluation used in our approach. In Table 5 , we extracted various text features such as TF-IDF word level, TF-IDF N-gram level (the length of n-gram between 2 and 3), TF-IDF character level, count vectors (bag-of-words), word sequences vectors, global to vector (GloVe) pre-trained word embedding, trained word embedding, character sequences vectors and implemented various classifiers such as XGBoost, Random forest, logistic regression, Naïve Bayes, Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) network. The main intention of this experiment was to reveal the best textual content features convenient for our data. From the experimental results, it is noticed that TF-IDF character level features outperformed other features with significant accuracy, precision, F-Score, Recall, and AUC using XGBoost and DNN classifiers. Hence, we implemented TF-IDF character level technique to generate text features (F2) of the webpage. Figure  5 presents the performance of textual content-based features. As shown in the figure, text features can correctly filter a high amount of phishing websites and achieved an accuracy of 88.82%.

figure 5

Performance of textual content features.

Table 6 shows the experiment results with hyperlinks features. From the empirical results, it is noticed that Random Forest classifier superior to the other classifiers with an accuracy of 82.27%, precision of 77.59%, F_Measure of 81.63%, recall of 86.10%, and AUC of 82.57%. It is also noticed that ensemble and XGBoost classifiers attained good accuracy of 82.18% and 80.49%, respectively. Figure  6 presents the classification results of hyperlink based features (F3–F15). As shown in the figure, hyperlink based features can accurately clarify 79.04% of benign websites and 86.10% of phishing websites.

figure 6

Performance of hyperlink based features.

In Table 7 , we integrated features of URL and HTML (hyperlink and text) using various classifiers to verify complementary behavior in phishing websites detection. From the empirical results, it is noticed that LR classifier has sufficient accuracy, precision, F-Score, AUC, and recall in terms of the HTML features. In contrast, NB classifier has good accuracy, precision, F-Score, AUC, and recall with respect to combining all the features. RF and ensemble classifiers achieved high accuracy, recall, F-Score, and AUC with respect to URL based features. XGBoost classifier outperformed the others with an accuracy of 96.76%, F-Score of 96.38%, AUC of 96.58% and recall of 94.56% with respect to combining all the features. It is observed that URL and HTML features are valuable in phishing detection. However, one type of feature is not suitable to identify all kinds of phishing webpages and does not result in high accuracy. Thus, we have combined all features to get more comprehensive features. The results on various classifiers of combined feature set are also shown in Fig.  7 . In Fig.  8 we compare the three feature sets in terms of accuracy, TNR, FPR, FNR, and TPR.

figure 7

Test results of various classifiers with respect to combined features.

figure 8

Performance of different feature combinations using XGBoost on dataset D1.

The confusion matrix is used to measure results where each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The confusion matrix of the proposed approach is created as represented in Table 8 . From the results, combining all kind of features together as an entity correctly identified 5212 out of 5512 phishing webpages and 6448 out of 6539 benign webpages and attained an accuracy of 96.76%. Our approach results in low false positive rate (i.e., less than 1.39% of benign webpages incorrectly classified as phishing), and high true positive rate (i.e., more than 94.56% of phishing webpages accurately classified). We have also tested our feature sets (URL and HTML) on the existing dataset D2. Since dataset D2 only contains legitimate and malicious URLs, we needed to extract the HTML source code features for these URLs. The results are given in Table 9 and Fig.  9 . From the results, it is noticed that combining all kinds of features had outperformed other feature sets with a significant accuracy of 98.48%, TPR of 99.04%, and FPR of 2.09%.

figure 9

Performance of the proposed approach on dataset D2.

Comparison with existing approaches

In this experiment, we compare our approach with existing anti-phishing approaches. Notice that we have applied Le et al. 29 and Aljofey et al. 3 works on dataset D1 to evaluate the efficiency of the proposed approach. While for comparison of the proposed approach with Sahingoz et al. 6 , Rao et al. 13 , Chatterjee and Namin 30 works, we evaluated our approach on benchmark dataset D2 6 , 13 , 30 based on the four-statistics metrics used in the papers. The comparison results are shown in Table 10 . From the results, it is observed that our approach gives better performance than other approaches discussed in the literature, which shows the efficiency of detecting phishing websites over the existing approaches.

In Table 11 , we implemented Le et al. 29 and Aljofey et al. 3 methods to our dataset D1 and our approach outperformed the others with an accuracy of 96.76%, precision of 98.28%, and F-Score of 96.38%. It should also be mentioned that Aljofey et al. method achieved 97.86% recall, which is 3.3% greater than our method, whereas our approach gives TNR that is higher by 4.97%, and FPR that is lesser by 4.96%. Our approach accurately identifies the legitimate websites with a high TNR and low FPR. Some phishing detection methods achieve high recall, however inaccurate classification of the legitimate websites is more serious compared to inaccurate classification of the phishing sites.

Discussion and limitations

The phishing website seems similar to its benign official website, and the defiance is how to distinguish between them. This paper proposed a novel anti-phishing approach, which involves different features (URL, hyperlink, and text) that have never been taken into consideration. The proposed approach is a completely client-side solution. We applied these features on various machine learning algorithms and found that XGBoost attained the best performance. Our major aim is to design a real-time approach, which has a high true-negative rate and low false-positive rate. The results show that our approach correctly filtered the benign webpages with a low amount of benign webpages incorrectly classified as phishing. In the process of phishing webpage classification, we construct the dataset by extracting the relevant and useful features from benign and phishing webpages.

A desktop machine having a core™ i7 processor with 3.4 GHz clock speed and 16 GB RAM is used to executed the proposed anti-phishing approach. Since Python provides excellent support of its libraries and has sensible compile-time, the proposed approach is implemented using Python programming language. BeautifulSoup library is employed to parse the HTML of the specified URL. The detection time is the time between entering URL to generating outputs. When the URL is entered as a parameter, the approach attempts to fetch all specific features from the URL and HTML code of the webpage as debated in feature extraction section. This is followed by current URL classification in form of benign or phishing based on the value of the extracted feature. The total execution time of our approach in phishing webpage detection is around 2–3 s, which is quite low and acceptable in a real-time environment. Response time depends on different factors, such as input size, internet speed, and server configuration. Using our data D1, we also attempted to compute the time taken for training, testing and detecting of proposed approach (all feature combinations) for the webpage classification. The results are given in Table 12 .

In pursuit of a further understanding of the learning capabilities, we also present the classification error as well as log loss regarding the number of iterations implemented by XGBoost. Log loss, short for logarithmic loss is a loss function for classification that indicates the price paid for the inaccuracy of predictions in classification problems. Figure  10 show the logarithmic loss and the classification error of the XGBoost approach for each epoch on the training and test dataset D1. From reviewing the figure, we might note that the learning algorithm is converging after approximately 100 iterations.

figure 10

XGBoost learning curve of logarithmic loss and classification error on dataset D1.

Limitations

Although our proposed approach has attained outstanding accuracy, it has some limitations. First limitation is that the textual features of our phishing detection approach depend on the English language. This may cause an error in generating efficient classification results when the suspicious webpage includes language other than English. About half (60.5%) of the websites use English as a text language 44 . However, our approach employs URL, noisy part of HTML, and hyperlink based features, which are language-independent features. The second limitation is that despite the proposed approach uses URL based features, our approach may fail to identify the phishing websites in case when the phishers use the embedded objects (i.e., Javascript, images, Flash, etc.) to obscure the textual content and HTML coding from the anti-phishing solutions. Many attackers use single server-side scripting to hide the HTML source code. Based on our experiments, we noticed that legitimate pages usually contain rich textual content features, and high amount of hyperlinks (At least one hyperlink in the HTML source code). At present, some phishing webpages include malware, for example, a Trojan horse that installs on user’s system when the user opens the website. Hence, the next limitation of this approach is that it is not sufficiently capable of detecting attached malware because our approach does not read and process content from the web page's external files, whether they are cross-domain or not. Finally, our approach's training time is relatively long due to the high dimensional vector generated by textual content features. However, the trained approach is much better than the existing baseline methods in terms of accuracy.

Conclusion and future work

Phishing website attacks are a massive challenge for researchers, and they continue to show a rising trend in recent years. Blacklist/whitelist techniques are the traditional way to alleviate such threats. However, these methods fail to detect non-blacklisted phishing websites (i.e., 0-day attacks). As an improvement, machine learning techniques are being used to increase detection efficiency and reduce the misclassification ratio. However, some of them extract features from third-party services, search engines, website traffic, etc., which are complicated and difficult to access. In this paper, we propose a machine learning-based approach which can speedily and precisely detect phishing websites using URL and HTML features of the given webpage. The proposed approach is a completely client-side solution, and does not rely on any third-party services. It uses URL character sequence features without expert intervention, and hyperlink specific features that determine the relationship between the content and the URL of a webpage. Moreover, our approach extracts TF-IDF character level features from the plaintext and noisy part of the given webpage's HTML.

A new dataset is constructed to measure the performance of the phishing detection approach, and various classification algorithms are employed. Furthermore, the performance of each category of the proposed feature set is also evaluated. According to the empirical and comparison results from the implemented classification algorithms, the XGBoost classifier with integration of all kinds of features provides the best performance. It acquired 1.39% false-positive rate and 96.76% of overall detection accuracy on our dataset. An accuracy of 98.48% with a 2.09% false-positive rate on a benchmark dataset.

In future work, we plane to include some new features to detect the phishing websites that contain malware. As we said in “ Limitations ” section, our approach could not detect the attached malware with phishing webpage. Nowadays, blockchain technology is more popular and seems to be a perfect target for phishing attacks like phishing scams on the blockchain. Blockchain is an open and distributed ledger that can effectively register transactions between receiving and sending parties, demonstrably and constantly, making it common among investors 45 . Thus, detecting phishing scams in the blockchain environment is a defiance for more research and evolution. Moreover, detecting phishing attacks in mobile devices is another important topic in this area due to the popularity of smart phones 47 , which has made them a common target of phishing offenses.

Data availability

The dataset generated during the current study are available in the Google Drive repository: https://drive.google.com/file/d/18ZZHsCeMmF9HKTaL_yd41oJ_3Fgk0gWE/view?usp=sharing .

RSA. Rsa fraud report. https://go.rsa.com/l/797543/2020-07-08/3njln/797543/48525/RSA_Fraud_Report_Q1_2020.pdf (2020) (Accessed 14 January 2021).

APWG. Phishing Attack Trends Reports, 24, November 2020. https://docs.apwg.org/reports/apwg_trends_report_q3_2020.pdf (2020) (Accessed 14 January 2021).

Aljofey, A., Jiang, Q., Qu, Q., Huang, M. & Niyigena, J.-P. An effective phishing detection model based on character level convolutional neural network from URL. Electronics 9 , 1514 (2020).

Article   Google Scholar  

Dhamija, R., Tygar, J.D., & Hearst, M. Why phishing works. in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 22–27 April 2006 , 581–590 (2006).

Jain, A. K. & Gupta, B. B. A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J. on Info. Security. 9 , 1–11. https://doi.org/10.1186/s13635-016-0034-3 (2016).

Sahingoz, O. K., Buber, E., Demir, O. & Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019 (117), 345–357 (2019).

Haruta, S. , Asahina, H., & Sasase, I. Visual Similarity-based Phishing Detection Scheme using Image and CSS with Target Website Finder. 978-1-5090-5019-2/17/$31.00 ©2017 IEEE (2017).

Cook, D. L., Gurbani, V. K., & Daniluk, M. Phishwish: A stateless phishing filter using minimal rules. in Financial Cryptography and Data Security , (ed. Gene Tsudik) 324, (Berlin, Heidelberg, Springer-Verlag, 2008).

Jain, A. K. & Gupta, B. B. A machine learning based approach for phishing detection using hyperlinks information. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-018-0798-z (2018).

Li, Y., Yang, Z., Chen, X., Yuan, H. & Liu, W. A stacking model using URL and HTML features for phishing webpage detection. Futur. Gener. Comput. Syst. 94 , 27–39 (2019).

Article   ADS   Google Scholar  

Xiang, G., Hong, J., Rose, C. P. & Cranor, L. CANTINA+: a feature rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14 (2), 1–28. https://doi.org/10.1145/2019599.2019606 (2011).

Zhang, W., Jiang, Q., Chen, L. & Li, C. Two-stage ELM for phishing Web pages detection using hybrid features. World Wide Web 20 (4), 797–813 (2017).

Rao, R. S., Vaishnavi, T. & Pais, A. R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humanized Comput. 11 , 813–825 (2019).

Arachchilage, N. A. G., Love, S. & Beznosov, K. Phishing threat avoidance behaviour: An empirical investigation. Comput. Hum. Behav. 60 , 185–197 (2016).

Wang, Y., Agrawal, R., & Choi, B.Y. Light weight anti-phishing with user whitelisting in a web browser. in Region 5 conference, 2008 IEEE, IEEE , 1–4 (2008).

Han, W., Cao, Y., Bertino, E. & Yong, J. Using automated individual white-list to protect web digital identities. Expert Syst. Appl. 39 (15), 11861–11869 (2012).

Prakash, P., Kumar, M., Kompella, R.R., Gupta, M. Phishnet: Predictive blacklisting to detect phishing attacks. in INFOCOM, 2010 Proceedings IEEE, IEEE , 1–5. https://doi.org/10.1109/INFCOM.2010.5462216 (2010)

Felegyhazi, M., Kreibich, C. & Paxson, V. On the potential of proactive domain blacklisting. LEET 10 , 6–6 (2010).

Google Scholar  

Sheng, S., Wardman, B., Warner, G., Cranor, L.F., Hong, J., & Zhang, C. An empirical analysis of phishing blacklists. in Proceedings of the 6th Conference on Email and Anti-Spam (CEAS’09) (2010).

Qi, L. et al. Privacy-aware data fusion and prediction with spatial-temporal context for smart city industrial environment. IEEE Trans. Ind. Inform. 17 (6), 4159–4167. https://doi.org/10.1109/TII.2020.3012157 (2021).

Liu, Y. et al. A label noise filtering and label missing supplement framework based on game theory. Digital Commun. Netw. https://doi.org/10.1016/j.dcan.2021.12.008 (2022).

Muzammal, M., Qu, Q. & Nasrulin B. Renovating blockchain with distributed databases: An open source system. Future Gener. Comput. Syst. 90 , 105–117. https://doi.org/10.1016/j.future.2018.07.042 (2019).

Liu, Y. et al. Bidirectional GRU networks-based next POI category prediction for healthcare. Int. J. Intell. Syst. https://doi.org/10.1002/int.22710 (2021).

Jain, A. K. & Gupta, B. B. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun. Syst. https://doi.org/10.1007/s11235-017-0414-0 (2017).

Rao, R. S. & Pais, A. R. Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach. J. Ambient. Intell. Humaniz. Comput. https://doi.org/10.1007/s12652-019-01637-z (2019).

Jain, A. K. & Gupta, B. B. Two-level authentication approach to protect from phishing attacks in real time. J. Ambient. Intell. Human Comput. https://doi.org/10.1007/s12652-017-0616-z (2017).

Rao, R. S., Umarekar, A. & Pais, A. R. Application of word embedding and machine learning in detecting phishing websites. Telecommun. Syst. 79 , 33–45. https://doi.org/10.1007/s11235-021-00850-6 (2022).

Guo, B. et al. HinPhish: An effective phishing detection approach based on heterogeneous information networks. Appl. Sci. 11 (20), 9733. https://doi.org/10.3390/app11209733 (2021).

Le, H., Pham, Q., Sahoo, D., & Hoi, S.C.H. Urlnet: Learning a URL representation with deep learning for malicious URL detection. arXiv 2018, arXiv: 1802.03162 (2018).

Chatterjee, M., & Namin, A.S. Detecting phishing websites through deep reinforcement learning. in 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) . 978-1-7281-2607-4/19/$31.00 ©2019 IEEE. (IEE Computer Society, 2019). https://doi.org/10.1109/COMPSAC.2019.10211 .

Xiao, X., Zhang, D., Hu, G., Jiang, Y. & Xia, S. CNN-MHSA: A convolutional neural network and multi-head self- attention combined approach for detecting phishing websites. Neural Netw. 125 , 303–312. https://doi.org/10.1016/j.neunet.2020.02.013 (2020).

Article   PubMed   Google Scholar  

Zheng, F., Yan Q., Victor C.M. Leung, F. Richard Yu, Ming Z. HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection, computers & security. https://doi.org/10.1016/j.cose.2021.102584 (2021)

Mohammad, R. M., Thabtah, F. & McCluskey, L. Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 25 (2), 443–458 (2014).

Ramanathan, V. & Wechsler, H. Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation. Comput. Security. 34 , 123–139 (2013).

Zhang, X., Zhao, J., & LeCun, Y. Character-level convolutional networks for text classification. in Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015 (2015).

Stecanella, B. What is TF-IDF? https://monkeylearn.com/blog/what-is-tf-idf/ . (2019) (Accessed 20 December 2020).

Bansal, S.A. Comprehensive guide to understand and implement text classification in python. https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-andimplement-text-classification-in-python/ (2018) (Accessed 1 July 2020).

Ramesh, G., Krishnamurthi, I. & Kumar, K. S. S. An efficacious method for detecting phishing webpages through target domain identification. Decis. Support Syst. 2014 (61), 12–22 (2014).

Zhang, Y., Hong, J.I., & Cranor, L.F. Cantina: A content- based approach to detecting phishing websites. in Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007 , 639–648 (2007).

Chen, T., & Guestrin, C.: Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. ACM , 785–794 (2016)

Aljofey, A., Jiang, Q. & Qu, Q. A supervised learning model for detecting Ponzi contracts in Ethereum Blockchain. In Big Data and Security. ICBDS 2021. Communications in Computer and Information Science Vol. 1563 (eds Tian, Y. et al. ) (Springer, 2022). https://doi.org/10.1007/978-981-19-0852-1_52 .

Chapter   Google Scholar  

http://stuffgate.com/stuff/website/ . (Accessed February 2020).

http://www.phishtank.com . (Accessed April 2020).

Usage of content languages for websites. https://w3techs.com/technologies/overview/content_language/all . (2021) (Accessed 19 January 2021).

Iansiti, M. & Lakhani, K. R. The truth about blockchain. Harvard Bus. Rev. 95 (1), 118–127 (2017).

https://github.com/YC-Coder-Chen/Tree-Math/blob/master/XGboost.md . (Accessed September 2021).

Qu, Q., Liu, S., Yang, B. & Jensen, C. S. Efficient top-k spatial locality search for co-located spatial web objects. 2014 IEEE 15th International Conference on Mobile Data Management. 1 , 269–278 (2014).

Download references

Acknowledgements

This research work is supported by the National Key Research and Development Program of China Grant nos. 2021YFF1200104 and 2021YFF1200100.

Author information

Authors and affiliations.

Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

Ali Aljofey, Qingshan Jiang, Abdur Rasool, Hui Chen & Qiang Qu

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Beijing, 100049, China

Ali Aljofey, Abdur Rasool & Hui Chen

Department of Computer Science, Guangdong University of Technology, Guangzhou, China

Cloud Computing Center, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China

You can also search for this author in PubMed   Google Scholar

Contributions

Data curation, A.A. and Q.J.; Funding acquisition, Q.J. and Q.Q.; Investigation, Q.J. and Q.Q.; Methodology, A.A. and Q.J.; Project administration, Q.J.; Software, A.A.; Supervision, Q.J.; Validation, A.R. and H.C.; Writing—original draft, A.A.; Writing—review & editing, Q.J., W.L, Y.W, and Q.Q; All authors reviewed the manuscript.

Corresponding author

Correspondence to Qingshan Jiang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Aljofey, A., Jiang, Q., Rasool, A. et al. An effective detection approach for phishing websites using URL and HTML features. Sci Rep 12 , 8842 (2022). https://doi.org/10.1038/s41598-022-10841-5

Download citation

Received : 17 December 2021

Accepted : 06 April 2022

Published : 25 May 2022

DOI : https://doi.org/10.1038/s41598-022-10841-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Spark-based multi-verse optimizer as wrapper features selection algorithm for phishing attack challenge.

  • Jamil Al-Sawwa
  • Mohammad Almseidin
  • Remah Younisse

Cluster Computing (2024)

Detection of phishing URLs with deep learning based on GAN-CNN-LSTM network and swarm intelligence algorithms

  • Abbas Jabr Saleh Albahadili
  • Ayhan Akbas
  • Javad Rahebi

Signal, Image and Video Processing (2024)

A CNN-Based SIA Screenshot Method to Visually Identify Phishing Websites

  • Dong-Jie Liu
  • Jong-Hyouk Lee

Journal of Network and Systems Management (2024)

  • Adnan Noor Mian
  • Sanaullah Manzoor

Scientific Reports (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on phishing attack

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

research paper on phishing attack

  • Subscribe SciFeed
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A systematic review on deep-learning-based phishing email detection.

research paper on phishing attack

1. Introduction

1.1. our contribution, 1.2. organization of the document, 2. methodology, 2.1. research question and search strategy, 2.2. study selection, 2.3. data extraction and analysis, 2.4. quality assessment, 2.5. inclusion and exclusion criteria, 2.5.1. inclusion criteria.

  • The paper must contain empirical results on deep-learning-based phishing detection.
  • The paper must be published in the English language.

2.5.2. Exclusion Criteria

  • The paper is not available in full-text format.
  • The paper is not related to the research question.
  • The paper is a duplicate publication. The paper is a review article or a meta-analysis.
  • The paper is a conference abstract or poster presentation.
  • The paper is a book, book chapter, or thesis.
  • The paper is of low quality, as determined by the QATQS.

3. Literature Survey and Findings

3.1. research papers published in 2017 and before, 3.2. research papers published in 2018, 3.3. research papers published in 2019, 3.4. research papers published in 2020, 3.5. research papers published in 2021, 3.6. research papers published in 2022, 3.7. research papers published in 2023, 4. results and analysis, 4.1. findings of data analysis, 4.2. limitations found, 4.3. future direction, 4.3.1. privacy preservation, 4.3.2. increasing dataset size and optimizing feature selection, 4.3.3. broader email content analysis, 4.3.4. handling modern phishing techniques, 4.3.5. handling concept drift, 4.3.6. consideration of additional factors, 4.3.7. comparison with state-of-the-art techniques, 4.3.8. hyperparameter optimization and more deep learning architectures, 4.3.9. real-time dataset and processing, 4.3.10. exploration of other machine learning techniques, 4.3.11. incorporating additional data sources, 4.3.12. enriching the dataset, 4.3.13. exploring attackers’ behavior and modus operandi, 4.3.14. testing on other domains, 5. conclusions, author contributions, data availability statement, conflicts of interest.

  • Alshingiti, Z.; Alaqel, R.; Al-Muhtadi, J.; Haq, Q.E.U.; Saleem, K.; Faheem, M.H. A Deep Learning-Based Phishing Detection System Using CNN, LSTM, and LSTM-CNN. Electronics 2023 , 12 , 232. [ Google Scholar ] [ CrossRef ]
  • Tsohou, A.; Diamantopoulou, V.; Gritzalis, S.; Lambrinoudakis, C. Cyber insurance: State of the art, trends and future directions. Int. J. Inf. Secur. 2023 , 22 , 737–748. [ Google Scholar ] [ CrossRef ]
  • Sheng, S.; Wardman, B.; Warner, G.; Cranor, L.; Hong, J.; Zhang, C. An Empirical Analysis of Phishing Blacklists. In Proceedings of the Sixth Conference on Email and Anti-Spam, Mountain View, CA, USA, 16–17 July 2009. [ Google Scholar ]
  • Edge, M.E.; Sampaio, P.R.F. A survey of signature based methods for financial fraud detection. Comput. Secur. 2009 , 28 , 381–394. [ Google Scholar ] [ CrossRef ]
  • Safi, A.; Singh, S. A systematic literature review on phishing website detection techniques. J. King Saud Univ. Comput. Inf. Sci. 2023 , 35 , 590–611. [ Google Scholar ] [ CrossRef ]
  • Aldawood, H.; Skinner, G. An Advanced Taxonomy for Social Engineering Attacks. Int. J. Comput. Appl. 2020 , 177 , 1–11. [ Google Scholar ] [ CrossRef ]
  • Aleroud, A.; Zhou, L. Phishing environments, techniques, and countermeasures: A survey. Comput. Secur. 2017 , 68 , 160–196. [ Google Scholar ] [ CrossRef ]
  • Kocher, G.; Kumar, G. Machine learning and deep learning methods for intrusion detection systems: Recent developments and challenges. Soft Comput. 2021 , 25 , 9731–9763. [ Google Scholar ] [ CrossRef ]
  • Chen, D.; Wawrzynski, P.; Lv, Z. Cyber security in smart cities: A review of deep learning-based applications and case studies. Sustain. Cities Soc. 2021 , 66 , 102655. [ Google Scholar ] [ CrossRef ]
  • Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Deep learning with convolutional neural network and long short-term memory for phishing detection. In Proceedings of the 2019 13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Island of Ulkulhas, Maldives, 26–28 August 2019; pp. 1–8. [ Google Scholar ]
  • Thomas, B.; Ciliska, D.; Dobbins, M.; Micucci, S. A Process for Systematically Reviewing the Literature: Providing the Research Evidence for Public Health Nursing Interventions. Worldviews Evid.-Based Nurs. 2004 , 1 , 176–184. [ Google Scholar ] [ CrossRef ]
  • Nosseir, A.; Nagati, K.; Taj-Eddin, I. Intelligent word-based spam filter detection using multi-neural networks. Int. J. Comput. Sci. Issues (IJCSI) 2013 , 10 Pt 1 , 17. [ Google Scholar ]
  • Almomani, A.; Gupta, B.B.; Wan, T.C.; Altaher, A.; Manickam, S. Phishing dynamic evolving neural fuzzy framework for online detection zero-day phishing email. Indian J. Sci. Technol. 2013 , 6 , 3960–3964. [ Google Scholar ] [ CrossRef ]
  • Hamid, I.R.A.; Abawajy, J.; Kim, T.H. Using feature selection and classification scheme for automating phishing email detection. Stud. Inform. Control. 2013 , 22 , 61–70. [ Google Scholar ] [ CrossRef ]
  • Jameel, N.G.M.; George, L.E. Detection of phishing emails using feed forward neural network. Int. J. Comput. Appl. 2013 , 77 , 10–15. [ Google Scholar ]
  • Soni, A.N. Spam-e-mail-detection-using-advanced-deep-convolution-neuralnetwork-algorithms. J. Innov. Dev. Pharm. Tech. Sci. 2019 , 2 , 74–80. [ Google Scholar ]
  • Zhang, N.; Yuan, Y. Phishing Detection Using Neural Network. Available online: http://cs229.stanford.edu/proj2012/ZhangYuan-PhishingDetectionUsingNeuralNetwork.pdf (accessed on 1 October 2023).
  • Kufandirimbwa, O.; Gotora, R. Spam detection using artificial neural networks (perceptron learning rule). Online J. Phys. Environ. Sci. Res. 2012 , 1 , 22–29. [ Google Scholar ]
  • Abu-Nimeh, S.; Nappa, D.; Wang, X.; Nair, S. A comparison of machine learning techniques for phishing detection. In Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit, Pittsburgh, PA, USA, 4–5 October 2007; pp. 60–69. [ Google Scholar ]
  • Chandan, C.J.; Chheda, H.P.; Gosar, D.M.; Shah, H.R.; Bhave, P.U. A Machine learning approach for detection of phished websites using neural networks. Int. J. Recent Innov. Trends Comput. Commun. 2014 , 2 , 42054209. [ Google Scholar ]
  • Alkaht, I.J.; Al Khatib, B. Filtering SPAM Using Several Stages Neural Networks. Int. Rev. Comput. Softw. (IRECOS) 2016 , 11 , 123–132. [ Google Scholar ] [ CrossRef ]
  • Coyotes, C.; Mohan, V.S.; Naveen, J.; Vinayakumar, R.; Soman, K.P.; Verma, A.D.R. ARES: Automatic rogue email spotter. In Proceedings of the 1st AntiPhishing Shared Pilot at 4th ACM International Workshop on Security and Privacy Analytics (IWSPA), Tempe, AZ, USA, 1–11 March 2018. [ Google Scholar ]
  • Smadi, S.; Aslam, N.; Zhang, L. Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis. Support Syst. 2018 , 107 , 88–102. [ Google Scholar ] [ CrossRef ]
  • Hiransha, M.; Unnithan, N.A.; Vinayakumar, R.; Soman, K.; Verma, A.D.R. Deep learning based phishing e-mail detection. In Proceedings of the 1st AntiPhishing Shared Pilot at 4th ACM International Workshop Security Privacy Analytics (IWSPA), Tempe, AZ, USA, 1–11 March 2018; pp. 1–5. [ Google Scholar ]
  • Barushka, A.; Hajek, P. Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks. Appl. Intell. 2018 , 48 , 3538–3556. [ Google Scholar ] [ CrossRef ]
  • Fang, Y.; Zhang, C.; Huang, C.; Liu, L.; Yang, Y. Phishing Email Detection Using Improved RCNN Model With Multilevel Vectors and Attention Mechanism. IEEE Access 2019 , 7 , 56329–56340. [ Google Scholar ] [ CrossRef ]
  • Harikrishnan, N.B.; Vinayakumar, R.; Soman, K.P.; Poornachandran, P. Time split based pre-processing with a data-driven approach for malicious url detection. Cybersecur. Secur. Inf. Syst. Chall. Solut. Smart Environ. 2019 , 43–65. [ Google Scholar ] [ CrossRef ]
  • Ali, W.; Ahmed, A.A. Hybrid intelligent phishing website prediction using deep neural networks with genetic algorithm-based feature selection and weighting. IET Inf. Secur. 2019 , 13 , 659–669. [ Google Scholar ] [ CrossRef ]
  • Oña, D.; Zapata, L.; Fuertes, W.; Rodríguez, G.; Benavides, E.; Toulkeridis, T. Phishing attacks: Detecting and preventing infected e-mails using machine learning methods. In Proceedings of the 2019 3rd Cyber Security in Networking Conference (CSNet), IEEE, Quito, Ecuador, 23–25 October 2019; pp. 161–163. [ Google Scholar ]
  • Nguyen, M.; Nguyen, T.; Nguyen, T.H. A deep learning model with hierarchical lstms and supervised attention for anti-phishing. CEUR Workshop Proc. 2018 , 2124 , 29–38. [ Google Scholar ]
  • Wei, B.; Hamad, R.A.; Yang, L.; He, X.; Wang, H.; Gao, B.; Woo, W.L. A deep-learning-driven light-weight phishing detection sensor. Sensors 2019 , 19 , 4258. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Vinayakumar, R.; Soman, K.P.; Poornachandran, P.; Akarsh, S.; Elhoseny, M. Deep learning framework for cyber threat situational awareness based on email and url data analysis. In Cybersecurity and Secure Information Systems: Challenges and Solutions in Smart Environments ; Springer: Berlin/Heidelberg, Germany, 2019; pp. 87–124. [ Google Scholar ]
  • Yang, P.; Zhao, G.; Zeng, P. Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning. IEEE Access 2019 , 7 , 15196–15209. [ Google Scholar ] [ CrossRef ]
  • Saha, I.; Sarma, D.; Chakma, R.J.; Alam, M.N.; Sultana, A.; Hossain, S. Phishing attacks detection using deep learning approach. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), IEEE, Tirunelveli, India, 20–22 August 2020; pp. 1180–1185. [ Google Scholar ]
  • Thapa, C.; Tang, J.W.; Abuadbba, A.; Gao, Y.; Camtepe, S.; Nepal, S.; Almashor, M.; Zheng, Y. Evaluation of Federated Learning in Phishing Email Detection. Sensors 2023 , 23 , 4346. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent phishing detection scheme using deep learning algorithms. J. Enterp. Inf. Manag. 2020 , 36 , 747–766. [ Google Scholar ] [ CrossRef ]
  • Alotaibi, R.; Al-Turaiki, I.; Alakeel, F. Mitigating email phishing attacks using convolutional neural networks. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), IEEE, Riyadh, Saudi Arabia, 19–21 March 2020; pp. 1–6. [ Google Scholar ]
  • Baccouche, A.; Ahmed, S.; Sierra-Sosa, D.; Elmaghraby, A. Malicious text identification: Deep learning from public comments and emails. Information 2020 , 11 , 312. [ Google Scholar ] [ CrossRef ]
  • Soon, G.K.; On, C.K.; Rusli, N.M.; Fun, T.S.; Alfred, R.; Guan, T.T. March. Comparison of simple feedforward neural network, recurrent neural network and ensemble neural networks in phishing detection. J. Phys. Conf. Ser. 2020 , 1502 , 012033. [ Google Scholar ] [ CrossRef ]
  • Alauthman, M. Botnet Spam E-Mail Detection Using Deep Recurrent Neural Network. Int. J. Emerg. Trends Eng. Res. 2020 , 8 , 1979–1986. [ Google Scholar ] [ CrossRef ]
  • Eryılmaz, E.E.; Şahin, D.Ö.; Kılıç, E. Filtering turkish spam using LSTM from deep learning techniques. In Proceedings of the 2020 8th International Symposium on Digital Forensics and Security, ISDFS, IEEE, Beirut, Lebanon, 1–2 June 2020; pp. 1–6. [ Google Scholar ]
  • Halgaš, L.; Agrafiotis, I.; Nurse, J.R. Catching the Phish: Detecting phishing attacks using recurrent neural networks (RNNs). In Proceedings of the Information Security Applications: 20th International Conference, WISA 2019, Jeju Island, Republic of Korea, 21–24 August 2019; pp. 219–233. [ Google Scholar ]
  • Isik, S.; Kurt, Z.; Anagun, Y.; Ozkan, K. Spam E-mail Classification Recurrent Neural Networks for Spam E-mail Classification on an Agglutinative Language. Int. J. Intell. Syst. Appl. Eng. 2020 , 8 , 221–227. [ Google Scholar ] [ CrossRef ]
  • AlEroud, A.; Karabatis, G. Bypassing detection of URL-based phishing attacks using generative adversarial deep neural networks. In Proceedings of the Sixth International Workshop on Security and Privacy Analytics, New Orleans, LA, USA, 18 March 2020; pp. 53–60. [ Google Scholar ]
  • Castillo, E.; Dhaduvai, S.; Liu, P.; Thakur, K.S.; Dalton, A.; Strzalkowski, T. Email threat detection using distinct neural network approaches. In Proceedings of the First International Workshop on Social Threats in Online Conversations: Understanding and Management, Marseille, France, 11–16 May 2020; pp. 48–55. [ Google Scholar ]
  • Kumar, A.; Chatterjee, J.M.; Díaz, V.G. A novel hybrid approach of SVM combined with NLP and probabilistic neural network for email phishing. Int. J. Electr. Comput. Eng. (IJECE) 2020 , 10 , 486–493. [ Google Scholar ] [ CrossRef ]
  • Opara, C.; Wei, B.; Chen, Y. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [ Google Scholar ]
  • AbdulNabi, I.; Yaseen, Q. Spam Email Detection Using Deep Learning Techniques. Procedia Comput. Sci. 2021 , 184 , 853–858. [ Google Scholar ] [ CrossRef ]
  • Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Networks Learn. Syst. 2020 , 32 , 604–624. [ Google Scholar ] [ CrossRef ]
  • Alhogail, A.; Alsabih, A. Applying machine learning and natural language processing to detect phishing email. Comput. Secur. 2021 , 110 , 102414. [ Google Scholar ] [ CrossRef ]
  • Bagui, S.; Nandi, D.; Bagui, S.; White, R.J. Machine learning and deep learning for phishing email classification using one-hot encoding. J. Comput. Sci. 2021 , 17 , 610–623. [ Google Scholar ] [ CrossRef ]
  • Lee, J.; Tang, F.; Ye, P.; Abbasi, F.; Hay, P.; Divakaran, D.M. D-Fence: A flexible, efficient, and comprehensive phishing email detection system. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy (EuroS&P), IEEE, Vienna, Austria, 7–11 September 2021; pp. 578–597. [ Google Scholar ]
  • Manaswini, M.; Srinivasu, D.N. Phishing Email Detection Model using Improved Recurrent Convolutional Neural Networks and Multilevel Vectors. Ann. Rom. Soc. Cell Biol. 2021 , 25 , 16674–16681. [ Google Scholar ]
  • Ghaleb, S.A.A.; Mohamad, M.; Fadzli, S.A.; Ghanem, W.A.H.M. Training Neural Networks by Enhance Grasshopper Optimization Algorithm for Spam Detection System. IEEE Access 2021 , 9 , 116768–116813. [ Google Scholar ] [ CrossRef ]
  • Eckhardt, R.; Bagui, S. Convolutional Neural Networks and Long Short Term Memory for Phishing Email Classification. Int. J. Comput. Sci. Inf. Secur. 2021 , 19 , 27–35. [ Google Scholar ]
  • Sheneamer, A. Comparison of Deep and Traditional Learning Methods for Email Spam Filtering. Int. J. Adv. Comput. Sci. Appl. 2021 , 12 , 560–565. [ Google Scholar ] [ CrossRef ]
  • Dubey, K.A.; Ganesh, K.B.; Gowtham, V.; Balakrishnan, M.D. Phishing email detection. Int. J. Emerg. Technol. Comput. Sci. Electron. (IJETCSE) 2021 , 28 , 1–4. [ Google Scholar ]
  • Samarthrao, K.V.; Rohokale, V.M. Enhancement of email spam detection using improved deep learning algorithms for cyber security. J. Comput. Secur. 2022 , 30 , 231–264. [ Google Scholar ] [ CrossRef ]
  • Dewis, M.; Viana, T. Phish Responder: A Hybrid Machine Learning Approach to Detect Phishing and Spam Emails. Appl. Syst. Innov. 2022 , 5 , 73. [ Google Scholar ] [ CrossRef ]
  • Khan, S.A.; Iqbal, K.; Mohammad, N.; Akbar, R.; Ali, S.S.A.; Siddiqui, A.A. A Novel Fuzzy-Logic-Based Multi-Criteria Metric for Performance Evaluation of Spam Email Detection Algorithms. Appl. Sci. 2022 , 12 , 7043. [ Google Scholar ] [ CrossRef ]
  • Malhotra, P.; Malik, S. Spam Email Detection Using Machine Learning and Deep Learning Techniques. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC), Delhi, India, 24 June 2022. [ Google Scholar ] [ CrossRef ]
  • Korkmaz, M.; Koçyiğit, E.; Şahingöz, Ö.; Diri, B. A Hybrid Phishing Detection System by Using Deep Learning-Based URL and Content Analysis. Elektron. Ir Elektrotechnika 2022 , 28 , 80–89. [ Google Scholar ] [ CrossRef ]
  • Zhu, E.; Yuan, Q.; Chen, Z.; Li, X.; Fang, X. CCBLA: A Lightweight Phishing Detection Model Based on CNN, BiLSTM, and Attention Mechanism. Cogn. Comput. 2022 , 15 , 1320–1333. [ Google Scholar ] [ CrossRef ]
  • Nooraee, M.; Ghaffari, H. Optimization and Improvement of Spam Email Detection Using Deep Learning Approaches. J. Comput. Robot. 2022 , 15 , 61–70. [ Google Scholar ]
  • Prosun, P.R.K.; Alam, K.S.; Bhowmik, S. Improved Spam Email Filtering Architecture Using Several Feature Extraction Techniques. In Proceedings of the International Conference on Big Data, IoT, and Machine Learning: BIM 2021, Cox’s Bazar, Bangladesh, 23–25 September 2021; Springer: Singapore, 2021; pp. 665–675. [ Google Scholar ]
  • Jafar, M.T.; Al-Fawa’reh, M.; Barhoush, M.; Alshira’H, M.H. Enhanced Analysis Approach to Detect Phishing Attacks During COVID-19 Crisis. Cybern. Inf. Technol. 2022 , 22 , 60–76. [ Google Scholar ] [ CrossRef ]
  • Do, N.Q.; Selamat, A.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H. Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future Directions. IEEE Access 2022 , 10 , 36429–36463. [ Google Scholar ] [ CrossRef ]
  • Zhou, M.-G.; Liu, Z.-P.; Yin, H.-L.; Li, C.-L.; Xu, T.-K.; Chen, Z.-B. Quantum Neural Network for Quantum Neural Computing. Research 2023 , 6 , 0134. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Rafat, K.F.; Xin, Q.; Javed, A.R.; Jalil, Z.; Ahmad, R.Z. Evading obscure communication from spam emails. Math. Biosci. Eng. 2021 , 19 , 1926–1943. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Rathee, D.; Mann, S. Detection of E-Mail Phishing Attacks – using Machine Learning and Deep Learning. Int. J. Comput. Appl. 2022 , 183 , 1–7. [ Google Scholar ] [ CrossRef ]
  • Mughaid, A.; AlZu’bi, S.; Hnaif, A.; Taamneh, S.; Alnajjar, A.; Abu Elsoud, E. An intelligent cyber security phishing detection system using deep learning techniques. Clust. Comput. 2022 , 25 , 3819–3828. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Butt, U.A.; Amin, R.; Aldabbas, H.; Mohan, S.; Alouffi, B.; Ahmadian, A. Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell. Syst. 2022 , 9 , 3043–3070. [ Google Scholar ] [ CrossRef ]
  • Logavarshini, G.; Yogalakshmi, S. E-Mail Spam Classification Via Deep Learning and Natural Language Processing. Int. J. Res. Publ. Rev. 2022 , 2582 , 7421. [ Google Scholar ]
  • Ghaleb, S.A.A.; Mohamad, M.; Ghanem, W.A.H.M.; Nasser, A.B.; Ghetas, M.; Abdullahi, A.M.; Saleh, S.A.M.; Arshad, H.; Omolara, A.E.; Abiodun, O.I. Feature Selection by Multiobjective Optimization: Application to Spam Detection System by Neural Networks and Grasshopper Optimization Algorithm. IEEE Access 2022 , 10 , 98475–98489. [ Google Scholar ] [ CrossRef ]
  • Babu, D.K. Phishing Detection in Emails Using Multi-Convolutional Neural Network Fusion. Ph.D. Thesis, National College of Ireland, Dublin, Ireland, 2022. [ Google Scholar ]
  • Shmalko, M.; Abuadbba, A.; Gaire, R.; Wu, T.; Paik, H.Y.; Nepal, S. Profiler: Profile-Based Model to Detect Phishing Emails. arXiv 2022 , arXiv:2208.08745. [ Google Scholar ]
  • Muralidharan, T.; Nissim, N. Improving malicious email detection through novel designated deep-learning architectures utilizing entire email. Neural Networks 2023 , 157 , 257–279. [ Google Scholar ] [ CrossRef ]
  • Bountakas, P.; Xenakis, C. HELPHED: Hybrid Ensemble Learning PHishing Email Detection. J. Netw. Comput. Appl. 2023 , 210 , 103545. [ Google Scholar ] [ CrossRef ]
  • Wen, T.; Xiao, Y.; Wang, A.; Wang, H. A novel hybrid feature fusion model for detecting phishing scam on Ethereum using deep neural network. Expert Syst. Appl. 2023 , 211 , 118463. [ Google Scholar ] [ CrossRef ]
  • Liu, Z.-P.; Zhou, M.-G.; Liu, W.-B.; Li, C.-L.; Gu, J.; Yin, H.-L.; Chen, Z.-B. Automated machine learning for secure key rate in discrete-modulated continuous-variable quantum key distribution. Opt. Express 2022 , 30 , 15024–15036. [ Google Scholar ] [ CrossRef ] [ PubMed ]

Click here to enlarge figure

RefMethodDataResultInnovationsLimitations
[ ]CNN, MLP, RNNSelf-generated emails datasetAccuracy: 93.1%Highlighted issues related to imbalance dataHighly imbalanced nature of the dataset
[ ]NNSpanAssianAccuracy: 99.07% Provided guidelines to improve offline dataNeeded to enrich the offline dataset to enhance model performance
[ ]CEN-DeepspamSelf-generated emails datasetAccuracy: 95.5%Larger dataset could improve accuracyAdditional dataset required to validate the result
[ ]DBB-RDNN-ReLEnron, SpamAssassin, SMS Spam ColectionAccuracy: 96.1%DBBRDNN-ReL model outperformed compared to other modelsSlow processing
RefMethodDataResultInnovationsLimitations
[ ]THEMISEnron and SpamAssassinAccuracy: 99.85%Utilized unbalanced datasetLimited to detecting phishing emails with header
[ ]NB, DT, AB, RF, DNN, RNN, CNNPhishTankAccuracy: 88.5% Tf-idf presentation is better than feature hashing and embeddingLimited real-time dataset
[ ]DNNUCI phishing websites Accuracy: 95%Hybrid model performs better for classificationFeature selection requires longer time
[ ]NNDebian and PhishTankAccuracy: 93.9%Better accuracyLimited use of deep learning
[ ]LSTMData-no-header and data-full-headerAccuracy: 89.34%-Low effectiveness
[ ]Multi-spatial CNNSelf-generated emails datasetAccuracy: 86.63% 30% reduction in the execution timeDid not compare model’s performance with other state-of-the-art methods
[ ]CNN, RNN, CNN-RNN,
CNN-LSTM
Spam dataset. URL datasetRecall: 99%Better performance in detecting malwarePerformance could be improved by adding sub-modules
[ ]CNN, RNN, LSTM, CNN-RNNSelf-generated emails datasetAccuracy: 98.99%High accuracy and low FPRFocused on a single type of phishing attack
RefMethodDataResultInnovationsLimitations
[ ]IPDSURLsAccuracy: 93.28%Novel approach to differentiate phishing and legitimate URLsEnsuring the availability of the dataset would be challenging
[ ]CNN PhishingCorpus and SpamAssasinAccuracy: 99.42% Used a huge dataset to detect phishing emailsUsed a smaller dataset
[ ]Multi-label LSTMSelf-generated emails datasetAccuracy: 92.7%Used combined datasetNo comparison of the results
[ ]GRU-RNN+SVMSpambase datasetAccuracy: 98.7%Claimed higher accuracyLimited to one dataset
[ ]LSTM+Keras800 Turkish emails datasetAccuracy: 100%Proposed hybrid modelLimited dataset
[ ]RNNsSA-JN and En-JN datasetsAccuracy: 98.91% and 96.74% Outperformed state-of-the-art systemsUnrealistically hard
[ ]ANN, LSTM, and BILSTMSelf-generated Turkish emails datasetAccuracy: 100%Highest accuracyFocused on the Turkish language only
[ ]GAN-basedPhishTank and MillerSmilesTPR: 97%Has used actual phishing datasetControlled environment
[ ]ML, DL, NLPRnron, APWGAccuracy: 93%-Limited dataset
[ ]SVM combined with NLP and PNNSelf-generated emails datasetAccuracy: 89%Probabilistic NN would be more accurate in phishing detectionOnly works on a small phishing dataset
[ ]CNNHTML documentsAccuracy: 93%Automatic phishing web page detectionLimited to HTML document analysis
RefMethodDataResultInnovationsLimitations
[ ]GCN+NLPSelf-generated email body text datasetAccuracy: 98.2%Enhance phishing detection on the email body textTested only English corpus
[ ]CNN and LSTMSelf-generated emails datasetAccuracy: 96.34% CNN with word embedding is most accurateTested only English corpus
[ ]D-FenceSelf-generated emails datasetAccuracy: 99%D-Fence maintained a high detection rateRelied on multiple modules
[ ]ThemisSelf-generated emails datasetAccuracy: 99.87%Combined email head and bodyFocused only on analyzing the email structure
[ ]MLPSpamBase, SpamAssassin, UK-2011 WebspamAccuracy: 98.1%Used several dataset and featuresSpam detection study is inadequate
[ ]CNN and LSTMTwo datasetsAccuracy: 98.3% Adam optimizer outperformed the SGD optimizerComparison limited to textual data classification
[ ]CNNSelf-generated emails datasetAccuracy: 96.52%Automated features extractionLimited datasets
RefMethodDataResultInnovationsLimitations
[ ]Fitness-oriented, Levy improvement-based DragonflyN/AAccuracy: 14.93%Better performance than DT, KNN, and SVMMisclassification existed
[ ]DL+NLPText-based and numerical-based datasetsAccuracy: 99% (text-based) and 94% (numerical-based)Phish Responder better than other modelsLimited data used; no explanation on the dataset employed
[ ]ML and DLN/AAccuracy: 98.5%BiLSTM classifier performed betterDataset did not contain variety of spam emails
[ ]TshPhishPhishTankAccuracy: 98.37%Improved feature selection through evolutionary algorithmsLow recall rate
[ ]CCBLATwo datasetsAccuracy: 99.85%Combined CNN, bi-directional LSTM, and attention mechanismHuge time consumption
[ ]LSTM and Glove word embeddingTwo datasetsAccuracy: 98.39% and 99.49%Used multiple datasetsLimited to one language
[ ]ML-based voting modelN/AAccuracy: 98%Used various feature retrieval algorithmsLack of benchmark datasets
[ ]GRU-based Phishing URL detectionPhishing URLsAccuracy: 98.30%Highly accurate classifierLimited detection of phishing attacks during COVID-19
[ ]Deep learningN/AAccuracy: 92%Incorporated less explored DL techniquesNo details of empirical analysis
[ ]ML and DL Spamassassin Precision: 95.26%, recall: 97.18%, F1-score: 96%Focused on the limitations of ML and DL algorithmsBroader email content analysis
[ ]DLEmail textAccuracy: 88–100%-Cannot effectively handle modern phishing techniques
[ ]RCNNEmail StructureN/AExamined emails at multiple levels, including the header, body, character, and wordsLimited to detecting phishing emails with header
[ ]Multiobjective optimizationSpamBase, SpamAssassin, and UK-2011 datasetsAccuracy: 97.5%, 98.3%, and 96.4%-Limited to detecting spam
RefMethodDataResultInnovationsLimitations
[ ]Deep ensemble learningEmail segmentsAUC of 0.993 and TPR of 5%Higher AUC resultFocus on privacy preservation in future work.
[ ]HELPHEDImbalancedF1-score: 99.42%Superior result in the imbalance datasetFocused on the detection and did not address prevention or mitigation of attacks. The dataset was imbalanced.
[ ]LBPS Ethereum dataF1-score: 97.86%Phishing scam account detection modelTested the LBPS model only on Ethereum data.
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Thakur, K.; Ali, M.L.; Obaidat, M.A.; Kamruzzaman, A. A Systematic Review on Deep-Learning-Based Phishing Email Detection. Electronics 2023 , 12 , 4545. https://doi.org/10.3390/electronics12214545

Thakur K, Ali ML, Obaidat MA, Kamruzzaman A. A Systematic Review on Deep-Learning-Based Phishing Email Detection. Electronics . 2023; 12(21):4545. https://doi.org/10.3390/electronics12214545

Thakur, Kutub, Md Liakat Ali, Muath A. Obaidat, and Abu Kamruzzaman. 2023. "A Systematic Review on Deep-Learning-Based Phishing Email Detection" Electronics 12, no. 21: 4545. https://doi.org/10.3390/electronics12214545

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Defending Against Vishing Attacks: A Comprehensive Review for Prevention and Mitigation Techniques

  • Conference paper
  • First Online: 11 March 2024
  • Cite this conference paper

research paper on phishing attack

  • Shaikh Ashfaq 12 ,
  • Pankaj Chandre 13 ,
  • Shafi Pathan 13 ,
  • Uday Mande 13 ,
  • Madhukar Nimbalkar 13 &
  • Parikshit Mahalle 14  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 896))

Included in the following conference series:

  • International Conference on Recent Developments in Cyber Security

365 Accesses

Vishing attacks, or voice phishing attacks, are a type of social engineering attack in which attackers use voice communication channels, such as phone calls, to trick victims into divulging sensitive information or performing actions that compromise their security. Vishing attacks have become increasingly common in recent years and can have severe consequences for individuals, businesses, and organizations. In this review, we examine various prevention and mitigation techniques that can be used to defend against vishing attacks. We start by discussing the common tactics used by attackers and how they exploit vulnerabilities in human behaviour to achieve their goals. We then present an overview of different prevention and mitigation techniques, including education and awareness campaigns, technical solutions such as authentication mechanisms and policy-based approaches. We also evaluate the effectiveness of different techniques, highlighting their strengths and weaknesses, and provide guidance on how to implement a comprehensive vishing defence strategy. Finally, we discuss emerging trends and challenges in vishing attacks and defence, such as the use of artificial intelligence and deepfake technology, and suggest directions for future research and development. Overall, this review provides a comprehensive and up-to-date overview of the current state of vishing attacks and defence, and should be of interest to researchers, practitioners, and individuals who want to enhance their knowledge and understanding of this important topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Nmachi WP, Win T (2021) Phishing mitigation techniques: a literature survey. Int J Netw Secur Appl 13(2):63–72. https://doi.org/10.5121/ijnsa.2021.13205

Article   Google Scholar  

Juan C, Chuanxiong G (2007) Online detection and prevention of phishing attacks (invited paper). In: 2006 first international conference on communications and networking in China, ChinaCom’06. https://doi.org/10.1109/CHINACOM.2006.344718

Andronova IV, Belova IN, Ganeeva MV, Moseykin YN (2018) Scientific technical cooperation within the EAEU as a key factor of the loyalty of the participating countries’ population to the integration and of its attractiveness for new members. RUDN J Sociol 18(1):117–130. https://doi.org/10.22363/2313-2272-2018-18-1-117-130

Gautam H, Kumar V, Sharma V (2021) Phishing prevention techniques: past, present and future, pp 83–98. https://doi.org/10.1007/978-981-33-6307-6_10

Bhusal CS (2021) Systematic review on social engineering: hacking by manipulating humans. J Inf Secur 12(01):104–114. https://doi.org/10.4236/jis.2021.121005

Al-Qahtani AF, Cresci S (2022) The COVID-19 scamdemic: a survey of phishing attacks and their countermeasures during COVID-19. IET Inf Secur 16(5):324–345. https://doi.org/10.1049/ise2.12073

Chandre PR, Mahalle PN, Shinde GR (2018) Machine learning based novel approach for intrusion detection and prevention system: a tool based verification. In: 2018 IEEE global conference on wireless computing and networking (GCWCN), Nov 2018, pp 135–140. https://doi.org/10.1109/GCWCN.2018.8668618

Perwej DY, Abbas SQ, Dixit JP, Akhtar DN, Jaiswal AK (2021) A systematic literature review on the cyber security. Int J Sci Res Manag 9(12):669–710. https://doi.org/10.18535/ijsrm/v9i12.ec04

Chandre P, Mahalle P, Shinde G (2022) Intrusion prevention system using convolutional neural network for wireless sensor network. IAES Int J Artif Intell 11(2):504–515. https://doi.org/10.11591/ijai.v11.i2.pp504-515

Do NQ, Selamat A, Krejcar O, Herrera-Viedma E, Fujita H (2022) Deep learning for phishing detection: taxonomy, current challenges and future directions. IEEE Access 10:36429–36463. https://doi.org/10.1109/ACCESS.2022.3151903

Alkhalil Z, Hewage C, Nawaf L, Khan I (2021) Phishing attacks: a recent comprehensive study and a new anatomy. Front Comput Sci 3:1–23. https://doi.org/10.3389/fcomp.2021.563060

Chandre PR (2021) Intrusion prevention framework for WSN using deep CNN. Turk J Comput Math Educ 12(6):3567–3572

Google Scholar  

Arshad A, Rehman AU, Javaid S, Ali TM, Sheikh JA, Azeem M (2021) A systematic literature review on phishing and anti-phishing techniques, pp 163–168. [Online]. Available: http://arxiv.org/abs/2104.01255

Salahdine F, Kaabouch N (2019) Social engineering attacks: a survey. Future Internet 11(4). https://doi.org/10.3390/FI11040089

Sadiq A et al (2021) A review of phishing attacks and countermeasures for internet of things-based smart business applications in industry 4.0. Hum Behav Emerg Technol 3(5):854–864. https://doi.org/10.1002/hbe2.301

Ahsan M, Nygard KE, Gomes R, Chowdhury MM, Rifat N, Connolly JF (2022) Cybersecurity threats and their mitigation approaches using machine learning—a review. J Cybersecur Priv 2(3):527–555. https://doi.org/10.3390/jcp2030027

Bhuvana, Bhat AS, Shetty T, Naik MP (2021) A study on various phishing techniques and recent phishing attacks. Int J Adv Res Sci Commun Technol 11(1):142–148. https://doi.org/10.48175/ijarsct-2094

Chawla M, Chouhan SS (2014) A survey of phishing attack techniques. Int J Comput Appl 93(3):32–35. https://doi.org/10.5120/16197-5460

Shankar A, Shetty R, Nath B (2019) A review on phishing attacks. Int J Appl Eng Res 14(9):2171–2175. [Online]. Available: http://www.ripublication.com

Bhavsar V, Kadlak A, Sharma S (2018) Study on phishing attacks. Int J Comput Appl 182(33):27–29. https://doi.org/10.5120/ijca2018918286

Priestman W, Anstis T, Sebire IG, Sridharan S, Sebire NJ (2019) Phishing in healthcare organisations: threats, mitigation and approaches. BMJ Health Care Inform 26(1):1–6. https://doi.org/10.1136/bmjhci-2019-100031

Bojjagani S, Brabin DRD, Rao PVV (2020) PhishPreventer: a secure authentication protocol for prevention of phishing attacks in mobile environment with formal verification. Procedia Comput Sci 171(2019):1110–1119. https://doi.org/10.1016/j.procs.2020.04.119

Mahalakshmi A, Goud NS, Murthy GV (2018) A survey on phishing and it’s detection techniques based on support vector method (SVM) and software defined networking (SDN). Int J Eng Adv Technol 8(2):498–503

Deloitte (2014) Fraud risk management—providing insight into fraud prevention, detection and response, pp 1–12. [Online]. Available: http://www2.deloitte.com/content/dam/Deloitte/in/Documents/finance/Forensic-Proactive-services/in-fa-frm-noexp.pdf

Chin T, Xiong K, Hu C (2018) PhishLimiter: a phishing detection and mitigation approach using software-defined networking. IEEE Access 6:42513–42531. https://doi.org/10.1109/ACCESS.2018.2837889

Abbas SG et al (2021) Identifying and mitigating phishing attack threats in IoT use cases using a threat modelling approach. Sensors 21(14):1–25. https://doi.org/10.3390/s21144816

FireEye Inc. (2016) Spear-phishing attacks why they are successful and how to stop them, pp 1–9. [Online]. Available: https://www.fireeye.com/content/dam/fireeye-www/global/en/products/pdfs/wp-fireeye-how-stop-spearphishing.pdf

Download references

Author information

Authors and affiliations.

Information Technology Department, M H Saboo Siddik College of Engineering, Mumbai, India

Shaikh Ashfaq

Computer Science and Engineering Department, MIT School of Computing, MIT Art Design and Technology University, Loni Kalbhor, Pune, India

Pankaj Chandre, Shafi Pathan, Uday Mande & Madhukar Nimbalkar

Artificial Intelligence and Data Science Department, Vishwakarma Institute of Information Technology, Kondhwa, Pune, India

Parikshit Mahalle

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Pankaj Chandre .

Editor information

Editors and affiliations.

Center for Cyber Security and Cryptology, Sharda University, Greater Noida, Uttar Pradesh, India

Nihar Ranjan Roy

Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India

Sudeep Tanwar

Department of Computer Science Engineering, Shri Vishwakarma Skill University, Gurugram, Haryana, India

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Ashfaq, S., Chandre, P., Pathan, S., Mande, U., Nimbalkar, M., Mahalle, P. (2024). Defending Against Vishing Attacks: A Comprehensive Review for Prevention and Mitigation Techniques. In: Roy, N.R., Tanwar, S., Batra, U. (eds) Cyber Security and Digital Forensics. REDCYSEC 2023. Lecture Notes in Networks and Systems, vol 896. Springer, Singapore. https://doi.org/10.1007/978-981-99-9811-1_33

Download citation

DOI : https://doi.org/10.1007/978-981-99-9811-1_33

Published : 11 March 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-9810-4

Online ISBN : 978-981-99-9811-1

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. (PDF) Online Detection and Prevention of Phishing Attacks (Invited Paper)

    research paper on phishing attack

  2. (PDF) Phishing Attacks and Its Preventions

    research paper on phishing attack

  3. Spear phishing attack

    research paper on phishing attack

  4. Phishing Attack and Scam Prevention Techniques

    research paper on phishing attack

  5. (PDF) Detection of phishing attacks

    research paper on phishing attack

  6. Phishing Attack Prevention Checklist

    research paper on phishing attack

VIDEO

  1. Phishing Attack Prevention Guide #cyberprotection #cybersecurity

  2. Phishing Attack Neutralized at SecureTech

  3. NEW PHISHING ATTACK DELIVERS KEYLOGGER DISGUISED AS A BANK PAYMENT NOTICE

  4. How To Perform A Simple Phishing Attack Using Zphisher tool Ethical Hacking Malayalam

  5. Cyber Threat Intelligence Data Acquisition

  6. Phishing & Anti-Phishing (2-way Authentication System) by Rockey Killer (h4ck3r.in)

COMMENTS

  1. Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    However, info-security professionals reported a higher frequency of all types of social engineering attacks year-on-year according to a report presented by Proofpoint. Spear phishing increased to 64% in 2018 from 53% in 2017, Vishing and/or SMishing increased to 49% from 45%, and USB attacks increased to 4% from 3%.

  2. Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    Phishing is an example of a highly. effective form of cybercrime that enables criminals to deceive users and steal important. data. Since the first reported phishing attack in 1990, it has been ...

  3. A systematic literature review on phishing website detection techniques

    Phishing is a social engineering attack (Paliath et al., 2020, Nakamura and Dobashi, 2019, Zabihimayvan and Doran, 2019) identified as the most common method used by cybercriminals to get access to an internet user's personal information such as credit card information, usernames, and passwords (Ramana et al., 2021, Faris and Yazid, 2021).Sometimes, attackers perform phishing attacks to ...

  4. How Good Are We at Detecting a Phishing Attack? Investigating the

    Phishing attacks have been the most common crime from 2020, with phishing incidents nearly doubled in regularity . Business Email Compromise attacks are the most common and amount to huge losses. These phishing attacks come in the form of a request, urgent, important, seeking attention and often requiring some form of payment .

  5. Mitigation strategies against the phishing attacks: A systematic

    The paper presents the outcomes of SLR conducted while focusing on four research questions. The paper advocates that technology-only solutions are never going to be enough to protect against attacks targeted toward human users, therefore, there is a need to consider the role and abilities of human users in the development of anti-phishing ...

  6. A COMPREHENSIVE STUDY OF PHISHING ATTACKS AND THEIR ...

    This research paper presents a comprehensive study of phishing attacks and their countermeasures. Phishing attacks are a major threat to individuals and organizations worldwide, and understanding ...

  7. Human Factors in Phishing Attacks: A Systematic Literature Review

    Abstract. Phishing is the fraudulent attempt to obtain sensitive information by disguising oneself as a trustworthy entity in digital communication. It is a type of cyber attack often successful because users are not aware of their vulnerabilities or are unable to understand the risks. This article presents a systematic literature review ...

  8. The COVID‐19 scamdemic: A survey of phishing attacks and their

    Still within the large body of initial research on phishing attacks and COVID‐19, other papers investigated a number of more specific issues. For example, the work in [ 20 ] focussed on challenges of the heathcare sector, by outlining why cyber‐attacks have been particularly problematic during COVID‐19 and by defining the ways in which ...

  9. (PDF) Study on Phishing Attacks

    Phishing is. one such type of methodologies which are used to acquire the. information. Phishing is a cyber crime in which emails, telephone, text messages, personally identifiable information ...

  10. A survey of phishing attack techniques, defence mechanisms and open

    Therefore, this paper presents a detailed analysis of phishing attack methods and defense techniques. This survey is presented in five folds. First, we discuss in detail the lifecycle of phishing attack, its history, and motivation behind this attack. Second, we present various distribution methods that are used to spread phishing attacks.

  11. An Evaluation and Comparison for Phishing Attack Detection using

    The persistent and evolving threat of phishing attacks demands effective and adaptive detection techniques. This research paper presents a comprehensive evaluation and comparison of various machine learning approaches to detect phishing attacks. We investigated five prominent algorithms: Logistic Regression, Support Vector Machine (SVM), K-Nearest Neighbors (K-NN), Naive Bayes, and Extreme ...

  12. Life-long phishing attack detection using continual learning

    This paper explores continual learning (CL) techniques for sustained phishing detection performance over time. To demonstrate this behavior, we collect phishing and benign samples for three ...

  13. A Systematic Literature Review on Phishing and Anti-Phishing Techniques

    h to find out different types of phishing and anti-phishing techniques. Research study evaluated that spear phishing, Email Spoofing, Email Manipul. tion and phone phishing are the most commonly used phishing techniques. On the other hand, according to the SLR, machine learning approaches have the highest accuracy of preventing.

  14. A comprehensive survey of phishing: mediums, intended targets, attack

    Diverse phishing-related literature is available throughout various libraries. Some of the earliest works in this domain were presented by [16, 17] who were among the first to enlighten the researchers on various aspects of phishing.Authors in [] have discussed different conventional phishing attack techniques employed by the threat actors, along with a methodology for preventing them.

  15. All About Phishing Exploring User Research through a Systematic

    these attacks use social engineering techniques to deceive endusers, indicating the importance of user- -focused studies to help prevent future attacks. We provide a detailed overview of phishing research that has focused on users by conducting a systematic literature review of peer-reviewed academic papers published in ACM Digital Library.

  16. An effective detection approach for phishing websites using URL and

    Phishing offenses are increasing, resulting in billions of dollars in loss 1.In these attacks, users enter their critical (i.e., credit card details, passwords, etc.) to the forged website which ...

  17. Phishing Attacks: A Recent Comprehensive Study and a New Anatomy

    anatomy of phishing which involves attack phases, attacker's types, vulnerabilities, threats, targets, attack mediums, and attacking techniques. Moreover, the proposed anatomy will help readers understand the process lifecycle of a phishing attack which in turn will increase the awareness of these phishing attacks and the techniques being used;

  18. How Good Are We at Detecting a Phishing Attack ...

    Phishing attacks are on the increase. The fact that our ways of living, studying and working have drastically changed as a result of the COVID pandemic (i.e., almost everything being done online) has created many new cyber security concerns. In particular, with the move to remote working, the number of phishing emails threatening employees has increased. The 2020 Phishing Attack Landscape ...

  19. A comprehensive survey of AI-enabled phishing attacks detection

    This paper also presents the comparison of different studies detecting the phishing attack for each AI technique and examines the qualities and shortcomings of these methodologies. Furthermore, this paper provides a comprehensive set of current challenges of phishing attacks and future research direction in this domain.

  20. Phishing in Organizations: Findings from a Large-Scale and Long-Term Study

    To summarize, this paper makes the following contributions: 1) Extensive measurement study on human factors of phish-ing and phishing prevention in large organizations. 2) Supportive results for several previous research findings with improved ecological validity. 3) Contradicting findings that challenge the conclusions of

  21. A Systematic Review on Deep-Learning-Based Phishing Email Detection

    Phishing attacks are a growing concern for individuals and organizations alike, with the potential to cause significant financial and reputational damage. Traditional methods for detecting phishing attacks, such as blacklists and signature-based techniques, have limitations that have led to developing more advanced techniques. In recent years, machine learning and deep learning techniques have ...

  22. Defending against Phishing Attacks- Taxonomy of Methods, Current Issues

    of phishing attacks is then discussed in section III. Section IV presents taxonomy of defence solutions. Phishing attacks in the Internet of things (IoTs) are discussed in Section V. Current issues and challenges are discussed in VI. Finally, section VII concludes the paper and discusses the scope for future research work. II. Phishing Attack ...

  23. Defending Against Vishing Attacks: A Comprehensive Review for

    The paper titled "A Systematic Literature Review on Phishing and Anti-Phishing Techniques" by Arshad et al. [] provides a comprehensive overview of the current state of research on phishing and anti-phishing techniques.The author discuss about the various kinds of phishing attacks, like spear phishing and whaling, and the dangers they could pose to people and businesses.