• Open access
  • Published: 10 August 2020

Using deep learning to solve computer security challenges: a survey

  • Yoon-Ho Choi 1 , 2 ,
  • Peng Liu 1 ,
  • Zitong Shang 1 ,
  • Haizhou Wang 1 ,
  • Zhilong Wang 1 ,
  • Lan Zhang 1 ,
  • Junwei Zhou 3 &
  • Qingtian Zou 1  

Cybersecurity volume  3 , Article number:  15 ( 2020 ) Cite this article

15k Accesses

20 Citations

1 Altmetric

Metrics details

Although using machine learning techniques to solve computer security challenges is not a new idea, the rapidly emerging Deep Learning technology has recently triggered a substantial amount of interests in the computer security community. This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. In particular, the review covers eight computer security problems being solved by applications of Deep Learning: security-oriented program analysis, defending return-oriented programming (ROP) attacks, achieving control-flow integrity (CFI), defending network attacks, malware classification, system-event-based anomaly detection, memory forensics, and fuzzing for software security.

Introduction

Using machine learning techniques to solve computer security challenges is not a new idea. For example, in the year of 1998, Ghosh and others in ( Ghosh et al. 1998 ) proposed to train a (traditional) neural network based anomaly detection scheme(i.e., detecting anomalous and unknown intrusions against programs); in the year of 2003, Hu and others in ( Hu et al. 2003 ) and Heller and others in ( Heller et al. 2003 ) applied Support Vector Machines to based anomaly detection scheme (e.g., detecting anomalous Windows registry accesses).

The machine-learning-based computer security research investigations during 1990-2010, however, have not been very impactful. For example, to the best of our knowledge, none of the machine learning applications proposed in ( Ghosh et al. 1998 ; Hu et al. 2003 ; Heller et al. 2003 ) has been incorporated into a widely deployed intrusion-detection commercial product.

Regarding why not very impactful, although researchers in the computer security community seem to have different opinions, the following remarks by Sommer and Paxson ( Sommer and Paxson 2010 ) (in the context of intrusion detection) have resonated with many researchers:

Remark A: “It is crucial to have a clear picture of what problem a system targets: what specifically are the attacks to be detected? The more narrowly one can define the target activity, the better one can tailor a detector to its specifics and reduce the potential for misclassifications.” ( Sommer and Paxson 2010 )

Remark B: “If one cannot make a solid argument for the relation of the features to the attacks of interest, the resulting study risks foundering on serious flaws.” ( Sommer and Paxson 2010 )

These insightful remarks, though well aligned with the machine learning techniques used by security researchers during 1990-2010, could become a less significant concern with Deep Learning (DL), a rapidly emerging machine learning technology, due to the following observations. First, Remark A implies that even if the same machine learning method is used, one algorithm employing a cost function that is based on a more specifically defined target attack activity could perform substantially better than another algorithm deploying a less specifically defined cost function. This could be a less significant concern with DL, since a few recent studies have shown that even if the target attack activity is not narrowly defined, a DL model could still achieve very high classification accuracy. Second, Remark B implies that if feature engineering is not done properly, the trained machine learning models could be plagued by serious flaws. This could be a less significant concern with DL, since many deep learning neural networks require less feature engineering than conventional machine learning techniques.

As stated in NSCAI Intern Report for Congress (2019 ), “DL is a statistical technique that exploits large quantities of data as training sets for a network with multiple hidden layers, called a deep neural network (DNN). A DNN is trained on a dataset, generating outputs, calculating errors, and adjusting its internal parameters. Then the process is repeated hundreds of thousands of times until the network achieves an acceptable level of performance. It has proven to be an effective technique for image classification, object detection, speech recognition, and natural language processing–problems that challenged researchers for decades. By learning from data, DNNs can solve some problems much more effectively, and also solve problems that were never solvable before.”

Now let’s take a high-level look at how DL could make it substantially easier to overcome the challenges identified by Sommer and Paxson ( Sommer and Paxson 2010 ). First, one major advantage of DL is that it makes learning algorithms less dependent on feature engineering. This characteristic of DL makes it easier to overcome the challenge indicated by Remark B. Second, another major advantage of DL is that it could achieve high classification accuracy with minimum domain knowledge. This characteristic of DL makes it easier to overcome the challenge indicated by Remark A.

Key observation. The above discussion indicates that DL could be a game changer in applying machine learning techniques to solving computer security challenges.

Motivated by this observation, this paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. It should be noticed that since this paper aims to provide a dedicated review, non-deep-learning techniques and their security applications are out of the scope of this paper.

The remaining of the paper is organized as follows. In “ A four-phase workflow framework can summarize the existing works in a unified manner ” section, we present a four-phase workflow framework which we use to summarize the existing works in a unified manner. In “ A closer look at applications of deep learning in solving security-oriented program analysis challenges - A closer look at applications of deep learning in security-oriented fuzzing ” section, we provide a review of eight computer security problems being solved by applications of Deep Learning, respectively. In “ Discussion ” section, we will discuss certain similarity and certain dissimilarity among the existing works. In “ Further areas of investigation ” section, we mention four further areas of investigation. In “ Conclusion section, we conclude the paper.

A four-phase workflow framework can summarize the existing works in a unified manner

We found that a four-phase workflow framework can provide a unified way to summarize all the research works surveyed by us. In particular, we found that each work surveyed by us employs a particular workflow when using machine learning techniques to solve a computer security challenge, and we found that each workflow consists of two or more phases. By “a unified way”, we mean that every workflow surveyed by us is essentially an instantiation of a common workflow pattern which is shown in Fig.  1 .

figure 1

Overview of the four-phase workflow

Definitions of the four phases

The four phases, shown in Fig.  1 , are defined as follows. To make the definitions of the four phases more tangible, we use a running example to illustrate each of the four phases. Phase I.(Obtaining the raw data)

In this phase, certain raw data are collected. Running Example: When Deep Learning is used to detect suspicious events in a Hadoop distributed file system (HDFS), the raw data are usually the events (e.g., a block is allocated, read, written, replicated, or deleted) that have happened to each block. Since these events are recorded in Hadoop logs, the log files hold the raw data. Since each event is uniquely identified by a particular (block ID, timestamp) tuple, we could simply view the raw data as n event sequences. Here n is the total number of blocks in the HDFS. For example, the raw data collected in Xu et al. (2009) in total consists of 11,197,954 events. Since 575,139 blocks were in the HDFS, there were 575,139 event sequences in the raw data, and on average each event sequence had 19 events. One such event sequence is shown as follows:

research paper on computer security

Phase II.(Data preprocessing)

Both Phase II and Phase III aim to properly extract and represent the useful information held in the raw data collected in Phase I. Both Phase II and Phase III are closely related to feature engineering. A key difference between Phase II and Phase III is that Phase III is completely dedicated to representation learning, while Phase II is focused on all the information extraction and data processing operations that are not based on representation learning. Running Example: Let’s revisit the aforementioned HDFS. Each recorded event is described by unstructured text. In Phase II, the unstructured text is parsed to a data structure that shows the event type and a list of event variables in (name, value) pairs. Since there are 29 types of events in the HDFS, each event is represented by an integer from 1 to 29 according to its type. In this way, the aforementioned example event sequence can be transformed to:

22, 5, 5, 7

Phase III.(Representation learning)

As stated in Bengio et al. (2013) , “Learning representations of the data that make it easier to extract useful information when building classifiers or other predictors.” Running Example: Let’s revisit the same HDFS. Although DeepLog ( Du et al. 2017 ) directly employed one-hot vectors to represent the event types without representation learning, if we view an event type as a word in a structured language, one may actually use the word embedding technique to represent each event type. It should be noticed that the word embedding technique is a representation learning technique.

Phase IV.(Classifier learning)

This phase aims to build specific classifiers or other predictors through Deep Learning. Running Example: Let’s revisit the same HDFS. DeepLog ( Du et al. 2017 ) used Deep Learning to build a stacked LSTM neural network for anomaly detection. For example, let’s consider event sequence {22,5,5,5,11,9,11,9,11,9,26,26,26} in which each integer represents the event type of the corresponding event in the event sequence. Given a window size h = 4, the input sample and the output label pairs to train DeepLog will be: {22,5,5,5 → 11 }, {5,5,5,11 → 9 }, {5,5,11,9 → 11 }, and so forth. In the detection stage, DeepLog examines each individual event. It determines if an event is treated as normal or abnormal according to whether the event’s type is predicted by the LSTM neural network, given the history of event types. If the event’s type is among the top g predicted types, the event is treated as normal; otherwise, it is treated as abnormal.

Using the four-phase workflow framework to summarize some representative research works

In this subsection, we use the four-phase workflow framework to summarize two representative works for each security problem. System security includes many sub research topics. However, not every research topics are suitable to adopt deep learning-based methods due to their intrinsic characteristics. For these security research subjects that can combine with deep-learning, some of them has undergone intensive research in recent years, others just emerging. We notice that there are 5 mainstream research directions in system security. This paper mainly focuses on system security, so the other mainstream research directions (e.g., deepfake) are out-of-scope. Therefore, we choose these 5 widely noticed research directions, and 3 emerging research direction in our survey:

In security-oriented program analysis, malware classification (MC), system-event-based anomaly detection (SEAD), memory forensics (MF), and defending network attacks, deep learning based methods have already undergone intensive research.

In defending return-oriented programming (ROP) attacks, Control-flow integrity (CFI), and fuzzing, deep learning based methods are emerging research topics.

We select two representative works for each research topic in our survey. Our criteria to select papers mainly include: 1) Pioneer (one of the first papers in this field); 2) Top (published on top conference or journal); 3) Novelty; 4) Citation (The citation of this paper is high); 5) Effectiveness (the result of this paper is pretty good); 6) Representative (the paper is a representative work for a branch of the research direction). Table  1 lists the reasons why we choose each paper, which is ordered according to their importance.

The summary is shown in Table  2 . There are three columns in the table. In the first column, we listed eight security problems, including security-oriented program analysis, defending return-oriented programming (ROP) attacks, control-flow integrity (CFI), defending network attacks (NA), malware classification (MC), system-event-based anomaly detection (SEAD), memory forensics (MF), and fuzzing for software security. In the second column, we list the very recent two representative works for each security problem. In the “Summary” column, we sequentially describe how the four phases are deployed at each work, then, we list the evaluation results for each work in terms of accuracy (ACC), precision (PRC), recall (REC), F1 score (F1), false-positive rate (FPR), and false-negative rate (FNR), respectively.

Methodology for reviewing the existing works

Data representation (or feature engineering) plays an important role in solving security problems with Deep Learning. This is because data representation is a way to take advantage of human ingenuity and prior knowledge to extract and organize the discriminative information from the data. Many efforts in deploying machine learning algorithms in security domain actually goes into the design of preprocessing pipelines and data transformations that result in a representation of the data to support effective machine learning.

In order to expand the scope and ease of applicability of machine learning in security domain, it would be highly desirable to find a proper way to represent the data in security domain, which can entangle and hide more or less the different explanatory factors of variation behind the data. To let this survey adequately reflect the important role played by data representation, our review will focus on how the following three questions are answered by the existing works:

Question 1: Is Phase II pervasively done in the literature? When Phase II is skipped in a work, are there any particular reasons?

Question 2: Is Phase III employed in the literature? When Phase III is skipped in a work, are there any particular reasons?

Question 3: When solving different security problems, is there any commonality in terms of the (types of) classifiers learned in Phase IV? Among the works solving the same security problem, is there dissimilarity in terms of classifiers learned in Phase IV?

To group the Phase III methods at different applications of Deep Learning in solving the same security problem, we introduce a classification tree as shown in Fig.  2 . The classification tree categorizes the Phase III methods in our selected survey works into four classes. First, class 1 includes the Phase III methods which do not consider representation learning. Second, class 2 includes the Phase III methods which consider representation learning but, do not adopt it. Third, class 3 includes the Phase III methods which consider and adopt representation learning but, do not compare the performance with other methods. Finally, class 4 includes the Phase III methods which consider and adopt representation learning and, compare the performance with other methods.

figure 2

Classification tree for different Phase III methods. Here, consideration , adoption , and comparison indicate that a work considers Phase III, adopts Phase III and makes comparison with other methods, respectively

In the remaining of this paper, we take a closer look at how each of the eight security problems is being solved by applications of Deep Learning in the literature.

A closer look at applications of deep learning in solving security-oriented program analysis challenges

Recent years, security-oriented program analysis is widely used in software security. For example, symbolic execution and taint analysis are used to discover, detect and analyze vulnerabilities in programs. Control flow analysis, data flow analysis and pointer/alias analysis are important components when enforcing many secure strategies, such as control flow integrity, data flow integrity and doling dangling pointer elimination. Reverse engineering was used by defenders and attackers to understand the logic of a program without source code.

In the security-oriented program analysis, there are many open problems, such as precise pointer/alias analysis, accurate and complete reversing engineer, complex constraint solving, program de-obfuscation, and so on. Some problems have theoretically proven to be NP-hard, and others still need lots of human effort to solve. Either of them needs a lot of domain knowledge and experience from expert to develop better solutions. Essentially speaking, the main challenges when solving them through traditional approaches are due to the sophisticated rules between the features and labels, which may change in different contexts. Therefore, on the one hand, it will take a large quantity of human effort to develop rules to solve the problems, on the other hand, even the most experienced expert cannot guarantee completeness. Fortunately, the deep learning method is skillful to find relations between features and labels if given a large amount of training data. It can quickly and comprehensively find all the relations if the training samples are representative and effectively encoded.

In this section, we will review the very recent four representative works that use Deep Learning for security-oriented program analysis. We observed that they focused on different goals. Shin, et al. designed a model ( Shin et al. 2015 ) to identify the function boundary. EKLAVYA ( Chua et al. 2017 ) was developed to learn the function type. Gemini ( Xu et al. 2017 ) was proposed to detect similarity among functions. DEEPVSA ( Guo et al. 2019 ) was designed to learn memory region of an indirect addressing from the code sequence. Among these works, we select two representative works ( Shin et al. 2015 ; Chua et al. 2017 ) and then, summarize the analysis results in Table  2 in detail.

Our review will be centered around three questions described in “ Methodology for reviewing the existing works ” section. In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

Key findings from a closer look

From a close look at the very recent applications using Deep Learning for solving security-oriented program analysis challenges, we observed the followings:

Observation 3.1: All of the works in our survey used binary files as their raw data. Phase II in our survey had one similar and straightforward goal – extracting code sequences from the binary. Difference among them was that the code sequence was extracted directly from the binary file when solving problems in static program analysis, while it was extracted from the program execution when solving problems in dynamic program analysis.

*Observation 3.2: Most data representation methods generally took into account the domain knowledge.

Most data representation methods generally took into the domain knowledge, i.e., what kind of information they wanted to reserve when processing their data. Note that the feature selection has a wide influence on Phase II and Phase III, for example, embedding granularities, representation learning methods. Gemini ( Xu et al. 2017 ) selected function level feature and other works in our survey selected instruction level feature. To be specifically, all the works except Gemini ( Xu et al. 2017 ) vectorized code sequence on instruction level.

Observation 3.3: To better support data representation for high performance, some works adopted representation learning.

For instance, DEEPVSA ( Guo et al. 2019 ) employed a representation learning method, i.e., bi-directional LSTM, to learn data dependency within instructions. EKLAVYA ( Chua et al. 2017 ) adopted representation learning method, i.e., word2vec technique, to extract inter-instruciton information. It is worth noting that Gemini ( Xu et al. 2017 ) adopts the Structure2vec embedding network in its siamese architecture in Phase IV (see details in Observation 3.7). The Structure2vec embedding network learned information from an attributed control flow graph.

Observation 3.4: According to our taxonomy, most works in our survey were classified into class 4.

To compare the Phase III, we introduced a classification tree with three layers as shown in Fig.  2 to group different works into four categories. The decision tree grouped our surveyed works into four classes according to whether they considered representation learning or not, whether they adopted representation learning or not, and whether they compared their methods with others’, respectively, when designing their framework. According to our taxonomy, EKLAVYA ( Chua et al. 2017 ), DEEPVSA ( Guo et al. 2019 ) were grouped into class 4 shown in Fig.  2 . Also, Gemini’s work ( Xu et al. 2017 ) and Shin, et al.’s work ( Shin et al. 2015 ) belonged to class 1 and class 2 shown in Fig.  2 , respectively.

Observation 3.5: All the works in our survey explain why they adopted or did not adopt one of representation learning algorithms.

Two works in our survey adopted representation learning for different reasons: to enhance model’s ability of generalization ( Chua et al. 2017 ); and to learn the dependency within instructions ( Guo et al. 2019 ). It is worth noting that Shin, et al. did not adopt representation learning because they wanted to preserve the “attractive” features of neural networks over other machine learning methods – simplicity. As they stated, “first, neural networks can learn directly from the original representation with minimal preprocessing (or “feature engineering”) needed.” and “second, neural networks can learn end-to-end, where each of its constituent stages are trained simultaneously in order to best solve the end goal.” Although Gemini ( Xu et al. 2017 ) did not adopt representation learning when processing their raw data, the Deep Learning models in siamese structure consisted of two graph embedding networks and one cosine function.

*Observation 3.6: The analysis results showed that a suitable representation learning method could improve accuracy of Deep Learning models.

DEEPVSA ( Guo et al. 2019 ) designed a series of experiments to evaluate the effectiveness of its representative method. By combining with the domain knowledge, EKLAVYA ( Chua et al. 2017 ) employed t-SNE plots and analogical reasoning to explain the effectiveness of their representation learning method in an intuitive way.

Observation 3.7: Various Phase IV methods were used.

In Phase IV, Gemini ( Xu et al. 2017 ) adopted siamese architecture model which consisted of two Structure2vec embedding networks and one cosine function. The siamese architecture took two functions as its input, and produced the similarity score as the output. The other three works ( Shin et al. 2015 ; Chua et al. 2017 ; Guo et al. 2019 ) adopted bi-directional RNN, RNN, bi-directional LSTM respectively. Shin, et al. adopted bi-directional RNN because they wanted to combine both the past and the future information in making a prediction for the present instruction ( Shin et al. 2015 ). DEEPVSA ( Guo et al. 2019 ) adopted bi-directional RNN to enable their model to infer memory regions in both forward and backward ways.

The above observations seem to indicate the following indications:

Indication 3.1: Phase III is not always necessary.

Not all authors regard representation learning as a good choice even though some case experiments show that representation learning can improve the final results. They value more the simplicity of Deep Learning methods and suppose that the adoption of representation learning weakens the simplicity of Deep Learning methods.

Indication 3.2: Even though the ultimate objective of Phase III in the four surveyed works is to train a model with better accuracy, they have different specific motivations as described in Observation 3.5.

When authors choose representation learning, they usually try to convince people the effectiveness of their choice by empirical or theoretical analysis.

*Indication 3.3: Observation 3.7 indicates that authors usually refer to the domain knowledge when designing the architecture of Deep Learning model.

For instance, the works we reviewed commonly adopt bi-directional RNN when their prediction partly based on future information in data sequence.

Despite the effectiveness and agility of deep learning-based methods, there are still some challenges in developing a scheme with high accuracy due to the hierarchical data structure, lots of noisy, and unbalanced data composition in program analysis. For instance, an instruction sequence, a typical data sample in program analysis, contains three-level hierarchy: sequence–instruction–opcode/operand. To make things worse, each level may contain many different structures, e.g., one-operand instructions, multi-operand instructions, which makes it harder to encode the training data.

A closer look at applications of deep learning in defending ROP attacks

Return-oriented programming (ROP) attack is one of the most dangerous code reuse attacks, which allows the attackers to launch control-flow hijacking attack without injecting any malicious code. Rather, It leverages particular instruction sequences (called “gadgets”) widely existing in the program space to achieve Turing-complete attacks ( Shacham and et al. 2007 ). Gadgets are instruction sequences that end with a RET instruction. Therefore, they can be chained together by specifying the return addresses on program stack. Many traditional techniques could be used to detect ROP attacks, such as control-flow integrity (CFI Abadi et al. (2009) ), but many of them either have low detection rate or have high runtime overhead. ROP payloads do not contain any codes. In other words, analyzing ROP payload without the context of the program’s memory dump is meaningless. Thus, the most popular way of detecting and preventing ROP attacks is control-flow integrity. The challenge after acquiring the instruction sequences is that it is hard to recognize whether the control flow is normal. Traditional methods use the control flow graph (CFG) to identify whether the control flow is normal, but attackers can design the instruction sequences which follow the normal control flow defined by the CFG. In essence, it is very hard to design a CFG to exclude every single possible combination of instructions that can be used to launch ROP attacks. Therefore, using data-driven methods could help eliminate such problems.

In this section, we will review the very recent three representative works that use Deep Learning for defending ROP attacks: ROPNN ( Li et al. 2018 ), HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ). ROPNN ( Li et al. 2018 ) aims to detect ROP attacks, HeNet ( Chen et al. 2018 ) aims to detect malware using CFI, and DeepCheck ( Zhang et al. 2019 ) aims at detecting all kinds of code reuse attacks.

Specifically, ROPNN is to protect one single program at a time, and its training data are generated from real-world programs along with their execution. Firstly, it generates its benign and malicious data by “chaining-up” the normally executed instruction sequences and “chaining-up” gadgets with the help of gadgets generation tool, respectively, after the memory dumps of programs are created. Each data sample is byte-level instruction sequence labeled as “benign” or “malicious”. Secondly, ROPNN will be trained using both malicious and benign data. Thirdly, the trained model is deployed to a target machine. After the protected program started, the executed instruction sequences will be traced and fed into the trained model, the protected program will be terminated once the model found the instruction sequences are likely to be malicious.

HeNet is also proposed to protect a single program. Its malicious data and benign data are generated by collecting trace data through Intel PT from malware and normal software, respectively. Besides, HeNet preprocesses its dataset and shape each data sample in the format of image, so that they could implement transfer learning from a model pre-trained on ImageNet. Then, HeNet is trained and deployed on machines with features of Intel PT to collect and classify the program’s execution trace online.

The training data for DeepCheck are acquired from CFGs, which are constructed by dissembling the programs and using the information from Intel PT. After the CFG for a protected program is constructed, authors sample benign instruction sequences by chaining up basic blocks that are connected by edges, and sample malicious instruction sequences by chaining up those that are not connected by edges. Although a CFG is needed during training, there is no need to construct CFG after the training phase. After deployed, instruction sequences will be constructed by leveraging Intel PT on the protected program. Then the trained model will classify whether the instruction sequences are malicious or benign.

We observed that none of the works considered Phase III, so all of them belong to class 1 according to our taxonomy as shown in Fig.  2 . The analysis results of ROPNN ( Li et al. 2018 ) and HeNet ( Chen et al. 2018 ) are shown in Table  2 . Also, we observed that three works had different goals.

From a close look at the very recent applications using Deep Learning for defending return-oriented programming attacks, we observed the followings:

Observation 4.1: All the works ( Li et al. 2018 ; Zhang et al. 2019 ; Chen et al. 2018 ) in this survey focused on data generation and acquisition.

In ROPNN ( Li et al. 2018 ), both malicious samples (gadget chains) were generated using an automated gadget generator (i.e. ROPGadget ( Salwant 2015 )) and a CPU emulator (i.e. Unicorn ( Unicorn-The ultimate CPU emulator 2015 )). ROPGadget was used to extract instruction sequences that could be used as gadgets from a program, and Unicorn was used to validate the instruction sequences. Corresponding benign sample (gadget-chain-like instruction sequences) were generated by disassembling a set of programs. In DeepCheck ( Zhang et al. 2019 ) refers to the key idea of control-flow integrity ( Abadi et al. 2009 ). It generates program’s run-time control flow through new feature of Intel CPU (Intel Processor Tracing), then compares the run-time control flow with the program’s control-flow graph (CFG) that generates through static analysis. Benign instruction sequences are that with in the program’s CFG, and vice versa. In HeNet ( Chen et al. 2018 ), program’s execution trace was extracted using the similar way as DeepCheck. Then, each byte was transformed into a pixel with an intensity between 0-255. Known malware samples and benign software samples were used to generate malicious data benign data, respectively.

Observation 4.2: None of the ROP works in this survey deployed Phase III.

Both ROPNN ( Li et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) used binary instruction sequences for training. In ROPNN ( Li et al. 2018 ), one byte was used as the very basic element for data pre-processing. Bytes were formed into one-hot matrices and flattened for 1-dimensional convolutional layer. In DeepCheck ( Zhang et al. 2019 ), half-byte was used as the basic unit. Each half-byte (4 bits) was transformed to decimal form ranging from 0-15 as the basic element of the input vector, then was fed into a fully-connected input layer. On the other hand, HeNet ( Chen et al. 2018 ) used different kinds of data. By the time this survey has been drafted, the source code of HeNet was not available to public and thus, the details of the data pre-processing was not be investigated. However, it is still clear that HeNet used binary branch information collected from Intel PT rather than binary instructions. In HeNet, each byte was converted to one decimal number ranging from 0 to 255. Byte sequences was sliced and formed into image sequences (each pixel represented one byte) for a fully-connected input layer.

Observation 4.3: Fully-connected neural network was widely used.

Only ROPNN ( Li et al. 2018 ) used 1-dimensional convolutional neural network (CNN) when extracting features. Both HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) used fully-connected neural network (FCN). None of the works used recurrent neural network (RNN) and the variants.

Indication 4.1: It seems like that one of the most important factors in ROP problem is feature selection and data generation.

All three works use very different methods to collect/generate data, and all the authors provide very strong evidences and/or arguments to justify their approaches. ROPNN ( Li et al. 2018 ) was trained by the malicious and benign instruction sequences. However, there is no clear boundary between benign instruction sequences and malicious gadget chains. This weakness may impair the performance when applying ROPNN to real world ROP attacks. As oppose to ROPNN, DeepCheck ( Zhang et al. 2019 ) utilizes CFG to generate training basic-block sequences. However, since the malicious basic-block sequences are generated by randomly connecting nodes without edges, it is not guaranteed that all the malicious basic-blocks are executable. HeNet ( Chen et al. 2018 ) generates their training data from malware. Technically, HeNet could be used to detect any binary exploits, but their experiment focuses on ROP attack and achieves 100% accuracy. This shows that the source of data in ROP problem does not need to be related to ROP attacks to produce very impressive results.

Indication 4.2: Representation learning seems not critical when solving ROP problems using Deep Learning.

Minimal process on data in binary form seems to be enough to transform the data into a representation that is suitable for neural networks. Certainly, it is also possible to represent the binary instructions at a higher level, such as opcodes, or use embedding learning. However, as stated in ( Li et al. 2018 ), it appears that the performance will not change much by doing so. The only benefit of representing input data to a higher level is to reduce irrelevant information, but it seems like neural network by itself is good enough at extracting features.

Indication 4.3: Different Neural network architecture does not have much influence on the effectiveness of defending ROP attacks.

Both HeNet ( Chen et al. 2018 ) and DeepCheck ( Zhang et al. 2019 ) utilizes standard DNN and achieved comparable results on ROP problems. One can infer that the input data can be easily processed by neural networks, and the features can be easily detected after proper pre-process.

It is not surprising that researchers are not very interested in representation learning for ROP problems as stated in Observation 4.1. Since ROP attack is focus on the gadget chains, it is straightforward for the researcher to choose the gadgets as their training data directly. It is easy to map the data into numerical representation with minimal processing. An example is that one can map binary executable to hexadecimal ASCII representation, which could be a good representation for neural network.

Instead, researchers focus more in data acquisition and generation. In ROP problems, the amount of data is very limited. Unlike malware and logs, ROP payloads normally only contain addresses rather than codes, which do not contain any information without providing the instructions in corresponding addresses. It is thus meaningless to collect all the payloads. At the best of our knowledge, all the previous works use pick instruction sequences rather than payloads as their training data, even though they are hard to collect.

Even though, Deep Learning based method does not face the challenge to design a very complex fine-grained CFG anymore, it suffers from a limited number of data sources. Generally, Deep Learning based method requires lots of training data. However, real-world malicious data for the ROP attack is very hard to find, because comparing with benign data, malicious data need to be carefully crafted and there is no existing database to collect all the ROP attacks. Without enough representative training set, the accuracy of the trained model cannot be guaranteed.

A closer look at applications of deep learning in achieving CFI

The basic ideas of control-flow integrity (CFI) techniques, proposed by Abadi in 2005 ( Abadi et al. 2009 ), could be dated back to 2002, when Vladimir and his fellow researchers proposed an idea called program shepherding ( Kiriansky et al. 2002 ), a method of monitoring the execution flow of a program when it is running by enforcing some security policies. The goal of CFI is to detect and prevent control-flow hijacking attacks, by restricting every critical control flow transfers to a set that can only appear in correct program executions, according to a pre-built CFG. Traditional CFI techniques typically leverage some knowledge, gained from either dynamic or static analysis of the target program, combined with some code instrumentation methods, to ensure the program runs on a correct track.

However, the problems of traditional CFI are: (1) Existing CFI implementations are not compatible with some of important code features ( Xu et al. 2019 ); (2) CFGs generated by static, dynamic or combined analysis cannot always be precisely completed due to some open problems ( Horwitz 1997 ); (3) There always exist certain level of compromises between accuracy and performance overhead and other important properties ( Tan and Jaeger 2017 ; Wang and Liu 2019 ). Recent research has proposed to apply Deep Learning on detecting control flow violation. Their result shows that, compared with traditional CFI implementation, the security coverage and scalability were enhanced in such a fashion ( Yagemann et al. 2019 ). Therefore, we argue that Deep Learning could be another approach which requires more attention from CFI researchers who aim at achieving control-flow integrity more efficiently and accurately.

In this section, we will review the very recent three representative papers that use Deep Learning for achieving CFI. Among the three, two representative papers ( Yagemann et al. 2019 ; Phan et al. 2017 ) are already summarized phase-by-phase in Table  2 . We refer to interested readers the Table  2 for a concise overview of those two papers.

Our review will be centered around three questions described in Section 3 . In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

From a close look at the very recent applications using Deep Learning for achieving control-flow integrity, we observed the followings:

Observation 5.1: None of the related works realize preventive Footnote 1 prevention of control flow violation.

After doing a thorough literature search, we observed that security researchers are quite behind the trend of applying Deep Learning techniques to solve security problems. Only one paper has been founded by us, using Deep Learning techniques to directly enhance the performance of CFI ( Yagemann et al. 2019 ). This paper leveraged Deep Learning to detect document malware through checking program’s execution traces that generated by hardware. Specifically, the CFI violations were checked in an offline mode. So far, no works have realized Just-In-Time checking for program’s control flow.

In order to provide more insightful results, in this section, we try not to narrow down our focus on CFI detecting attacks at run-time, but to extend our scope to papers that take good use of control flow related data, combined with Deep Learning techniques ( Phan et al. 2017 ; Nguyen et al. 2018 ). In one work, researchers used self-constructed instruction-level CFG to detect program defection ( Phan et al. 2017 ). In another work, researchers used lazy-binding CFG to detect sophisticated malware ( Nguyen et al. 2018 ).

Observation 5.2: Diverse raw data were used for evaluating CFI solutions.

In all surveyed papers, there are two kinds of control flow related data being used: program instruction sequences and CFGs. Barnum et al. ( Yagemann et al. 2019 ) employed statically and dynamically generated instruction sequences acquired by program disassembling and Intel ® Processor Trace. CNNoverCFG ( Phan et al. 2017 ) used self-designed algorithm to construct instruction level control-flow graph. Minh Hai Nguyen et al. ( Nguyen et al. 2018 ) used proposed lazy-binding CFG to reflect the behavior of malware DEC.

Observation 5.3: All the papers in our survey adopted Phase II.

All the related papers in our survey employed Phase II to process their raw data before sending them into Phase III. In Barnum ( Yagemann et al. 2019 ), the instruction sequences from program run-time tracing were sliced into basic-blocks. Then, they assigned each basic-blocks with an unique basic-block ID (BBID). Finally, due to the nature of control-flow hijacking attack, they selected the sequences ending with indirect branch instruction (e.g., indirect call/jump, return and so on) as the training data. In CNNoverCFG ( Phan et al. 2017 ), each of instructions in CFG were labeled with its attributes in multiple perspectives, such as opcode, operands, and the function it belongs to. The training data is generated are sequences generated by traversing the attributed control-flow graph. Nguyen and others ( Nguyen et al. 2018 ) converted the lazy-binding CFG to corresponding adjacent matrix and treated the matrix as a image as their training data.

Observation 5.4: All the papers in our survey did not adopt Phase III. We observed all the papers we surveyed did not adopted Phase III. Instead, they adopted the form of numerical representation directly as their training data. Specifically, Barnum ( Yagemann et al. 2019 ) grouped the instructions into basic-blocks, then represented basic-blocks with uniquely assigning IDs. In CNNoverCFG ( Phan et al. 2017 ), each of instructions in the CFG was represented by a vector that associated with its attributes. Nguyen and others directly used the hashed value of bit string representation.

Observation 5.5: Various Phase IV models were used. Barnum ( Yagemann et al. 2019 ) utilized BBID sequence to monitor the execution flow of the target program, which is sequence-type data. Therefore, they chose LSTM architecture to better learn the relationship between instructions. While in the other two papers ( Phan et al. 2017 ; Nguyen et al. 2018 ), they trained CNN and directed graph-based CNN to extract information from control-flow graph and image, respectively.

Indication 5.1: All the existing works did not achieve Just-In-Time CFI violation detection.

It is still a challenge to tightly embed Deep Learning model in program execution. All existing work adopted lazy-checking – checking the program’s execution trace following its execution.

Indication 5.2: There is no unified opinion on how to generate malicious sample.

Data are hard to collect in control-flow hijacking attacks. The researchers must carefully craft malicious sample. It is not clear whether the “handcrafted” sample can reflect the nature the control-flow hijacking attack.

*Observation 5.3: The choice of methods in Phase II are based on researchers’ security domain knowledge.

The strength of using deep learning to solve CFI problems is that it can avoid the complicated processes of developing algorithms to build acceptable CFGs for the protected programs. Compared with the traditional approaches, the DL based method could prevent CFI designer from studying the language features of the targeted program and could also avoid the open problem (pointer analysis) in control flow analysis. Therefore, DL based CFI provides us a more generalized, scalable, and secure solution. However, since using DL in CFI problem is still at an early age, which kinds of control-flow related data are more effective is still unclear yet in this research area. Additionally, applying DL in real-time control-flow violation detection remains an untouched area and needs further research.

A closer look at applications of deep learning in defending network attacks

Network security is becoming more and more important as we depend more and more on networks for our daily lives, works and researches. Some common network attack types include probe, denial of service (DoS), Remote-to-local (R2L), etc. Traditionally, people try to detect those attacks using signatures, rules, and unsupervised anomaly detection algorithms. However, signature based methods can be easily fooled by slightly changing the attack payload; rule based methods need experts to regularly update rules; and unsupervised anomaly detection algorithms tend to raise lots of false positives. Recently, people are trying to apply Deep Learning methods for network attack detection.

In this section, we will review the very recent seven representative works that use Deep Learning for defending network attacks. Millar et al. (2018 ); Varenne et al. (2019 ); Ustebay et al. (2019 ) build neural networks for multi-class classification, whose class labels include one benign label and multiple malicious labels for different attack types. Zhang et al. (2019 ) ignores normal network activities and proposes parallel cross convolutional neural network (PCCN) to classify the type of malicious network activities. Yuan et al. (2017 ) applies Deep Learning to detecting a specific attack type, distributed denial of service (DDoS) attack. Yin et al. (2017 ); Faker and Dogdu (2019 ) explores both binary classification and multi-class classification for benign and malicious activities. Among these seven works, we select two representative works ( Millar et al. 2018 ; Zhang et al. 2019 ) and summarize the main aspects of their approaches regarding whether the four phases exist in their works, and what exactly do they do in the Phase if it exists. We direct interested readers to Table  2 for a concise overview of these two works.

From a close look at the very recent applications using Deep Learning for solving network attack challenges, we observed the followings:

Observation 6.1: All the seven works in our survey used public datasets, such as UNSW-NB15 ( Moustafa and Slay 2015 ) and CICIDS2017 ( IDS 2017 Datasets 2019 ).

The public datasets were all generated in test-bed environments, with unbalanced simulated benign and attack activities. For attack activities, the dataset providers launched multiple types of attacks, and the numbers of malicious data for those attack activities were also unbalanced.

Observation 6.2: The public datasets were given into one of two data formats, i.e., PCAP and CSV.

One was raw PCAP or parsed CSV format, containing network packet level features, and the other was also CSV format, containing network flow level features, which showed the statistic information of many network packets. Out of all the seven works, ( Yuan et al. 2017 ; Varenne et al. 2019 ) used packet information as raw inputs, ( Yin et al. 2017 ; Zhang et al. 2019 ; Ustebay et al. 2019 ; Faker and Dogdu 2019 ) used flow information as raw inputs, and ( Millar et al. 2018 ) explored both cases.

Observation 6.3: In order to parse the raw inputs, preprocessing methods, including one-hot vectors for categorical texts, normalization on numeric data, and removal of unused features/data samples, were commonly used.

Commonly removed features include IP addresses and timestamps. Faker and Dogdu (2019 ) also removed port numbers from used features. By doing this, they claimed that they could “avoid over-fitting and let the neural network learn characteristics of packets themselves”. One outlier was that, when using packet level features in one experiment, ( Millar et al. 2018 ) blindly chose the first 50 bytes of each network packet without any feature extracting processes and fed them into neural network.

Observation 6.4: Using image representation improved the performance of security solutions using Deep Learning.

After preprocessing the raw data, while ( Zhang et al. 2019 ) transformed the data into image representation, ( Yuan et al. 2017 ; Varenne et al. 2019 ; Faker and Dogdu 2019 ; Ustebay et al. 2019 ; Yin et al. 2017 ) directly used the original vectors as an input data. Also, ( Millar et al. 2018 ) explored both cases and reported better performance using image representation.

Observation 6.5: None of all the seven surveyed works considered representation learning.

All the seven surveyed works belonged to class 1 shown in Fig.  2 . They either directly used the processed vectors to feed into the neural networks, or changed the representation without explanation. One research work ( Millar et al. 2018 ) provided a comparison on two different representations (vectors and images) for the same type of raw input. However, the other works applied different preprocessing methods in Phase II. That is, since the different preprocessing methods generated different feature spaces, it was difficult to compare the experimental results.

Observation 6.6: Binary classification model showed better results from most experiments.

Among all the seven surveyed works, ( Yuan et al. 2017 ) focused on one specific attack type and only did binary classification to classify whether the network traffic was benign or malicious. Also, ( Millar et al. 2018 ; Ustebay et al. 2019 ; Zhang et al. 2019 ; Varenne et al. 2019 ) included more attack types and did multi-class classification to classify the type of malicious activities, and ( Yin et al. 2017 ; Faker and Dogdu 2019 ) explored both cases. As for multi-class classification, the accuracy for selective classes was good, while accuracy for other classes, usually classes with much fewer data samples, suffered by up to 20% degradation.

Observation 6.7: Data representation influenced on choosing a neural network model.

Indication 6.1: All works in our survey adopt a kind of preprocessing methods in Phase II, because raw data provided in the public datasets are either not ready for neural networks, or that the quality of data is too low to be directly used as data samples.

Preprocessing methods can help increase the neural network performance by improving the data samples’ qualities. Furthermore, by reducing the feature space, pre-processing can also improve the efficiency of neural network training and testing. Thus, Phase II should not be skipped. If Phase II is skipped, the performance of neural network is expected to go down considerably.

Indication 6.2: Although Phase III is not employed in any of the seven surveyed works, none of them explains a reason for it. Also, they all do not take representation learning into consideration.

Indication 6.3: Because no work uses representation learning, the effectiveness are not well-studied.

Out of other factors, it seems that the choice of pre-processing methods has the largest impact, because it directly affects the data samples fed to the neural network.

Indication 6.4: There is no guarantee that CNN also works well on images converted from network features.

Some works that use image data representation use CNN in Phase IV. Although CNN has been proven to work well on image classification problem in the recent years, there is no guarantee that CNN also works well on images converted from network features.

From the observations and indications above, we hereby present two recommendations: (1) Researchers can try to generate their own datasets for the specific network attack they want to detect. As stated, the public datasets have highly unbalanced number of data for different classes. Doubtlessly, such unbalance is the nature of real world network environment, in which normal activities are the majority, but it is not good for Deep Learning. ( Varenne et al. 2019 ) tries to solve this problem by oversampling the malicious data, but it is better to start with a balanced data set. (2) Representation learning should be taken into consideration. Some possible ways to apply representation learning include: (a) apply word2vec method to packet binaries, and categorical numbers and texts; (b) use K-means as one-hot vector representation instead of randomly encoding texts. We suggest that any change of data representation may be better justified by explanations or comparison experiments.

One critical challenge in this field is the lack of high-quality data set suitable for applying deep learning. Also, there is no agreement on how to apply domain knowledge into training deep learning models for network security problems. Researchers have been using different pre-processing methods, data representations and model types, but few of them have enough explanation on why such methods/representations/models are chosen, especially for data representation.

A closer look at applications of deep learning in malware classification

The goal of malware classification is to identify malicious behaviors in software with static and dynamic features like control-flow graph and system API calls. Malware and benign programs can be collected from open datasets and online websites. Both the industry and the academic communities have provided approaches to detect malware with static and dynamic analyses. Traditional methods such as behavior-based signatures, dynamic taint tracking, and static data flow analysis require experts to manually investigate unknown files. However, those hand-crafted signatures are not sufficiently effective because attackers can rewrite and reorder the malware. Fortunately, neural networks can automatically detect large-scale malware variants with superior classification accuracy.

In this section, we will review the very recent twelve representative works that use Deep Learning for malware classification ( De La Rosa et al. 2018 ; Saxe and Berlin 2015 ; Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Tobiyama et al. 2016 ; Dahl et al. 2013 ; Nix and Zhang 2017 ; Kalash et al. 2018 ; Cui et al. 2018 ; David and Netanyahu 2015 ; Rosenberg et al. 2018 ; Xu et al. 2018 ). De La Rosa et al. (2018 ) selects three different kinds of static features to classify malware. Saxe and Berlin (2015 ); Kolosnjaji et al. (2017 ); McLaughlin et al. (2017 ) also use static features from the PE files to classify programs. ( Tobiyama et al. 2016 ) extracts behavioral feature images using RNN to represent the behaviors of original programs. ( Dahl et al. 2013 ) transforms malicious behaviors using representative learning without neural network. Nix and Zhang (2017 ) explores RNN model with the API calls sequences as programs’ features. Cui et al. (2018 ); Kalash et al. (2018 ) skip Phase II by directly transforming the binary file to image to classify the file. ( David and Netanyahu 2015 ; Rosenberg et al. 2018 ) applies dynamic features to analyze malicious features. Xu et al. (2018 ) combines static features and dynamic features to represent programs’ features. Among these works, we select two representative works ( De La Rosa et al. 2018 ; Rosenberg et al. 2018 ) and identify four phases in their works shown as Table  2 .

From a close look at the very recent applications using Deep Learning for solving malware classification challenges, we observed the followings:

Observation 7.1: Features selected in malware classification were grouped into three categories: static features, dynamic features, and hybrid features.

Typical static features include metadata, PE import Features, Byte/Entorpy, String, and Assembly Opcode Features derived from the PE files ( Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Saxe and Berlin 2015 ). De La Rosa et al. (2018 ) took three kinds of static features: byte-level, basic-level (strings in the file, the metadata table, and the import table of the PE header), and assembly features-level. Some works directly considered binary code as static features ( Cui et al. 2018 ; Kalash et al. 2018 ).

Different from static features, dynamic features were extracted by executing the files to retrieve their behaviors during execution. The behaviors of programs, including the API function calls, their parameters, files created or deleted, websites and ports accessed, etc, were recorded by a sandbox as dynamic features ( David and Netanyahu 2015 ). The process behaviors including operation name and their result codes were extracted ( Tobiyama et al. 2016 ). The process memory, tri-grams of system API calls and one corresponding input parameter were chosen as dynamic features ( Dahl et al. 2013 ). An API calls sequence for an APK file was another representation of dynamic features ( Nix and Zhang 2017 ; Rosenberg et al. 2018 ).

Static features and dynamic features were combined as hybrid features ( Xu et al. 2018 ). For static features, Xu and others in ( Xu et al. 2018 ) used permissions, networks, calls, and providers, etc. For dynamic features, they used system call sequences.

Observation 7.2: In most works, Phase II was inevitable because extracted features needed to be vertorized for Deep Learning models.

One-hot encoding approach was frequently used to vectorize features ( Kolosnjaji et al. 2017 ; McLaughlin et al. 2017 ; Rosenberg et al. 2018 ; Tobiyama et al. 2016 ; Nix and Zhang 2017 ). Bag-of-words (BoW) and n -gram were also considered to represent features ( Nix and Zhang 2017 ). Some works brought the concepts of word frequency in NLP to convert the sandbox file to fixed-size inputs ( David and Netanyahu 2015 ). Hashing features into a fixed vector was used as an effective method to represent features ( Saxe and Berlin 2015 ). Bytes histogram using the bytes analysis and bytes-entropy histogram with a sliding window method were considered ( De La Rosa et al. 2018 ). In ( De La Rosa et al. 2018 ), De La Rosa and others embeded strings by hashing the ASCII strings to a fixed-size feature vector. For assembly features, they extracted four different levels of granularity: operation level (instruction-flow-graph), block level (control-flow-graph), function level (call-graph), and global level (graphs summarized). bigram, trigram and four-gram vectors and n -gram graph were used for the hybrid features ( Xu et al. 2018 ).

Observation 7.3: Most Phase III methods were classified into class 1.

Following the classification tree shown in Fig.  2 , most works were classified into class 1 shown in Fig.  2 except two works ( Dahl et al. 2013 ; Tobiyama et al. 2016 ), which belonged to class 3 shown in Fig.  2 . To reduce the input dimension, Dahl et al. (2013 ) performed feature selection using mutual information and random projection. Tobiyama et al. generated behavioral feature images using RNN ( Tobiyama et al. 2016 ).

Observation 7.4: After extracting features, two kinds of neural network architectures, i.e., one single neural network and multiple neural networks with a combined loss function, were used.

Hierarchical structures, like convolutional layers, fully connected layers and classification layers, were used to classify programs ( McLaughlin et al. 2017 ; Dahl et al. 2013 ; Nix and Zhang 2017 ; Saxe and Berlin 2015 ; Tobiyama et al. 2016 ; Cui et al. 2018 ; Kalash et al. 2018 ). A deep stack of denoising autoencoders was also introduced to learn programs’ behaviors ( David and Netanyahu 2015 ). De La Rosa and others ( De La Rosa et al. 2018 ) trained three different models with different features to compare which static features are relevant for the classification model. Some works investigated LSTM models for sequential features ( Nix and Zhang 2017 ; Rosenberg et al. 2018 ).

Two networks with different features as inputs were used for malware classification by combining their outputs with a dropout layer and an output layer ( Kolosnjaji et al. 2017 ). In ( Kolosnjaji et al. 2017 ), one network transformed PE Metadata and import features using feedforward neurons, another one leveraged convolutional network layers with opcode sequences. Lifan Xu et al. ( Xu et al. 2018 ) constructed a few networks and combined them using a two-level multiple kernel learning algorithm.

Indication 7.1: Except two works transform binary into images ( Cui et al. 2018 ; Kalash et al. 2018 ), most works surveyed need to adapt methods to vectorize extracted features.

The vectorization methods should not only keep syntactic and semantic information in features, but also consider the definition of the Deep Learning model.

Indication 7.2: Only limited works have shown how to transform features using representation learning.

Because some works assume the dynamic and static sequences, like API calls and instruction, and have similar syntactic and semantic structure as natural language, some representation learning techniques like word2vec may be useful in malware detection. In addition, for the control-flow graph, call graph and other graph representations, graph embedding is a potential method to transform those features.

Though several pieces of research have been done in malware detection using Deep Learning, it’s hard to compare their methods and performances because of two uncertainties in their approaches. First, the Deep Learning model is a black-box, researchers cannot detail which kind of features the model learned and explain why their model works. Second, feature selection and representation affect the model’s performance. Because they do not use the same datasets, researchers cannot prove their approaches – including selected features and Deep Learning model – are better than others. The reason why few researchers use open datasets is that existing open malware datasets are out of data and limited. Also, researchers need to crawl benign programs from app stores, so their raw programs will be diverse.

A closer look at applications of Deep Learning in system-event-based anomaly detection

System logs recorded significant events at various critical points, which can be used to debug the system’s performance issues and failures. Moreover, log data are available in almost all computer systems and are a valuable resource for understanding system status. There are a few challenges in anomaly detection based on system logs. Firstly, the raw log data are unstructured, while their formats and semantics can vary significantly. Secondly, logs are produced by concurrently running tasks. Such concurrency makes it hard to apply workflow-based anomaly detection methods. Thirdly, logs contain rich information and complexity types, including text, real value, IP address, timestamp, and so on. The contained information of each log is also varied. Finally, there are massive logs in every system. Moreover, each anomaly event usually incorporates a large number of logs generated in a long period.

Recently, a large number of scholars employed deep learning techniques ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ) to detect anomaly events in the system logs and diagnosis system failures. The raw log data are unstructured, while their formats and semantics can vary significantly. To detect the anomaly event, the raw log usually should be parsed to structure data, the parsed data can be transformed into a representation that supports an effective deep learning model. Finally, the anomaly event can be detected by deep learning based classifier or predictor.

In this section, we will review the very recent six representative papers that use deep learning for system-event-based anomaly detection ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ). DeepLog ( Du et al. 2017 ) utilizes LSTM to model the system log as a natural language sequence, which automatically learns log patterns from the normal event, and detects anomalies when log patterns deviate from the trained model. LogAnom ( Meng et al. 2019 ) employs Word2vec to extract the semantic and syntax information from log templates. Moreover, it uses sequential and quantitative features simultaneously. Das et al. (2018 ) uses LSTM to predict node failures that occur in super computing systems from HPC logs. Brown et al. (2018 ) presented RNN language models augmented with attention for anomaly detection in system logs. LogRobust ( Zhang et al. 2019 ) uses FastText to represent semantic information of log events, which can identify and handle unstable log events and sequences. Bertero et al. (2017 ) map log word to a high dimensional metric space using Google’s word2vec algorithm and take it as features to classify. Among these six papers, we select two representative works ( Du et al. 2017 ; Meng et al. 2019 ) and summarize the four phases of their approaches. We direct interested readers to Table  2 for a concise overview of these two works.

From a close look at the very recent applications using deep learning for solving security-event-based anomaly detection challenges, we observed the followings:

Observation 8.1: Most works of our surveyed papers evaluated their performance using public datasets.

By the time we surveyed this paper, only two works in ( Das et al. 2018 ; Bertero et al. 2017 ) used their private datasets.

Observation 8.2: Most works in this survey adopted Phase II when parsing the raw log data.

After reviewing the six works proposed recently, we found that five works ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ) employed parsing technique, while only one work ( Bertero et al. 2017 ) did not.

DeepLog ( Du et al. 2017 ) parsed the raw log to different log type using Spell ( Du and Li 2016 ) which is based a longest common subsequence. Desh ( Das et al. 2018 ) parsed the raw log to constant message and variable component. Loganom ( Meng et al. 2019 ) parsed the raw log to different log templates using FT-Tree ( Zhang et al. 2017 ) according to the frequent combinations of log words. Andy Brown et al. ( Brown et al. 2018 ) parsed the raw log into word and character tokenization. LogRobust ( Zhang et al. 2019 ) extracted its log event by abstracting away the parameters in the message. Bertero et al. (2017 ) considered logs as regular text without parsing.

Observation 8.3: Most works have considered and adopted Phase III.

Among these six works, only DeepLog represented the parsed data using the one-hot vector without learning. Moreover, Loganom ( Meng et al. 2019 ) compared their results with DeepLog. That is, DeepLog belongs to class 1 and Loganom belongs to class 4 in Fig.  2 , while the other four works follow in class 3.

The four works ( Meng et al. 2019 ; Das et al. 2018 ; Zhang et al. 2019 ; Bertero et al. 2017 ) used word embedding techniques to represent the log data. Andy Brown et al. ( Brown et al. 2018 ) employed attention vectors to represent the log messages.

DeepLog ( Du et al. 2017 ) employed the one-hot vector to represent the log type without learning. We have engaged an experiment replacing the one-hot vector with trained word embeddings.

Observation 8.4: Evaluation results were not compared using the same dataset.

DeepLog ( Du et al. 2017 ) employed the one-hot vector to represent the log type without learning, which employed Phase II without Phase III. However, Christophe Bertero et al. ( Bertero et al. 2017 ) considered logs as regular text without parsing, and used Phase III without Phase II. The precision of the two methods is very high, which is greater than 95%. Unfortunately, the evaluations of the two methods used different datasets.

Observation 8.5: Most works empolyed LSTM in Phase IV.

Five works including ( Du et al. 2017 ; Meng et al. 2019 ; Das et al. 2018 ; Brown et al. 2018 ; Zhang et al. 2019 ) employed LSTM in the Phase IV, while Bertero et al. (2017 ) tried different classifiers including naive Bayes, neural networks and random forest.

Indication 8.1: Phase II has a positive effect on accuracy if being well-designed.

Since Bertero et al. (2017 ) considers logs as regular text without parsing, we can say that Phase II is not required. However, we can find that most of the scholars employed parsing techniques to extract structure information and remove the useless noise.

Indication 8.2: Most of the recent works use trained representation to represent parsed data.

As shown in Table  3 , we can find Phase III is very useful, which can improve detection accuracy.

Indication 8.3: Phase II and Phase III cannot be skipped simultaneously.

Both Phase II and Phase III are not required. However, all methods have employed Phase II or Phase III.

Indication 8.4: Observation 8.3 indicates that the trained word embedding format can improve the anomaly detection accuracy as shown in Table  3 .

Indication 8.5: Observation 8.5 indicates that most of the works adopt LSTM to detect anomaly events.

We can find that most of the works adopt LSTM to detect anomaly event, since log data can be considered as sequence and there can be lags of unknown duration between important events in a time series. LSTM has feedback connections, which can not only process single data points, but also entire sequences of data.

As our consideration, neither Phase II nor Phase III is required in system event-based anomaly detection. However, Phase II can remove noise in raw data, and Phase III can learn a proper representation of the data. Both Phase II and Phase III have a positive effect on anomaly detection accuracy. Since the event log is text data that we can’t feed the raw log data into deep learning model directly, Phase II and Phase III can’t be skipped simultaneously.

Deep learning can capture the potentially nonlinear and high dimensional dependencies among log entries from the training data that correspond to abnormal events. In that way, it can release the challenges mentioned above. However, it still suffers from several challenges. For example, how to represent the unstructured data accurately and automatically without human knowledge.

A closer look at applications of deep learning in solving memory forensics challenges

In the field of computer security, memory forensics is security-oriented forensic analysis of a computer’s memory dump. Memory forensics can be conducted against OS kernels, user-level applications, as well as mobile devices. Memory forensics outperforms traditional disk-based forensics because although secrecy attacks can erase their footprints on disk, they would have to appear in memory ( Song et al. 2018 ). The memory dump can be considered as a sequence of bytes, thus memory forensics usually needs to extract security semantic information from raw memory dump to find attack traces.

The traditional memory forensic tools fall into two categories: signature scanning and data structure traversal. These traditional methods usually have some limitations. Firstly, it needs expert knowledge on the related data structures to create signatures or traversing rules. Secondly, attackers may directly manipulate data and pointer values in kernel objects to evade detection, and then it becomes even more challenging to create signatures and traversing rules that cannot be easily violated by malicious manipulations, system updates, and random noise. Finally, the high-efficiency requirement often sacrifices high robustness. For example, an efficient signature scan tool usually skips large memory regions that are unlikely to have the relevant objects and relies on simple but easily tamperable string constants. An important clue may hide in this ignored region.

In this section, we will review the very recent four representative works that use Deep Learning for memory forensics ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ). DeepMem ( Song et al. 2018 ) recognized the kernel objects from raw memory dumps by generating abstract representations of kernel objects with a graph-based Deep Learning approach. MDMF ( Petrik et al. 2018 ) detected OS and architecture-independent malware from memory snapshots with several pre-processing techniques, domain unaware feature selection, and a suite of machine learning algorithms. MemTri ( Michalas and Murray 2017 ) predicts the likelihood of criminal activity in a memory image using a Bayesian network, based on evidence data artefacts generated by several applications. Dai et al. (2018 ) monitor the malware process memory and classify malware according to memory dumps, by transforming the memory dump into grayscale images and adopting a multi-layer perception as the classifier.

Among these four works ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ), two representative works (i.e., ( Song et al. 2018 ; Petrik et al. 2018 )) are already summarized phase-by-phase in Table 1. We direct interested readers to Table  2 for a concise overview of these two works.

Our review will be centered around the three questions raised in Section 3 . In the remaining of this section, we will first provide a set of observations, and then we provide the indications. Finally, we provide some general remarks.

From a close look at the very recent applications using Deep Learning for solving memory forensics challenges, we observed the followings:

Observation 9.1: Most methods used their own datasets for performance evaluation, while none of them used a public dataset.

DeepMem was evaluated on self-generated dataset by the authors, who collected a large number of diverse memory dumps, and labeled the kernel objects in them using existing memory forensics tools like Volatility. MDMF employed the MalRec dataset by Georgia Tech to generate malicious snapshots, while it created a dataset of benign memory snapshots running normal software. MemTri ran several Windows 7 virtual machine instances with self-designed suspect activity scenarios to gather memory images. Dai et al. built the Procdump program in Cuckoo sandbox to extract malware memory dumps. We found that each of the four works in our survey generated their own datasets, while none was evaluated on a public dataset.

Observation 9.2: Among the four works ( Song et al. 2018 ; Michalas and Murray 2017 ; Petrik et al. 2018 ; Dai et al. 2018 ), two works ( Song et al. 2018 ; Michalas and Murray 2017 ) employed Phase II while the other two works ( Petrik et al. 2018 ; Dai et al. 2018 ) did not employ.

DeepMem ( Song et al. 2018 ) devised a graph representation for a sequence of bytes, taking into account both adjacency and points-to relations, to better model the contextual information in memory dumps. MemTri ( Michalas and Murray 2017 ) firstly identified the running processes within the memory image that match the target applications, then employed regular expressions to locate evidence artefacts in a memory image. MDMF ( Petrik et al. 2018 ) and Dai et al. (2018 ) transformed the memory dump into image directly.

Observation 9.3: Among four works ( Song et al. 2018 ; Michalas and Murray 2017 ; Petrik et al. 2018 ; Dai et al. 2018 ), only DeepMem ( Song et al. 2018 ) employed Phase III for which it used an embedding method to represent a memory graph.

MDMF ( Petrik et al. 2018 ) directly fed the generated memory images into the training of a CNN model. Dai et al. (2018 ) used HOG feature descriptor for detecting objects, while MemTri ( Michalas and Murray 2017 ) extracted evidence artefacts as the input of Bayesian Network. In summary, DeepMem belonged to class 3 shown in Fig.  2 , while the other three works belonged to class 1 shown in Fig.  2 .

Observation 9.4: All the four works ( Song et al. 2018 ; Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ) have employed different classifiers even when the types of input data are the same.

DeepMem chose fully connected network (FCN) model that has multi-layered hidden neurons with ReLU activation functions, following by a softmax layer as the last layer. MDMF ( Petrik et al. 2018 ) evaluated their performance both on traditional machine learning algorithms and Deep Learning approach including CNN and LSTM. Their results showed the accuracy of different classifiers did not have a significant difference. MemTri employed a Bayesian network model that is designed with three layers, i.e., a hypothesis layer, a sub-hypothesis layer, and an evidence layer. Dai et al. used a multi-layer perception model including an input layer, a hidden layer and an output layer as the classifier.

Indication 9.1: There lacks public datasets for evaluating the performance of different Deep Learning methods in memory forensics.

From Observation 9.1, we find that none of the four works surveyed was evaluated on public datasets.

Indication 9.2: From Observation 9.2, we find that it is disputable whether one should employ Phase II when solving memory forensics problems.

Since both ( Petrik et al. 2018 ) and ( Dai et al. 2018 ) directly transformed a memory dump into an image, Phase II is not required in these two works. However, since there is a large amount of useless information in a memory dump, we argue that appropriate prepossessing could improve the accuracy of the trained models.

Indication 9.3: From Observation 9.3, we find that Phase III is paid not much attention in memory forensics.

Most works did not employ Phase III. Among the four works, only DeepMem ( Song et al. 2018 ) employed Phase III during which it used embeddings to represent a memory graph. The other three works ( Petrik et al. 2018 ; Michalas and Murray 2017 ; Dai et al. 2018 ) did not learn any representations before training a Deep Learning model.

Indication 9.4: For Phase IV in memory forensics, different classifiers can be employed.

Which kind of classifier to use seems to be determined by the features used and their data structures. From Observation 9.4, we find that the four works have actually employed different kinds of classifiers even the types of input data are the same. It is very interesting that MDMF obtained similar results with different classifiers including traditional machine learning and Deep Learning models. However, the other three works did not discuss why they chose a particular kind of classifier.

Since a memory dump can be considered as a sequence of bytes, the data structure of a training data example is straightforward. If the memory dump is transformed into a simple form in Phase II, it can be directly fed into the training process of a Deep Learning model, and as a result Phase III can be ignored. However, if the memory dump is transformed into a complicated form in Phase II, Phase III could be quite useful in memory forensics.

Regarding the answer for Question 3 at “ Methodology for reviewing the existing works ” section, it is very interesting that during Phase IV different classifiers can be employed in memory forensics. Moreover, MDMF ( Petrik et al. 2018 ) has shown that they can obtain similar results with different kinds of classifiers. Nevertheless, they also admit that with a larger amount of training data, the performance could be improved by Deep Learning.

An end-to-end manner deep learning model can learn the precise representation of memory dump automatically to release the requirement for expert knowledge. However, it still needs expert knowledge to represent data and attacker behavior. Attackers may also directly manipulate data and pointer values in kernel objects to evade detection.

A closer look at applications of deep learning in security-oriented fuzzing

Fuzzing of software security is one of the state of art techniques that people use to detect software vulnerabilities. The goal of fuzzing is to find all the vulnerabilities exist in the program by testing as much program code as possible. Due to the nature of fuzzing, this technique works best on finding vulnerabilities in programs that take in input files, like PDF viewers ( Godefroid et al. 2017 ) or web browsers. A typical workflow of fuzzing can be concluded as: given several seed input files, the fuzzer will mutate or fuzz the seed inputs to get more input files, with the aim of expanding the overall code coverage of the target program as it executes the mutated files. Although there have already been various popular fuzzers ( Li et al. 2018 ), fuzzing still cannot bypass its problem of sometimes redundantly testing input files which cannot improve the code coverage rate ( Shi and Pei 2019 ; Rajpal et al. 2017 ). Some input files mutated by the fuzzer even cannot pass the well-formed file structure test ( Godefroid et al. 2017 ). Recent research has come up with ideas of applying Deep Learning in the process of fuzzing to solve these problems.

In this section, we will review the very recent four representative works that use Deep Learning for fuzzing for software security. Among the three, two representative works ( Godefroid et al. 2017 ; Shi and Pei 2019 ) are already summarized phase-by-phase in Table  2 . We direct interested readers to Table  2 for a concise overview of those two works.

Observation 10.1: Deep Learning has only been applied in mutation-based fuzzing.

Even though various of different fuzzing techniques, including symbolic execution based fuzzing ( Stephens et al. 2016 ), tainted analysis based fuzzing ( Bekrar et al. 2012 ) and hybrid fuzzing ( Yun et al. 2018 ) have been proposed so far, we observed that all the works we surveyed employed Deep Learning method to assist the primitive fuzzing – mutation-based fuzzing. Specifically, they adopted Deep Learning to assist fuzzing tool’s input mutation. We found that they commonly did it in two ways: 1) training Deep Learning models to tell how to efficiently mutate the input to trigger more execution path ( Shi and Pei 2019 ; Rajpal et al. 2017 ); 2) training Deep Learning models to tell how to keep the mutated files compliant with the program’s basic semantic requirement ( Godefroid et al. 2017 ). Besides, all three works trained different Deep Learning models for different programs, which means that knowledge learned from one programs cannot be applied to other programs.

Observation 10.2: Similarity among all the works in our survey existed when choosing the training samples in Phase I.

The works in this survey had a common practice, i.e., using the input files directly as training samples of the Deep Learning model. Learn&Fuzz ( Godefroid et al. 2017 ) used character-level PDF objects sequence as training samples. Neuzz ( Shi and Pei 2019 ) regarded input files directly as byte sequences and fed them into the neural network model. Rajpal et al. (2017 ) also used byte level representations of input files as training samples.

Observation 10.3: Difference between all the works in our survey existed when assigning the training labels in Phase I.

Despite the similarity of training samples researchers decide to use, there was a huge difference in the training labels that each work chose to use. Learn&Fuzz ( Godefroid et al. 2017 ) directly used the character sequences of PDF objects as labels, same as training samples, but shifted by one position, which is a common generative model technique already broadly used in speech and handwriting recognition. Unlike Learn&Fuzz, Neuzz ( Shi and Pei 2019 ) and Rajpal’s work ( Rajpal et al. 2017 ) used bitmap and heatmap respectively as training labels, with the bitmap demonstrating the code coverage status of a certain input, and the heatmap demonstrating the efficacy of flipping one or more bytes of the input file. Whereas, as a common terminology well-known among fuzzing researchers, bitmap was gathered directly from the results of AFL. Heatmap used by Rajpal et al. was generated by comparing the code coverage supported by the bitmap of one seed file and the code coverage supported by bitmaps of the mutated seed files. It was noted that if there is acceptable level of code coverage expansion when executing the mutated seed files, demonstrated by more “1”s, instead of “0”s in the corresponding bitmaps, the byte level differences among the original seed file and the mutated seed files will be highlighted. Since those bytes should be the focus of later on mutation, heatmap was used to denote the location of those bytes.

Different labels usage in each work was actually due to the different kinds of knowledge each work wants to learn. For a better understanding, let us note that we can simply regard a Deep Learning model as a simulation of a “function”. Learn&Fuzz ( Godefroid et al. 2017 ) wanted to learn valid mutation of a PDF file that was compliant with the syntax and semantic requirements of PDF objects. Their model could be seen as a simulation of f ( x , θ )= y , where x denotes sequence of characters in PDF objects and y represents a sequence that are obtained by shifting the input sequences by one position. They generated new PDF object character sequences given a starting prefix once the model was trained. In Neuzz ( Shi and Pei 2019 ), an NN(Neural Network) model was used to do program smoothing, which simultated a smooth surrogate function that approximated the discrete branching behaviors of the target program. f ( x , θ )= y , where x denoted program’s byte level input and y represented the corresponding edge coverage bitmap. In this way, the gradient of the surrogate function was easily computed, due to NN’s support of efficient computation of gradients and higher order derivatives. Gradients could then be used to guide the direction of mutation, in order to get greater code coverage. In Rajpal and others’ work ( Rajpal et al. 2017 ), they designed a model to predict good (and bad) locations to mutate in input files based on the past mutations and corresponding code coverage information. Here, the x variable also denoted program’s byte level input, but the y variable represented the corresponding heatmap.

Observation 10.4: Various lengths of input files were handled in Phase II.

Deep Learning models typically accepted fixed length input, whereas the input files for fuzzers often held different lengths. Two different approaches were used among the three works we surveyed: splitting and padding. Learn&Fuzz ( Godefroid et al. 2017 ) dealt with this mismatch by concatenating all the PDF objects character sequences together, and then splited the large character sequence into multiple training samples with a fixed size. Neuzz ( Shi and Pei 2019 ) solved this problem by setting a maximize input file threshold and then, padding the smaller-sized input files with null bytes. From additional experiments, they also found that a modest threshold gived them the best result, and enlarging the input file size did not grant them additional accuracy. Aside from preprocessing training samples, Neuzz also preprocessed training labels and reduced labels dimension by merging the edges that always appeared together into one edge, in order to prevent the multicollinearity problem, that could prevent the model from converging to a small loss value. Rajpal and others ( Rajpal et al. 2017 ) used the similar splitting mechanism as Learn&Fuzz to split their input files into either 64-bit or 128-bit chunks. Their chunk size was determined empirically and was considered as a trainable parameter for their Deep Learning model, and their approach did not require sequence concatenating at the beginning.

Observation 10.5: All the works in our survey skipped Phase III.

According to our definition of Phase III, all the works in our survey did not consider representation learning. Therefore, all the three works ( Godefroid et al. 2017 ; Shi and Pei 2019 ; Rajpal et al. 2017 ) fell into class 1 shown in Fig.  2 .While as in Rajpal and others’ work, they considered the numerical representation of byte sequences. They claimed that since one byte binary data did not always represent the magnitude but also state, representing one byte in values ranging from 0 to 255 could be suboptimal. They used lower level 8-bit representation.

Indication 10.1: No alteration to the input files seems to be a correct approach. As far as we concerned, it is due to the nature of fuzzing. That is, since every bit of the input files matters, any slight alteration to the input files could either lose important information or add redundant information for the neural network model to learn.

Indication 10.2: Evaluation criteria should be chosen carefully when judging mutation.

Input files are always used as training samples regarding using Deep Learning technique in fuzzing problems. Through this similar action, researchers have a common desire to let the neural network mode learn how the mutated input files should look like. But the criterion of judging a input file actually has two levels: on the one hand, a good input file should be correct in syntax and semantics; on the other hand, a good input file should be the product of a useful mutation, which triggers the program to behave differently from previous execution path. This idea of a fuzzer that can generate semantically correct input file could still be a bad fuzzer at triggering new execution path was first brought up in Learn&Fuzz ( Godefroid et al. 2017 ). We could see later on works trying to solve this problem by using either different training labels ( Rajpal et al. 2017 ) or use neural network to do program smoothing ( Shi and Pei 2019 ). We encouraged fuzzing researchers, when using Deep Learning techniques, to keep this problem in mind, in order to get better fuzzing results.

Indication 10.3: Works in our survey only focus on local knowledge. In brief, some of the existing works ( Shi and Pei 2019 ; Rajpal et al. 2017 ) leveraged the Deep Learning model to learn the relation between program’s input and its behavior and used the knowledge that learned from history to guide future mutation. For better demonstration, we defined the knowledge that only applied in one program as local knowledge . In other words, this indicates that the local knowledge cannot direct fuzzing on other programs.

Corresponding to the problems conventional fuzzing has, the advantages of applying DL in fuzzing are that DL’s learning ability can ensure mutated input files follow the designated grammar rules better. The ways in which input files are generated are more directed, and will, therefore, guarantee the fuzzer to increase its code coverage by each mutation. However, even if the advantages can be clearly demonstrated by the two papers we discuss above, some challenges still exist, including mutation judgment challenges that are faced both by traditional fuzzing techniques and fuzzing with DL, and the scalability of fuzzing approaches.

We would like to raise several interesting questions for the future researchers: 1) Can the knowledge learned from the fuzzing history of one program be applied to direct testing on other programs? 2) If the answer to question one is positive, we can suppose that global knowledge across different programs exists? Then, can we train a model to extract the global knowledge ? 3) Whether it is possible to combine global knowledge and local knowledge when fuzzing programs?

Using high-quality data in Deep Learning is important as much as using well-structured deep neural network architectures. That is, obtaining quality data must be an important step, which should not be skipped, even in resolving security problems using Deep Learning. So far, this study demonstrated how the recent security papers using Deep Learning have adopted data conversion (Phase II) and data representation (Phase III) on different security problems. Our observations and indications showed a clear understanding of how security experts generate quality data when using Deep Learning.

Since we did not review all the existing security papers using Deep Learning, the generality of observations and indications is somewhat limited. Note that our selected papers for review have been published recently at one of prestigious security and reliability conferences such as USENIX SECURITY, ACM CCS and so on ( Shin et al. 2015 )-( Das et al. 2018 ), ( Brown et al. 2018 ; Zhang et al. 2019 ), ( Song et al. 2018 ; Petrik et al. 2018 ), ( Wang et al. 2019 )-( Rajpal et al. 2017 ). Thus, our observations and indications help to understand how most security experts have used Deep Learning to solve the well-known eight security problems from program analysis to fuzzing.

Our observations show that we should transfer raw data to synthetic formats of data ready for resolving security problems using Deep Learning through data cleaning and data augmentation and so on. Specifically, we observe that Phases II and III methods have mainly been used for the following purposes:

To clean the raw data to make the neural network (NN) models easier to interpret

To reduce the dimensionality of data (e.g., principle component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE))

To scale input data (e.g., normalization)

To make NN models understand more complex relationships depending on security problems (e.g. memory graphs)

To simply change various raw data formats into a vector format for NN models (e.g. one-hot encoding and word2vec embedding)

In this following, we do further discuss the question, “What if Phase II is skipped?", rather than the question, “Is Phase III always necessary?". This is because most of the selected papers do not consider Phase III methods (76%), or adopt with no concrete reasoning (19%). Specifically, we demonstrate how Phase II has been adopted according to eight security problems, different types of data, various models of NN and various outputs of NN models, in depth. Our key findings are summarized as follows:

How to fit security domain knowledge into raw data has not been well-studied yet.

While raw text data are commonly parsed after embedding, raw binary data are converted using various Phase II methods.

Raw data are commonly converted into a vector format to fit well to a specific NN model using various Phase II methods.

Various Phase II methods are used according to the relationship between output of security problem and output of NN models.

What if phase II is skipped?

From the analysis results of our selected papers for review, we roughly classify Phase II methods into the following four categories.

Embedding: The data conversion methods that intend to convert high-dimensional discrete variables into low-dimensional continuous vectors ( Google Developers 2016 ).

Parsing combined with embedding: The data conversion methods that constitute an input data into syntactic components in order to test conformability after embedding.

One-hot encoding: A simple embedding where each data belonging to a specific category is mapped to a vector of 0s and a single 1. Here, the low-dimension transformed vector is not managed.

Domain-specific data structures: A set of data conversion methods which generate data structures capturing domain-specific knowledge for different security problems, e.g., memory graphs ( Song et al. 2018 ).

Findings on eight security problems

We observe that over 93% of the papers use one of the above-classified Phase II methods. 7% of the papers do not use any of the above-classified methods, and these papers are mostly solving a software fuzzing problem. Specifically, we observe that 35% of the papers use a Category 1 (i.e. embedding) method; 30% of the papers use a Category 2 (i.e. parsing combined with embedding) method; 15% of the papers use a Category 3 (i.e. one-hot encoding) method; and 13% of the papers use a Category 4 (i.e. domain-specific data structures) method. Regarding why one-hot encoding is not widely used, we found that most security data include categorical input values, which are not directly analyzed by Deep Learning models.

From Fig.  3 , we also observe that according to security problems, different Phase II methods are used. First, PA, ROP and CFI should convert raw data into a vector format using embedding because they commonly collect instruction sequence from binary data. Second, NA and SEAD use parsing combined with embedding because raw data such as the network traffic and system logs consist of the complex attributes with the different formats such as categorical and numerical input values. Third, we observe that MF uses various data structures because memory dumps from memory layout are unstructured. Fourth, fuzzing generally uses no data conversion since Deep Learning models are used to generate the new input data with the same data format as the original raw data. Finally, we observe that MC commonly uses one-hot encoding and embedding because malware binary and well-structured security log files include categorical, numerical and unstructured data in general. These observations indicate that type of data strongly influences on use of Phase II methods. We also observe that only MF among eight security problems commonly transform raw data into well-structured data embedding a specialized security domain knowledge. This observation indicates that various conversion methods of raw data into well-structure data which embed various security domain knowledge are not yet studied in depth.

figure 3

Statistics of Phase II methods for eight security problems

Findings on different data types

Note that according to types of data, a NN model works better than the others. For example, CNN works well with images but does not work with text. From Fig.  4 for raw binary data, we observe that 51.9%, 22.3% and 11.2% of security papers use embedding, one-hot encoding and Others , respectively. Only 14.9% of security papers, especially related to fuzzing, do not use one of Phase II methods. This observation indicates that binary input data which have various binary formats should be converted into an input data type which works well with a specific NN model. From Fig.  4 for raw text data, we also observe that 92.4% of papers use parsing with embedding as the Phase II method. Note that compared with raw binary data whose formats are unstructured, raw text data generally have the well-structured format. Raw text data collected from network traffics may also have various types of attribute values. Thus, raw text data are commonly parsed after embedding to reduce redundancy and dimensionality of data.

figure 4

Statistics of Phase II methods on type of data

Findings on various models of NN

According to types of the converted data, a specific NN model works better than the others. For example, CNN works well with images but does not work with raw text. From Fig.  6 b, we observe that use of embedding for DNN (42.9%), RNN (28.6%) and LSTM (14.3%) models approximates to 85%. This observation indicates that embedding methods are commonly used to generate sequential input data for DNN, RNN and LSTM models. Also, we observe that one-hot encoded data are commonly used as input data for DNN (33.4%), CNN (33.4%) and LSTM (16.7%) models. This observation indicates that one-hot encoding is one of common Phase II methods to generate numerical values for image and sequential input data because many raw input data for security problems commonly have the categorical features. We observe that the CNN (66.7%) model uses the converted input data using the Others methods to express the specific domain knowledge into the input data structure of NN networks. This is because general vector formats including graph, matrix and so on can also be used as an input value of the CNN model.

From Fig.  5 b, we observe that DNN, RNN and LSTM models commonly use embedding, one-hot encoding and parsing combined with embedding. For example, we observe security papers of 54.6%, 18.2% and 18.2% models use embedding, one-hot encoding and parsing combined with embedding, respectively. We also observe that the CNN model is used with various Phase II methods because any vector formats such as image can generally be used as an input data of the CNN model.

figure 5

Statistics of Phase II methods for various types of NNs

figure 6

Statistics of Phase II methods for various output of NN

Findings on output of NN models

According to the relationship between output of security problem and output of NN, we may use a specific Phase II method. For example, if output of security problem is given into a class (e.g., normal or abnormal), output of NN should also be given into classification.

From Fig.  6 a, we observe that embedding is commonly used to support a security problem for classification (100%). Parsing combined with embedding is used to support a security problem for object detection (41.7%) and classification (58.3%). One-hot encoding is used only for classification (100%). These observations indicate that classification of a given input data is the most common output which is obtained using Deep Learning under various Phase II methods.

From Fig.  6 b, we observe that security problems, whose outputs are classification, commonly use embedding (43.8%) and parsing combined with embedding (21.9%) as the Phase II method. We also observe that security problems, whose outputs are object detection, commonly use parsing combined with embedding (71.5%). However, security problems, whose outputs are data generation, commonly do not use the Phase III methods. These observations indicate that a specific Phase II method has been used according to the relationship between output of security problem and use of NN models.

Further areas of investigation

Since any Deep Learning models are stochastic, each time the same Deep Learning model is fit even on the same data, it might give different outcomes. This is because deep neural networks use random values such as random initial weights. However, if we have all possible data for every security problem, we may not make random predictions. Since we have the limited sample data in practice, we need to get the best-effort prediction results using the given Deep Learning model, which fits to the given security problem.

How can we get the best-effort prediction results of Deep Learning models for different security problems? Let us begin to discuss about the stability of evaluation results for our selected papers for review. Next, we will elaborate the influence of security domain knowledge on prediction results of Deep Learning models. Finally, we will discuss some common issues in those fields.

How stable are evaluation results?

When evaluating neural network models, Deep Learning models commonly use three methods: train-test split; train-validation-test split; and k -fold cross validation. A train-test split method splits the data into two parts, i.e., training and test data. Even though a train-test split method makes the stable prediction with a large amount of data, predictions vary with a small amount of data. A train-validation-test split method splits the data into three parts, i.e., training, validation and test data. Validation data are used to estimate predictions over the unknown data. k -fold cross validation has k different set of predictions from k different evaluation data. Since k -fold cross validation takes the average expected performance of the NN model over k -fold validation data, the evaluation result is closer to the actual performance of the NN model.

From the analysis results of our selected papers for review, we observe that 40.0% and 32.5% of the selected papers are measured using a train-test split method and a train-validation-test split method, respectively. Only 17.5% of the selected papers are measured using k -fold cross validation. This observation implies that even though the selected papers show almost more than 99% of accuracy or 0.99 of F1 score, most solutions using Deep Learning might not show the same performance for the noisy data with randomness.

To get stable prediction results of Deep Learning models for different security problems, we might reduce the influence of the randomness of data on Deep Learning models. At least, it is recommended to consider the following methods:

Do experiments using the same data many time : To get a stable prediction with a small amount of sample data, we might control the randomness of data using the same data many times.

Use cross validation methods, e.g. k -fold cross validation : The expected average and variance from k -fold cross validation estimates how stable the proposed model is.

How does security domain knowledge influence the performance of security solutions using deep learning?

When selecting a NN model that analyzes an application dataset, e.g., MNIST dataset ( LeCun and Cortes 2010 ), we should understand that the problem is to classify a handwritten digit using a 28×28 black. Also, to solve the problem with the high classification accuracy, it is important to know which part of each handwritten digit mainly influences the outcome of the problem, i.e., a domain knowledge.

While solving a security problem, knowing and using security domain knowledge for each security problem is also important due to the following reasons (we label the observations and indications that realted to domain knowledge with ‘ ∗ ’):

Firstly, the dataset generation, preprocess and feature selection highly depend on domain knowledge. Different from the image classification and natural language processing, raw data in the security domain cannot be sent into the NN model directly. Researchers need to adopt strong domain knowledge to generate, extract, or clean the training set. Also, in some works, domain knowledge is adopted in data labeling because labels for data samples are not straightforward.

Secondly, domain knowledge helps with the selection of DL models and its hierarchical structure. For example, the neural network architecture (hierarchical and bi-directional LSTM) designed in DEEPVSA ( Guo et al. 2019 ) is based on the domain knowledge in the instruction analysis.

Thirdly, domain knowledge helps to speed up the training process. For instance, by adopting strong domain knowledge to clean the training set, domain knowledge helps to spend up the training process while keeping the same performance. However, due to the influence of the randomness of data on Deep Learning models, domain knowledge should be carefully adopted to avoid potential decreased accuracy.

Finally, domain knowledge helps with the interpretability of models’ prediction. Recently, researchers try to explore the interpretability of the deep learning model in security areas, For instance, LEMNA ( Guo et al. 2018 ) and EKLAVYA ( Chua et al. 2017 ) explain how the prediction was made by models from different perspectives. By enhancing the trained models’ interpretability, they can improve their approaches’ accuracy and security. The explanation for the relation between input, hidden state, and the final output is based on domain knowledge.

Common challenges

In this section, we will discuss the common challenges when applying DL to solving security problems. These challenges as least shared by the majority of works, if not by all the works. Generally, we observe 7 common challenges in our survey:

The raw data collected from the software or system usually contains lots of noise.

The collected raw is untidy. For instance, the instruction trace, the Untidy data: variable length sequences,

Hierarchical data syntactic/structure. As discussed in Section 3 , the information may not simply be encoded in a single layer, rather, it is encoded hierarchically, and the syntactic is complex.

Dataset generation is challenging in some scenarios. Therefore, the generated training data might be less representative or unbalanced.

Different for the application of DL in image classification, and natural language process, which is visible or understandable, the relation between data sample and its label is not intuitive, and hard to explain.

Availability of trained model and quality of dataset.

Finally, we investigate the availability of the trained model and the quality of the dataset. Generally, the availability of the trained models affects its adoption in practice, and the quality of the training set and the testing set will affect the credibility of testing results and comparison between different works. Therefore, we collect relevant information to answer the following four questions and shows the statistic in Table  4 :

Whether a paper’s source code is publicly available?

Whether raw data, which is used to generate the dataset, is publicly available?

Whether its dataset is publicly available?

How are the quality of the dataset?

We observe that both the percentage of open source of code and dataset in our surveyed fields is low, which makes it a challenge to reproduce proposed schemes, make comparisons between different works, and adopt them in practice. Specifically, the statistic shows that 1) the percentage of open source of code in our surveyed fields is low, only 6 out of 16 paper published their model’s source code. 2) the percentage of public data sets is low. Even though, the raw data in half of the works are publicly available, only 4 out of 16 fully or partially published their dataset. 3) the quality of datasets is not guaranteed, for instance, most of the dataset is unbalanced.

The performance of security solutions even using Deep Learning might vary according to datasets. Traditionally, when evaluating different NN models in image classification, standard datasets such as MNIST for recognizing handwritten 10 digits and CIFAR10 ( Krizhevsky et al. 2010 ) for recognizing 10 object classes are used for performance comparison of different NN models. However, there are no known standard datasets for evaluating NN models on different security problems. Due to such a limitation, we observe that most security papers using Deep Learning do not compare the performance of different security solutions even when they consider the same security problem. Thus, it is recommended to generate and use a standard dataset for a specific security problem for comparison. In conclusion, we think that there are three aspects that need to be improved in future research:

Developing standard dataset.

Publishing their source code and dataset.

Improving the interpretability of their model.

This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges. In particular, the review covers eight computer security problems being solved by applications of Deep Learning: security-oriented program analysis, defending ROP attacks, achieving CFI, defending network attacks, malware classification, system-event-based anomaly detection, memory forensics, and fuzzing for software security. Our observations of the reviewed works indicate that the literature of using Deep Learning techniques to solve computer security challenges is still at an earlier stage of development.

Availability of data and materials

Not applicable.

We refer readers to ( Wang and Liu 2019 ) which systemizes the knowledge of protections by CFI schemes.

Abadi, M, Budiu M, Erlingsson Ú, Ligatti J (2009) Control-Flow Integrity Principles, Implementations, and Applications. ACM Trans Inf Syst Secur (TISSEC) 13(1):4.

Article   Google Scholar  

Bao, T, Burket J, Woo M, Turner R, Brumley D (2014) BYTEWEIGHT: Learning to Recognize Functions in Binary Code In: 23rd USENIX Security Symposium (USENIX Security 14), 845–860.. USENIX Association, San Diego.

Google Scholar  

Bekrar, S, Bekrar C, Groz R, Mounier L (2012) A Taint Based Approach for Smart Fuzzing In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.. IEEE. https://doi.org/10.1109/icst.2012.182 .

Bengio, Y, Courville A, Vincent P (2013) Representation Learning: A Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828.

Bertero, C, Roy M, Sauvanaud C, Tredan G (2017) Experience Report: Log Mining Using Natural Language Processing and Application to Anomaly Detection In: 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).. IEEE. https://doi.org/10.1109/issre.2017.43 .

Brown, A, Tuor A, Hutchinson B, Nichols N (2018) Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection In: Proceedings of the First Workshop on Machine Learning for Computing Systems, MLCS’18, 1:1–1:8.. ACM, New York.

Böttinger, K, Godefroid P, Singh R (2018) Deep Reinforcement Fuzzing In: 2018 IEEE Security and Privacy Workshops (SPW), pages 116–122.. IEEE. https://doi.org/10.1109/spw.2018.00026 .

Chen, L, Sultana S, Sahita R (2018) Henet: A Deep Learning Approach on Intel Ⓡ Processor Trace for Effective Exploit Detection In: 2018 IEEE Security and Privacy Workshops (SPW).. IEEE. https://doi.org/10.1109/spw.2018.00025 .

Chua, ZL, Shen S, Saxena P, Liang Z (2017) Neural Nets Can Learn Function Type Signatures from Binaries In: 26th USENIX Security Symposium (USENIX Security 17), 99–116.. USENIX Association. https://dl.acm.org/doi/10.5555/3241189.3241199 .

Cui, Z, Xue F, Cai X, Cao Y, Wang GG, Chen J (2018) Detection of Malicious Code Variants Based on Deep Learning. IEEE Trans Ind Inform 14(7):3187–3196.

Dahl, GE, Stokes JW, Deng L, Yu D (2013) Large-scale Malware Classification using Random Projections and Neural Networks In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).. IEEE. https://doi.org/10.1109/icassp.2013.6638293 .

Dai, Y, Li H, Qian Y, Lu X (2018) A Malware Classification Method Based on Memory Dump Grayscale Image. Digit Investig 27:30–37.

Das, A, Mueller F, Siegel C, Vishnu A (2018) Desh: Deep Learning for System Health Prediction of Lead Times to Failure in HPC In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’18, 40–51.. ACM, New York.

Chapter   Google Scholar  

David, OE, Netanyahu NS (2015) DeepSign: Deep Learning for Automatic Malware Signature Generation and Classification In: 2015 International Joint Conference on Neural Networks (IJCNN).. IEEE. https://doi.org/10.1109/ijcnn.2015.7280815 .

De La Rosa, L, Kilgallon S, Vanderbruggen T, Cavazos J (2018) Efficient Characterization and Classification of Malware Using Deep Learning In: 2018 Resilience Week (RWS).. IEEE. https://doi.org/10.1109/rweek.2018.8473556 .

Du, M, Li F (2016) Spell: Streaming Parsing of System Event Logs In: 2016 IEEE 16th International Conference on Data Mining (ICDM).. IEEE. https://doi.org/10.1109/icdm.2016.0103 .

Du, M, Li F, Zheng G, Srikumar V (2017) DeepLog: Anomaly Detection and Diagnosis from System Logs Through Deep Learning In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, 1285–1298.. ACM, New York.

Faker, O, Dogdu E (2019) Intrusion Detection Using Big Data and Deep Learning Techniques In: Proceedings of the 2019 ACM Southeast Conference on ZZZ - ACM SE ’19, 86–93.. ACM. https://doi.org/10.1145/3299815.3314439 .

Ghosh, AK, Wanken J, Charron F (1998) Detecting Anomalous and Unknown Intrusions against Programs In: Proceedings 14th annual computer security applications conference (Cat. No. 98Ex217), 259–267.. IEEE, Washington, DC.

Godefroid, P, Peleg H, Singh R (2017) Learn&Fuzz: Machine Learning for Input Fuzzing In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).. IEEE. https://doi.org/10.1109/ase.2017.8115618 .

Google Developers (2016) Embeddings . https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture .

Guo, W, Mu D, Xu J, Su P, Wang G, Xing X (2018) Lemna: Explaining deep learning based security applications In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 364–379. https://doi.org/10.1145/3243734.3243792 .

Guo, W, Mu D, Xing X, Du M, Song D (2019) { DEEPVSA }: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis In: 28th USENIX Security Symposium (USENIX Security 19), 1787–1804.. USENIX Association, Santa Clara, CA. https://www.usenix.org/conference/usenixsecurity19/presentation/guo .

Heller, KA, Svore KM, Keromytis AD, Stolfo SJ (2003) One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses In: Proceedings of the Workshop on Data Mining for Computer Security.. IEEE, Dallas, TX.

Horwitz, S (1997) Precise Flow-insensitive May-alias Analysis is NP-hard. ACM Trans Program Lang Syst 19(1):1–6.

Hu, W, Liao Y, Vemuri VR (2003) Robust Anomaly Detection using Support Vector Machines In: Proceedings of the international conference on machine learning, 282–289.. Citeseer, Washington, DC.

IDS 2017 Datasets (2019). https://www.unb.ca/cic/datasets/ids-2017.html .

Kalash, M, Rochan M, Mohammed N, Bruce NDB, Wang Y, Iqbal F (2018) Malware Classification with Deep Convolutional Neural Networks In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 1–5. https://doi.org/10.1109/NTMS.2018.8328749 .

Kiriansky, V, Bruening D, Amarasinghe SP, et al. (2002) Secure Execution via Program Shepherding In: USENIX Security Symposium, volume 92, page 84.. USENIX Association, Monterey, CA.

Kolosnjaji, B, Eraisha G, Webster G, Zarras A, Eckert C (2017) Empowering Convolutional Networks for Malware Classification and Analysis. Proc Int Jt Conf Neural Netw 2017-May:3838–3845.

Krizhevsky, A, Nair V, Hinton G (2010) CIFAR-10 (Canadian Institute for Advanced Research). https://www.cs.toronto.edu/~kriz/cifar.html .

LeCun, Y, Cortes C (2010) MNIST Handwritten Digit Database. http://yann.lecun.com/exdb/mnist/ .

Li, J, Zhao B, Zhang C (2018) Fuzzing: A Survey. Cybersecurity 1(1):6.

Li, X, Hu Z, Fu Y, Chen P, Zhu M, Liu P (2018) ROPNN: Detection of ROP Payloads Using Deep Neural Networks. arXiv preprint arXiv:1807.11110.

McLaughlin, N, Martinez Del Rincon J, Kang BJ, Yerima S, Miller P, Sezer S, Safaei Y, Trickel E, Zhao Z, Doupe A, Ahn GJ (2017) Deep Android Malware Detection In: Proceedings of the 7th ACM Conference on Data and Application Security and Privacy, 301–308. https://doi.org/10.1145/3029806.3029823 .

Meng, W, Liu Y, Zhu Y, Zhang S, Pei D, Liu Y, Chen Y, Zhang R, Tao S, Sun P, Zhou R (2019) Loganomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence.. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2019/658 .

Michalas, A, Murray R (2017) MemTri: A Memory Forensics Triage Tool Using Bayesian Network and Volatility In: Proceedings of the 2017 International Workshop on Managing Insider Security Threats, MIST ’17, pages 57–66.. ACM, New York.

Millar, K, Cheng A, Chew HG, Lim C-C (2018) Deep Learning for Classifying Malicious Network Traffic In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 156–161.. Springer. https://doi.org/10.1007/978-3-030-04503-6_15 .

Moustafa, N, Slay J (2015) UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems (UNSW-NB15 Network Data Set) In: 2015 Military Communications and Information Systems Conference (MilCIS).. IEEE. https://doi.org/10.1109/milcis.2015.7348942 .

Nguyen, MH, Nguyen DL, Nguyen XM, Quan TT (2018) Auto-Detection of Sophisticated Malware using Lazy-Binding Control Flow Graph and Deep Learning. Comput Secur 76:128–155.

Nix, R, Zhang J (2017) Classification of Android Apps and Malware using Deep Neural Networks. Proc Int Jt Conf Neural Netw 2017-May:1871–1878.

NSCAI Intern Report for Congress (2019). https://drive.google.com/file/d/153OrxnuGEjsUvlxWsFYauslwNeCEkvUb/view .

Petrik, R, Arik B, Smith JM (2018) Towards Architecture and OS-Independent Malware Detection via Memory Forensics In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 2267–2269.. ACM, New York.

Phan, AV, Nguyen ML, Bui LT (2017) Convolutional Neural Networks over Control Flow Graphs for Software defect prediction In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 45–52.. IEEE. https://doi.org/10.1109/ictai.2017.00019 .

Rajpal, M, Blum W, Singh R (2017) Not All Bytes are Equal: Neural Byte Sieve for Fuzzing. arXiv preprint arXiv:1711.04596.

Rosenberg, I, Shabtai A, Rokach L, Elovici Y (2018) Generic Black-box End-to-End Attack against State of the Art API Call based Malware Classifiers In: Research in Attacks, Intrusions, and Defenses, 490–510.. Springer. https://doi.org/10.1007/978-3-030-00470-5_23 .

Salwant, J (2015) ROPGadget. https://github.com/JonathanSalwan/ROPgadget .

Saxe, J, Berlin K (2015) Deep Neural Network based Malware Detection using Two Dimensional Binary Program Features In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).. IEEE. https://doi.org/10.1109/malware.2015.7413680 .

Shacham, H, et al. (2007) The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86) In: ACM conference on Computer and communications security, pages 552–561. https://doi.org/10.1145/1315245.1315313 .

Shi, D, Pei K (2019) NEUZZ: Efficient Fuzzing with Neural Program Smoothing. IEEE Secur Priv.

Shin, ECR, Song D, Moazzezi R (2015) Recognizing Functions in Binaries with Neural Networks In: 24th USENIX Security Symposium (USENIX Security 15).. USENIX Association. https://dl.acm.org/doi/10.5555/2831143.2831182 .

Sommer, R, Paxson V (2010) Outside the Closed World: On Using Machine Learning For Network Intrusion Detection In: 2010 IEEE Symposium on Security and Privacy (S&P).. IEEE. https://doi.org/10.1109/sp.2010.25 .

Song, W, Yin H, Liu C, Song D (2018) DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, 606–618.. ACM, New York.

Stephens, N, Grosen J, Salls C, Dutcher A, Wang R, Corbetta J, Shoshitaishvili Y, Kruegel C, Vigna G (2016) Driller: Augmenting Fuzzing Through Selective Symbolic Execution In: Proceedings 2016 Network and Distributed System Security Symposium.. Internet Society. https://doi.org/10.14722/ndss.2016.23368 .

Tan, G, Jaeger T (2017) CFG Construction Soundness in Control-Flow Integrity In: Proceedings of the 2017 Workshop on Programming Languages and Analysis for Security - PLAS ’17.. ACM. https://doi.org/10.1145/3139337.3139339 .

Tobiyama, S, Yamaguchi Y, Shimada H, Ikuse T, Yagi T (2016) Malware Detection with Deep Neural Network Using Process Behavior. Proc Int Comput Softw Appl Conf 2:577–582.

Unicorn-The ultimate CPU emulator (2015). https://www.unicorn-engine.org/ .

Ustebay, S, Turgut Z, Aydin MA (2019) Cyber Attack Detection by Using Neural Network Approaches: Shallow Neural Network, Deep Neural Network and AutoEncoder In: Computer Networks, 144–155.. Springer. https://doi.org/10.1007/978-3-030-21952-9_11 .

Varenne, R, Delorme JM, Plebani E, Pau D, Tomaselli V (2019) Intelligent Recognition of TCP Intrusions for Embedded Micro-controllers In: International Conference on Image Analysis and Processing, 361–373.. Springer. https://doi.org/10.1007/978-3-030-30754-7_36 .

Wang, Z, Liu P (2019) GPT Conjecture: Understanding the Trade-offs between Granularity, Performance and Timeliness in Control-Flow Integrity. eprint 1911.07828, archivePrefix arXiv, primaryClass cs.CR, arXiv.

Wang, Y, Wu Z, Wei Q, Wang Q (2019) NeuFuzz: Efficient Fuzzing with Deep Neural Network. IEEE Access 7:36340–36352.

Xu, W, Huang L, Fox A, Patterson D, Jordan MI (2009) Detecting Large-Scale System Problems by Mining Console Logs In: Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles SOSP ’09, 117–132.. ACM, New York.

Xu, X, Liu C, Feng Q, Yin H, Song L, Song D (2017) Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 363–376.. ACM. https://doi.org/10.1145/3133956.3134018 .

Xu, L, Zhang D, Jayasena N, Cavazos J (2018) HADM: Hybrid Analysis for Detection of Malware 16:702–724.

Xu, X, Ghaffarinia M, Wang W, Hamlen KW, Lin Z (2019) CONFIRM: Evaluating Compatibility and Relevance of Control-flow Integrity Protections for Modern Software In: 28th USENIX Security Symposium (USENIX Security 19), pages 1805–1821.. USENIX Association, Santa Clara.

Yagemann, C, Sultana S, Chen L, Lee W (2019) Barnum: Detecting Document Malware via Control Flow Anomalies in Hardware Traces In: Lecture Notes in Computer Science, 341–359.. Springer. https://doi.org/10.1007/978-3-030-30215-3_17 .

Yin, C, Zhu Y, Fei J, He X (2017) A Deep Learning Approach for Intrusion Detection using Recurrent Neural Networks. IEEE Access 5:21954–21961.

Yuan, X, Li C, Li X (2017) DeepDefense: Identifying DDoS Attack via Deep Learning In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP).. IEEE. https://doi.org/10.1109/smartcomp.2017.7946998 .

Yun, I, Lee S, Xu M, Jang Y, Kim T (2018) QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing In: 27th USENIX Security Symposium (USENIX Security 18), pages 745–761.. USENIX Association, Baltimore.

Zhang, S, Meng W, Bu J, Yang S, Liu Y, Pei D, Xu J, Chen Y, Dong H, Qu X, Song L (2017) Syslog Processing for Switch Failure Diagnosis and Prediction in Datacenter Networks In: 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).. IEEE. https://doi.org/10.1109/iwqos.2017.7969130 .

Zhang, J, Chen W, Niu Y (2019) DeepCheck: A Non-intrusive Control-flow Integrity Checking based on Deep Learning. arXiv preprint arXiv:1905.01858.

Zhang, X, Xu Y, Lin Q, Qiao B, Zhang H, Dang Y, Xie C, Yang X, Cheng Q, Li Z, Chen J, He X, Yao R, Lou J-G, Chintalapati M, Shen F, Zhang D (2019) Robust Log-based Anomaly Detection on Unstable Log Data In: Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2019, pages 807–817.. ACM, New York.

Zhang, Y, Chen X, Guo D, Song M, Teng Y, Wang X (2019) PCCN: Parallel Cross Convolutional Neural Network for Abnormal Network Traffic Flows Detection in Multi-Class Imbalanced Network Traffic Flows. IEEE Access 7:119904–119916.

Download references

Acknowledgments

We are grateful to the anonymous reviewers for their useful comments and suggestions.

This work was supported by ARO W911NF-13-1-0421 (MURI), NSF CNS-1814679, and ARO W911NF-15-1-0576.

Author information

Authors and affiliations.

The Pennsylvania State University, Pennsylvania, USA

Yoon-Ho Choi, Peng Liu, Zitong Shang, Haizhou Wang, Zhilong Wang, Lan Zhang & Qingtian Zou

Pusan National University, Busan, Republic of Korea

Yoon-Ho Choi

Wuhan University of Technology, Wuhan, China

Junwei Zhou

You can also search for this author in PubMed   Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Peng Liu .

Ethics declarations

Competing interests.

PL is currently serving on the editorial board for Journal of Cybersecurity.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Choi, YH., Liu, P., Shang, Z. et al. Using deep learning to solve computer security challenges: a survey. Cybersecur 3 , 15 (2020). https://doi.org/10.1186/s42400-020-00055-5

Download citation

Received : 11 March 2020

Accepted : 17 June 2020

Published : 10 August 2020

DOI : https://doi.org/10.1186/s42400-020-00055-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Security-oriented program analysis
  • Return-oriented programming attacks
  • Control-flow integrity
  • Network attacks
  • Malware classification
  • System-event-based anomaly detection
  • Memory forensics
  • Fuzzing for software security

research paper on computer security

  • Search Menu
  • Editor's Choice
  • Author Guidelines
  • Submission Site
  • Open Access
  • About Journal of Cybersecurity
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Journals on Oxford Academic
  • Books on Oxford Academic

CYBERS High Impact 960x160.png

High-Impact Research from Journal of Cybersecurity

Explore a collection of the most read and most cited articles making an impact in the Journal of Cybersecurity  published within the past two years. This collection will be continuously updated with the journal's leading articles so be sure to revisit periodically to see what is being read and cited.

Also discover the articles being discussed the most on digital media by  exploring this Altmetric report  pulling the most discussed articles from the past year.

research paper on computer security

Recognizing and Celebrating Women in Science

Oxford University Press is proud to support diverse voices across our publishing. In this collection, we shine a spotlight on the representation of women in scientific fields, the gains that have been made in their fields, from research and major discoveries to advocacy and outreach, and amplify the voices of women who have made a career in scientific research.

Delve into the Women in Science collection

Affiliations

  • Online ISSN 2057-2093
  • Print ISSN 2057-2085
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Survey Paper
  • Open access
  • Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
  • A. S. M. Kayes 3 ,
  • Shahriar Badsha 4 ,
  • Hamed Alqahtani 5 ,
  • Paul Watters 3 &
  • Alex Ng 3  

Journal of Big Data volume  7 , Article number:  41 ( 2020 ) Cite this article

139k Accesses

233 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

figure 1

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig.  1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig.  1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “  Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “  A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “  Discussion ” section, we highlight several key points regarding our studies. Finally,  “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

  • Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

  • Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

figure 2

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security  alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather,  a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table  3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table  4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

figure 3

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig.  3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “  Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

  • Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar  

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH   Google Scholar  

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet   MATH   Google Scholar  

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet   Google Scholar  

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed   Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Decision making
  • Cyber-attack
  • Security modeling
  • Intrusion detection
  • Cyber threat intelligence

research paper on computer security

A survey on security challenges in cloud computing: issues, threats, and solutions

  • Published: 28 February 2020
  • Volume 76 , pages 9493–9532, ( 2020 )

Cite this article

  • Hamed Tabrizchi 1 &
  • Marjan Kuchaki Rafsanjani   ORCID: orcid.org/0000-0002-3220-4839 1  

15k Accesses

217 Citations

3 Altmetric

Explore all metrics

Cloud computing has gained huge attention over the past decades because of continuously increasing demands. There are several advantages to organizations moving toward cloud-based data storage solutions. These include simplified IT infrastructure and management, remote access from effectively anywhere in the world with a stable Internet connection and the cost efficiencies that cloud computing can bring. The associated security and privacy challenges in cloud require further exploration. Researchers from academia, industry, and standards organizations have provided potential solutions to these challenges in the previously published studies. The narrative review presented in this survey provides cloud security issues and requirements, identified threats, and known vulnerabilities. In fact, this work aims to analyze the different components of cloud computing as well as present security and privacy problems that these systems face. Moreover, this work presents new classification of recent security solutions that exist in this area. Additionally, this survey introduced various types of security threats which are threatening cloud computing services and also discussed open issues and propose future directions. This paper will focus and explore a detailed knowledge about the security challenges that are faced by cloud entities such as cloud service provider, the data owner, and cloud user.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper on computer security

Similar content being viewed by others

research paper on computer security

Cloud Security Threats and Solutions: A Survey

Umer Ahmed Butt, Rashid Amin, … Nasser Albaqami

research paper on computer security

Big Data Security and Privacy

research paper on computer security

Intelligent cybersecurity approach for data protection in cloud computing based Internet of Things

Ala Mughaid, Ibrahim Obeidat, … Hazem Migdady

Subramanian N, Jeyaraj A (2018) Recent security challenges in cloud computing. Comput Electr Eng 71:28–42

Google Scholar  

Mell P, Grance T (2018) SP 800-145, The NIST Definition of cloud computing | CSRC (online) Csrc.nist.gov. https://csrc.nist.gov/publications/detail/sp/800-145/final . Accessed 11 Dec 2018

Xu X (2012) From cloud computing to cloud manufacturing. Robot Comput Integr Manuf 28(1):75–86

Pippal SK, Kushwaha DS (2013) A simple, adaptable and efficient heterogeneous multi-tenant database architecture for ad hoc cloud. J Cloud Comput Adv Syst Appl 2(1):5

Shi B, Cui L, Li B, Liu X, Hao Z, Shen H (2018) Shadow monitor: an effective in-VM monitoring framework with hardware-enforced isolation. In: International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, Berlin, pp 670–690

Bhamare D, Samaka M, Erbad A, Jain R, Gupta L, Chan HA (2017) Optimal virtual network function placement in multi-cloud service function chaining architecture. Comput Commun 102:1–16

Alzahrani A, Alalwan N, Sarrab M (2014) Mobile cloud computing. In: Proceedings of the 7th Euro American Conference on Telematics and Information Systems (EATIS’14)

Deka GC, Das PK (2018) Application of virtualization technology in IaaS cloud deployment model. In: Design and Use of Virtualization Technology in Cloud Computing: IGI Global, pp 29–99

Oracle.com (2018) The Oracle and KPMG Cloud Threat Report 2018 | Oracle (online). https://www.oracle.com/cloud/cloud-threat-report.html . Accessed 11 Dec 2018

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115

Roman R, Lopez J, Mambo M (2018) Mobile edge computing, fog et al.: a survey and analysis of security threats and challenges. Future Gener Comput Syst 78:680–698

Ramachandra G, Iftikhar M, Khan FA (2017) A comprehensive survey on security in cloud computing. Proc Comput Sci 110:465–472

Csrc.nist.gov (2018) SP 500-299 (DRAFT), NIST Cloud Computing Security Reference Architecture | CSRC (online). https://csrc.nist.gov/publications/detail/sp/500-299/draft . Accessed 11 Sept 2018

Yu S, Wang C, Ren K, Lou W (Mar 2010) Achieving secure, scalable, and fine-grained data access control in cloud computing. In: Proceedings of the IEEE INFOCOM

Sgandurra D, Lupu E (2016) Evolution of attacks, threat models, and solutions for virtualized systems. ACM Comput Surv 48(3):1–38

Kaur M, Singh H (2015) A review of cloud computing security issues. Int J Adv Eng Technol 8(3):397–403

Kumar PR, Raj PH, Jelciana P (2018) Exploring data security issues and solutions in cloud computing. Proc Comput Sci 125:691–697

Khalil I, Khreishah A, Azeem M (2014) Cloud computing security: a survey. Computers 3(1):1–35

Bashir SF, Haider S (Dec 2011) Security threats in cloud computing. In: Proceedings of the International Conference for Internet Technology and Secured Transactions, pp 214–219

Ryan MD (2013) Cloud computing security: the scientific challenge, and a survey of solutions. J Syst Softw 86(9):2263–2268

Wang C, Wang Q, Ren K, Lou W (Mar 2010) Privacy-preserving public auditing for data storage security in cloud computing. In: Proceedings of the IEEE INFOCOM

Singh S, Jeong Y-S, Park JH (2016) A survey on cloud computing security: issues, threats, and solutions. J Netw Comput Appl 75:200–222

Khalil IM, Khreishah A, Azeem M (2014) Cloud computing security: a survey. Computers 3(1):1–35

Ahmed M, Litchfield AT (2018) Taxonomy for identification of security issues in cloud computing environments. J Comput Inf Syst 58(1):79–88

Fotiou N, Machas A, Polyzos GC, Xylomenos G (2015) Access control as a service for the Cloud. J Internet Serv Appl 6(1):11

Sumitra B, Pethuru C, Misbahuddin M (2014) A survey of cloud authentication attacks and solution approaches. Int J Innov Res Comput Commun Eng 2(10):6245–6253

Fernandes DA, Soares LF, Gomes JV, Freire MM, Inácio PR (2014) Security issues in cloud environments: a survey. Int J Inf Secur 13(2):113–170

Subashini S, Kavitha V (2011) A survey on security issues in service delivery models of cloud computing. J Netw Comput Appl 34(1):1–11

Zhang Y, Chen X, Li J, Wong DS, Li H, You I (2017) Ensuring attribute privacy protection and fast decryption for outsourced data security in mobile cloud computing. Inf Sci 379:42–61

MATH   Google Scholar  

Abbas H, Maennel O, Assar S (2017) Security and privacy issues in cloud computing. Springer, Berlin

TechRepublic (2018) Building Trust in a Cloudy Sky (online). https://www.techrepublic.com/resource-library/whitepapers/building-trust-in-a-cloudy-sky/ . Accessed 11 Sept 2018

Basu S et al (2018) Cloud computing security challenges and solutions—a survey. In: Proceedings of the IEEE 8th Annual on Computing and Communication Workshop and Conference (CCWC), pp 347–356

Dzombeta S, Stantchev V, Colomo-Palacios R, Brandis K, Haufe K (2014) Governance of cloud computing services for the life sciences. IT Prof 16(4):30–37

Butun I, Erol-Kantarci M, Kantarci B, Song H (2016) Cloud-centric multi-level authentication as a service for secure public safety device networks. IEEE Commun Mag 54(4):47–53

Saevanee H, Clarke N, Furnell S, Biscione V (2015) Continuous user authentication using multi-modal biometrics. Comput Secur 53:234–246

Khalil I, Khreishah A, Azeem M (2014) Consolidated identity management system for secure mobile cloud computing. Comput Netw 65:99–110

Faber T, Schwab S, Wroclawski J (2016) Authorization and access control: ABAC. In: McGeer R, Berman M, Elliott C, Ricci R (eds) The GENI book. Springer, Berlin, pp 203–234

Khan MA (2016) A survey of security issues for cloud computing. J Netw Comput Appl 71:11–29

Cai F, Zhu N, He J, Mu P, Li W, Yu Y (2018) Survey of access control models and technologies for cloud computing. Clust Comput 22(S3):6111–6122

Joshi MP, Joshi KP, Finin T (2018) Attribute based encryption for secure access to cloud based EHR systems. In: Proceedings of the International Conference on Cloud Computing

Indu I, Anand PR, Bhaskar V (2018) Identity and access management in cloud environment: mechanisms and challenges. Eng Sci Technol Int J 21(4):574–588

Mohit P, Biswas G (2017) Confidentiality and storage of data in cloud environment. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. Springer, Berlin, pp 289–295

Khan SI, Hoque ASL (2016) Privacy and security problems of national health data warehouse: a convenient solution for developing countries. In: Proceedings of the IEEE International Conference on Networking Systems and Security (NSysS), pp 1–6

Tang J, Cui Y, Li Q, Ren K, Liu J, Buyya R (2016) Ensuring security and privacy preservation for cloud data services. ACM Comput Surv (CSUR) 49(1):13

Islam MA, Vrbsky SV (2017) Transaction management with tree-based consistency in cloud databases. Int J Cloud Comput 6(1):58–78

Ku C-Y, Chiu Y-S (2013) A novel infrastructure for data sanitization in cloud computing. In: Diversity, Technology, and Innovation for Operational Competitiveness: Proceedings of the 2013 International Conference on Technology Innovation and Industrial Management, pp 3–25

Singh HJ, Bawa S (2018) Scalable metadata management techniques for ultra-large distributed storage systems—a systematic review. ACM Comput Surv (CSUR) 51(4):82

Sehgal NK, Bhatt PCP (2018) Cloud computing concepts and practics. Springer

Prokhorenko V, Choo K-KR, Ashman H (2016) Web application protection techniques: a taxonomy. J Netw Comput Appl 60:95–112

Shin S et al (2014) Rosemary: a robust, secure, and high-performance network operating system. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, New York, pp 78–89

Somani G, Gaur MS, Sanghi D, Conti M, Buyya R (2017) DDoS attacks in cloud computing: issues, taxonomy, and future directions. Comput Commun 107:30–48

Sattar K, Salah K, Sqalli M, Rafiq R, Rizwan M (2017) A delay-based countermeasure against the discovery of default rules in firewalls. Arab J Sci Eng 42(2):833–844

Iqbal S, Kiah ML, Dhaghighi B, Hussain M, Khan S, Khan MK, Choo KKR (2016) On cloud security attacks: a taxonomy and intrusion detection and prevention as a service. J Netw Comput Appl 74:98–120

Mishra P, Pilli ES, Varadharajan V, Tupakula U (2017) Intrusion detection techniques in cloud environment: a survey. J Netw Comput Appl 77:18–47

Kohnfelder L, Garg P (1999) The threats to our products. Microsoft Interface, Microsoft Corporation, New York, p 33

Tounsi W, Rais HJC (2018) A survey on technical threat intelligence in the age of sophisticated cyber attacks. Comput Secur 72:212–233

Meinig M, Sukmana MI, Torkura KA, Meinel CJPCS (2019) Holistic strategy-based threat model for organizations. Proc Comput Sci 151:100–107

Mokhtar B, Azab MJAEJ (2015) Survey on security issues in vehicular ad hoc networks. Alex Eng J 54(4):1115–1126

Tan Y, Wu F, Wu Q, Liao XJTJOS (2019) Resource stealing: a resource multiplexing method for mix workloads in cloud system. J Supercomput 75(1):33–49

Hong JB, Nhlabatsi A, Kim DS, Hussein A, Fetais N, Khan KMJCN (2019) Systematic identification of threats in the cloud: a survey. Comput Netw 150:46–69

Haber MJ, Hibbert B (2018) Asset attack vectors. Apress, Berkeley, CA

Rai S, Sharma K, Dhakal D (2019) A survey on detection and mitigation of distributed denial-of-service attack in named data networking. In: Sarma H, Borah S, Dutta N (eds) Advances in communication, cloud, and big data. Lecture notes in networks and systems, vol 31. Springer, Singapore

Bojović P, Bašičević I, Ocovaj S, Popović M (2019) A practical approach to detection of distributed denial-of-service attacks using a hybrid detection method. Comput Electr Eng 73:84–96

Eldewahi AE, Hassan A, Elbadawi K, Barry BI (2018) The analysis of MATE attack in SDN based on STRIDE model. In: Proceedings of the International Conference on Emerging Internetworking, Data and Web Technologies, pp 901–910

Tuma K, Scandariato R (2018) Two architectural threat analysis techniques compared. In: Proceedings of the European Conference on Software Architecture. Springer, Berlin, pp 347–363

Symantec.com (2019) Cloud Security Threat Report (CSTR) 2019 | Symantec (online). https://www.symantec.com/security-center/cloud-security-threat-report . Accessed 19 July 2019

Akshaya MS, Padmavathi G (2019) Taxonomy of security attacks and risk assessment of cloud computing. In: Peter J, Alavi A, Javadi B (eds) Advances in big data and cloud computing. Advances in intelligent systems and computing, vol 750. Springer, Singapore

Subramanian N, Jeyaraj AJC, Engineering E (2018) Recent security challenges in cloud computing. Comput Electr Eng 71:28–42

Tan CB, Hijazi MHA, Lim Y, Gani A (2018) A survey on proof of retrievability for cloud data integrity and availability: cloud storage state-of-the-art, issues, solutions and future trends. J Netw Comput Appl 110:75–86

Ghafir I, Jibran S, Mohammad H, Hanan F, Vaclav P, Sardar J, Sohail J, Thar B (2018) Security threats to critical infrastructure: the human factor. J Supercomput 74(10):4986–5002

Yamin MM, Katt B, Sattar K, Ahmad MB (2019) Implementation of insider threat detection system using honeypot based sensors and threat analytics. In: Future of Information and Communication Conference. Springer, Berlin, pp 801–829

Osanaiye O, Choo K-KR, Dlodlo MJJON (2016) Distributed denial of service (DDoS) resilience in cloud: review and conceptual cloud DDoS mitigation framework. J Netw Comput Appl 67:147–165

Alsmadi I (2019) Incident response. In: The NICE Cyber Security Framework, pp 331–346

Fernandes G, Rodrigues JJPC, Carvalho LF, Al-Muhtadi JF, Proença ML (2018) A comprehensive survey on network anomaly detection. Telecommun Syst 70(3):447–489

Nashimoto S, Homma N, Hayashi Y, Takahashi J, Fuji H, Aoki T (2016) Buffer overflow attack with multiple fault injection and a proven countermeasure. J Cryptogr Eng 7(1):35–46

Chen Z, Han H (2017) Attack mitigation by data structure randomization. In: Cuppens F, Wang L, Cuppens-Boulahia N, Tawbi N, Garcia-Alfaro J (eds) Foundations and practice of security. FPS 2016. Lecture notes in computer science, vol 10128. Springer, Cham

Cohen A, Nissim N, Rokach L, Elovici Y (2016) SFEM: structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst Appl 63:324–343

Sangeetha R (Feb 2013) Detection of malicious code in user mode. In: Proceedings of the International Conference on Information Communication and Embedded Systems (ICICES)

Lichtman M, Poston JD, Amuru S, Shahriar C, Clancy TC, Buehrer RM, Reed JH (2016) A communications jamming taxonomy. IEEE Secur Priv 14(1):47–54

Wu M, Moon YB (2017) Taxonomy of cross-domain attacks on cyber manufacturing system. Proc Comput Sci 114:367–374

Bhagwani H, Negi R, Dutta AK, Handa A, Kumar N, Shukla SK (2019) Automated classification of web-application attacks for intrusion detection. In: Lecture notes in computer science, pp 123–141

Chen M-S, Park JS, Yu PS (1996) Data mining for path traversal patterns in a web environment. In: Proceedings of 16th International Conference on Distributed Computing Systems, pp 385–392

Murugan K, Suresh P (2018) Efficient anomaly intrusion detection using hybrid probabilistic techniques in wireless ad hoc network. Int J Netw Secur 20(4):730–737

Ghose N, Lazos L, Li M (2018) Secure device bootstrapping without secrets resistant to signal manipulation attacks. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp 819–835

Osanaiye O, Choo K-KR, Dlodlo M (2016) Distributed denial of service (DDoS) resilience in cloud: review and conceptual cloud DDoS mitigation framework. J Netw Comput Appl 67:147–165

Zhang X, Zhang Y, Mo Q, Xia H, Yang Z, Yang M, Wang X, Lunand L, Duan H (2018) An empirical study of web resource manipulation in real-world mobile applications. In: Proceedings of the 27th Security Symposium (Security 18), pp 1183–1198

Coppolino L, D’Antonio S, Mazzeo G, Romano L (2017) Cloud security: emerging threats and current solutions. Comput Electr Eng 59:126–140

Gumaei A, Sammouda R, Al-Salman AMS, Alsanad A (2019) Anti-spoofing cloud-based multi-spectral biometric identification system for enterprise security and privacy-preservation. J Parallel Distrib Comput 124:27–40

Vlajic N, Chowdhury M, Litoiu M (2019) IP Spoofing in and out of the public cloud: from policy to practice. Computers 8(4):81

Download references

Author information

Authors and affiliations.

Department of Computer Science, Faculty of Mathematics and Computer, Shahid Bahonar University of Kerman, Kerman, Iran

Hamed Tabrizchi & Marjan Kuchaki Rafsanjani

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Marjan Kuchaki Rafsanjani .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Tabrizchi, H., Kuchaki Rafsanjani, M. A survey on security challenges in cloud computing: issues, threats, and solutions. J Supercomput 76 , 9493–9532 (2020). https://doi.org/10.1007/s11227-020-03213-1

Download citation

Published : 28 February 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s11227-020-03213-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Cloud computing
  • Vulnerabilities
  • Data protection
  • Find a journal
  • Publish with us
  • Track your research

Data Security and Privacy: Concepts, Approaches, and Research Directions

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

The Impact of Artificial Intelligence on Data System Security: A Literature Review

Ricardo raimundo.

1 ISEC Lisboa, Instituto Superior de Educação e Ciências, 1750-142 Lisbon, Portugal; [email protected]

Albérico Rosário

2 Research Unit on Governance, Competitiveness and Public Policies (GOVCOPP), University of Aveiro, 3810-193 Aveiro, Portugal

Associated Data

Not applicable.

Diverse forms of artificial intelligence (AI) are at the forefront of triggering digital security innovations based on the threats that are arising in this post-COVID world. On the one hand, companies are experiencing difficulty in dealing with security challenges with regard to a variety of issues ranging from system openness, decision making, quality control, and web domain, to mention a few. On the other hand, in the last decade, research has focused on security capabilities based on tools such as platform complacency, intelligent trees, modeling methods, and outage management systems in an effort to understand the interplay between AI and those issues. the dependence on the emergence of AI in running industries and shaping the education, transports, and health sectors is now well known in the literature. AI is increasingly employed in managing data security across economic sectors. Thus, a literature review of AI and system security within the current digital society is opportune. This paper aims at identifying research trends in the field through a systematic bibliometric literature review (LRSB) of research on AI and system security. the review entails 77 articles published in the Scopus ® database, presenting up-to-date knowledge on the topic. the LRSB results were synthesized across current research subthemes. Findings are presented. the originality of the paper relies on its LRSB method, together with an extant review of articles that have not been categorized so far. Implications for future research are suggested.

1. Introduction

The assumption that the human brain may be deemed quite comparable to computers in some ways offers the spontaneous basis for artificial intelligence (AI), which is supported by psychology through the idea of humans and animals operating like machines that process information by devices of associative memory [ 1 ]. Nowadays, researchers are working on the possibilities of AI to cope with varying issues of systems security across diverse sectors. Hence, AI is commonly considered an interdisciplinary research area that attracts considerable attention both in economics and social domains as it offers a myriad of technological breakthroughs with regard to systems security [ 2 ]. There is a universal trend of investing in AI technology to face security challenges of our daily lives, such as statistical data, medicine, and transportation [ 3 ].

Some claim that specific data from key sectors have supported the development of AI, namely the availability of data from e-commerce [ 4 ], businesses [ 5 ], and government [ 6 ], which provided substantial input to ameliorate diverse machine-learning solutions and algorithms, in particular with respect to systems security [ 7 ]. Additionally, China and Russia have acknowledged the importance of AI for systems security and competitiveness in general [ 8 , 9 ]. Similarly, China has recognized the importance of AI in terms of housing security, aiming at becoming an authority in the field [ 10 ]. Those efforts are already being carried out in some leading countries in order to profit the most from its substantial benefits [ 9 ]. In spite of the huge development of AI in the last few years, the discussion around the topic of systems security is sparse [ 11 ]. Therefore, it is opportune to acquaint the last developments regarding the theme in order to map the advancements in the field and ensuing outcomes [ 12 ]. In view of this, we intend to find out the principal trends of issues discussed on the topic these days in order to answer the main research question: What is the impact of AI on data system security?

The article is organized as follows. In Section 2 , we put forward diverse theoretical concepts related to AI in systems security. In Section 3 , we present the methodological approach. In Section 4 , we discuss the main fields of use of AI with regard to systems security, which came out from the literature. Finally, we conclude this paper by suggesting implications and future research avenues.

2. Literature Trends: AI and Systems Security

The concept of AI was introduced following the creation of the notion of digital computing machine in an attempt to ascertain whether a machine is able to “think” [ 1 ] or if the machine can carry out humans’ tasks [ 13 ]. AI is a vast domain of information and computer technologies (ICT), which aims at designing systems that can operate autonomously, analogous to the individuals’ decision-making process [ 14 ].In terms of AI, a machine may learn from experience through processing an immeasurable quantity of data while distinguishing patterns in it, as in the case of Siri [ 15 ] and image recognition [ 16 ], technologies based on machine learning that is a subtheme of AI, defined as intelligent systems with the capacity to think and learn [ 1 ].

Furthermore, AI entails a myriad of related technologies, such as neural networks [ 17 ] and machine learning [ 18 ], just to mention a few, and we can identify some research areas of AI:

  • (I) Machine learning is a myriad of technologies that allow computers to carry out algorithms based on gathered data and distinct orders, providing the machine the capabilities to learn without instructions from humans, adjusting its own algorithm to the situation, while learning and recoding itself, such as Google and Siri when performing distinct tasks ordered by voice [ 19 ]. As well, video surveillance that tracks unusual behavior [ 20 ];
  • (II) Deep learning constitutes the ensuing progress of machine learning, in which the machine carry out tasks directly from pictures, text, and sound, through a wide set of data architecture that entails numerous layers in order to learn and characterize data with several levels of abstraction imitating thus how the natural brain processes information [ 21 ]. This is illustrated, for example, in forming a certificate database structure of university performance key indicators, in order to fix issues such as identity authentication [ 21 ];
  • (III) Neural networks are composed of a pattern recognition system that machine/deep learning operates to perform learning from observational data, figuring out its own solutions such as an auto-steering gear system with a fuzzy regulator, which enables to select optimal neural network models of the vessel paths, to obtain in this way control activity [ 22 ];
  • (IV) Natural language processing machines analyze language and speech as it is spoken, resorting to machine learning and natural language processing, such as developing a swarm intelligence and active system, while mounting friendly human-computer interface software for users, to be implemented in educational and e-learning organizations [ 23 ];
  • (V) Expert systems are composed of software arrangements that assist in achieving answers to distinct inquiries provided either by a customer or by another software set, in which expert knowledge is set aside in a particular area of the application that includes a reasoning component to access answers, in view of the environmental information and subsequent decision making [ 24 ].

Those subthemes of AI are applied to many sectors, such as health institutions, education, and management, through varying applications related to systems security. These abovementioned processes have been widely deployed to solve important security issues such as the following application trends ( Figure 1 ):

  • (a) Cyber security, in terms of computer crime, behavior research, access control, and surveillance, as for example the case of computer vision, in which an algorithmic analyses images, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) techniques [ 6 , 7 , 12 , 19 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 ];
  • (b) Information management, namely in supporting decision making, business strategy, and expert systems, for example, by improving the quality of the relevant strategic decisions by analyzing big data, as well as in the management of the quality of complex objects [ 2 , 4 , 5 , 11 , 14 , 24 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 ];
  • (c) Societies and institutions, regarding computer networks, privacy, and digitalization, legal and clinical assistance, for example, in terms of legal support of cyber security, digital modernization, systems to support police investigations and the efficiency of technological processes in transport [ 8 , 9 , 10 , 15 , 17 , 18 , 20 , 21 , 23 , 28 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 ];
  • (d) Neural networks, for example, in terms of designing a model of human personality for use in robotic systems [ 1 , 13 , 16 , 22 , 74 , 75 ].

An external file that holds a picture, illustration, etc.
Object name is sensors-21-07029-g001.jpg

Subthemes/network of all keywords of AI—source: own elaboration.

Through these streams of research, we will explain how the huge potential of AI can be deployed to over-enhance systems security that is in use both in states and organizations, to mitigate risks and increase returns while identifying, averting cyber attacks, and determine the best course of action [ 19 ]. AI could even be unveiled as more effective than humans in averting potential threats by various security solutions such as redundant systems of video surveillance, VOIP voice network technology security strategies [ 36 , 76 , 77 ], and dependence upon diverse platforms for protection (platform complacency) [ 30 ].

The design of the abovementioned conceptual and technological framework was not made randomly, as we did a preliminary search on Scopus with the keywords “Artificial Intelligence” and “Security”.

3. Materials and Methods

We carried out a systematic bibliometric literature review (LRSB) of the “Impact of AI on Data System Security”. the LRSB is a study concept that is based on a detailed, thorough study of the recognition and synthesis of information, being an alternative to traditional literature reviews, improving: (i) the validity of the review, providing a set of steps that can be followed if the study is replicated; (ii) accuracy, providing and demonstrating arguments strictly related to research questions; and (iii) the generalization of the results, allowing the synthesis and analysis of accumulated knowledge [ 78 , 79 , 80 ]. Thus, the LRSB is a “guiding instrument” that allows you to guide the review according to the objectives.

The study is performed following Raimundo and Rosário suggestions as follows: (i) definition of the research question; (ii) location of the studies; (iii) selection and evaluation of studies; (iv) analysis and synthesis; (v) presentation of results; finally (vi) discussion and conclusion of results. This methodology ensures a comprehensive, auditable, replicable review that answers the research questions.

The review was carried out in June 2021, with a bibliographic search in the Scopus database of scientific articles published until June 2021. the search was carried out in three phases: (i) using the keyword Artificial Intelligence “382,586 documents were obtained; (ii) adding the keyword “Security”, we obtained a set of 15,916 documents; we limited ourselves to Business, Management, and Accounting 401 documents were obtained and finally (iii) exact keyword: Data security, Systems security a total of 77 documents were obtained ( Table 1 ).

Screening methodology.

Source: own elaboration.

The search strategy resulted in 77 academic documents. This set of eligible break-downs was assessed for academic and scientific relevance and quality. Academic Documents, Conference Paper (43); Article (29); Review (3); Letter (1); and retracted (1).

Peer-reviewed academic documents on the impact of artificial intelligence on data system security were selected until 2020. In the period under review, 2021 was the year with the highest number of peer-reviewed academic documents on the subject, with 18 publications, with 7 publications already confirmed for 2021. Figure 2 reviews peer-reviewed publications published until 2021.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-07029-g002.jpg

Number of documents by year. Source: own elaboration.

The publications were sorted out as follows: 2011 2nd International Conference on Artificial Intelligence Management Science and Electronic Commerce Aimsec 2011 Proceedings (14); Proceedings of the 2020 IEEE International Conference Quality Management Transport and Information Security Information Technologies IT and Qm and Is 2020 (6); Proceedings of the 2019 IEEE International Conference Quality Management Transport and Information Security Information Technologies IT and Qm and Is 2019 (5); Computer Law and Security Review (4); Journal of Network and Systems Management (4); Decision Support Systems (3); Proceedings 2021 21st Acis International Semi Virtual Winter Conference on Software Engineering Artificial Intelligence Networking and Parallel Distributed Computing Snpd Winter 2021 (3); IEEE Transactions on Engineering Management (2); Ictc 2019 10th International Conference on ICT Convergence ICT Convergence Leading the Autonomous Future (2); Information and Computer Security (2); Knowledge Based Systems (2); with 1 publication (2013 3rd International Conference on Innovative Computing Technology Intech 2013; 2020 IEEE Technology and Engineering Management Conference Temscon 2020; 2020 International Conference on Technology and Entrepreneurship Virtual Icte V 2020; 2nd International Conference on Current Trends In Engineering and Technology Icctet 2014; ACM Transactions on Management Information Systems; AFE Facilities Engineering Journal; Electronic Design; Facct 2021 Proceedings of the 2021 ACM Conference on Fairness Accountability and Transparency; HAC; ICE B 2010 Proceedings of the International Conference on E Business; IEEE Engineering Management Review; Icaps 2008 Proceedings of the 18th International Conference on Automated Planning and Scheduling; Icaps 2009 Proceedings of the 19th International Conference on Automated Planning and Scheduling; Industrial Management and Data Systems; Information and Management; Information Management and Computer Security; Information Management Computer Security; Information Systems Research; International Journal of Networking and Virtual Organisations; International Journal of Production Economics; International Journal of Production Research; Journal of the Operational Research Society; Proceedings 2020 2nd International Conference on Machine Learning Big Data and Business Intelligence Mlbdbi 2020; Proceedings Annual Meeting of the Decision Sciences Institute; Proceedings of the 2014 Conference on IT In Business Industry and Government An International Conference By Csi on Big Data Csibig 2014; Proceedings of the European Conference on Innovation and Entrepreneurship Ecie; TQM Journal; Technology In Society; Towards the Digital World and Industry X 0 Proceedings of the 29th International Conference of the International Association for Management of Technology Iamot 2020; Wit Transactions on Information and Communication Technologies).

We can say that in recent years there has been some interest in research on the impact of artificial intelligence on data system security.

In Table 2 , we analyze for the Scimago Journal & Country Rank (SJR), the best quartile, and the H index by publication.

Scimago journal and country rank impact factor.

Note: * data not available. Source: own elaboration.

Information Systems Research is the most quoted publication with 3510 (SJR), Q1, and H index 159.

There is a total of 11 journals on Q1, 3 journals on Q2 and 2 journals on Q3, and 2 journal on Q4. Journals from best quartile Q1 represent 27% of the 41 journals titles; best quartile Q2 represents 7%, best quartile Q3 represents 5%, and finally, best Q4 represents 5% each of the titles of 41 journals. Finally, 23 of the publications representing 56%, the data are not available.

As evident from Table 2 , the significant majority of articles on artificial intelligence on data system security rank on the Q1 best quartile index.

The subject areas covered by the 77 scientific documents were: Business, Management and Accounting (77); Computer Science (57); Decision Sciences (36); Engineering (21); Economics, Econometrics, and Finance (15); Social Sciences (13); Arts and Humanities (3); Psychology (3); Mathematics (2); and Energy (1).

The most quoted article was “CCANN: An intrusion detection system based on combining cluster centers and nearest neighbors” from Lin, Ke, and Tsai 290 quotes published in the Knowledge-Based Systems with 1590 (SJR), the best quartile (Q1) and with H index (121). the published article proposes a new resource representation approach, a cluster center, and the nearest neighbor approach.

In Figure 3 , we can analyze the evolution of citations of documents published between 2010 and 2021, with a growing number of citations with an R2 of 0.45%.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-07029-g003.jpg

Evolution and number of citations between 2010 and 2021. Source: own elaboration.

The h index was used to verify the productivity and impact of the documents, based on the largest number of documents included that had at least the same number of citations. Of the documents considered for the h index, 11 have been cited at least 11 times.

In Appendix A , Table A1 , citations of all scientific articles until 2021 are analyzed; 35 documents were not cited until 2021.

Appendix A , Table A2 , examines the self-quotation of documents until 2021, in which self-quotation was identified for a total of 16 self-quotations.

In Figure 4 , a bibliometric analysis was performed to analyze and identify indicators on the dynamics and evolution of scientific information using the main keywords. the analysis of the bibliometric research results using the scientific software VOSviewe aims to identify the main keywords of research in “Artificial Intelligence” and “Security”.

An external file that holds a picture, illustration, etc.
Object name is sensors-21-07029-g004.jpg

Network of linked keywords. Source: own elaboration.

The linked keywords can be analyzed in Figure 4 , making it possible to clarify the network of keywords that appear together/linked in each scientific article, allowing us to know the topics analyzed by the research and to identify future research trends.

4. Discussion

By examining the selected pieces of literature, we have identified four principal areas that have been underscored and deserve further investigation with regard to cyber security in general: business decision making, electronic commerce business, AI social applications, and neural networks ( Figure 4 ). There is a myriad of areas in where AI cyber security can be applied throughout social, private, and public domains of our daily lives, from Internet banking to digital signatures.

First, it has been discussed the possible decreasing of unnecessary leakage of accounting information [ 27 ], mainly through security drawbacks of VOIP technology in IP network systems and subsequent safety measures [ 77 ], which comprises a secure dynamic password used in Internet banking [ 29 ].

Second, it has been researched some computer user cyber security behaviors, which includes both a naïve lack of concern about the likelihood of facing security threats and dependence upon specific platforms for protection, as well as the dependence on guidance from trusted social others [ 30 ], which has been partly resolved through a mobile agent (MA) management systems in distributed networks, while operating a model of an open management framework that provides a broad range of processes to enforce security policies [ 31 ].

Third, AI cyber systems security always aims at achieving stability of the programming and analysis procedures by clarifying the relationship of code fault-tolerance programming with code security in detail to strengthen it [ 33 ], offering an overview of existing cyber security tasks and roadmap [ 32 ].

Fourth, in this vein, numerous AI tools have been developed to achieve a multi-stage security task approach for a full security life cycle [ 38 ]. New digital signature technology has been built, amidst the elliptic curve cryptography, of increasing reliance [ 28 ]; new experimental CAPTCHA has been developed, through more interference characters and colorful background [ 8 ] to provide better protection against spambots, allowing people with little knowledge of sign languages to recognize gestures on video relatively fast [ 70 ]; novel detection approach beyond traditional firewall systems have been developed (e.g., cluster center and nearest neighbor—CANN) of higher efficiency for detection of attacks [ 71 ]; security solutions of AI for IoT (e.g., blockchain), due to its centralized architecture of security flaws [ 34 ]; and integrated algorithm of AI to identify malicious web domains for security protection of Internet users [ 19 ].

In sum, AI has progressed lately by advances in machine learning, with multilevel solutions to the security problems faced in security issues both in operating systems and networks, comprehending algorithms, methods, and tools lengthily used by security experts for the better of the systems [ 6 ]. In this way, we present a detailed overview of the impacts of AI on each of those fields.

4.1. Business Decision Making

AI has an increasing impact on systems security aimed at supporting decision making at the management level. More and more, it is discussed expert systems that, along with the evolution of computers, are able to integrate systems into corporate culture [ 24 ]. Such systems are expected to maximize benefits against costs in situations where a decision-making agent has to decide between a limited set of strategies of sparse information [ 14 ], while a strategic decision in a relatively short period of time is required demanded and of quality, for example by intelligent analysis of big data [ 39 ].

Secondly, it has been adopted distributed decision models coordinated toward an overall solution, reliant on a decision support platform [ 40 ], either more of a mathematical/modeling support of situational approach to complex objects [ 41 ], or more of a web-based multi-perspective decision support system (DSS) [ 42 ].

Thirdly, the problem of software for the support of management decisions was resolved by combining a systematic approach with heuristic methods and game-theoretic modeling [ 43 ] that, in the case of industrial security, reduces the subsequent number of incidents [ 44 ].

Fourthly, in terms of industrial management and ISO information security control, a semantic decision support system increases the automation level and support the decision-maker at identifying the most appropriate strategy against a modeled environment [ 45 ] while providing understandable technology that is based on the decisions and interacts with the machine [ 46 ].

Finally, with respect to teamwork, AI validates a theoretical model of behavioral decision theory to assist organizational leaders in deciding on strategic initiatives [ 11 ] while allowing understanding who may have information that is valuable for solving a collaborative scheduling problem [ 47 ].

4.2. Electronic Commerce Business

The third research stream focuses on e-commerce solutions to improve its systems security. This AI research stream focuses on business, principally on security measures to electronic commerce (e-commerce), in order to avoid cyber attacks, innovate, achieve information, and ultimately obtain clients [ 5 ].

First, it has been built intelligent models around the factors that induce Internet users to make an online purchase, to build effective strategies [ 48 ], whereas it is discussed the cyber security issues by diverse AI models for controlling unauthorized intrusion [ 49 ], in particular in some countries such as China, to solve drawbacks in firewall technology, data encryption [ 4 ] and qualification [ 2 ].

Second, to adapt to the increasingly demanding environment nowadays of a world pandemic, in terms of finding new revenue sources for business [ 3 ] and restructure business digital processes to promote new products and services with enough privacy and manpower qualified accordingly and able to deal with the AI [ 50 ].

Third, to develop AI able to intelligently protect business either by a distinct model of decision trees amidst the Internet of Things (IoT) [ 51 ] or by ameliorating network management through active networks technology, of multi-agent architecture able to imitate the reactive behavior and logical inference of a human expert [ 52 ].

Fourth, to reconceptualize the role of AI within the proximity’s spatial and non-spatial dimensions of a new digital industry framework, aiming to connect the physical and digital production spaces both in the traditional and new technology-based approaches (e.g., industry 4.0), promoting thus innovation partnerships and efficient technology and knowledge transfer [ 53 ]. In this vein, there is an attempt to move the management systems from a centralized to a distributed paradigm along the network and based on criteria such as for example the delegation degree [ 54 ] that inclusive allows the transition from industry 4.0 to industry 5.0i, through AI in the form of Internet of everything, multi-agent systems and emergent intelligence and enterprise architecture [ 58 ].

Fifth, in terms of manufacturing environments, following that networking paradigm, there is also an attempt to manage agent communities in distributed and varied manufacturing environments through an AI multi-agent virtual manufacturing system (e.g., MetaMorph) that optimizes real-time planning and security [ 55 ]. In addition, in manufacturing, smart factories have been built to mitigate security vulnerabilities of intelligent manufacturing processes automation by AI security measures and devices [ 56 ] as, for example, in the design of a mine security monitoring configuration software platform of a real-time framework (e.g., the device management class diagram) [ 26 ]. Smart buildings in manufacturing and nonmanufacturing environments have been adopted, aiming at reducing costs, the height of the building, and minimizing the space required for users [ 57 ].

Finally, aiming at augmenting the cyber security of e-commerce and business in general, other projects have been put in place, such as computer-assisted audit tools (CAATs), able to carry on continuous auditing, allowing auditors to augment their productivity amidst the real-time accounting and electronic data interchange [ 59 ] and a surge in the demand of high-tech/AI jobs [ 60 ].

4.3. AI Social Applications

As seen, AI systems security can be widely deployed across almost all society domains, be in regulation, Internet security, computer networks, digitalization, health, and other numerous fields (see Figure 4 ).

First, it has been an attempt to regulate cyber security, namely in terms of legal support of cyber security, with regard to the application of artificial intelligence technology [ 61 ], in an innovative and economical/political-friendly way [ 9 ] and in fields such as infrastructures, by ameliorating the efficiency of technological processes in transport, reducing, for example, the inter train stops [ 63 ] and education, by improving the cyber security of university E-Gov, for example in forming a certificate database structure of university performance key indicators [ 21 ] e-learning organizations by swarm intelligence [ 23 ] and acquainting the risk a digital campus will face according to ISO series standards and criteria of risk levels [ 25 ] while suggesting relevant solutions to key issues in its network information safety [ 12 ].

Second, some moral and legal issues have risen, in particular in relation to privacy, sex, and childhood. Is the case of the ethical/legal legitimacy of publishing open-source dual-purpose machine-learning algorithms [ 18 ], the needed legislated framework comprising regulatory agencies and representatives of all stakeholder groups gathered around AI [ 68 ], the gendering issue of VPAs as female (e.g., Siri) as replicate normative assumptions about the potential role of women as secondary to men [ 15 ], the need of inclusion of communities to uphold its own code [ 35 ] and the need to improve the legal position of people and children in particular that are exposed to AI-mediated risk profiling practices [ 7 , 69 ].

Third, the traditional industry also benefits from AI, given that it can improve, for example, the safety of coal mine, by analyzing the coal mine safety scheme storage structure, building data warehouse and analysis [ 64 ], ameliorating, as well, the security of smart cities and ensuing intelligent devices and networks, through AI frameworks (e.g., United Theory of Acceptance and Use of Technology—UTAUT) [ 65 ], housing [ 10 ] and building [ 66 ] security system in terms of energy balance (e.g., Direct Digital Control System), implying fuzzy logic as a non-precise program tool that allows the systems to function well [ 66 ], or even in terms of data integrity attacks to outage management system OMSs and ensuing AI means to detect and mitigate them [ 67 ].

Fourth, the citizens, in general, have reaped benefits from areas of AI such as police investigation, through expert systems that offer support in terms of profiling and tracking criminals based on machine-learning and neural network techniques [ 17 ], video surveillance systems of real-time accuracy [ 76 ], resorting to models to detect moving objects keeping up with environment changes [ 36 ], of dynamical sensor selection in processing the image streams of all cameras simultaneously [ 37 ], whereas ambient intelligence (AmI) spaces, in where devices, sensors, and wireless networks, combine data from diverse sources and monitor user preferences and their subsequent results on users’ privacy under a regulatory privacy framework [ 62 ].

Finally, AI has granted the society noteworthy progress in terms of clinical assistance in terms of an integrated electronic health record system into the existing risk management software to monitor sepsis at intensive care unit (ICU) through a peer-to-peer VPN connection and with a fast and intuitive user interface [ 72 ]. As well, it has offered an AI organizational solution of innovative housing model that combines remote surveillance, diagnostics, and the use of sensors and video to detect anomalies in the behavior and health of the elderly [ 20 ], together with a case-based decision support system for the automatic real-time surveillance and diagnosis of health care-associated infections, by diverse machine-learning techniques [ 73 ].

4.4. Neural Networks

Neural networks, or the process through which machines learn from observational data, coming up with their own solutions, have been lately discussed over some stream of issues.

First, it has been argued that it is opportune to develop a software library for creating artificial neural networks for machine learning to solve non-standard tasks [ 74 ], along a decentralized and integrated AI environment that can accommodate video data storage and event-driven video processing, gathered from varying sources, such as video surveillance systems [ 16 ], which images could be improved through AI [ 75 ].

Second, such neural networks architecture has progressed into a huge number of neurons in the network, in which the devices of associative memory were designed with the number of neurons comparable to the human brain within supercomputers [ 1 ]. Subsequently, such neural networks can be modeled on the base of switches architecture to interconnect neurons and to store the training results in the memory, on the base of the genetic algorithms to be exported to other robotic systems: a model of human personality for use in robotic systems in medicine and biology [ 13 ].

Finally, the neural network is quite representative of AI, in the attempt of, once trained in human learning and self-learning, could operate without human guidance, as in the case of a current positioning vessel seaway systems, involving a fuzzy logic regulator, a neural network classifier enabling to select optimal neural network models of the vessel paths, to obtain control activity [ 22 ].

4.5. Data Security and Access Control Mechanisms

Access control can be deemed as a classic security model that is pivotal do any security and privacy protection processes to support data access from different environments, as well as to protect unauthorized access according to a given security policy [ 81 ]. In this vein, data security and access control-related mechanisms have been widely debated these days, particularly with regard to their distinct contextual conditions in terms, for example, of spatial and temporal environs that differ according to diverse, decentralized networks. Those networks constitute a major challenge because they are dynamically located on “cloud” or “fog” environments, rather than fixed desktop structures, demanding thus innovative approaches in terms of access security, such as fog-based context-aware access control (FB-CAAC) [ 81 ]. Context-awareness is, therefore, an important characteristic of changing environs, where users access resources anywhere and anytime. As a result, it is paramount to highlight the interplay between the information, now based on fuzzy sets, and its situational context to implement context-sensitive access control policies, as well, through diverse criteria such as, for example, following subject and action-specific attributes. In this way, different contextual conditions, such as user profile information, social relationship information, and so on, need to be added to the traditional, spatial and temporal approaches to sustain these dynamic environments [ 81 ]. In the end, the corresponding policies should aim at defining the security and privacy requirements through a fog-based context-aware access control model that should be respected for distributed cloud and fog networks.

5. Conclusion and Future Research Directions

This piece of literature allowed illustrating the AI impacts on systems security, which influence our daily digital life, business decision making, e-commerce, diverse social and legal issues, and neural networks.

First, AI will potentially impact our digital and Internet lives in the future, as the major trend is the emergence of increasingly new malicious threats from the Internet environment; likewise, greater attention should be paid to cyber security. Accordingly, the progressively more complexity of business environment will demand, as well, more and more AI-based support systems to decision making that enables management to adapt in a faster and accurate way while requiring unique digital e-manpower.

Second, with regard to the e-commerce and manufacturing issues, principally amidst the world pandemic of COVID-19, it tends to augment exponentially, as already observed, which demands subsequent progress with respect to cyber security measures and strategies. the same, regarding the social applications of AI that, following the increase in distance services, will also tend to adopt this model, applied to improved e-health, e-learning, and e-elderly monitoring systems.

Third, subsequent divisive issues are being brought to the academic arena, which demands progress in terms of a legal framework, able to comprehend all the abovementioned issues in order to assist the political decisions and match the expectations of citizens.

Lastly, it is inevitable further progress in neural networks platforms, as it represents the cutting edge of AI in terms of human thinking imitation technology, the main goal of AI applications.

To summarize, we have presented useful insights with respect to the impact of AI in systems security, while we illustrated its influence both on the people’ service delivering, in particular in security domains of their daily matters, health/education, and in the business sector, through systems capable of supporting decision making. In addition, we over-enhance the state of the art in terms of AI innovations applied to varying fields.

Future Research Issues

Due to the aforementioned scenario, we also suggest further research avenues to reinforce existing theories and develop new ones, in particular the deployment of AI technologies in small medium enterprises (SMEs), of sparse resources and from traditional sectors that constitute the core of intermediate economies and less developed and peripheral regions. In addition, the building of CAAC solutions constitutes a promising field in order to control data resources in the cloud and throughout changing contextual conditions.

Acknowledgments

We would like to express our gratitude to the Editor and the Referees. They offered extremely valuable suggestions or improvements. the authors were supported by the GOVCOPP Research Unit of Universidade de Aveiro and ISEC Lisboa, Higher Institute of Education and Sciences.

Overview of document citations period ≤ 2010 to 2021.

Overview of document self-citation period ≤ 2010 to 2020.

Author Contributions

Conceptualization, R.R. and A.R.; data curation, R.R. and A.R.; formal analysis, R.R. and A.R.; funding acquisition, R.R. and A.R.; investigation, R.R. and A.R.; methodology, R.R. and A.R.; project administration, R.R. and A.R.; software, R.R. and A.R.; validation, R.R. and A.R.; resources, R.R. and A.R.; writing—original draft preparation, R.R. and A.R.; writing—review and editing, R.R. and A.R.; visualization, R.R. and A.R.; supervision, R.R. and A.R.; project administration, R.R. and A.R.; All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest. the funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We use cookies to enhance our website for you. Proceed if you agree to this policy or learn more about it.

  • Essay Database >
  • Essays Samples >
  • Essay Types >
  • Research Paper Example

Computer Security Research Papers Samples For Students

5 samples of this type

During studying in college, you will surely have to write a bunch of Research Papers on Computer Security. Lucky you if putting words together and transforming them into relevant text comes naturally to you; if it's not the case, you can save the day by finding a previously written Computer Security Research Paper example and using it as a model to follow.

This is when you will definitely find WowEssays' free samples database extremely useful as it embodies numerous professionally written works on most various Computer Security Research Papers topics. Ideally, you should be able to find a piece that meets your requirements and use it as a template to develop your own Research Paper. Alternatively, our competent essay writers can deliver you a unique Computer Security Research Paper model written from scratch according to your personal instructions.

Types Of Surveillance Research Paper

Proper research paper example about digital forensics education, the repercussions of trojan horse: a top-quality research paper for your inspiration, introduction.

Don't waste your time searching for a sample.

Get your research paper done by professional writers!

Just from $10/page

Internet Security Research Paper

Research paper on information security.

Password recovery email has been sent to [email protected]

Use your new password to log in

You are not register!

By clicking Register, you agree to our Terms of Service and that you have read our Privacy Policy .

Now you can download documents directly to your device!

Check your email! An email with your password has already been sent to you! Now you can download documents directly to your device.

or Use the QR code to Save this Paper to Your Phone

The sample is NOT original!

Short on a deadline?

Don't waste time. Get help with 11% off using code - GETWOWED

No, thanks! I'm fine with missing my deadline

IMAGES

  1. (PDF) Cloud Computing and Security Fundamentals

    research paper on computer security

  2. (PDF) Cybersecurity Issues in AI

    research paper on computer security

  3. Cloud security research paper. (PDF) Cloud Computing: Security Issues

    research paper on computer security

  4. Research Paper On Applications Of Cyber Security

    research paper on computer security

  5. (PDF) A Study Of Cyber Security Challenges And Its Emerging Trends On

    research paper on computer security

  6. (DOC) Analysis And Research of Computer Network Security.docx

    research paper on computer security

VIDEO

  1. Ensuring cybersecurity of ICS: Life after Russia’s law on critical information infrastructure

  2. Paper Computer Science 10th Class Annual 2024 Group-1 Lahore Board Morning Paper

  3. CBSE Computer Science Sample Paper 2024

  4. Paper Computer Science 10th Class Annual 2024 Group-2 Lahore Board Evening Paper ||2nd time paper ||

  5. Specimen Paper Computer Science-2024 ISC(XII)( Section A and B )

  6. Using the Array.fold function in F#

COMMENTS

  1. COSE

    The International Source of Innovation for the Information Security and IT Audit Professional Computers & Security is one of the most respected journals in IT security, being recognized worldwide as THE primary source of reference for IT security research and applications expertise. Computers & Security provides the IT security community with a unique blend of leading edge research and sound ...

  2. Cyber security: State of the art, challenges and future directions

    Cybersecurity is the protection of individuals, societies, organizations, systems, and technologies from abnormal activity. Cybersecurity is the maintenance of the confidentiality, integrity, and availability (CIA) of computer resources owned by one organization or connected to another organization's network [3].

  3. Journal of Cybersecurity

    About the journal. Journal of Cybersecurity publishes accessible articles describing original research in the inherently interdisciplinary world of computer, systems, and information security …. Journal of Cybersecurity is soliciting papers for a special collection on the philosophy of information security. This collection will explore ...

  4. Using deep learning to solve computer security challenges: a survey

    Although using machine learning techniques to solve computer security challenges is not a new idea, the rapidly emerging Deep Learning technology has recently triggered a substantial amount of interests in the computer security community. This paper seeks to provide a dedicated review of the very recent research works on using Deep Learning techniques to solve computer security challenges.

  5. Cyber risk and cybersecurity: a systematic review of data ...

    Depending on the amount of data, the extent of the damage caused by a data breach can be significant, with the average cost being USD 392 million Footnote 1 (IBM Security 2020). This research paper reviews the existing literature and open data sources related to cybersecurity and cyber risk, focusing on the datasets used to improve academic ...

  6. Journal of Cybersecurity and Privacy

    Feature papers represent the most advanced research with significant potential for high impact in the field. ... open access journal on all aspects of computer, systems, and information security, published quarterly online by MDPI. Open Access — free ... This paper scrutinises security attack patterns and the corresponding security solutions ...

  7. Cybersecurity: Past, Present and Future

    2021, for the terms Cyber Security, Computer Security, and Information Security. The y- axis depicts the relative search frequency for the term. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. ... there is a need to expand research and develop novel cybersecurity methods and tools to

  8. (PDF) A Systematic Literature Review on the Cyber Security

    A Systematic Literature Review on the Cyber Security. 1 Dr.Yusuf Perwej, 2 Prof. (Dr.) Syed Qamar Abbas, 3 Jai Pratap Dixit, 4 Dr. Nikhat Akhtar, 5Anurag. Kumar Jaiswal. 1 Professor, Department of ...

  9. High-Impact Research

    With the promulgation of the Personal Information Protection Law (PIPL) and the Data Security Law (DSL) in the summer of 2021, the first phase of these efforts have been concluded. ... This research provides indications that many users of a popular public internet marketing forum have connections to cybercrime. It does so by investigating the ...

  10. Cybersecurity data science: an overview from machine learning

    In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...

  11. (PDF) ADVANCES IN NETWORK SECURITY: A COMPREHENSIVE ...

    The methodology adopted in this paper is a review of papers with keywords network security, network attacks and threats and network security measures. The aim of this paper is to critically review ...

  12. Security and privacy protection in cloud computing ...

    Fig. 1. The organization framework of this paper. With the development of cloud computing, the privacy security issues become more and more prominent, which have been widely concerned by the industry and academia. We review the research progress from the perspective of privacy security protection technology in the cloud computing.

  13. A Systematic Literature Review on Cloud Computing Security: Threats and

    Cloud computing has become a widely exploited research area in academia and industry. Cloud computing benefits both cloud services providers (CSPs) and consumers. The security challenges associated with cloud computing have been widely studied in the literature. This systematic literature review (SLR) is aimed to review the existing research studies on cloud computing security, threats, and ...

  14. A survey on security challenges in cloud computing: issues, threats

    2.2 Existing review papers on security challenges in cloud computing. ... In the context of computer security, anything that has the potential to cause serious damage to a computer system is called a threat. ... In fact, this research attempted to show various security challenges, vulnerabilities, attacks, and threats that hamper the adoption ...

  15. Data Security and Privacy: Concepts, Approaches, and Research

    Data are today an asset more critical than ever for all organizations we may think of. Recent advances and trends, such as sensor systems, IoT, cloud computing, and data analytics, are making possible to pervasively, efficiently, and effectively collect data. However for data to be used to their full power, data security and privacy are critical. Even though data security and privacy have been ...

  16. (PDF) Cryptography in Computer Security

    The aim of this research paper is to provide the users with flexible, efficient and secure method of data encryption-decryption. View full-text Discover the world's research

  17. Artificial intelligence for cybersecurity: Literature review and future

    The article is a full research paper (i.e., not a presentation or supplement to a poster). • The article should make it apparent that AI is its primary emphasis or include AI as a large part of the methodology. For example, publications that explicitly include machine learning as a core component of their methodology/research. •

  18. The Impact of Artificial Intelligence on Data System Security: A

    This paper aims at identifying research trends in the field through a systematic bibliometric literature review (LRSB) of research on AI and system security. the review entails 77 articles published in the Scopus ® database, presenting up-to-date knowledge on the topic. the LRSB results were synthesized across current research subthemes ...

  19. (PDF) Research Paper on Cyber Security

    Abstract. In the current world that is run by technology and network connections, it is crucial to know what cyber security is and to be able to use it effectively. Systems, important files, data ...

  20. Computer Security Research Papers Samples For Students

    The term Trojan horse is derived from Trojan War a Greek mythology (Christensson, 2006). The myth talks about the wooden horse that was pulled by the people from Troy. The legend claims that inside the horse there were Greek soldiers in it who emerged at night and seized the city. This fact can be observed from the program where a person ...

  21. Ethics in cybersecurity research and practice

    Urgent discussion regarding the ethics of cybersecurity research is needed. This paper critiques existing governance in cyber-security ethics through providing an overview of some of the ethical issues facing researchers in the cybersecurity community and highlighting shortfalls in governance practice. We separate these issues into those facing ...

  22. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing

    The company's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling industrial digitalization across markets. NVIDIA is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry.

  23. (PDF) On Cyber Crimes and Cyber Security

    P.O. Box 5969, Safat 13060, Kuwait University, Kuwait. Abstract. The world has become more advanced in communication, espec ially after the invention of. the Internet. A key issue facing today's ...