Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Prioritizing tasks in software development: A systematic literature review

Roles Conceptualization, Formal analysis, Methodology, Software

Affiliation Huawei, Moscow, Russia

ORCID logo

Roles Data curation, Validation, Writing – review & editing

Affiliation Institute of Software Development and Engineering, Innopolis University, Innopolis, Russia

Roles Conceptualization, Validation, Writing – original draft

Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected] (MF); [email protected] (AK)

Affiliation Institute of Human and Social Sciences, Innopolis University, Innopolis, Russia

Roles Software, Supervision, Writing – original draft

Roles Conceptualization, Data curation, Formal analysis, Writing – original draft

Roles Methodology, Supervision, Writing – review & editing

Affiliation Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada

Roles Conceptualization, Funding acquisition, Methodology, Software, Supervision, Writing – original draft

Affiliation Department of Computer Science and Engineering, University of Bologna, Bologna, Italy

  • Yegor Bugayenko, 
  • Ayomide Bakare, 
  • Arina Cheverda, 
  • Mirko Farina, 
  • Artem Kruglov, 
  • Yaroslav Plaksin, 
  • Witold Pedrycz, 
  • Giancarlo Succi

PLOS

  • Published: April 6, 2023
  • https://doi.org/10.1371/journal.pone.0283838
  • Reader Comments

Table 1

Task prioritization is one of the most researched areas in software development. Given the huge number of papers written on the topic, it might be challenging for IT practitioners–software developers, and IT project managers–to find the most appropriate tools or methods developed to date to deal with this important issue. The main goal of this work is therefore to review the current state of research and practice on task prioritization in the Software Engineering domain and to individuate the most effective ranking tools and techniques used in the industry. For this purpose, we conducted a systematic literature review guided and inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses, otherwise known as the PRISMA statement. Based on our analysis, we can make a number of important observations for the field. Firstly, we found that most of the task prioritization approaches developed to date involve a specific type of prioritization strategy— bug prioritization . Secondly, the most recent works we review investigate task prioritization in terms of “pull request prioritization” and “issue prioritization,” (and we speculate that the number of such works will significantly increase due to the explosion of version control and issue management software systems). Thirdly, we remark that the most frequently used metrics for measuring the quality of a prioritization model are f-score , precision , recall , and accuracy .

Citation: Bugayenko Y, Bakare A, Cheverda A, Farina M, Kruglov A, Plaksin Y, et al. (2023) Prioritizing tasks in software development: A systematic literature review. PLoS ONE 18(4): e0283838. https://doi.org/10.1371/journal.pone.0283838

Editor: Bilal Alatas, Firat Universitesi, TURKEY

Received: January 4, 2023; Accepted: March 17, 2023; Published: April 6, 2023

Copyright: © 2023 Bugayenko et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are all contained within the paper and/or Supporting information files.

Funding: This research was supported by Huawei Technologies. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In software development, the vast majority of tasks do not have mandatory dependencies and it is up to the project manager to decide which task should be completed first. The proper continuous prioritization of tasks (known as backlog refinement in agile terminology) becomes a critical success factor for any software development project, as it guarantees that the company’s crucial goals are in focus and can be met [ 1 ].

What is a task, though? The term “task” in software engineering refers to the smallest unit of work subject to management accountability that needs to be completed as part of a software development project [ 2 ]. So, in the context of software development, the term task is an umbrella term that encompasses concepts, such as “pull request” and “issue,” commonly found in GitHub/GitLab integration (so development areas) [ 3 ], or to ideas, such as “bug,” “feature,” “improvement,” commonly used in task management. Although these concepts and ideas are considered conceptually independent, they often overlap in practice.

In an attempt to optimize the process and practice of task prioritization, researchers approached the problem from a bug-fixing perspective; that is, in terms of selecting the most appropriate developer for the given task [ 4 ]. Cubranic and Murphy were among the first to analyze the problem of task prioritization in terms of Machine Learning (ML); namely as a classification problem [ 5 ]. The datasets provided in their research, Eclipse (see https://bugs.eclipse.org/bugs/ ) and Mozilla (see http://www.mozilla.org/projects/bugzilla ), have become “de facto” the standard for training and testing ML models for this problem domain.

However, it is worth noting, that other researchers developed alternative methods and approaches to improve the process of prioritizing and assigning bug fixes. For example, Zimmermann et al. [ 6 ] provided a series of recommendations for formulating and better classifying bug reports, while Anvik et al. [ 7 ] proposed an effective strategy for developers selection. Panjer [ 8 ] formulated a method capable of predicting bugs’ lifetime and Wang et al. [ 9 ] suggested a new technique for identifying bug duplicates.

Menzies and Marcus [ 10 ] adopted another conceptual framework for dealing with the problem of task prioritization and proposed a solution based on the prediction of the severity of bug reports. Their work formed the conceptual palette necessary for the development of further research on bug priorities prediction, such as the works by Sharma et al. [ 11 ] and Tian et al. [ 12 ].

The importance, urgency, and significance of this problem for the Software Engineering community is also attested by the recent publication of several surveys, such as [ 13 – 16 ]. Among them, the work of Gousios et al. showed that the issue of task prioritization is particularly sensitive for development teams that follow a pull-based development model [ 16 – 18 ].

The considerations made above clearly demonstrate that task prioritization has become an active research topic in software engineering. On the one hand, its growth signals a positive trend: the more people get involved in the discussion of these issues, the more ideas are generated and accumulated in the scientific community. On the other hand, though, wide participation poses potentially insurmountable challenges for researchers and developers in terms of understanding the current state and capabilities of the field. Therefore, we believe that a comprehensive systematic literature review (SLR) carried out on this topic is going to be highly beneficial for researchers, project managers, developers, scrum masters, and other industry practitioners.

Research problem and objectives

Taking this important observation as a starting point this work reviews how the IT industry addresses the problem of task prioritization and attempts to produce a state-of-the-art summary of tools and techniques used for this purpose. Although we do not limit our work to specific methods, we expect to mostly gather Machine Learning (ML)-based approaches. This is because of the recent successes of ML in software engineering and computer science [ 19 , 20 ].

The objectives of the SLR are therefore to:

  • present our readership (mostly IT practitioners) with newly-developed techniques for ranking tasks that they can reliably use in their work,
  • develop new strategies in ranking and prioritizing tasks, thus filling current gaps in the relevant literature and
  • identify possible directions for future research.

The scientific contribution of this paper includes structured information on task prioritization, a survey of existing tools and approaches, methods, and metrics, as well as some estimates about their effectiveness and reliability.

Structure of the paper

This paper is organized as follows. Section Related Works provides an overview of current research on task prioritization and a helpful comparison between such research and the focus and scope of our work. Section SLR Protocol Development describes the protocol used in this systematic literature review. Section Results presents the results of our work, while section Discussion contextualizes our findings and section Critical Review of our Research Question provides their critical interpretation. Section Limitations, Threats to Validity, and Review Assessment evaluates the limitations and various other shortcomings potentially affecting our study, while section Conclusion summarizes what we achieved and points out future research directions.

Related works

There exist a number of studies devoted to requirements prioritization techniques. For example, Achimugu et al. [ 21 ] found that the most cited techniques for requirements prioritization include Analytical Hierarchy Process (AHP), Pairwise Comparison, Cost-Value Prioritization, and Cumulative Voting. More recent trends in prioritizing requirements include ML techniques (such as Case Base Ranking and Fuzzy AHP). Bukhsh et al. [ 22 ] also identified a trend toward fuzzy logic and machine learning methods. Somohano-Murrieta et al. [ 23 ] investigated the most documented techniques with regard to scalability and time consumption problems. Rashdan [ 24 ] found evidence of a shift towards computed-assisted/algorithmic methods, while Sufian et al. [ 25 ] analyzed factors that influence prioritization and identified commonly used techniques and tools aimed at improving the process. These studies underline the importance and evolution of requirements prioritization techniques, and -at the same time- emphasize the need for real-world evaluations and scalability solutions.

There are also studies aimed at analyzing other aspects of software engineering, which are typically connected with prioritization issues (such as analysis of non-functional requirements, code smells, technical debt, and software bugs). For example, Kaur et al. [ 26 ] identified existing techniques for code smell prioritization and introduced different tools for prioritizing code smells (such as Fusion, ConQAT, SpIRIT, JSpIRIT, PMD, Fica, JCodeOdor, and DT-SOA). Alfayez et al. [ 27 ] investigated technical debt prioritization and identified a number of important techniques used, which include: Cost-Benefit Analysis, Ranking, Predictive Analytics, Real Options Analysis, Analytic Hierarchy Process, Modern Portfolio Theory, Weighted Sum Model, Business Process Management, Reinforcement Learning, and Software Quality Assessment Based on Lifecycle Expectations (SQALE). However, the researchers concluded that more research is needed to develop technical debt prioritization approaches capable of effectively considering costs, values, and resource constraints. Ijaz et al. [ 28 ] looked at non-functional requirements prioritization techniques and found that AI techniques can potentially handle uncertainties in requirements while contributing to overcome the most common limitations characterizing standard approaches (such as AHP). Pasikanti and Kawaf [ 29 ] studied the latest trends in software bug prioritization and identified a series of ML techniques (such as Naive Bayes, Support Vector Machines, Random Forest, and Multinational Naive Bayes) that are most commonly used for prioritizing software bugs.

Several SLRs were also conducted to identify the most commonly used techniques for test case selection and prioritization in software testing. For example, Pan et al. [ 30 ] found that Supervised Learning, Unsupervised Learning, Reinforcement Learning, and NLP-based methods have been applied to test case prioritization; yet, due to a lack of standard evaluation procedures, the authors couldn’t draw reliable conclusions on their effective performance. Bajaj and Sangwan [ 31 ] observed that genetic algorithms bear great potential for solving test case prioritization problems, while nevertheless noting that the design of parameter settings, type of operators, and fitness function significantly affects the quality of the solutions obtained.

Another important area of research focuses on aspects of task assignment and allocation in software development projects. Filho et al. [ 32 ] reviewed works on multicriteria models for task assignment in distributed software development projects with a special focus on qualitative decision-making methods. TAMRI emerged as the most efficient and widely used approach, while McDSDS, Global Studio Project, and 24-Hour Development Model received lower scores. Fatima et al. [ 33 ] studied the models used for task assignment and scheduling in software projects. The review found that static models are the most widely used for task scheduling, while the Support Vector Machine algorithm is the most widely used for task assignment. Both these papers demonstrated the importance of considering, as crucial for the practice of software management, specific factors (such as personal aspects, team skills, labor cost, geographic issues, and task granularity).

However, the contribution of our SLR is unique and different from that of the above-mentioned studies because: ( Table 1 ):

  • Unlike other SLRs, which have focused -as we have seen above- on prioritization techniques for requirements, test cases, bugs, and/or other artifacts of software development; our own review provides a comprehensive coverage of the problem at stake. Crucially, it does so by describing the broad category of “task”, without focusing on a specific type of prioritized item.
  • In addition, our research differs from prior studies on task allocation/assignment in several aspects. Firstly, the problem of assignment/allocation involves distributing tasks based on various factors (such as skills, availability, workload, etc), whereas the problem of prioritization focuses on determining which tasks should be completed first. Secondly, prior research has predominantly relied on qualitative analyses of algorithms, methods, and tools for task allocation/assignment, without conducting detailed quantitative analyses of their effectiveness. Our research aims to bridge these important gaps in the literature by conducting a comprehensive quantitative analysis of task prioritization techniques, which are used to determine their effectiveness in different contexts.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0283838.t001

SLR protocol development

SLRs offer a comprehensive analysis of the research conducted in the field while also providing critical, original insights [ 34 ]. They are of paramount importance for scientific progress and, for this reason, represent one of the preferred methods used by researchers to investigate the state of the art of a particular research topic [ 35 ].

The quality of SLRs can vary greatly and it is important to ensure that an SLR is conducted in a rigorous and systematic manner [ 36 , 37 ]. Thus, to ensure the comprehensiveness and soundness of our work we followed the PRISMA Statement [ 38 ], which is essentially a checklist, conventionally adopted by researchers worldwide, to guide, orient, and inform the development of any SLR. The PRISMA 2020 checklist adopted for this study is included as S1 Table .

Since the Prisma checklist abovementioned is not -strictly speaking- a methodological framework; rather a series of suggestions or -better- recommendations to be implemented for the sound development of any SLR (even beyond computer science), we decided to integrate it and complement it with a more specific methodological framework; the one recently developed by Kitchenham and Charters [ 39 ]. This framework was chosen due to its focus on software engineering and because its effectiveness has been amply demonstrated in previous studies [ 40 – 42 ]. We believe that complementing the general indications or recommendations outlined in the PRISMA checklist (which are valid for any field) with a framework specifically designed for research on software engineering is highly beneficial for this study, as it guarantees better accuracy. In addition, since the checklist and the framework partially overlap (despite being also complementary), one can use them to mutually strengthen each other. The stages of the methodological framework adopted in this SLR, are:

  • Specification of the research questions.
  • Development of the review protocol.
  • Formulation of the literature log.
  • Performance of quality assessment.
  • Extraction of Data.
  • Data synthesis.
  • Formulation of the main report.
  • Evaluation of the review and of the report.

Research questions

The first step in any SLR involves the formulation of a series of research questions that can guide and inform its development.

To formulate the most appropriate research questions, we adopted the Goal Question Metric (GQM) model developed by Basili et al. [ 43 ]. This model requires specifying up front the purpose of analysis, the objects and the issues to analyze, as well as the standpoints from which the analysis is performed. The Goal Question Metric model for this work is the following:

  • Purpose Systematic literature review.
  • Object Peer review publications in computer science and software engineering.
  • Issue Approaches for ranking tasks in software development.
  • Viewpoint Software engineers and industry practitioners.

With the GQM model in place, we then formulated the Research Questions (RQs) that characterized this work:

  • RQ1 What are the existing approaches for automatic task ranking in software development?
  • RQ2 Which methods are used in automatic task ranking models and approaches and how is their effectiveness assessed?
  • RQ3 What are the most effective and versatile models for automatic task ranking developed so far?

The motivation for RQ1 is to gain a clear understanding of existing research on the topic. Then, moving from general to more specific tasks, we formulate RQ2 with the intent of finding out which methods for task ranking are currently the most popular in the software development industry and how their effectiveness can be assessed. Further research along these lines leads to RQ3, through which we try to rank such methods in terms of effectiveness, accuracy, fidelity, and reliability. This could help developing new ranking strategies and remedial approaches for the field.

Literature search process

Following the best practices in the field [ 42 ], we selected the following databases for our searches: Google Scholar, Microsoft Academic, ScienceDirect, IEEE Xplore, and ACM digital library.

We then extracted a set of basic keywords, which describe our research questions. The keywords are: a) manage, b) backlog, c) priority, d) task, e) job, f) commit, g) bug, h) pull request, i) issue, j) feature, k) software, l) rank, m) distributed software development, n) machine learning.

Searches via keywords yielded a very large number of papers. Thus, to screen out irrelevant documents and add focus and precision to our work, we formulated a set of search queries by using Boolean operators, as common in the literature. Upon conducting an initial screening of papers, we discovered that there were more papers focused on prioritizing bug reports than those focused on prioritizing pull requests and issues. Because of this, we decided to model a series of search queries around these themes for better coverage. The list of queries we used for our searches is reported below:

  • ((pull OR merge) AND request) OR Github issue) AND (prioritization or priority OR rank OR order OR ranking OR ordering)
  • (task or bug or defect or feature) AND (prioritization or priority OR rank OR order OR ranking OR ordering)
  • bug severity AND priority AND (machine learning OR neural network)

We performed our searches by using these queries on the selected databases. Table 2 displays the results we obtained.

thumbnail

Acronyms used: GS—Google Scholar, IEEE—IEEE Xplore, MA—Microsoft Academic, SD—ScienceDirect.

https://doi.org/10.1371/journal.pone.0283838.t002

Inclusion and exclusion criteria

Next, we specified inclusion (IC) and exclusion (EC) criteria as recommended by Patino and Ferreira [ 44 ]. IC and EC help the authors decide which articles found through seminal searches deserve to be considered for further analysis. In this study we used the following IC and EC:

  • IC1 The paper is written in English.
  • IC2 The paper is peer-reviewed and published by a reputable publisher.
  • IC3 The paper was published as early as 2006*.
  • IC4 The paper uses ML techniques to deal with backlog systems or tasks/todos.
  • IC5 The paper compares different ML models or compares ML models with other learning models.
  • EC1 The paper does not satisfy at least one of the ICs.
  • EC2 The paper is a duplicate or contains duplicate information.
  • EC3 The paper is an editorial, an opinion piece, or an introduction. In general, the paper is excluded if it does not contain any original insight.
  • EC4 The paper does not present any type of experimentation or comparison or results.

*Shoham et al. [ 45 ] noted that around 2006, there was a pick of interest in ML in the software engineering community. We thus selected this year as the starting point for our systematic review.

Search results by sources

In this subsection, we offer to our readers a detailed description of the process that led to the inclusion of preliminary selected papers in our final reading log ( Table 3 ).

thumbnail

The table shows the procedure through which potentially relevant papers were screened out through the adoption of IC and EC criteria. The number of papers included in the final reading log is shown in the column “Selected papers”.

https://doi.org/10.1371/journal.pone.0283838.t003

We note that we only considered the first 100 results displayed in the relevant databases for each of the four queries we formulated. This is justified by the fact that the databases we used normally sort out results by significance and credibility (e.g., h-index, number of citations, impact factor, etc.) and by the observation that usually no relevant paper is found after the first 100 results.

The PRISMA flow chart diagram shown in Fig 1 represents the process of inclusion/exclusion visually for the reader.

thumbnail

It shows the stages of the search process as a flowchart diagram [ 38 ].

https://doi.org/10.1371/journal.pone.0283838.g001

Quality assessment

To assess the quality of the manuscripts, we defined a set of criteria and applied them to all the papers selected for inclusion in our reading log:

  • QA1 Were the objectives and the research questions clearly specified?
  • QA2 Were the results evaluated critically and comprehensively?
  • QA3 Was the research process transparent and reproducible?
  • QA4 Are there comparisons with alternatives?

We then determined whether the papers we selected matched the criteria and—in case—the extent to which they did so. So, we assigned 1 if a paper fully matched the criterion, 0.5 if it partially matched the criterion, and 0 otherwise.

The criteria used for QA1 are:

  • Fully matched The objectives and research questions were explicitly stated.
  • Partially matched The goals of the paper and its research questions were sufficiently clear but could be improved.
  • Not matched No objectives were stated if the research questions were hard to determine, or if they didn’t relate to the research being carried out.

The criteria used for QA2 are:

  • Fully matched The authors of the paper provided a critical, balanced, and fair analysis of their results.
  • Partially matched The results were only partly (sufficiently) scrutinized and a comprehensive critical analysis was missing.
  • Not matched The authors did not evaluate their results.

The criteria used for QA3 are:

  • Fully matched The paper specified the methodology and the technologies used as well as the data gathered.
  • Partially matched Minor details were lacking (for example, a dataset is not readily available).
  • Not matched It was impossible to restore the sequence of actions or if other critical details (such as an algorithm or technologies used) were missing.

The criteria used for QA4 are:

  • Fully matched A comparison with other solutions offered; advantages and limitations clearly stated.
  • Partially matched The comparison was offered, but it was not comprehensively discussed.
  • Not matched No comparison was provided.

The resulting scores are shown in S2 Table , and their distribution can be found in Fig 2 .

thumbnail

Each paper was evaluated on a scale from 0 to 1 as per QA1-QA4. The bars display the number of papers with their respective quality score.

https://doi.org/10.1371/journal.pone.0283838.g002

There were many high-quality papers among those we selected for inclusion in our final log, which is demonstrated by the scores reported in Fig 2 . The average quality score was 2.9 out of 4. This confirms the reliability of the findings on which we based our SLR.

This section presents the findings gathered from the papers we included in our final reading log. More specifically, in this section, we use a series of statistical tools to cluster and organize the papers we selected in meaningful ways. Such clustering is beneficial for our readers because it provides some background for the conclusions we will draw in subsequent sections.

Preliminary clustering

We start this process of clustering by summarizing the potential advantages and disadvantages of the databases we used to perform our searches.

  • Microsoft Academic has the advantage of extensive coverage of scientific research, including patents. Its limitation is that some of the papers it lists are not peer-reviewed.
  • IEEE Xplore provides peer-reviewed publications, generally of high quality. Its limitation is that its full functionality requires a subscription, which is pricy.
  • ScienceDirect offers comprehensive coverage with tools for statistical analysis. However, it is beyond a paywall and has limitations for query building.
  • ACM provides comprehensive coverage with a particular emphasis on IT. Its major limitation is that it requires a subscription.
  • Google Scholar is one of the best database aggregators. It provides comprehensive coverage and tools for statistical analysis. However, it includes grey literature and non-peer-reviewed publications.

While not of crucial importance for the development of this work, we notice that such–complementary–information can be useful to ensure the academic integrity and scientific soundness of our approach.

The distribution of the papers included in the final reading log by databases is presented in Table 3 . Fig 3 re-elaborates the information contained in Table 3 in the form of a pie chart, which is probably more appealing for the reader. For convenience, we attributed papers to single repositories (even though some papers could be found across different databases). The attribution was subjective in character and determined by the chronological order of the searches we performed.

thumbnail

The pie chart shows the percentage of papers found in the databases we considered in this study. Acronyms used: GS—Google Scholar, IEEE—IEEE Xplore, SD—ScienceDirect. The number of papers is given in brackets.

https://doi.org/10.1371/journal.pone.0283838.g003

To give the reader a fuller picture of our results, we added information about the distribution of papers by publisher. This information can be found in Fig 4 . We note that the following journals and publishers fall under the label “others,” which accounts for about 14% of selected studies: ASTL (SERSC), CES (hikari), EISEJ, IJACSA, IJARCS, IJCNIS, IJCSE, IJOSSP, JATIT, Sensors (MDPI), and TIIS (KSII).

thumbnail

The pie chart shows the distribution (in percentages) of the papers we considered in our reading log by publishers. Acronyms used: IEEE—IEEE Xplore, WS—World Scientific.

https://doi.org/10.1371/journal.pone.0283838.g004

Studies classification

In this subsection, we present a series of statistical data that can be used to cluster our findings. Firstly, we identified 2 major topics characterizing the studies we included in our reading log: “Bug prioritization”, “Bug severity prediction”, and 2 minor topics “Issue prioritization”, and “Pull Request prioritization”. It is worth noting that even though bug severity [ 46 ] and bug priority [ 47 ] are two different theoretical entities (often treated as such even by project managers), a few works [ 12 , 48 , 49 ] demonstrated that severity can sometimes help predict priority. This is why, in this study, we consider not only papers concerned with bug priority but also those related to bug severity. Table 4 shows the distribution of publications across these topics.

thumbnail

The table shows the number (column “Quantity”) of papers devoted to a particular key topic (column “Topic”). Note: 2 papers have content for both topic 1 and 2 distribution.

https://doi.org/10.1371/journal.pone.0283838.t004

Secondly, we clustered the distribution of topics by year of publication (Figs 5 and 6 ). The dynamics of growth for the key topics underlying this study are roughly the same. This suggests that the scientific community is equally interested in both topics. As we noted above, this demonstrates their close interrelation.

thumbnail

The bars show the number of papers related to the key topic published in a particular year. Black bars show the number of papers related to “bug prioritization”. White bars show the number of papers related to “bug severity and prediction”.

https://doi.org/10.1371/journal.pone.0283838.g005

thumbnail

The bars show the number of papers related to the key topic published in a particular year. Black bars show the number of papers related to “issue prioritization”. White bars show the number of papers related to “pull request prioritization”.

https://doi.org/10.1371/journal.pone.0283838.g006

Thirdly, building and expanding on this classification, we clustered the papers we selected by the year of publication. Fig 7 shows our results. The same information is presented in Table 5 , where it is aggregated and visualized over a 4-years period.

thumbnail

The bars show the number of papers published on the topic between 2010 and 2021. No papers we found for the period 2006–2009. 2006 was the starting year for our SLR IC3.

https://doi.org/10.1371/journal.pone.0283838.g007

thumbnail

The table shows the number (column “Quantity”) and percentage (column “Percentage”) of papers for the specified period (column “Years”).

https://doi.org/10.1371/journal.pone.0283838.t005

literature review for software development project

The Pearson correlation coefficient is 0.9 with a p -value of 6.4 e − 06. This confirms our assumption that there is a significant synergy between the growth in the number of ML tools and their application to our problem domain.

Fourthly, we also collected some statistics related to the tags used in the papers we included in the final reading log. Information about this point is presented in Fig 8 .

thumbnail

The bars show the number of papers related to a specific tag. Note: several tags can be assigned to one paper.

https://doi.org/10.1371/journal.pone.0283838.g008

We note that the information presented in Fig 8 can be used to:

  • add more substance to the conclusions related to algorithm distribution we made in Further Clustering;
  • validate the relevance and the significance of our selection (the papers included in the final reading log);
  • characterize the most popular “subtopics” investigated by researchers worldwide in the selected domain.

Further clustering

We next proceed to further cluster our results and we do so along three dimensions: a) algorithms, b) datasets, and c) metrics. Fig 9 shows the algorithms used for training the models. Naive Bayes [ 51 ] is the most popular method among the models observed in the papers we reviewed.

thumbnail

The bars show the number of papers in which the specified algorithms were considered. Note: several algorithms can be considered in one paper.

https://doi.org/10.1371/journal.pone.0283838.g009

In addition, we also clustered the datasets used in the papers included in the final reading log. The most often used datasets are presented in Fig 10 .

thumbnail

The bars show the number of papers in which the specified algorithms were considered. Note: several datasets could be considered in one paper.

https://doi.org/10.1371/journal.pone.0283838.g010

Fig 10 shows the datasets most frequently used, which account for 48.7% of all datasets. The remaining datasets, accounting for 51.3% of the total, have been used only once, for example, bug repository of hdfs, etc. It is also worth noting that a single dataset can be found in many articles. The total number of dataset occurrences is calculated based on this important observation.

Finally, we collected statistics about the metrics used in the papers we included in our reading log ( Fig 11 ). We did not plot the metrics reported once. Nevertheless, we believe that such metrics are important because they might be used to create a comprehensive overview of their usage, which can be instrumental in evaluating the effectiveness of task prioritization models. These metrics include: average percentage of faults detected (APFD), normalized discounted cumulative gain (NDCG), mean squared error (MSE), Cohen’s kappa coefficient, nearest false negatives, nearest false positives, adjusted r squared, prediction time, training time, and robustness.

thumbnail

The bars show the number of papers in which the specified metrics were used. Note: several metrics could be used in one paper. Acronyms used: AUC—area under the ROC curve, MCC—Matthews Correlation Coefficient, MRR—mean reciprocal rank, MAE—mean absolute error, MAP—mean average precision, ROC—receiver operating characteristic curve.

https://doi.org/10.1371/journal.pone.0283838.g011

We note that some papers may contain multiple metrics, which might be jointly used to assess and more comprehensively evaluate the quality of a model. The f-score, as shown in Fig 11 , is the most commonly used metric in the papers we reviewed.

In this section, we contextualize and critically discuss the data presented in Results, while also highlighting their significance and relevance for the field.

RQ1. What are the existing approaches for automatic task ranking in software development?

As discussed in the Introduction, task prioritization can be divided into 3 subtopics: issues prioritization, PRs prioritization, and bugs prioritization. In this study, we treat these topics as independent issues and discuss them below in order of appearance in the literature. Bugs prioritization originated first, and it involves prioritizing the bugs based on their severity and impact on the software system [ 52 ]. Issue prioritization is the process of selecting and ranking issues based on factors such as their importance and urgency [ 53 ]. Finally, Pull Request prioritization is the process of selecting and ranking pull requests based on their impact on the software system and their relationship with other pull requests [ 54 ]. Overall trends in the task prioritization field are shown in Fig 12 .

thumbnail

The graph illustrates the chronological development of approaches to task prioritization from 2009 to 2022, specifically with respect to bugs, issues, and pull requests (PR). Each subtopic is denoted by bold text, followed by the total number of associated works. The figure is divided into three vertical fragments, each representing one of the subtopics. Green and red arrows indicate an increase or decrease in the number of publications, respectively.

https://doi.org/10.1371/journal.pone.0283838.g012

Bug prioritization

In this work, we understand bugs as reports by users and developers about program components that do not function properly. The collection of attributes used to describe the reports is usually determined by the platform on which the report was created. The platform used to detect bugs in most cases is Bugzilla (see https://www.bugzilla.org/ ). In 2004, one of the first studies by Cubranic and Murphy [ 5 ] on bug prioritization was conducted. Although this paper has been highly influential in the literature, we did not include this study in our main log because it failed to satisfy one of our inclusion criteria (namely, the third criterion).

However, since 2004, which is the year in which this paper was published, the field of bug prioritization boomed, giving raise to many profitable investigations on: prioritization on imbalanced datasets, prioritization in case of scarce datasets, and analyses concerning relevant features in datasets. Each of these areas deals with separate problems inherent to the field of bug prioritization.

Since we were unable to find precise causal relationships in the development of each of these directions, we describe them below on the basis of their popularity. The popularity of each direction is hereby determined by the number of scientific articles we found to be related to that specific direction. Fig 13 provides information about the popularity of each direction.

thumbnail

The numbers in brackets indicate the number of publications found for each branch.

https://doi.org/10.1371/journal.pone.0283838.g013

Prioritization on imbalance datasets.

According to Fig 13 , the most popular subtopic within bug prioritization is prioritization on imbalanced datasets. We believe that the popularity of this area is determined by the problem it subtends.

In brief, machine learning systems trained on imbalanced data will only perform well on samples with a dominant label. A dataset with a high number of low-priority bugs, for example, is more likely to classify subsequent bugs, even those with a high priority, as low priority. One of the researchers who also noted the importance of balancing the dataset is Thabtah [ 55 ]. Having explained the possible reasons for the popularity of this topic, we next move on to analyze its general structure as well as some of the most representative works we found that relate to it.

In light of the data we gathered, we can divide bug prioritization on imbalance datasets into two categories, based on the specific (machine learning) techniques used: those that use one predictor and those that use several predictors (what is known as the ensemble approach).

An example of work belonging to the former category is the work of Singha and Rossi [ 56 ]. The authors of this work used a modified version of Support Vector Machine (SVM) to weight classes based on the inverse occurrence of class frequencies. The results suggest that the model provides better prediction quality than standard SVM. Another example of the approach predicated by the former category is the work of Guo et al. [ 57 ]. In this study, the authors used Extreme Learning Machine (ELM) as a predictor. Several oversampling strategies were also tested.

The findings of this study are interesting for the purpose of this study because they indicate that the suggested approach can effectively balance an imbalanced dataset, which can contribute to increase the accuracy of bug prioritization.

Methods using a single predictor to deal with bug prioritization on imbalance datasets, as previously indicated, are not the only ones available. An example of work using multiple predictors (hence belonging to the latter category above-mentioned) is the work of Awad et al. [ 58 ].

The authors of this work proposed using the so-called ensemble method, in which each category of bug has its own predictor plus an additional general predictor for any type of bug. The topic of ensemble methods was described in more detail by Optiz and Maclin [ 59 ]. The peculiarity of the method is that any machine learning technique can be used as a predictor; the authors of the paper tested several techniques (such as Nave Bayes Multinomial (NBM), Random Forest (RF), and SVM). They also evaluated their proposed approach, which used both textual-based and non-textual datasets. Results showed that the proposed method can be successfully used to improve classical methods; however, this could be done only in the presence of a textual dataset.

Bug prioritization on scarce datasets.

A second subtopic we found within bug prioritization is prioritization in case of scarce data. Research in this area typically attempts to formulate methods capable of showing consistent and accurate results despite the availability of a relatively small amount of training data. The first work we found for bug prioritization in case of scarce data is the work of Sharma et al. [ 11 ], which was published in 2012. Several machine learning techniques, like SVM, Naive Bayes (NB), Neural Networks (NN), K-Nearest Neighbours (KNN), were tested to ascertain the best suitable and most accurate among them all. The authors showed that overall SVM and NN produce better results.

It is nevertheless worth noting that M. Sharma is the primary contributor to the topic, having published 5 of the 8 papers we found. We note that in this research group repeatedly utilized the same set of machine learning techniques (such as SVM, NB, NN, and KNN) [ 11 , 60 , 61 ]. We, therefore, acknowledge that this may lead to biased conclusions.

We also note that the techniques listed above are not all the techniques that are currently applied, used, or tested in the literature. For example, Zhang et al. [ 62 ] proposed using ELM, while Hernández-González et al. [ 63 ] used the Expectation Maximization (EM) Algorithm.

Analyses of relevant features in datasets.

This brings us to the discussion of the last subtopic within bug prioritization: analyses of relevant features in datasets. The goal of researchers in this field is to identify a set of features within a dataset that will yield the highest accuracy for a model trained on such data.

Although the topic is well-researched, there is still no consensus on the optimal set of attributes to be used. For example, the first publication in the domain by Alenezi and Banitaan [ 64 ] indicates that meta-data attributes are more relevant than textual description features. Sharmin et al. [ 65 ] also investigate the significance of features; however, they only compare two fields (text description and text conclusion).

Another perspective on the relevance and significance of features/attributes is offered by Sabor et al. [ 66 ]. The authors of this article proposed using stack traces as well as attributes, that were discussed by Alenezi and Banitaan [ 64 ]. More recently, a few works explored new ways for supplementing datasets with social-media information, for example, the work of Zhang et al. [ 67 ].

In light of the evidence reviewed above, we believe that the wide range of techniques and opinions developed in the literature thus far makes the task of identifying optimal qualities considerably challenging. Hence, we fear we are not in a position to make any specific recommendation with respect to this subtopic.

Issue prioritization

An issue (see https://docs.github.com/en/issues ) is an object that describes the work and the prerequisites for completing it. Any member of the open-source community can create an issue in order to enhance any given product. Issues in software development are typically found in platforms like GitHub, GitLab, or Bitbucket. Because of the novelty of these platforms, the subject has received little attention from the research community.

The main approach for dealing with issue prioritization has been that of predicting the lifetime of the issue itself. This approach was initially discussed by Murgia et al. [ 53 ]. The authors of this paper also investigate the impact of different types of developers’ activities (such as maintenance type, adding a new feature, refactoring, etc) on issue resolution time (or issue lifetime). These activities are often represented with labels. The results of this study show that fixing defects and implementing/improving new features is more effective and typically less time-consuming than other activities (such as testing or documenting).

The idea of using labels to represent activities also inspired other authors, such as Kikas et al. [ 68 ]. Subsequent work by Kallis et al. [ 69 ] confirmed the potential of this research direction and analyzed the relationship between static/dynamic features and issues’ lifetime.

Static features are those that remain consistent over time (for example, the number of issues created by the issue submitter in the three months before opening the issue). Dynamic features on the contrary are those features that change depending on when an observation is made (for example, if we look at the number of comments on an issue, we can see how it changes over time).

Another work that attempts to resolve issues prioritization by utilizing the concept of issue-lifetime prediction is the work of Dhasade et al. [ 70 ]. The authors of this work continued to use both static and dynamic attributes. They also expanded the previously developed approach by including in the model (and subsequently testing within it) various other hyperparameters (such as time and hotness). The changes implemented in the model by these researchers made the model more flexible and therefore capable of being adjusted to the needs of different teams.

As previously stated, the majority of articles predict the priority based on the expected issue lifetime. A slightly different strategy is however demonstrated by Kallis et al. [ 69 ], where labels are used by the authors to assist developers in the organization of their work (hence prioritizing their tasks). The method developed in this study can correctly and reliably anticipate one of three labels: bug, enhancement, or question.

PR prioritization

We cannot fully understand task prioritization if we do not discuss the third category that falls within it; namely, PR Prioritization. Research on PR Prioritization may be divided into two sub-topics, as shown in Fig 13 : integrator-oriented research and contributor-oriented research. An integrator is someone who is in charge of reviewing PRs, whereas a contributor is someone who creates PRs.

In the former category (integrator-oriented research), we may include the works of [ 54 , 71 – 73 ]. Van der Veen [ 71 ] offered a tool for prioritizing PRs based on static and dynamic attributes. This type of approach is quite similar to the work of Dhasade et al. [ 70 ]. In fact, Dhasade et al. [ 70 ] were inspired by this work and used it, as we have seen above, as a conceptual palette for their investigation.

A study by Yu et al. [ 54 ] proposes another approach to improving PR prioritization. The approach revolves around the idea of recommending appropriate reviewers to PRs. A description of the PR and a comment-network are two of the most crucial features used in this model. The comment network is a graph that is constructed based on the developers’ shared interests. Results from this study show that the method is capable of obtaining a 71 percent precision in predicting the appropriate reviewer.

Another method, that has similar goals, is discussed by Yu et al. [ 72 ]. The researchers developed an approach that is intended to aid in the prioritization of PRs, by forecasting their latency (i.e., evaluation time). To make such a prediction, the researchers took into consideration numerous socio-technical parameters (such as project age, team size, and total CI). The findings demonstrated that the length of the dialogue (the number of comments under the PR) had a substantial impact on its latency.

With respect to the latter category we introduced above (contributor-oriented research), we only found one relevant study by Azeem et al. [ 74 ]. In this study, the authors not only investigated the impact of each individual variable on the probability of a PR being merged, but they also formulated and developed a model capable of automatically estimating such a probability.

To obtain these results, the researchers used the XGBoost algorithm and over 50 different attributes. The mean average precision of their model for the first five recommended PRs was 95.3%, hovered at 89.6% for the first ten PRs, and eventually decreased to 79.6% for the first twenty PRs. The results show that the technique outperformed the baseline model presented by Gousios et al. [ 75 ] at all levels (for the first five, the first ten, and the first twenty PRs).

RQ2. Which methods are used in automatic task ranking models and approaches and how is their effectiveness assessed?

As we noted in the previous section, giving an exact definition of a task can be quite challenging, at least in software development. In our research, we found that the same approaches are usually employed to solve prioritization tasks for each of those different attributions. In other words, the observations we made for one meaning of the term invariably apply to the others. We speculate that the reason for this might be the presence of some sort of common fields or attributes between all these different meanings.

On these grounds and in light of the data presented in Fig 9 , we can conclude that Naive Bayes is the most frequently used technique for solving the problem of predicting bug severity and priority.

In this context, it is essential to note that, despite the growing popularity of neural network approaches in other areas of computer science, we did not entirely observe the same popularity in the studies we selected. We believe that the relative unpopularity of neural networks and deep neural networks [ 76 ] might be caused by the relatively small size of the dataset. Even though neural networks are very powerful tools, they require a lot of training to process data properly, which was not always possible, for different reasons, in the papers we reviewed.

As we pointed out in Fig 11 , the most used metrics for assessing the effectiveness of models are: f-score, precision, recall, and accuracy. This may indicate that the skewness of the datasets was relatively low [ 77 ]; however, closer scrutiny [ 60 , 64 , 78 ] reveals that this is not the case. Only nine studies employ the f-score as the only assessment metric. We can infer that in most cases additional metrics such as precision and recall are used in combination with the f-score to provide a complete and fair characterization of the model’s quality.

Based on the results shown in Further Clustering we can make another important observation. Among the most popular metrics observed in our review, we noticed that there were not any of those typically used for recommender systems [ 79 , 80 ]. Even though the question of task priority has a recommender nature, i.e., we want to know “what are the next tasks/features/bugs and how should be solved”; recommender ML techniques are not so popular. We believe that one of the reasons for that is that recommender system approaches [ 81 ] have only recently captured researchers’ attention. This speculation is reinforced by the observation concerning the number of papers published under the search query “recommender system” for the period 2010–2022 (data gathered from Scopus, Fig 14 ). The development of recommender systems can be a profitable way to guide existing works and the orientation of recommender systems can be used to gauge and support effective decision-making. This means that a model trained in this fashion cares about the ordering of output variables. So, if we have multiple entities and need only a small subset of the best of them, we most probably need to look at the solutions offered by the recommender system.

thumbnail

The bars show the number of papers on the topic published between 2010 and 2022.

https://doi.org/10.1371/journal.pone.0283838.g014

RQ3. What are the most effective and versatile models for automatic task ranking developed so far?

Giving a proper answer to the research questions proved to be more challenging due to the high variability of datasets and metrics used, and it may also depend upon the output variables selected (e.g. datasets). If we want to obtain reliable results, we should therefore compare results obtained on the same dataset [ 82 ]. Eclipse was the dataset of our choice. We chose Eclipse because it’s the most popular dataset, according to the findings we presented in our results section. We must also note that if the authors made a comparison and/or had multiple models in one of their works, we decided to consider only one model, the one with better performance and the highest results. The result of our comparisons is given in Tables 6 and 7 .

thumbnail

Priority levels are from Bugzilla [ 47 ]. The performance is described as precision/recall/f-score with the best results highlighted in bold. All data are shown as percentages.

https://doi.org/10.1371/journal.pone.0283838.t006

thumbnail

Severity levels are from Bugzilla [ 46 ]. The performance is described as precision/recall/f-score with the best results highlighted in bold. All data are shown as percentages.

https://doi.org/10.1371/journal.pone.0283838.t007

Table 6 shows a comparison between models with respect to their capacity for predicting bugs’ priority. We note that even though the number of papers using the same dataset is much higher, the comparison table has only five elements. That is because some papers either used different levels of priorities and metrics or gave only graphical information, so no accurate value for the metric could be gauged.

As Table 6 shows, the work by Pushpalatha et al. [ 85 ] has the highest amount of highlighted cells. This makes it the best approach concerning the Eclipse dataset.

Table 7 shows a severity-wise comparison among all the approaches reviewed.

We note that the highest score is attributed to the approach proposed by [ 87 ], while the second-highest result is achieved by the model proposed by [ 90 ]. This paper, based on the Naive Bayes ML algorithm, is interesting because it formulated a method capable of ensuring the most accurate result for the blocker bug, which is one of the most severe types of bugs typically found. We also note that works by [ 87 ] demonstrated better outcomes for critical bugs.

Critical review of our research question

Because the analysis we conducted above showed a limited number of research articles related to task prioritization in software development, this prompted us to make several assumptions and partially expand the scope of our study. Since the word “task” can refer to multiple concepts, we decided to consider this word in its most general meaning. This allowed us to gather more articles for our analysis. However, the fact that we only gathered a relatively limited number of scientific articles related to prioritization may indicate a significant gap in the research field. On the one hand, the presence of this gap may be taken as a sign of the relevance and novelty of this study for the research field. On the other hand, our findings raise several significant and pressing concerns. For example, why has so little research been conducted in this area? What are the pitfalls in task prioritization research? Why there was no demand for such a system? While we do not have a clear answer to all these questions, we can nevertheless assert that this SLR highlighted the need for more research in this area (task prioritization in software development) while also forming a solid basis for future progress in the field.

Analysis of RQ1: What are the existing approaches for automatic task ranking in software development?

We can make several important observations about the results we obtained. Firstly, earlier work dealt mainly with the problem of “bug” prioritization, which, albeit useful, is neither exhaustive nor comprehensive. That is so because we are interested in a broader understanding of task prioritization.

Secondly, only recently (especially in the last 5 years, as shown in Fig 12 ) researchers began to pay attention to the concept of “pull requests” prioritization and “issues” prioritization, which substantially expanded on original research conducted in the prioritization of Software Development metrics. As we discussed earlier on in this paper, this may well result in substantial growth of the literature in the near future, as it was the case with “bug prioritization” in the past.

Analysis of RQ2: Which methods are used in automatic task ranking models and approaches and how is their effectiveness assessed?

The number of methods described in this SLR for task prioritization is rather limited. The most popular method we observed is Naive Bayes. This method is important because it provides the most accurate result for the blocker bug. We also analyzed several different metrics found in the papers we reviewed, the most popular of which are: a) f-score, b) precision, c) recall, and d) accuracy.

Across the whole set of metrics, we found in the papers we reviewed for this SLR, CPU-costs related metrics are the rarest, which means that the question of computational costs has not been a priority. This may signal a new potential future research direction for task prioritization.

Based on the results presented above, we can argue that the lack of metrics commonly used in recommender systems represents an interesting research gap in the field, which is also shown in [ 93 ]. Our explanation for the existence of such a gap lies in the observation that there are still very few studies on recommender systems [ 81 ]. This is due, presumably, to the fact that recommender systems only recently attracted researchers’ attention.

Analysis of RQ3: What are the most effective and versatile models for automatic task ranking developed so far?

As we have shown above, the quality of task prioritization in software development has improved over the years as new and more accurate estimation methods have been deployed [ 87 ]. However, when it comes to prioritizing “pull requests” and “issues,” there are a lot of conflating strategies and ideas about what to prioritize [ 50 ]. Unfortunately, we have to admit that there isn’t a standardized or universally agreed approach for prioritizing such issues. Because of that, the only way to compare the approaches currently available is to fully reproduce and compare them on the same dataset and on the same set of metrics. This, however, is an extremely complicated task. Nevertheless, developing new works along this research direction may open up new vistas of vital importance for further progress in the field.

A synoptic summary

The brief synoptic summary of our results with respect to each of the research questions tackled in this study is the following:

  • RQ1 The number of articles that consider the problem of automatic task prioritization is still fairly small.
  • RQ2 Only a few articles, among those reviewed, assessed models in terms of time or CPU costs. Also, there is a sensitive lack of metrics used to analyze recommender systems.
  • RQ3 TSo far, there is no standardized approach or universal agreement on defining prioritization strategies for “pull requests” and “issues.” Presumably, this is because datasets are not marked up, meaning that there are no labels on data samples (see https://www.ibm.com/cloud/learn/data-labeling ).

Limitations, threats to validity, and review assessment

Limitations.

We start this subsection by briefly reviewing some of the obstacles that may have prevented an objective review.

In this study, we used five different databases, namely, Google Scholar, Microsoft Academic, ScienceDirect, IEEE Xplore, and ACM . A skeptical reader may point out that we should have used more databases (such as Springer Link, Web of Science, Scopus, ProQuest, Academic OneFile ). We certainly are aware of the existence of many more databases (besides those we selected) and could even agree that a larger set of databases might have broadened the scope of our work and increased the diversity of our searches. However, we picked those databases most commonly used by software engineers worldwide and that are known to aggregate the largest possible variety of papers. So, even though the list of databases used to perform our searches could have been ameliorated, we are pretty confident that our searches were scientifically sound.

Also, one may see as perhaps problematic the fact that we did not use any kind of grey literature in this work. Although there is a tendency to advocate for the usage of multivocal approaches (such as grey literature) in software engineering, we believe that such practice (given the dubious nature of such literature) should be limited to cases where there is a sensible lack of secondary sources, which was not our case.

Finally, one may rightly claim that in picking only works written in English, we somehow constrained and severely limited the scope and breadth of this research. Cross-cultural issues are emerging as vitally important in ensuring universalism in science. We agree with the importance of adding multicultural perspectives and even under-represented works in any study; however, most of the literature in the field is in English, and all the best journals only accept submissions in English. We, therefore, deem that the requirement we adopted in this SLR concerning language is pretty standard for the field and relatively unproblematic. Nevertheless, we note that our team of researchers is culturally very diverse, as it includes people from four different continents.

Threats to validity

In this subsection, we discuss a series of biases that might have affected the development and production of our review.

  • Bias towards Primary Sources —SLRs are usually performed on secondary sources. This is done to maximize objectivity. In this work, we uniquely relied on secondary sources; hence we avoided this potential bias.
  • Selection Bias —A major risk involved in any SLR is what we may call “selective outcome reporting” or “selection bias.” This typically occurs when the authors present only a selection of outcomes and/or results based on their statistical significance. We note that our reading log consists of only peer-reviewed, high-quality papers. The papers, as noted above, were published by world-leading publishers (such as Springer, Elsevier, ACM, and IEEE Xplore). In addition, we selected papers (methodologies, datasets, and metrics) from several journals as well as from reputable conference proceedings. This ensured a variety of levels of analysis and experimental protocols.
  • Bias in Synthesis —To avoid this type of bias, which can be considered as an extension of the Selection Bias above-mentioned, we carefully assessed our methodological protocol and -by extension- our findings. Thus, all the researchers involved in this study actively and consistently participated in monitoring each other’s activity to maximize objectivity and minimize mistakes (such as this bias).

Review assessment

Finally, we want to reflect on the overall quality of our work. To do so, we formulated—inspired by Kitchenham [ 94 ], a set of questions, which we critically applied to our results and findings. The questions we formulated and the answers we gave to them follow below.

  • Are the inclusion / exclusion criteria objective and reasonable? Following the best norms in our discipline, we formulated—before conducting our searches—a set of inclusion and exclusion criteria, which we subsequently applied to finalize the reading log. The criteria we formulated are congruent with those generally used in the field and are obviously relevant to the topic of our work.
  • Has there been a quality review? We developed a metric to assess the papers’ quality (Quality Assessment). We proved that the quality of the papers we included in the log was relatively high. Thus, we can confidently assert that the results that informed our work were scientifically sound and academically grounded.
  • Were the basic data / studies adequately described? We build a comprehensive literature log. The log consisted of all the relevant information we extracted from the papers we analyzed. This allowed us to process our data transparently and comprehensively. It also ensured the replicability of our findings, which is another key trait of any SLR.

This SLR investigated the problem of task prioritization in software development and focused on: a) identifying existing approaches for automatic task prioritization (RQ1), b) further investigating methods and metrics for task prioritization as developed in the literature (RQ2), and c) analyzing the effectiveness and reliability of these methods and metrics (RQ3).

Concerning RQ1, our results showed that earlier work mainly dealt with bug prioritization, and more recent work has expanded to consider prioritizing pull requests and issues. We speculate that this may lead to a substantial growth of literature in the future. RQ2 revealed that the most popular method used for task prioritization is Naive Bayes, while the most popular metrics used (in descending order) are f-score, precision, recall, and accuracy. However, there is a lack of metrics used in recommender systems, which may indicate a potential direction for future research. RQ3 showed that the quality of task prioritization in software development has improved over time; however, there is still a sensible lack of standardized approaches for prioritizing pull requests and issues.

In light of these findings, we can assert that this SLR contributed to broadening the field of research on task prioritization in software development, while also providing a solid basis for future research. Our goal in the mid-term is to develop an empirical study based on the topic of this SLR. Our aim in such a study would be to find a practical way to implement the findings of this review. To this extent, we shall consider whether it could be possible to develop algorithms for predicting task prioritization in a project using ML methods. This may well lead to novel AI-based management strategies, which could improve people’s well-being at work as well as foster moral and social good.

Nevertheless, IT practitioners should be cognizant of the relatively scarce amount of research conducted on task prioritization to date. They should also be aware of the absence of established methods for prioritizing pull requests and issues. They should therefore use the results of this SLR as a springboard for further explorations aimed at the development of such methods and tools.

Supporting information

S1 table. prisma 2020 checklist..

Template is taken from: www.prisma-statement.org/documents/PRISMA_2020_checklist.pdf .

https://doi.org/10.1371/journal.pone.0283838.s001

S2 Table. Quality scores assigned to the papers.

https://doi.org/10.1371/journal.pone.0283838.s002

  • View Article
  • Google Scholar
  • 2. IEEE Standard for Software Project Management Plans;.
  • 3. About issues; 2023. Available from: https://docs.github.com/en/issues/tracking-your-work-with-issues/about-issues .
  • 4. Anvik J. Automating Bug Report Assignment. In: Proceedings of the 28th International Conference on Software Engineering. ICSE’06. New York, NY, USA: Association for Computing Machinery; 2006. p. 937–940.
  • 7. Anvik J, Hiew L, Murphy GC. Who Should Fix This Bug? In: Proceedings of the 28th International Conference on Software Engineering. ICSE’06. New York, NY, USA: Association for Computing Machinery; 2006. p. 361–370.
  • 8. Panjer LD. Predicting Eclipse Bug Lifetimes. In: Fourth International Workshop on Mining Software Repositories (MSR’07:ICSE Workshops 2007); 2007. p. 29–29.
  • 9. Wang X, Zhang L, Xie T, Anvik J, Sun J. An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 13th international conference on Software engineering—ICSE '08. ACM Press; 2008.
  • 10. Menzies T, Marcus A. Automated severity assessment of software defect reports. In: 2008 IEEE International Conference on Software Maintenance; 2008. p. 346–355.
  • 11. Sharma M, Bedi P, Chaturvedi KK, Singh VB. Predicting the priority of a reported bug using machine learning techniques and cross project validation. In: 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA). IEEE; 2012.
  • 12. Tian Y, Lo D, Sun C. DRONE: Predicting Priority of Reported Bugs by Multi-factor Analysis. In: 2013 IEEE International Conference on Software Maintenance. IEEE; 2013.
  • 13. Illes-Seifert T, Herrmann A, Geisser M, Hildenbrand T. The Challenges of Distributed Software Engineering and Requirements Engineering: Results of an Online Survey. In: Proceedings : 1st Global Requirements Engineering Workshop—Grew’07 : in conjuction with the IEEE Conference on Global Software Engineering (ICGSE), Germany, Munich, 27th August 2007; 2007. p. 55–65.
  • 15. Demir K. A Survey on Challenges of Software Project Management. In: Software Engineering Research and Practice; 2009. p. 579–585.
  • 16. Gousios G, Storey MA, Bacchelli A. Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE); 2016. p. 285–296.
  • 17. Gousios G, Zaidman A, Storey MA, Deursen Av. Work Practices and Challenges in Pull-Based Development: The Integrator’s Perspective. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. vol. 1; 2015. p. 358–368.
  • 18. Barr ET, Bird C, Rigby PC, Hindle A, German DM, Devanbu P. Cohesive and Isolated Development with Branches. In: Fundamental Approaches to Software Engineering. Springer Berlin Heidelberg; 2012. p. 316–331.
  • PubMed/NCBI
  • 20. Wang H, Ma C, Zhou L. A Brief Review of Machine Learning and Its Application. In: 2009 International Conference on Information Engineering and Computer Science. IEEE; 2009.
  • 23. Somohano-Murrieta JCB, Ocharan-Hernandez JO, Sanchez-Garcia AJ, de los Angeles Arenas-Valdes M. Requirements Prioritization Techniques in the last decade: A Systematic Literature Review. In: 2020 8th International Conference in Software Engineering Research and Innovation (CONISOFT). IEEE; 2020.
  • 25. Sufian M, Khan Z, Rehman S, Butt WH. A Systematic Literature Review: Software Requirements Prioritization Techniques. In: 2018 International Conference on Frontiers of Information Technology (FIT). IEEE; 2018.
  • 27. Alfayez R, Alwehaibi W, Winn R, Venson E, Boehm B. A systematic literature review of technical debt prioritization. In: Proceedings of the 3rd International Conference on Technical Debt. ACM; 2020.
  • 28. Ijaz KB, Inayat I, Bukhsh FA. Non-functional Requirements Prioritization: A Systematic Literature Review. In: 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE; 2019.
  • 33. Fatima T, Azam F, Anwar MW, Rasheed Y. A Systematic Review on Software Project Scheduling and Task Assignment Approaches. In: Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence. ACM; 2020.
  • 39. Kitchenham BA, Charters S. Guidelines for performing Systematic Literature Reviews in Software Engineering. Keele University and Durham University Joint Report; 2007. Available from: https://www.elsevier.com/__data/promis_misc/525444systematicreviewsguide.pdf .
  • 45. Shoham Y, Perrault R, Brynjolfsson E, Clark J, Manyika J, Niebles JC, et al.. The AI Index 2018 Annual Report; 2018. Available from: https://hai.stanford.edu/sites/default/files/2020-10/AI_Index_2018_Annual_Report.pdf .
  • 46. Lauhakangas I, Raal R, Iversen J, Tryon R, Goncharuk L, Roczek D. QA/Bugzilla/Fields/Severity; 2021. Available from: https://wiki.documentfoundation.org/QA/Bugzilla/Fields/Severity .
  • 47. Kanat-Alexander M, Miller D, Humphries E. Bugzilla:Priority System; 2021. Available from: https://wiki.mozilla.org/Bugzilla:Priority_System .
  • 49. Yang G, Zhang T, Lee B. Towards Semi-automatic Bug Triage and Severity Prediction Based on Topic Model and Multi-feature of Bug Reports. In: 2014 IEEE 38th Annual Computer Software and Applications Conference. IEEE; 2014.
  • 50. Freedman D, Pisani R, Purves R. Statistics (international student edition). Pisani , Purves R, 4th edn WW Norton & Company, New York. 2007;.
  • 52. Jeong G, Kim S, Zimmermann T. Improving bug triage with bug tossing graphs. In: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. ACM; 2009.
  • 53. Murgia A, Concas G, Tonelli R, Ortu M, Demeyer S, Marchesi M. On the influence of maintenance activity types on the issue resolution time. In: Proceedings of the 10th International Conference on Predictive Models in Software Engineering. ACM; 2014.
  • 54. Yu Y, Wang H, Yin G, Ling CX. Reviewer Recommender of Pull-Requests in GitHub. In: 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE; 2014.
  • 56. Roy NKS, Rossi B. Cost-Sensitive Strategies for Data Imbalance in Bug Severity Classification: Experimental Results. In: 2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE; 2017.
  • 58. Awad MA, ElNainay MY, Abougabal MS. Predicting bug severity using customized weighted majority voting algorithms. In: 2017 Japan-Africa Conference on Electronics, Communications and Computers (JAC-ECC). IEEE; 2017.
  • 61. Sharma M, Kumari M, Singh VB. Bug Priority Assessment in Cross-Project Context Using Entropy-Based Measure. In: Algorithms for Intelligent Systems. Springer Singapore; 2020. p. 113–128.
  • 64. Alenezi M, Banitaan S. Bug Reports Prioritization: Which Features and Classifier to Use? In: 2013 12th International Conference on Machine Learning and Applications. IEEE; 2013.
  • 65. Sharmin S, Aktar F, Ali AA, Khan MAH, Shoyaib M. BFSp: A feature selection method for bug severity classification. In: 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE; 2017.
  • 67. Zhang Y, Yin G, Wang T, Yu Y, Wang H. Evaluating Bug Severity Using Crowd-based Knowledge. In: Proceedings of the 7th Asia-Pacific Symposium on Internetware. ACM; 2015.
  • 68. Kikas R, Dumas M, Pfahl D. Using dynamic and contextual features to predict issue lifetime in github projects. In: 2016 ieee/acm 13th working conference on mining software repositories (msr). IEEE; 2016. p. 291–302.
  • 70. Dhasade AB, Venigalla ASM, Chimalakonda S. Towards Prioritizing GitHub Issues. In: Proceedings of the 13th Innovations in Software Engineering Conference on Formerly known as India Software Engineering Conference. ACM; 2020.
  • 71. van der Veen E, Gousios G, Zaidman A. Automatically Prioritizing Pull Requests. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE; 2015.
  • 72. Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B. Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE; 2015.
  • 73. Yu S, Xu L, Zhang Y, Wu J, Liao Z, Li Y. NBSL: A Supervised Classification Model of Pull Request in Github. In: 2018 IEEE International Conference on Communications (ICC). IEEE; 2018.
  • 74. Azeem MI, Peng Q, Wang Q. Pull Request Prioritization Algorithm based on Acceptance and Response Probability. In: 2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS). IEEE; 2020.
  • 75. Gousios G, Pinzger M, van Deursen A. An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering. ACM; 2014.
  • 77. Jeni LA, Cohn JF, Torre FDL. Facing Imbalanced Data–Recommendations for the Use of Performance Metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE; 2013.
  • 78. Zhang W, Challis C. Automatic Bug Priority Prediction Using DNN Based Regression. In: Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery. Springer International Publishing; 2019. p. 333–340.
  • 81. Ricci F, Rokach L, Shapira B. Introduction to Recommender Systems Handbook. In: Recommender Systems Handbook. Springer US; 2010. p. 1–35.
  • 82. Lamkanfi A, Perez J, Demeyer S. The Eclipse and Mozilla defect tracking dataset: A genuine dataset for mining bug information. In: 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE; 2013.
  • 87. Zhang T, Yang G, Lee B, Chan ATS. Predicting severity of bug report by mining bug repository with concept profile. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing. ACM; 2015.
  • 88. Tian Y, Lo D, Sun C. Information Retrieval Based Nearest Neighbor Classification for Fine-Grained Bug Severity Prediction. In: 2012 19th Working Conference on Reverse Engineering. IEEE; 2012.
  • 93. Happel HJ, Maalej W. Potentials and challenges of recommendation systems for software development. In: Proceedings of the 2008 international workshop on Recommendation systems for software engineering. ACM; 2008.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Software Development Project Management: A Literature Review

Profile image of Kevin Adams

Related Papers

International Journal of Information Technology Project Management (IJITPM)

Iatrellis Omiros

This article aims to provide the reader with a comprehensive background for understanding current knowledge and research works on ontologies for software project management (SPM). It constitutes a systematic literature review behind key objectives of the potential adoption of ontologies in PM. Ontology development and engineering could facilitate substantially the software development process and improve knowledge management, software and artifacts reusability, internal consistency within project management processes of various phases of software life cycle. The authors examined the literature focusing on software project management ontologies and analyzed the findings of these published papers and categorized them accordingly. They used qualitative methods to evaluate and interpret findings of the collected studies. The literature review, among others, has highlighted lack of standardization in …

literature review for software development project

Panos Fitsilis , Vassilis C Gerogiannis , LEONIDAS ANTHOPOULOS

Software Project Management is a knowledge intensive process that can benefit substantially from ontology development and ontology engineering. Ontology development could facilitate or improve substantially the software development process through the improvement of knowledge management, the increase of software and artefacts reusability, and the establishment of internal consistency within project management processes of various phases of software life cycle. A large number of ontologies have been developed attempting to address various software engineering aspects, such as requirements engineering, components reuse, domain modelling, etc. In this paper, we present a systematic literature review focusing on software project management ontologies. The literature review, among other, has identified lack of standardization in terminology and concepts, lack of systematic domain modelling and use of ontologies mainly in prototype ontology systems that address rather limited aspects of software project management processes

Louise Reid

This paper describes research into the development of a quality plan for the management of software in an Irish Hospital. It studies relevant standards, models and legal acts. Synergies between the Irish Health Service Executive's Quality and Risk Management Standard and the Capability Maturity Model Integration are utilised to build and study a quality plan. While exploring the possibility of utilising software engineering quality standards to improve the quality standards within health care, this has also led to a greater understanding of the ...

Danne Silva Oliveira

Paul Bannerman

Mark Harman

The Importance of Metrics in Search Based Software Engineering Mark Harman King's College London, Strand, London, WC2R 2LS Abstract. This paper was written to accompany the author's keynote talk at the Mensura Conference 2006. The keynote will present an overview of Search Based Software Engineering (SBSE) and explain how metrics play a crucial role in SBSE. 1 Search Based Software Engineering (SBSE) The aim of SBSE research is to move software engineering problems from human based search to machine-based search [15, 27].

juan felipe

Patrick Arana

Alain Abran

Software maintenance function suffers from a scarcity of management models that would facilitate its evaluation, management and continuous improvement. This paper is part of a series of papers that presents a software maintenance capability maturity model (SMCMM). The contributions of this specific paper are: 1) to describe the key references of software maintenance; 2) to present the model update process conducted during 2003; and 3) to present, for the first time, the updated architecture of the model.

RELATED PAPERS

Maurice Frayssinet

33rd Annual Frontiers in Education, 2003. FIE 2003.

Monica Villavicencio

Journal of Software Maintenance and Evolution: Research and Practice

International Journal of Software Engineering and Knowledge Engineering

Kashif Kamran

International Journal of Information Technology Project Management

Clifford Maurer

Proceedings of ONTOSE

daniel rodriguez

Journal of Systems and Software

Ivan Garcia

ahmed I Mohamed

Journal of Software Engineering and Applications

Daniel Rodriguez

Omar Badreddin

INCOSE International Symposium

Alice Squires , Joseph Ekstrom

Sandra H Cleland

talha javeed

andres camilo

Syeda Uzma Gardazi

Ricardo Cespedes

16th International Conference on Evaluation & Assessment in Software Engineering (EASE 2012)

J. Verner , Mahmood Niazi

Information Systems Journal

Rudy Hirschheim , Juhani Iivari

Evan Duggan

Information and Software Technology

Paul Clarke

Charles Betz

SOFTWARE ENGINEERING: METHODS, …

Marcel Jacques Simonette , Edison Spina

2013 20th Asia-Pacific Software Engineering Conference (APSEC)

Sofia Ouhbi , José Luis Fernández Alemán

Felix Garcia

Ricardo Ramos

Empirical Software Engineering

Claes Wohlin

Miguel-Angel Sicilia

IEEE Transactions on Software Engineering

Bayu Permadi

Alfredo Armijos

Tony Gorschek , Chow Cheng

Australasian Computing Education Conference

John Hamer , Andrew Luxton-reilly

ronin monfaredi

International Journal of Information Technologies and Systems Approach

Manuel Mora

Jean-Christophe Deprez , Claude Laporte , Kenneth Crowder

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Software Practices For Agile Developers: A Systematic Literature Review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

AIP Publishing Logo

  • Previous Article
  • Next Article

Lean practices in software development projects: A literature review

Email: [email protected]

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Reprints and Permissions
  • Cite Icon Cite
  • Search Site

RamKaran Yadav , M. L. Mittal , Rakesh Jain; Lean practices in software development projects: A literature review. AIP Conf. Proc. 3 September 2019; 2148 (1): 030044. https://doi.org/10.1063/1.5123966

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Rapidly increasing customer demands, competition, continuous changing scenario and accelerating pace of technological developments have put tremendous pressure on the business organization to deliver quality products at lower cost. On the same lines the software development (SD) companies need to deliver quality codes with new features at reduced cost. This can be achieved through Lean to software development projects. As the lean has been considered in different ways and has been implemented to varying extent in different sectors of the economy this paper aims to investigate as to how “lean” is viewed in software development projects and status of implementation in software development projects. First, application of lean in different types of projects viz. construction, healthcare, aerospace, new product development and service is discussed. Secondly, application of lean to SD projects is investigated at three levels: philosophy, principles (value, value stream, flow, pull and perfection) and practices/tools. The effect of lean on performance (inventory, lead time, customer satisfaction, cost, and business value) of SD projects is also analyzed. Further, “Leagile” software development and agile dominance is explored through this paper.

Sign in via your Institution

Citing articles via, publish with us - request a quote.

literature review for software development project

Sign up for alerts

  • Online ISSN 1551-7616
  • Print ISSN 0094-243X
  • For Researchers
  • For Librarians
  • For Advertisers
  • Our Publishing Partners  
  • Physics Today
  • Conference Proceedings
  • Special Topics

pubs.aip.org

  • Privacy Policy
  • Terms of Use

Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

Risk factors in software development projects: a systematic literature review

  • Published: 07 November 2018
  • Volume 27 , pages 1149–1174, ( 2019 )

Cite this article

literature review for software development project

  • Júlio Menezes Jr   ORCID: orcid.org/0000-0002-2460-1148 1 ,
  • Cristine Gusmão 2 &
  • Hermano Moura 1  

3050 Accesses

26 Citations

Explore all metrics

Risks are an inherent part of any software project. The presence of risks in environments of software development projects requires the perception so that the associated factors do not lead projects to failure. The correct identification and monitoring of these factors can be decisive for the success of software development projects and software quality. However, in practice, risk management in software development projects is still often neglected and one of the reasons is due to the lack of knowledge of risk factors that promoted a low perception of them in the environment. This paper aims to identify and to map risk factors in environments of software development projects. We conducted a systematic literature review through a database search, as well as we performed an assessment of quality of the selected studies. All this process was conducted through a research protocol. We identified 41 studies. In these works, we extracted and classified risk factors according to the software development taxonomy developed by Software Engineering Institute (SEI). In total, 148 different risk factors were categorized. The found evidences suggest that risk factors relating to software requirements are the most recurrent and cited. In addition, we highlight that the most mentioned risk factors were the lack of technical skills by the staff. Therefore, the results converged to the need for more studies on these factors as fundamental items for reduction of failure level of a software development project.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

literature review for software development project

Similar content being viewed by others

literature review for software development project

Effective Risk Management of Software Projects (ERM): An Exploratory Literature Review of IEEE and Scopus Online Databases

literature review for software development project

Risk Management During Software Development: Results of a Survey in Software Houses from Germany, Austria and Switzerland

literature review for software development project

Systematic Literature Review of the Risk Management Process Literature for the Public Sector

Alam, A. U., Khan, S. U., & Ali, I. (2012). Knowledge sharing management risks in outsourcing from various continents perspective: a systematic literature review. International Journal of Digital Content Technology and its Applications, 6 (21), 27–33.

Article   Google Scholar  

Bannerman, P. L. (2015). A reassessment of risk management in software projects. In: Handbook on Project Management and scheduling , vol. 2 (pp. 1119–1134). Springer International Publishing.

Biolchini, J., Mian, P. G., Natali, A. C. C., & Travassos, G. H. (2005). Systematic review in software engineering. In: System engineering and computer science department COPPE/UFRJ, Technical Report ES , vol 679(05) (p. 45).

Boehm, B. W. (1989). Software risk management . Piscataway: Software risk management.

Book   Google Scholar  

Boehm, B. W. (1991). Software risk management: principles and practices. IEEE Software, 8 (1), 32–41. https://doi.org/10.1109/52.62930 .

Brasiliano, A. (2009). Método Brasiliano avançado – Gestão e análise de risco corporativo . Sicurezza.

Carr, M. J., Konda, S. L., Monarch, I., Ulrich, F. C., & Walker, C. F. (1993). Taxonomy-based risk identification (No. CMU/SEI-93-TR-06) . Carnegie-Mellon Univ Pittsburgh Pa Software Engineering Inst.

Charette, R. N. (1989). Software engineering risk analysis and management . New York: Intertext Publications.

Google Scholar  

Charette, R. N. (2005). Why software fails. IEEE Spectrum, 42 (9), 42–49.

De Bakker, K., Boonstra, A., & Wortmann, H. (2010). Does risk management contribute to IT project success? A meta-analysis of empirical evidence. International Journal of Project Management, 28 (5), 493–503.

De Marco, T. (1997). The deadline: a novel about project management . Dorset House.

DoD, U. S. (2006). Risk management guide for DoD acquisition . USA: Department of Defense.

Dorofee, A. J., Walker, J. A., Alberts, C. J., Higuera, R. P., & Murphy, R. L. (1996). Continuous risk management guidebook . Carnegie-Mellon Univ, Pittsburgh.

Fairley, R. (1994). Risk management for software projects. IEEE Software, 11 (3), 57–67.

Fan, C. F., & Yu, Y. C. (2004). BBN-based software project risk management. Journal of Systems and Software, 73 (2), 193–203.

Article   MathSciNet   Google Scholar  

Fu, Y., Li, M., & Chen, F. (2012). Impact propagation and risk assessment of requirement changes for software development projects based on design structure matrix. International Journal of Project Management, 30 (3), 363–373.

Gerrard, P., & Thompson, N. (2002). Risk-based E-business testing . Artech House.

Goguen, A., Stoneburner, G., & Feringa, A. (2002). Risk management guide for information technology systems and underlying technical models for information technology security .

Google Scholar citations. (2017). https://scholar.google.com/intl/en/scholar/citations.html . Accessed May 2017.

Hall, E. M. (1998). Managing risk: methods for software systems development . Pearson Education.

Han, W. M., & Huang, S. J. (2007). An empirical analysis of risk components and performance on software projects. Journal of Systems and Software, 80 (1), 42–50.

Heldman, K. (2010). Project manager’s spotlight on risk management . John Wiley & Sons.

Higgins, J. P., & Green, S. (Eds.). (2011). Cochrane handbook for systematic reviews of interventions . http://handbook.cochrane.org/chapter_6/6_4_4_sensitivity_versus_precision.htm . Accessed May 2017.

Hillson, D. (2002). The Risk Breakdown Structure (RBS) as an aid to effective risk management. In: 5th European Project Management conference . Cannes, France (pp. 1–11).

Ivarsson, M., & Gorschek, T. (2011). A method for evaluating rigor and industrial relevance of technology evaluations. Empirical Software Engineering, 16 (3), 365–395.

Jiang, J., & Klein, G. (2000). Software development risks to project effectiveness. The Journal of Systems and Software, 52 (1), 3–10.

Jiang, J., Klein, G., & Discenza, R. (2001). Information systems success as impacted by risks and development strategies. IEEE Transactions on Engineering Management, 48 (1), 46–55.

Jorgensen, M. (1999). Software quality measurement. Advances in Engineering Software, 30 (12), 907–912.

Kerzner, H. (2017). Project management: a systems approach to planning, scheduling, and controlling . Hoboken: John Wiley & Sons.

Khan, A. A., Basri, S., & Dominic, P. D. D. (2014). Communication risks in GSD during RCM: results from SLR. In: Computer and Information Sciences (ICCOINS), 2014 International Conference on (pp. 1–6). IEEE.

Kitchenham, B & Charters, S., 2007 . Guidelines for performing systematic literature reviews in software engineering . Technical report. EBSE.

Kontio, J. (2001). Software engineering risk management: a method, improvement framework, and empirical evaluation . Helsinki University of Technology.

López, C., & Salmeron, J. L. (2012). Risks response strategies for supporting practitioners decision-making in software projects. Procedia Technology, 5 , 437–444.

March, J. G., & Shapira, Z. (1987). Managerial perspectives on risk and risk taking. Management Science, 33 (11), 1404–1418.

Munir, H., Wnuk, K., & Runeson, P. (2016). Open innovation in software engineering: a systematic mapping study. Empirical Software Engineering, 21 (2), 684–723.

Neves, S. M., da Silva, C. E. S., Salomon, V. A. P., da Silva, A. F., & Sotomonte, B. E. P. (2014). Risk management in software projects through knowledge management techniques: cases in Brazilian incubated technology-based firms. International Journal of Project Management, 32 (1), 125–138.

Nurdiani, I., Jabangwe, R., Šmite, D., & Damian, D. (2011). Risk identification and risk mitigation instruments for global software development: systematic review and survey results. In: Global Software Engineering Workshop (ICGSEW), 2011 Sixth IEEE International Conference on (pp. 36–41). IEEE.

Oliveira, K. A., Gusmão, C. M., & de Barros Carvalho Filho, E. C. (2012). Mapeamento de Riscos em Projetos de Desenvolvimento Distribuído de Software. In: CONTECSI-international conference on information systems and technology management (vol. 9, no. 1, pp. 3837–3866).

Pa, N. C., & Jnr, B. A. (2015). A review on decision making of risk mitigation for software management. Journal of Theoretical & Applied Information Technology, 76 (3).

Pfleeger, S. L., Hatton, L., & Howell, C. C. (2001). Solid software . Prentice Hall PTR.

Pressman, R. S. (2005). Software engineering: a practitioner’s approach . Palgrave Macmillan.

Qinghua, P. (2009). A model of risk assessment of software project based on grey theory. In: Computer Science & Education, 2009. ICCSE'09. 4th International Conference on (pp. 538–541). IEEE.

Raz, T., Shenhar, A. J., & Dvir, D. (2002). Risk management, project success, and technological uncertainty. R&D Management, 32 (2), 101–109.

Reeves, J. D., Eveleigh, T., Holzer, T. H., & Sarkani, S. (2013). Identification biases and their impact to space system development project performance. Engineering Management Journal, 25 (2), 3–12.

Ren, F. (2016) Understanding Pareto’s principle - the 80-20 rule . https://www.thebalance.com/pareto-s-principle-the-80-20-rule-2275148 . Accessed May 2017.

Salmeron, J. L., & Lopez, C. (2012). Forecasting risk impact on ERP maintenance with augmented fuzzy cognitive maps. IEEE Transactions on Software Engineering, 38 (2), 439–452.

Sarigiannidis, L., & Chatzoglou, P. D. (2014). Quality vs risk: an investigation of their relationship in software development projects. International Journal of Project Management, 32 (6), 1073–1082.

Savolainen, P., Ahonen, J. J., & Richardson, I. (2012). Software development project success and failure from the supplier’s perspective: a systematic literature review. International Journal of Project Management, 30 (4), 458–469.

Silva, S. (2011). Proposta de tratamento de fatores de riscos em desenvolvimento de software para uma organização no setor público . Federal University of Permambuco.

SJR. (2017). Scimago Journal & Country Rank (SJR) . http://www.scimagojr.com/aboutus.php . Accessed May 2017.

Subramanian, G. H., Jiang, J. J., & Klein, G. (2007). Software quality and IS project performance improvements from software development process maturity and IS implementation strategies. Journal of Systems and Software, 80 (4), 616–627.

Tang, A. G., & Wang, R. L. (2010, June). Software project risk assessment model based on fuzzy theory. In: Computer and Communication Technologies in Agriculture Engineering (CCTAE), 2010 International Conference On (vol. 2, pp. 328–330). IEEE.

Trigo, T. R., Gusmão, C., & Lins, A. (2008). CBR risk – risk identification method using case based reasoning. In: International Conference on Information Systems and Technology Management (vol. 5, No. 2008).

Van Loon, H. (2007). A management methodology to reduce risk and improve quality. IT Professional, 9 (6), 30–35.

Vasconcellos, F. J., Landre, G. B., Cunha, J. A. O., Oliveira, J. L., Ferreira, R. A., & Vincenzi, A. M. (2017). Approaches to strategic alignment of software process improvement: a systematic literature review. Journal of Systems and Software, 123 , 45–63.

Wallace, L., & Keil, M. (2004). Software project risks and their effect on outcomes. Communications of the ACM, 47 (4), 68–73.

Wallace, L., Keil, M., & Rai, A. (2004a). Understanding software project risk: a cluster analysis. Information Management, 42 (1), 115–125.

Wallace, L., Keil, M., & Rai, A. (2004b). How software project risk affects project performance: an investigation of the dimensions of risk and an exploratory model. Decision Sciences, 35 (2), 289–321.

Wysocki, R. K. (2011). Effective project management: traditional, agile, extreme . John Wiley & Sons.

Zhang, H., Babar, M. A., & Tell, P. (2011). Identifying relevant studies in software engineering. Information and Software Technology, 53 (6), 625–637.

Download references

Acknowledgements

The authors would like to thank the Brazilian Ministry of Health for the support given to this work.

Author information

Authors and affiliations.

Center of Informatics (CIn), Federal University of Pernambuco (UFPE), Recife, PE, Brazil

Júlio Menezes Jr & Hermano Moura

Department of Biomedical Engineering (DEBM) – Center of Technology and Geosciences (CTG), Federal University of Pernambuco (UFPE), Recife, PE, Brazil

Cristine Gusmão

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Júlio Menezes Jr .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Menezes, J., Gusmão, C. & Moura, H. Risk factors in software development projects: a systematic literature review. Software Qual J 27 , 1149–1174 (2019). https://doi.org/10.1007/s11219-018-9427-5

Download citation

Published : 07 November 2018

Issue Date : September 2019

DOI : https://doi.org/10.1007/s11219-018-9427-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Software risk management
  • Risk factors
  • Project management
  • Systematic literature review
  • Find a journal
  • Publish with us
  • Track your research
  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, a systematic literature review on the impact of ai models on the security of code generation.

literature review for software development project

  • 1 Security and Trust, University of Luxembourg, Luxembourg, Luxembourg
  • 2 École Normale Supérieure, Paris, France
  • 3 Faculty of Humanities, Education, and Social Sciences, University of Luxembourg, Luxembourg, Luxembourg

Introduction: Artificial Intelligence (AI) is increasingly used as a helper to develop computing programs. While it can boost software development and improve coding proficiency, this practice offers no guarantee of security. On the contrary, recent research shows that some AI models produce software with vulnerabilities. This situation leads to the question: How serious and widespread are the security flaws in code generated using AI models?

Methods: Through a systematic literature review, this work reviews the state of the art on how AI models impact software security. It systematizes the knowledge about the risks of using AI in coding security-critical software.

Results: It reviews what security flaws of well-known vulnerabilities (e.g., the MITRE CWE Top 25 Most Dangerous Software Weaknesses) are commonly hidden in AI-generated code. It also reviews works that discuss how vulnerabilities in AI-generated code can be exploited to compromise security and lists the attempts to improve the security of such AI-generated code.

Discussion: Overall, this work provides a comprehensive and systematic overview of the impact of AI in secure coding. This topic has sparked interest and concern within the software security engineering community. It highlights the importance of setting up security measures and processes, such as code verification, and that such practices could be customized for AI-aided code production.

1 Introduction

Despite initial concerns, increasingly, many organizations rely on artificial intelligence (AI) to enhance the operational workflows in their software development life cycle and to support writing software artifacts. One of the most well-known tools is GitHub Copilot. It is created by Microsoft relies on OpenAI's Codex model, and is trained on open-source code publicly available on GitHub ( Chen et al., 2021 ). Like many similar tools—such as CodeParrot, PolyCoder, StarCoder—Copilot is built atop a large language model (LLM) that has been trained on programming languages. Using LLMs for such tasks is an idea that dates back at least as far back as the public release of OpenAI's ChatGPT.

However, using automation and AI in software development is a double-edged sword. While it can improve code proficiency, the quality of AI-generated code is problematic. Some models introduce well-known vulnerabilities, such as those documented in MITRE's Common Weakness Enumeration (CWE) list of the top 25 “most dangerous software weaknesses.” Others generate so-called “stupid bugs,” naïve single-line mistakes that developers would qualify as “stupid” upon review ( Karampatsis and Sutton, 2020 ).

This behavior was identified early on and is supported to a varying degree by academic research. Pearce et al. (2022) concluded that 40% of the code suggested by Copilot had vulnerabilities. Yet research also shows that users trust AI-generator code more than their own ( Perry et al., 2023 ). These situations imply that new processes, mitigation strategies, and methodologies should be implemented to reduce or control the risks associated with the participation of generative AI in the software development life cycle.

It is, however, difficult to clearly attribute the blame, as the tooling landscape evolves, different training strategies and prompt engineering are used to alter LLMs behavior, and there is conflicting if anecdotal, evidence that human-generated code could be just as bad as AI-generated code.

This systematic literature review (SLR) aims to critically examine how the code generated by AI models impacts software and system security. Following the categorization of the research questions provided by Kitchenham and Charters (2007) on SLR questions, this work has a 2-fold objective: analyzing the impact and systematizing the knowledge produced so far. Our main question is:

“ How does the code generation from AI models impact the cybersecurity of the software process? ”

This paper discusses the risks and reviews the current state-of-the-art research on this still actively-researched question.

Our analysis shows specific trends and gaps in the literature. Overall, there is a high-level agreement that AI models do not produce safe code and do introduce vulnerabilities , despite mitigations. Particular vulnerabilities appear more frequently and prove to be more problematic than others ( Pearce et al., 2022 ; He and Vechev, 2023 ). Some domains (e.g., hardware design) seem more at risk than others, and there is clearly an imbalance in the efforts deployed to address these risks.

This work stresses the importance of relying on dedicated security measures in current software production processes to mitigate the risks introduced by AI-generated code and highlights the limitations of AI-based tools to perform this mitigation themselves.

The article is divided as follows: we first introduce the reader to AI models and code generation in Section 2 to proceed to explain our research method in Section 3. We then present our results in Section 4. In Section 5 we discuss the results, taking in consideration AI models, exploits, programming languages, mitigation strategies and future research. We close the paper by addressing threats to validity in Section 6 and conclusion in Section 7.

2 Background and previous work

2.1 ai models.

The sub-branch of AI models that is relevant to our discussion are generative models, especially large-language models (LLMs) that developed out of the attention-based transformer architecture ( Vaswani et al., 2017 ), made widely known and available through pre-trained models (such as OpenAI's GPT series and Codex, Google's PaLM, Meta's LLaMA, or Mistral's Mixtral).

In a transformer architecture, inputs (e.g., text) are converted to tokens 1 which are then mapped to an abstract latent space, a process known as encoding ( Vaswani et al., 2017 ). Mapping back from the latent space to tokens is accordingly called decoding , and the model's parameters are adjusted so that encoding and decoding work properly. This is achieved by feeding the model with human-generated input, from which it can learn latent space representations that match the input's distribution and identify correlations between tokens.

Pre-training amortizes the cost of training, which has become prohibitive for LLMs. It consists in determining a reasonable set of weights for the model, usually through autocompletion tasks, either autoregressive (ChatGPT) or masked (BERT) for natural language, during which the model is faced with an incomplete input and must correctly predict the missing parts or the next token. This training happens once, is based on public corpora, and results in an initial set of weights that serves as a baseline ( Tan et al., 2018 ). Most “open-source” models today follow this approach. 2

It is possible to fine-tune parameters to handle specific tasks from a pre-trained model, assuming they remain within a small perimeter of what the model was trained to do. This final training often requires human feedback and correction ( Tan et al., 2018 ).

The output of a decoder is not directly tokens, however, but a probability distribution over tokens. The temperature hyperparameter of LLMs controls how much the likelihood of less probable tokens is amplified: a high temperature would allow less probable tokens to be selected more often, resulting in a less predictable output. This is often combined with nucleus sampling ( Holtzman et al., 2020 ), i.e., requiring that the total sum of token probabilities is large enough and various penalty mechanisms to avoid repetition.

Finally, before being presented to the user, an output may undergo one or several rounds of (possibly non-LLM) filtering, including for instance the detection of foul language.

2.2 Code generation with AI models

With the rise of generative AI, there has also been a rise in the development of AI models for code generation. Multiple examples exist, such as Codex, Polycoder, CodeGen, CodeBERT, and StarCoder, to name a few (337, Xu, Li). These new tools should help developers of different domains be more efficient when writing code—or at least expected to ( Chen et al., 2021 ).

The use of LLMs for code generation is a domain-specific application of generative methods that greatly benefit from the narrower context. Contrary to natural language, programming languages follow a well-defined syntax using a reduced set of keywords, and multiple clues can be gathered (e.g., filenames, other parts of a code base) to help nudging the LLM in the right direction. Furthermore, so-called boilerplate code is not project-specific and can be readily reused across different code bases with minor adaptations, meaning that LLM-powered code assistants can already go a long way simply by providing commonly-used code snippets at the right time.

By design, LLMs generate code based on their training set ( Chen et al., 2021 ). 3 In doing so, there is a risk that sensitive, incorrect, or dangerous code is uncritically copied verbatim from the training set or that the “minor adaptations” necessary to transfer code from one project to another introduces mistakes ( Chen et al., 2021 ; Pearce et al., 2022 ; Niu et al., 2023 ). Therefore, generated code may include security issues, such as well-documented bugs, malpractices, or legacy issues found in the training data. A parallel issue often brought up is the copyright status of works produced by such tools, a still-open problem that is not the topic of this paper.

Similarly, other challenges and concerns have been highlighted by different academic research. From an educational point of view, some concerns are that using AI code generation models may impact acquiring bad security habits between novice programmers or students ( Becker et al., 2023 ). However, the usage of such models can also help lower the entry barrier to the field ( Becker et al., 2023 ). Similarly, cite337 has suggested that using AI code generation models does not output secure code all the time, as they are non-deterministic, and future research on mitigation is required ( Pearce et al., 2022 ). For example, Pearce et al. (2022) was one of the first to research this subject.

There are further claims that it may be possible to use by cyber criminal ( Chen et al., 2021 ; Natella et al., 2024 ). In popular communication mediums, there are affirmations that ChatGPT and other LLMs will be “useful” for criminal activities, for example Burgess (2023) . However, these tools can be used defensively in cyber security, as in ethical hacking ( Chen et al., 2021 ; Natella et al., 2024 ).

3 Research method

This research aims to systematically gather and analyze publications that answer our main question: “ How does the code generation of AI models impact the cybersecurity of the software process? ” Following Kitchenham and Charters (2007) classification of questions for SLR, our research falls into the type of questions of “Identifying the impact of technologies” on security, and “Identifying cost and risk factors associated with a technology” in security too.

To carry out this research, we have followed different SLR guidelines, most notably Wieringa et al. (2006) , Kitchenham and Charters (2007) , Wohlin (2014) , and Petersen et al. (2015) . Each of these guidelines was used for different elements of the research. We list out in a high-level approach which guidelines were used for each element, which we further discuss in different subsections of this article.

• For the general structure and guideline on how to carry out the SLR, we used Kitchenham and Charters (2007) . This included exclusion and inclusion criteria, explained in Section 3.2 ;

• The identification of the Population, Intervention, Comparison, and Outcome (PICO) is based both in Kitchenham and Charters (2007) and Petersen et al. (2015) , as a framework to create our search string. We present and discuss this framework in Section 3.1 ;

• The questions and quality check of the sample, we used the research done by Kitchenham et al. (2010) , which we describe in further details at Section 3.4 ;

• The taxonomy of type of research is from Wieringa et al. (2006) as a strategy to identify if a paper falls under our exclusion criteria. We present and discuss this taxonomy in Section 3.2. Although their taxonomy focuses on requirements engineering, it is broad enough to be used in other areas as recognized by Wohlin et al. (2013) ;

• For the snowballing technique, we used the method presented in Wohlin (2014) , which we discuss in Section 3.3 ;

• Mitigation strategies from Wohlin et al. (2013) are used, aiming to increase the reliability and validity of this study. We further analyze the threats to validity of our research in Section 6.

In the following subsections, we explain our approach to the SLR in more detail. The results are presented in Section 4.

3.1 Search planning and string

To answer our question systematically, we need to create a search string that reflects the critical elements of our questions. To achieve this, we thus need to frame the question in a way that allows us to (1) identify keywords, (2) identify synonyms, (3) define exclusion and inclusion criteria, and (4) answer the research question. One common strategy is the PICO (population, intervention, comparison, outcome) approach ( Petersen et al., 2015 ). Originally from medical sciences, it has been adapted for computer science and software engineering ( Kitchenham and Charters, 2007 ; Petersen et al., 2015 ).

To frame our work with the PICO approach, we follow the methodologies outlined in Kitchenham and Charters (2007) and Petersen et al. (2015) . We can identify the set of keywords and their synonyms by identifying these four elements, which are explained in detail in the following bullet point.

• Population: Cybersecurity.

• Following Kitchenham and Charters (2007) , a population can be an area or domain of technology. Population can be very specific.

• Intervention: AI models.

• Following Kitchenham and Charters (2007) “The intervention is the software methodology/tool/technology, such as the requirement elicitation technique.”

• Comparison: we compare the security issues identified by the code generated in the research articles. In Kitchenham and Charters (2007) word, “This is the software engineering methodology/tool/technology/procedure with which the intervention is being compared. When the comparison technology is the conventional or commonly-used technology, it is often referred to as the ‘control' treatment.”

• Outcomes: A systematic list of security issues of using AI models for code generation and possible mitigation strategies.

• Context: Although not mandatory (per Kitchenham and Charters, 2007 ) in general we consider code generation.

With the PICO elements done, it is possible to determine specific keywords to generate our search string. We have identified three specific sets: security, AI, and code generation. Consequently, we need to include synonyms of these three sets for generating the search string, taking a similar approach as Petersen et al. (2015) . The importance of including different synonyms arises from different research papers referring to the same phenomena differently. If synonyms are not included, essential papers may be missed from the final sample. The three groups are explained in more detail:

• Set 1: search elements related to security and insecurity due to our population of interest and comparison.

• Set 2: AI-related elements based on our intervention. This set should include LLMs, generative AI, and other approximations.

• Set 3: the research should focus on code generation.

With these three sets of critical elements that our research focuses on, a search string is created. We constructed the search string by including synonyms based on the three sets (as seen in Table 1 ). In a concurrent manner, while identifying the synonyms, we create the search string. Through different iterations, we aim at achieving the “golden” string, following a test-retest approach by Kitchenham et al. (2010) . In every iteration, we checked if the vital papers of our study were in the sample. The final string was selected based on the new synonym that would add meaningful results. For example, one of the iterations included “ hard* ,” which did not add any extra article. Hence, it was excluded. Due to space constraints, the different iterations are available in the public repository of this research. The final string, with the unique query per database, is presented in Table 2 .

www.frontiersin.org

Table 1 . Keywords and synonyms.

www.frontiersin.org

Table 2 . Search string per database.

For this research, we selected the following databases to gather our sample: IEEE Explore, ACM, and Scopus (which includes Springer and ScienceDirect). The databases were selected based on their relevance for computer science research, publication of peer-reviewed research, and alignment with this research objective. Although other databases from other domains could have been selected, the ones selected are notably known in computer science.

3.2 Exclusion and inclusion criteria

The exclusion and inclusion criteria were decided to align our research objectives. Our interest in excluding unranked venues is to avoid literature that is not peer-reviewed and act as a first quality check. This decision also applies to gray literature or book chapters. Finally, we excluded opinion and philosophical papers, as they do not carry out primary research. Table 3 shows are inclusion and exclusion criteria.

www.frontiersin.org

Table 3 . Inclusion and exclusion criteria.

We have excluded articles that address AI models or AI technology in general, as our interest—based on PICO—is on the security issue of AI models in code generation. So although such research is interesting, it does not align with our main objective.

For identifying the secondary research, opinion, and philosophical papers—which are all part of our exclusion criteria in Table 3 —we follow the taxonomy provided by Wieringa et al. (2006) . Although this classification was written for the requirements engineering domain, it can be generalized to other domains ( Wieringa et al., 2006 ). In addition, apart from helping us identify if a paper falls under our exclusion criteria, this taxonomy also allows us to identify how complete the research might be. The classification is as follows:

• Solution proposal: Proposes a solution to a problem ( Wieringa et al., 2006 ). “The solution can be novel or a significant extension of an existing technique ( Petersen et al., 2015 ).”

• Evaluation research: “This is the investigation of a problem in RE practice or an implementation of an RE technique in practice [...] novelty of the knowledge claim made by the paper is a relevant criterion, as is the soundness of the research method used ( Petersen et al., 2015 ).”

• Validation research: “This paper investigates the properties of a solution proposal that has not yet been implemented... ( Wieringa et al., 2006 ).”

• Philosophical papers: “These papers sketch a new way of looking at things, a new conceptual framework ( Wieringa et al., 2006 ).”

• Experience papers: Is where the authors publish their experience over a matter. “In these papers, the emphasis is on what and not on why ( Wieringa et al., 2006 ; Petersen et al., 2015 ).”

• Opinion papers: “These papers contain the author's opinion about what is wrong or good about something, how we should do something, etc. ( Wieringa et al., 2006 ).”

3.3 Snowballing

Furthermore, to increase the reliability and validity of this research, we applied a forward snowballing technique ( Wohlin et al., 2013 ; Wohlin, 2014 ). Once the first sample (start set) has passed an exclusion and inclusion criteria based on the title, abstract, and keyword, we forward snowballed the whole start set ( Wohlin et al., 2013 ). That is to say; we checked which papers were citing the papers from our starting set, as suggested by Wohlin (2014) . For this section, we used Google Scholar.

In the snowballing phase, we analyzed the title, abstract, and key words of each possible candidate ( Wohlin, 2014 ). In addition, we did an inclusion/exclusion analysis based on the title, abstract, and publication venue. If there was insufficient information, we analyzed the full text to make a decision, following the recommendations by Wohlin (2014) .

Our objective with the snowballing is to increase the reliability and validity. Furthermore, some articles found through the snowballing had been accepted at different peer-reviewed venues but had not been published yet in the corresponding database. This is a situation we address at Section 6.

3.4 Quality analysis

Once the final sample of papers is collected, we proceed with the quality check, following the procedure of Kitchenham and Charters (2007) and Kitchenham et al. (2010) . The objective behind a quality checklist if 2-fold: “to provide still more detailed inclusion/exclusion criteria” and act “as a means of weighting the importance of individual studies when results are being synthesized ( Kitchenham and Charters, 2007 ).” We followed the approach taken by Kitchenham et al. (2010) for the quality check, taking their questions and categorizing. In addition, to further adapt the questionnaire to our objectives, we added one question on security and adapted another one. The questionnaire is properly described at Table 4 . Each question was scored, according to the scoring scale defined in Table 5 .

www.frontiersin.org

Table 4 . Quality criteria questionnaire.

www.frontiersin.org

Table 5 . Quality criteria assessment.

The quality analysis is done by at least two authors of this research, for reliability and validity purposes ( Wohlin et al., 2013 ).

3.5 Data extraction

To answer the main question and extract the data, we have subdivided the main question, to answer it. This allows us to extract information and summarize it systematically; we created an extract form in line with ( Kitchenham and Charters, 2007 ; Carrera-Rivera et al., 2022 ). The data extraction form is presented in Table 6 .

www.frontiersin.org

Table 6 . Data extraction form and type of answer.

The data extraction was done by at least two researchers per article. Afterward, the results are compared, and if there are “disagreements, [they must be] resolved either by consensus among researchers or arbitration by an additional independent researcher ( Kitchenham and Charters, 2007 ).”

4.1 Search results

The search and recollection of papers were done during the last week of November 2023. Table 7 shows the total number of articles gathered per database. The selection process for our final samples is exemplified in Figure 1 .

www.frontiersin.org

Table 7 . Search results per database.

www.frontiersin.org

Figure 1 . Selection of sample papers for this SLR.

The total number of articles in our first round, among all the databases, was 95. We then identified duplicates and applied our inclusion and exclusion criteria for the first round of selected papers. This process left us with a sample of 21 articles.

These first 21 artcles are our starting set, from which we proceeded for a forward snowballing. We snowballed each paper of the starting set by searching Google Scholar to find where it had been cited. The selected papers at this phase were based on the title, abstract, based on Wohlin (2014) . From this step, 22 more articles were added to the sample, leaving 43 articles. We then applied inclusion and exclusion criteria to the new snowballed papers, that left us with 35 papers. We discuss this high number of snowballed papers at Section 6.

At this point, we read all the articles to analyze if they should pass to the final phase. In this phase, we discarded 12 articles deemed out of scope for this research, leaving us with 23 articles for quality check. For example, they would not focus on cybersecurity, code generation, or the usage of AI models for code generation.

At this phase, three particular articles (counted among the eight articles previously discarded) sparked discussion between the first and fourth authors regarding whether they were within the scope of this research. We defined AI code generation as artifacts that suggest or produce code. Hence, those artifacts that use AI to check and/or verify code, and vulnerability detection without suggesting new code are not within scope. In addition, the article's main focus should be on code generation and not other areas, such as code verification. So, although an article might discuss code generation, the paper was not accepted as it was not the main topic. As a result, two of the three discussion articles were accepted, and one was rejected.

4.2 Quality evaluation

We carried out a quality check for our preliminary sample of papers ( N = 23) as detailed at Section 3.4. Based on the indicated scoring system, we discarded articles that did not pass 50% of the total possible score (four points). If there were disagreements in the scoring, these were discussed and resolved between authors. Each paper's score details are provided in Table 8 , for transparency purposes ( Carrera-Rivera et al., 2022 ). Quality scores guides us on where to place more weight of importance, and on which articles to focus ( Kitchenham and Charters, 2007 ). The final sample is of N = 19.

www.frontiersin.org

Table 8 . Quality scores of the final sample.

4.3 Final sample

The quality check discarded three papers, which left us with 19 as a final sample, as seen in Table 9 . The first article published in this sample was in 2022 and the number of publications has been increasing every year. This situation is not surprising, as generative AI has risen in popularity in 2020 and has expanded into widespread knowledge with the release of ChatGPT 3.5.

www.frontiersin.org

Table 9 . Sample of papers, with the main information of interest ( † means no parameter or base model was specified in the article).

5 Discussion

5.1 about ai models comparisons and methods for investigation.

Almost the majority (14 papers—73%) of the papers research at least one OpenAI model, Codex being the most popular option. OpenAI owns ChatGPT, which was adopted massively by the general public. Hence, it is not surprising that most articles focus on OpenAI models. However, other AI models from other organizations are also studied, Salesforce's CodeGen and CodeT5, both open-source, are prime examples. Similarly, Xu et al. (2022) Polycoder was a popular selection in the sample. Finally, different authors benchmarked in-house AI models and popular models. For example, papers such as Tony et al. (2022) with DeepAPI-plusSec and DeepAPI-onlySec and Pearce et al. (2023) with Gpt2-csrc. Figure 3 shows the LLM instances researched by two or more articles grouped by family.

As the different papers researched different vulnerabilities, it remains difficult to compare the results. Some articles researched specific CWE, other MITRE Top-25, the impact of AI in code, the quality of the code generated, and malware generation, among others. It was also challenging to find the same methodological approach for comparing results, and therefore, we can only infer certain tendencies. For this reason, future research could focus on generating a standardized approach and analyzing vulnerabilities to analyze the quality of security. Furthermore, it would be interesting to have more analysis between open-source and proprietary models.

Having stated this, two articles with similar approaches, topics, and vulnerabilities are Pearce et al. (2022 , 2023) . Both papers share authors, which can help explain the similarity in the approach. Both have similar conclusions on the security of the output of different OpenAI models: they can generate functional and safe code, but the percentage of this will vary between CWE and programming language ( Pearce et al., 2022 , 2023 ). For both authors, the security of the code generated in C was inferior to that in Python ( Pearce et al., 2022 , 2023 ). For example, Pearce et al. (2022) indicates that for Python, 39% of the code suggested is vulnerable and 50% for code in C. Pearce et al. (2023) highlights that the models they studied struggled with fixes for certain CWE, such as CWE-787 in C. So even though they compared different models of the OpenAI family, they produced similar results (albeit some models had better performance than others).

Based on the work of Pearce et al. (2023) , when comparing OpenAI's models to others (such as the AI21 family, Polycoder, and GPT-csrc) in C and Python with CWE vulnerabilities, OpenAI's models would perform better than the rest. In the majority of the cases, code-davinci-002 would outperform the rest. Furthermore, when applying the AI models to other programming languages, such as Verilog, not all models (namely Polycoder and gpt2-csrc) supported it ( Pearce et al., 2023 ). We cannot fully compare these results with other research articles, as they focused on different CWEs but identified tendencies. To name the difference,

• He and Vechev (2023) studies mainly CodeGen and mentions that Copilot can help with CWE-089,022 and 798. They do not compare the two AI models but compare CodeGen with SVEN. They use scenarios to evaluate CWE, adopting the method from Pearce et al. (2022) . CodeGen does seem to provide similar tendencies as Pearce et al. (2022) : certain CWE appeared more recurrently than others. For example, comparing with Pearce et al. (2022) and He and Vechev (2023) , CWE-787, 089, 079, and 125 in Python and C appeared in most scenarios at a similar rate. 4

• This data shows that even OpenAI's and CodeGen models have similar outputs. When He and Vechev (2023) present the “overall security rate” at different temperatures of CodeGen, they have equivalent security rates: 42% of the code suggested being vulnerable in He and Vechev (2023) vs. a 39% in Python and 50% in C in Pearce et al. (2022) .

• Nair et al. (2023) also studies CWE vulnerabilities for Verilog code. Both Pearce et al. (2022 , 2023) also analyze Verilog in OpenAI's models, but with very different research methods. Furthermore, their objectives are different: Nair et al. (2023) focuses on prompting and how to modify prompts for a secure output. What can be compared is that Nair et al. (2023) and Pearce et al. (2023) highlight the importance of prompting.

• Finally Asare et al. (2023) also studies OpenAI from a very different perspective: the human-computer interaction (HCI). Therefore, we cannot compare the study results of Asare et al. (2023) with Pearce et al. (2022 , 2023) .

Regarding malware code generation, both Botacin (2023) and Pa Pa et al. (2023) OpenAI's models, but different base-models. Both conclude that AI models can help generate malware but to different degrees. Botacin (2023) indicates that ChatGPT cannot create malware from scratch but can create snippets and help less-skilled malicious actors with the learning curve. Pa Pa et al. (2023) experiment with different jailbreaks and suggest that the different models can create malware, up to 400 lines of code. In contrast, Liguori et al. (2023) researchers Seq2Seq and CodeBERT and highlight the importance for malicious actors that AI models output correct code if not their attack fails. Therefore, human review is still necessary to fulfill the goals of malicious actors ( Liguori et al., 2023 ). Future work could benefit from comparing these results with other AI code generation models to understand if they have similar outputs and how to jailbreak them.

The last element we can compare is the HCI aspects, specifically Asare et al. (2023) , Perry et al. (2023) , and Sandoval et al. (2023) , who all researched on C. Both Asare et al. (2023) and Sandoval et al. (2023) agree that AI code generation models do not seem to be worse, if not the same, in generating insecure code and introducing vulnerabilities. In contrast, Perry et al. (2023) concludes that developers who used AI assistants generated more insecure code—although this is inconclusive for the C language—as these developers believed they had written more secure code. Perry et al. (2023) suggest that there is a relationship between how much trust there is between the AI model and the security of code. All three agree that AI assistant tools should not be used carefully, particularly between non-experts ( Asare et al., 2023 ; Perry et al., 2023 ; Sandoval et al., 2023 ).

5.2 New exploits

Firstly, Niu et al. (2023) hand-crafted prompts that seemed could leak personal data, which yielded 200 prompts. Then, they queried each of these prompts, obtaining five responses per prompt, giving 1,000 responses. Two authors then looked through the outputs to identify if the prompts had leaked personal data. The authors then improved these with the identified prompts. They tweaked elements such as context, pre-fixing or the natural language (English and Chinese), and meta-variables such as prompt programming language style for the final data set.

With the final set of prompts, the model was queried for privacy leaks. B efore querying the model, the authors also tuned specific parameters, such as temperature. “Using the BlindMI attack allowed filtering out 20% of the outputs, with the high recall ensuring that most of the leakages are classified correctly and not discarded ( Niu et al., 2023 ).” Once the outputs had been labeled as members, a human checked if they contained “sensitive data” ( Niu et al., 2023 ). The human could categorize such information as targeted leak, indirect leak, or uncategorized leak.

When applying the exploit to Codex Copilot and verifying with GitHub, it shows there is indeed a leakage of information ( Niu et al., 2023 ). 2.82% of the outputs contained identifiable information such as address, email, and date of birth; 0.78% private information such as medical records or identities; and 0.64% secret information such as private keys, biometric authentication or passwords ( Niu et al., 2023 ). The instances in which data was leaked varied; specific categories, such as bank statements, had much lower leaks than passwords, for example Niu et al. (2023) . Furthermore, most of the leaks tended to be indirect rather than direct. This finding implies that “the model has a tendency to generate information pertaining to individuals other than the subject of the prompt, thereby breaching privacy principles such as contextual agreement ( Niu et al., 2023 ).”

Their research proposes a scalable and semi-automatic manner to leak personal data from the training data in a code-generation AI model. The authors do note that the outputs are not verbatim or memorized data.

To achieve this, He and Vechev (2023) curated a dataset of vulnerabilities from CrossVul ( Nikitopoulos et al., 2021 ) and Big-Vul ( Fan et al., 2020 ), which focuses in C/C++ and VUDENC ( Wartschinski et al., 2022 ) for Python. In addition, they included data from commits from GitHub, taking into special consideration that they were true commits, avoiding that SVEN learns “undesirable behavior.” At the end, they target 9 CWES from MITRE Top 25.

Through benchmarking, they evaluate SVEN output's security (and functional) correctness against CodeGen (350M, 2.7B, and 6.1B). They follow a scenario-based approach “that reflect[s] real-world coding ( He and Vechev, 2023 ),” with each scenario targeting one CWE. They measure the security rate, which is defined as “the percentage of secure programs among valid programs ( He and Vechev, 2023 ).” They set the temperature at 0.4 for the samples.

Their results show that SVEN can significantly increase and decrease (depending on the controlled generation output) the code security score. “CodeGen LMs have a security rate of ≈60%, which matches the security level of other LMs [...] SVEN sec significantly improves the security rate to >85%. The best-performing case is 2.7B, where SVENsec increases the security rate from 59.1 to 92.3% ( He and Vechev, 2023 ).” Similar results are obtained for SVEN vul with the “security rate greatly by 23.5% for 350M, 22.3% for 2.7B, and 25.3% for 6.1B ( He and Vechev, 2023 )”. 5 When analyzed per CWE, in almost all cases (except CWE-416 language C) SVEN sec increases the security rate. Finally, even when tested with 4 CWE that were not included in the original training set of 9, SVEN had positive results.

Although the authors aim at evaluating and validating SVEN, as an artifact for cybersecurity, they also recognize its potential use as a malicious tool. They suggest that SVEN can be inserted in open-source projects and distributed ( He and Vechev, 2023 ). Future work could focus on how to integrate SVEN—or similar approaches—as plug-ins into AI code generations, to lower the security of the code generated. Furthermore, replication of this approach could raise security alarms. Other research can focus on seeking ways to lower the security score while keeping the functionality and how it can be distributed across targeted actors.

They benchmark CodeAttack against the TextFooler and BERT-Attack, two other adversarial attacks in three tasks: code translation (translating code between different programming languages, in this case between C# and Java), code repair (fixes bugs for Java) and code (a summary of the code in natural language). The authors also applied the benchmark in different AI models (CodeT5, CodeBERT, GraphCode-BERT, and RoBERTa) in different programming languages (C#, Java, Python, and PHP). In the majority of the tests, CodeAttack had the best results.

5.3 Performance per programming language

Different programming languages are studied. Python and the C family are the most common languages, including C, C++, and C# (as seen in Figure 2 ). To a lesser extent, Java and Verilog are tested. Finally, specific articles would study more specific programming languages, such as Solidity, Go or PHP. Figure 2 offers a graphical representation of the distribution of the programming languages.

www.frontiersin.org

Figure 2 . Number of articles that research specific programming languages. An article may research 2 or more programming languages.

www.frontiersin.org

Figure 3 . Number of times each LLM instance was researched by two or more articles, grouped by family. One paper might study several instances of the same family (e.g., Code-davinci-001 and Code-davinci-002), therefore counting twice. Table 9 offers details on exactly which AI models are studied per article.

5.3.1 Python

Python is the second most used programming language 6 as of today. As a result most publicly-available training corpora include Python and it is therefore reasonable to assume that AI models can more easily be tuned to handle this language ( Pearce et al., 2022 , 2023 ; Niu et al., 2023 ; Perry et al., 2023 ). Being a rather high level, interpreted language, Python should also expose a smaller attack surface. As a result, AI-generated Python code has fewer avenues to cause issues to begin with, and this is indeed backed up by evidence ( Pearce et al., 2022 , 2023 ; Perry et al., 2023 ).

In spite of this, issues still occur: Pearce et al. (2022) experimented with 29 scenarios, producing 571 Python programs. Out of these, 219 (38.35%) presented some kind of Top-25 MITRE (2021) vulnerability, with 11 (37.92%) scenarios having a top-vulnerable score. Unaccounted in these statistics are the situations where generated programs fail to achieve functional correctness ( Pearce et al., 2023 ), which could yield different conclusions. 7

Pearce et al. (2023) , building from Pearce et al. (2022) , study to what extent post-processing can automatically detect and fix bugs introduced during code generation. For instance, on CWE-089 (SQL injection) they found that “29.6% [3197] of the 10,796 valid programs for the CWE-089 scenario were repaired” by an appropriately-tuned LLM ( Pearce et al., 2023 ). In addition, they claim that AI models can generate bug-free programs without “additional context ( Pearce et al., 2023 ).”

It is however difficult to support such claims, which need to be nuanced. Depending on the class of vulnerability, AI models varied in their ability in producing secure Python code ( Pearce et al., 2022 ; He and Vechev, 2023 ; Perry et al., 2023 ; Tony et al., 2023 ). Tony et al. (2023) experimented with code generation from natural language prompts, findings that indeed, Codex output included vulnerabilities. In another research,Copilot reports only rare occurences of CWE-079 or CWE-020, but common occurences of CWE-798 and CWE- 089 ( Pearce et al., 2022 ). Pearce et al. (2022) report a 75% vulnerable score for scenario 1, 48% scenario 2, and 65% scenario 3 with regards to CWE-089 vulnerability ( Pearce et al., 2022 ). In February 2023, Copilot launched a prevention system for CWEs 089, 022, and 798 ( He and Vechev, 2023 ), the exact mechanism of which is unclear. At the time of writing it falls behind other approaches such as SVEN ( He and Vechev, 2023 ).

Perhaps surprisingly, there is not much variability across different AI models: CodeGen-2.7B has comparable vulnerability rates ( He and Vechev, 2023 ), with CWE-089 still on top. CodeGen-2.7B also produced code that exhibited CWE-078, 476, 079, or 787, which are considered more critical.

One may think that using AI as an assistant to a human programmer could alleviate some of these issues. Yet evidence points to the opposite: when using AI models as pair programmers, developers consistently deliver more insecure code for Python ( Perry et al., 2023 ). Perry et al. (2023) led a user-oriented study on how the usage of AI models for programming affects the security and functionality of code, focusing on Python, C, and SQL. For Python, they asked participants to write functions that performed basic cryptographic operations (encryption, signature) and file manipulation. 8 They show a statistically significant difference between subjects that used AI models (experimental group) and those that did not (control group), with the experimental group consistently producing less secure code ( Perry et al., 2023 ). For instance, for task 1 (encryption and decryption), 21% of the responses of the experiment group was secure and correct vs. 43% of the control group ( Perry et al., 2023 ). In comparison, 36% of the experiment group provided insecure but correct code, compared to 14%.

Even if AI models produce on occasion bug-free and secure code, evidence points out that it cannot be guaranteed. In this light, both Pearce et al. (2022 , 2023) recommend deploying additional security-aware tools and methodologies whenever using AI models. Moreover, Perry et al. (2023) suggests a relationship between security awareness and trust in AI models on the one hand, and the security of the AI-(co)generated code.

Another point of agreement in our sample is that prompting plays a crucial role in producing vulnerabilities, which can be introduced or avoided depending on the prompt and adjustment of parameters (such as temperature). Pearce et al. (2023) observes that AI models can generate code that repairs the issue when they are given a suitable repair prompt. Similarly, Pearce et al. (2022) analyzed how meta-type changes and comments (documentation) can have varying results over the security ( Pearce et al., 2022 ). An extreme example is the difference between an SQL code generated with different prompts: the prompt “adds a separate non-vulnerable SQL function above a task function” (identified as variation C-2, as it is a code change) would never produce vulnerable code whereas “adds a separate vulnerable SQL function above the task function” (identified as variation C-3) returns vulnerable code 94% of the time ( Pearce et al., 2022 ). Such results may not be surprising if we expect the AI model to closely follow instructions, but suffice to show the effect that even minor prompt variations can have on security.

Lastly, Perry et al. (2023) observe in the experimental group a relationship between parameters of the AI model (such as temperature) and code quality. They also observe a relationship between education, security awareness, and trust ( Perry et al., 2023 ). Because of this, there could be spurious correlations in their analysis, for instance the variable measuring AI model parameters adjustments could be, in reality, measuring education or something else.

On another security topic, Siddiq et al. (2022) study code and security “smells.” Smells are hints, not necessarily actual vulnerabilities, but they can open the door for developers to make mistakes that lead to security flaws that attackers exploit. Siddiq et al. (2022) reported on the following CWE vulnerabilities: 078,703,330. They have concluded that bad code patterns can (and will) leak to the output of models, and code generated with these tools should be taken with a “grain of salt” ( Siddiq et al., 2022 ). Furthermore, identified vulnerabilities may be severe (not merely functional issues) ( Siddiq et al., 2022 ). However, as they only researched OpenAI's AI models, their conclusion may lack external validity and generalization.

Finally, some authors explore the possibility to use AI models to deliberately produce malicious code ( He and Vechev, 2023 ; Jha and Reddy, 2023 ; Jia et al., 2023 ; Niu et al., 2023 ). It is interesting to the extent that this facilitates the work of attackers, and therefore affects cybersecurity as a whole, but it does not (in this form at least) affect the software development process or deployment per se, and is therefore outside of the scope of our discussion.

The C programming language is considered in 10 (52%) papers of our final sample, with C being the most common, followed by C++ and C#. Unlike Python, C is a low-level, compiled language, that puts the programmer in charge of many security-sensitive tasks (such as memory management). The vast majority of native code today is written in C. 9

The consensus is that AI generation of C programs yields insecure code ( Pearce et al., 2022 , 2023 ; He and Vechev, 2023 ; Perry et al., 2023 ; Tony et al., 2023 ), and can readily be used to develop malware ( Botacin, 2023 ; Liguori et al., 2023 ; Pa Pa et al., 2023 ). However, it is unclear whether AI code generation introduce more or new vulnerabilities compared to humans ( Asare et al., 2023 ; Sandoval et al., 2023 ), or to what extent they influence developers' trust in the security of the code ( Perry et al., 2023 ).

Multiple authors report that common and identified vulnerabilities are regularly found in AI-generated C code ( Pearce et al., 2022 , 2023 ; Asare et al., 2023 ; He and Vechev, 2023 ; Perry et al., 2023 ; Sandoval et al., 2023 ). Pearce et al. (2022) obtained 513 C programs, 258 of which (50.29% ) had a top-scoring vulnerability. He and Vechev (2023) provides a similar conclusion.

About automated code-fixing, Asare et al. (2023) and Pearce et al. (2023) report timid scores, with only 2.2% of C code for CWE-787.

On the question of human- vs. AI-generated code, Asare et al. (2023) used 152 scenarios to conclude that AI models make in fact fewer mistakes. Indeed, when prompted with the same scenario as a human, 33% cases suggested the original vulnerability, and 25% provided a bug-free output. Yet, when tested on code replication or automated vulnerability fixing, the authors do not recommend the usage of a model by non-experts. For example, in code replication, AI models would always replicate code regardless of whether it had a vulnerability, and CWE-20 would consistently be replicated ( Asare et al., 2023 ).

Sandoval et al. (2023) experimentally compared the security of code produced by AI-assisted students to the code generated by Codex. They had 58 participants and studied memory-related CWE, given that they are in the Top-25 MITRE list ( Sandoval et al., 2023 ). Although there were differences between groups, these were not bigger than 10% and would differ between metrics ( Sandoval et al., 2023 ). In other words, depending on the chosen metric, sometimes AI-assisted subjects perform better in security and vice versa ( Sandoval et al., 2023 ). For example, CWE-787 was almost the same for the control and experimental groups, whereas the generated Codex code was prevalent. Therefore, they conclude that the impact on “cybersecurity is less conclusive than the impact on functionality ( Sandoval et al., 2023 ).” Depending on the security metric, it may be beneficial to use AI-assisted tools, which the authors recognize goes against standard literature ( Sandoval et al., 2023 ). They go so far as to conclude that there is “no conclusive evidence to support the claim LLM assistant increase CWE incidence in general, even when we looked only at severe CWEs ( Sandoval et al., 2023 ).”

Regarding AI-assisted malware generation, there seems to be fundamental limitations preventing current AI models from writing self-contained software from scratch ( Botacin, 2023 ; Liguori et al., 2023 ; Pa Pa et al., 2023 ), although it is fine for creating smaller blocks of code which, strung together, produce a complete malware ( Botacin, 2023 ). It is also possible to bypass models' limitations by leveraging basic obfuscation techniques ( Botacin, 2023 ). Pa Pa et al. (2023) experiment prompts and jailbreaks in ChatGPT to produce code (specifically, fileless malware for C++), which was only provided with 2 jailbreaks they chose. While Liguori et al. (2023) reflect on how to best optimize AI-generating tools to assist attackers in producing code, as failure or incorrect codes means the attack fails.

Over CWE, Top MITRE-25 is a concern across multiple authors ( Pearce et al., 2022 , 2023 ; He and Vechev, 2023 ; Tony et al., 2023 ). CWE-787 is a common concern across articles, as it is the #1 vulnerability in the Top-25 MITRE list ( Pearce et al., 2022 ; Botacin, 2023 ; He and Vechev, 2023 ). On the three scenarios experimented by Pearce et al. (2022) , on average, ~34% of the output is vulnerable code. He and Vechev (2023) tested with two scenarios, the first receiving a security rate of 33.7% and the second one 99.6%. What was interesting in their experiment is that they were not able to provide lower security rates for SVEN vul than the originals ( He and Vechev, 2023 ). Other vulnerabilities had varying results but with a similar trend. Overall, it seems that the AI code generation models produce more vulnerable code compared to other programming languages, possibly due to the quality and type of data in the training data set ( Pearce et al., 2022 , 2023 ).

Finally, regarding human-computer interaction, Perry et al. (2023) suggests that subjects “with access to an AI assistant often produced more security vulnerabilities than those without access [...] overall.” However, they highlight that their difference is not statistically significant and inconclusive for the case they study in C. So even if the claim applies to Python, Perry et al. (2023) indicates this is not the case for the C language. Asare et al. (2023) and Sandoval et al. (2023) , as discussed previously, both conclude that AI models do not introduce more vulnerabilities than humans into code. “This means that in a substantial number of scenarios we studied where the human developer has written vulnerable code, Copilot can avoid the detected vulnerability ( Asare et al., 2023 ).”

Java 10 is a high-level programming language that runs atop a virtual machine, and is today primarily used for the development of mobile applications. Vulnerabilities can therefore arise from programs themselves, calls to vulnerable (native) libraries, or from problems within the Java virtual machine. Only the first category of issues is discussed here.

In our sample, four articles ( Tony et al., 2022 ; Jesse et al., 2023 ; Jha and Reddy, 2023 ; Wu et al., 2023 ) analyzed code generation AI models for Java. Each research focused on very different aspects of cyber security and they did not analyze the same vulnerabilities. Tony et al. (2022) investigated the dangers and incorrect of API calls for cryptographic protocols. Their conclusions is that generative AI might not be at all optimized for generating cryptographically secure code ( Tony et al., 2022 ). The accuracy of the code generated was significantly lower on cryptographic tasks than what the AI is advertised to have on regular code ( Tony et al., 2022 ).

Jesse et al. (2023) experiments with generating single stupid bugs (SStuB) with different AI models. They provide six main findings, which can be summarized as: AI models propose twice as much SSTuB as correct code. However, they also seem to help with other SStuB ( Jesse et al., 2023 ). 11 One of the issues with SStuBs is that “where Codex wrongly generates simple, stupid bugs, these may take developers significantly longer to fix than in cases where Codex does not ( Jesse et al., 2023 ).” In addition, different AI models would behave differently over the SStuBs generated ( Jesse et al., 2023 ). Finally, Jesse et al. (2023) found that commenting on the code leads to fewer SStuBs and more patches, even if the code is misleading.

Wu et al. (2023) analyze and compare (1) the capabilities of different LLMs and fine-tuned LLMs and automated program repair (APR) techniques for repairing vulnerabilities in Java; (2) proposes VJBench and VJBench-trans as a “new vulnerability repair benchmark;” (3) and evaluates the studied AI models on their proposed VJBench and VJBench-trans. VJBench aims to extend the work of Vul4J and thus proposes 42 vulnerabilities, including 12 new CWEs that were not included in Vul4J ( Wu et al., 2023 ). Therefore, their study assessed 35 vulnerabilities proposed by Vul4J and 15 by the authors ( Wu et al., 2023 ). On the other hand, VJBench-trans is composed of “150 transformed Java vulnerabilities ( Wu et al., 2023 ).” Overall, they concluded that the AI models fix very few Java vulnerabilities, with Codex fixing 20.4% of them ( Wu et al., 2023 ). Indeed, “large language models and APR techniques, except Codex, only fix vulnerabilities that require simple changes, such as deleting statements or replacing variable/method names ( Wu et al., 2023 ).” Alternatively, it seems that fine-tuning helps the LLMs improve the task of fixing vulnerabilities ( Wu et al., 2023 ).

However, four APR and nine LLMs did not fix the new CWEs introduced by VJBench ( Wu et al., 2023 ). Some CWEs that are not tackled are “CWE-172 (Encoding error), CWE-325 (Missing cryptographic step), CWE-444 (HTTP request smuggling; Wu et al., 2023 ),” which can have considerable cybersecurity impacts. For example, CWE-325 can weaken a cryptographic protocol, thus lowering the security capacity. Furthermore, apart from Codex, the other AI models and APR studied did not apply complex vulnerability repair but would focus on “simple changes, such as deletion of a statement ( Wu et al., 2023 ).”

Jia et al. (2023) study the possibility that a code-generation AI model is manipulated by “adversarial inputs.” In other words, the user inputs designed to trick the model into either misunderstanding code, or producing code that behaves in an adversarially-controlled way. They tested Claw, M1 and ContraCode both in Python and Java for the following tasks: code summarization, code completion and code clone detection ( Jia et al., 2023 ).

Finally, Jha and Reddy (2023) proposes CodeAttack , which is implemented in different programming languages, including Java. 12 When tested in Java, their results show that 60% of the adversarial code generated is syntactically correct ( Jha and Reddy, 2023 ).

5.3.4 Verilog

Verilog is a hardware-description language. Unlike other programming languages discussed so far, its purpose is not to describe software but to design and verify of digital circuits (at the register-transfer level of abstraction).

The articles that researched Verilog generally conclude that the AI models they researched are less efficient in this programming language than Python or C ( Pearce et al., 2022 , 2023 ; Nair et al., 2023 ). Different articles would research different vulnerabilities, with two specific CWEs standing out: 1271 and 1234. Pearce et al. (2022) summarizes the difficulty of defining which vulnerability to study from the CWE for Verilog, as there is no Top 25 CWE for hardware. Hence, their research selected vulnerabilities that could be analyzed ( Pearce et al., 2022 ). This situation produces difficulties in comparing research and results, as different authors can select different focuses. The different approaches to vulnerabilities in Verilog can be seen in Table 9 , where only two CWE are common across all studies (1271 and 1234), but others such as 1221 ( Nair et al., 2023 ) or 1294 ( Pearce et al., 2022 ) are researched by one article.

Note that unlike software vulnerabilities, it is much harder to agree on a list of the most relevant hardware vulnerabilities, and to the best of our knowledge there is no current consensus on the matter today.

Regarding the security concern, both Pearce et al. (2022 , 2023) , studying OpenAI, indicated that in general these models struggled to produce correct, functional, and meaningful code, being less efficient over the task. For example, Pearce et al. (2022) generates “198 programs. Of these, 56 (28.28%) were vulnerable. Of the 18 scenarios, 7 (38.89 %) had vulnerable top-scoring options.” Pearce et al. (2023) observes that when using these AI models to generate repair code, firstly, they had to vary around with the temperature of the AI model (compared to C and Python), as it produced different results. Secondly, they conclude that the models behaved differently with Verilog vs. other languages and “seemed [to] perform better with less context provided in the prompt ( Pearce et al., 2023 ).” The hypothesis on why there is a difference between Verilog and other programming languages is because there is less training data available ( Pearce et al., 2022 ).

5.4 Mitigation strategies

There have been several attempts, or suggestions, to mitigate the negative effects on security when using AI to code. Despite reasonable, not all are necessarily effective, as we discuss in the remainder of this section. Overall, the attempts we have surveyed discuss how modify the different elements that can affect the quality of the AI models or the quality of the user control over the AI-generated code. Table 10 summarizes the suggested mitigation strategies.

www.frontiersin.org

Table 10 . Summary of the mitigation strategies.

5.4.1 Dataset

Part of the issue is that LLMs are trained on code that is itself ripe with vulnerabilities and bad practice. As a number of the AI models are not open-source or their training corpora is no available, different researchers hypothesize that the security issue arise from the training dataset ( Pearce et al., 2022 ). Adding datasets that include different programming languages with different vulnerabilities may help reduce the vulnerabilities in the output ( Pearce et al., 2022 ). This is why, to mitigate the problems with dataset security quality, He and Vechev (2023) manually curated the training data for fine-tuning, which improved the output performance against the studied CWE.

By carefully selecting training corpora that are of higher quality, which can be partially automated, there is hope that fewer issues would arise ( He and Vechev, 2023 ). However, a consequence of such a mitigation is that the size of the training set would be much reduced, which weakens the LLM's ability to generate code and generalize ( Olson et al., 2018 ). Therefore one may expect that being too picky with the training set would result, paradoxically, in a reduction in code output quality. A fully fledged study of this trade-off remains to be done.

5.4.2 Training procedure

During the training process, LLMs are scored on their ability to autoencode, that is, to accurately reproduce their input (in the face of a partially occulted input). In the context of natural language, minor errors are often acceptable and almost always have little to no impact on the meaning or understanding of a sentence. Such is not the case for code, which can be particularly sensitive to minor variations, especially for low-level programming languages. A stricter training regimen could score an LLM based not only on syntactic correctness, but on (some degree of) semantic correctness, to limit the extent to which the model wanders away from a valid program. Unfortunately, experimental data from Liguori et al. (2023) suggests that currently no single metric succeeds at that task.

Alternatively, since most LLMs today come pre-trained, a better fine-tuning step could reduce the risks associated with incorrect code generation. He and Vechev (2023) took this approach and had promising results in the CWE they investigated. However, there is conflicting evidence. Evidence from Wu et al. (2023) seems to indicate that this approach is inherently limited to fixing a very narrow, and simple class of bugs. More studies analyzing the impact of fine-tuning models with curated security datasets are needed to assess the impact of this mitigation strategy.

5.4.3 Generation procedure

Code quality is improved by collecting more context that the user typically provides through their prompts ( Pearce et al., 2022 ; Jesse et al., 2023 ). The ability to use auxiliary data, such as other project files, file names, etc. seems to explain the significant difference in code acceptation between GitHub Copilot and its bare model OpenAI Codex. The exploration of creating guidelines and best practices on how to do prompts effectively may be interesting. Nair et al. (2023) explored the possibility of creating prompt strategies and techniques for ChatGPT that would output secure code.

From an adversarial point of view, Niu et al. (2023) provides evidence of the impact of context and prompts for exploiting AI models. There are ongoing efforts to limit which prompts are accepted by AI systems by safeguarding them ( Pa Pa et al., 2023 ). However, Pa Pa et al. (2023) showed—with mixed results—how to bypass these limitations, what is called “jailbreaking.” Further work on this area is needed as a mitigation strategy and its effectiveness.

Independently, post-processing the output (SVEN is one example; He and Vechev, 2023 ) has a measurable impact on code quality, and is LLM-agnostic, operating without the need for re-training nor fine-tuning. Presumably, non-LLM static analyzers or linters may be integrated as part of the code generation procedure to provide checks along the way and avoid producing code that is visibly incorrect or dangerous.

5.4.4 Integration of AI-generated code into software

Even after all the technical countermeasures have been taken to avoid producing code that is obviously incorrect, there remains situations where AI-generated programs contain (non-obvious) vulnerabilities. To a degree, such vulnerabilities could also appear out of human-generated code, and there should in any case be procedures to try and catch these as early as possible, through unit, functional and integration testing, fuzzing, or static analysis. Implementation of security policies and processes remains vital.

However AI models are specifically trained to produce code that looks correct, meaning that their mistakes may be of a different nature or appearance than those typically made by human software programmers, and may be harder to spot. At the same time, the very reason why code generation is appealing is that it increases productivity, hence the amount of code in question.

It is therefore essential that software developers who rely on AI code generation keep a level of mistrust with regards to these tools ( Perry et al., 2023 ). It is also likely that code review methodologies should be adjusted in the face of AI-generated code to look for the specific kind of mistakes or vulnerabilities that this approach produces.

5.4.5 End-user education

One straightforward suggestion is educating users to assess the quality of software generated with AI models. Among the works we have reviewed, we found no studies that specifically discuss the quality and efficacy of this potential mitigation strategy, so we can only speculate about it from related works. For instance, Moradi Dakhel et al. (2023) compares the code produced by human users with the code generated by GitHub Copilot. The study is not about security. It is about the correctness of the implementation of quite well-known algorithms. Still, human users—students with an education in algorithms—performed better than their AI counterparts, but the buggy solutions generated by Copilot were easily fixable by the users. Relevantly, the AI-generated bugs were more easily recognizable and fixable than those produced by other human developers performing the same task.

This observation suggests that using AI could help write code faster for programmers skilled in debugging and that this task should not hide particular complexity for them. As Chen et al. (2021) suggested, “human oversight and vigilance is required for safe use of code generation systems like Codex.” However, removing obvious errors from buggy implementations of well-known algorithms is not the same as spotting security vulnerabilities: the latter task is complex and error-prone, even for experts. And here we speculate that if AI-generated flaws are naïve, programmers can still have some gain from using AI if they back up coding with other instruments used in security engineering (e.g., property checking, code inspection, and static analysis). Possible design changes or decision at the user interfaces may also have an impact. However, we have no evidence of whether our speculative idea can work in practice. The question remains open and calls for future research.

6 Threats to validity and future work

Previous literature Wohlin et al. (2013) and Petersen et al. (2015) have identified different reliability and validity issues in systematic literature reviews. One of the first elements that needs to be noted is the sample of papers. As explained by Petersen et al. (2015) , the difference between systematic mapping studies and systematic literature reviews is the sample's representativeness; mappings do not necessarily need to obtain the whole universe of papers compared with literature reviews. Nevertheless, previous research has found that even two exact literature reviews on the same subject do not have the same sample of papers, affecting it. Consequently, to increase the reliability, we identified the PICO of our research and used golden standard research methods for SLR, such as Kitchenham and Charters (2007) . This strategy helps us develop different strings for the databases tested to obtain the most optimal result. Furthermore, aiming to obtain a complete sample, we followed a forward snowballing of the whole sample obtained in the first round, as suggested by Wohlin et al. (2013) and Petersen et al. (2015) .

However, there may still be reliability issues with the sample. Firstly, the amount of ongoing publications on the subjects increases daily. Therefore, the total number would increase depending on the day the sample was obtained. Furthermore, some research on open-source platforms (such as ArXiV) did not explicitly indicate if it was peer-reviewed. Hence, the authors manually checked whether it was accepted at a peer-review venue. This is why we hypothesize that the snowballing phase provided many more papers, as these had yet to be indexed in the databases and were only available at open-source platforms. Therefore, the final sample of this research may increase and change depending on the day the data was gathered.

In addition, the sample may differ based on the definition of “code generation.” For this research and as explained in Section 4 , we worked around the idea that AI models should suggest code (working or not). Some papers would fall under our scope in some cases, even if the main topic were “verification and validation,” as the AI tools proposed for this would suggest code. Hence, we focus not only on the development phase of the SDLC but also on any phase that suggests code. Different handling of “code generation” may provide different results.

On another note, the background and expertise of the researchers affect how papers are classified and information is extracted ( Wohlin et al., 2013 ). In this manner, in this research, we used known taxonomies and definitions for classification schemes, such as Wieringa et al. (2006) for the type of research or MITRE's Top Vulnerabilities to identify which are the most commonly discussed risk vulnerabilities. The objective of using well-known classification schemes and methodologies is to reduce bias, as identified ( Petersen et al., 2015 ). However, a complete reduction of bias cannot be ruled out.

Moreover, to fight authors' bias, every single article was reviewed, and data was extracted by at least two others, using a pairing strategy. If, due to time constraints, it was only reviewed by one author, the other author would review the work ( Wohlin et al., 2013 ). If disagreements appeared at any phase – such as the inclusion/exclusion or data gathering – a meeting would be done and discussed ( Wohlin et al., 2013 ). For example, in a couple of papers, Author #1 was unsure if it should be included or excluded based on the quality review, which was discussed with Author #4. Our objective in using a pairing strategy is to diminish authors' bias throughout the SLR.

On the analysis and comparison of the different articles, one threat to the validity of this SLR is that not all articles use the same taxonomy for vulnerabilities; they could not be classified under a single method. Some articles would research either MITRE's CWE or the Top-25, and others would tackle more specific vulnerabilities (such as jailbreaking, malware creation, SSB, and human programming). Therefore, comparing the vulnerabilities between the articles is, at best, complicated and, at worst, a threat to our conclusions. Given the lack of a classification scheme for the wide range of security issues tackled in our sample, we (1) tried to classify the papers based on the claims of the papers' articles; (2) we aimed at comparing based on the programming language used, and between papers researched similar subjects, such as MITRE's CWE. In this manner, we would not be comparing completely different subjects. As recognized by Petersen et al. (2015) , the need for a classification scheme for specific subjects is a common challenge for systematic mapping studies and literature reviews. Nevertheless, future studies would benefit from a better classification approach if the sample permits.

We have provided the whole sample at: https://doi.org/10.5281/zenodo.10666386 for replication and transparency, with the process explained in detail. Each paper has details on why it was included/excluded, at which phase, and with details and/or comments to help readers understand and replicate our research. Likewise, we explained our research methods in as much detail as possible in the papers. Tangently, providing the details and open sources of the data helps us increase validity issues that may be present in this study.

Nonetheless, even when using well-known strategies both for the SLR and to mitigate known issues, we cannot rule out that there are inherent validity and reliability elements proper from all SLRs. We did our best efforts to mitigate these.

7 Conclusion

By systematically reviewing the state of the art, we aimed to provide insight into the question, “How does the code generation from AI models impact the cybersecurity of the software process?” We can confirm that there is enough evidence for us to say, unsurprisingly, that code generated by AI is not necessarily secure and it also contains security flaws. But, as often happens with AI, the real matter is not if AI is infallible but whether it performs better than humans doing the same task. Unfortunately, the conclusions we gathered from the literature diverge in suggesting whether AI-generated security artifacts should be cautiously approached, for instance, because of some particular severity or because they are tricky to spot. Indeed, some work reports of them as naïve and easily detectable, but the result cannot be generalized. Overall, there is no clear favor for one hypothesis over the other because of incomparable differences between the papers' experimental setups, data sets used for the training, programming languages considered, types of flaws, and followed experimental methodologies.

Generally speaking and regardless of the code production activity—whether for code generation from scratch, generating code repair, or even suggesting code—our analysis reveals that well-documented vulnerabilities in have been found in AI-suggested code, and this happened a non-negligible amount of times. And among the many, specific vulnerabilities, such as CWE MITRE Top-25, have received special attention in the current research and for a reason. For instance, CWE-787 and 089 received particular attention from articles, as they are part of the top 3 of MITRE CWE. Furthermore, the CWE security scores of generated code suggested by AI models would vary, with some CWEs being more prevalent than others.

Other works report on having found naïve bugs, easy to fix while other discovered malware code hidden between the benign lines, and other more reported an unjustified trust by human on the quality of the AI-generated code, an issue that raises concerns of a more socio-technical nature.

Similarly, when generated with AI support, different programming languages have different security performances. AI-generated Python code seemed to be more secure (i.e., have fewer bugs) than AI-generated code of the C family. Indeed, different authors have hypothesized that this situation is a consequence of the training data set and its quality. Verilog seems to suffer from similar shortcomings as C. When comparing the security of AI-generated Verilog to C or Python, the literature converges on reporting that the security of the former is worse. Once again, the suggested reason for the finding is that available training data sets for Verilog are smaller and of worse quality than those available for training AI models to generate C or Python code. In addition, there is no identified Top 25 CWE for Verilog. Java is another commonly studied programming language, with similar conclusions as once stated before. To a lesser extent, other programming languages that could be further studied were studied.

Looking at security exploits enabled by AI-generated code with security weaknesses, four different of them are those more frequently reported: SVEN, CodeAttack, and Codex Leaks. Such attacks are reported to used to decreasing code security, creating adversarial code, and personal data leaks over automated generated code.

What can be done to mitigate the severity of flaws introduced by AI? Does the literature suggest giving up on AI entirely? No, this is not what anyone suggests, as it can be imagined that AI is considered an instrument that, despite imperfect, has a clear advantage in terms of speeding up code production. Instead different mitigation strategies are suggested, although more research is required to discuss their effectiveness and efficacy.

• Modifications to the dataset can be a possibility, but the impacts and trade-offs of such an approach are necessary;

• Raising awareness of the context of prompts and how to increase their quality seems to affect the security quality of the code generated positively;

• Security processes, policies, and a degree of mistrust of the AI-generated code could help with security. In other words, AI-generated should pass specific processes—such as test and security verification—before being accepted;

• Educating end-users on AI models (and for code generation) on their limits could help. Future research is required in this area.

As a closing remark, we welcome that the study of the impact on the security of AI models is sparking. We also greet the increased attention that the community is dedicating to the problem of how insecure our systems will be as developers continue to resort to AI support for their work. However, it is still premature to conclude on the impact of the flaws introduced by AI models and, in particular, the impact of those flaws comparatively with those generated by human programmers. Although several mitigation techniques are suggested, what combination of them is efficient or practical is a question that still needs experimental data.

Surely, we have to accept that AI will be used more and more in producing code and that the practice and this tool are still far from being flawless. Until more evidence is available, the general agreement is to exert caution: AI models for secure code generation need to be approached with due care.

Data availability statement

The datasets of the sample of papers for this study can be found in: https://zenodo.org/records/11092334 .

Author contributions

CN-R: Conceptualization, Data curation, Investigation, Methodology, Project administration, Resources, Validation, Writing—original draft, Writing—review & editing. RG-S: Investigation, Visualization, Writing—original draft, Writing—review & editing. AS: Conceptualization, Investigation, Methodology, Writing—review & editing. GL: Conceptualization, Funding acquisition, Investigation, Writing—original draft, Writing—review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant: NCER22/IS/16570468/NCER-FT.

Acknowledgments

The authors thank Marius Lombard-Platet for his feedback, comments, and for proof-reading the paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ The number of different tokens that a model can handle, and their internal representation, is a design choice.

2. ^ This is the meaning of “GPT:” generative pre-trained transformer.

3. ^ Some authors claim that, because there is an encoding-decoding step, and the output is probabilistic, data is not directly copy-pasted. However seriously this argument can be taken, LLMs can and do reproduce parts of their training set ( Huang et al., 2023 ).

4. ^ Certain CWE prompting scenarios, when compared between the authors, had dissimilar security rates, which we would like to note.

5. ^ The authors do highlight that their proposal is not a poisoning attack.

6. ^ In reality, multiple (broadly incompatible) versions of Python coexist, but this is unimportant in the context of our discussion and we refer to them collectively as “Python.”

7. ^ One could argue for instance that the vulnerabilities occur in large proportions in generated code that fails basic functional testing, and would never make it into production because of this. Or, the other way around, that code without security vulnerabilities could still be functionally incorrect, which also causes issues. A full study of these effects remains to be done.

8. ^ They were tasked to write a program that “takes as input a string path representing a file path and returns a File object for the file at 'path' ( Perry et al., 2023 ).”

9. ^ Following the authors of our sample, we use “C” to refer to the various versions of the C standard, indiscriminately.

10. ^ Here again we conflate all versions of Java together.

11. ^ The authors define single stupid bugs as “...bugs that have single-statement fixes that match a small set of bug templates. They are called 'simple' because they are usually fixed by small changes and 'stupid' because, once located, a developer can usually fix them quickly with minor changes ( Jesse et al., 2023 ).”

12. ^ The attack is explained in detail in Section 5.2.

Ahmad, W., Chakraborty, S., Ray, B., and Chang, K.-W. (2021). “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , eds. K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, et al. (Association for Computational Linguistics), 2655–2668.

Google Scholar

Asare, O., Nagappan, M., and Asokan, N. (2023). Is GitHub's Copilot as bad as humans at introducing vulnerabilities in code? Empir. Softw. Eng . 28:129. doi: 10.48550/arXiv.2204.04741

Crossref Full Text | Google Scholar

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., and Santos, E. A. (2023). “Programming is hard-or at least it used to be: educational opportunities and challenges of ai code generation,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V.1 (New York, NY), 500–506.

Botacin, M. (2023). “GPThreats-3: is automatic malware generation a threat?” in 2023 IEEE Security and Privacy Workshops (SPW) (San Francisco, CA: IEEE), 238–254.

Britz, D., Goldie, A., Luong, T., and Le, Q. (2017). Massive exploration of neural machine translation architectures. ArXiv e-prints . doi: 10.48550/arXiv.1703.03906

Burgess, M. (2023). Criminals Have Created Their Own ChatGPT Clones . Wired.

Carrera-Rivera, A., Ochoa, W., Larrinaga, F., and Lasa, G. (2022). How-to conduct a systematic literature review: a quick guide for computer science research. MethodsX 9:101895. doi: 10.1016/j.mex.2022.101895

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., et al. (2021). Evaluating large language models trained on code. CoRR abs/2107.03374. doi: 10.48550/arXiv.2107.03374

Fan, J., Li, Y., Wang, S., and Nguyen, T. N. (2020). “A C/C + + code vulnerability dataset with code changes and CVE summaries,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (New York, NY: Association for Computing Machinery), 508–512.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., et al. (2020). “CodeBERT: a pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020 , eds. T. Cohn, Y. He, and Y. Liu (Association for Computational Linguistics), 1536–1547.

Fried, D., Aghajanyan, A., Lin, J., Wang, S. I., Wallace, E., Shi, F., et al. (2022). InCoder: a generative model for code infilling and synthesis. ArXiv abs/2204.05999. doi: 10.48550/arXiv.2204.05999

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., et al. (2021). “GraphCodeBERT: pre-training code representations with data flow,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 . OpenReview.net .

He, J., and Vechev, M. (2023). “Large language models for code: Security hardening and adversarial testing,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 1865–1879.

Henkel, J., Ramakrishnan, G., Wang, Z., Albarghouthi, A., Jha, S., and Reps, T. (2022). “Semantic robustness of models of source code,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Honolulu, HI), 526–537.

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020 . OpenReview.net .

Huang, Y., Li, Y., Wu, W., Zhang, J., and Lyu, M. R. (2023). Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools .

HuggingFaces (2022). Codeparrot. Available online at: https://huggingface.co/codeparrot/codeparrot (accessed February, 2024).

Jain, P., Jain, A., Zhang, T., Abbeel, P., Gonzalez, J., and Stoica, I. (2021). “Contrastive code representation learning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , eds. M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (Punta Cana: Association for Computational Linguistics), 5954–5971.

Jesse, K., Ahmed, T., Devanbu, P. T., and Morgan, E. (2023). “Large language models and simple, stupid bugs,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR) (Los Alamitos, CA: IEEE Computer Society), 563–575.

Jha, A., and Reddy, C. K. (2023). “CodeAttack: code-based adversarial attacks for pre-trained programming language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37 , 14892–14900.

Jia, J., Srikant, S., Mitrovska, T., Gan, C., Chang, S., Liu, S., et al. (2023). “CLAWSAT: towards both robust and accurate code models,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Los Alamitos, CA: IEEE), 212–223.

Karampatsis, R.-M., and Sutton, C. (2020). “How often do single-statement bugs occur? The manySStuBs4J dataset,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (Seoul: Association for Computing Machinery), 573–577.

Kitchenham, B., and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Tech. Rep. Available online at: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=CQDOm2gAAAAJ&citation_for_view=CQDOm2gAAAAJ:d1gkVwhDpl0C

Kitchenham, B., Sjøberg, D. I., Brereton, O. P., Budgen, D., Dybå, T., Höst, M., et al. (2010). “Can we evaluate the quality of software engineering experiments?,” in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY), 1–8.

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., et al. (2023). StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 . doi: 10.48550/arXiv.2305.06161

Liguori, P., Improta, C., Natella, R., Cukic, B., and Cotroneo, D. (2023). Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators. Expert Syst. Appl. 225:120073. doi: 10.48550/arXiv.2212.06008

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692. doi: 10.48550/arXiv.1907.11692

Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., and Jiang, Z. M. J. (2023). GitHub Copilot AI pair programmer: asset or liability? J. Syst. Softw . 203:111734. doi: 10.48550/arXiv.2206.15331

Multiple authors (2021). GPT Code Clippy: The Open Source Version of GitHub Copilot .

Nair, M., Sadhukhan, R., and Mukhopadhyay, D. (2023). “How hardened is your hardware? Guiding ChatGPT to generate secure hardware resistant to CWEs,” in International Symposium on Cyber Security, Cryptology, and Machine Learning (Berlin: Springer), 320–336.

Natella, R., Liguori, P., Improta, C., Cukic, B., and Cotroneo, D. (2024). AI code generators for security: friend or foe? IEEE Secur. Priv . 2024:1219. doi: 10.48550/arXiv.2402.01219

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., et al. (2023). CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis . ICLR.

Nikitopoulos, G., Dritsa, K., Louridas, P., and Mitropoulos, D. (2021). “CrossVul: a cross-language vulnerability dataset with commit data,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 (New York, NY: Association for Computing Machinery), 1565–1569.

Niu, L., Mirza, S., Maradni, Z., and Pöpper, C. (2023). “CodexLeaks: privacy leaks from code generation language models in GitHub's Copilot,” in 32nd USENIX Security Symposium (USENIX Security 23) , 2133–2150.

Olson, M., Wyner, A., and Berk, R. (2018). Modern neural networks generalize on small data sets. Adv. Neural Inform. Process. Syst . 31, 3623–3632. Available online at: https://proceedings.neurips.cc/paper/2018/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html

Pa Pa, Y. M., Tanizaki, S., Kou, T., Van Eeten, M., Yoshioka, K., and Matsumoto, T. (2023). “An attacker's dream? Exploring the capabilities of chatgpt for developing malware,” in Proceedings of the 16th Cyber Security Experimentation and Test Workshop (New York, NY), 10–18.

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. (2022). “Asleep at the keyboard? Assessing the security of GitHub Copilot's code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP) (IEEE), 754–768.

Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. (2023). “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP) (Los Alamitos, CA: IEEE), 2339–2356.

Perry, N., Srivastava, M., Kumar, D., and Boneh, D. (2023). “Do users write more insecure code with AI assistants?,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 2785–2799.

Petersen, K., Vakkalanka, S., and Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: an update. Inform. Softw. Technol . 64, 1–18. doi: 10.1016/j.infsof.2015.03.007

Sandoval, G., Pearce, H., Nys, T., Karri, R., Garg, S., and Dolan-Gavitt, B. (2023). “Lost at C: a user study on the security implications of large language model code assistants,” in 32nd USENIX Security Symposium (USENIX Security 23) (Anaheim, CA: USENIX Association), 2205–2222.

Siddiq, M. L., Majumder, S. H., Mim, M. R., Jajodia, S., and Santos, J. C. (2022). “An empirical study of code smells in transformer-based code generation techniques,” in 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) (Limassol: IEEE), 71–82.

Storhaug, A., Li, J., and Hu, T. (2023). “Efficient avoidance of vulnerabilities in auto-completed smart contract code using vulnerability-constrained decoding,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) (Los Alamitos, CA: IEEE), 683–693.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). “A survey on deep transfer learning,” in Artificial Neural Networks and Machine Learning – ICANN 2018 , eds. V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Cham. Springer International Publishing), 270–279.

Tony, C., Ferreyra, N. E. D., and Scandariato, R. (2022). “GitHub considered harmful? Analyzing open-source projects for the automatic generation of cryptographic API call sequences,” in 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS) (Guangzhou: IEEE), 270–279.

Tony, C., Mutas, M., Ferreyra, N. E. D., and Scandariato, R. (2023). “LLMSecEval: a dataset of natural language prompts for security evaluations,” in 20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023 (Los Alamitos, CA: IEEE), 588–592.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017 , eds. I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, et al. (Long Beach, CA), 5998–6008.

Wang, Y., Wang, W., Joty, S. R., and Hoi, S. C. H. (2021). “CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , 8696–8708.

Wartschinski, L., Noller, Y., Vogel, T., Kehrer, T., and Grunske, L. (2022). VUDENC: vulnerability detection with deep learning on a natural codebase for Python. Inform. Softw. Technol . 144:106809. doi: 10.48550/arXiv.2201.08441

Wieringa, R., Maiden, N., Mead, N., and Rolland, C. (2006). Requirements engineering paper classification and evaluation criteria: a proposal and a discussion. Requir. Eng . 11, 102–107. doi: 10.1007/s00766-005-0021-6

Wohlin, C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (New York, NY), 1–10.

Wohlin, C., Runeson, P., Neto, P. A. d. M. S., Engström, E., do Carmo Machado, I., and De Almeida, E. S. (2013). On the reliability of mapping studies in software engineering. J. Syst. Softw . 86, 2594–2610. doi: 10.1016/j.jss.2013.04.076

Wu, Y., Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., et al. (2023). “How effective are neural networks for fixing security vulnerabilities,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023 (New York, NY: Association for Computing Machinery), 1282–1294.

PubMed Abstract | Google Scholar

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. (2022). “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (New York, NY), 1–10.

Keywords: artificial intelligence, security, software engineering, programming, code generation

Citation: Negri-Ribalta C, Geraud-Stewart R, Sergeeva A and Lenzini G (2024) A systematic literature review on the impact of AI models on the security of code generation. Front. Big Data 7:1386720. doi: 10.3389/fdata.2024.1386720

Received: 15 February 2024; Accepted: 22 April 2024; Published: 13 May 2024.

Reviewed by:

Copyright © 2024 Negri-Ribalta, Geraud-Stewart, Sergeeva and Lenzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Claudia Negri-Ribalta, claudia.negriribalta@uni.lu

This article is part of the Research Topic

Cybersecurity and Artificial Intelligence: Advances, Challenges, Opportunities, Threats

We will keep fighting for all libraries - stand with us!

Internet Archive Audio

literature review for software development project

  • This Just In
  • Grateful Dead
  • Old Time Radio
  • 78 RPMs and Cylinder Recordings
  • Audio Books & Poetry
  • Computers, Technology and Science
  • Music, Arts & Culture
  • News & Public Affairs
  • Spirituality & Religion
  • Radio News Archive

literature review for software development project

  • Flickr Commons
  • Occupy Wall Street Flickr
  • NASA Images
  • Solar System Collection
  • Ames Research Center

literature review for software development project

  • All Software
  • Old School Emulation
  • MS-DOS Games
  • Historical Software
  • Classic PC Games
  • Software Library
  • Kodi Archive and Support File
  • Vintage Software
  • CD-ROM Software
  • CD-ROM Software Library
  • Software Sites
  • Tucows Software Library
  • Shareware CD-ROMs
  • Software Capsules Compilation
  • CD-ROM Images
  • ZX Spectrum
  • DOOM Level CD

literature review for software development project

  • Smithsonian Libraries
  • FEDLINK (US)
  • Lincoln Collection
  • American Libraries
  • Canadian Libraries
  • Universal Library
  • Project Gutenberg
  • Children's Library
  • Biodiversity Heritage Library
  • Books by Language
  • Additional Collections

literature review for software development project

  • Prelinger Archives
  • Democracy Now!
  • Occupy Wall Street
  • TV NSA Clip Library
  • Animation & Cartoons
  • Arts & Music
  • Computers & Technology
  • Cultural & Academic Films
  • Ephemeral Films
  • Sports Videos
  • Videogame Videos
  • Youth Media

Search the history of over 866 billion web pages on the Internet.

Mobile Apps

  • Wayback Machine (iOS)
  • Wayback Machine (Android)

Browser Extensions

Archive-it subscription.

  • Explore the Collections
  • Build Collections

Save Page Now

Capture a web page as it appears now for use as a trusted citation in the future.

Please enter a valid web address

  • Donate Donate icon An illustration of a heart shape

An Update on Economic Development Projects - April 19, 2024

Video item preview, share or embed this item, flag this item for.

  • Graphic Violence
  • Explicit Sexual Content
  • Hate Speech
  • Misinformation/Disinformation
  • Marketing/Phishing/Advertising
  • Misleading/Inaccurate/Missing Metadata

plus-circle Add Review comment Reviews

Download options, in collections.

Uploaded by John Hauser on May 4, 2024

SIMILAR ITEMS (based on metadata)

COMMENTS

  1. Risk factors in software development projects: a systematic literature

    The usage of a systematic literature review allows extending the reach of results, as well as to bring a deeper insight into the state of art in the context of software development projects. So, the conjecture we want to investigate this systematic literature review is that it is important to get a comprehensive view of risk factors in ...

  2. Agile Global Software Development: A Systematic Literature Review

    Global Software Development (GSD) continues to grow substantially and it is fast becoming the norm and fundamentally different from local Software Engineering development. ... Wasim Alsaqaf, Maya Daneva, and Roel Wieringa. 2017. Quality Requirements in Large-Scale Distributed Agile Projects - A Systematic Literature Review. In International ...

  3. Software Development Project Management: A Literature Review

    The CMMI ® is a five-stage model that. provides software organizations with guidance on how. to gain control of their processes for developing and. maintaining software and how to evolve toward a ...

  4. Best Practices for Software Development: A Systematic Literature Review

    Best Practices for Software Development: A Systematic Literature Review. November 2020. DOI: 10.1007/978-3-030-63329-5_3. Conference: International Conference on Software Process Improvement ...

  5. Prioritizing tasks in software development: A systematic literature review

    Task prioritization is one of the most researched areas in software development. Given the huge number of papers written on the topic, it might be challenging for IT practitioners-software developers, and IT project managers-to find the most appropriate tools or methods developed to date to deal with this important issue. The main goal of this work is therefore to review the current state ...

  6. Software Development Project Management: A Literature Review

    The literature review, among other, has identified lack of standardization in terminology and concepts, lack of systematic domain modelling and use of ontologies mainly in prototype ontology systems that address rather limited aspects of software project management processes. Download Free PDF. RAJ TAPASE.

  7. PDF Context-Augmented Software Development Projects: Literature Review and

    improve big data software project development. Keywords: Software engineering, context, adaptive context, software development, literature review. I. INTRODUCTION Software development in traditional and big data projects is a complex knowledge-intensive [1] effort [2]: there is a broad number of different technologies involved in software ...

  8. Risk factors in software development projects: a systematic literature

    Software development project success and failure from the supplier's perspective: a systematic literature review. International Journal of Project Management, 30 (4), 458-469. Google Scholar Cross Ref

  9. PDF DOI: https://doi.org/10.48009/3 iis 2021 298-316 A systematic ...

    : agile, agile project management, systematic literature review, agile software development, and traditional project management . Introduction . As disruptive technologies and innovations continue, the resulting change forces organizations to consider adopting an agile mindset (Project Management Institute & Agile Alliance, 2017).

  10. Software Practices For Agile Developers: A Systematic Literature Review

    Software development is one of the work practices in a company's startup, academics, and industries. Agile is a software development methodology that is currently quite popular. Agile development practices existing methodologies include Test-Driven Development (TDD), Behavior-Driven Development (BDD), Domain-Driven Design (DDD), and Model-Driven Development (MDD). Each software development ...

  11. (PDF) Social Network Analysis in Software Development Projects: A

    projects. We conduc ted a systematic lit erature review (SLR) of research on softw are development. projects and social networ k data publish ed between 1980 and 2019. We identified and analyzed ...

  12. The Impact of Agile Methodology on Project Success, with a Moderating

    Computing software plays an essential role in almost every sector of the digital age, but the process of efficient software development still faces several challenges. Effective software development methodology can be the difference between the success and failure of a software project. This research aims to evaluate the overall impact of Agile Software Development (ASD) on the individual ...

  13. How to Write a Literature Review

    Examples of literature reviews. Step 1 - Search for relevant literature. Step 2 - Evaluate and select sources. Step 3 - Identify themes, debates, and gaps. Step 4 - Outline your literature review's structure. Step 5 - Write your literature review.

  14. Skills development for software engineers: Systematic literature review

    2. Methodology. To conduct the literature survey, a Systematic Literature Review (SLR) was performed based on the method described by Felizardo et al. [26]. As an initial step, there is a definition stage, including the research questions, the database, the search strings, and the inclusion and exclusion criteria.

  15. Software development methodologies and practices in start‐ups

    A development methodology imposes a disciplined process upon software development with the aim of making software development more predictable and more efficient . A software development methodology is a set of activities, practices or processes that an organisation uses to develop and maintain the software and the associated products (e.g ...

  16. Social Network Analysis in Software Development Projects: A Systematic

    Software development in project teams has become more and more complex, with increasing demands for information and decision making. ... Ouriques, K. Wnuk, T. Gorschek and R. B. Svensson, Knowledge management strategies and processes in agile software development, a systematic literature review, Int. J. Soft. Eng. Knowl.

  17. Risk factors in software development projects: a systematic literature

    The found evidences suggest that risk factors relating to software requirements are the most recurrent and cited and the need for more studies on these factors as fundamental items for reduction of failure level of a software development project. Risks are an inherent part of any software project. The presence of risks in environments of software development projects requires the perception so ...

  18. How to make literature review for a software implementation project?

    I don't know if this is out of the scope of this website. I'm a student in software engineering deparment and I am supposed to make a project for a course during the semester. The first step of it is to prepare a project proposal which consists of a short description of the project, a literature review and detailed flowcharts.

  19. Lean practices in software development projects: A literature review

    This can be achieved through Lean to software development projects. As the lean has been considered in different ways and has been implemented to varying extent in different sectors of the economy this paper aims to investigate as to how "lean" is viewed in software development projects and status of implementation in software development ...

  20. PDF Risk factors in software development projects: a systematic literature

    This paper aims to identify and to map risk factors in environments of software development projects. We conducted a systematic literature review through a database search, as well as we performed an assessment of quality of the selected studies. All this process was conducted through a research protocol.

  21. Frontiers

    This systematic literature review (SLR) aims to critically examine how the code generated by AI models impacts software and system security. Following the categorization of the research questions provided by Kitchenham and Charters (2007) on SLR questions, this work has a 2-fold objective: analyzing the impact and systematizing the knowledge ...

  22. Machine Learning Development: A Comprehensive Review of Booktest and

    Booktest is a review-based testing tool, that aims to combine the best sides of Jupyter Notebook with classic unit testing. Mainly, book test provides 3 main services. Easy snapshotting and easy snapshot reviews and diffs to power review-driven development. Internal build system providing RAM and FS caches.

  23. An Update on Economic Development Projects

    An update from Mayor Burney Jenkins and Scott County Judge Executive Joe Pat Covington on proposed economic development projects. Skip to main content We will keep fighting for all libraries - stand with us!