Run a free plagiarism check in 10 minutes, generate accurate citations for free.
Published on October 30, 2022 by Shona McCombes . Revised on October 19, 2023.
The research question is one of the most important parts of your research paper , thesis or dissertation . It’s important to spend some time assessing and refining your question before you get started.
The exact form of your question will depend on a few things, such as the length of your project, the type of research you’re conducting, the topic , and the research problem . However, all research questions should be focused, specific, and relevant to a timely social or scholarly issue.
Once you’ve read our guide on how to write a research question , you can use these examples to craft your own.
Research question | Explanation |
---|---|
The first question is not enough. The second question is more , using . | |
Starting with “why” often means that your question is not enough: there are too many possible answers. By targeting just one aspect of the problem, the second question offers a clear path for research. | |
The first question is too broad and subjective: there’s no clear criteria for what counts as “better.” The second question is much more . It uses clearly defined terms and narrows its focus to a specific population. | |
It is generally not for academic research to answer broad normative questions. The second question is more specific, aiming to gain an understanding of possible solutions in order to make informed recommendations. | |
The first question is too simple: it can be answered with a simple yes or no. The second question is , requiring in-depth investigation and the development of an original argument. | |
The first question is too broad and not very . The second question identifies an underexplored aspect of the topic that requires investigation of various to answer. | |
The first question is not enough: it tries to address two different (the quality of sexual health services and LGBT support services). Even though the two issues are related, it’s not clear how the research will bring them together. The second integrates the two problems into one focused, specific question. | |
The first question is too simple, asking for a straightforward fact that can be easily found online. The second is a more question that requires and detailed discussion to answer. | |
? dealt with the theme of racism through casting, staging, and allusion to contemporary events? | The first question is not — it would be very difficult to contribute anything new. The second question takes a specific angle to make an original argument, and has more relevance to current social concerns and debates. |
The first question asks for a ready-made solution, and is not . The second question is a clearer comparative question, but note that it may not be practically . For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries. |
Note that the design of your research question can depend on what method you are pursuing. Here are a few options for qualitative, quantitative, and statistical research questions.
Type of research | Example question |
---|---|
Qualitative research question | |
Quantitative research question | |
Statistical research question |
If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.
Methodology
Statistics
Research bias
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
McCombes, S. (2023, October 19). 10 Research Question Examples to Guide your Research Project. Scribbr. Retrieved September 6, 2024, from https://www.scribbr.com/research-process/research-question-examples/
Other students also liked, writing strong research questions | criteria & examples, how to choose a dissertation topic | 8 steps to follow, evaluating sources | methods & examples, "i thought ai proofreading was useless but..".
I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”
Content Marketing. Python Pandas. A 1 and 2 Year all semsters Question Paper Download. Short Answer Questions Note : Answer any four questions. Practicing question paper gives you the confidence papeer business research methods question paper pdf the board exam with minimum fear and stress since you get proper idea about question paper pattern and marks weightage. Discuss the various methods of qualitative research. If you would like to leave a comment, please do, I'd love to hear what you think! The figures in the margin indicate full marks. Question Bank Solutions. Share to Twitter Share to Facebook. My Profile. What are the major decisions involved in constructing an itemized rating scale? Objective Questions Note : Answer all questions. BCA 6th Sem. Financial Accounting, Question Paper of B. Enter your email address to comment.
Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service
Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground
Know how your people feel and empower managers to improve employee engagement, productivity, and retention
Take action in the moments that matter most along the employee journey and drive bottom line growth
Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
Explore the platform powering Experience Management
Popular Use Cases
Market Research
The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.
Business research: definition, types & methods.
10 min read What is business research and why does it matter? Here are some of the ways business research can be helpful to your company, whichever method you choose to carry it out.
Business research helps companies make better business decisions by gathering information. The scope of the term business research is quite broad – it acts as an umbrella that covers every aspect of business, from finances to advertising creative. It can include research methods which help a company better understand its target market. It could focus on customer experience and assess customer satisfaction levels. Or it could involve sizing up the competition through competitor research.
Often when carrying out business research, companies are looking at their own data, sourced from their employees, their customers and their business records. However, business researchers can go beyond their own company in order to collect relevant information and understand patterns that may help leaders make informed decisions. For example, a business may carry out ethnographic research where the participants are studied in the context of their everyday lives, rather than just in their role as consumer, or look at secondary data sources such as open access public records and empirical research carried out in academic studies.
There is also a body of knowledge about business in general that can be mined for business research purposes. For example organizational theory and general studies on consumer behavior.
Free eBook: 2024 global market research trends report
We live in a time of high speed technological progress and hyper-connectedness. Customers have an entire market at their fingertips and can easily switch brands if a competitor is offering something better than you are. At the same time, the world of business has evolved to the point of near-saturation. It’s hard to think of a need that hasn’t been addressed by someone’s innovative product or service.
The combination of ease of switching, high consumer awareness and a super-evolved marketplace crowded with companies and their offerings means that businesses must do whatever they can to find and maintain an edge. Business research is one of the most useful weapons in the fight against business obscurity, since it allows companies to gain a deep understanding of buyer behavior and stay up to date at all times with detailed information on their market.
Thanks to the standard of modern business research tools and methods, it’s now possible for business analysts to track the intricate relationships between competitors, financial markets, social trends, geopolitical changes, world events, and more.
Find out how to conduct your own market research and make use of existing market research data with our Ultimate guide to market research
Business research methods vary widely, but they can be grouped into two broad categories – qualitative research and quantitative research .
Qualitative business research deals with non-numerical data such as people’s thoughts, feelings and opinions. It relies heavily on the observations of researchers, who collect data from a relatively small number of participants – often through direct interactions.
Qualitative research interviews take place one-on-one between a researcher and participant. In a business context, the participant might be a customer, a supplier, an employee or other stakeholder. Using open-ended questions , the researcher conducts the interview in either a structured or unstructured format. Structured interviews stick closely to a question list and scripted phrases, while unstructured interviews are more conversational and exploratory. As well as listening to the participant’s responses, the interviewer will observe non-verbal information such as posture, tone of voice and facial expression.
Like the qualitative interview, a focus group is a form of business research that uses direct interaction between the researcher and participants to collect data. In focus groups , a small number of participants (usually around 10) take part in a group discussion led by a researcher who acts as moderator. The researcher asks questions and takes note of the responses, as in a qualitative research interview. Sampling for focus groups is usually purposive rather than random, so that the group members represent varied points of view.
In an observational study, the researcher may not directly interact with participants at all, but will pay attention to practical situations, such as a busy sales floor full of potential customers, or a conference for some relevant business activity. They will hear people speak and watch their interactions , then record relevant data such as behavior patterns that relate to the subject they are interested in. Observational studies can be classified as a type of ethnographic research. They can be used to gain insight about a company’s target audience in their everyday lives, or study employee behaviors in actual business situations.
Ethnographic research is an immersive design of research where one observes peoples’ behavior in their natural environment. Ethnography was most commonly found in the anthropology field and is now practices across a wide range of social sciences.
Ehnography is used to support a designer’s deeper understanding of the design problem – including the relevant domain, audience(s), processes, goals and context(s) of use.
The ethnographic research process is a popular methodology used in the software development lifecycle. It helps create better UI/UX flow based on the real needs of the end-users.
If you truly want to understand your customers’ needs, wants, desires, pain-points “walking a mile” in their shoes enables this. Ethnographic research is this deeply rooted part of research where you truly learn your targe audiences’ problem to craft the perfect solution.
A case study is a detailed piece of research that provides in depth knowledge about a specific person, place or organization. In the context of business research, case study research might focus on organizational dynamics or company culture in an actual business setting, and case studies have been used to develop new theories about how businesses operate. Proponents of case study research feel that it adds significant value in making theoretical and empirical advances. However its detractors point out that it can be time consuming and expensive, requiring highly skilled researchers to carry it out.
Quantitative research focuses on countable data that is objective in nature. It relies on finding the patterns and relationships that emerge from mass data – for example by analyzing the material posted on social media platforms, or via surveys of the target audience. Data collected through quantitative methods is empirical in nature and can be analyzed using statistical techniques. Unlike qualitative approaches, a quantitative research method is usually reliant on finding the right sample size, as this will determine whether the results are representative. These are just a few methods – there are many more.
Surveys are one of the most effective ways to conduct business research. They use a highly structured questionnaire which is distributed to participants, typically online (although in the past, face to face and telephone surveys were widely used). The questions are predominantly closed-ended, limiting the range of responses so that they can be grouped and analyzed at scale using statistical tools. However surveys can also be used to get a better understanding of the pain points customers face by providing open field responses where they can express themselves in their own words. Both types of data can be captured on the same questionnaire, which offers efficiency of time and cost to the researcher.
Correlational research looks at the relationship between two entities, neither of which are manipulated by the researcher. For example, this might be the in-store sales of a certain product line and the proportion of female customers subscribed to a mailing list. Using statistical analysis methods, researchers can determine the strength of the correlation and even discover intricate relationships between the two variables. Compared with simple observation and intuition, correlation may identify further information about business activity and its impact, pointing the way towards potential improvements and more revenue.
It may sound like something that is strictly for scientists, but experimental research is used by both businesses and scholars alike. When conducted as part of the business intelligence process, experimental research is used to test different tactics to see which ones are most successful – for example one marketing approach versus another. In the simplest form of experimental research, the researcher identifies a dependent variable and an independent variable. The hypothesis is that the independent variable has no effect on the dependent variable, and the researcher will change the independent one to test this assumption. In a business context, the hypothesis might be that price has no relationship to customer satisfaction. The researcher manipulates the price and observes the C-Sat scores to see if there’s an effect.
You can make the business research process much quicker and more efficient by selecting the right tools. Business research methods like surveys and interviews demand tools and technologies that can store vast quantities of data while making them easy to access and navigate. If your system can also carry out statistical analysis, and provide predictive recommendations to help you with your business decisions, so much the better.
Market intelligence 10 min read, marketing insights 11 min read, ethnographic research 11 min read, qualitative vs quantitative research 13 min read, qualitative research questions 11 min read, qualitative research design 12 min read, primary vs secondary research 14 min read, request demo.
Ready to learn more about Qualtrics?
[OBJECTIVE]
Subject: Business Research Methods
Time Allowed: 15 Minutes
Maximum Marks: 10
NOTE: Attempt this Paper on this Question Sheet only. Please encircle the correct option. Division of marks is given in front of each question. This Paper will be collected back after expiry of time limit mentioned above.
Part-I Encircle the right answer, cutting and overwriting is not allowed. (10)
1. The degree of exactness or exactitude in scientific research is known as a) Purposiveness b) Rigor c) Objectivity d) Testability 2. The artificial study setting is known as a) Artificial study b) Contrived c) Non-contrived d) Botha and b 3. A scale that measures both the direction and intensity of the attributes of a concept a) Staple scale b) Dichotomous scale c) Likert scale d) Constant sum rating scale. 4. A subset or subgroup of the population chosen for study a) Subject b) Sample c) Population frame d) Element 5. The hypothesis “what is the distribution of hypertensive patients by income level?” is an example of a) Descriptive hypothesis b) Relational hypothesis c) Correlational hypothesis d) Causal hypothesis 6. the most powerful scale: a) Nominal scale b) Ordinal scale c) Interval scale d) Ratio scale 7. The paired comparison scale is used when, among a small number of objects, respondents are asked to choose between ______ objects at a time. a) Two b) Three c) Four d) None of these 8. _____ is a test of how consistently a measuring instrument measures whatever concept it is measuring. , a) Validity b) Reliability c) Content validity d) Construct validity 9. A question that lends itself to different possible responses to its subparts is called a: a) Loaded question b) Leading question c) Double-barreled question d) Ambiguous question 10. Collecting the necessary data without becoming integral part of the organizational system: a) Participant-observer b) Non participant-observer c) Assistant observer d) None of these
[SUBJECTIVE]
Time Allowed: 2 Hour and 45 Minutes
Maximum Marks: 50
NOTE: ATTEMPT THIS (SUBJECTIVE) ON THE SEPARATE ANSWER SHEET PROVIDED.
Part-II Give Short answers, Each question carries equal marks. (20)
Q# 1: What is descriptive research?
Q# 2: Define Simple Random Sampling?
Q# 3: Define ratio scale with the help of an example.
Q# 4: Differentiate between cross sectional and longitudinal research.
Q# 5: Explain semi structured interview.
Q# 6: What is meant by deductive reasoning?
Q# 7: Write down two advantages and two disadvantages of external researcher.
Q# 8: Explain funneling technique of questioning?
Q# 9: Explain any two possible threats to internal validity in experimental design.
Q# 10: Pros and Cons of observational studies
Part-III Give detailed answers, Each question carries equal marks. (30)
Q# 1: What is hypothetical-deductive method of research? Explain the steps involved in this method of research with the help of an example.
Q# 2: What is reliability and validity in research? How can you assess the reliability and validity of qualitative research?
Q# 3: What is stratified sampling technique? What are its different types? Give an example of a situation where you would use stratified sampling.
Learn With Studynotes
B.com business research methods previous year question papers.
Download Calicut University B.com V semester Business Research Methods previous year question papers .
Download the Bcom Business research methods previous question paper of Nov 2022.
Save my name, email, and website in this browser for the next time I comment.
[ mba - anna university 2021 regulation ].
Under Class: | 2nd Semester [MBA Dept Anna University 2021 Regulation] |
Business Research Methods - BA4205 - Notes, Important Questions, Semester Question Paper PDF Download
MBA - Business Research Methods - BA4205 Subject (under MBA - Anna University 2021 Regulation) - Notes, Important Questions, Semester Question Paper PDF Download
Semester question papers.
Peer Reviewed
Article metrics.
CrossRef Citations
Altmetric Score
PDF Downloads
Academic journals, archives, and repositories are seeing an increasing number of questionable research papers clearly produced using generative AI. They are often created with widely available, general-purpose AI applications, most likely ChatGPT, and mimic scientific writing. Google Scholar easily locates and lists these questionable papers alongside reputable, quality-controlled research. Our analysis of a selection of questionable GPT-fabricated scientific papers found in Google Scholar shows that many are about applied, often controversial topics susceptible to disinformation: the environment, health, and computing. The resulting enhanced potential for malicious manipulation of society’s evidence base, particularly in politically divisive domains, is a growing concern.
Swedish School of Library and Information Science, University of Borås, Sweden
Department of Arts and Cultural Sciences, Lund University, Sweden
Division of Environmental Communication, Swedish University of Agricultural Sciences, Sweden
The use of ChatGPT to generate text for academic papers has raised concerns about research integrity. Discussion of this phenomenon is ongoing in editorials, commentaries, opinion pieces, and on social media (Bom, 2023; Stokel-Walker, 2024; Thorp, 2023). There are now several lists of papers suspected of GPT misuse, and new papers are constantly being added. 1 See for example Academ-AI, https://www.academ-ai.info/ , and Retraction Watch, https://retractionwatch.com/papers-and-peer-reviews-with-evidence-of-chatgpt-writing/ . While many legitimate uses of GPT for research and academic writing exist (Huang & Tan, 2023; Kitamura, 2023; Lund et al., 2023), its undeclared use—beyond proofreading—has potentially far-reaching implications for both science and society, but especially for their relationship. It, therefore, seems important to extend the discussion to one of the most accessible and well-known intermediaries between science, but also certain types of misinformation, and the public, namely Google Scholar, also in response to the legitimate concerns that the discussion of generative AI and misinformation needs to be more nuanced and empirically substantiated (Simon et al., 2023).
Google Scholar, https://scholar.google.com , is an easy-to-use academic search engine. It is available for free, and its index is extensive (Gusenbauer & Haddaway, 2020). It is also often touted as a credible source for academic literature and even recommended in library guides, by media and information literacy initiatives, and fact checkers (Tripodi et al., 2023). However, Google Scholar lacks the transparency and adherence to standards that usually characterize citation databases. Instead, Google Scholar uses automated crawlers, like Google’s web search engine (Martín-Martín et al., 2021), and the inclusion criteria are based on primarily technical standards, allowing any individual author—with or without scientific affiliation—to upload papers to be indexed (Google Scholar Help, n.d.). It has been shown that Google Scholar is susceptible to manipulation through citation exploits (Antkare, 2020) and by providing access to fake scientific papers (Dadkhah et al., 2017). A large part of Google Scholar’s index consists of publications from established scientific journals or other forms of quality-controlled, scholarly literature. However, the index also contains a large amount of gray literature, including student papers, working papers, reports, preprint servers, and academic networking sites, as well as material from so-called “questionable” academic journals, including paper mills. The search interface does not offer the possibility to filter the results meaningfully by material type, publication status, or form of quality control, such as limiting the search to peer-reviewed material.
To understand the occurrence of ChatGPT (co-)authored work in Google Scholar’s index, we scraped it for publications, including one of two common ChatGPT responses (see Appendix A) that we encountered on social media and in media reports (DeGeurin, 2024). The results of our descriptive statistical analyses showed that around 62% did not declare the use of GPTs. Most of these GPT-fabricated papers were found in non-indexed journals and working papers, but some cases included research published in mainstream scientific journals and conference proceedings. 2 Indexed journals mean scholarly journals indexed by abstract and citation databases such as Scopus and Web of Science, where the indexation implies journals with high scientific quality. Non-indexed journals are journals that fall outside of this indexation. More than half (57%) of these GPT-fabricated papers concerned policy-relevant subject areas susceptible to influence operations. To avoid increasing the visibility of these publications, we abstained from referencing them in this research note. However, we have made the data available in the Harvard Dataverse repository.
The publications were related to three issue areas—health (14.5%), environment (19.5%) and computing (23%)—with key terms such “healthcare,” “COVID-19,” or “infection”for health-related papers, and “analysis,” “sustainable,” and “global” for environment-related papers. In several cases, the papers had titles that strung together general keywords and buzzwords, thus alluding to very broad and current research. These terms included “biology,” “telehealth,” “climate policy,” “diversity,” and “disrupting,” to name just a few. While the study’s scope and design did not include a detailed analysis of which parts of the articles included fabricated text, our dataset did contain the surrounding sentences for each occurrence of the suspicious phrases that formed the basis for our search and subsequent selection. Based on that, we can say that the phrases occurred in most sections typically found in scientific publications, including the literature review, methods, conceptual and theoretical frameworks, background, motivation or societal relevance, and even discussion. This was confirmed during the joint coding, where we read and discussed all articles. It became clear that not just the text related to the telltale phrases was created by GPT, but that almost all articles in our sample of questionable articles likely contained traces of GPT-fabricated text everywhere.
Evidence hacking and backfiring effects
Generative pre-trained transformers (GPTs) can be used to produce texts that mimic scientific writing. These texts, when made available online—as we demonstrate—leak into the databases of academic search engines and other parts of the research infrastructure for scholarly communication. This development exacerbates problems that were already present with less sophisticated text generators (Antkare, 2020; Cabanac & Labbé, 2021). Yet, the public release of ChatGPT in 2022, together with the way Google Scholar works, has increased the likelihood of lay people (e.g., media, politicians, patients, students) coming across questionable (or even entirely GPT-fabricated) papers and other problematic research findings. Previous research has emphasized that the ability to determine the value and status of scientific publications for lay people is at stake when misleading articles are passed off as reputable (Haider & Åström, 2017) and that systematic literature reviews risk being compromised (Dadkhah et al., 2017). It has also been highlighted that Google Scholar, in particular, can be and has been exploited for manipulating the evidence base for politically charged issues and to fuel conspiracy narratives (Tripodi et al., 2023). Both concerns are likely to be magnified in the future, increasing the risk of what we suggest calling evidence hacking —the strategic and coordinated malicious manipulation of society’s evidence base.
The authority of quality-controlled research as evidence to support legislation, policy, politics, and other forms of decision-making is undermined by the presence of undeclared GPT-fabricated content in publications professing to be scientific. Due to the large number of archives, repositories, mirror sites, and shadow libraries to which they spread, there is a clear risk that GPT-fabricated, questionable papers will reach audiences even after a possible retraction. There are considerable technical difficulties involved in identifying and tracing computer-fabricated papers (Cabanac & Labbé, 2021; Dadkhah et al., 2023; Jones, 2024), not to mention preventing and curbing their spread and uptake.
However, as the rise of the so-called anti-vaxx movement during the COVID-19 pandemic and the ongoing obstruction and denial of climate change show, retracting erroneous publications often fuels conspiracies and increases the following of these movements rather than stopping them. To illustrate this mechanism, climate deniers frequently question established scientific consensus by pointing to other, supposedly scientific, studies that support their claims. Usually, these are poorly executed, not peer-reviewed, based on obsolete data, or even fraudulent (Dunlap & Brulle, 2020). A similar strategy is successful in the alternative epistemic world of the global anti-vaccination movement (Carrion, 2018) and the persistence of flawed and questionable publications in the scientific record already poses significant problems for health research, policy, and lawmakers, and thus for society as a whole (Littell et al., 2024). Considering that a person’s support for “doing your own research” is associated with increased mistrust in scientific institutions (Chinn & Hasell, 2023), it will be of utmost importance to anticipate and consider such backfiring effects already when designing a technical solution, when suggesting industry or legal regulation, and in the planning of educational measures.
Recommendations
Solutions should be based on simultaneous considerations of technical, educational, and regulatory approaches, as well as incentives, including social ones, across the entire research infrastructure. Paying attention to how these approaches and incentives relate to each other can help identify points and mechanisms for disruption. Recognizing fraudulent academic papers must happen alongside understanding how they reach their audiences and what reasons there might be for some of these papers successfully “sticking around.” A possible way to mitigate some of the risks associated with GPT-fabricated scholarly texts finding their way into academic search engine results would be to provide filtering options for facets such as indexed journals, gray literature, peer-review, and similar on the interface of publicly available academic search engines. Furthermore, evaluation tools for indexed journals 3 Such as LiU Journal CheckUp, https://ep.liu.se/JournalCheckup/default.aspx?lang=eng . could be integrated into the graphical user interfaces and the crawlers of these academic search engines. To enable accountability, it is important that the index (database) of such a search engine is populated according to criteria that are transparent, open to scrutiny, and appropriate to the workings of science and other forms of academic research. Moreover, considering that Google Scholar has no real competitor, there is a strong case for establishing a freely accessible, non-specialized academic search engine that is not run for commercial reasons but for reasons of public interest. Such measures, together with educational initiatives aimed particularly at policymakers, science communicators, journalists, and other media workers, will be crucial to reducing the possibilities for and effects of malicious manipulation or evidence hacking. It is important not to present this as a technical problem that exists only because of AI text generators but to relate it to the wider concerns in which it is embedded. These range from a largely dysfunctional scholarly publishing system (Haider & Åström, 2017) and academia’s “publish or perish” paradigm to Google’s near-monopoly and ideological battles over the control of information and ultimately knowledge. Any intervention is likely to have systemic effects; these effects need to be considered and assessed in advance and, ideally, followed up on.
Our study focused on a selection of papers that were easily recognizable as fraudulent. We used this relatively small sample as a magnifying glass to examine, delineate, and understand a problem that goes beyond the scope of the sample itself, which however points towards larger concerns that require further investigation. The work of ongoing whistleblowing initiatives 4 Such as Academ-AI, https://www.academ-ai.info/ , and Retraction Watch, https://retractionwatch.com/papers-and-peer-reviews-with-evidence-of-chatgpt-writing/ . , recent media reports of journal closures (Subbaraman, 2024), or GPT-related changes in word use and writing style (Cabanac et al., 2021; Stokel-Walker, 2024) suggest that we only see the tip of the iceberg. There are already more sophisticated cases (Dadkhah et al., 2023) as well as cases involving fabricated images (Gu et al., 2022). Our analysis shows that questionable and potentially manipulative GPT-fabricated papers permeate the research infrastructure and are likely to become a widespread phenomenon. Our findings underline that the risk of fake scientific papers being used to maliciously manipulate evidence (see Dadkhah et al., 2017) must be taken seriously. Manipulation may involve undeclared automatic summaries of texts, inclusion in literature reviews, explicit scientific claims, or the concealment of errors in studies so that they are difficult to detect in peer review. However, the mere possibility of these things happening is a significant risk in its own right that can be strategically exploited and will have ramifications for trust in and perception of science. Society’s methods of evaluating sources and the foundations of media and information literacy are under threat and public trust in science is at risk of further erosion, with far-reaching consequences for society in dealing with information disorders. To address this multifaceted problem, we first need to understand why it exists and proliferates.
Finding 1: 139 GPT-fabricated, questionable papers were found and listed as regular results on the Google Scholar results page. Non-indexed journals dominate.
Most questionable papers we found were in non-indexed journals or were working papers, but we did also find some in established journals, publications, conferences, and repositories. We found a total of 139 papers with a suspected deceptive use of ChatGPT or similar LLM applications (see Table 1). Out of these, 19 were in indexed journals, 89 were in non-indexed journals, 19 were student papers found in university databases, and 12 were working papers (mostly in preprint databases). Table 1 divides these papers into categories. Health and environment papers made up around 34% (47) of the sample. Of these, 66% were present in non-indexed journals.
Indexed journals* | 5 | 3 | 4 | 7 | 19 |
Non-indexed journals | 18 | 18 | 13 | 40 | 89 |
Student papers | 4 | 3 | 1 | 11 | 19 |
Working papers | 5 | 3 | 2 | 2 | 12 |
Total | 32 | 27 | 20 | 60 | 139 |
Finding 2: GPT-fabricated, questionable papers are disseminated online, permeating the research infrastructure for scholarly communication, often in multiple copies. Applied topics with practical implications dominate.
The 20 papers concerning health-related issues are distributed across 20 unique domains, accounting for 46 URLs. The 27 papers dealing with environmental issues can be found across 26 unique domains, accounting for 56 URLs. Most of the identified papers exist in multiple copies and have already spread to several archives, repositories, and social media. It would be difficult, or impossible, to remove them from the scientific record.
As apparent from Table 2, GPT-fabricated, questionable papers are seeping into most parts of the online research infrastructure for scholarly communication. Platforms on which identified papers have appeared include ResearchGate, ORCiD, Journal of Population Therapeutics and Clinical Pharmacology (JPTCP), Easychair, Frontiers, the Institute of Electrical and Electronics Engineer (IEEE), and X/Twitter. Thus, even if they are retracted from their original source, it will prove very difficult to track, remove, or even just mark them up on other platforms. Moreover, unless regulated, Google Scholar will enable their continued and most likely unlabeled discoverability.
Environment | researchgate.net (13) | orcid.org (4) | easychair.org (3) | ijope.com* (3) | publikasiindonesia.id (3) |
Health | researchgate.net (15) | ieee.org (4) | twitter.com (3) | jptcp.com** (2) | frontiersin.org (2) |
A word rain visualization (Centre for Digital Humanities Uppsala, 2023), which combines word prominences through TF-IDF 5 Term frequency–inverse document frequency , a method for measuring the significance of a word in a document compared to its frequency across all documents in a collection. scores with semantic similarity of the full texts of our sample of GPT-generated articles that fall into the “Environment” and “Health” categories, reflects the two categories in question. However, as can be seen in Figure 1, it also reveals overlap and sub-areas. The y-axis shows word prominences through word positions and font sizes, while the x-axis indicates semantic similarity. In addition to a certain amount of overlap, this reveals sub-areas, which are best described as two distinct events within the word rain. The event on the left bundles terms related to the development and management of health and healthcare with “challenges,” “impact,” and “potential of artificial intelligence”emerging as semantically related terms. Terms related to research infrastructures, environmental, epistemic, and technological concepts are arranged further down in the same event (e.g., “system,” “climate,” “understanding,” “knowledge,” “learning,” “education,” “sustainable”). A second distinct event further to the right bundles terms associated with fish farming and aquatic medicinal plants, highlighting the presence of an aquaculture cluster. Here, the prominence of groups of terms such as “used,” “model,” “-based,” and “traditional” suggests the presence of applied research on these topics. The two events making up the word rain visualization, are linked by a less dominant but overlapping cluster of terms related to “energy” and “water.”
The bar chart of the terms in the paper subset (see Figure 2) complements the word rain visualization by depicting the most prominent terms in the full texts along the y-axis. Here, word prominences across health and environment papers are arranged descendingly, where values outside parentheses are TF-IDF values (relative frequencies) and values inside parentheses are raw term frequencies (absolute frequencies).
Finding 3: Google Scholar presents results from quality-controlled and non-controlled citation databases on the same interface, providing unfiltered access to GPT-fabricated questionable papers.
Google Scholar’s central position in the publicly accessible scholarly communication infrastructure, as well as its lack of standards, transparency, and accountability in terms of inclusion criteria, has potentially serious implications for public trust in science. This is likely to exacerbate the already-known potential to exploit Google Scholar for evidence hacking (Tripodi et al., 2023) and will have implications for any attempts to retract or remove fraudulent papers from their original publication venues. Any solution must consider the entirety of the research infrastructure for scholarly communication and the interplay of different actors, interests, and incentives.
We searched and scraped Google Scholar using the Python library Scholarly (Cholewiak et al., 2023) for papers that included specific phrases known to be common responses from ChatGPT and similar applications with the same underlying model (GPT3.5 or GPT4): “as of my last knowledge update” and/or “I don’t have access to real-time data” (see Appendix A). This facilitated the identification of papers that likely used generative AI to produce text, resulting in 227 retrieved papers. The papers’ bibliographic information was automatically added to a spreadsheet and downloaded into Zotero. 6 An open-source reference manager, https://zotero.org .
We employed multiple coding (Barbour, 2001) to classify the papers based on their content. First, we jointly assessed whether the paper was suspected of fraudulent use of ChatGPT (or similar) based on how the text was integrated into the papers and whether the paper was presented as original research output or the AI tool’s role was acknowledged. Second, in analyzing the content of the papers, we continued the multiple coding by classifying the fraudulent papers into four categories identified during an initial round of analysis—health, environment, computing, and others—and then determining which subjects were most affected by this issue (see Table 1). Out of the 227 retrieved papers, 88 papers were written with legitimate and/or declared use of GPTs (i.e., false positives, which were excluded from further analysis), and 139 papers were written with undeclared and/or fraudulent use (i.e., true positives, which were included in further analysis). The multiple coding was conducted jointly by all authors of the present article, who collaboratively coded and cross-checked each other’s interpretation of the data simultaneously in a shared spreadsheet file. This was done to single out coding discrepancies and settle coding disagreements, which in turn ensured methodological thoroughness and analytical consensus (see Barbour, 2001). Redoing the category coding later based on our established coding schedule, we achieved an intercoder reliability (Cohen’s kappa) of 0.806 after eradicating obvious differences.
The ranking algorithm of Google Scholar prioritizes highly cited and older publications (Martín-Martín et al., 2016). Therefore, the position of the articles on the search engine results pages was not particularly informative, considering the relatively small number of results in combination with the recency of the publications. Only the query “as of my last knowledge update” had more than two search engine result pages. On those, questionable articles with undeclared use of GPTs were evenly distributed across all result pages (min: 4, max: 9, mode: 8), with the proportion of undeclared use being slightly higher on average on later search result pages.
To understand how the papers making fraudulent use of generative AI were disseminated online, we programmatically searched for the paper titles (with exact string matching) in Google Search from our local IP address (see Appendix B) using the googlesearch – python library(Vikramaditya, 2020). We manually verified each search result to filter out false positives—results that were not related to the paper—and then compiled the most prominent URLs by field. This enabled the identification of other platforms through which the papers had been spread. We did not, however, investigate whether copies had spread into SciHub or other shadow libraries, or if they were referenced in Wikipedia.
We used descriptive statistics to count the prevalence of the number of GPT-fabricated papers across topics and venues and top domains by subject. The pandas software library for the Python programming language (The pandas development team, 2024) was used for this part of the analysis. Based on the multiple coding, paper occurrences were counted in relation to their categories, divided into indexed journals, non-indexed journals, student papers, and working papers. The schemes, subdomains, and subdirectories of the URL strings were filtered out while top-level domains and second-level domains were kept, which led to normalizing domain names. This, in turn, allowed the counting of domain frequencies in the environment and health categories. To distinguish word prominences and meanings in the environment and health-related GPT-fabricated questionable papers, a semantically-aware word cloud visualization was produced through the use of a word rain (Centre for Digital Humanities Uppsala, 2023) for full-text versions of the papers. Font size and y-axis positions indicate word prominences through TF-IDF scores for the environment and health papers (also visualized in a separate bar chart with raw term frequencies in parentheses), and words are positioned along the x-axis to reflect semantic similarity (Skeppstedt et al., 2024), with an English Word2vec skip gram model space (Fares et al., 2017). An English stop word list was used, along with a manually produced list including terms such as “https,” “volume,” or “years.”
Haider, J., Söderström, K. R., Ekström, B., & Rödl, M. (2024). GPT-fabricated scientific papers on Google Scholar: Key features, spread, and implications for preempting evidence manipulation. Harvard Kennedy School (HKS) Misinformation Review . https://doi.org/10.37016/mr-2020-156
Antkare, I. (2020). Ike Antkare, his publications, and those of his disciples. In M. Biagioli & A. Lippman (Eds.), Gaming the metrics (pp. 177–200). The MIT Press. https://doi.org/10.7551/mitpress/11087.003.0018
Barbour, R. S. (2001). Checklists for improving rigour in qualitative research: A case of the tail wagging the dog? BMJ , 322 (7294), 1115–1117. https://doi.org/10.1136/bmj.322.7294.1115
Bom, H.-S. H. (2023). Exploring the opportunities and challenges of ChatGPT in academic writing: A roundtable discussion. Nuclear Medicine and Molecular Imaging , 57 (4), 165–167. https://doi.org/10.1007/s13139-023-00809-2
Cabanac, G., & Labbé, C. (2021). Prevalence of nonsensical algorithmically generated papers in the scientific literature. Journal of the Association for Information Science and Technology , 72 (12), 1461–1476. https://doi.org/10.1002/asi.24495
Cabanac, G., Labbé, C., & Magazinov, A. (2021). Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals . arXiv. https://doi.org/10.48550/arXiv.2107.06751
Carrion, M. L. (2018). “You need to do your research”: Vaccines, contestable science, and maternal epistemology. Public Understanding of Science , 27 (3), 310–324. https://doi.org/10.1177/0963662517728024
Centre for Digital Humanities Uppsala (2023). CDHUppsala/word-rain [Computer software]. https://github.com/CDHUppsala/word-rain
Chinn, S., & Hasell, A. (2023). Support for “doing your own research” is associated with COVID-19 misperceptions and scientific mistrust. Harvard Kennedy School (HSK) Misinformation Review, 4 (3). https://doi.org/10.37016/mr-2020-117
Cholewiak, S. A., Ipeirotis, P., Silva, V., & Kannawadi, A. (2023). SCHOLARLY: Simple access to Google Scholar authors and citation using Python (1.5.0) [Computer software]. https://doi.org/10.5281/zenodo.5764801
Dadkhah, M., Lagzian, M., & Borchardt, G. (2017). Questionable papers in citation databases as an issue for literature review. Journal of Cell Communication and Signaling , 11 (2), 181–185. https://doi.org/10.1007/s12079-016-0370-6
Dadkhah, M., Oermann, M. H., Hegedüs, M., Raman, R., & Dávid, L. D. (2023). Detection of fake papers in the era of artificial intelligence. Diagnosis , 10 (4), 390–397. https://doi.org/10.1515/dx-2023-0090
DeGeurin, M. (2024, March 19). AI-generated nonsense is leaking into scientific journals. Popular Science. https://www.popsci.com/technology/ai-generated-text-scientific-journals/
Dunlap, R. E., & Brulle, R. J. (2020). Sources and amplifiers of climate change denial. In D.C. Holmes & L. M. Richardson (Eds.), Research handbook on communicating climate change (pp. 49–61). Edward Elgar Publishing. https://doi.org/10.4337/9781789900408.00013
Fares, M., Kutuzov, A., Oepen, S., & Velldal, E. (2017). Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In J. Tiedemann & N. Tahmasebi (Eds.), Proceedings of the 21st Nordic Conference on Computational Linguistics (pp. 271–276). Association for Computational Linguistics. https://aclanthology.org/W17-0237
Google Scholar Help. (n.d.). Inclusion guidelines for webmasters . https://scholar.google.com/intl/en/scholar/inclusion.html
Gu, J., Wang, X., Li, C., Zhao, J., Fu, W., Liang, G., & Qiu, J. (2022). AI-enabled image fraud in scientific publications. Patterns , 3 (7), 100511. https://doi.org/10.1016/j.patter.2022.100511
Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Research Synthesis Methods , 11 (2), 181–217. https://doi.org/10.1002/jrsm.1378
Haider, J., & Åström, F. (2017). Dimensions of trust in scholarly communication: Problematizing peer review in the aftermath of John Bohannon’s “Sting” in science. Journal of the Association for Information Science and Technology , 68 (2), 450–467. https://doi.org/10.1002/asi.23669
Huang, J., & Tan, M. (2023). The role of ChatGPT in scientific communication: Writing better scientific review articles. American Journal of Cancer Research , 13 (4), 1148–1154. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10164801/
Jones, N. (2024). How journals are fighting back against a wave of questionable images. Nature , 626 (8000), 697–698. https://doi.org/10.1038/d41586-024-00372-6
Kitamura, F. C. (2023). ChatGPT is shaping the future of medical writing but still requires human judgment. Radiology , 307 (2), e230171. https://doi.org/10.1148/radiol.230171
Littell, J. H., Abel, K. M., Biggs, M. A., Blum, R. W., Foster, D. G., Haddad, L. B., Major, B., Munk-Olsen, T., Polis, C. B., Robinson, G. E., Rocca, C. H., Russo, N. F., Steinberg, J. R., Stewart, D. E., Stotland, N. L., Upadhyay, U. D., & Ditzhuijzen, J. van. (2024). Correcting the scientific record on abortion and mental health outcomes. BMJ , 384 , e076518. https://doi.org/10.1136/bmj-2023-076518
Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology, 74 (5), 570–581. https://doi.org/10.1002/asi.24750
Martín-Martín, A., Orduna-Malea, E., Ayllón, J. M., & Delgado López-Cózar, E. (2016). Back to the past: On the shoulders of an academic search engine giant. Scientometrics , 107 , 1477–1487. https://doi.org/10.1007/s11192-016-1917-2
Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: A multidisciplinary comparison of coverage via citations. Scientometrics , 126 (1), 871–906. https://doi.org/10.1007/s11192-020-03690-4
Simon, F. M., Altay, S., & Mercier, H. (2023). Misinformation reloaded? Fears about the impact of generative AI on misinformation are overblown. Harvard Kennedy School (HKS) Misinformation Review, 4 (5). https://doi.org/10.37016/mr-2020-127
Skeppstedt, M., Ahltorp, M., Kucher, K., & Lindström, M. (2024). From word clouds to Word Rain: Revisiting the classic word cloud to visualize climate change texts. Information Visualization , 23 (3), 217–238. https://doi.org/10.1177/14738716241236188
Swedish Research Council. (2017). Good research practice. Vetenskapsrådet.
Stokel-Walker, C. (2024, May 1.). AI Chatbots Have Thoroughly Infiltrated Scientific Publishing . Scientific American. https://www.scientificamerican.com/article/chatbots-have-thoroughly-infiltrated-scientific-publishing/
Subbaraman, N. (2024, May 14). Flood of fake science forces multiple journal closures: Wiley to shutter 19 more journals, some tainted by fraud. The Wall Street Journal . https://www.wsj.com/science/academic-studies-research-paper-mills-journals-publishing-f5a3d4bc
The pandas development team. (2024). pandas-dev/pandas: Pandas (v2.2.2) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.10957263
Thorp, H. H. (2023). ChatGPT is fun, but not an author. Science , 379 (6630), 313–313. https://doi.org/10.1126/science.adg7879
Tripodi, F. B., Garcia, L. C., & Marwick, A. E. (2023). ‘Do your own research’: Affordance activation and disinformation spread. Information, Communication & Society , 27 (6), 1212–1228. https://doi.org/10.1080/1369118X.2023.2245869
Vikramaditya, N. (2020). Nv7-GitHub/googlesearch [Computer software]. https://github.com/Nv7-GitHub/googlesearch
This research has been supported by Mistra, the Swedish Foundation for Strategic Environmental Research, through the research program Mistra Environmental Communication (Haider, Ekström, Rödl) and the Marcus and Amalia Wallenberg Foundation [2020.0004] (Söderström).
The authors declare no competing interests.
The research described in this article was carried out under Swedish legislation. According to the relevant EU and Swedish legislation (2003:460) on the ethical review of research involving humans (“Ethical Review Act”), the research reported on here is not subject to authorization by the Swedish Ethical Review Authority (“etikprövningsmyndigheten”) (SRC, 2017).
This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author and source are properly credited.
All data needed to replicate this study are available at the Harvard Dataverse: https://doi.org/10.7910/DVN/WUVD8X
The authors wish to thank two anonymous reviewers for their valuable comments on the article manuscript as well as the editorial group of Harvard Kennedy School (HKS) Misinformation Review for their thoughtful feedback and input.
Published on 4.9.2024 in Vol 12 (2024)
Authors of this article:
1 Golpazari Family Health Center, Bilecik, Turkey
2 SafeVideo AI, San Francisco, CA, United States
3 Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
4 Department of Internal Medicine, Ankara Etlik City Hospital, Ankara, Turkey
5 Faculty of Medicine, Ankara Yildirim Beyazit University, Ankara, Turkey
6 Department of Computer Science, Istanbul Technical University, Istanbul, Turkey
7 Department of Pediatric Gastroenterology, Children Hospital, Ankara Bilkent City Hospital, Ankara Yildirim Beyazit University, Ankara, Turkey
Seyma Handan Akyon, MD
Golpazari Family Health Center
Istiklal Mahallesi Fevzi Cakmak Caddesi No:23 Golpazari
Bilecik, 11700
Phone: 90 5052568096
Email: [email protected]
Background: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed.
Objective: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study.
Methods: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs’ understanding of different sections of a research paper.
Results: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs ( P <.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper—with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding.
Conclusions: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.
Artificial intelligence (AI) has revolutionized numerous fields, including health care, with its potential to enhance patient outcomes, increase efficiency, and reduce costs [ 1 ]. AI devices are divided into 2 main categories. One category uses machine learning techniques to analyze structured data for medical applications, while the other category uses natural language processing methods to extract information from unstructured data, such as clinical notes, thereby improving the analysis of structured medical data [ 2 ]. A key development within natural language processing has been the emergence of large language models (LLMs), which are advanced systems trained on vast amounts of text data to generate human-like language and perform a variety of language-based tasks [ 3 ]. While deep learning models recognize patterns in data [ 4 ], LLMs are trained to predict the probability of a word sequence based on the context. By training on large amounts of text data, LLMs can generate new and plausible sequences of words that the mode has not previously observed [ 4 ]. ChatGPT, an advanced conversational AI technology developed by OpenAI in late 2022, is a general-purpose LLM [ 5 ]. GPT is part of a growing landscape of conversational AI products, with other notable examples including Llama (Meta), Jurassic (Ai21), Claude (Anthropic), Command (Cohere), Gemini (formerly known as Bard), PaLM, and Bard (Google) [ 5 ]. The potential of AI systems to enhance medical care and health outcomes is highly promising [ 6 ]. Therefore, it is essential to ensure that the creation of AI systems in health care adheres to the principles of trust and explainability. Evaluating the medical knowledge of AI systems compared to that of expert clinicians is a vital initial step to assess these qualities [ 5 , 7 , 8 ].
Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. This poses a significant barrier to efficient knowledge acquisition and evidence-based decision-making in health care. There is a need for a tool that can help doctors to process and understand medical papers more efficiently and accurately. Although LLMs are promising in evaluating patients, diagnosis, and treatment processes [ 9 ], studies on reading academic papers are limited. LLMs can be directly questioned and can generate answers from their own memory [ 10 , 11 ]. This has been extensively studied in many papers. However, these pose the problem of artificial hallucinations, which are inaccurate outputs, in LLMs. The retrieval augmented generation (RAG) method, which intuitively addresses the knowledge gap by conditioning language models on relevant documents retrieved from an external knowledge source, can be used to overcome this issue [ 12 ].
The STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist provides a standardized framework for evaluating key elements of observational study and sufficient information for critical evaluation. These guidelines consist of 22 items that authors should adhere to before submitting their manuscripts for publication [ 13 - 15 ]. This study aims to address this gap by evaluating the comprehension capabilities of LLMs in accurately and efficiently understanding medical research papers. We use the STROBE checklist to assess LLMs’ ability to understand different sections of research papers. This study uses a novel benchmark pipeline that can process PubMed papers regardless of their length using various generative AI tools. This research will provide critical insights into the strengths and weaknesses of different LLMs in enhancing medical research paper comprehension. To overcome the problem of “artificial hallucinations,” we implement the RAG method. RAG involves providing the LLMs with a prompt that instructs them to answer while staying relevant to the given document, ensuring responses align with the provided information. The results of this study will provide valuable information for medical professionals, researchers, and developers seeking to leverage the potential of LLMs for improving medical literature comprehension and ultimately enhance patient care and research efficiency.
This study uses a methodological research design to evaluate the comprehension capabilities of generative AI tools using the STROBE checklist.
We included the first 50 observational studies conducted within the past 5 years that were retrieved through an advanced search on PubMed on December 19, 2023, using “obesity” in the title as the search term. The included studies were limited to those written in English, available as free full text, and focusing specifically on human participants ( Figure 1 ). The papers included in the study were statistically examined in detail, and a total of 11 of them were excluded because they were not observational studies. The study was completed with 39 papers. A post hoc power analysis was conducted to assess the statistical power of our study based on the total correct responses across all repetitions. The analysis excluded GPT-4-1106 and GPT-3.5-Turbo-1106 due to their similar performance and the significant differences observed between other models. The power analysis, conducted using G*Power (version 3.1.9.7; Heinrich-Heine-Universität Düsseldorf), indicated that all analyses exceeded 95% power. Thus, the study was completed with the 39 selected papers, ensuring sufficient statistical power to detect meaningful differences in LLM performance.
This study used a novel benchmark pipeline to evaluate the understanding capabilities of LLMs when processing medical research papers. To establish a reference standard for evaluating the LLMs’ comprehension, we relied on the expertise of an experienced medical professor and an epidemiology expert doctor. The professor, with their extensive medical knowledge, was tasked with answering 15 questions derived from the STROBE checklist, designed to assess key elements of observational studies and cover different sections of a research paper ( Table 1 ). The epidemiology expert doctor, with their specialized knowledge in statistical analysis and epidemiological methods, provided verification and validation of the professor’s answers, ensuring the rigor of the benchmark. The combined expertise of both professionals provided a robust and reliable reference standard against which the LLMs’ responses were compared.
Questions | Answers | |
Q1. Does the paper indicate the study’s design with a commonly used term in the title or the abstract? | ||
Q2. What is the observational study type: cohort, case-control, or cross-sectional studies? | ||
Q3. Were settings or locations mentioned in the method? | ||
Q4. Were relevant dates mentioned in the method? | ||
Q5. Were eligibility criteria for selecting participants mentioned in the method? | ||
Q6. Were sources and methods of selection of participants mentioned in the method? | ||
Q7. Were any efforts to address potential sources of bias described in the method or discussion? | ||
Q8. Which program was used for statistical analysis? | ||
Q9. Were report numbers of individuals at each stage of the study (eg, numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study, completing follow-up, and analyzed) mentioned in the results? | ||
Q10. Was a flowchart used to show the reported numbers of individuals at each stage of the study? | ||
Q11. Were the study participants’ demographic characteristics (eg, age and sex) given in the results? | ||
Q12. Does the discussion part summarize key results concerning study objectives? | ||
Q13. Are the limitations of the study discussed in the paper? | ||
Q14. Is the generalizability of the study discussed in the discussion part? | ||
Q15. Is the funding of the study mentioned in the paper? |
This list of 15 questions, 2 multiple-choice and 13 yes or no questions, has been prepared by selecting the STROBE checklist items that can be answered definitively and have clear, nonsubjective responses. Question 1, related to title and abstract, examines the LLMs’ ability to identify and understand research designs and terms that are commonly used, evaluating the model’s comprehension of the concise language typically used in titles and abstracts. Questions 2-8, related to methods, cover various aspects of the study’s methodology, from the type of observational study to the statistical analysis programs used. They test the model’s understanding of the detailed and technical language often found in this section. Questions 9-11, related to results, focus on the accuracy and completeness of reported results, such as participant numbers at each study stage and demographic characteristics. These questions gauge the LLMs’ capability to parse and summarize factual data. Questions 12-14, related to the discussion, involve summarizing key results, discussing limitations, and addressing the study’s generalizability. These questions assess the LLMs’ ability to engage with more interpretive and evaluative content, showcasing their understanding of research impacts and contexts. Question 15, related to funding, tests the LLMs’ attentiveness to specific yet crucial details that could influence the interpretation of research findings.
The methodology incorporated a novel web application specifically designed for this purpose to assess the understanding capabilities of generative AI tools in medical research papers ( Figure 2 ). To mitigate the problem of “artificial hallucinations” inherent to LLMs, this study implemented the RAG method, which involves using a web application to dissect PDF-format medical papers from PubMed into text chunks ready to be processed by various LLMs. This approach guides the LLMs to provide answers grounded in the provided information by supplying them with relevant text chunks retrieved from the target paper.
The benchmark pipeline itself is designed to process PubMed papers of varying lengths and extract relevant information for analysis. This pipeline operates as follows:
Using this benchmark pipeline, we compared the answers of the generative AI tools, such as GPT-3.5-Turbo-1106 (June 11th version), GPT-4-0613 (November 6th version), GPT-4-1106 (June 11th version), PaLM 2 (chat-bison), Claude v1, and Gemini Pro, with the benchmark in 15 questions for 39 medical research papers ( Table 2 ). In this study, 15 questions selected from the STROBE checklists were posed 10 times each for 39 papers to 6 different LLMs.
Generative AI tool | Version | Company | Cutoff date |
GPT-3,5-Turbo | November 6, 2023 | OpenAI | September 2021 |
GPT-4-0613 | June 13, 2023 | OpenAI | September 2021 |
GPT-4-1106 | November 6, 2023 | OpenAI | April 2023 |
Claude v1 | Version 1 | Anthropic | — |
PaLM 2 | Chat-bison | — | |
Gemini Pro | 1.0 | — |
a The company does not explicitly state a cutoff date.
Access issues with Claude v1, specifically restrictions on its ability to process certain medical information, resulted in the exclusion of data from 6 papers, limiting the study’s scope to 33 papers. LLMs commonly provide a “knowledge-cutoff” date, indicating the point at which their training data ends and they may not have access to the most up-to-date information. With some LLMs, however, the company does not explicitly state a cutoff date. The explicitly stated cutoff dates are given in Table 2 , based on the publicly available information for each LLM.
A chatbot conversation begins when a user enters a query, often called a system prompt. The chatbot responds in natural language within a second, creating an interactive, conversation-like exchange. This is possible because the chatbot understands context. In addition to the RAG method, providing LLMs with well-designed system prompts that guide them to stay relevant to a given document can help generate responses that align with the provided information. We used the following system prompt for all LLMs:
You are an expert medical professor specialized in pediatric gastroenterology hepatology and nutrition, with a detailed understanding of various research methodologies, study types, ethical considerations, and statistical analysis procedures. Your task is to categorize research articles based on information provided in query prompts. There are multiple options for each question, and you must select the most appropriate one based on your expertise and the context of the research article presented in the query.
The language models used in this study rely on statistical models that incorporate random seeds to facilitate the generation of diverse outputs. However, the companies behind these LLMs do not offer a stable way to fix these seeds, meaning that a degree of randomness is inherent in their responses. To further control this randomness, we used the “temperature” parameter within the language models. This parameter allows for adjustment of the level of randomness, with a lower temperature setting generally producing more deterministic outputs. For this study, we opted for a low-temperature parameter setting of 0.1 to minimize the impact of randomness. Despite these efforts, complete elimination of randomness is not possible. To further mitigate its effects and enhance the consistency of our findings, we repeated each question 10 times for the same language model. By analyzing the responses across these 10 repetitions, we could determine the frequency of accurate and consistent answers. This approach helped to identify instances where the LLM’s responses were consistently aligned with the benchmark answers, highlighting areas of strength and consistency in comprehension.
Each question was repeated 10 times in the same time period to obtain answers from multiple LLMs and ensure the consistency and reliability of responses. Consequently, the responses to the same question were analyzed to determine how many aligned with the benchmark, and the findings were examined. Only the answers that were correct and followed the instructions provided in the question text were considered “correct.” Ambiguous answers, evident mistakes, and responses with an excessive number of candidates were considered incorrect. The data were carefully examined, and the findings were documented and analyzed. Each inquiry and its response formed the basis of the analysis. Various descriptive statistical tests were used to assess the data presented as numbers and percentages. The Shapiro-Wilk test was used to assess the data’s normal distribution. The Kruskal-Wallis and Pearson chi-square tests were used in the statistical analysis. Type I error level was accepted as 5% in the analyses performed using the SPSS (version 29.0; IBM Corp).
This study only used information that had already been published on the internet. Ethics approval is not required for this study since it did not involve any human or animal research participants. This study did not involve a clinical trial, as it focused on evaluating the capabilities of AI tools in understanding medical papers.
In this study, 15 questions selected from the STROBE checklists were posed 10 times each for 39 papers to 6 different LLMs. Access issues with Claude v1, specifically restrictions on its ability to process certain medical information, resulted in the exclusion of data from 6 papers, limiting the study’s scope to 33 papers. The percentage of correct answers for each LLM is shown in Table 3 , with GPT-3.5-Turbo achieving the highest rate (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%).
LLM | Total questions asked | Correct answers, n (%) |
GPT-3.5-Turbo-1106 | 5850 | 3916 (66.9) |
GPT-4-0613 | 5850 | 2580 (44.1) |
GPT-4-1106 | 5850 | 3837 (65.6) |
Claude v1 | 4950 | 2887 (58.3) |
PaLM 2-chat-bison | 5850 | 3632 (62.1) |
Gemini Pro | 5850 | 2878 (49.2) |
Each LLM was compared with another LLM that provided a lower percentage of correct answers. Statistical analysis using the Kruskal-Wallis test revealed statistically significant differences between the LLMs ( P <.001). The lowest correct answer percentage was provided by GPT-4-0613, at 44.1% (n=2580). Gemini Pro yielded 49.2% (n=2878) correct answers, significantly higher than GPT-4-0613 ( P <.001). Claude v1 yielded 58.3% (n=2887) correct answers, statistically significantly higher than Gemini Pro ( P <.001). PaLM 2 achieved 62.1% (n=3632) correct answers, significantly higher than Claude v1 ( P <.001). GPT-4-1106 achieved 65.6% (n=3837) correct answers, significantly higher than PaLM 2 ( P <.001). The difference between GPT-4-1106 and GPT-3.5-Turbo-1106 was not statistically significant ( P =.06). Of the 39 papers analyzed, 28 (71.8%) were published before the training data cutoff date for GPT-3.5-Turbo and GPT-4-0613, while all 39 (100%) papers were published before the cutoff date for GPT-4-1106. Explicit cutoff dates for the remaining LLMs (Claude, PaLM 2, and Gemini Pro) were not publicly available and therefore could not be assessed in this study. When all LLMs are collectively considered, the 3 questions receiving the highest percentage of correct answers were question 12 (n=4025, 68.3%), question 13 (n=3695, 62.8%), and question 10 (n=3565, 60.5%). Conversely, the 3 questions with the lowest percentage of correct responses were question 8 (n=1971, 33.5%), question 15 (n=2107, 35.8%), and question 1 (n=2147, 36.5%; Table 4 ).
Question | Correct answers (across all LLMs), n (%) |
Q1 | 2147 (36.5) |
Q2 | 3061 (52) |
Q3 | 2953 (50.2) |
Q4 | 2713 (46.2) |
Q5 | 3353 (57.1) |
Q6 | 3132 (53.3) |
Q7 | 2530 (43) |
Q8 | 1971 (33.5) |
Q9 | 2288 (38.9) |
Q10 | 3565 (60.5) |
Q11 | 3339 (56.9) |
Q12 | 4025 (68.3) |
Q13 | 3695 (62.8) |
Q14 | 2578 (43.8) |
Q15 | 2107 (35.8) |
The percentages of correct answers given by all LLMs for each question are depicted in Figure 3 . The median values for questions 7, 8, 9, 10, and 14 were similar across all LLMs, indicating a general consistency in performance for these specific areas of comprehension. However, significant differences were observed in the performance of different LLMs for other questions. The statistical tests used in this analysis were the Kruskal-Wallis test for comparing the medians of multiple groups and the chi-square test for comparing categorical data. For question 1, the fewest correct answers were provided by Claude (n=124, 24.8%) and Gemini Pro (n=197, 39.5%), while the most correct answers were provided by PaLM 2 (n=301, 60.3%; P =.01). In question 2, Claude v1 (n=366, 73.3%) achieved the highest median correct answer count (10.0, IQR 5.0-10.0), while Gemini Pro provided the fewest correct answers (n=237, 47.4%; P =.03). For question 3, GPT-3.5 (n=425, 85.1%) and PaLM 2 (n=434, 86.8%) had the highest median correct answer counts, while GPT-4-0613 (n=164, 32.8%) and Gemini Pro (n=189, 37.9%) had the lowest ( P <.001). In the fourth question, PaLM 2 (n=369, 73.8%), GPT-3.5 (n=293, 58.7%), and GPT-4-1106 (n=336, 67.2%) performed best, while GPT-4-0613 (n=187, 37.4%) showed the lowest performance ( P <.001). For questions 5 and 6, GPT-4-0613 (n=209, 41.8%) and Gemini Pro (n=186, 37.2%) provided fewer correct answers compared to the other LLMs ( P <.001 and P =.001, respectively). In question 11, GPT-4-1106 (n=406, 81.2%), Claude (n=347, 69.4%), and PaLM 2 (n=406, 81.2%) performed well, while Gemini Pro (n=264, 52.8%) had the fewest correct answers ( P =.001). For questions 12 and 13, all LLMs, except GPT-4-0613, performed well in these areas ( P <.001). In question 15, GPT-3.5 (n=368, 73.6%) showed the highest number of correct answers ( P <.001; Multimedia Appendix 1 ).
AI can improve the data analysis and publication process in scientific research while also being used to generate medical papers [ 16 ]. Although these fraudulent papers may appear well-crafted, their semantic inaccuracies and errors can be detected by expert readers upon closer examination [ 11 , 17 ]. The impact of LLMs on health care is often discussed in terms of their ability to replace health professionals, but their significant impact on medical and research writing applications and limitations is often overlooked. Therefore, physicians involved in research need to be cautious and verify information when using LLMs. As their reliance can lead to ethical concerns and inaccuracies, the scientific community should be vigilant in ensuring the accuracy and reliability of AI tools by using them as aids rather than replacements, understanding their limitations and biases [ 10 , 18 ]. With millions of papers published annually, AI could generate summaries or recommendations, simplifying the process of gathering evidence and enabling researchers to grasp important aspects of scientific results more efficiently [ 18 ]. Moreover, there is limited research focused on assessing the comprehension of academic papers.
This study aimed to evaluate the ability of 6 different LLMs to understand medical research papers using the STROBE checklist. We used a novel benchmark pipeline that processed 39 PubMed papers, posing 15 questions derived from the STROBE checklist to each model. The benchmark was established using the answers provided by an experienced medical professor and validated by an epidemiologist, serving as a reference standard against which the LLMs’ responses were compared. To mitigate the problem of “artificial hallucinations” inherent to LLMs, our study implemented the RAG method, which involves using a web application to dissect PDF-format medical papers into text chunks and present them to the LLMs.
Our findings reveal significant variation in the performance of different LLMs, suggesting that LLMs are capable of understanding medical papers to varying degrees. While newer models like GPT-3.5-Turbo and GPT-4-1106 generally demonstrated better comprehension, GPT-3.5-Turbo outperformed even the more recent GPT-4-0613 in certain areas. This unexpected finding highlights the complexity of LLM performance, indicating that simple assumptions about newer models consistently outperforming older ones may not always hold true. The impact of training data cutoffs on LLM performance is a critical consideration in evaluating their ability to understand medical research [ 19 ]. While we were able to obtain explicitly stated cutoff dates for GPT-3.5-Turbo, GPT-4-1106, and GPT-4-0613, this information was not readily available for the remaining models. This lack of transparency regarding training data limits our ability to definitively assess the impact of knowledge cutoffs on model performance. The observation that all 39 papers were published before the cutoff date for GPT-4-1106, while only 28 papers were published before the cutoff date for GPT-3.5-Turbo and GPT-4-0613, suggests that the knowledge cutoff may play a role in the observed performance differences. GPT-4-1106, with a more recent knowledge cutoff, has access to a larger data set, potentially including information from more recently published research. This could contribute to its generally better performance compared to GPT-3.5-Turbo. However, it is important to note that GPT-3.5-Turbo still outperformed GPT-4-0613 in specific areas, even with a similar knowledge cutoff. This suggests that factors beyond training data (eg, the number of layers, the type of attention mechanism, or the use of transformers) and compression techniques (eg, quantization, pruning, or knowledge distillation) may also play a significant role in LLM performance. Future research should prioritize transparency regarding training data cutoffs and aim to standardize how LLMs communicate these crucial details to users.
This study evaluated the performance of various LLMs in accurately answering specific questions related to different sections of a scholarly paper: title and abstract, methods, results, discussion, and funding. The results shed light on which LLMs excel in specific areas of comprehension and information retrieval from academic texts. PaLM 2 (n=219, 60.3%) showed superior performance in question 1, identifying the study design from the title or abstract, suggesting enhanced capability in understanding and identifying specific terminologies. Claude (n=82, 24.8%) and Gemini Pro (n=154, 39.5%), however, lagged, indicating a potential area for improvement in terminology recognition and interpretation. Claude v1 (n=242, 73.3%) and PaLM 2 (n=295, 86.8%) exhibited strong capabilities in identifying methodological details, such as observational study types and settings or locations (questions 2-8). This suggests a robust understanding of complex methodological descriptions and the ability to distinguish between different study frameworks. For questions regarding the results section (questions 9-11), it is evident that models like GPT-4-1106 (n=317, 81.3%), Claude (n=229, 69.4%), and PaLM 2 (n=276, 81.2%) showed superior performance in providing correct answers related to the study participants’ demographic characteristics and the use of flowcharts. All LLMs except for GPT4-0613 (n=89, 22.8%) exhibited remarkable competence in summarizing key results, discussing limitations, and addressing the generalizability of the study (questions 12-14), which are critical aspects of the discussion section. GPT-3.5 (n=287, 73.6%) particularly excelled in identifying the mention of funding (question 15), indicating a nuanced understanding of acknowledgments and funding disclosures often nuanced and embedded toward the end of papers. Across the array of tested questions, both GPT-3.5 and PaLM 2 exhibit remarkable strengths in understanding and analyzing scholarly papers, with PaLM 2 generally showing a slight edge in versatility, especially in interpreting methodological details and study design. GPT-3.5, while strong in discussing study limitations, generalized findings, and funding details, indicates that improvements can be made in extracting complex methodological information. We observed that different models excelled in different areas, indicating that no single LLM currently demonstrates universal dominance in medical paper understanding. This suggests that factors like training data, model architecture, and question complexity influence performance, and further research is needed to understand the specific contributions of each factor.
LLMs can be directly questioned and can generate answers from their own memory [ 11 ]. This has been extensively studied in many medical papers . According to a study, ChatGPT, an LLM, was evaluated on the United States Medical Licensing Examination. The results showed that GPT performed at or near the passing threshold for examinations without any specialized training, demonstrating a high level of concordance and insight in its explanations. These findings suggest that LLMs have the potential to aid in medical education and potentially assist with clinical decision-making [ 5 , 20 ]. Another study aimed to evaluate the knowledge level of GPT in medical education by assessing its performance in a multiple-choice question examination and its potential impact on the medical examination system. The results indicated that GPT achieved a satisfactory score in both basic and clinical medical sciences, highlighting its potential as an educational tool for medical students and faculties [ 21 ]. Furthermore, GPT offers information and aids health care professionals in diagnosing patients by analyzing symptoms and suggesting appropriate tests or treatments. However, advancements are required to ensure AI’s interpretability and practical implementation in clinical settings [ 8 ]. The study conducted in October 2023 explored the diagnostic capabilities of GPT-4V, an AI model, in complex clinical scenarios involving medical imaging and textual patient data. Results showed that GPT-4V had the highest diagnostic accuracy when provided with multimodal inputs, aligning with confirmed diagnoses in 80.6% of cases [ 22 ]. In another study, GPT-4 was instructed to address the case with multiple-choice questions followed by an unedited clinical case report that evaluated the effectiveness of the newly developed AI model GPT-4 in solving complex medical case challenges. GPT-4 correctly diagnosed 57% of the cases, outperforming 99.98% of human readers who were also tasked with the same challenge [ 23 ]. These studies highlight the potential of multimodal AI models like GPT-4 in clinical diagnostics, but further investigation is needed to uncover biases and limitations due to the model’s proprietary training data and architecture.
There are few studies in which LLMs are directly questioned, and their capacities to produce answers from their own memories are compared with each other and expert clinicians. In a study, GPT-3.5 and GPT-4 were compared to orthopedic residents in their performance on the American Board of Orthopaedic Surgery written examination, with residents scoring higher overall, and a subgroup analysis revealed that GPT-3.5 and GPT-4 outperformed residents in answering text-only questions, while residents scored higher in image interpretation questions. GPT-4 scored higher than GPT-3.5 [ 24 ]. A study aimed to evaluate and compare the recommendations provided by GPT-3 and GPT-4 with those of primary care physicians for the management of depressive episodes. The results showed that both GPT-3.5 and GPT-4 largely aligned with accepted guidelines for treating mild and severe depression while demonstrating a lack of gender or socioeconomic biases observed among primary care physicians. However, further research is needed to refine the AI recommendations for severe cases and address potential ethical concerns and risks associated with their use in clinical decision-making [ 25 ]. Another study assessed the accuracy and comprehensiveness of health information regarding urinary incontinence generated by various LLMs. By inputting selected questions into GPT-3.5, GPT-4, and Gemini, the researchers found that GPT-4 performed the best in terms of accuracy and comprehensiveness, surpassing GPT-3.5 and Gemini [ 26 ]. According to a study that evaluates the performance of 2 GPT models (GPT-3.5 and GPT-4) and human professionals in answering ophthalmology questions from the StatPearls question bank, GPT-4 outperformed both GPT-3.5 and human professionals on most ophthalmology questions, showing significant performance improvements and emphasizing the potential of advanced AI technology in the field of ophthalmology [ 27 ]. Some studies showed that GPT-4 is more proficient, as evidenced by scoring higher than GPT-3.5 in both multiple-choice dermatology examinations and non–multiple-choice cardiology heart failure questions from various sources and outperforming GPT-3.5 and Flan-PaLM 540B on medical competency assessments and benchmark data sets [ 28 - 30 ]. In a study conducted on the proficiency of various open-source and proprietary LLMs in the context of nephrology multiple-choice test-taking ability, it was found that their performance on 858 nephSAP questions ranged from 17.1% to 30.6%, with Claude 2 at 54.4% accuracy and GPT-4 at 73.3%, highlighting the potential for adaptation in medical training and patient care scenarios [ 31 ]. To our knowledge, this is the first study to assess the performance of evaluating medical papers and understanding the capabilities of different LLMs. The findings reveal that the performance of LLMs varies across different questions, with some LLMs showing superior understanding and answer accuracy in certain areas. Comparative analysis across different LLMs showcases a gradient of capabilities. The results revealed a hierarchical performance ranking as follows: GPT-4-1106 equals GPT-3.5-Turbo, which is superior to PaLM 2, followed by Claude v1, then Gemini Pro, and finally, GPT-4-0613. Similar to the literature review, GPT-4-1106 and GPT-3.5 showed improved accuracy and understanding compared to other LLMs. This mirrors wider literature trends, indicating LLMs’ rapid evolution and increasing sophistication in handling complex medical queries. Notably, GPT-3.5-Turbo showed better performance than GPT-4-0613, which may be counterintuitive, considering the tendency to assume newer iterations naturally perform better. This anomaly in performance between newer and older versions can be attributed to the application of compression techniques in developing new models to reduce computational costs. While these advancements make deploying LLMs more cost-effective and thus accessible, they can inadvertently compromise the performance of LLMs. The notable absence of responses from PaLM in certain instances, actually stemming from Google’s policy to restrict the use of its medical information, presents an intriguing case within the scope of our discussion. Despite these constraints, PaLM’s demonstrated high performance in other areas is both surprising and promising. This suggests that even when faced with limitations on accessing a vast repository of medical knowledge, PaLM’s underlying architecture and algorithms enable it to make effective use of the information it can access, showcasing the robust potential of LLMs in medical settings even under restricted conditions.
While LLMs can be directly questioned and generate answers from their own memory, as demonstrated in numerous studies above, this approach can lead to inaccuracies known as hallucinations. Hallucinations in LLMs have diverse origins, encompassing the entire spectrum of the capability acquisition process, with hallucinations primarily categorized into 3 aspects: training, inference, and data. Architecture flaws, exposure bias, and misalignment issues in both pretraining and alignment phases induce hallucinations. To address this challenge, our study used the RAG method, ensuring that the LLMs’ responses were grounded in factual information retrieved from the target paper. The RAG method intuitively addresses the knowledge gap by conditioning language models on relevant documents retrieved from an external knowledge source [ 12 , 32 ]. RAG provides the LLM with relevant text chunks extracted from the specific paper being analyzed. This ensures that the LLM’s responses are directly supported by the provided information, reducing the risk of hallucination. While a few studies have explored the use of RAG to compare LLMs, like the one demonstrating GPT-4’s improved accuracy with RAG for interpreting oncology guidelines [ 33 ], our study is the first to evaluate LLM comprehension of medical research papers using this method. This method conditions LLMs on relevant documents retrieved from an external knowledge source, ensuring their answers are grounded in factual information. The design of system prompts is crucial for LLMs, as it provides context, instructions, and formatting guidelines to ensure the desired output [ 34 ]. In this study, it is empirically determined that a foundational system and set of system prompts universally enhanced the response quality across all language models tested. This approach was designed to optimize the comprehension and summarization capabilities of each generative AI tool when processing medical research papers. The specific configuration of system settings and query structures we identified significantly contributed to improving the accuracy and relevance of the models’ answers. These optimized parameters were crucial in achieving a more standardized and reliable evaluation of each model’s ability to understand complex medical texts. While further research is needed to fully understand the effectiveness of RAG across different medical scenarios, our findings demonstrate its potential to enhance the reliability and accuracy of LLMs in medical research comprehension.
This study, while offering valuable insights, is subject to several limitations. The selection of 50 papers focused on obesity, and the use of a specific set of 15 STROBE-derived questions might not fully capture the breadth of medical research. Additionally, the reliance on binary and multiple-choice questions restricts the evaluation of LLMs’ ability to provide nuanced answers. The rapid evolution of LLMs means that the findings might not be applicable to future versions, and potential biases within the training data have not been systematically assessed. Furthermore, the study’s reliance on a single highly experienced medical professor as the benchmark, while evaluating, might limit the generalizability of the findings. A larger panel of experts with diverse areas of specialization might provide a more comprehensive reference standard for evaluating LLM performance. Further investigation with a wider scope and more advanced methodologies is needed to fully understand the potential of LLMs in medical research.
In conclusion, LLMs show promise for transforming medical research, potentially enhancing research efficiency and evidence-based decision-making. This study demonstrates that LLMs exhibit varying capabilities in understanding medical research papers. While newer models generally demonstrate better comprehension, no single LLM currently excels in all areas. This highlights the need for further research to understand the complex interplay of factors influencing LLM performance. Continued research is crucial to address these limitations and ensure the safe and effective integration of LLMs in health care, maximizing their benefits while mitigating risks.
The authors gratefully acknowledge Dr Hilal Duzel for her invaluable assistance in validating the reference standard used in this study. Dr Duzel’s expertise in epidemiology and statistical analysis ensured the accuracy and robustness of the benchmark against which the LLMs were evaluated. We would also like to thank Ahmet Hamza Dogan, a promising future engineer, for his contributions to the LLM analysis.
None declared.
Percentages of correct answers by large language models for each question.
artificial intelligence |
large language model |
retrieval augmented generation |
Strengthening the Reporting of Observational Studies in Epidemiology |
Edited by A Castonguay; submitted 07.04.24; peer-reviewed by C Wang, S Mao, W Cui; comments to author 04.06.24; revised version received 16.06.24; accepted 05.07.24; published 04.09.24.
©Seyma Handan Akyon, Fatih Cagatay Akyon, Ahmet Sefa Camyar, Fatih Hızlı, Talha Sari, Şamil Hızlı. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 04.09.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
IMAGES
VIDEO
COMMENTS
4. Explain A)Construct B)Definition C)Proposition D)Hypothesis E)Theory 10 M. 5. Define the term 'Research', Enumerate the characteristics of research. Give a 10 M Comprehensive definition of research. 6. What do you mean by scientific investigation and explain them in detail. 10 M. 7. "Research is much concerned with proper fact finding ...
This document contains a question paper on business research methods with multiple choice and descriptive questions. The questions cover topics like types of research, research objectives vs hypotheses, concepts vs constructs, stages of the research process, differences between exploratory and empirical research, and roles of theory. There is also a case study on advertising spending in India ...
Test: Business Research Methods Exam 1 Practice. 5.0 (1 review) Name: Score: 22 Multiple choice questions. Definition. business research. The two types of business research based on the specificity of its purpose are called _____ and _____. 1. The application of the scientific method in searching for truth about business phenomena is known as
An Introduction to Business Research
Mcq 5. Mcq 4 - research methodology. Mcq 3 - research methodology. Mcq 2 - research methodology. Mcq 1 - research methodology. Question bank new-brm. On Studocu you find all the lecture notes, summaries and study guides you need to pass your exams with better grades.
Business Research Methods - Science topic. Explore the latest questions and answers in Business Research Methods, and find Business Research Methods experts. Questions (33) Publications (314 ...
Research Methods in Business Studies This accessible guide provides clear and practical explanations of key research methods in business studies, presenting a step-by-step approach to data collection, analysis, and problem solving. Readers will learn how to formulate a research question or problem, choose an appropriate research
Business Research: Methods, Types & Examples
Business Research Methods Notes, PDF I MBA 2024
This document contains a sample research paper question paper that assesses students' knowledge of business research methods. It is divided into three sections. Section A contains 10 short answer questions worth 2 marks each, testing concepts like research, hypothesis, sampling error, and applications of charts. Section B has 5 long answer questions worth 13 marks each, covering topics such as ...
Business Research Methods 6 In research, the researchers try to find out answers for unsolved questions It should be carefully recorded and reported Business Research Business research refers to systematic collection and analysis of data with the purpose of finding answers to problems facing management.
Business Research Methodology. Chapter 1 - business Research Methods. Q1. A ..... is a proposed explanation possessing limited evidence. Generally, you want to turn logical hypotheses into an empirical hypothesis, putting your theories or postulations to the test.
Business research methods Question Bank SYBMS 2019 Paper pattern: - Q1. Objectives (15 marks) Q2. Full length questions (2 sets of 2 questions each. Attempt any one set) Q3. Full length questions (2 sets of 2 questions each. Attempt any one set) Q4. Full length questions (2 sets of 2 questions each. Attempt any one set) Q5. Short notes (3 of 5 ...
This document contains 317 multiple choice questions related to business research methods. The questions cover topics such as types of correlation, Karl Pearson's coefficient of correlation, regression analysis, tests of significance, parametric vs non-parametric tests, coding, frequencies, charts/graphs, measures of central tendency, bivariate analysis, ANOVA, and more. The questions are in ...
RESEARCH METHODS EXAM QUESTIONS, ANSWERS & MARKS. 4.3 (40 reviews) Get a hint. What is an experiment? An experiment is a research technique in which an IV is manipulated / and the effects of this on a DV are observed and measured. / Other (extraneous) variables are held constant. / A true experiment is one in which the IV is directly under the ...
10 Research Question Examples to Guide your ...
Case Study Method: A Step-by-Step Guide for Business ...
Practicing question paper gives you the confidence papeer business research methods question paper pdf the board exam with minimum fear and stress since you get proper idea about question paper pattern and marks weightage. Discuss the various methods of qualitative research. If you would like to leave a comment, please do, I'd love to hear what ...
Business Research: Definition, Types & Methods
Subject: Business Research Methods. Time Allowed: 15 Minutes. Maximum Marks: 10. NOTE: Attempt this Paper on this Question Sheet only. Please encircle the correct option. Division of marks is given in front of each question. This Paper will be collected back after expiry of time limit mentioned above. Part-I Encircle the right answer, cutting ...
This paper explores teaching business students research methods using a psychogeographical approach, specifically the technique of dérive. It responds to calls for new ways of teaching in higher ...
Semester 5: Business Research Methods. Download the Bcom Business research methods previous question paper of Nov 2022. of 2. Download. of 3. Download. of 3. Download. of 3.
This document contains sample questions from a Business Research Methods exam for an MBA program. It includes 10 short 2-mark questions testing concepts like null hypotheses, reliability, scales, sampling techniques, and research objectives. It also includes 5 longer 16-mark questions requiring explanations of research types, measurement, sampling methods, primary data collection, statistical ...
Business Research Methods - BA4205 Subject (under MBA - Anna University 2021 Regulation) - Notes, Important Questions, Semester Question Paper PDF Download. ... Semester Question Papers Business Research Methods - BA4205 2021 Regulation - Question Paper 2022 April May Download Business Research Methods ...
Academic journals, archives, and repositories are seeing an increasing number of questionable research papers clearly produced using generative AI. They are often created with widely available, general-purpose AI applications, most likely ChatGPT, and mimic scientific writing. Google Scholar easily locates and lists these questionable papers alongside reputable, quality-controlled research.
Background: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. Objective: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding ...