An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures

As we increasingly depend on software systems, the consequences of breaches in the software supply chain become more severe. High-profile cyber attacks like those on SolarWinds and ShadowHammer have resulted in significant financial and data losses, underlining the need for stronger cybersecurity. One way to prevent future breaches is by studying past failures. However, traditional methods of analyzing these failures require manually reading and summarizing reports about them. Automated support could reduce costs and allow analysis of more failures. Natural Language Processing (NLP) techniques such as Large Language Models (LLMs) could be leveraged to assist the analysis of failures. In this study, we assessed the ability of Large Language Models (LLMs) to analyze historical software supply chain breaches. We used LLMs to replicate the manual analysis of 69 software supply chain security failures performed by members of the Cloud Native Computing Foundation (CNCF). We developed prompts for LLMs to categorize these by four dimensions: type of compromise, intent, nature, and impact. GPT 3.5s categorizations had an average accuracy of 68% and Bard had an accuracy of 58% over these dimensions. We report that LLMs effectively characterize software supply chain failures when the source articles are detailed enough for consensus among manual analysts, but cannot yet replace human analysts. Future work can improve LLM performance in this context, and study a broader range of articles and failures.


INTRODUCTION
Software mediates almost all aspects of modern life [58].To reduce development time, software applications integrate dependencies both directly (e.g., importing a library) and indirectly (e.g., that library's dependencies).These dependencies may come to dominate the application's risk profile: it has been estimated that the source code of a typical web application is comprised of 80% dependencies and only 20% custom business logic [1,75].The owners of these dependencies may be external to the organization developing the application, and thus the reduction of development time comes with an increase in risks associated with this software supply chain [30,49].One potential risk is a software supply chain attack -actors insert or exploit vulnerable logic in dependencies, these dependencies are integrated into applications, and the vulnerability becomes exploitable in application deployments [71].
In a failure-aware engineering process, engineers study past failures to prevent future ones [8,77].Although organizations may be unwilling to publicly disclose their own failures, news articles and other kinds of grey literature could provide sufficient information on failures [7].Such data comprises "Open-Source Intelligence" [85], and are used by governmental bodies, military institutions, and law enforcement agencies [48] to design security offenses and defenses.Current approaches to garnering open-source intelligence, e.g., studying news articles of failures, require costly expert manual analysis.For example, the Cloud Native Computing Foundation (CNCF) maintains a database of software supply chain security failures through manual analysis [51].This database has been further analyzed manually [45].
With the goal of reducing the costs of manual analysis, we assess the effectiveness of Large Language Models (LLMs) in gathering open-source intelligence.We explored the effectiveness of LLMs at replicating the classifications of the CNCF database [51] made by Geer et al. [45] and the CNCF database maintainers.We conducted prompt engineering to iteratively develop prompts that performed well on a sample of 20% of the articles and then evaluated performance on the remaining 80%.In addition, we introduced a new category of analysis, "Lessons learned", to assess the usefulness of an LLM's recommendations.
We compared the performance of two state-of-the-art LLMs, OpenAI's GPT and Google's Bard, on these prompts.
GPT outperformed Bard in all cases.GPT's accuracy ranged from 52-88% on the pre-defined dimensions.On the openended "Lessons learned", our research team rated GPT's performance as reasonable but not excellent, with an average helpfulness score of 3.83/5.Not surprisingly, the quality of the LLMs' outputs depends on the level of detail provided in the source articles -more comprehensive articles lead to higher-quality responses, as well as less disagreement among the manual raters.Lastly, we note that sometimes we preferred GPT's rating over that provided by the CNCF, suggesting that ground truth may be difficult to establish in this context.
Our contributions are: • An extended analysis of a catalog of software supply chain failures • An evaluation of LLMs at replicating manual characterization of software supply chain failures • An evaluation of LLMs at extracting lessons learned from software supply chain failures 2 BACKGROUND AND RELATED WORK

Software Supply Chain
Over the years, software production has changed significantly.Early software engineers wrote most code from scratch, increasing production costs [88].As reusable libraries and frameworks became more available, software engineers shifted to more software reuse [83].Software applications now commonly rely on external code components, often referred to as dependencies.These dependencies, including packages, libraries, frameworks, and other artifacts, serve as building blocks in modern software development [83].
This paradigm shift leads to software supply chain: the collection of systems, devices, and people which result in a final software product [26].Figure 2 provides an illustration.According to Google [18], the constituents of a software supply chain include: (1) The code developed by teams, its dependencies, and the various internal and external software applications utilized in the development, compilation, packaging, and installation of the software; (2) The rules and procedures used in all stages of the process; and (3) The systems used for the development of the software and its dependencies.A software supply chain can also be viewed as a network linking actors who perform operations on artifacts [30,69,71].The popularity and reliance on third-party dependencies have been reported in various studies.For example, a 2012 study by Nikiforakis et al. [68] showed that 88% of the Alexa top 10,000 websites included at least one remote JavaScript library.Also, according to a 2019 Synopsys Black Duck report, over 96% of the applications they analyzed include some OSS libraries.These libraries often make up more than 50% of the average code-base [79].In the 2023 version of this report, the percentage of code in codebases that was open source had risen to about 80% [1,75].
Software supply chains come with a tradeoff.Costs are reduced during product development and maintenance, but harm may result due to a mismatch between the desired integrity level of a product and the integrity level achieved by one's dependencies.Defects in dependencies may cause an application to fail, as we discuss next.

Software Supply Chain Attacks
Faults in software supply chains leave applications vulnerable to attack [86].Attacks on software supply chains (or records about them) are a recent trend, following the industry shift to relying on third-party components ( §2.1).
According to a 2021 Sonatype report [83], from February 2015-June 2019 only 216 software supply chain attacks were recorded, then from July 2019 to May 2020 there were 929 attacks recorded, and from 2020-2021 there were over 12,000 attacks recorded.In their 2022 report, this number skyrocketed to 88,000 [84].Some high-profile attacks, such as SolarWinds [53] and ShadowHammer [59], threatened US national security.
These and similar attacks have inspired comments from many organizations.Governmental organizations such as the Cybersecurity and Infrastructure Security Agency (CISA), the National Security Agency (NSA), and the European Union Agency for Cybersecurity (ENISA) have published threat reports and guidance for securing software supply chains [31,[35][36][37][38]. Industry organizations such as the Cloud Native Computing Foundation (CNCF) have also published their own findings and suggestions [52].These findings have led to the development of security frameworks such as the widely-recognized Supply-chain Levels for Software Artifacts (SLSA) [43].
Academics have also begun to focus on software supply chain attacks.Ohm et al. [70], Ladisa et al. [62], Zimmerman et al. [95], and Zahan et al. [92] studied and characterized attacks on the software supply chain.Okafor et al. [71] condensed existing knowledge about software supply chain attacks into a four-stage attack pattern consisting of initial compromise, alteration, propagation, and exploitation.Table 1 summarizes many avenues for these attacks.

Failure Studies in Software Engineering
Software engineers have finite resources to produce software [82].Engineers accept some defects [23,61], but try to eliminate severe defects that may cause incidents: undesired, unplanned, software-induced events that incur substantial loss [63].Whether severe defects are caught internally or result in incidents, their presence is a failure indicating a flawed software engineering process.
All engineered systems will fail, regardless of the process (e.g., Agile or Plan-based) and methods (e.g., test-driven development or formal methods).For example, Fonseca et al. identified 16 defects across three formally verified systems [42] due to invalid assumptions about the software environment.Across all schools of software engineering thought, from ISO to Agile, guidelines agree that software engineers should analyze failures to improve for next time [2,12,13,20,40,41,47,55,56,60].In light of this, techniques to learn from failures [17] as well as to manage the resulting knowledge [29] are important software engineering knowledge.
Many researchers have studied software failures in an effort to learn from them [5,7,25,45].This failure analysis research has advanced the software engineering field [7,24,63].However, the high costs associated with failure analysis methods -which rely on manual analysis -deter many organizations from undertaking failure analysis [76].In their literature review, Amusuo et al. noted that the typical methodology of academic failure analysis is also manual analysis, and recommended the evaluation of Natural Language Processing (NLP) tools to assist in these tasks [6].Our study responds by evaluating NLP tools in the context of analyzing cybersecurity failures in the software supply chain.
2.4 Natural Language Processing in Support of Software Engineering 2.4.1 NLP to Analyze Supply Chain Failures.In §2.2 we noted that many governments, companies, and academics are studying software supply chain failures.To the best of our knowledge, these studies are conducted manually.This

Malicious Maintainer
Occurs when a maintainer, or an entity posing as a maintainer, deliberately injects a vulnerability somewhere in the supply chain or in the source code.This kind of compromise could have great consequences because usually the individual executing the attack is considered trustworthy by many.This category includes attacks from experienced maintainers going rogue, account compromise, and new personas performing an attack soon after they have acquired responsibilities.

7
Attack Chaining Sometimes a breach may be attributed to multiple lapses, with several compromises chained together to enable the attack.The attack chain may include types of supply chain attacks as defined here.However, catalogued attack chains often include other types of compromise, such as social engineering or a lack of adherence to best practices for securing publicly accessible infrastructure components.
reduces the number of organizations that can gather such intelligence, and we expect that manual efforts will not scale as the number of software supply chain attacks continues to increase.
We believe that recent progress in NLP (Natural Language Processing) could enable large-scale analysis of supply chain failures.Specifically, recent advancements in Large Language Models (LLMs) could aid in studying supply chain failures.LLMs are neural network-based language models that are capable of "understanding" natural language and extracting structured information from unstructured text data [14].We therefore hypothesize they could extract relevant failure information from software supply chain failure data sources.We are not aware of prior work on this topic.

Other Applications of NLP in SE. Natural Language Processing (NLP) has been leveraged for various phases of the
Software Development Life-Cycle (SDLC).NLP tools have been proposed for detecting, extracting, modeling, tracing, classifying, and searching tasks in the specification phase [94].NLP tools have been proposed for modeling software systems during the design phase [11].NLP tools have been proposed to assist with the development phase by helping detect vulnerabilities and generating code [34].NLP tools have been proposed to assist during the testing phase [44].
NLP tools have been proposed to identify risks during the deployment phase [87].NLP tools have been proposed to classify user feedback to assist during the maintenance phase [74].In this paper, we apply NLP tools to learn from failures.

RESEARCH QUESTIONS
To reduce the costs of analyzing software supply chain failures, we explore the effectiveness of Large Language Models (LLMs) in automating the analysis of these failures.Towards this goal, we used LLMs to replicate a manual study of software supply chain failures [51].Specifically, we investigate: • RQ1: How effective are LLMs in replicating manual analysis of software supply chain failures?
• RQ2: Do LLMs suggest viable mitigation strategies for preventing future failures?

METHODOLOGY
Fig. 3. Overview of experiment design.The CNCF catalog manually characterizes software supply chain failures from the news and blogs.We extended this catalog with additional characteristics.We conducted prompt engineering to leverage LLMs to automatically analyze the news and blogs.We compare an LLM's analysis against the manual analysis.
An overview of our methodology is illustrated in Figure 3.To assess the effectiveness of LLMs at replicating manual analysis of software supply chain failures, we compare the analysis of a manually generated catalog against the responses generated by two popular LLMs: ChatGPT [57] and Bard [78].Specifically to replicate the catalog, we engineered prompts for the LLMs to extract type of compromise, intent, nature, and impact information from the source blogs and news reports.Additionally, we constructed a prompt to gather lessons learned, similar to a postmortem [10].We evaluate the LLM generated catalog for correctness against the CNCF baseline manual catalog.We manually extract the intent, nature, and impact information and compare against the LLM's extraction, to evaluate the LLM's effectiveness at conducting an extended failure analysis.

Articles for analysis
The CNCF's "Catalog of Supply Chain Compromises" was used as the baseline dataset [51].We are not aware of an alternative dataset.This is a catalog of 69 software supply chain security failures analyzed from news articles and blogs from 1984-2022.Each entry describes the failure and its impacts. 1 Some examples are in Table 2.

Dimensions of analysis
The dimensions of analysis that we replicate and conduct for the software supply chain failures are outlined in Table 3.
Additionally, we extend the analysis of the articles in the catalog to explore the capabilities of LLMs at analyzing failures based on data commonly collected to classify and analyze failures [10].We constructed prompts to extract the intent [10], nature [10], impacts [25], and lessons learnt [67] from the failures.
Table 3. Dimensions used to analyze the capabilities of LLMs.The CNCF database includes "Type of compromise".Our research team labeled each catalog entry for the next three dimensions.The final dimension was assessed via a Likert scale.

Dimension Description
Type of compromise What kind of failure occurred [51]?See Table 1 for types.

Intent
Was the "software root cause" of the failure, accidental or deliberate?[10]

Nature
Was the failure a vulnerability or an exploit?For exploits, was the actor an insider or outsider?The CNCF catalog provides the type of compromise for the failures, stated in Table 2.By manually analyzing the articles, we extend this catalog with three additional dimensions of analysis: intent, nature, and impacts.
For the dimension of Type of Compromise, the CNCF catalog provides this (analysis conducted by the members of the CNFC organization) and we used their label.We used existing taxonomies for the dimensions of Intent, Nature, and Impacts, drawing from related works [10,25].
We had 3 pairs of 2 analysts manually analyze 23 sources per pair for these additional dimensions.They were trained on articles until consistent agreement and definitions were reached. 2 Table 4 shows the inter-rater agreement for these dimensions, measured using Cohen's kappa score.The accuracy for these dimensions was computed in a similar manner.In the case of the "Impacts" dimension, we observed a low inter-rater agreement (=0.34).Given the substantial judgment (or uncertainty) in this dimension, we adopted a "union" strategy of accepting the assessment of either rater to determine accuracy.For all other dimensions, disagreements were resolved by the authors.
See §9 for summary distributions of the labels per dimension.
4.3.2For RQ2.For RQ2, we opted not to build a controlled taxonomy of "lessons learned" due to the open-ended nature of the prompt.Instead, we had human raters evaluate the recommendations using a 5-point Likert scale, ranging from "Strongly disagree" to "Strongly agree".The humans rated the LLM's response in relation to the quality of the LLM's response and whether it would mitigate a future attack.[57] and Google's Bard model [78].Their properties are summarized in Table 5.Other large language models are available, e.g., Claude [9] and Cohere [19], but GPT and Bard are the most widely used due to their user-friendly interfaces.
ChatGPT-3.5-turbo,OpenAI's LLM.GPT-3.5-turbo is a large language model created by OpenAI.It uses a deep learning method known as transformers.It is currently one of the most popular and accurate LLMs [91].GPT-3.5 uses 175B parameters and is trained on the same datasets used by GPT-3 but with a fine-tuning process called Reinforcement Learning with Human Feedback (RLHF) [3].
Bard, Google's LLM.Bard is another popular and accurate LLM created by Google.Bard also uses transformers.It uses an optimized version of Language Models for Dialogue Applications (LaMDA) and was pre-trained on a variety publicly available data [66] including dialogue [46].

Prompt engineering.
A prompt is the specific query (instructions or questions) given to an LLM.The behavior of an LLM varies widely as a result of seemingly minor tweaks to its prompt [65].Prompt engineering is the process of crafting a prompt for an LLM to increase the quality of its response [89].
We used prompt engineering to iteratively develop prompts.We referred to various studies on prompt engineering [72,89,90].For each dimension, we refined the prompt by issuing a basic query, then applying each prompt engineering technique in a cumulative sequence until the performance peaked, preserving any changes that improved from the best observed performance.Table 6 describes our approach using the first dimension, "Type of Compromise", as an example.
This prompt engineering phase was conducted on a subset of 20% of the dataset; we used the most recently published articles. 3Table 9 lists the final version of each prompt.Bard [78] Free Unknown (estimate: 2K tokens per prompt and 50-100 prompts per 9 hours) [39] 137 billion None available to users 4.5 Experimental Setup 4.5.1 Order of prompts.We prompted LLMs in the order of Table 9.
4.5.2Parameterization of LLMs.We focused on the two primary adjustable parameters of GPT-3.5, namely "temperature" and "top_p", as outlined in Table 5.According to the literature, when one of the parameters is tuned, the other should be maintained at its default setting [73].Our preliminary tests, as shown in Table 6, were conducted with a temperature of 0 and a default top_p value of 1.
After finalizing the prompt, we examined the effect of the parameters on accuracy for the "Type of compromise".
For this article, accuracy decreased as the temperature increased.The accuracy was 78% at a temperature of 0, which declined to 64% at a temperature of 0.5, and further reduced to 50% at a temperature of 1.A similar trend was noted for the top_p parameter.
The optimal performance, with an accuracy of 78%, was achieved with a temperature of 0 and the top_p parameter at its default value of 1.We retained these parameter settings for the remainder of our analysis.This decision aligns with the guidelines provided in OpenAI's documentation [73], which suggests that a lower temperature results in more focused and deterministic responses, a characteristic that is beneficial for article analysis. 4.5.3 Number of trials.We noted that the responses of GPT-3.5, configured with Temperature=0, exhibited consistent behavior.Consequently, a single trial was conducted to evaluate GPT's accuracy across the dataset.Bard's responses were less consistent, but the rate limit was low so we could only conduct one trial.

Data Analysis
We compared the results of the manual analysis against the automated analysis by the LLMs.
For RQ1, we treated each LLM as another analyst and found how accurate it is at classifying various dimensions.We quantitatively report the LLM's accuracy to measure its correctness for each dimension of analysis.In cases where the LLM's analysis disagreed with the manual analysis, we examined its justifications.We qualitatively report some of our observations.For RQ2, many distinct "lessons learned" are possible.We had analysts review each article and then the recommendations by GPT.The analysts rated the recommendations on whether the recommendations were appropriate to the article on a 5-point Likert scale: "Strongly disagree", "Disagree", "Neither disagree nor agree", "Agree", and "Strongly agree".We did not experiment with Bard for this research question due to its rate limits.

RESULTS AND ANALYSIS
5.1 RQ1: How effective are LLMs replicating analysis of SW supply chain failures?
Table 7 summarizes the accuracy of GPT and Bard for the type of compromise, intent, nature, and impacts.GPT consistently outperformed Bard.We therefore focus our detailed analysis on GPT.
For most articles, GPT performed well on most dimensions.As depicted in Figure 4, GPT demonstrates an accuracy exceeding 75% (indicating correct responses in three out of four dimensions) in the majority of instances (62%).
When the manual raters had higher agreement, GPT tended to agree with them.GPT had high accuracy in the "Intent" and "Nature" dimensions, with accuracies of 88% and 74%, respectively.These dimensions exhibit Cohen's  values of 0.87 and 0.58, respectively (Table 4), demonstrating substantial agreement between the analysts.In the "Impacts" dimension, the LLM produced an accuracy of 52%, as indicated in Table 7.The Cohen's  was also low, at 0.34, as shown in Table 8.We conjecture that GPT agrees with analysts when there is a consensus amongst analysts regarding the labeling.
GPT had trouble when offered multi-answer as an option.For example, for the "Impacts" dimension it could choose from 4 specific impacts, or "All of the above/Multiple", or "Unknown/Unclear".In 87% of the cases, raters chose one of the multi-answer options, while GPT chose one of the specific options.GPT only selected "All of the above" three times and "Unknown/Unclear" once.We conjecture that when GPT was uncertain about the impacts, it opted for the most probable outcome of software supply chain failures in these articles (which focus on IT software).That option is data and financial theft, which it chose 49 times out of 65.
We observe that for the articles where the "Type of compromise" (ground truth provided by CNCF), we sometimes agreed with GPT over the CNCF. Figure 5 represents the distribution of GPT's choice and when they were incorrect according to the CNFC ground truth.We examined the 14 articles where both the type of compromise and impacts were incorrectly identified.For these instances, two raters with an inter-rater agreement,  of 0.82 found that most of the time, if they disagreed with CNCF, they concurred with GPT and vice versa.In the 8 instances where raters disagreed with CNCF, they agreed with GPT 6 times; the same ratio was observed when they disagreed with GPT and agreed with CNCF.For 2/14 articles they disagreed with both GPT and CNCF.

RQ2: Do LLMs suggest viable mitigation strategies for preventing future failures?
To address our second research question, we asked raters to evaluate GPT's proposed solutions/learnings using a 5-point Likert scale.The average ratings are depicted in Table 8.The mean score across all three questions is 3.83.The raters generally held a positive or neutral view of GPT's "Lessons learned": 42% of the ratings were above 4 (agree), and only 5% of the ratings fell below 2 (disagree).For further analysis, we randomly selected two articles where the average score of both the raters > 4, and two where < 2. See Table 10 for the full "Lessons Learned" for these cases.
Factors for strong ratings (average score ≥ 4).We believe the LLM demonstrated good performance in these cases due to the depth of the articles.Article 7 [15] describes the PHP Supply Chain Attack on Pear, and includes technical details of the failures, the exploitation method, and the patch.GPT utilizes the information provided in the blog, combined with its own knowledge, to suggest suitable solutions, e.g., "encouraging companies and developers to transition from PEAR to Composer".Article 35 [54] describes a compromised npm package.It contains technical details of the failure and information on prevention.GPT offers specific solutions, such as "encouraging the use of Intrinsic or similar Node.jspackages to whitelist and control access to sensitive resources and APIs".
Factors for weak ratings (average score ≤ 2).We believe the LLM demonstrated poor performance in these cases because the articles had few details.Articles 65 [21] and 67 [80] are brief and lack substantial technical details of the failures.Article 67 discusses remote exploitation of a Gentoo server and mentions ongoing forensics.It primarily serves as a notice to users.Article 65 discusses the backdooring of WordPress but provides little information that could inform solutions/learnings.The advice given by GPT is hence generic, such as "investigate the incident and address the vulnerability" and "conduct code audits".

DISCUSSION
Is using LLMs worth it in this context?We found that both LLMs in our experiment were capable of simpler forms of analysis, such as distinguishing whether a vulnerability was actually exploited.However, for more complex questions that require some amount of context or judgment, neither LLM achieved a high level of agreement with the CNCF analysts or our manual raters.We believe the current generation of off-the-shelf LLMs does not offer a high enough level of agreement with expert judgment to make it a useful assistant in this context.One potential path to improving performance is fine-tuning the LLM using baseline knowledge such as this catalog, and then applying it on future issues [28].
Will LLMs be a viable alternative to manual analysis in the future?In the past few years, OpenAI's GPT models have advanced from simple tasks (GPT-1, GPT-2) to the performance reported here (GPT-3.5).The recent GPT-4 model is more impressive still [32].We expect the next generation of LLMs will be suitable aids or replacements for this class of manual analysis.
Future Work.The scope of this analysis could be broadened to encompass additional LLMs, such as Claude [9] and Cohere [19].Additional prompt engineering, and tailoring the prompts per LLM, might improve the accuracy of the results.Lastly, the analysis could be extended to include a wider range of articles and failures beyond those found in the CNCF catalog [6,7].

THREATS TO VALIDITY
Internal: Prompt engineering was conducted with only one of the LLMs (ChatGPT) utilizing literature from its parent organization (OpenAI); the same prompts were used with the other LLM (BARD).The performance of BARD as reported in our study might be misrepresented due to this bias in prompt engineering.Additionally, we relied on manual analysis as the ground truth for our evaluation.We used multiple raters reaching agreement to mitigate bias.We measured an average inter-rater agreement of  = 0.6, indicating that independent judgments were generally consistent.
Several issues were identified with the catalog and its articles.(1) Three articles were inaccessible due to broken URLs or PDF formats that were incompatible with LLMs, and were excluded from the analysis [16,33,50].(2) Three articles [27,80,81] announced a failure, but no analysis -too little information to answer our RQs.(3) Some of the CNCF article labels did not match the CNCF taxonomy.For example, Article 56 [64] was categorized as a "Fake toolchain", and Article 63 [93] was labeled as a "Watering-hole attack".(4) One article [22] was not relevant.
Bard's low performance could be due to methodological bias.Lacking resources on Bard prompt engineering, we used available guidance for GPT.Bard's limit of 2000 tokens per prompt was below some prompt lengths, potentially reducing accuracy.
External: Constructed prompts could be over-fitted to analysis in the catalog.Replication of the catalog might not represent failure analysis of incidents in practice.Replication of a single catalog might not generalize to all incidents.

CONCLUSION
We evaluate the ability of Large Language Models (LLMs) at characterizing software supply chain failures.Our study revealed that LLMs are particularly effective when manual analysts are able to reach a consensus on the characteristics of the failure.In contrast, their performance tends to deteriorate when the agreement among raters is low.The quality of the LLMs' outputs also depends on the level of detail provided in the source articles, with more comprehensive articles leading to higher-quality responses.We conjecture that while LLMs offer a valuable tool for rapidly analyzing large volumes of text, they have not yet reached a stage where they can replace human analysts or manual classification.
Rather than viewing LLMs as a replacement for human input, they should be considered as a supplementary tool that can assist human analysts.As the depth of detail in postmortems and articles increases, and as LLMs continue to improve, they may evolve into viable analytical resources

ACKNOWLEDGMENTS
OpenAI's ChatGPT model (v4) was used during manuscript preparation.Prompt: Can you make the following clearer?
"TEXT SNIPPET".We reviewed answers to ensure it did not change the ideas.Table 8.Average rater's rating (Likert scale (1-5/"strongly disagree" to "strongly agree") over all the articles of GPT's response to the solution/learnings prompt.

Question Rating
Is the advice helpful in general for software supply chain failures?3.72 Is the advice related to the specific failure mentioned in the article?4.15 Can the advice be used to solve/mitigate the failure mentioned in the article? 3.62

APPENDIX
Table 9 presents the finalized prompts utilized to query the Language Learning Models (LLMs) across various dimensions.
These prompts were derived using a range of prompt engineering techniques, as detailed in Table 6.
Table 10 gives the full set of solutions/learnings proposed by GPT for the four articles discussed in detail in §5.2.
We wondered whether software supply chain reporting quality has improved over the years.If this were the case, we would expect to see an increase in LLM performance for newer articles.Figure 6 shows no such trend.
Figure 7, Figure 8, and Figure 9 show the ground truth for various dimensions.The ground truth for the dimension "Impact" is not presented as the disagreements among the raters were not resolved.In total, there were 65 articles analyzable for the "Intent", "Nature" and "Impacts" dimensions.For "Type of Compromise", there were analyzable articles.The failures that were not included were the ones with not functioning URLs and PDF formats, and where the manual labeling of the type of compromise by CNCF was not in the taxonomy.Table 9.The final prompts for each dimension.

Dimension Prompt
Type of compromise Classify the attack from the following choices Choice 1: Dev Tooling-This occurs when the development machine, SDK, toolchains, or build kit has been exploited.These exploits often result in the introduction of a backdoor by an attacker to own the development environment.Choice 2: Negligence-Occurs due to a lack of adherence to best practices.TypoSquatting attacks are a common type of attack associated with negligence, such as when a developer fails to verify the requested dependency name was correct (spelling, name components, glyphs in use, etc).Choice 3: Publishing Infrastructure-Occurs when the integrity or availability of shipment, publishing, or distribution mechanisms and infrastructure are affected.This can result from a number of attacks that permit access to the infrastructure.Choice 4: Source Code-Occurs when a source code repository (public or private) is manipulated intentionally by the developer or through a developer or repository credential compromise.Source Code compromise can also occur with intentional introduction of security backdoors and bugs in Open Source code contributions by malicious actors.Choice 5: Trust and Signing-Occurs when the signing key used is compromised, resulting in a breach of trust of the software from the open source community or software vendor.This kind of compromise results in the legitimate software being replaced with a malicious, modified version.Choice 6: Malicious Maintainer-Occurs when a maintainer, or an entity posing as a maintainer, deliberately injects a vulnerability somewhere in the supply chain or in the source code.This kind of compromise could have great consequences because usually the individual executing the attack is considered trustworthy by many.This category includes attacks from experienced maintainers going rogue, account compromise, and new personas performing an attack soon after they have acquired responsibilities.Choice 7: Attack Chaining-Sometimes a breach may be attributed to multiple lapses, with several compromises chained together to enable the attack.The attack chain may include types of supply chain attacks as defined here.However, catalogued attack chains often include other types of compromise, such as social engineering or a lack of adherence to best practices for securing publicly accessible infrastructure components.Explain your answer using the given definitions and return the option.Use JSON format with the keys: 'explanation', 'choice' Based on the information provided in the Article delimited by triple backticks.Article: "'{article}"'

Fig. 1 .
Fig. 1.Proposed use of Large Language Models (LLMs) to analyze software supply chain failures.Failures are often reported in articles and blogs.Organizations concerned with cybersecurity (e.g.governments, corporations) manually analyze failure reports.We evaluate LLMs as an aid.

Fig. 2 .
Fig. 2. A Software Ecosystem's Supply Chain Component and Dependency Vulnerability Flow.
[10] Impacts What kind(s) of impact resulted?The options are taken from[25]: (1) Data or financial theft, (2) Disabling networks or systems (3) Monitoring organizations or individuals, (4) Causing physical harm or death (5) All of the above are possible (6) Unknown or unclear.Solutions/learnings What was the quality of the solutions/learnings from the failure, that the LLM provided [67]?4.3 Baseline: Manual Analysis 4.3.1For RQ1.

Fig. 6 .
Fig. 6.The average accuracy of the articles for all the dimensions over the years.The graph shows no specific trend.

Fig. 7 .
Fig. 7. Categorization of articles for the dimension-"Type of Compromise" by CNCF catalog.

Table 1 .
[51]s of software supply chain attacks, according to the Cloud Native Computing Foundation (CNCF)[51].source code repository (public or private) is manipulated intentionally by the developer or through a developer or repository credential compromise.Source Code compromise can also occur with intentional introduction of security backdoors and bugs in Open Source code contributions by malicious actors.
5Trust and Signing Occurs when the signing key used is compromised, resulting in a breach of trust of the software from the open source community or software vendor.This kind of compromise results in the legitimate software being replaced with a malicious, modified version.

Table 2 .
Failure classification examples from CNCF catalog and LLMs.

Table 4 .
Inter-rater agreement for the dimensions.The Cohen's kappa () was calculated for each group (3 groups in total) of raters and then the average  was calculated.
4.4.1 LLM selection.We used two popular, state-of-the-art LLMs that are publicly available at time of writing (June 2023): OpenAI's ChatGPT model

Table 5 .
Specifications of the LLMs used in the evaluation: GPT-3.5 and Bard.GPT's tuning knobs use a 0-1 scale.

Table 6 .
Techniques used to improve the prompts, illustrated for the prompt associated with the dimension of type of compromise.'ID' denotes the order in which the techniques were used.The accuracy column contains the change in accuracy from the previous technique and the final accuracy in brackets.Accuracy was measured over 20% of the labelled data (we repeatedly analyzed the 14 most recent articles).Prompt 3 was chosen as it had the highest accuracy of 78%.Use JSON format with the keys: 'explanation', 'choice'.Based on the information provided in the Article delimited by triple backticks.Article: "'{article}"'" in the end.

Table 7 .
Total accuracy over all the articles for each LLM.