Leveraging Large Language Models to Boost Dafny’s Developers Productivity

This research idea paper proposes leveraging Large Language Models (LLMs) to enhance the productivity of Dafny developers. Although the use of verification-aware languages, such as Dafny, has increased considerably in the last decade, these are still not widely adopted. Often the cost of using such languages is too high, due to the level of expertise required from the developers and challenges that they often face when trying to prove a program correct. Even though Dafny automates a lot of the verification process, sometimes there are steps that are too complex for Dafny to perform on its own. One such case is that of missing lemmas, i.e. Dafny is unable to prove a result without being given further help in the form of a theorem that can assist it in the proof of the step. In this paper, we describe preliminary work on using LLMs to assist developers by generating suggestions for relevant lemmas that Dafny is unable to discover and use. Moreover, for the lemmas that cannot be proved automatically, we attempt to provide accompanying calculational proofs. We also discuss ideas for future work by describing a research agenda on using LLMs to increase the adoption of verification-aware languages in general, by increasing developers productivity and by reducing the level of expertise required for crafting formal specifications and proving program properties.


INTRODUCTION
As software is becoming increasingly pervasive in our daily lives, it is more important than ever to ensure that it does not contain bugs.Particularly in critical systems, software testing is not enough as stronger assurances are needed.These assurances can be achieved through software verification, which mathematically ensures that software behaves exactly as intended, with respect to a formal specification.Several verification-aware programming languages and systems, where logical constructs such as pre and postconditions, invariants, and assertions provide assurances about the correctness of the program, are available and enable verification alongside code development.One such language is Dafny [21].
Dafny is celebrated for its advanced deductive verification support that offers a sophisticated backend, encompassing a compiler capable of producing executable binaries and a verifier tasked with meticulously validating code conformity to specified requirements.While Dafny stands as a state-of-the-art tool in software verification, its application demands a profound grasp of concepts often encountered exclusively during the formalization of specifications, which is not commonplace in conventional programming languages.This complexity can hinder software development productivity in Dafny.In addition, and although Dafny has considerable adoption in industry (e.g. at Amazon Web Services 1 and at Consensys [5][6][7]), it is not widely used, most likely due to the cost in effort, time, and, as stated above, the need for expert knowledge when using Dafny.
Large Language Models (LLMs) have recently demonstrated extraordinary capabilities, exhibiting proficiency in diverse tasks such as engaging in conversations, retrieving and summarizing extensive information, and even generating and explaining text and code [2].Their utility is further underscored by their application as code suggestion tools, exemplified by GitHub Copilot [28].
In this research idea paper, we describe how we plan to explore LLMs' capabilities to boost Dafny's developers productivity and to support its wider adoption.First, we describe some of the challenges identified in the use of Dafny and how we intend to address them.Second, we showcase preliminary work on prompting GPT-4 [2,4] to support the automated inference of lemmas and their proofs.Finally, we discuss implications of these results, challenges in using these tools within the context of Dafny, and the next steps of our research agenda.Our proposed solution encompasses the integration of the proposed features into the Dafny plugin for VS Code, seamlessly integrating the presented ideas to assist developers in using Dafny to write specifications and their corresponding implementations.
Our overarching goal is to enhance the accessibility of Dafny for new users and cultivate increased autonomy and productivity in Dafny's use.
There are several challenges that Dafny users face when using the language.An important challenge is the common case that non-trivial Dafny developments entail substantial effort in writing lemmas and proofs.For example, in Cassez's verification of the Incremental Merkle Tree Algorithm [5], almost 90% of the lines of code are proofs and function definitions used in the proofs.Moreover, an experience report on using Dafny at the VerifyThis 2021 verification competition shows that the interpretability of Dafny's error messages is challenging as these are often not informative [9].
In this section, we describe four challenges that we intend to tackle using LLMs.

Predicate and Lemma Inference
Even though Dafny automates a lot of the verification process, sometimes there are steps that are too complex for Dafny to perform on its own.One such case is that of missing lemmas, i.e.Dafny is unable to prove a result without being given further help in the form of a theorem that can assist it in the proof of the step.The challenge lies not only in proving the lemmas and theorems but also in determining the specifications of lemmas and predicates capable of solving the problem at hand.
Various approaches have been explored in the literature to address these challenges across a diverse range of verification tools.For instance, infering inductive invariants in TLA+ [34], inference of lemmas in Coq [33,35] and in symbolic-Heap separation logic [37].
Our objective is to address lemma inference and general predicate inference using LLMs.Our approach involves using prompting to infer lemmas and predicates, and fine-tuning LLMs to infer them, particularly if the zero-shot or few-shot learning approaches yield unsatisfactory results.Section 3 provides an illustrative example of lemma inference using prompting and GPT-4 [2].

Proof Inference
Dafny supports calculational proofs (aka verified calculations), which are proofs by stepwise formula manipulation.As pointed out by Leino and Polikarpova [23], calculational proofs are praised for their rigor, readability, and elegance.Indeed, it has been shown that calculational proofs can greatly improve on traditional verbose proofs in natural language [10][11][12].
Writing calculational proofs is challenging and any method that can help infer these proofs or steps of these proofs can boost Dafny's developers productivity.Moreover, proof inference can complement the lemma inference described in subsection 2.1, as when lemmas are added to the code, it is often the case that a proof needs to be provided by the user.LLMs can assist in this process by providing the full proof or by giving hints to the user regarding the proof steps.This also applies to lemmas inferred by the LLM, as proofs will also need to be provided.In addition, even though Dafny can often prove the lemmas on its own, it has been shown that, even in those cases, adding proof steps that are not absolutely necessary can reduce considerably the verification time [5].
To infer proofs, we plan to use models fine-tuned on proof data, but we will also explore few-shot prompted and even zero-shot prompted approaches.Section 3 provides an illustrative example of few-shot prompted proof generation.

Automated Repair
Most programmers make mistakes when writing code.Automated Program Repair (APR) can help with this by supporting developers with automatic fix of software bugs.There are several existing works that successfully applied LLMs for automated program repair [19,29,40,41].However, as far as we are aware, there are no previous developments on APR for Dafny leveraging LLMs.In addition, in the context of Dafny, when a bug is detected, i.e. the program fails to verify, the issue can be due to an incorrect specification or an incorrect implementation.Although most current research assumes that the specification is correct, focusing on repairing the implementation, it is known that many issues with software stem from incorrect specifications [24].Previous work on specification repair in Dafny relies on Daikon [8] for dynamic invariant inference which is then used for generating weakening and strengthening candidate fixes (and their combination) [1].
We are not aware of any work on Dafny's specification or program repair using LLMs.Our goal is to explore existing approaches that use LLMs for automated repair and adapt/improve them to be used for Dafny.We also intend to tackle proof repair and to do so in at least two contexts: when the code changes and existing proofs become invalid and when the user or the LLM suggest a proof that does not verify.We will follow a continuous feedback loop between LLMs and the Dafny's verifier, where the LLM produces proofs or fixes to proofs and, should these not verify, the feedback produced by Dafny is fed into the LLM to produce another suggestion.

Summarization and Natural Language Specs
Code summarization, the task of generating summaries that accurately describe the functionalities of the code, can reduce developers' efforts in interpreting the goals of a program or snippet of code.In the context of Dafny, code summarization can also potentially help with understanding the intentions set out in specifications.Further, as detailed above, Dafny's error messages can be challenging to understand; code summarization has the potential to enhance error messages with further information about the error and about what is the mismatch between the specifications and the code.As developers spend around 58% percent of their time on program comprehension activities [42], features that assist them in code comprehension seem valuable to enhance their productivity.
We intend to explore LLMs' capabilities in code summarization to enhance error messages in Dafny, to provide explanations of specs/code, and also complement and update specs/code comments which are helpful in code understanding activities.Previous works have shown the enormous potential of LLMs in code summarization tasks [16,20,39].Another feature that we intend to explore is the use of LLMs to support the translation of natural language specifications into Dafny, assisting developers in this task that requires knowledge that is not commonplace.

PRELIMINARY EXPERIMENTS
This section describes preliminary experiments on using LLMs to infer lemmas and calculational proofs.In particular we used the latest version of GPT-4 Turbo, gpt-4-1106-preview, trained on data up to April 2023, enabling support for contexts with 128,000 tokens.You are a software expert specializing in formal methods using the Dafny programming language.You receive the following program where a loop invariant could not be proven.The verifier error message is inside // VERIFIER ERROR ... //.Your task is to create lemmas and insert them into the code to facilitate verification.

Lemma Inference
We use an example taken from Leino's book [22] and shown in Figure 1a.Given two sorted arrays, the method CoincidenceCount computes how many elements they have in common.The postcondition is expressed in terms of multisets (note that * denotes multiset intersection and the vertical-bar brackets denote the cardinality of a multiset).Since the two arrays are sorted, the algorithm can be efficiently implemented using two indices (m and n) to keep track of how many elements of a and b have been processed.Dafny's verifier is able to prove the method postcondition from the loop invariants.However, it cannot automatically prove the second invariant (identified by the red comment), because it is not able to automatically prove relevant properties about multisets.The typical approach to solve this is to annotate the program with lemmas that provide enough information for the proof to be completed.To determine whether GPT-4 can assist us by inferring useful lemmas, we used the code shown in Figure 1a and the prompt shown in Figure 1b 2 .GPT-4 inferred the lemmas shown in Figure 1c and placed them in the code as shown in Figure 1a (in blue).Only a small correction was required to make the program verify using these lemmas as axioms (highlighted in blue).However, note that these lemmas cannot be proved since they require more information in the precondition (e.g. to prove LemmaIntersectionAfterIncrease_mn, it is required to add a[m] == b[n] as a precondition; similar annotations are needed for the other lemmas).

Proof Inference
Our experiments with prompt engineering to infer proofs for the three lemmas shown in Figure 1c were not as successful.In general, there were many syntactic errors and we had to provide many examples of proofs to generate plausible solutions.The best result we obtained was when we asked to prove one of the lemmas, but gave a complete proof of one of the other similar lemmas.
To test whether GPT-4 would be able to help Dafny developers when they need to prove less domain-specific results, we also attempted to infer proofs for statements that are more widely known.An example is shown in Figure 2, where a proof for the lemma Factor0 was generated by GPT-4.The lemma is a version of a basic property of number theory:  is a factor of any linear combination  *  +  * .The predicate is a variation of the lemmas required to prove correctness of Euclid's algorithm as implemented in Dafny's integration tests. 3 The best attempt by GPT-4 to prove the lemma is shown in blue in Figure 2. To achieve this, we designed a prompt incorporating extensive Dafny code and examples; in particular we provided all the contents of the integration test file for calculations. 4Even though the hints are correct and make sense, they are not accepted by Dafny.Nevertheless, once we comment them, the proof is accepted.Moreover, in the version of Dafny that we used, 4.3.0,we had to change the variable definitions to:

RELATED WORK
Clover [36] is the only work that we are aware of on using LLMs in the context of Dafny.Clover consists on using a checker that performs consistency checks among code, docstrings, and formal annotations.The main idea is to reduce correctness checking to a problem of consistency checking.However, the reduction from correctness to consistency is not mathematically complete.Our goals are different and more fine-grained: we aim to prove correctness and to contribute with features that continuously focus on parts of the program, not necessarily the program as a whole, to enhance developers productivity in achieving program correctness.
Regarding lemma inference, we are not aware of any work focused on Dafny nor work that uses LLMs, but there have been efforts to synthesize lemmas for Coq [33,35], and for symbolic-Heap separation logic [37].AdtInd automates proofs by induction over algebraic data types, where lemmas are synthesized by term enumeration guided by user-specified templates [45].
Recent work on neural methods to automate proof synthesis is related to our goals of inferring calculational proofs.Given a partial proof and the proof state, neural theorem provers use neural networks to predict the next likely proof steps, which are then evaluated by a proof assistant to return new proof states or errors.Several neural theorem provers have been proposed for Coq [3,13,14,17,31,32,38,43], for Lean [44] and Isabelle [18,27].Baldur uses LLMs to repair proofs as part of its proof synthesis approach [15].Finally, regarding computer support for calculational proofs, there have been efforts to create structure editors that also support verified calculations [25,26] but none using LLMs.

CONCLUSION
The lemma inference experiment yielded promising results, demonstrating GPT-4's ability to infer lemmas with minor errors.However, throughout the experiments, several responses contained syntax errors that proved challenging to rectify through reprompting.The program struggled to complete lemma proofs or correct minor errors in lemma specifications.
On the other hand, the proof inference experiment revealed that only through a meticulously crafted prompt with rich examples could GPT-4 successfully complete the proofs.This underscores the need for improved LLMs that reduce the need for extensive manual prompt curation.
In order to achieve these improvements, our next steps will focus on prompting engineering and fine-tuning models with well-crafted datasets containing relevant Dafny code, with a specific emphasis on lemma inference, proof inference, and suggestions that only contain correct syntax.
In terms of dataset creation, we plan to contribute to and extend CloverBench [36], the dataset used by Clover.At the time of writing, CloverBench contains only 60 small CS textbook examples.Many more, and more complex, examples will be required to ensure the practicality of our ideas.Moreover, we plan to create additional and specialized datasets for specific tasks; for example, we intend to create a Dafny dataset akin to Reichel et al. 's dataset for Coq [30], to assist developers in proof repair.
Finally, to increase the adoption of our ideas, we started their integration into Dafny's VSCode plugin.