Unlocking Practical Applications in Legal Domain: Evaluation of GPT for Zero-Shot Semantic Annotation of Legal Texts

We evaluated the capability of a state-of-the-art generative pre-trained transformer (GPT) model to perform semantic annotation of short text snippets (one to few sentences) coming from legal documents of various types. Discussions of potential uses (e.g., document drafting, summarization) of this emerging technology in legal domain have intensified, but to date there has not been a rigorous analysis of these large language models' (LLM) capacity in sentence-level semantic annotation of legal texts in zero-shot learning settings. Yet, this particular type of use could unlock many practical applications (e.g., in contract review) and research opportunities (e.g., in empirical legal studies). We fill the gap with this study. We examined if and how successfully the model can semantically annotate small batches of short text snippets (10-50) based exclusively on concise definitions of the semantic types. We found that the GPT model performs surprisingly well in zero-shot settings on diverse types of documents (F1=.73 on a task involving court opinions, .86 for contracts, and .54 for statutes and regulations). These findings can be leveraged by legal scholars and practicing lawyers alike to guide their decisions in integrating LLMs in wide range of workflows involving semantic annotation of legal texts.


INTRODUCTION
We evaluate the effectiveness of GPT-3.5 (text-davinci-003) in tasks focused on (i) contract review, (ii) statutory/regulatory provisions investigation and (iii) case-law analysis.We benchmark the performance of the general (not fine-tuned) GPT-3.5 model in performing annotation of small batches of short text snippets coming from the aforementioned types of legal documents against the performance of a traditional statistical machine learning (ML) model (random forest) and fine-tuned BERT model (RoBERTa).The GPT moodel's annotations are based on compact one sentence long semantic type definitions provided to the model as a prompt.Specifically, we analyzed the following research question in the context of the three legal annotation tasks: Given brief type definitions from a single non-hierarchical type system describing short snippets of text, how successfully can a general GPT-3.5 model automatically classify such text spans in terms of the type system's categories?

RELATED WORK
Zero-shot GPT in AI & Law.Yu et al. applied GPT to the COLIEE entailment task, improving on the then existing state-of-the-art [45].Bommarito and Katz successfully applied GPT-3.5 and GPT-4 to the Bar Examination [4,18].Other use cases include assessment of trademark distinctiveness [13], legal reasoning [3,22], and U.S. Supreme court judgment modeling [14].
Rhetorical/Functional Segments in Adjudicatory Decisions.The task involves labeling of smaller textual snippets such as sentences [31] in terms of, e.g., rhetorical roles, functional or argument units.Examples include court [27] or administrative decisions from the U.S. [36], multi-domain court decisions from India [1] or Canada [43,44], international court [25] or arbitration decisions [6], or even multi-{domain,country} adjudicatory decisions [32].Identifying a section that states an outcome of the case has also received considerable attention separately [24,42].The task sometimes takes a form of identifying a small number of contiguous parts typically comprising multiple paragraphs.Different variations of this task were applied to several legal domains from countries such as Canada [11], the Czech Republic [15], France [5], the U.S. [28], or even in multi-jurisdictional settings [33].
Classification of Legal Norms.Researchers used traditional statistical supervised ML models to classify portions of Italian statutory texts with types, such as definition, prohibition, or obligation [2,12].Other groups classified sentences from Dutch statutory texts in terms of categories such as definition, publication provision, or scope of change [10].Some work focuses on fine-grained semantic analysis of statutory texts in terms of obligations, permissions, subject agents or themes [26,30,41], concepts or definitions [40].
Classification of contractual clauses.. Chalkidis et al. analyzed contractual clauses in terms of types such as termination clause, governing law or jurisdiction [8,9].Leivaditi et al. released a benchmark data set of 179 lease agreement documents focusing on recognition of entities and red flags [20].In this work we focus on twelve selected semantic types from the Contract Understanding Atticus Dataset (CUAD) [16].Wang et al. assembled and released the Merger Agreement Understanding Dataset (MAUD) [37].

DATA
We use three existing manually annotated data sets.Each supports various tasks involving different types of legal documents.All of them are equipped with expert annotations attached to (usually) short pieces of text.We further filtered and processed the data sets to make them suitable for this work's experiments.
The U.S. Board of Veterans' Appeals1 (BVA) is an administrative body within the U.S. Department of Veterans Affairs (VA) responsible for hearing appeals from veterans who are dissatisfied with decisions made by VA regional offices.The BVA reviews a wide range of issues, including claims for disability compensation, survivor benefits, and other compensation and pension claims.Walker et al. [36] analyzed 50 BVA decisions issued between 2013 and 2017.The decisions were all arbitrarily selected cases dealing with claims by veterans for service-related post-traumatic stress disorder (PTSD).For each decision, the researchers manually extracted sentences addressing the factual issues.The sentences were then manually annotated with rhetorical roles they play in the respective decisions [35].Figure 1 (left) shows the distribution of the labels.
Contract Understanding Atticus Dataset (CUAD) is a corpus of 510 commercial legal contracts that have been manually labeled under the supervision of professional lawyers.This effort resulted in more than 13,000 annotations. 2The data set was released by Hendrycks et al. [16] and it identifies 41 types of legal clauses that are typically considered important in contract review in connection with corporate transactions.In this study, we decided to work with the 12 most common clause-level types present in the corpus the distribution of which is shown in Figure 1 (center).
At the University of Pittsburgh's Graduate School of Public Health, researchers have manually coded federal, state and local laws and regulations related to emergency preparedness and response of the public health system (PHS).They used the codes to analyze network diagrams representing various functional features of states' regulatory frameworks for public health emergency preparedness.They retrieved candidate sets of relevant statutes and regulations from a full-text legal information service and identified relevant spans of text [34].They then coded the relevant spans as per instructions in the codebook, 3 representing relevant features of those spans.In this work we focus on the purpose of the legal provision in terms of the three categories the distribution of which is shown in Figure 1 (right).The statutory and regulatory texts were automatically divided into text units which are often non-contiguous spans of text referenceable with citations [29].

EXPERIMENTS
We use the Jaccard similarity measure as the baseline (tokens as sets).Each text snippet is compared to the type definitions available in the respective task.It is then assigned the label the definition of which has the highest similarity score with the snippet.We benchmark the performance of GPT-3.5 [7,23] against a traditional statistical supervised learning algorithm (random forest [17]) which is evaluated using 10-fold cross-validation at the document level.Within each iteration of the cross-validation, we utilize grid search 4 to select the best set of hyperparameters.The space that is being considered is defined over the type of n-grams to be used, number of estimators and maximum tree depth.
To compare the performance of the zero-shot GPT-3.5 model against a fine-tuned LLM, we fine-tune the base RoBERTa model [21] for 10 epochs on the training set within each of the cross-validation folds.The same splits as for evaluating the performance of the random forest are used.We set the batch size to 16 and the length of Table 1: Experimental results in terms of micro-F 1 scores.The Jaccard column shows the performance of the Jaccard similarity baseline.The @N labels denote how many data points were used in the training of the two supervised ML systems (RandF -random forest, BERT -base RoBERTa), where @Max means all the available data points were used.The GPT section reports the performance of the text-davinci-003 model.The blue cells signify the point at which the BERT system matched the performance of either one of the GPT-3.5 models.The green shaded cells do the same for the random forest.@20 @50 @100 @250 @500 @1000 @Max Jaccard RandF BERT  the sequence to 512.As optimizer we use the Adam algorithm [19] with initial learning rate set to 4 −5 .
To gauge how many labeled documents are needed by the supervised ML system to match and exceed the performance of GPT-3.5, we train the random forest and RoBERTa on training sets of varying sizes.We train the systems on the training sets with 20, 50, 100, 250, 500 and 1,000 data points.These documents are randomly sampled from the training set in each iteration of the cross-validation.
To test the performance of text-davinci-003, we submit a batch of text snippets using the openai Python library 5 which is a wrapper for the OpenAI's REST API.We make the batches as large as possible to achieve maximum cost effectiveness.Their size is determined by the size of the evaluated text snippets that can fit into the prompt (4,097 tokens), leaving enough space for the completion (i.e., the predictions).For the BVA decisions' sentences the batch size was set to 50, for the CUAD's contractual clauses to 20, and for PHASYS' statutory and regulatory provisions to 10.
We embed each batch in the prompt template shown in Figure 2. The model returns the list of predicted labels as the prompt completion.The construction of the prompt is focused on maximizing the cost effectiveness of the proposed approach which may somewhat limit the performance of the evaluated GPT-3.5 model.
We set the temperature to 0.0, which corresponds to no randomness.The higher the temperature the more creative the output but it can also be less factual.We set max_tokens to 500 (a token roughly corresponds to a word).This parameter controls the maximum length of the output.We set top_p to 1, as is recommended when temperature is set to 0.0.This parameter is related to temperature and also influences creativeness of the output.We set frequency_penalty to 0, which allows repetition by ensuring no penalty is applied to repetitions.Finally, we set presence_penalty to 0, ensuring no penalty is applied to tokens appearing multiple times in the output, which is especially important for our use case.

RESULTS AND DISCUSSION
Table 1 shows the results of our experiments on applying the text-davinci-003 model to the three tasks involving adjudicatory opinions (BVA), contract clauses (CUAD) and statutory and regulatory provisions (PHASYS).Firstly, the GPT-3.5 model outperforms the baseline based on Jaccard similarity by a large margin on all the three tasks (.35 vs .73 on BVA, .38 vs .86 on CUAD, and .24vs .54 on PHASYS).While the magnitude of the difference might be Table 2: Confusion Matrices of GPT-3.5 Predictions.The columns show the true labels as assigned by human experts, while the rows report the predictions of the system.somewhat surprising the better performance of the GPT-3.5 model as compared to the baseline is to be expected.The baseline only matches the exact words from the type definitions to the exact words in the evaluated text snippets, whereas the sophisticated GPT-3.5 model has access to their semantics.When compared to the supervised algorithms trained on the indomain data, the performance of the GPT-3.5 model is surprisingly high.While the GPT-3.5 model clearly falls short when compared to the supervised models trained on large portions of the available data, it is quite competitive, perhaps beyond what could be reasonably expected, when the size of the training data is limited.Overall, it appears that the RoBERTa model needs at least several hundred training data points to match the performance of the GPT-3.5 model (light-blue cells in Table 1).The random forest requires even more data, often close to a thousand (light-green cells in Table 1).
The three confusion matrices in Table 2 offer detailed view into the performance of the GPT-3.5 model.The performance on CUAD's contractual clauses appears very promising overall.There is a small number of classes that the system struggles to distinguish from each other, such as Minimum Commitment, Profit Sharing, or Volume Restrictions.As for the BVA's adjudicatory documents, the Reasoning class appears to be the most problematic.There is especially large number of Evidence sentences (654) that have been misclassified as Reasoning.Finally, the PHASYS' statutory and regulatory provisions seem to be the most challenging.A large number of Emergency Response provisions are labeled as Emergency Preparedness.
While the results of our experiments are promising, limitations clearly exist.First, the performance of the models is far from perfect and there is a considerable gap between the performance of the zero-shot LLM compared to the performance of the supervised ML systems trained on hundreds or thousands of example data points.Hence, in workflows with low tolerance towards inaccuracies in semantic annotation the zero-shot LLM predictions may need to be subjected to a human-expert QA.The outcome of such humancomputer interaction may be a high-quality data set of the size that enables fine-tuning of a powerful domain-adapted LLM.
There are considerable differences in performance of GPT-3.5 across the three data sets.While the performance on the CUAD data set seems to be very reasonable there are some limitations when it comes to the performance on the BVA data set.The model struggles with the Reasoning type in terms of mislabeling many sentences of other types as Reasoning as well as not recognizing many Reasoning sentences as such (Table 2).This is consistent with the performance of the supervised ML models.While the fine-tuned base RoBERTa is clearly more successful in handling this semantic type compared to the GPT-3.5, it still struggles (F 1 = .71).The random forest model under-performs GPT-3.5.Hence, the correct recognition of this type may require extremely nuanced notions that may be difficult to acquire through a compact one-sentence definition (GPT-3.5)or word occurrence features (random forest).For such situations, the proposed approach might not (yet) be powerful enough and the only viable solution could be fine-tuning an LLM.
The performance of GPT-3.5 on the PHASYS data set is not satisfactory.We identified several challenges this data set poses that makes it difficult even to the supervised ML models (Table 1).The data set is imbalanced with the Response type constituting 62.4% of the available data points.Second, the definitions of the semantic types appear to be somewhat less clear and lower quality than for the other two data sets.Hence, we hypothesize that the manual annotation of this data set heavily relied on the informal expertise of the human annotators which was not fully captured in the annotation guidelines.Finally, the fine-grained distinctions between what counts as emergency Response as opposed to Preparedness may simply be too nuanced to be captured in a compact definition.

CONCLUSIONS
We evaluated text-davinci-003 on three legal annotation tasks, involving adjudicatory opinions, contractual clauses, and statutory and regulatory provisions.The model was provided with a list of compact definitions of the semantic types.The tasks were to assign a batch of short text snippets with the defined categories.The results of the experiment are very promising, where the model achieved (micro) F 1 = .73for the rhetorical roles of sentences from adjudicatory decisions, .86 for the types of contractual clauses, and .54for the purpose of public-health system's emergency response and preparedness statutory and regulatory provisions.Our findings are important for legal professionals, educators and scholars who intend to leverage the capabilities of state-of-the-art LLMs to lower the cost of existing high-volume workloads, involving semantic annotation of legal documents, or to unlock novel workflows that would have not been economically feasible to be performed manually or using supervised ML.We also envision that the approach could be successfully combined with high-speed similarity annotation frameworks [38,39] to enable highly cost efficient annotation in situations where resources are scarce.

Figure 1 :
Figure 1: Semantic Types Distribution.The figure shows distribution of the semantic types across the three data sets.

TASKFigure 2 :
Figure 2: Batch Prediction Prompt Template.The preamble (1) primes the model to generate semantic type predictions.The tokens surrounded by curly braces are replaced with the document type (2) according to the data set, the names of the semantic types (3), the corresponding definitions (4), and the analyzed text snippets (5).