MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.

Figure 1: Mean performance comparison of baseline (green) and our multilingual models (violet) on multilingual legal text (French, Italian, Spanish, English, and German) and a zero-shot experiment on Portuguese data ABSTRACT Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks.It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used.In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages.Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data.We trained and tested monolingual and multilingual models based on

INTRODUCTION
Recent methodological advances, e.g., transformers [34], have lead to substantial progress in quality and performance of language models as well as growth in the general field of Natural Language Processing (NLP).This trend is also evident in legal NLP, with research papers increasing drastically in recent years [14].
Not as much attention and resources have been directed to the Sentence Boundary Detection (SBD) task, being viewed as solved by some, as high baseline performances can be achieved by utilizing simple lookup methods capturing frequent sentence-terminating characters such as periods, exclamations marks and question marks combined with hand-crafted rules [26].This approach is feasible when applied to well-formed and curated text such as news articles.Noisier domain-specific data containing differently structured text combined with the ambiguity of many sentence-terminating characters [8,15] -e.g., the period occurring in abbreviations, ellipses, initials etc. as a non-terminating character -often overwhelm the aforementioned methods and also more complicated off-the-shelf SBD systems.This has been illustrated in a number of specific SBD applications such as user-generated content [9,26] as well as in the clinical [20] and financial domain [7,19].
In legal documents, the aforementioned difficulties are increased with legal text consisting of smaller parts such as paragraphs, clauses etc., making it quite different from standard text.Furthermore, sentences are long and may contain complex structures such as citations, parentheses, and lists.These structures are often utilized to convey additional information to the reader (e.g., citations referencing another text) or formatting the text in a specific way (e.g., lists emphasizing ideas or increasing the readability of long paragraphs).However, these structures or special sentences do not follow a standard sentence structure, thus posing an additional challenge to SBD systems, illustrated in several works on English [27,29] and German [10] legal documents.

Motivation
Having a reliable SBD system is crucial for accurate NLP analysis of text.Poor SBD can result in errors propagating into higher-level text processing tasks, which hinders overall performance.For instance, the curation of the multilingual EUROPARL corpus required proper SBD to align sentences in both languages for statistical machine translation.Koehn [16] noted the difficulty of SBD as it requires specialized tools for each language, which are not readily available for all languages.Inadequate SBD weakens the performance of sentence alignment algorithms and reduces the quality of the corpus.Therefore, a high-quality SBD system, especially one customized for the legal domain, can significantly improve performance.
Another example is Negation Scope Resolution (NSR), focusing on finding negation words (e.g., "not") in sentences and their impact on surrounding words' meaning.Negations are vital in text's semantic representation, reversing proposition values.This is particularly useful in the legal domain, enabling models extracting information from documents to better understand input text meaning, such as recognizing court decisions' outcomes based on exact wording.NSR models often require data split into sentences for labeling training data and application input, making a reliable SBD system crucial.Incorrect sentence predictions by the SBD system may significantly lower input data quality and model performance.Proper SBD is also crucial in other NLP tasks such as Text Summarization, Part-of-Speech-Tagging, and Named Entity Recognition, all relevant in the legal domain.

Main Research Questions
In this work, we pose and examine three main research questions: RQ1: What is the performance of existing SBD systems on legal data in French, Spanish, Italian, English, and German?RQ2: To what extent can we improve upon this performance by training mono-and multilingual models based on CRF, BiLSTM-CRF, and transformers?RQ3: What is the performance of the multilingual models on unseen Portuguese legal text, i.e., a zero-shot experiment?

Contributions
The contributions of this paper are twofold: (1) We curate and publicly release a large, diverse, high-quality, multilingual legal dataset (see Section 3) containing over 130'000 annotated sentence spans for further research in the community.
(2) Using this dataset, we showcase that existing SBD systems exhibit suboptimal performance on legal text in French, Italian, Spanish, English, and German.We train and evaluate state-of-the-art monolingual SBD models based on Conditional Random Fields (CRF), BiLSTM-CRF and transformers, achieving F1-scores up to 99.6%.We showcase the performance and feasibility of multilingual SBD models, i.e., trained on all languages, achieving F1-scores in the higher nineties, comparable or better than our monolingual models on each aforementioned language.In a zero-shot experiment, we demonstrate that it is possible to achieve good cross-lingual transfer by testing the multilingual models on unseen Portuguese legal text.We publicly release the datasets1 , all of our monolingual and multilingual models2 (see Section 5) as well as our code 3 for further use in the community.

RELATED WORK
In this section, we discuss the literature at our disposal.First, we look at works showcasing the need for more research in regard to SBD.Second, we take a look at works tackling the problem of SBD in legal text in several languages.Lastly, we investigate SBD research in other domains and present multilingual datasets in the legal domain for thoroughness.Read et al. [26] questioned the status quo of SBD being "solved", especially in more informal language and special domains, by reviewing the current state-of-the-art SBD systems on English news articles and user-generated content.The systems were able to reach F1-scores in the higher nineties for the former, however the performance on user-generated content weakened perceptibly with scores down to the lower nineties, showcasing the need for "a renewed research interest in this foundational first step in NLP." [26] 2.1 SBD in the Legal Domain Savelka et al. [29] continued this research in the English language by curating a legal dataset, consisting of adjudicatory decisions from the United States.When testing existing systems on the dataset, they report F1-scores between 75% and 78%.Training or adapting these systems to the dataset improved their F1 score to the mideighties, which is still lower than their respective performance in more standard domains [26], showcasing the subpar performance of state-of-the-art SBD in the English legal domain.To improve this issue, they trained a number of CRF models as well as a model based on hand-crafted rules, reporting F1-scores of 79% for the hand-crafted model and up to 96% for the CRFs.Additionally, they developed a publicly available, comprehensive set of annotationguidelines for sentence boundaries in legal texts which we used as a foundation for our guidelines.
Sanchez [27] experimented on the same dataset reporting an F1score of 74% using the Punkt Model [15]; adapting it to the dataset slightly improved performance.They also trained and evaluated CRF and Neural Network (NN) models, reporting F1-scores up to 98.5% and 98.4% respectively.Our multilingual models achieve F1scores between 95.1% and 97% on the same dataset.
Similarly, Glaser et al. [10] curated a German legal dataset, split into laws and judgements; a similar distribution is used in our work.They established a baseline performance of existing SBD systems and compared it to CRF and NN models trained on the aforementioned dataset.Their findings outline F1-scores between 70% to 78% for off-the-shelf systems, supporting the view that the performance of existing SBD system is subpar on legal data.The CRFs and NNs models achieve F1-scores up to 98.5%.However, a significant decrease in performance was reported, when applying them to previously unseen German legal texts with scores down to 81.1%.Our multilingual models showcase F1-scores between 91.6% to 97.6% on the German dataset.

SBD in Other Domains
In the financial domain, Du et al. [7] experimented with Bidirectional Long Short-Term Memory (BiLSTM) models combined with a CRF layer as well as the transformer-based model BERT [6] and compared their performance, approaching SBD as a sequence labelling task to extract useful sentences from noisy financial texts.They demonstrate that BERT significantly outperforms BiLSTM-CRFs across all evaluation metrics, including F1-scores.In their work they also underline the fact that "SBD has received much less attention in the last few decades than some of the more popular subtasks and topics in NLP." Schweter and Ahmed [31] compared the performance of Long Short-Term Memorys (LSTMs), BiLSTMs and Convolutional Neural Networks (CNNs) to OpenNLP 4 in an SBD task on the Europarl [16], SETimes [33] and Leipzig Corpora [11] containing around 10 different languages, showcasing the use of their models as robust, language-independent SBD systems.

Multilingual Datasets in the Legal Domain
Niklaus et al. [23] present LEXTREME, a novel multilingual benchmark dataset containing 11 datasets in 24 languages, designed to 4 https://opennlp.apache.org/evaluate natural language processing models on legal tasks.The authors assess five prevalent multilingual language models, providing a benchmark for researchers to use as a basis for comparison.Savelka et al. [30] investigate the application of multilingual sentence embeddings in sequence labeling models to facilitate transfer across languages, jurisdictions, and other legal domains.They demonstrate encouraging outcomes in allowing the reuse of annotated data across various contexts, which leads to the development of more resilient and generalizable models.Additionally, they create a vast dataset of newly annotated legal texts using these models.Chalkidis et al. [3] introduce MultiEURLEX, a multilingual and multilabel legal document classification dataset containing 65000 EU Laws.Aumiller et al. [1] present a EurLexSum, a multilingual summarization dataset curated from Eur-Lex data.Niklaus et al. [21,24] introduce Swiss-Judgment-Prediction, a multilingual judgment prediction dataset from the Federal Supreme Court of Switzerland.

DATASET
We annotated sentence spans for three diverse multilingual legal datasets in French, Italian, and Spanish, each containing approximately 20,000 sentences evenly split between judgments and laws.We chose a variety of legal areas to capture a broad selection.The laws included the Constitution, part of the Civil Code, and part of the Criminal Code, with the Constitution used only for evaluation.The judgments comprised court decisions from various legal areas and sources.We also annotated a smaller Portuguese dataset with approximately 1800 sentences, divided into the same subsets as the other datasets.This dataset was used for zero-shot experiments.
Additionally, we standardized and integrated two publicly available datasets, an English collection of legal texts [29], consisting of Adjudicatory Decision from the United States as well as a German dataset [10], comprising laws and judgments, into our dataset to further increase its diversity.
Figure 2 illustrates the sentence length distribution of our dataset, showing the relative frequency of sentence length in tokens for laws and judgments, with a bin size of 5. We used an aggressive tokenizer, resulting in a larger number of tokens per sentence than usual.For clarity, we did not include sentences longer than 101 tokens, which comprised only ~2% (2634) of the sentences.Only 26 sentences were longer than 512 tokens.
For each language, we used random sampling to split the dataset into three parts: train, test and validation.The test and validation splits each contain 20% of the dataset.Every model is trained on the train split, and we report their performance on the test split.Selected statistics and information about the dataset are in Table 1.

Annotation
The human annotator was tasked with correcting the sentencespans predicted by an automatic SBD system5 [29] based on CRF, which was trained on data annotated using annotation guidelines by Savelka et al. [29].This helped improve the quality and consistency of our annotations.Furthermore, a practical rule set, heavily influenced by the aforementioned guidelines, was utilized to aid the annotator in the annotation process, reducing the complexity of the task and helped provide dependable and well-founded data.The  The documents were annotated using Prodigy (https://prodi.gy/).Because Prodigy requires pre-tokenized text, a customized tokenizer was applied to the input text, further described in Section 3.2.The decision to annotate the full sentence-span, in lieu of just the first and last token in the sequence, was made to incentivize the annotator to read the text instead of skimming it for sentenceterminating characters.To make the annotation easier, laws were split into smaller chunks with one to three articles per chunk, while judgments were only split, if they surpassed ~15000 characters since Prodigy was unable to handle longer documents.

Legal Sentence Structures.
In this section, we briefly describe the most important sentence structures in legal text, heavily influenced by Savelka et al. [29], followed by an example in French.
Standard Sentence have subject, object and verb in the correct order and the last token in the sequence is a sentenceterminating character.
• Il s'est établi comme ingénieur indépendant.Linguistically Transformed Sentence are similar to a standard sentence, but slight transformations such as changes to the word order are applied.
• Tout porte à croire, en réalité, qu'elle est condamnée au surendettement, puis à la faillite.Headlines determine the structure of the text and show relatedness between parts of the document and therefore convey important information about the overall structure of the text.
• Considérant en fait et en droit • PAR CES MOTIFS • DÉCLARATION Data fields provide the name and data of a field.This is annotated as a sentence, as for example in English "Civil Chamber: Madrid" has a similar meaning to "The civil chambers are located in Madrid".
• Numéro d'appel: 1231/2015 Parentheses appear frequently in legal text, often combined with citations.We annotate parentheses with the sentence they belong to.Sequences inside the parentheses are not annotated separately, as seen in the following example, containing a single sentence: • Ce dernier étant domicilié à l'étranger, il ne peut en effet prétendre à des mesures de réadaptation (art.8a.1er paragraphe.Convention de sécurité sociale entre la Suisse et la Yougoslavie du 8 juin 1962).Colons should not be annotated as a sentence-terminating character, unless the colon is immediately followed by a newline.The reasoning here is that a sequence ending in a colon followed by a line break usually introduce a list or block quote, which should be annotated separately to the introductory sentence.Lists are annotated differently depending on its type.For lists with incomplete sentences as list items, often ended with a semi-colon, the whole list is annotated as a sentence.The following example consists of 2 sentences, the introductory sentence to the colon and 1°to the period.
However, if the list items themselves are sentences, the list number (or letter) and items are both annotated as one sentence each, the reason being that they express separate thoughts.In the example below we have 3 sentences (introductory, list number, list item).
• Considérant en droit: 1.-En instance fédérale, peut seul être examiné le point de savoir si la commission de recours a exigé à bon droit de la recourante une avance de frais de 500 fr.pour la procédure de recours de première instance.Ellipses are used to indicate when part of a sentence or part of the document are left out.The following example shows the use cases for ellipses.The first ellipsis is annotated separately, as it indicates sentences that are missing.The second ellipses indicates, that part of that single sentence was left out and is therefore not annotated separately.
• (...) La faute de X. est d'une exceptionnelle gravité tant les faits qui lui sont reprochés (...), commis avec une certaine froideur sont insoutenables et comportent un caractère insupportable pour les victimes.Footnotes / Endnotes convey additional information to the reader.Indicators for end-and footnotes such as numbers or letters should always be annotated as being inside the sentence span, even if they occur after the sentence-terminating character.As an example, the sequence below is just one sentence, with "(2)" as the indicator: • La loi ne dispose que pour l'avenir; elle n'a point d'effet rétroactif.
(2) Furthermore, endnotes appearing as numbered lists, should be annotated as following the guidelines for lists.In the example below, (2) is one sentence, followed by a normal sentence: • (2) Le remplacement des membres du Parlement a lieu conformément aux dispositions de l'article 25.

Tokenizer
We implemented an aggressive tokenizer based on Regex to segment text into tokens, also employed in other research [10,29].This tokenizer was utilized for all languages.Words, numbers and special characters such as newlines and whitespace are separated into individual sequences.This was done to ensure no information (e.g., a line break indicating a sentence boundary), vital to the SBD process, was lost.An example is showcased below; tokenized whitespace is left out for clarity's sake: • D._ est entré à l'école le 16 juillet 1979.

EXPERIMENTAL SETUP
We conducted a series of experiments to answer our research questions posed in Section 1.2.Firstly, we compared selected existing models to establish a baseline performance.Secondly, we trained and evaluated various monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, comparing them to baselines.Lastly, we evaluated the multilingual models' performance on unseen data in a zero-shot experiment.

Baseline Systems
We conducted a thorough evaluation of several widely used systems utilizing various technologies, including CoreNLP, NLTK, Stanza, and Spacy, which served as our baselines.In the following section, we will provide a detailed description of each system.

NLTK.
A fully unsupervised SBD system created by Kiss and Strunk [15].The main thought behind the system is that most falsely predicted sentence boundaries stem from periods after abbreviations.The system therefore discovers abbreviations by looking at the length, the collocational bond, internal periods and occurrences of abbreviations without an ending period of each token in the text.
We test a pre-trained model as well as a model trained on our data.

CoreNLP.
A rule-based system from the Stanford CoreNLP toolkit [18], which predicts sentence boundaries based on events like periods, question marks, or exclamation marks.

Stanza.
A multilingual system based on a BiLSTM model [25].We only use the first part of its NLP pipeline, the tokenizer.It addresses tokenization and sentence splitting jointly, treating it as a character sequence tagging problem, predicting if a character is the end of a token or sentence.
4.1.4Spacy.A multilingual system [12] with pre-trained models using technologies like CNN and transformers.For our purposes, only the tokenizer and sentence splitter were used.

Our Models
Following the works presented in Section 2, we chose to test models based on CRFs, BiLSTM-CRFs and transformers.We further describe these models in the following subsections.For testing, we trained 6 and evaluated monolingual models for each language as well as multilingual models using all languages except Portuguese, once for laws, once for judgments and both types together.

Conditional Random
Fields.The tokenizer in Section 3.2 tokenized input text, including whitespaces.Each token was translated into a list of simple features representing the token, and the features of tokens within a pre-defined window around the token were added.Window sizes for each feature varied, inspired by Glaser et al. [10] and Savelka et al. [29], as shown in Table 2.We labeled input data using the "BILOU" system following Lin et al. [17].
For training our CRF models, we used the python-crfsuite7 implementation.We trained each model for 100 iterations, with regularization parameters 1 and 1e −3 for C1 and C2, L-BFGS as the algorithm, and including all possible feature transitions.

Bidirectional LSTM -CRF.
A BiLSTM connects two LSTMs with opposite directions to the same output, allowing it to capture information from past and future states at the same time.The outputs of each LSTM are concatenated into a representation of each input token.For a BiLSTM-CRF model, a CRF layer is connected to the output of the BiLSTM network, using the aforementioned representation as features to predict the final label.

Feature Description Window
Special Each token is categorized using the following translation: Sentenceterminating tokens as "End", opening and closing parentheses as "Open" and "Close" respectively, newline characters as "Newline", abbreviation characters as "Abbr" and the rest as "No". 10 Lowercase The token in lowercase.7 Length The length of the token.7 Signature Each character is represented using the following translation: Lower case and upper case character are rewritten as "c" and "C" respectively, digits are written as "N" and special characters as "S".

Lower
Whether the first character is lower case.

Upper
Whether the first character is upper case.
3 Digit Whether the token is a digit.3 We utilized the Bi-LSTM-CRF 8 library to train our models.We used a word embedding dimension of 128, hidden dimension of 256 and a maximum sequence length of 512.The batch-size was 16 with a learning rate of 0.01 and a weight decay of 0.0001.We trained each model for 8 epochs and saved the model with the smallest validation loss.We extracted word embeddings for training from our documents.To label the training data, we utilized the "BILOU" labeling system described in Section 4.2.1.For training, gold sentences were put together into batches with a token-limit of 512 to simulate longer paragraphs.

Transformer.
Transformers are a type of NN that utilizes selfattention mechanisms to weigh the importance of difference parts of the input when making predictions.Transformer models such as BERT use a multi-layer encoder [34] to pre-train deep bidirectional representations by jointly conditioning on both left and right context across all layers [6].Thus, we can fine-tune transformer models to the SBD task by adding an additional output layer.In our case we used a pre-trained model 9 based on DistilBERT [28], a smaller, more lightweight version of BERT, for all languages on our SBD task. 10 We trained the models using PyTorch 11 and Accelerate 12  with the Adam optimizer for 5 epochs with a batch-size of 8 and learning rate of 2e −5 .
A limitation of DistilBERT is the input length limit of 512 tokens because the runtime of the self-attention mechanism scales quadratically with the sequence length.This issue is exacerbated, since DistilBERT relies on a WordPiece Tokenizer [32], splitting the text into subwords resulting in a higher token count per sequence.Thus, to get around the 512 token-limit, each document was split into sentences using the gold annotation.Each consecutive sentence was added to a collection until the total length was as close to the token-limit as possible.Next, the model predicted the sentence boundaries for each collection.Sentences longer than 512 tokens were truncated. 13An obvious downside to this solution is that the input text already has to be split into sentences or short sections, making it difficult to apply BERT models to unknown text.
For future work, it would be interesting to see, whether it is feasible to chain SBD models (i.e., first, apply a CRF model on the input text to split the text into sections smaller than 512 tokens and second apply a transformer based model).Another solution might be using pre-trained transformer models that support longer input text utilizing an attention mechanism scaling linearly with sequence length, such as Longformers [2]. 14

Evaluation
A characteristic of the SBD task is the inherent imbalance towards non-sentence boundary labels, as each sentence can at most have two sentence boundaries.Thus, to more accurately score our models, we used commonly utilized measures to evaluate our models -Precision (P), Recall (R) and F1-Score (F1).Although the SBD task is not yet solved in specialized domains, it is comparatively easier than other NLP tasks such as Questions Answering or Summarization.Because SBD is a pre-processing task, it is necessary to achieve higher scores to prohibit the propagation of errors into downstream tasks.Thus, we expect that state-of-the-art SBD models exhibit F1scores in the high nineties to be useful in practice.
For the evaluation process, we let models predict the sentence spans of every document.These annotated spans are tokenized by our tokenizer (Section 3.2).Each token is then assigned a binary value, depending on whether it was a sentence boundary or not.This decouples the predicted sentence spans or boundaries from the tokenizer used, as the tokenizer of some models might designate a slightly different token as the first or last in a sentence, further described in the following example in French: "C'est en outre ...".While our tokenizer would designate "C" as the first token in the sequence, a different tokenizer might designate "C'" or even "C'est".This would lead to a wrongly predicted sentence boundary when compared to the gold annotations, although the prediction was actually correct.
True and predicted labels for each document type are compared using Scikit-Learn to calculate binary F1-Scores.Scores are averaged for subsets: "Laws" encompass Criminal Code, Civil Code, and Constitution; "Judgments" include various court decisions.
We trained each CRF model once and the BiLSTM-CRF and transformer models 5 times with random seeds, reporting the mean performance including standard deviation.If not specified differently, reported values are binary F1-scores.

Baseline Models
The performance of baseline models in Section 4 on each language in our dataset is summarized in the upper section of Table 3.
The results for the baseline models are clearly lower than the reported scores for user-generated content by Read et al. [26], supporting the hypothesis that the performance of out-of-the-box models is subpar on legal data for all tested languages.The difference in performance could be explained in one part by the special sentence structures presented in Section 3.1, while the challenging nature of legal text accounts for another part.
Of interest is the gap between NLTK and NLTK-train in most languages, as training NLTK improves its ability to recognize and correctly predict abbreviations.This showcases that abbreviations are one part of the challenging nature of legal texts.To note here is that Spacy uses a slightly different notion of a sentence compared to the other models: Usually, when two sentences are separated by a newline character, the newline character would not be part of any sentence span, however Spacy would include it in the span of the second sentence.This leads to a false prediction, even though Spacy correctly recognized that there are two sentences.Therefore, the scores Spacy achieves are lower than expected.

Monolingual Models
We report the performance of our trained monolingual models in Table 4.Each model was trained and tested on the same language.
We observe that each model's performance, when applied to their training subset, reaches high nineties for almost all languages, significantly improving over the baseline models from Section 5.1 and comparable to reported SBD system performance on English news articles [26].Our models also perform similarly to the reported performance of CRFs and CNNs on English [27,29], as well as CRFs and NNs on German datasets [10].
Comparing the performance of the models when trained on one subset and evaluated on the other unseen set, i.e. a zero-shot experiment, the transformer model outperforms CRF and BiLSTM-CRF on most languages, dropping down to 81.8% on the Italian dataset, comparable to the best baseline models, when trained on judgements and evaluated on laws.Unsurprisingly, the models' performance in the zero-shot experiment is almost always lower than the performance on the subset they were trained on.This gap can be explained by the large difference of writing and formatting styles between judgements and laws, with the transformer model being the best at generalizing knowledge between the two subsets.We further hypothesize that it was easier for the models to generalize their knowledge to different domains, when being trained on judgements, than when being trained on laws, resulting in higher scores on unseen data.One factor here might be that legal text in judgements contain a higher variety of different sentence structures, while laws usually reuse the same structures.
The CRF and BiLSTM-CRF model showcase especially poor performance on the Spanish dataset when trained on laws and evaluated on judgements, with scores down to 43.4% and 54.3%.We hypothesize that both models possess a worse ability to generalize to different domains compared to transformer models.
To conclude, while training on both laws and judgments together not always produces the absolute best performance, it is most robust and does not result in performance degradation.

Multilingual Models
The performance of our multilingual models trained on laws and judgements is reported in the lower section of Table 3.Each multilingual model was trained on all languages except Portuguese.
The multilingual models clearly outperform the baseline models by a large margin, with F1-scores up to 99.2%.Both the BiLSTM-CRF and transformer models perform very well, with transformers performing slightly better on judgements and BiLSTM-CRFs on laws.The CRF model is close behind the other two, mostly reaching scores in the higher nineties.Comparing the performance of the multilingual models to the monolingual models, showcases that there is no loss of performance when training on a much larger dataset, with multilingual models performing comparably or in case of the transformer and BiLSTM-CRF model even better than the monolingual models on each respective language.

Zero-shot Experiment on Portuguese Data
We conducted a more challenging experiment, evaluating multilingual models on Portuguese data, comparing them to the baseline.Figure 1 provides an overview, while Table 3 details the differences in judgements and laws against the baseline.For judgements performance is adequate with F1-scores between 90.2% and 93.6%, comparable to user-generated content [26], and outperforming most baselines.However, for laws, only the transformer model scores in the lower nineties, while CRF and BiLSTM-CRF drop to 78.6% and 73.2%, respectively, similar to our usual baseline values.The transformer model's large-scale multilingual pretraining likely makes it more robust to distribution shifts, leading to better cross-lingual transfer to unseen languages than CRFs or BiLSTM-CRFs.
The difficulty of the writing and formatting style in Portuguese law texts could explain the difference between laws and judgements, indicated by lower than usual Portuguese baseline performance.BiLSTM-CRF's reduced performance could also result from the lack of Portuguese word embeddings used in training, as we only extracted embeddings from our training data.To improve BiLSTM-CRF models, future research could explore adding Portuguese word embeddings or using larger, multilingual embedding vocabularies during training.To improve transformer models, fine-tuning larger pre-trained models like XLM-RoBERTa [5] on the SBD task could be a potential avenue as they improve significantly in cross-lingual transfer compared to mBERT [6] or DistilBERT [28] models.
When evaluating the effectiveness of monolingual and multilingual models, trained on the entire monolingual dataset, on previously unseen Portuguese data (Table 5), we observe that the multilingual models outperform corresponding monolingual models in most languages, with Spanish being a notable exception.We hypothesize that the disparity in performance is due to close linguistic ties between Spanish and Portuguese, which enabled the Spanish monolingual models to excel in cross-lingual transfer.However, on other languages linguistically less close to Spanish, the multilingual model is expected to perform better than the monolingual ones.

Inference Time
Table 6 reports the inference times of our multilingual models trained on laws and judgments.We measured inference time three times on both a GPU (NVIDIA GeForce RTX 3060 TI) and a CPU (Intel Core i5-8600K CPU @ 3.60GHz), and show the average.We did not report standard deviation since there were no significant outliers.Notably, the transformer model saw significant improvements in inference time on a GPU.However, CRF does not benefit from GPU evaluation as it uses sequential operations.Considering the results presented in Sections 5.2, 5.3 and 5.4, inference times and ease of use, a recommendation for the multilingual transformer model can be made for most cases, as long as a GPU is available for inference.For language specific tasks or tasks requiring longer input texts, we recommend the CRF models for the respective language, although they have a longer setup time compared to the BiLSTM-CRF and transformer model.

Error Analysis
We inspected random samples -two thirds of the Portuguese dataset (8 judgements, 20 laws) -predicted by the multilingual transformer model for the zero-shot experiment on Portuguese texts.We selected the multilingual transformer following our recommendation in Section 5.5, and the Portuguese dataset because the model already performed very well on the other datasets.
Standard sentence boundaries are rarely missed and the model performs adequately in that regard; yet, we identified a few sources of common mistakes.We discuss examples with |T| and |P| indicating true and predicted sentence boundaries, respectively.Many errors stem from citations and parentheses as shown in the example below: The model failed to predict a sentence boundary at the end of both sequences.The errors showcased in the examples above mainly stem from our particularly defined sentence structures (Section 3.1.1)as well as the challenging nature of the legal SBD task.
Another set of errors were caused by the different formatting styles and words used in the Portuguese language, unknown to the model, such as: In (1), we have the abbreviation "Sr(a)", which the model did not recognise as such, thus marking the period as a sentence boundary.A similar mistake is shown in (2), with the abbreviation "Exmos".

Limitations
Due to the language skills of our annotator, we only annotated data from two language groups (Germanic and Italic).Therefore, our languages have high lexical overlap, making cross-lingual transfer comparatively easy.Future work may investigate legal text from additional diverse language groups to build systems even more robust towards language distribution shifts.The annotator is a native German speaker, with intermediate French language skills.Due to the similarity of Italian, Spanish, and Portuguese to French, and because the SBD task is largely structural, the annotations were possible.However, having the annotations performed by a native speaker in the respective languages may further increase annotation quality.On the other hand, having one annotator (as done in our case) annotate the entire dataset, enables more consistency across languages.
Because of financial limitations, we performed the annotations using only one annotator.Having a second annotator validate the annotations may further increase annotation quality.
Augmenting the qualitative error analysis from Section 5.6 quantitatively may provide more concrete and actionable evidence for improving the systems further.To achieve this, a more detailed annotation of the sentence type would be helpful, so statistics over the sentences can be computed to get quantitative results of the sentence types performing worst.The monolingual models achieved state-of-the-art F1-scores in the high nineties for all tested languages, comparable to reported scores on news articles [26].The multilingual models performed similarly to monolingual models, demonstrating the potential of training with larger datasets.The transformer model exhibited superior cross-domain transfer compared to CRF and BiLSTM-CRF models.RQ3: What is the performance of the multilingual models on unseen Portuguese legal text, i.e., a zero-shot experiment?
The transformer models performs adequately on the judgements and laws subsets, reaching F1-scores in the lower nineties, demonstrating the best cross-lingual transfer, while the CRF and BiLSTM-CRF models perform decently around 90% on judgements, but drop down to baseline values on the laws, most likely requiring additional optimization.

Conclusion
In this work, we curated and publicly released a diverse legal dataset with over 130'000 annotated sentences in 6 languages, enabling further research in the legal domain.Using this dataset, we showed that existing SBD methods perform poorly on multilingual legal data, at most reaching F1-scores in the low nineties.We trained and evaluated mono-and multilingual CRF, BiLSTM-CRF and transformer models, achieving binary F1-scores in the higher nineties on our dataset, demonstrating state-of-the art performance.For a more challenging task, we tested our multilingual models in a zero-shot experiment on unseen Portuguese data, with the transformer model reaching scores in the lower nineties, outperforming the baseline trained on Portuguese texts as well as the CRF and BiLSTM-CRF models by a large margin.We publicly release these models and the code for further use and research in the community.

Future Work
Further improvement for all models might be achieved by preprocessing the input text more, e.g., replacing newlines with spaces, special characters with more widely used equivalent characters e.g., double quotes (") with single quotes (").Furthermore, thorough hyperparameter optimization tailored to the specific dataset could improve multilingual CRF and BiLSTM-CRF models.Finally, transformer models may benefit from legal-oriented models [4,13,22], larger pre-trained models like BERT [6], or models designed for cross-lingual transfer tasks, like XLM-RoBERTa [5].
Augmenting the dataset with legal texts from multiple languages and documents from various sources like privacy policies and terms of service may improve multilingual models' performance, particularly in the zero-shot scenario.An interesting impact on the model performance could be observed if the sentence spans were labeled with their sentence structure type such as "Citation" (Section 3.1.1)during training instead of being assigned a single label.
An investigation into whether the positive cross-lingual transfer observed in their study also applies to languages from a different family, such as Hungarian.This assumption is based on the common origin of the languages studied, as mentioned in Section 5.

Figure 2 :
Figure 2: Sentence length distribution in tokens

Table 1 :
Statistics on datasets per language and subset

Table 4 :
Mean (±std) F1 Score of monolingual models on their respective language.Best scores are in bold.

Table 5 :
Mean F1 Score of monolingual and multilingual models on unseen Portuguese data

Table 6 :
Mean inference time in minutes (min), seconds (s), milliseconds (ms) for each multilingual model to predict the entire dataset of ~130000 sentences and one sentence, measured on a GPU and CPU • (Bittar, Carlos Alberto.|P| Direito de autor.|P| Rio de Janeiro: Forense Universitária, 2001, p. 143) |T| |P| In this example, we have a citation sentence with periods being wrongly predicted as sentence boundaries inside the citation.Another source of errors are datafields and headlines, since there is often little indication e.g., a sentence-terminating character, for the model to recognize it as such: 6 CONCLUSION AND FUTURE WORK 6.1 Answers to the Research Questions RQ1: What is the performance of existing SBD systems on legal data in French, Spanish, Italian, English, and German?Existing SBD systems are subpar in all tested languages, lower than reported scores by Read et al. [26] on user-generated content, indicating that SBD is not solved in the legal domain.RQ2: To what extent can we improve upon this performance by training mono-and multilingual models based on CRF, BiLSTM-CRF and transformers?