sustain.AI: a Recommender System to analyze Sustainability Reports

We present sustain.AI, an intelligent, context-aware recommender system that assists auditors and financial investors as well as the general public to efficiently analyze companies' sustainability reports. The tool leverages an end-to-end trainable architecture that couples a BERT-based encoding module with a multi-label classification head to match relevant text passages from sustainability reports to their respective law regulations from the Global Reporting Initiative (GRI) standards. We evaluate our model on two novel German sustainability reporting data sets and consistently achieve a significantly higher recommendation performance compared to multiple strong baselines. Furthermore, sustain.AI is publicly available for everyone at https://sustain.ki.nrw/.


INTRODUCTION
In the face of climate change and environmental degradation, our society's expectations of sustainable and responsible entrepreneurial action have increased continuously over the past years.Legislators worldwide and particularly in the EU become increasingly aware of the situation and have taken concrete political measures to enforce corporate social responsibility (CSR).In 2014 the EU approved the Non-Financial Reporting Directive (NFRD) which forces large companies to extend their reporting on policies, risks and key performance indicators regarding sustainability and social Figure 1: A screenshot of the sustain.AI recommender tool.After selecting a specific regulatory requirement from one of the categories, the system predicts the most relevant segments of a provided sustainability report.On the right side, the recommended segments are highlighted in the rendered report, fostering an efficient sustainability analysis.matters.Beginning in 2024 the NFRD will be updated with the stricter CSR-Directive, which applies to around 50,000 European companies and includes a wider catalog of reporting requirements covering environmental, social and governance aspects.The majority of these requirements are based on the popular regulatory framework from the Global Reporting Initiative (GRI).Its universal reporting standards provide a detailed set of indicators that address a company's impact on the economy, environment and people.
In light of these more comprehensive and rigorous sustainability regulations and the public's growing interest in corporate social responsibility, it is of vital importance to make the disclosed information easily accessible and comparable.However, manually retrieving and analyzing the published reports concerning specific GRI-Indicators is practically infeasible, especially considering that the documents often span around a hundred pages or more.This is particularly true for the auditing domain, where auditors spend hours to assure a report's compliance related to said CSR standards.
Hence, we introduce sustain.AI, a sophisticated, context-aware recommender system that utilizes modern techniques of natural language processing (NLP) and machine learning to process and analyze uploaded sustainability reports.Concretely, interested users like consumers or investors can query the recommender engine for specific GRI-indicators, e.g. the company's emissions (see Figure 1), and the engine returns and renders the most relevant document segments related to the query.Thus, stakeholders are able to quickly assess investment risks and opportunities arising from social and environmental issues and to evaluate the sustainability performance of companies.Similarly, auditors significantly benefit from the automated matching of concrete regulatory requirements to the relevant text passages.In fact, a large part of the sustainability report audit is about ensuring the completeness and correctness of the report according to the specified GRI standards.
Our recommender system builds on a BERT-based [3] encoding module followed by a non-linear multi-label classification head.Both components are trained jointly in an end-to-end fashion leveraging weighted random sampling (WRS) to counter the significant class label imbalance.We evaluate the model on two novel German sustainability reporting data sets while consistently outperforming a large set of strong baselines by more than 10 percentage points in mean average precision.
sustain.AI is released to the public as a KI-NRW demonstrator, which is available at https://sustain.ki.nrw/.First user tests have already promised significant efficiency gains for the analysis of sustainability reports in the context of auditing.Moreover, the continuous use in production will further improve the system's recommendation capabilities due to the integration of human feedback, e.g. in the form of correcting wrong predictions.

RELATED WORK
Before continuing with the description of the inner workings of sustain.AI, we take a look at prior accomplishments of other researchers related to this work.
In terms of facilitating the audit of annual financial statements, [16] presented the Automated List Inspection (ALI) tool, a recommender system that ranks textual elements of financial documents to associated requirements of predefined regulatory frameworks like IFRS (International financial reporting standards) or HGB (Handelsgesetzbuch).For the ranking task, the authors used classical NLP techniques like Tf-Idf (Term frequency-Inverse document frequency), latent semantic indexing, neural networks and logistic regression (LR) with the combination of the first and last methods giving the best performance.In a follow-up work, [12] improved ALI by utilizing a pre-trained BERT [3] language model as the backbone to encode text segments.Our architecture extends this approach by including weighted random sampling in the training process which speeds up the model convergence time and improves the overall performance.Concerning a more granular information extraction approach related to automatic consistency checks of financial disclosures, [8] introduced KPI-Check, a BERT-based system that makes use of a tailored named entity and relation extraction model [7] to automatically detect and validate semantically equivalent key performance indicators in financial reports.
When it comes to the NLP-based analysis of sustainability or Corporate Social Responsibility (CSR) reports, different aspects have been researched.[6] and [5] addressed the problem of automatically evaluating the GRI-and ESG 1 -accordance of CSR-reports.Both applied unsupervised text similarity measures building on GloVe (Global Vectors for word representation) embeddings.Similarly, [2] leveraged the language model RoBERTa [9] to predict 1 Environmental, Social and Governance factors.the relevance of sustainability reports according to the sustainable development goals in the USA.Specifically targeted for the banking sector, [11] developed a rule-based named entity recognition approach to estimate an index that displays the level of compliance of the climate-related financial disclosures with the TCFD2 recommendations.

METHODOLOGY
In this section, we formally define the problem of matching text segments within documents to relevant legal requirements before turning to the in-depth analysis of our proposed architecture which is visualized in Figure 2.

Problem Formulation
Given a sustainability report consisting of  distinct text segments S, e.g.paragraphs, titles, tables or diagrams, and a set of  regulatory checklist requirements R, our goal is to identify all semantically relevant text segments for each requirement.Since the number of requirements  is static, but each document has a different length (number of text segments)  , we initially model the described matching task from a segment-to-requirements perspective as a multi-label classification problem.Formally, for every   ∈ S our recommender model assigns relevance scores to all   ∈ R.
However, from the users' point of view, the reverse direction of getting relevant segment recommendations for a specific requirement   (requirement-to-segments perspective) is far more beneficial.This is especially true because a significant amount of text segments within a sustainability report is unrelated to concrete requirements in R.This is why, based on the assigned relevance scores, our model ranks the text segments per requirement in descending order and subsequently recommends the top  relevant text blocks to the user.

Document Parsing
Before we focus on the actual recommender module, the core component of sustain.AI, we briefly touch upon the non-negligible task of document parsing.The large majority of publicly available sustainability reports are published as PDF documents, an inherently difficult format to convert into a structured machine-readable form like XML or JSON.The latter is particularly true for scanned PDF reports that only contain image information.
To solve this issue our system utilizes a custom PDF parser (see Figure 2), that is capable of parsing machine-created as well as scanned PDFs with arbitrarily complex formattings.The parser leverages a refined image segmentation technique by combining the powerful object detection network Faster R-CNN [14] with the density-based clustering algorithm DBSCAN [4].It is also trained to recognize specific elements of a document, such as footers, headers or pagination.For further details about the parser's functionality, we refer to [1].
After the successful PDF parsing we apply some basic textual preprocessing in the form of removing line break hyphens and filtering out irrelevant text segment types like footer, header and table of contents.Our final set of considered segments S consists of titles, paragraphs, enumerations, tables and diagrams.

Recommender System
Considering a parsed and processed sustainability report, we use a pretrained BERT [3] model to individually encode each text segment   ∈ S.
Formally, we first apply WordPiece [15] tokenization to transform an examplary input segment  into a sequence of sub-word tokens  = ([CLS],  1 , . . .,   , [SEP]).Note [CLS] denotes a BERTspecific special token that aggregates the content of the entire segment while [SEP] simply highlights the end of the sequence.
Passing  to the BERT model with pretrained parameters  bert yields a sequence of contextual token embeddings  [CLS] ,  1 , . . .,   ,  [SEP] , where  [CLS] represents the aggregated context hidden state for the whole segment .
Subsequently, we employ a multi-layer perceptron (MLP) with trainable parameters  mlp to predict relevance probabilities ŷ = [ ŷ1 , . . ., ŷ ] ∈ R  for all requirements in R. The classifying MLP consists of a fully-connected hidden layer followed by dropout and ReLU (Rectified Linear Unit) activation functions and a sigmoidal output layer.
During training, we jointly optimize and finetune the parameters of the BERT model  bert and the classification layer  mlp to minimize the Binary Cross Entropy (BCE) loss between target labels  and predicted probabilities ŷ.
Finally, after assigning relevance scores over requirements for all   ∈ S, we sort the segments for each requirement   in descending order in order to recommend the top  relevant text blocks.

EXPERIMENTS
In the following sections, we introduce our two custom data sets of German sustainability reports, define our evaluation metrics, discuss the overall training setup, describe the competing baseline methods, and finally, evaluate results.

Data
We train and evaluate our algorithms on two novel sustainability reporting data sets.The first data set, named GRI, consists of 92 published sustainability reports from major German companies.The reports have been sourced in PDF format from the companies' websites.After the parsing step domain experts from the auditing industry annotated all text segments in accordance to the requirements of the Global Reporting Initiative (GRI) standards.Concretely, we consider the 89 indicators of the GRI topic standards which cover the three main categories, economy, environment and social that are further split into granular topics like anti-corruption, energy consumption and human rights assessment.The annotation work load was equally split among three auditors which were supervised by a senior auditor.In multiple iterations, the created requirement labels have been validated and refined via double-checking randomly selected sample annotations and a qualitative inspection of the false positive and negative model predictions.
The second data set, named DNK, leverages the public sustainability reporting database from the German Sustainability Code3 (DNK).The platform is used by the majority of German companies to annually disclose their sustainability activities with respect to 33 requirements from 20 DNK criteria, e.g.usage of natural resources and human rights.The categories and their requirements cover most of the GRI topics but are generally less granular.In contrast to the PDF documents of the GRI data set, the DNK reports in HTML format follow a predefined structure where each section of text segments answers a distinct requirement.Since the requirement descriptions precede their respective sections we can automatically retrieve the ground truth annotations from the HTML during the parsing process.
Table 1 displays descriptive statistics for both data sets.Due to the smaller amount of training documents, the greater document size and the annotation sparsity, we consider the GRI data set the harder challenge for our models.We separately train, optimize and

Evaluation Metrics
We quantitatively evaluate all models by calculating modified mean sensitivity (MS) and mean average precision (MAP) scores for the top  recommendations.While MAP punishes the lower ranked recommendations of relevant segments, MS only considers whether the relevant segments are contained in the set of recommendations.For a single document and a concrete requirement   the modified sensitivity S() from [16] and the average precision AP() are respectively defined as: where  denotes the number of relevant segment annotations, rel() indicates whether the  th recommendation is relevant (rel() = 1) or not (rel() = 0), and represents the precision score considering the top  recommendations.Averaging S() and AP() over all checklist requirements   ∈ R and documents yields the subsequently reported mean sensitivity MS() and mean average precision MAP() metrics.

Training Setup
In this section, we shed light on the training process and the hyperparameter optimization of sustain.AI.For all evaluated models we conduct an exhaustive grid search comparing various parameter combinations based on their validation set MAP(3) performance to determine the best training setup.Table 2 highlights the explored ranges and respective best values of sustain.AI's tuned model parameters.As encoding backbone we employ a BERT BASE model, published by the MDZ Digital Library team (dbmdz) 5 .It mirrors the architectural setup of the English BERT BASE counterpart 6 and is pre-trained on a large corpus of German books, news reports and Wikipedia articles.We train our model and all neural network based baselines via gradient descent utilizing the AdamW [10] optimizer with a linear warmup of 10% and a linearly decaying learning rate schedule.Additionally, we apply weight decay of 0.01 and gradient clipping with a maximum value of 1.We also analyze different learning rates, batch sizes, levels of dropout regularization, and MLP hidden dimensions, as can be seen in Table 2.For all training runs we set a random seed of 42 and fix the maximum number of epochs to 15 while applying early stopping with a patience of 3 epochs.
Due to the small percentage of annotated segments  in the GRI data set (9%, see Table 1) we employ weighted random sampling (WRS) with replacement to expose these relevant segments more frequently during training.Concretely, we alter the originally uniform sampling probability of each segment to the normalized inverse frequency of relevant + or irrelevant − occurrences in the training set.
Figure 3 showcases the benefits of integrating WRS into the model training process for the GRI data set.We achieve a much faster training convergence and thus, save a considerable amount of training time and compute power benefitting from early stopping.At the same time, our model's MAP(3) score on the validation set increases by 3 percentage points.

Baselines
We compare sustain.AI's end-to-end recommender model from Section 3.3 with 4 competing baseline architectures.For a fair comparison, all baselines make use of weighted random sampling concerning the imbalanced GRI data set.
First, we utilize word frequency-based Tf-Idf [13] representations that have been fitted on our respective training corpora.Prior to training, all segments have been preprocessed in terms of lowercasing, punctuation-and digit removal as well as stemming.The resulting 8000 dimensional segment vectors are then used as input for an ensemble of one-vs-rest binary logistic regression (LR) classifiers.Each classifier is trained for a specific requirement  and a maximum of 100 iterations using the "liblinear" solver from the scikit-learn python library.Second, we pass the same Tf-Idf representations into an MLP with one hidden layer of dimensionality 1024.In contrast to the binary logistic regression heads, the MLP performs multi-label classification and predicts the relevant requirements simultaneously.We find an optimal batch size of 64 and a learning rate of 1−3.
Third, we exchange the Tf-Idf input vectors with frozen contextual embeddings from sustain.AI's BERT model.As classifiers we evaluate the previously defined MLP and a GRU (Gated Recurrent Unit).While the MLP takes BERT's CLS output embedding as input, the bidirectional GRU processes the resulting token representations of the frozen BERT model.Specifically, the last/first hidden state of the forward/backward GRU are concatenated and passed to a sigmoidal output layer.Optimal settings are obtained with a hidden size of 512 neurons, a batch size of 8 and a learning rate of 1−5.

Results
We evaluate and compare sustain.AI and all baseline methods on the previously specified hold out test set for both the GRI and DNK data.Table 3 reports mean sensitivity (MS) and mean average precision (MAP) scores for the top 3 and top 5 recommendations.
First, it can be seen that the overall DNK performance across all methods is much better compared to the GRI data.This was expected, considering the reduced number of requirements and the larger amount of training documents and annotations.
Second, we find that the application of weighted random sampling (WRS) during training significantly improves the test set performance of our model.Compared to the version without WRS all metrics have increased by more than 6 percentage points.To enable a fair comparison we apply WRS during the training process of all baseline methods.Also, WRS is solely employed for the GRI data, since the DNK reports do not exhibit any annotation scarcity.
Finally, the results in Table 3 show the overall superiority of sustain.AI's end-to-end architecture, outperforming all baselines by a large margin.

CONCLUSION AND FUTURE WORK
We presented sustain.AI, an interactive, AI-powered tool for the semi-automated analysis of German sustainability reports.Our transformer-based model achieves promising results both on the well-structured DNK data set and on the real-world GRI data, compared to a number of strong baselines.Qualitative exploration of the results also suggests that it is indeed helpful in analyzing those long documents.The tool is planned to be deployed on an online platform soon and will then be openly accessible to the public.
Future work includes improving the current model with additional annotated data, which can easily be inferred from the user feedback we will collect through the tool.We also plan to extend the framework to English reports, as currently only the processing of German documents is possible.Another idea for improvement is to extract specific numeric key performance indicators from the reports, such as different types of CO 2 emissions, water consumption or indicators for social welfare.

Figure 2 :
Figure 2: Schematical visualization of the recommender system and the data flow in sustain.AI.A custom PDF parser processes the raw sustainability reports.After some textual clean-ups, a fine-tuned BERT model encodes individual text segments that are subsequently matched to relevant regulatory requirements.

Figure 3 :
Figure 3: Positive impact of weighted random sampling (WRS) on training convergence and validation performance.We report the mean average precision considering the top 3 recommendations (MAP(3)) with and without WRS.

Table 1 :
Properties of our GRI and DNK data sets.We display the number of requirements and documents, the average number of segments per document, the average percentage of segments assigned to at least one requirement, and the average number of matched segments per requirement.

Table 2 :
Evaluated hyperparameter configurations of sustain.AI.The best configuration on the validation set is highlighted in boldface.

Table 3 :
Test set results for the recommendation of relevant segments in GRI and DNK sustainability reports.sustain.AI outperforms all competing baselines in top 3/5 mean sensitivity (MS) and mean average precision (MAP).