Making Sense of Citizens’ Input through Artificial Intelligence

Public sector institutions that consult citizens to inform decision-making face the challenge of evaluating the contributions made by citizens. This evaluation has important democratic implications but at the same time, consumes substantial human resources. However, until now the use of artificial intelligence such as computer-supported text analysis has remained an under-studied solution to this problem. We identify three generic tasks in the evaluation process that could benefit from natural language processing (NLP). Based on a systematic literature search in two databases on computational linguistics and digital government, we provide a detailed review of existing methods and their performance. While some promising approaches exist, for instance to group data thematically and to detect arguments and opinions, we show that there remain important challenges before these could offer any reliable support in practice. These include the quality of results, the applicability to non-English language corpuses and making algorithmic models available to practitioners through software. We discuss a number of avenues that future research should pursue that can ultimately lead to solutions for practice. The most promising of these bring in the expertise of human evaluators, for example through active learning approaches or interactive topic modelling.

00:2 • J. Romberg and T. Escher take various forms such as written statements to planning procedures, oral statements during a public hearing on a proposed development, or proposals located on a digital map through an interactive online platform.In contrast to citizen-led initiatives such as petitions, expressions of political opinions (through discussions and demonstrations) or political consumerism, these top-down consultations allow authorities substantial control of the process through determining the design and organizational framework.What is more, they have a specific (even if often weak) link to decision-making processes that is regularly codified in law.Nevertheless, given that contemporary large-scale democracies are representative in nature with only limited opportunities for citizens to engage in decision-making directly, the role of public participation remains largely consultative.Public participation acts primarily as one of many sources of input (albeit a particularly important one) for those people who are legitimized (e.g., through elections) to take final decisions.Public authorities may utilize public participation to elicit input for different stages of the policy-making process, most regularly for agenda-setting, policy formulation and decision-making [ 82 ].Generally, they pursue two distinct but related aims [ 77 ]: On the one hand, through the additional information acquired by such procedures, the resulting policies should be better informed and provide better adapted solutions, therefore ideally resulting in more effective policies.On the other hand, enabling citizens to provide knowledge, voice their concerns and (to some degree) shape the final policies, are expected to achieve higher acceptance if not satisfaction with the decisions, hence ideally resulting in higher legitimacy of the policies.Especially in response to heightened concerns about citizens' (dis)satisfaction with the way democracy works, such public participation has been increasingly used by authorities around the world and at all levels of government, taking various shapes, from simple invitations to comment, to large-scale deliberative events [ 22 ].
Policy-makers that aim to incorporate the knowledge and attitudes of citizens to inform their policy decisions face a number of challenges, such as whom to include in such consultations, how to design the process in order to achieve the desired outcomes and how much control citizens should wield over the process and its results-many of which have not yet conclusive answers.We focus on one particular challenge, which is the processing of the collected data by the authority responsible.Policy-makers and their administrations regularly face the problem of how to make sense of the diversity of statements that the public provides [ 1 , 2 , 37 , 52 , 54 , 71 , 82 ].It involves both identifying overarching patterns and individual statements requiring further action to ultimately prepare conclusions from the input [ 52 ].We call this process the evaluation of public participation contributions .
The relevance of this evaluation process can hardly be overstated.For example, basic democratic norms require that all citizens and their contributions are treated equally and that the process of decision-making is fair and transparent [ 20 ].The way in which public authorities evaluate the input from citizens has direct consequences for public perceptions of legitimacy [ 77 ].Empirical research has shown that if the public believes that the evaluation fails these criteria, this translates into lower legitimacy perceptions of the resulting policies [ 24 , 58 , 87 ].What is more, in more formal participation procedures, public sector authorities may face costly litigation if they fail to identify and respond to substantive input by the public [ 52 ].
Hence public authorities have to dedicate care to the evaluation process in order to ensure that these normative criteria are satisfied and all contributions by citizens receive equal scrutiny.However, authorities are faced with the problem that evaluation takes considerable effort.It is regularly time-consuming, often requires substantial resources in terms of staff and money, and can lead to information overload [ 2 , 16 , 52 , 63 ].When authorities do not have sufficient resources to engage in these efforts, they might choose evaluation strategies that do not satisfy democratic norms or decide to refrain from engaging the public altogether.Therefore, finding ways to support this evaluation process is of crucial importance, not least because it can be the decisive factor for authorities to engage or not to engage the public at all.
While there are different potential solutions to this problem such as increasing staff or using more structured participation formats, we focus on technological solutions in the form of computer-supported analysis procedures.While we believe that due to the often contested nature of public participation and its potentially far-reaching consequences, evaluation always requires some form of human assessment [ 56 ], the question is to what degree these human evaluators can be supported in their work.Technical means have long been proposed as one potential solution to this problem [ 63 ] and in the meantime, natural language processing (NLP) has made huge advances.These artificial intelligence (AI)-based techniques could be applied to the evaluation as the majority of contributions in public participation are in the form of textual data.However, despite early research efforts dating back almost 20 years, so far we lack an overview of which of the available computational methods have already been applied to the evaluation of public participation and how these have performed.What is more, within the burgeoning field of AI and public policy, supporting the evaluation of contributions by the public through NLP is not yet recognized as a research field in its own right and relevant research is widely dispersed across different fields and disciplines.
Therefore, the key objective of this article is (i) to identify generic tasks in the evaluation process and how these could be supported through the use of AI, (ii) to summarize which approaches relying on computational text analysis have been used so far and to provide an assessment of their performance, and (iii) to identify remaining gaps to inform future research efforts that could ultimately lead to solutions that offer reliable support in practice and hence make democratic participation possible.While we rely on a systematic literature review, our aim is not to conduct a detailed census but to provide an overview of the state of the field along with its strength and weaknesses.
The remainder of this article is structured as follows.After briefly reviewing the state of the field (2), we describe our research methodology (3) and identify the tasks involved in the evaluation process and how these might be supported through automated procedures (4).The main body of this study focuses on reviewing approaches to topical grouping of content (5) and to extraction of arguments and opinions (6).We then summarize and discuss the main findings of the review and identify gaps that should be addressed by future research (7) before we draw a resume and offer reasons for the existing gaps in research (8).

SUPPORTING EVALUATION THROUGH COMPUTER-SUPPORTED TEXT ANALYSIS: RELEVANCE AND STATE OF THE FIELD
The task of evaluation is to make sense of citizens' contributions.These contributions derive from different sources and can take different forms.In offline public participation procedures citizens are asked for their opinion within on-site events or with tools such as questionnaires.Another source of contributions are online public participation procedures in which citizens have the opportunity to communicate their viewpoints via internet platforms.While citizen contributions can take many different forms, we focus our attention on textual data that might be derived from written statements from citizens, either created digitally or later digitized.Although by no means the only format, we believe these to be those most regularly used.When public agencies have collected input from citizens, this needs to be analyzed.The overarching aim of such an analysis is to get to know which issues are raised and to decide if this input should trigger further action, such as a response or a change of the proposed plan.This requires reading each contribution, often several times, and as a result, the process of evaluating citizen contributions can take a significant amount of effort.How much effort depends on the number of citizens who participate as well as the amount and the length of contributions.Historically, there are numerous instances of offline participation that have resulted in large amounts of data.For example, when in 1997 the United States Department of Agriculture (USDA) launched its public comment period on standards for the marketing of organic agricultural products, the majority of the more than 277,000 comments were received via article mail [ 79 ].However, the ease and velocity afforded by information and communication technologies (ICT) has enabled more people to submit more statements in shorter time.Coupled with increasing relevance of public participation, there are now more instances of public participation, that each tend to receive more comments than in the pre-digital era.Livermore et al. [ 52 ] provide an overview of this development for the particular case of US e-rulemaking that eventually resulted in "megaparticipation" such as the US Environmental Protection Agency receiving more than 4 million comments for the proposed Clean Air Act.This development has increased the administrative burden and hence the urgency of the problem [ 52 , 80 ].
From early on, ICTs have not just been perceived as one cause of the problem but also as a possible solution to tackle the evaluation problem.For example, in 2003 an OECD [ 63 ] report highlighted the analysis of e-contributions as a challenge that might be solved through the use of content analysis techniques that help to structure contributions.Already in the late 1990s, in response to a growing number of comments on regulatory rules [ 19 , 79 ] the National Science Foundation among others had funded research such as the Cornell eRulemaking Initiative (CeRI) or the Penn Program on Regulation that investigated the potential to use text-processing techniques to sort through public comments [ 79 , 81 , 97 ].This has sparked a remarkable research activity that resulted in the development of functionalities for searching similar content [ 79 ], duplicate detection [ 97 ], categorizing comments [ 12 ] and relating these to regulations [ 47 ].Yet, as far as we know, these have not moved beyond the stage of prototypes and they have never experienced sustained use in public administration.
Since then, not only have government consultations and other instances of public participation increased, also the technology in the form of AI has made vast improvements.In particular the progress of machine learning algorithms has increased the capabilities of NLP, an "area of computer science that deals with methods to analyze, model, and understand human language" [91:4] and as such is of relevance to the evaluation task.As a matter of fact, the public sector now regularly applies AI to large amounts of data in order to derive insights for different stages of the policy-making process [ 95 , 104 ].However, so far, we lack an overview of which specific technologies have been used to analyze citizen contributions and how these perform in comparison to established human evaluation.The only review that we know of that focuses on the technology is outdated and incomprehensive [ 55 ].
While the more recent advances in NLP techniques such as pre-trained language models have shown remarkable results on a variety of application tasks such as text translation and conversational agents (most recently through the release of ChatGPT) and in different application domains, they have yet to demonstrate their value for the input from public participation as these texts exhibit a number of differences from other domains.For example, tweets or other social media contents are not only shorter than citizen proposals but have also been shown to use a different vocabulary and syntax [ 31 ], not least demonstrated by the fact that specially trained models exist for this particular domain [e.g., 61 ].Also, the contributions from public participation usually revolve around making proposals and deliberating about different possible solutions.As such their content differs from the contributions in comment sections on news portals, product reviews or online discussion groups which are primarily used to voice opinions and sentiments.The specific properties of public participation data lead us to believe that existing breakthrough are not necessarily delivering the same results in this domain.
Given the need for support of the evaluation process, we believe it is urgently required to take stock of this field by reviewing the strengths and weaknesses of those approaches that have been used and by offering guidance for further research.Here we focus mainly on the technological basis to offer an assessment of whether NLP technologies could be a support to the public authorities to reduce burden on human resources or achieve more accurate results.Clearly, whether such technologies actually should be used depends on additional normative considerations given that evaluation has important implications for the democratic process as outlined earlier.
The increasing use of AI in the public sector raises fundamental questions about transparency (e.g., what goes on inside the black box of the algorithm), accountability (e.g., who is responsible for decisions derived from AI), fairness (e.g., is the algorithm biased) and how these impact on the legitimacy of decisions, among others.There is now an established debate that focuses on these implications that are different for governments than for businesses [ 21 , 42 , 86 , 95 ].
However, we believe that questions about the ethics of AI use in government cannot be answered without a better understanding of the value that the technology could actually provide: If existing technologies cannot support the evaluation process, their implications would remain irrelevant.Conversely, if AI would be able to support evaluation, it is necessary to assess the degree of efficiency gains and the risks involved (such as mislabeling) to weigh these up against normative requirements such as fairness and accountability.This review can Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.offer the basis for such a normative judgment.Therefore, in contrast to current reviews of AI use in government [ 88 , 95 , 104 ], we focus explicitly on the technology used and its performance instead of more general implications of the use of AI in the public sector.

METHODOLOGY
We have conducted a systematic literature review, following the basic steps as suggested by Kitchenham [ 39 ] including (i) identifying relevant research, (ii) selection of relevant studies, and (iii) quality assessment of the selected studies, followed by the actual analysis and synthesis of the data.
The major challenge to the identification of relevant research has been that the task of evaluating citizen contributions has so far not been recognized as a research problem in its own right, but that relevant research occurs in different research areas.The research area that focuses on the development as well as the implications of using AI for policy-making has been termed policy analytics [ 30 , 82 ] but relevant research has also been undertaken under the heading of big data [ 29 , 88 ], data science [ 8 ], artificial intelligence in government [ 104 ], or policy informatics [ 102 ].
We have addressed this challenge by combining two search strategies, namely a search of publication databases and a snowballing approach to identify additional studies of interest.We started by searching two publication databases that complemented each other as one focused on the technology of interest, while the other focused on the application area of interest.On the one hand, we used the Association for Computational Linguistics Anthology1 as it offers a large collection of more than 80,000 articles from the field of computational linguistics.On the other hand, we drew on the almost 18,000 documents from the Digital Government Reference Library [ 78 ] to find peer-reviewed articles in the domain of digital government and democracy. 2Including all articles up until early 2023, the search resulted in 285 documents that were subsequently screened to select studies of relevance to the goal of this literature review.Articles were not only required to use NLP techniques but also had to rely on datasets from the field of public participation or to present the application of these techniques specifically to this domain.What is more, as the focus of this survey is explicitly on contributions generated directly by citizen participation processes, articles were excluded that related only to citizen contributions in a broader sense (such as citizen posts on Twitter about municipal issues).We further requested that the studies either critically evaluated the results of the applied NLP techniques, or proposed particular software solutions for practitioners that used NLP for the analysis of contributions in citizen participation processes.This left a total of 27 studies.Because of this small number and the fact that these had all been peer-reviewed, no further assessment of the study quality was necessary.
As a second strategy to identify additional relevant literature, we employed a snowballing approach as defined by Wohlin [ 96 ].Using these 27 publications as a starting set, we conducted backward snowballing by accessing the references cited in these publications, as well as forward snowballing by using Google Scholar to find more recent publications citing any of the publications in the starting set.To complement the snowballing approach, we followed the suggestion by Wohlin [ 96 ] and screened the entire list of publications of all authors that had (co-)authored several of the articles in our list of relevant documents.This strategy resulted in 28 additional studies.
Through this combination of strategies that is visualized in Figure 1 we identified 55 studies.These offer a comprehensive overview of the diversity of existing approaches that have been in use for the particular domain of evaluating public participation contributions, and allow us to identify gaps that we will discuss in the next sections.Given the dispersed state of the field, it is almost impossible to provide a complete overview of all existing studies, but our strategy should allow us to offer a rather comprehensive overview of the state of the field.

TASKS IN THE PROCESS OF EVALUATING PUBLIC PARTICIPATION CON TRIBU TIONS
While consultation processes initiated by public authorities differ in the format of contributions citizens provide, the type of information the receiving authority is looking for, and the formal requirements for processing submissions, it is possible to recognize two broad evaluation requirements that are common across all of these types of processes.These are identifying substantive contributions on the individual level, and gaining insights into common themes and trends on the aggregate level.Livermore et al. [ 52 :1015] term these the "haystack problem", i.e., to find signal in the noise of mass contributions, and the "forest problem", i.e., to derive information from the whole corpus of contributions.While the analytical perspectives are different, the tasks necessary to achieve these insights are largely similar.
Based on the literature reviewed here [ 37 , 55 , 81 , 82 ] and confirmed by our own interviews with practitioners [ 74 ], we can identify a number of generic tasks that need to be performed: (i) detecting (near) duplicates, (ii) grouping of contributions by topic and (iii) analyzing the individual contributions in depth, e.g., to identify arguments or other content of relevance.Each of these tasks can help to find the individual comment of relevance among a mass of comments, for example by removing duplicates, by grouping those with a particular content in one group (and disregarding others) or by providing a sentiment score for individual comments.In the same way, these tasks support identification of themes on the aggregate level, by identifying different topics or providing sentiment distributions.
Figure 2 details these three tasks along with their specific subtasks that we introduce in this section.Tasks highlighted in green are those that have received most attention in the literature and which we subsequently focus on in this review.
The evaluation process often starts with the detection of exact duplicates or substantially identical proposals even though this filtering can also occur in later stages of the process.Given that in particular online comments can be easily submitted, and often campaigns might invite the public to make use of preformulated statements, authorities might receive many comments that are identical or nearly identical.For example, Livermore et al. [ 52 ] assume that 99% of the 4m comments to the EPA's Clean Power Plan were actually duplicates or near duplicates.For an earlier rulemaking, Shulman [ 80 ] reported that three quarters of comments related to copy-and-paste letters and not individually crafted statements.Identifying duplicate contributions is important for analysts to save time during the evaluation and to avoid undue influence on the process by individual stakeholder groups.At the same time, in the case of near-duplicates, care must be taken to ensure that no substantial information is lost.
The detection of (near) duplicates in the domain of online citizen participation has already been studied by Yang and colleagues [ 97 , 98 , 100 ] who released the DUplicate Removal In lArge collectioNs (DURIAN) system.Applying DURIAN to 3,000 English-language public comments from U.S. rulemaking showed that the system recognizes duplicates well and with an acceptable runtime.In particular, the high agreement with human ratings of near-duplicates is remarkable.The language-independent structure of the algorithm suggests that duplicates can be detected similarly well in other languages.
Notwithstanding the relevance of duplicate detection, more important for the analysis of citizen input are actually the two remaining tasks.The second task that occurs regularly is that the mass of contributions needs to be grouped thematically .This global structuring of all contributions provides the analyst with a quick overview of the topics which arose and in which contributions these can be found.We will provide a detailed overview of the approaches to grouping by topic in Section 5 .
As a third task, contributions are analyzed in further depth , mainly for arguments or opinions .The analysis of arguments and certain aspects of discourse can support a more detailed assessment and indicate how certain issues are perceived by the public.Approaches to solving these tasks form the largest portion of the literature reviewed and will be discussed in Section 6 .In addition, there are a number of other aspects for which automated solutions were considered useful in the evaluation of citizen participation processes.These include stakeholder identification [ 4 ], the recognition of citations in public comments [ 5 ], the estimation of the urgency of urban issues [ 57 ], a relatedness analysis of provisions in drafted regulations and public comments [ 47 ] and the summarization of comments [ 3 ].

GROUPING THE DATA COLLECTION BY TOPIC
There are two ways of addressing the task of sorting citizen contributions into topic groups that are shown in Figure 3 .In supervised machine learning , the goal is to predict the true label(s) for a given data point out of a set of predefined topics.To build such a machine learning model, labeled training data is required to fit a model to the task.In contrast, unsupervised machine learning does not need training data.The goal of these models is to find latent topics in the data to form clusters of topically similar data points.We review in turn how both approaches have been used to categorize contributions from citizens.

Supervised Approaches: Classification by Thematic Categories
We first concentrate on the classification (hereinafter also referred to as categorization ) of textual content into appropriate content-based categories .This approach relies on a predefined set of (thematic) categories and uses supervised learning to train an algorithm which can then subsequently classify citizen contributions and assign these to the pre-defined topic groups.Administrative staff or service providers usually categorize contributions according to various aspects when evaluating them.By assigning the contributions to the appropriate categories, it is easier to grasp and summarize the essential issues raised within each of the individual categories.It also allows to focus on particular topics in order to identify individual contribution of relevance.Table 1 in the Appendix provides a systematic overview of the literature covered here, including information on the datasets, the categorization schemes, and the algorithms that have been applied in the studies.
The evaluation datasets range from formal processes such as U.S. eRulemaking to more informal civic participation projects (online and on-site) from Chile, Germany and South Korea.Thematically, the processes focused on transportation and environment, as well as on urban issues and a constitutional process.A variety of categorization schemes is used, which differ in the number and subject of categories as well as between hierarchical and non-hierarchical structures.Categorization is furthermore conducted on different levels of granularity: either on contributions in their entirety or smaller units of analysis, e.g., sentences, ideas, or arguments.
Categorizing contributions [ 6 , 38 , 44 , 45 ] yielded good results for the categories that occur frequently in the training datasets, while most categories with little support could only be recognized moderately to poorly.Balta et al. [ 6 ] faced a further difficulty when working with a category that represents a collection of miscellaneous topics.In contrast to the more specific categories, it is difficult to find class-typical indicators for such a group (i.e., "other").
Cardie and her co-authors [ 12 , 13 ] focus on sentence-level categorization.They compare a flat categorization approach with a hierarchical attempt that leads from main categories to more detailed subcategories.Surprisingly, the hierarchical approach cannot surpass the flat one.At the same time, however, none of the approaches can really convince.Aitamurto et al. [ 1 ] also categorize hierarchically and achieve good results for the main categories.At the lower levels of the hierarchy the performance is significantly weaker.
Fierro et al. [ 26 ] predict matching constitutional concepts for arguments with moderately good results.Interestingly, in addition to exact match performance, the authors also consider whether the correct concept is among the five most likely concepts identified by the algorithms.This is indeed almost always the case for the best performing algorithm fastText.Especially with regard to a software solution in which human and machine work together, these are promising results because human coders could be supported by restricting their choices from a large number of categories to a few most likely ones.Giannakopoulos et al. [ 28 ] enhance the exact classification performance with a neural network but the algorithm takes more than seven hours to train.
Regardless of the classification quality of the approaches presented so far, in all works a substantial amount of data was used for training purposes, e.g., several to over a hundred thousand sentences, arguments or documents.At the same time, all works use categorization schemes that are tailored to the corpus in question and hence a customized model must be trained for each dataset.This creates a tension because in order to support an analyst's work, the additional workload caused by manual annotation of data must be kept low.
To address these problems and to provide a feasible solution, Purpura et al. [ 70 ] suggest the use of active learning .Active learning takes place in close collaboration with the user and consists of two steps: First, a fixed number of unlabeled data points are selected that are assumed to bring the highest gain for the training of a (classification) algorithm.Second, the selected unlabeled data points are manually labeled with the appropriate topic and the classifier is re-trained with all already labeled data.Both steps are repeated until the classifier is reliable.As expected, the evaluation shows that active learning tends to achieve good precision faster than non-active learning, but a closer look at the results highlights that the tested algorithms (Support Vector Machines (SVM), Naïve Bayes, and Maximum Entropy) must still be trained with about 1,000 data points to achieve good results.In a more recent article, however, Romberg & Escher [ 75 ] were able to show that the amount of training data can be significantly reduced to a few hundred data points when active learning is combined with current state-of-the-art approaches for text classification (i.e., pre-trained language models).

Unsupervised Approaches: Topic Modeling and Clustering
In contrast to supervised procedures, unsupervised approaches that assume no prior knowledge of the data can be applied.Basically, there are two types of approaches which are both unsupervised learning strategies: In topic modeling the latent topics of a collection of texts are explored and for each document the degree of membership to each topic is determined.In clustering , documents are grouped by similarities.If the similarities are determined on the basis of the content of the texts, the clusters can represent topics as in topic models.In the following we will provide an overview of those works in which these algorithms are not only applied but also analyzed and evaluated.The existing works applied unsupervised approaches to eRulemaking processes as well as e-participation and e-partitioning data from the U.S., Austria, China, Spain, and Belgium.The detailed list of works and their characteristics can be found in Table 2 in the Appendix.
In contrast to supervised learning, the evaluation of unsupervised learning algorithms is more complex because there is no labeled ground truth to which the results can be compared.In the works reviewed here, either manual qualitative analysis or measurement of the agreement between algorithmic and human topic assignment are used to rate the algorithms' quality.Most works relied on the topic modeling method Latent Dirichlet Allocation (LDA) to find clusters of thematically similar contributions, which presupposes a fixed number of topics.Levy & Franklin [ 49 ] algorithmically detect eight topic clusters of which seven are confirmed by human review.Hagen et al.'s [ 36 ] best model, determined by experimenting with different values for the number of topics, consisted of 30 topics of which 21 had a coherent theme.Manual judgment also showed that labeling the topic clusters with the most probable topic term worked well for high-quality topics.Similar findings were reported by Arana-Catania et al. [ 2 ], but for the respective dataset the alternative method Non-Negative Matrix Factorization (NMF) was able to detect a higher number of relevant topics than LDA.In contrast to the manual analysis used in these studies, in Ma et al. [ 53 ] the best number of topics is estimated with the perplexity metric.In a user study, the LDA model outperformed a common public management search method.
An alternative approach to LDA is the use of associative networks, in which topically related concepts can be clustered based on activation patterns [ 89 ].Manual comparison showed that the emerging clusters resemble the categories that are used by the citizens on the participation platform, e.g., environment, health or education.Simonofski et al. [ 82 ] proposed the use of k -means clustering which (similar to LDA) requires a predefined number ( k ) of clusters to be found.To overcome this limitation, the authors proposed the so-called elbow method to computationally determine an optimal value.In a manual analysis with two practitioners, the limitations become clear: both believed that the clusters must be checked manually.Nevertheless, they also acknowledged the helpfulness of the algorithm to avoid manual clustering.
The abovementioned works show that unsupervised learning can identify topics, but with serious limitations.To address the challenges of interpretability and validity of LDA for content analyses, Hagen [ 34 ] has three recommendations for the application of topic modeling: (1) Word stemming can enhance results but further preprocessing of the data should be kept to a minimum.(2) The number of topics should be determined with a combination of the perplexity metric and human judgment.(3) The generated topics should always be validated (e.g., for topic quality, external validity and internal coherence).
Topic models without strong human supervision tend to produce topics that have no clear meaning to analysts which can be caused by inappropriate model parameter choices, or the deviation of the statistically meaningful model outcome from the outcome expectations of an analyst.To overcome the mismatching of topic models, Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.
Cai, Sun & Sha [ 10 ] propose the use of interactive topic modeling.Similar to the active learning approach for supervised learning, in interactive learning, the human user is directly involved in the model building process.In the first step, topics are discovered unsupervised.Then, the user investigates the clusters and refines them by merging or splitting topic clusters.The resulting topic model can be qualitatively inspected to decide whether further refinement is necessary.Evaluation on some example cases showed that the manual refinement operations improved the clustering and led to higher overall topic coherence.Yang & Callan [ 99 ] also use an interactive approach, based on clustering, and introduce the software OntoCop to construct topic ontologies in collaboration with a user.Human evaluation showed that the interactive setting can reduce the time needed to receive a satisfactory topic clustering and that interactively constructed ontologies resemble manually constructed ones.

MINING ARGUMENTS AND OPINIONS IN CITIZEN CONTRIBUTIONS
After reviewing approaches to the second task of topical grouping, we now turn to technical solutions to support the third evaluation task, namely an in-depth analysis of individual contributions.While these include different tasks as outlined in Section 4 , here we focus on the analysis of argumentation components, of discourse and of sentiments as these are the tasks that have been most often addressed in the studies we review here.

Argument Mining
Public participation often takes place in a discursive format.Citizens can express their opinions and ideas on certain topics, have the possibility to refer to the contributions of others in their comments and to argue for or against stances.In the evaluation, the analysis of arguments is important in order to make the different citizen opinions visible.The term argument mining refers to the automated identification and extraction of arguments from natural language data.Judging from the results of our literature review, it is one of the most prominent parts of research in the field of citizens' participation.Table 3 in the Appendix provides the details on the individual studies which we summarize in the following subsections.Like in topic grouping, many of the datasets originate from U.S. eRulemaking initiatives.Further data sources that have been used derive from German-language citizen participation on the restructuring of a former airport site, as well as on transportation-related spatial planning processes, a Japanese-language online citizen discussion on the city of Nagoya, and citizen contributions from the 2016 Chilean constitutional process.
According to Peldszus & Stede [ 68 ], argument mining can be systematized as three consecutive subtasks: (1) segmentation, (2) segment classification, and (3) relation identification.While some of the reported approaches tackle multiple steps at once, where possible we nevertheless address the results separately in the three steps.

Segmentation.
In the segmentation step, citizen contributions are divided into units of argumentative content. 3All articles that we review here use sentences as units of information and classify them as either argumentative or not. 4 A direct comparison of the results is hardly possible due to the differences in the datasets (i.e., language, specific properties of the processes analyzed, share of argumentative content).While Eidelman & Grom [ 23 ] work on a dataset consisting of almost 90 percent non-argumentative sentences, argumentative content prevails in the datasets introduced by Liebeck, Esau & Conrad [ 51 ], Morio & Fujita [ 59 ] and Romberg & Conrad [ 73 ].This class distribution strongly influences the performance of the algorithms.So do similar algorithms (such as SVM) produce divergent results on the different datasets.Overall fastText and logistic regression with embedding features [ 23 ], SVMs with a combination of unigrams and grammatical features [ 51 ], BERT [ 73 ] and parallel constrained pointer architectures (PCPA) [ 60 ] lead to the best but not yet sufficient results in classifying argumentative sentences on the respective datasets.
6.1.2Segment Classification.Following segmentation, the identified argument units need to be mapped to their function in the argument.The schemas used to capture the different functions of argumentative discourse units vary widely.Most works focus on recognizing the contextual function of the components of an argument.Additionally, there are a number of works that focus on intrinsic properties, i.e., the verifiability, evaluability, and concreteness of arguments.
Morio and Fujita [ 59 , 60 ] use a straightforward scheme of claim and premise .Claims are defined as the core component of an argument and consist of controversial statements.Premises are reasons supporting or opposing a claim.A related two-fold division is used by Kwon and co-authors [ 44 , 46 ] who distinguish main claims from sub-claims and main-supporting/opposing reasons of a main claim.Liebeck et al. [ 51 ] introduce major positions ("options for actions or decisions that occur in the discussion") as an additional component type for processes in which citizens can submit their own proposals for discussion.Romberg & Conrad [ 73 ] likewise differentiate between premises and major positions.Some works further differentiate into supporting or opposing arguments [ 23 , 51 ].Another argumentation scheme [ 26 ] divides arguments according to whether a policy is being proposed, a fact is being stated, or a value -based statement is being made.
In addition to differing concepts of argument components, the various works approach the classification process differently.While some use a sequential approach in which several subtasks (e.g., the identification of claims and classification of claim types) are solved successively [ 44 , 45 , 50 , 51 , 73 ], others attempt to solve the segment classification in a single step [ 26 , 28 , 59 , 60 ].Eidelman & Grom [ 23 ] are the only ones who compare the results of a flat classification using all argument types and a sequential strategy combining stance (opposition, support) classification with a more precise classification into specific stance types.
How do these approaches perform?All evaluated approaches for argument component classification in Kwon et al. [ 45 ] and Kwon & Hovy [ 44 ] perform poorly.Liebeck et al.'s [ 51 ] best approach, a SVM with unigram and grammatical features, shows encouraging results but still leads to frequent misclassifications.In claim type classification, SVMs with character embeddings and Random Forests (RF) with unigrams show good results.Promising results are also shown by Morio & Fujita [ 60 ] using Pointer Networks (PN) and their own approach PCPA, and by Eidelman & Grom [ 23 ], who reported the best performance with logistic regression and word embeddings.Likewise, the approaches still need to be improved.The results obtained by Fierro et al. [ 26 ] and Giannakopoulos et al. [ 28 ] are strong, although the class distribution of the data is very imbalanced.It turns out that neural networks (convolutional and recurrent layers, attention mechanism) can outperform classical approaches and fastText on this dataset.Encouragingly, Romberg & Conrad [ 73 ] were able to show that BERT can consistently provide a very good distinction between premises and major positions across a variety of processes.
One of the problems with the interpretation of arguments from citizens' contributions is that they often lack a justification or a supporting component that substantiates the statement.A number of studies [ 33 , 64 , 67 ] concentrate on developing NLP models to classify the verifiability of propositions.The comparison of their approaches shows that although networks with Long Short-Term Memory (LSTM) exceed other approaches in that unverifiable propositions could be identified with high quality, the prediction of the different types of verifiability (i.e., non-experiential and verifiable experiential propositions) seems more difficult.Niculae et al. [ 62 ] and Galassi et al. [ 27 ] focus on a more comprehensive argumentation model to assess the evaluability of citizen's contributions for eRulemaking within the Cornell eRulemaking Corpus -CDCP [ 65 ].Promising results suggest the use of structured learning approaches, which cannot be surpassed by residual networks.Falk & Lapesa [ 25 ] highlight the role of personal experiences and stories in grounding arguments in political discourse.They show that BERT models can reliably find contributions that contain such a form of justification.
Another aspect that can aid evaluators to efficiently process contributions is to assess the concreteness of proposals by citizens as it is easier to derive measures for implantation from more specific proposals.Looking at a transport-related spatial planning process, Romberg [ 72 ] proposes a ranking based on three levels of concreteness and the results of the best-performing method BERT show that the prediction of concreteness is possible but needs to be improved.

Relation Identification.
In order to understand arguments in their entirety, it is also important to investigate the relationship between the previously identified components.Most related works focus on support relations [ 18 , 27 , 48 , 62 ].The first three tested on the CDCP corpus, which makes the results directly comparable.Unlike the other works, Cocarascu et al. [ 18 ] trained their models on further argument mining datasets that are not from the public participation domain.This has the advantage that a larger amount of training documents is available to build the classification model.While most approaches perform weakly, the use of additional training datasets shows strong results for all models evaluated.Surprisingly, simple RF and SVM approaches can compete with deep learning models if an appropriate training dataset is used.However, the results vary considerably depending on the choice of the training dataset.
In addition to support relations, Morio & Fujita [ 59 , 60 ] define an argumentation scheme specifically for discussion thread analysis and thus to the discursive reply-to structure that can be found in (online) citizen participations.In particular, they distinguish between inner-post relationships of different argumentative components within a post, and inter-post relationships that link two distinct posts in a discussion thread that relate to each other.PCPA, an algorithm specifically designed for thread structures, clearly outperforms state-of-the-art baselines for identifying such relations.

Discourse and Sentiment Analysis
Based on argumentation structures, a discussion can be analyzed for certain characteristics such as the controversy, divisiveness, popularity and centrality of discussion points.Analyzing discursive elements allows tracking of how consensus decisions develop or where great disagreement between citizens exists.This information can support an analyst in the more in-depth analysis of data and in summarizing the important points of a debate.Approaches to determine controversial points in online discussions are presented in two works.Konat et al. [ 41 ] rely on argument graphs and apply two measures for divisiveness defined on graph properties.Cantador, Cortés-Cediel & Fernández [ 11 ] propose a theory-based metric to measure controversy.The authors' review of selected examples suggests this is a reasonable approach.To determine the centrality of discussion points, Lawrence et al. [ 48 ] apply the mathematical concept of eigenvector centrality on an argument graph.A comparison of the results with human annotation shows a strong overlap, suggesting eigenvector centrality to be a suitable way to predict centrality.
Sentiment Analysis, also referred to as Opinion Mining, is the process of detecting and categorizing opinions in order to determine the writer's attitude regarding a certain subject.This can be relevant for the evaluation of citizens' contributions as it enables officials to get a sense not only of what the key issues are, but how (positively or negatively) these are perceived by citizens.Maragoudakis et al. [ 55 ] provide a general overview of existing opinion mining techniques and make assumptions on if and how they can be transferred to analyzing citizens' contributions.They formulate a basic framework for the use of opinion mining methods in e-participation and provide recommendations for use.In addition, there are various works that develop or apply sentiment analysis methods to public participation contributions that we summarize here and which are listed in more detail in Table 4 in the Appendix.
Research focused on the analysis of citizens' subjective claims and the public opinion in large data collections to support rule-writers, the impact of the sentiments in public input on a policymaking process, and the analysis and visualization of the public opinion of open-ended survey questions and free texts from e-consultations.Except for one Greek-language dataset, all works rely on English datasets from the field of civic participation and eRulemaking.
Methods have been developed to analyze public opinion on different levels of granularity (single claims, comments/contributions, or topics) and with varying tonality scales.While most articles use discrete tonality scales , e.g., a distinction into negative or positive polarity of a comment or a distinction into supporting or opposing stance towards some claim, Aitamurto et al. [ 1 ] use a continuous scale in the range of values from −1 to 1, where −1 describes an all negative and +1 an all positive attitude.
The best results for classifying supporting or opposing opinions achieved by Kwon and colleagues [ 44 -46 ] come from a boosting algorithm and provide almost human-like results.Although it is difficult to predict whether the approach can provide similar outcomes on other datasets, the results seem promising for the automated determination of stance positions.The additional distinction of neutral opinions, on the other hand, was harder and significantly lowered the prediction quality.The approach of Soulis et al. [ 85 ] scored worse, but considering the number of sentiment classes (four) and the small training dataset, these results are likewise positive.The results of the only approach with a continuous tonality scale seem to be more limited.
In contrast to previous work analyzing citizens' attitudes via sentiment (from positive to negative), Jasim et al. [ 37 ] propose analyzing the emotions behind them.This was prompted by findings from interviews with human evaluators in which a division into positive and negative attitudes was considered insufficient.Rather, they expressed a desire to learn whether the citizens were excited, happy, neutral, concerned, or angry regarding an issue.In a comparison of different classification algorithms, BERT was found to perform best, predicting the five emotions very well.

DISCUSSION
Based on the presented NLP approaches, we can assess how well the three generic evaluation tasks identified in Section 4 are already supported and what issues remain that should be addressed in further research.

Summary of the Current State of Research on the Evaluation of Public Participation Contributions
Much to our surprise, with DURIAN we found only one approach that has been specifically developed for (near) duplicate detection in the domain of public participation [ 97 , 98 , 100 ].However, the developed solution achieves good results.There is considerably more research on the task of topical grouping .Overall, the different supervised learning approaches, varying in granularity of analysis and in categorization schemes, showed moderate to good results.However, so far identifying rarely occurring categories poses a challenge to all these efforts.What is more, the usability of these supervised learning approaches is hampered by categorization schemes tailored to individual datasets and the resulting additional effort required to manually categorize a considerable amount of contributions for the training of customized classification models.According to the reviewed articles, this implies several thousand data points (e.g., sentences, arguments, or contributions).Clearly, especially for small datasets, categorization approaches that need to be trained on such large datasets are not a relief, but rather an additional burden for authorities.It should also be noted that participation processes with less than a thousand contributions do occur regularly.As a solution the use of active learning was proposed to keep the required amount of training data as low as possible [ 70 ], and recent work has confirmed that combining it with modern language models can meaningfully support participation processes consisting of a small number of contributions [ 75 ].Still, a manual labeling effort is required.What is more, in active learning the classification algorithms must be constantly retrained, posing limits to the use of complex (i.e., time-consuming) models.Unsupervised models avoid the manual effort of labeling training data.Most research projects rely on the topic modeling technique LDA and have achieved some promising results.However, the studies have shown that the quality of the resulting topic clusters strongly depends on case-specific model settings, such as the initial choice of the number of topic clusters.In the reviewed articles, parameter selection is either approached by human judgment or by using metrics, but it is understood that the model outcome needs human validation.Therefore again, a strong involvement of the analyst is needed.A further problem in the application of topic modeling methods is rooted in the statistical model itself: Although a resulting topic model might be correct from a mathematical point of view, it does not necessarily correspond to the perception of topics by a human Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.
evaluator.The only solution to control the emerging topics and to approximate them to those desired by the user seems to be the direct involvement of the user through interactive topic modeling [ 10 , 99 ].For the practical application of topic modeling in the public sector, this development is very promising but in need of further research.What is more, only in a few articles has an attempt been made to automatically provide labels for discovered topic clusters.
Most of the literature in this review focused on the automated recognition and analysis of arguments, one particular aspect of the task of in-depth analysis of contributions.Overall, although promising approaches exist for each of the three consecutive subtasks (segmentation, segment classification, and relation identification), none of them has been solved satisfactorily.Good approaches for classifying argument components have relied on PN and PCPA [ 60 ] or BERT [ 25 , 73 ].In addition to arguments, the analysis of discourse structures as well as sentiments has produced good results already.

Research Agenda
Considering the field as a whole, since the beginnings of using NLP to support the evaluation of public participation contributions, the technical possibilities in machine learning have steadily developed.In particular, the rise of pre-trained language models (PLM) in recent years has brought an unprecedented boost.Above all, models based on the transformer architecture such as BERT and GPT-3, have been able to achieve considerable improvements over earlier algorithms in many supervised machine learning tasks [ 93 ], including topic classification, sentiment analysis and argument mining.However, despite this encouraging development, it remains to be tested whether these successful applications are transferable to our domain.This literature review has revealed that PLMs have rarely been applied to the evaluation of citizens' input from participation processes.So far, PLMs (i.e., BERT) were only used in grouping input by topic [ 6 , 75 ], in the analysis of arguments for the detection and classification of argument components and properties [ 25 , 72 , 73 ], as well as for the prediction of relations between the argument components [ 18 ] and for emotion analysis [ 37 ].These initial efforts are promising but need more systematic application and evaluation.In particular, the focus should be on the development of robust PLMs that perform reliably and consistently across different participation processes.Such important properties have so far remained a challenge for the practical applicability of algorithms [ 94 ], but are essential to ensure the value of automation and thus the benefit for practitioners.
Turning to the individual tasks discussed in this study, we identify the following promising avenues for further research.Duplicate and near-duplicate detection is a well-known task in data science for which a multitude of approaches are available [e.g., 90 ] but so far these have not been studied in detail beyond the early DURIAN approach.This obvious gap is waiting to be addressed in future work.Regarding topic classification as the supervised approach to thematic grouping, more recent work has shown the benefits of PLMs.Given the trade-offs between training and automation outlined above, active learning that combines human feedback in the training process offers the possibility to reduce training efforts.What is more, is has also the potential to increase trust in the AI-based classification process as it brings human and machine closer together.While existing efforts seem promising [ 70 , 75 ], the field of active learning constantly evolves from which further research efforts should benefit [ 103 ].An alternative to active learning could be provided by the development of categorization schemes that are universally applicable to particular types of content such as different issues that are regularly subject to participation (e.g., infrastructure planning or regulation drafting).This would allow one-time training of arbitrary classification models, which could then be used directly in practice.
Research has also progressed for unsupervised machine learning tasks such as topic modeling.Since the introduction of LDA, other topic modeling approaches have been introduced, such as word-embedding based topic models or topic modeling with BERT [ 17 ].Again, these novel techniques offer great potential for the automatic support for the evaluation of public participation data, especially when applied in interactive settings.A starting point for this is offered by various works on the support of content analysis by human-in-the-loop topic models Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.
in recent years that focused on user needs and perceptions [e.g., 84 ] and on technical advancements [e.g., 43 , 101 ].What is more, only in a few articles has an attempt been made to automatically provide labels for discovered topic clusters.These efforts should be pursued in order to aid the interpretation of the output from unsupervised methods, because having a label can be extremely helpful for an analyst to understand the content of individual topics faster.
Two gaps have been revealed in the research on argument mining.First, further work is needed on techniques for identifying argument components and their relationships for participation data.After all, the mining of arguments is a very complex area that has developed rapidly in recent years [e.g., 83 ].Second, the lack of standardized argumentation models became obvious.What should be prioritized is the (theoretical) development of uniform argumentation models for citizen participation procedures.For example, Liebeck et al. [ 51 ] and Fierro et al. [ 26 ] have tailored argumentation models to informal participation procedures.These models do not necessarily have to be highly complex.Simply recognizing proposals and the respective rationales can already provide great support in the evaluation.Worth further exploring is also the idea to use additional argument mining training datasets from domains other than participation processes in order to improve the classification as has been demonstrated for relation identification [ 18 ].Regarding discourse and sentiment analysis, the application of PLMs has so far been neglected despite its obvious potential, not least illustrated by the successful analysis of emotions in citizen contributions [ 37 ].An open question remains whether analysts are better supported by discrete tonality classes or via continuous values and, when choosing discrete categories, how many categories the polarity spectrum should comprise.We suspect that the use of a few meaningful categories, such as agreement and disagreement, might be better suited to quickly convey the essential points of the content to the analyst.
Apart from these specific gaps, there are a number of other broader challenges that exist across all evaluation tasks.A large part of the research concentrates exclusively on English language data.There is little research that focuses on other languages.As languages differ in their syntactic and semantic properties, more coded datasets in other languages are required to apply, adapt and test existing algorithms.Currently, only few non-English language resources are publicly available [ 2 , 51 , 76 ].
Another overarching challenge is that in order to reap the benefits of such automated procedures, it is not enough to identify suitable mechanisms and algorithms but such procedures need to be made available in ways that public officials can apply them to their data.For example, as multiple reviews highlight [ 92 , 95 , 104 ], there remains a significant lack of technological expertise in the public sector and among those tasked with implementing and using the technologies reviewed here.Hence, it is necessary to provide end-user software.This review has found that only little work has been devoted to the dedicated development of tools that implement such analysis technologies.These are listed in Table 5 in the Appendix.Given that most of these tools are not available or supported any more,5 cover only specific aspects (e.g., language, functionalities), and are restricted either by the underlying techniques or the visualization methods, we identify a clear need for (preferably open source) applications that make these algorithmic approaches accessible to public administration.A promising step in this direction if offered by CommunityPulse [ 37 ].However, the development of suitable solutions and their integration into the everyday work of experts poses a number of challenges, as Hagen et al. [ 35 ] highlight.

CONCLUSION
While public authorities are routinely consulting citizens to inform decision-making processes, these procedures come with the challenge of evaluating the contributions made by citizens.This evaluation has important consequences for the effectiveness and legitimacy of policies deriving from public participation but it is a resourceintensive process, so far requiring substantial human effort.We have argued that AI in the form of NLP could be one possibility to support this human evaluation process and eventually be a decisive factor for the public sector to engage or not to engage the public at all.While the use of automated procedures in decision-making processes raises normative concerns such as transparency or accountability, here we have focused on assessing the state of the technology and its potential benefits to inform the debate on these important questions.
Overall, public authorities are still largely lacking reliable tools that could be used in practice to support their work.What is more, despite the fact that NLP has seen major advances in recent years, research on computersupported text analysis to support the evaluation of citizen contributions is sparse and dispersed across different fields and disciplines.Therefore, this study set out to take stock of this field by reviewing the strengths and weaknesses of existing approaches to offer guidance for further research.Despite a number of promising approaches, we established that most of them are not yet ready for practical use.It remains to be seen whether this situation improves once current state-of-the-art NLP techniques are applied more frequently to this domain.
Among the approaches that are proposed as possible solutions to the problems identified, many draw upon the expertise of humans, for example through active learning or interactive topic modeling.While this suggests that human expertise can never be fully replaced as for example asserted by Grimmer & Stewart [ 32 ], it has yet to be established whether such approaches would eventually still require less time for evaluation than human-only evaluation.Finally, it became clear that there remains a significant lack of non-English language datasets and models as well as software that would allow the application of the models in practice.
Taken together, this leads us to conclude that the evaluation of citizen contributions -despite the significance outlined above -has not received the scholarly attention that it deserves.We hypothesize a number of reasons for this lack of interest.First, while interest in the utility of big data for policy has been large, citizen contributions do not fulfil the definition of big data: Despite their occasional large number, they usually remain in the hundreds or thousands.What is more, compared to traffic or sensor data, instances of public participation are sporadic and not continuous and hence might attract less interest for automation.Second, natural language data remains highly unstandardized which makes automatic analysis more challenging.Third, further challenges arise from the exceptionally high requirements for transparency and due process for public participation that we outlined earlier, as failures in the evaluation process such as omitting a relevant statement can have important consequences that might also prevent the adoption of automation.Fourth, the lack of technology expertise and capacity in public administration is a barrier to the utilization of advanced technologies [ 29 , 69 ].At the same time, despite these difficulties, the field of government technologies has been a profitable ground for technology companies and consultancies who offer their technologies to support service provision including dealing with citizen contributions.Due to their business model, these have few incentives to publicly share their technologies, making it more difficult to assess the state of the field.
Although we believe that an open source solution is preferable (e.g., to facilitate deployment in communities or countries with low budgets), the lack of access to commercial solutions is one limitation of this study.Further limitations arise from the fact that the evaluation of citizen contributions is not a clearly demarcated field.As outlined earlier, this makes it possible that our review has missed individual studies.While we have justified our focus on studies that deal with contributions from participatory processes, this has excluded research that could potentially also provide relevant insights, e.g., in relation to social media data.Consequently, further research should try to use the lessons learned from these approaches and test whether they perform well also on public participation data despite the differences in domains.Similarly, in contrast to the top-down public participation that is the focus of this article, bottom-up participation such as petitions or more broadly online discussions (e.g., on social media) are more difficult to incorporate into the formalized decision-making process of public authorities.Nevertheless, increasingly efforts are made to analyze such exchanges to gauge public opinion outside of such formal arenas as these could supplement the input from consultations [see for example 7,15,82].Such studies can offer further insights on how to provide additional information for decision-making.Finally, we have focused on textual data only, but contributions might also include images, audio or even videos.These would also benefit from automated support and supplement the analysis of citizen contributions but were beyond the Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.scope of this review.Supporting the evaluation of contributions in public participation with computational text analysis is an exciting area of research.Still, more work is needed to turn approaches from research into fruitful approaches to practice.With the rapid progress in the fields of AI, NLP, and policy analytics, these gaps can hopefully be bridged in the near future.For the complete list, please refer to the article.

APPENDICES A OVERVIEW OF SUPERVISED APPROACHES FOR THEMATIC CLASSIFICATION
The authors state that NB and CRF were also evaluated.However, the results are not further reported.
We could not find the total number of topics or an overview of all subtopics in the article.9 The authors do not specify the algorithm and refer to the tool's website for further information.As this webpage is not available anymore, we are unfortunately unable to provide more detailed information about the type of classifier.

OntoCop
Yang & Callan [ 99 ] system to support policy-making in online communities Klinger et al. [ 40 ] information visualization tool for surveys Soulis et al. [ 85 ] PIERINO (PIattaforma per l'Estrazione e il Recupero di INformazione Online) Caselli et al. [ 14 ] Civic CrowdAnalytics Aitamurto et al. [ 1 ] system for interactive topic modeling Cai et al. [ 10 ] interactive dashboard for the analysis of social media and e-participation data Simonofski et al. [ 82 ] information extraction and visualization modules for the open source platform Consul Arana-Catania et al. [ 2 ] CommunityPulse Jasim et al. [ 37 ] Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.

F LI TERAT URE DATABASE SEARCH
In order to identify those studies in the Association for Computational Linguistics Anthology that applied NLP to the relevant application area, the anthology was searched with multiple search terms: "public participation", "online participation", "political participation", "civic participation", "e-participation", "public engagement", "online engagement", "political engagement", "civic engagement", "e-engagement", "e-government", "public consultation" and "e-rulemaking".The documents for the Digital Government Reference Library were narrowed to the application area of interest by using the search terms "participation", "engagement", "consultation" and "rulemaking", and subsequently searched for studies that utilized the relevant technology by searching for the terms "natural language processing", "nlp", "text mining", "text analysis", "machine learning" and the more specific machine learning tasks "topic modeling", "document categorization", "classification", "clustering", "argument mining" and "sentiment analysis".

Fig. 2 .
Fig. 2. Overview of tasks in the evaluation of public participation contributions.

Fig. 3 .
Fig. 3. Approaches to grouping the data collection by topic.

Table 1 .
Overview of Supervised Approaches for Thematic Classification

Table 2 .
Overview of Topic Modeling and Clustering Approaches Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.

Table 3 .
Overview of Argument Mining Approaches

Table 4 .
Overview of Sentiment Analysis Approaches Digital Government: Research and Practice, Vol.00, No. JA, Article 00.Publication date: June 2023.E OVERVIEW OF SOFTWARE SOLUTIONS FOR PRACTITIONERS

Table 5 .
Overview of Software Solutions for Practitioners