FeedbackMap: a tool for making sense of open-ended survey responses

Analyzing open-ended survey responses is a crucial yet challenging task for social scientists, non-profit organizations, and educational institutions, as they often face the trade-off between obtaining rich data and the burden of reading and coding textual responses. This demo introduces FeedbackMap, a web-based tool that uses natural language processing techniques to facilitate the analysis of open-ended survey responses. FeedbackMap lets researchers generate summaries at multiple levels, identify interesting response examples, and visualize the response space through embeddings. We discuss the importance of examining survey results from multiple perspectives and the potential biases introduced by summarization methods, emphasizing the need for critical evaluation of the representation and omission of respondent voices.


INTRODUCTION
Open-ended survey questions can give richer information to researchers than closed-ended questions, with lower risks of certain kinds of bias [15].But in deciding whether to add such a question, survey makers face a trade-off between obtaining rich data and the burden of reading and coding textual responses.We seek to reduce that burden and thereby make it more compelling for surveyors to add questions with free-form textual answers.We introduce FeedbackMap 1 , a tool that summarizes a collection of open-ended responses in multiple ways.
Our goal is a kind of multi-document summarization, which is well-studied in the context of news stories and business communication.Survey responses differ from these kinds of document collections in that they are from individuals expressing different perspectives in response to a common question.A summary of any text loses some of the nuance of the input, but with survey responses this means that individual perspectives may be erased.A survey summary may systematically prefer the majority response, or exclude certain types of responses due to hidden biases in the summarization model.Without careful evaluation, the analyst may jump to conclusions that influence actions taken on behalf of the communities they serve.As such, there are three priorities that guide the development of FeedbackMap: (1) Give the researcher multiple perspectives on the data; (2) show interactions between open-ended responses and known categorical variables; and (3) connect findings back to individual responses to the survey.

RELATED WORK 2.1 Natural language processing for qualitative analysis
Qualitative researchers use tools like NVIVO [12] to support manual open-ended analyses, but they are often expensive to access and do little to reduce the time burden of qualitative analyses.In recent years, researchers have turned to automated methods to gather new insights from open-ended corpora in time-efficient ways [11,16].However, many of the most popular and frequently used methods are also dated in how they operate in settings of data sparsity (like Latent Dirichlet Allocation [2]), rendering them less effective in surfacing nuanced and actionable findings from text corpora.Even when such tools make use of recent and more powerful methods (like BERTopic [6]), they often require writing code to model and analyze bespoke datasets, creating barriers for those who are not trained as engineers and analysts.Startups are jumping in to fill this void, but may charge fees that make them inaccessible to community organizations with limited budgets.There appears, then, to be a gap between effectiveness, user-friendliness, and cost in the domain of open-ended survey response analysis.Researchers are beginning to design and prototype platforms to bridge this gap [5,10], yet more work is needed to make these tools more general and accessible to wider audiences while simultaneously remaining flexible enough to incorporate emerging capabilities from the NLP community.

Biases in open-ended survey response analysis
Bias is inherent in any type of voluntary survey, and in particular, open-ended survey questions.A recent study by Pew Research found that women, younger adults, Hispanic and Black adults, and individuals with less formal education were less likely to answer open-ended survey questions [4], reflecting findings from prior work [1].Of course, even when people do respond, feelings of "question threat" may contribute to biased or misleading responses [3].Bias stems both from who responds as well as how responses are analyzed.Practices like concept mapping [7] and computing intercoder reliability across multiple coders for the same content [13] seek to account for researcher biases in the analysis process.Feed-backMap's multiple data presentations seek to mitigate biases that may be reinforced through a single analysis frame, yet the extent to which this succeeds, and how it may help address issues of response biases further upstream in the feedback-giving process-remain important open questions for our work.

SYSTEM OVERVIEW
FeedbackMap is a Web application implemented using the Streamlit framework [17].Organizations that host the tool can customize the language models (LMs) to use for writing summaries and computing embeddings.LMs may be API-based, such as the GPT3 model from the OpenAI API [14]-the choice for summaries in our publicly deployed instance of the tool-or a local transformer-based model from PyTorch-transformers [18].Other customizations include the types of summarization available to the end user and their corresponding LM prompts.The tabs shown to the end user as they use the Web application are described below.

Welcome tab
The Welcome tab invites the researcher to upload a comma-separated value (CSV) file containing the results of their survey.The tool is designed to work best with the format that is produced by Google Forms, but it will try to accommodate any CSV or JSONL-formatted file.While the input file may be arbitrarily large, the records will be randomly sampled for analysis if they exceed a threshold (5000 rows, in the case of our publicly deployed instance.)

Summary tab
The Summary tab gives an overview of the selected data file and asks the researcher to pick one of the open-ended questions in the data to analyze.Here, by "open-ended" we mean that the survey asked for a free-form textual response.While we can't be certain of the question type from a schema-less CSV file, in practice it's usually easy to tell: FeedbackMap infers which columns of data are open-ended and which are categorical by analyzing the distribution of answer values for each question.The open-ended questions are shown alongside the rate of nonempty responses for the question, while the categorical questions are shown alongside information about the observed value distribution.For categorical questions identified as multi-select, each value is counted separately.Menus next to the categorical questions let the user constrain the analysis (for example, to consider only the responses from one U.S. state) prior to clicking on an open-ended question.
Figure 1 shows an excerpt of a synthetic survey data file, and the corresponding Summary tab that FeedbackMap displays for the file.

Analysis tab
The analysis tab contains the key offerings of the tool, letting the user explore the open-ended responses to a question of interest in various ways, each way contained within a collapsible subsection.Figure 2 shows the subsections of the analysis tab for our hypothetical input file.We discuss these in more detail below: 3.3.1 Top-level abstractive summary.This section shows a top-level summary of the responses to the selected open-ended question.It is generated using an LM and prompt that may be controlled in the code configuration.The prompt to the LM is a random sample of responses (sample size chosen to maximize the use of the LM's input context window), followed by an instructional statement such as the default, "Briefly, what do these responses have in common?"3.3.2Topic scatterplot.This section shows a two-dimensional interactive scatterplot that organizes the individual responses by topic, such that responses about similar topics are close to each other.Users may read the individual responses by hovering over one of the points.This section uses a familiar pipeline of techniques (used by, e.g., BERTopic [6]) to go from the raw text to the scatterplot: namely, it computes the embedding for each response according to a sentence embedding model, and then projects the set to 2 dimensions.By default, the points are clustered and colorcoded according to the result of a clustering algorithm applied to these same embeddings.We use UMAP [9] for projection and HDBSCAN [8] for clustering.The user may choose to override this "auto-clustering" and instead group by one of the categorical variables in the survey data.Labels for the clusters are determined by terms in the responses that have high pointwise mutual information with respect to the other clusters.

Interesting examples.
Here FeedbackMap uses the LM to ask for noteworthy responses.As in the top-level summary, this happens by prompting the LM with a random selection of responses, followed by a fixed instructional prompt.The default instructional portion is "What are 3 interesting responses and why?", which elicits rationales for the LM's choices.A "Pick again" prompt allows the user to rerun the generation, based on a new random sample. 2.3.4Cluster summaries.This section shows an LM-generated summary for each cluster, color-coded to match the scatterplot.
3.3.5Top words and phrases.We identify words and collocations (multi-word terms) that are frequent in the data and show how they interact with the selected category in tabular form.Columns of the table correspond to categorical values and can be sorted to surface terms that are highly associated with the category.

DISCUSSION AND EVALUATION PLAN
Since deploying FeedbackMap we have shared it with three users, including a school district administrator, a non-profit founder, and a political scientist, all of whom had suitable data sets for it.Reactions were positive, and varied with respect to which features were seen as most useful, with the top-level and category-specific summaries being highlights for all three users.We hope to run a more rigorous field study of the tool with these users and others, pending IRB approval.We also plan to measure the impact of survey summarization on users with respect to the kinds of knowledge they impart and conceal about the underlying data set.

Figure 1 :
Figure 1: An excerpt of the input file (left) and summary tab (right) for our synthetic survey response data.The dataset consists of 1,020 GPT-4-generated responses to questions about age, state of residence, favorite meal, and the current weather, among others.The script to generate this data and its output are available in the demo repository.

Figure 2 :
Figure 2: Some elements of the Analysis tab when the user selects the question "What is your favorite meal?"