How Do Data Analysts Respond to AI Assistance? A Wizard-of-Oz Study

Data analysis is challenging as analysts must navigate nuanced decisions that may yield divergent conclusions. AI assistants have the potential to support analysts in planning their analyses, enabling more robust decision making. Though AI-based assistants that target code execution (e.g., Github Copilot) have received significant attention, limited research addresses assistance for both analysis execution and planning. In this work, we characterize helpful planning suggestions and their impacts on analysts' workflows. We first review the analysis planning literature and crowd-sourced analysis studies to categorize suggestion content. We then conduct a Wizard-of-Oz study (n=13) to observe analysts' preferences and reactions to planning assistance in a realistic scenario. Our findings highlight subtleties in contextual factors that impact suggestion helpfulness, emphasizing design implications for supporting different abstractions of assistance, forms of initiative, increased engagement, and alignment of goals between analysts and assistants.

Figure 1: Execution and planning assistance during data analysis.During an ongoing analysis, analysts may be focusing on the execution of an analysis decision (left) or planning their next decisions (right).During execution, analysts have a clear intent of what they want to do (e.g., specify a model formula).Meanwhile, during planning, analysts are reasoning about potential decisions they are considering ( ).Existing systems powered by Large Language Models (e.g., ChatGPT and Github Copilot), focus on providing execution assistance, helping the analyst carry out a decision (A).However, analysts can also benefit from planning assistance.Planning assistance helps analysts reason about the analysis decisions.This can occur as analysts are explicitly planning their decisions (B) or even during execution when analysts are unaware of plausible alternatives (C).more robust decision making.Though AI-based assistants that target code execution (e.g., Github Copilot) have received significant attention, limited research addresses assistance for both analysis execution and planning.In this work, we characterize helpful planning suggestions and their impacts on analysts' workflows.We first review the analysis planning literature and crowd-sourced analysis studies to categorize suggestion content.We then conduct a Wizardof-Oz study (n=13) to observe analysts' preferences and reactions to planning assistance in a realistic scenario.Our findings highlight subtleties in contextual factors that impact suggestion helpfulness, emphasizing design implications for supporting different abstractions of assistance, forms of initiative, increased engagement, and alignment of goals between analysts and assistants.

INTRODUCTION
Data analysis requires analysts to make numerous decisions regarding varied tasks, such as data collection, data wrangling, statistical modeling, and inference [79].Each choice involves analysis planning, wherein analysts reason through the effects of potential decisions by synthesizing their results, prior experiences, and domain and statistical knowledge.Decisions made, their focus shifts to enacting these choices using the correct code with appropriate software tools, which we refer to as analysis execution.After examining the outcome of their execution, the process repeats.
Inadequate analysis planning can complicate this flow and can cause analysts to risk making arbitrary and ill-founded decisions [108,109] for a host of reasons.Analysts often struggle to identify and address relevant decision points [78,79].They may overfocus on low-level details, such as tuning hyperparameters, or miss highlevel considerations like alternative statistical or mental models [78].They often center execution decisions tied to familiar workflows when approaches using other tools may be more applicable [57,78].Further, the unconsidered flexibility in decision-making (and the biases inherent therein) contributes to the scientific reproducibility crisis [1,9,31].For instance, given the same analysis task and dataset, analysts often derive divergent conclusions [12,15,18,35,86,103,107].
Data analysis assistants seek to bridge these gaps by helping analysts execute and plan their analyses.AI-based execution assistants (such as Copilot [42]) help analysts implement decisions (Fig. 1A) by suggesting code or tools.In contrast, planning assistance may help them reason about their decisions (Fig. 1B and C), articulate hypotheses and mental models, and identify overlooked alternative decisions or rationales (Fig. 1 ).Systematically considering alternatives can deepen analysts' understanding of the outcome variations latent to decision-making in data analysis [81,108,109].Therefore, planning assistance can be a transformative step towards more principled, reliable, and robust analyses.
The improved capabilities of large language models (LLMs) for coding, language, and visualization generation [19,20,27,29,34]   offer an opportunity to make such planning and execution data analysis assistants a reality.Prior LLM-based assistants have explored practical approaches to execution assistance [11,48,55,85,112,128], however planning remains unexplored (Fig. 2 bottom row).Effective planning suggestions can not be achieved by simply feeding the current analysis context into an LLM as it requires selective context, higher-level scaffolding, targeted prompts, and sensitive triggers.The iterative nature of data analysis demands the nuanced interpretation of prior results, statistics, and domain knowledge to inform the subsequent decisions [17,46,57,111], which is typically unaccounted for in execution-only assistants (Fig. 2A and B).Synthesis of such decisions may require backtracking to a prior choice and deviating from the current approach (Fig. 1).Authoring effective prompts can be difficult [87,125], and the right context to provide to not cause hallucinatory analyses [84] may be hard to identify.Finally, data analysis is a non-linear activity that involves iterated and inter-woven actions of typing, reading, and reflecting-making the optimal timing of suggestions less clear than in software engineering.
Yet, essential questions about the design of planning assistants remain unanswered.In this work, we explore the design of such AI-based assistants.To this end, we consider the following research questions: RQ 1 -Scope: What types of specific planning assistance content could an assistant provide?RQ 2 -Helpfulness: What data analysis suggestions do analysts find helpful and under what circumstances?RQ 3 -Impact: How do suggestions impact analysts' workflows?To what extent do analysts prioritize their own ideas or adopt suggestions and explore alternative approaches?
To characterize assistant suggestion content (RQ 1), we conducted a literature review and critically assessed data from a crowdsourced analysis study [107], wherein 29 analyst teams separately analyzed the same dataset to answer the same research question (Sec.3).We used these findings to develop categories of assistance content across points in the end-to-end analysis process (Table 1).
To observe analysts' preferences for (RQ2) and effects of (RQ3) different suggestions, we conducted a Wizard of Oz study, with 13 practicing data analysts (Sec.4)-our first contribution 1 .We designed the study around the scenario considered in the crowdsourced study [107] that we looked at in our previous step.Our participants worked through the same dataset and task while receiving suggestions from a behind-the-scenes experimenter acting as a wizard.To facilitate our study, we developed a JupyterLab extension that allowed the wizard to surface suggestions to the participant as they used the notebook in a familiar manner 2 .This allowed the wizard (aided by LLMs) to synthesize suggestions and manipulate the selection, quality, and timing of recommendations (Fig. 3).
We synthesize the findings of our study (Sec.5) into a set of design guidelines (Sec.6) to align analysis intent and goals, support varying levels of suggestion assistance, and increase analyst understanding and engagement with the assistant's suggestions-our second contribution.We observed that the data analysis process can benefit from diverse forms of planning assistance, but only with careful consideration of assistance context.Additional factors-such as the timing of suggestions or variations in analysts' statistical and domain backgrounds-played a critical role in a suggestion's perceived helpfulness.Finally, we found differences in goals between the analyst (i.e., speed and completion) and planning assistance (i.e., methodical and robust planning).
By characterizing the usability of planning-equipped analysis assistants, we seek to make data analysis more robust and reliable by enabling richer and more useful assistants.

BACKGROUND AND RELATED WORK 2.1 Planning in Data Analysis
Analysts often engage in planning activities [4].Such activities are closely related to cognitive theories of sensemaking, whereby analysts seek and integrate new observations to build mental models [67,94,101] that inform their decisions for collecting and analyzing data to elicit findings; these findings, in turn, inform their mental models in an iterative back-and-forth.Thus, data analysis planning can be highly iterative and often requires revisiting decisions at various points in the analysis [17,57,111].Moreover, these processes are integral to reduce human biases in decision-making and help form robust conclusions [46].In this work, we study how sensemaking activities in analysis planning can be improved with an AI assistant.
Similarly, analysis planning relates closely to multiverse analysis [108,109], a statistical analysis paradigm whereby analysts consider, specify, and report all reasonable decisions and combinations of decisions.Through assessing potentially thousands of analyses based on these choices, a multiverse approach highlights how different decisions can influence final outcomes.Despite the importance of analysis planning, empirical studies observed analysts struggle to reason about their analysis decisions [78,79].Often, they are unaware of potential alternative decisions in the end-to-end data analysis process.Our work aims to guide analysts in recognizing, understanding, and reflecting on these overlooked but important decisions.
Prior work developed systems to simplify the planning of statistical models and statistical tests [58,59].Data-driven systems (like Lodestar [96], EDAssistant [72], and ATENA [37]) add structure to analyses and offer recommendations for data science and exploratory data analysis (EDA) workflows.However, these systems focused on specific parts of the analysis process (e.g., modeling or EDA) or based their recommendations on predicting the next steps in an analysis.In contrast, we emphasize analysis planning for the entire end-to-end analysis and as a highly iterative process.
Specifically, we aim to support analysts in considering alternative decisions and revisiting prior decisions as new insights emerge.We view LLM-supported data analysis assistants as a promising solution.Furthermore, these assistants could be integrated with existing multiverse analysis systems [47,81,102].Thus, as planning assistants help analysts consider and structure alternative decisions, multiverse tools help them author, execute, and evaluate these decisions within a "multiverse" of alternative analyses.
For these benefits to be realized, we must learn what specific feedback analysts need and under what circumstances assistance would be most useful to them.To begin exploring factors that influence favorable suggestions, we first categorize suggestion content (Sec.3).Using this categorization, we develop a prototype interface and conduct a Wizard of Oz study (Sec.4) to investigate the helpfulness and impact of such suggestions.

LLM-based Assistants and Limitations for Planning Assistance
LLMs trained on large-scale code corpora [27,29,41,73,89] have shown proficiency in learning programming concepts, solving programming challenges [27], and covering various domains like web development [126] and data science problems [25,71].While these LLMs excel at precise code implementation tasks (e.g., Consider a (6,7,8) shape array, what is the index (x,y,z) of the 100th element?[71]), this work focuses on situations where the problem statement and methods to apply are less clear, unspecified by the user, or at a higher level of abstraction.LLMs are trained to predict the next token in a code/text sequence and thus form the back-end of many popular code assistants that offer execution assistance [5,42,44,88].Note that we distinguish between an LLM and an assistant; the latter is a complete system that acts as the interface between the user and the LLM.Existing code assistants typically pass the programmer's context (e.g., code, comments, etc.) to the LLM to then recommend entire statements or blocks of code.In this interaction, programmers can either initiate execution assistance with explicit comments and triggers (Fig. 2A) or passively let the assistant auto-complete their code (Fig. 2B).However, as discussed earlier (Sec.1), assistants offering data planning assistance cannot naively pass the current context to an LLM given the iterative and non-linear nature of analysis planning.This work explores the design of an LLM-supported complete system for planning assistance.
More recently, general-purpose LLMs, such as InstructGPT [92] and , have reached a level of performance that is now deployable in various contexts and domains [20].The assistant ChatGPT [90], supported by LLMs, has exploded in popularity [22].In the context of data analysis, ChatGPT can offer basic planning assistance, integrating broader domain expertise, software libraries, and statistical knowledge with code when given precise instructions (e.g., Show me the correlation between Height and Weight in a scatter plot) [28].
However, ChatGPT requires users to actively prompt it for information (Fig. 2C).Prior work found that non-AI experts often struggle to write effective prompts [125] and experience significant cognitive load [54].Data analysis planning can further exacerbate this issue since analysts must synthesize information from various points in the analysis to pass to the assistant.Thus, the burden is on the analyst to decide what they want, where to find relevant context in a likely messy analysis [49,63,100], and craft a prompt or series of prompts.In addition, since analysts are often unaware of the entire space of analytical decisions, they may also need planning assistance they do not directly ask for.However, current assistants like ChatGPT do not naturally support assistantinitiated planning assistance (Fig. 2D).Further, since analysts may be alternating between planning and execution, the optimal timing of planning assistance is unclear.
Our work aims to understand the design of data analysis assistants that involve all forms of assistance and initiation.Since planning assistance presents new and largely unexplored challenges, we focus our efforts here.Notably, we monitor analysts' reactions to an assistant that offers new planning information that possibly differs from their current analysis context (i.e., analysis code and notes).In accomplishing this, we also explore an assistant that raises suggestions without direct invocation from the analyst.

Designing for Human-AI Interaction
In general, designing AI-based user-facing systems is challenging since designs must face issues that include explainability [65,75], trust [69,76,110], user control [50,105], and user expectations [68,82].As a result, there is a rich history of literature on design guidelines for Human-AI interaction to address these challenges [6,50].Similar practitioner-facing guidelines have also been released in large companies [8,43,52].
Though these resources address common challenges in Human-AI interaction design, designing a system still requires support for domain-specific examples and proper problem formulation [123].In other words, these resources do not address the specific Human-AI interaction challenges surrounding the design of data analysis assistants (e.g., the timing of assistant-initiated planning assistance).Our work focuses exclusively on identifying precise goals and problems for designing analysis assistants.Based on findings from a Wizard of Oz study, we propose design guidelines specifically for developing these assistants (Table 3).
With respect to designing analysis assistants, McNutt et al. [85] studied computational notebooks as a medium to provide AI-based execution assistance for data scientists.They conducted an interview study and presented participants with slide-based prototypes of interfaces.In contrast, our work concentrates on the open challenges in understanding the scope of (Sec.3) and developing for (Sec.6) planning assistance.Instead of focusing on the medium where the assistant resides, we focus on identifying factors that make assistance helpful.

CATEGORIZING SUGGESTION CONTENT
To identify the circumstances in which planning suggestions are helpful (RQ 2), we first establish categories of relevant suggestion content (RQ 1).In our subsequent Wizard of Oz study (Sec.4), we use this categorization to derive suggestions and observe analysts' reactions and preferences.We conducted a literature review to identify existing challenges data analysts face.We then examined results from a crowd-sourced analysis study, where multiple independent analysts made decisions given the same data and research question.This helped us understand variations in decision-making, determine points where alternative approaches would be plausible, and observe what these approaches might look like.

Literature Review
We began by identifying works that study the data analysis process and present opportunities for assistance.We searched Google Scholar with the keywords "sensemaking data analysis" from which we iteratively snowball sampled [117].We sought papers that discussed the end-to-end data analysis process and challenges suitable for a text-based assistant.Therefore, we did not focus on problems with specific systems [3,106] or address common analysis challenges related to software and debugging [26,47,62], communication [115,127], and data provenance [63,85,121].

Behavior-Driven Analysis of a Crowd-Sourced Data Analysis Study
In crowd-sourced data analysis studies [12,15,18,35,86,103,107], numerous independent teams of analysts conduct analyses answering the same research question with the same dataset.Though originally intended by authors to study variations in analytical choices and how they impact analysis conclusions, these studies also offer high-fidelity data about code and rationales behind analysts' decisions.

Operationalizing Constructs
Skin Tone Operationalization Right now, skin tone rating is represented by rater1 and rater2.The rater1 and rater2 variables could be averaged and rounded to the most central value.We could also create a binary variable based on a rating threshold.This way, we could work with only one variable for player skin tone.It is up to you to decide which decision to make to ensure the validity and reliability of your results.This code shows a template for representing skin tone...

Choosing the Statistical Model
Using a Poisson Model Here, we want to investigate the relationship between a player's skin tone and the number of red cards they receive from referees.Since the outcome variable, i.e., the number of red cards received by a player is a count variable (i.e., it is a non-negative integer), it might make sense to consider a Poisson model to analyze this relationship...

Results Interpretation
The regression suggests that skin tone and the number of goals are significant predictors red cards.The R-squared value of 0.005 indicates that the model explains only a small proportion of the variance in the redCards variable...

High-Level Planning Consider Including Covariates
Covariates are additional variables that can affect the relationship between the predictor variable and the outcome variable in an analysis.To accurately understand the relationship between the predictor and outcome variable, it is important to consider the effects of these covariates.Including covariates in an analysis can help to control for potential confounding variables...

Execution Assistance
Answering: Is the number of red cards associated with some unique referee ID?
Table 1: Categories of suggestions informed by a literature review and a behavior-driven analysis of a crowd-sourced data analysis study.With the exception of high-level planning and execution assistance, the dimensions correspond to assistance during the stages in an end-to-end analysis.High-level planning involves more general considerations.
In this work, we chose a crowd-sourced study [107] that addresses the research question: Are soccer players with a dark skin tone more likely than those with a light skin tone to receive red cards from referees?The study provides a real-world dataset for answering the research question and includes analyses from 29 teams of analysts.The question and dataset are of moderate practical difficulty and complexity, encouraging different but reasonable analytical approaches.Furthermore, the domain of the study, soccer, is approachable to general audiences, reducing the likelihood of substantial differences in analysts' domain expertise.
We analyzed the soccer dataset for alternative approaches to an analysis decision at any point in the analysis process (e.g., data wrangling, statistical modeling, inference, etc.).To do so, we looked for decisions with high variance between analysis teams (e.g., choosing specific covariates).From these decisions, we observed approaches (e.g., choosing player position as a covariate) that could help introduce other analysts to alternative approaches and mental models.These decisions informed categories of suggestions that aid in the various stages of the end-to-end analysis process.
These categories both corroborate and make more specific those outlined in the literature review.For instance, guidance for statistical modeling from the literature review includes two aspects we uncovered in the crowd-sourced study: conceptual model formulation and operationalizing constructs.Likewise, alternative decisions captures both high-level planning (which helps analysts consider broader analysis decisions) and more specific planning decisions better summarized by the other categories.From studying independent analyses of the crowd-sourced data, new categories, such as data wrangling and model results interpretation, also became apparent.They are unaware of the existence of a human wizard, located in a separate room, operating the assistant backend (right).The wizard uses a general purpose LLM and integrates information prepared before the study.The wizard observes the analyst's notebook for opportunities for assistance and obtains the relevant notebook context (A).Next, the wizard either retrieves a prewritten suggestion developed prior to the study (B) or prompts the LLM (C) to generate a relevant suggestion (D).The wizard may further refine the suggestion (e.g., shorten its length) with additional prompting of the LLM (C-D Loop).Finally, after creating the suggestion, the wizard presents the suggestion back to the analyst (E).In this process, the wizard acts as the interface between the analyst and the LLM, deciding on the content and timing of the suggestion.

Results
A summary of our suggestion categorizations is shown in Table 1, which includes examples we later developed for our Wizard of Oz study (Sec. 4).Full examples of suggestions in each category are also in the appendix.The literature review spanned domains and tasks, covering broad categories of assistance (e.g., guidance for statistical modeling).Examination of the crowd-sourced study then helped facilitate more specific and new categories of help aimed at critical steps in the end-to-end analysis process (e.g., choosing the statistical model).All categories except execution assistance are intended as planning assistance, helping the analyst reason about and consider alternative analysis decisions.Likewise, while the categories may not cover every possible helpful suggestion, based on our research, we expect high coverage.
Overall, these categories encapsulate the collective approaches and rationale of expert-written analyses.Thus, they are likely to be helpful for analysts working on a variety of research problems.However, it is unclear to what degree suggestions from these categories would be helpful and what factors influence this judgment.
To answer this, we conducted a Wizard of Oz study.

WIZARD OF OZ STUDY
We conducted an exploratory lab study using a Wizard of Oz protocol [33,83] to understand the types of suggestions analysts prefer and how these suggestions might affect their workflow.In our study, participants interacted with a data analysis assistant housed in a customized JupyterLab interface which, unbeknownst to them, was controlled by a human "wizard".Our wizard (Fig. 3) simulated an LLM-based data analysis assistant.The wizard managed the LLM interaction manually, drawing from a predefined list of answers to make planning and execution suggestions while participants carried out a data analysis task.
Participants.We recruited participants who were already familiar with statistical analysis and programming experience.We contacted potential participants through analysis-related mailing lists at our institution.From a pool of 60 volunteers, we invited those who rated their programming and statistics proficiency 4 out of 5 or greater and were familiar with computational notebooks, selecting a final 13 on a first come basis (Table 2).We chose not to consider novices, as they would be too limited by execution challenges [62,78] to be able to benefit from our analysis planning.Since even trained analysts face difficulties in analysis planning [61], supporting those with expertise is a crucial step in supporting everyone.Participants received a $50 gift card as compensation.We denote participants by anonymous identifiers, like AX, and quote them, "like so".
Procedure.We conducted our study in a lab on a MacBook Pro on a 27-inch monitor.One author, in the lab with participants, served as the coordinator.Another author, the wizard, participated from a physically separate room over Zoom (which was also used to record the session with consent) under a pseudonym.Participants were unaware of the wizard and that the wizard was able to hear and see their screen.
The study lasted roughly 2 hours and consisted of three phases, consisting of a tutorial (∼15 minutes), the primary task detailed below (∼75 minutes), and a semi-structured interview (∼30 minutes).Table 2: Participant Information.Prior AI code assistant experience was gathered from our study interviews.
In the last of these phases, the coordinator asked participants to review each suggestion they saw, give a positive, neutral, or negative reaction to whether the suggestion was helpful, and explain why (Fig. 5).We then asked about the impact the assistant's recommendations had on the participant's process, what participants valued from the assistant, and what sorts of actions or information they would have liked the assistant to provide.At the end of the study, we revealed how the assistant actually worked.Prior to the study, participants were asked about their preferred programming language and the analysis libraries.We created a notebook for each participant with their preferred language and tools.We wanted to ensure that the lab environment was as familiar as possible.
The study coordinator took notes throughout the analysis phase and the semi-structured interview.One author viewed the recordings, transcribed relevant episodes, and logged whether suggestions were included in participants' working analyses.To define common themes that emerged, two of the authors conducted iterative open coding on the recordings.
Task.The bulk of the study consisted of a data analysis tasknamely the task conducted in the crowd-sourced study3 considered in Sec.3-in the context of a JupyterLab notebook customized to have a planning equipped assistant (Fig. 4).Participants were asked to imagine they were leading a research team that had collected the dataset, following a similar methodology from Jun et al. [57].To prime participants, we told them that their analysis results would impact a major policy decision-namely whether the soccer league invests money in bias training.We stressed the importance of having reasonable rationales for their analysis decisions and that their decisions must be robust against alternative assumptions.Likewise, to encourage comfort with the lab environment, we mentioned that we were not interested in the completion of the analysis but in their analysis process.
Once participants were familiar with the task, the coordinator toggled on the assistant and introduced how its suggestions were raised.We described what the system can do (e.g., make suggestions based on the notebook context), being careful not to anthropomorphize the assistant to facilitate accurate evaluations [64].We encouraged participants to think aloud or document their process in a notebook.They were free to use any external resources.
Assistant Design.The design of our assistant closely followed recent explorations for assistants in notebooks [85] and is shown in Fig. 4. We focused on notebooks because they are an extremely common medium for data analysis [100,104].We place our assistant within a side panel which affords an apparently global perspective across the notebook.This scope helpfully implies that the suggestions integrate information from across the notebook.By keeping the suggestions separate from the participant's working notebook, we minimize any interference with their code, notes, and overall process.As in other triggerless systems [53], participants received suggestions from the system without needing to take specific action.Given this design, there is no direct UI element for controlling the generated topics or to refine suggestions (similarly to Copilot's tab view).Also like Copilot, participants could influence the assistant by writing specific comments in their notebook.This comment writing behavior was not described in our introduction but emerged naturally across multiple participants.As suggestions from the wizard began to reference participant comments as the relevant notebook context, some participants took notice and wrote additional comments, expecting assistant help.
Raising and Creating Suggestions.When and what suggestions the wizard raised were guided by several principles.The wizard raised suggestions when they were relevant to the notebook, with priority given to the most recent additions [6].When possible, the wizard offered planning suggestions, but execution assistance was provided when it was deemed required.Examples include when participants were stuck debugging outputs or wrote notebook comments asking for specific help (Sec.5.5).The wizard attempted to strike a balance between a rich diversity of suggestions and not overwhelming the participant with excessive feedback.Finally, we simulated an assistant that could learn over time what suggestions the analyst found helpful-following common Human-AI guidelines [6].To wit, the categories of suggestions we raised depended When a new suggestion occurs, the side panel shows a loading animation of variable duration to call attention to it; variability in duration aims to indicate that the system is working and simulate a real system computing and generating a suggestion [40,98].New suggestions also have a yellow highlight around them for greater visibility (D).Likewise, each suggestion has a corresponding context (D1) in the notebook, which is automatically scrolled into view when the suggestion is clicked (D2).Analysts can provide feedback for each suggestion via the thumbs-up and thumbs-down buttons (E).Finally, analysts can copy recommended code using the copy button (F).Note that the width of the side panel can be adjusted and is typically smaller than is pictured here for illustration.
on whether the suggestions already given were considered or taken.If the participant disregarded many suggestions that addressed alternative variables, we raised these suggestions less often.
Creating suggestions followed the analysis conducted in the previous section (Sec.3).For categories such as domain background and operationalizing constructs (Table 1), the same or very similar suggestions could be given to different participants.To speed up the assistant's response time, we prepared a spreadsheet of suggestions (based on Table 1) in these categories before the study for the wizard to draw from (Fig. 3B).We created 32 suggestions across these categories.We deliberately crafted all suggestions with pertinent explanations and relevant domain and statistical knowledge.For most suggestions, we also included example execution assistance code to help participants grasp and implement the given recommendation.Though these planning suggestions featured execution assistance elements, their primary intent was to aid in analysis planning.
For categories execution assistance or model results interpretation (Table 1), suggestions were closely tied to the context of the working analysis which precluded previously created suggestions.For these categories, we leveraged ChatGPT [90] and constructed prompts to build suggestions.Prior to the study, we created a general prompt introducing the task and dataset.In addition to this general prompt, the wizard, who prepared the spreadsheet of suggestions and is experienced in using LLMs, crafted prompts in real-time using the notebook context (Fig. 3A) or parts of pre-written suggestions (Fig. 3B).The wizard then passed the prompt to ChatGPT to generate the suggestion of interest (Fig. 3C).ChatGPT was further used to refine some generated code-for example, to convert code between programming languages or shorten the length of a prior generation (Fig. 3C-D loop).The wizard sometimes manually corrected the output of a ChatGPT generation (e.g., changed the wording or variable names) to ensure high-quality suggestions.
Limitations.While able to elicit a variety of realistic user reactions to a new form of assistant this design has some limitations.While a ) of which were planning suggestions.All categories of intended planning suggestions (which may include code to help execute a suggestion) were found to be helpful by at least some analysts, highlighting the need for planning assistance that current assistants lack.For planning suggestions which analysts could reasonably incorporate into their notebooks (i.e., those that were not results interpretation and domain background), analysts integrated suggestions 51.6% of the time (47/91).Analyst reactions varied within each category, with responses ranging from finding the assistance helpful to neutral, or even unhelpful in certain instances.Our interviews and observations suggest that the effectiveness of a suggestion is not solely determined by its suggestion category but also depends on various nuanced factors (Sec.5).We provide full examples of the raised suggestions in the appendix.complete implementation of full system would be useful, our goals in this work are to better understand the design of such assistants rather than their implementation.Similarly, through this approach we are not limited by the capabilities of current LLMs (Sec.2.2) or participants non-expert prompt engineering abilities [87,125].Having a wizard create and deliver suggestions also raises a potential downside of being slower to respond than an automated system.However, we chose to conduct our study with a single analysis task and dataset so we could prepare high-quality suggestions in advance.This helped us both ensure a high quality and rich diversity of suggestion content and timing while being only slightly slower than a fully automated approach.

RESULTS
Our Wizard of Oz study explored two main questions: "what characterizes helpful suggestions?(RQ2), and how do these suggestions impact analysts' workflows?(RQ3).
With respect to RQ2, we found that analysts preferred suggestions that matched their analysis plan and their statistical and domain background (Sec.5.1).They preferred both execution and planning assistance, though planning required more cognitive effort (Sec.5.2.1).As a result, explanations provided in code, code comments, and natural language greatly assisted analysts in understanding the intent and rationale of suggestions (Sec.5.2.2).Due to the extra effort to consider planning assistance, analysts valued suggestions that were well-timed to their current task (Sec.5.3).
For RQ3, we found well-timed and contextual planning assistance often helped analysts consider and make alternative decisions (Secs.5.2 and 5.3).In contrast, analysts were reluctant to accept suggestions that mismatched with their expertise or task.Likewise, we observed potential drawbacks of AI-assistance: some analysts became distracted or overly reliant on the assistant's suggestions, often without taking the time to critically assess their relevance or correctness (Sec.5.4).Nevertheless, these impacts were not uniform across all scenarios and varied depending on the individual analyst (Sec.5.1), the type of assistance (Sec.5.2), and when the suggestion was made (Sec.5.3).
Triangulating their ratings with interview data revealed subtleties in analyst-assistant interaction dynamics, which we summarize in Fig. 6.Here, we consider suggestion helpfulness, their analysis impact, and the overall experience using a "Who, What, When, Why, and How" framework to organize our findings.While analysts expressed preferences for "where" the assistant should be placed on the UI, these preferences did not fundamentally impact their overall workflow, and so we include our observation to this effect in the appendix.

Who-Analyst Background
Our study revealed many aspects of an analyst's background that influenced their perception of suggestions and overall experience.

Analysis Plan.
Analysts exhibited varying levels of analysis forethought and rigidity in their analysis plans.While all analysts had a rough plan, some took extra time to explicitly detail their analysis steps [A3, A5, A9, A10, A11].For example, A9 spent the first 20 minutes planning on scratch paper before writing a single line of code.Others employed on familiar strategies, for instance, A4's initial plan drew on their familiarity with Species Distribution Modeling [39].Analysts (8/13) often liked suggestions that matched their own ideas or plans.They appreciated that the included code offered additional execution assistance to implement their plan.For example, in reference to a suggestion recommending Age as a variable, A12 noted "I really liked this one...I think it actually somehow knew that I was going to do this, and then it suggested a very efficient way to do so."Furthermore, for A5, such suggestions reaffirmed their decisions: "The assistant actually gave the same conclusion that I had typed, so that was helpful."Given these planning variations, it is unsurprising that participants also expressed interest in controlling the level of assistance received (Sec.5.5).

Statistical and Domain Knowledge
Expertise.Analysts' varying levels of statistical and domain expertise affected how they reacted to presented suggestions.Most analysts (11/13) had no trouble understanding the feedback for most suggestions, with many (10/13) finding them to be well-matched with their expertise and helped them consider new approaches.However, some analysts found suggestions overly basic and unnecessary [A1 A3, A5, A7, A8], due to their statistical expertise.This was most evident with the results interpretation suggestions created using ChatGPT.Some analysts liked these suggestions [A5, A6, A7, A10, A11, A12], while others adopted a more neutral stance [A8, A4].A1, a statistics professor with extensive expertise in computational social science, regarded the suggestions as unhelpful: "It was giving me comments on my results.It was not telling me whether it was good or bad which I knew already."Suggestions were occasionally unfamiliar since some analysts lacked an adequate statistical background [A3, A5, A6, A8, A9, A13].This led to varying behaviors: A3, A5, and A13 chose to ignore unfamiliar suggestions completely; A6 and A8 spent time trying to understand the suggestion but were left confused; and A9 considered referencing external sources to interpret the suggestions: "My first idea is to try and copy and search on the internet to see how this kind of generalized linear model could help me."Suggestions that were mismatched with the analyst's statistical background were unhelpful or required significant additional effort to understand, highlighting that assistant usage might usefully begin with an alignment phase.

5.1.3
Prior Experience with AI-based Assistants.We found that analysts' own past experiences with AI led to behaviors and reactions that diverged from the immediate analysis task.Analysts had mixed prior experiences working with AI-based code assistants (i.e., Github Copilot, ChatGPT).Over half (7/13) had no experience while others ranged from using code assistants a few times in total to regularly using them (Table 2).These prior experiences caused some analysts to be skeptical about suggestion correctness.For instance, A8 and A13 questioned the resources that informed it.Referring to a suggestion that referenced external studies, A8 mentioned, "I have less trust of something that is a more generated summary of stuff that exists outside of the notebook, that I didn't know where the information came from."Some analysts became distracted by the novelty of using an assistant.For instance, A8 was distracted from their task by their curiosity: "It made me want to understand how this (worked).I was getting distracted.What is it learning from?Is it from the data and knows all the variables or from what I am asking for?I feel like I was experimenting a lot with putting little notes and seeing what would happen."Our observations agree with existing work [76] that models trustworthy AI, where a user's attitude towards an AI system (and therefore subsequent reactions to the system) depends on individual, environmental, and cultural contexts in addition to the system's trustworthy cues.

What-Suggestion Content
Though we found that the exact category of planning content (e.g.Table 1's categories) was less important for depicting the helpfulness of a suggestion (Fig. 5), the type (planning and execution) and specificity of assistance were preferred in different situations (Sec.5.2.1).Moreover, we found well-reasoned explanations to be important to analysts (Sec.5.2.2).

Execution vs. Planning Assistance.
All analysts appreciated execution assistance, occasionally rating it negatively only when it came too late or differed from what they had intended.In our study, execution assistance occurred either on its own-when analysts were struggling with executing their plan-or as part of intended planning suggestions.Analysts (8/13) appreciated that execution assistance saved time and prevented them from having to search the internet.For example, A3 liked that a suggestion to consider a BMI variable included the code to calculate it, noting that "I don't know how to calculate BMI off the top of my head, and it was like 'Hey, here's how you calculate it'.Perfect, saves me a Google."These preferences are in line with prior observations [112,128] that expediency and reduced online searching are major perceived benefits of AI-based code assistants.Some analysts would have preferred even more execution assistance [A2, A3, A4, A5, A13].For example, A5 and A13 expected more help when faced with errors, while A3 wanted it to make visualization for them.
Reciprocally, most analysis found planning assistants to be helpful, although this utility depended less on suggestion type than on the timing (Sec.5.3).Most analysts (12/13) noted that planning suggestions often presented ideas that they had not previously considered.For instance, A10 contrasted our assistant's planning merits with other assistants: "Auto-complete code (assistants) are very good, but they are looking at what you are doing right now and maybe the last cell.This agent could take a more overall approach and help you think about the overall approach."In other situations, the suggestions also helped analysts realize key steps they would have otherwise missed.For instance, A2 found observed that "this is helpful.I didn't realize that skin tone rating was a combination of rater one and rater two." However, while most suggestions the assistant presented were tied to specific moments in the analysis process, some analysts wanted broader analysis planning help [A2, A3, A6, A7, A8].A3, for example, mentioned how at the beginning of the analysis, "recommendations for how to think about the problem would (have been) good."Similarly, A2 wanted suggestions that could provide a whole analysis plan: "Given the dependent variables and the data structures, what are some of the analyses I could run?If I need to run different analyses, how would I need to transform the data, with suggestions or sample code of how I can do that?"In these situations, analysts might benefit from workflow scaffolding recommendations similar to those in Lodestar [96] or litvis [119].
For A5 and A9, both the included code and comments acted as explanations that helped them understand what the assistant was trying to do.A9 felt that explanations helped bridge a lack of trust in the assistant about a new calls-per-game variable suggestion, noting that "I'm not trust(ing) with the calls-per-game, so I checked with every line of the code if they actually write what I'm thinking about."Meanwhile, explanations for why a planning suggestion should be considered were also preferred.A2 was especially enthusiastic: "I love the justification of why you would want to use age."In contrast, A8 connected their statistical expertise (Sec.5.1.2) to their willingness to accept a suggestion: "I don't know if I was being led astray by the assistant but I found this very convincing.But I don't have the fresh statistical knowledge to actually know if it was a good statistical approach or not." The explanations were also seen as detrimental at times [A2, A6, A8, A13], depending on the analyst's expertise (Sec.5.1.2) and focus on their own analysis plan (Sec.5.3).A13 ignored suggestions when "there is too much explanation".Similarly, A2, who appreciated the explanation for the inclusion of age, thought that some suggestions contained too much text.A6 noted that "Because I didn't understand why it was trying to include goals per game, I was kind of confused".This is akin to how programmers working with AI-based code assistants sometimes struggle to understand the suggested code [112], highlighting the importance and fragility of explanation design.

When-Timing of Suggestion
Timing has long been a central part of designing interactions with autonomous agents, and like previous studies [38,50], we find that inopportunely offered suggestions risk being ignored.
We observed that when engrossed in a particular analytical task, analysts were often resistant to diverting their focus toward distracting suggestions.For instance, A5 found that suggestions were often discordant with their current objectives: "very often the suggestions aren't in line with what you're currently typing.It goes in a whole different direction, and now you have to think of whether you want to take that direction or your direction."Similarly, A3 wanted to focus on their current step: ""I ignored it because I was focused on the variables at the time."For A5, poor timing also led them to ignore it: "I don't know if this was relevant or not.I didn't read it completely because I was trying to complete what I already started."Notably, most analysts (9/13) felt that "unhelpful" suggestions, would have been useful if they had been offered at a different moment.
When suggestions deviated only slightly from the current tasks analysts [A2, A5, A7, A9, A12] were open to them.For instance, while working through some statistical models, A7 took a recommendation to use the existing birthday column to create an age variable: "Honestly, birthday didn't hit me at that moment before the suggestion that popped up to calculate age.Without the suggestion, I wouldn't have thought of it unless I ended up reading about it and then including it a lot later."This suggests that while suggestions should be aware of the current task, they need not exclusively service that task as adjacent ones might usefully highlight alternative analysis paths-a key planning assistant goal.
Inherent in these observations is the tension between not disrupting the analyst and highlighting choices that might impact the analyst's conclusions.From a cost-benefit perspective [113], execution assistance is less intrusive, with its time-saving benefits more easily recognizable.Conversely, planning assistance demands more mental effort as analysts evaluate multiple analytical paths, underlying rationales, and the alignment with their prevailing strategies.This complicates the assessment of the utility of planning assistance and amplifies the drawbacks of sub-optimal suggestions.

Why-Reasons to Work with the Assistant
Most analysts (11/13) thought the assistant was helpful for both planning and execution-centric tasks, and expressed a variety of preferences as to why they would want to work with this style of assistant.
Analysts valued the planning assistance content provided by the assistant.For instance, A9 found it helped them to consider alternative decisions: "The (assistant) is helpful for me to indicate some aspects to think about.Though I did not take all of them, I see this is something that I may need to take a look at."Furthermore, A2 liked having a clear rationale for why they should incorporate a suggestion (Sec.5.2.2): "As compared to when I'm typing in Gmail where it (suggests) words, this I find is much more helpful because it not only gives me the code but also gives the justification and rationale."This suggests that planning assistance can be perceived as being of use to the data analysis process.
Even though our design primarily was centered on helping analysts consider alternative but reasonable approaches, they also frequently liked execution assistance.For instance, A4 appreciated that "it saves me from having to do a lot of mundane stuff like setting up a testing and training dataset for some sort of machine learning model."Similarly, A5 appreciated execution assistance for helping them move past roadblocks, noting that when "you are stuck... and (the assistant saved) you that trouble of going somewhere else and googling things."This is consistent with the preferences found in prior works that studied code execution assistants [74,99,114], and suggests that having both both planning and execution is critical for the design of this style of assistant.
However, given the quality and quantity of assistance provided by our assistant, A8 believed that it made them less engaged in critical thinking.For A8, the reason to use the assistant was to reduce their own cognitive load rather than to have an AI guide their decision making: "I feel like this 'friend' was great, but it also made me this clicking machine (rather) than an analytical thinker... Autopilot was a welcome path that I could choose."This follows the findings in empirical Human-AI collaboration studies in which people often over-rely on AI support [10,21,23,70].

How-Analyst or Assistant Initiative
Analysts espoused a variety of preferences about whether they wanted to initiate suggestions themselves or have the assistant do it for them.A few analysts [A4, A7, A9, A11] appreciated the assistant taking the initiative.For example, A11 liked that it "proactively goes ahead and gives you suggestions." Some analysts [A2, A3, A4, A9] wanted the assistant to take even more initiative.A2, for instance, wanted the assistant to automatically run code given their current workflow: "[I would want the assistant to get...] a sense of what my workflow is, has been, and is intended to be and pre-populate stuff for me.So if I normally run an OLS with varying degrees of variables, some included and some not included, generate a histogram for each.Do that for me." However, others felt that assistance would be more effective if it was delivered on demand.Given the preference for both assistant and analyst-initiated assistance, multiple analysts [A6, A7, A9, A13] suggested that this could be a configuration option.To wit, A9 proposed: "Maybe two modes: lazy mode and (control) mode.If I am in lazy mode, you should already construct things and add them (to the notebook).If I want to take over everything, please be quiet here.If I want you, I'll call you."This is akin to the two acceleration and exploration modes observed in prior work on AI-assisted programming [11], indicating that planning suggestions might operate in a similar way as those of execution.
Mediating these desires was the way in which the assistant was triggered.For instance, analysts [A3, A8, A10] liked that the assistant responded to requests made in comments.Some analysts wanted additional and different ways to directly ask the assistant for suggestions.A5, A6, and A7 wanted to ask the assistant for suggestions via a text prompt à la ChatGPT-style assistants.Similarly, A11 wondered if the assistant could provide a list of relevant next steps via a keyboard shortcut.
This diversity of preferences suggests that having multiple ways to interact with the assistant that can be modified to taste and task is essential for effective assistant design.

DISCUSSION
In this work, we explored the potential of AI-based data analysis assistants that incorporate execution and planning assistance.To identify the scope of suggestion content, we initially performed a literature review and a behavioral-driven analysis of independent analyses from a crowd-sourced analysis study (Sec.3) to define categories of relevant suggestions (Table 1).To explore the essential components of a helpful suggestion, we conducted a Wizard of Oz study to elicit the circumstances in which suggestions are helpful.We found that analysts had varying perspectives on what planning assistance was helpful.In some moments, planning assistance was about assistance for specific analysis steps; during others, it was about guidance for their overall workflow (Sec.5.2.1).Some analysts found suggestions occasionally distracting, preferring to focus on their own analysis plan (Sec.5.3).Similarly, we observed that there exists an inherent tension between the intended goals of the assistant and the goals of the analyst who might predominantly favor execution assistance, leading to divergent views on what constitutes "helpfulness." Though analysts' goals are primary, assistants can play a crucial role in creating higher quality and more robust data analyses.Thus, supporting analysts with only execution assistance may not lead to such outcomes.Balancing analysts' perspectives of helpfulness and the assistant's impact on the workflow is crucial for providing truly valuable suggestions.Given that planning suggestions often require deeper consideration, a mutual understanding of their appropriate timing is also imperative.Developing affordances to enable analysts understanding and usage of such assistants is essential to those the design of those assistant.
Here, we synthesize these and other findings as design guidelines (Table 3) to highlight design implications for future AI-based analysis assistants and reflect on limitations of our design.

Design Guidelines
Based on our results and discussion, we highlight seven design guidelines that extend Human-AI Interaction guidelines [6] for designing a data analysis assistant.
6.1.1Assistants should communicate with the analyst to align on analysis goals.During the analysis portion of our study, the wizard was occasionally misaligned with the analyst's goals (Sec.5.3).Likewise, analysts spent time figuring out the assistant's goals (Sec.5.1.3).Therefore, the analyst and assistant should establish shared goals at the beginning of the analysis, and this understanding should evolve over time.During its introduction, the assistant can describe its objectives for improving the analysis quality, highlight the value of planning assistance, and warn against the pitfalls of prioritizing execution assistance only.Furthermore, the assistant can communicate this information through interactive model cards [32].Although prior work on conversational agents has found users prioritize their goals and needs over the assistant [30], we posit that it remains vital to consider both parties in the interest of quality analytical results.Adhering only to analysts' execution preferences may perpetuate poor analysis decisions and practices, leading to misguided inferences or biased results [84].Understanding the best ways to establish, integrate, and uphold shared goals over time is an important area for future exploration.
6.1.2Assistants should provide suggestion content suited to the analyst's background.Analysts differ in their statistical, domain, and coding backgrounds.We observed that the same suggestion considered helpful by some participants, could be found by others to be too elementary, too distracting, too far from their expertise (making correctness auditing difficult), or too different from their usual analysis or coding practices to be useful (Sec.5.1).In situations when a suggestion is too difficult or different to comprehend, the assistant should provide options to explain further or offer additional resources-e.g., links to relevant documentation and tutorials.Likewise, the assistant could suggest more approachable alternatives on demand.In our study, the assistant did not gather any information on task or background from the analyst.A more explicit definition of the analyst's expertise, such as through analyst initiative (Fig. 6A), may help establish common ground.For instance, on firstusage analysts could have a quick conversation with chatbot to elicit their educational background, coding competency, and typical workflows (akin to how we asked participants about their preferred environments before the study).The assistant could then use this as relevant context to customize LLM-based assistant suggestions [7,66].

Assistants match the analyst's preferred level of planning assistance abstraction.
There are multiple perspectives on what consitutes helpful planning assistance.These can range from guiding hyper-parameter selections to providing a comprehensive plan with multiple analysis steps.Future assistants should account for analysts' preferences regarding desired levels of planning assistance and how these preferences may change throughout the analysis process.For instance, the assistant could offer explicit modal dials to guide suggestion content-such as an "execution" mode that focuses on code execution, a "think" mode for specific planning suggestions, a "reflection" mode for connecting decisions and highlighting potential missed steps, and an "exploration" mode for higher-level planning suggestions.Likewise, the level of planning abstraction also relates closely to the scope of the analysis code that the assistance affects.An analytical decision could be one parameter, line, function, code block, or even multiple chunks of code [81].Suggestions that impact large parts of the analyst's analysis require more willingness from the analyst to consider and adopt.Both sets of levels could be configured from the UI or learned through observation of the users across analyses.
6.1.4Assistants should time suggestions based on the analyst's openness to considering alternative approaches.Analysts in our study felt that some suggestions were good but were not well-timed based on their current task (Sec.5.3).When analysts are focused on executing a specific analysis decision or plan, they may prefer execution rather than planning suggestions.Conversely, when they are considering their next steps, they may be more open to guidance for analysis planning.Analysts could explicitly communicate their timing preferences to the assistant or use a "remind me later" response to raised suggestions.Additionally, organizing and tagging suggestions could help them easily find and revisit suggestions.Meanwhile, the assistant could also learn appropriate timing from analysts' behaviors.For instance, prior works have used probalistic utility modeling [50] and sensors to model and predict human interruptibility [38].We note that this may be challenging if analysts fixate on rigid analysis plans and highlight the importance of establishing shared goals between the analyst and assistant (Sec.6.1.1).

6.1.5
Assistants should provide multiple ways to request content at different levels of abstraction.We observed diverse opinions (Sec.5.2.1) about assistant-versus-analyst initiative and levels of planning assistance, a heterogeneity which might be met by supporting multiple means of requesting assistance.While existing systems usually offer one form of invocation, LLM-based analysis assistants should allow and interleave multiple forms of invocation and context for different levels of assistance.It is also essential to clarify the context for the LLM in each form of interaction.For example, a hotkey trigger at the line level could indicate a focus on execution assistance using only the current cell.On the other hand, analysts could specify broader suggestions when requesting feedback in the side panel and choose the LLM context by selecting relevant cells.Interactive visualizations could also be employed to show and help analysts make decisions and consider alternative choices [79,81].These visualizations (e.g., Fig 1), tied to the analysis code, could help analysts review their steps and select areas where additional assistance is needed.
6.1.6Assistants should mitigate suggestion overreliance by promoting engagement and critical thinking.As data analysis often informs high-stakes real-world decisions [1,9,31], it is essential that analysts be able to understand and think critically about their analyses, regardless whether they received execution or planning assistance.In our study, analysts showed varying levels of thoroughness in validating the model's decisions, code, and outputs (Sec.5.1.3)-sometimescritically disengaging and going on "autopilot".Over-reliance on AI is well documented in other scenarios [23,24,113,124], and developing mechanisms for skepticism is a growing concern [45].Designing assistants with critical affordances is essential to preventing them becoming AI-mediated p-hacking machines.One approach might be to reduce the cognitive effort of engaging with suggestions [113], freeing analysts to be more scrutinous.For example, the system could present a visualization mapping the overall decision paths (à la Fig. 1) to demonstrate how a given recommendation would influence the overall analysis.Similarly, assistants could introduce cognitive forcing functions [23] to increase the cost of relying on the assistant.For example, the assistant could require a scaffold of the analysis plan before providing suggestions or by requiring that code be typed out manually rather than automatically inserted.Forced low-level participation may prompt better engagement with the assistant-although it is crucial to ensure that these interventions do not deter analysts from utilizing assistance altogether.
6.1.7Assistants should adopt multiple explanation methods for increased understanding and trust.Analysts highly valued suggestion explanations, as they helped them better understand and trust the assistant's suggestions (Sec.5.2.2).Explanations provided both rationale for why an analysis decision should be considered and definitions of potentially unfamiliar concepts.Future assistants should incorporate multiple forms of explanations, that are sensitive to analysts' backgrounds and preferences.To avoid introducing excessive cognitive load [113], assistants could allow analysts to select their preferred types of explanation.Alternative explanation mediums, like visualizations or animations [95,120], could better contextualize suggestions.As LLMs continue to improve, assistants could also provide model-generated critiques of their own suggestions [34,77,80].Identifying explanation best practices remains an open area of research [10,23,36,113].Unlike many other tasks however, there are often multiple reasonable paths in data analysis [79] without an objectively "correct" answer or other measures besides accuracy-making evaluation of explanations especially troublesome.

Limitations and Future Work
This work has several limitations, relating to our participants, our study design, and our analysis of the results.
First, we selected analysts who self-identified as having a high proficiency in statistical analysis.As we wanted to understand the mechanisms surrounding planning assistance, we determined that analysts with sufficient programming and statistical knowledge would be more engaged with analysis planning and our assistant's suggestions; we found that even they were often unaware of the decisions and rationales our assistant offered.Analysts with less statistical and programming expertise may have different experiences and preferences when working with an analysis assistant.Future work should explore how this style of assistance can be support for novice data analysts.In addition, we focused on people willing to participate and perform data analysis with an AI assistant, which may positively bias our results.
Next, to facilitate a lab study of reasonable duration, we chose to conduct a same-day, in-person study of two hours.If analysts had more time with the assistant, say weeks, they may have calibrated their expectations for the behaviors of the assistant, developed a better sense of the assistant's timing and goals, and overcome any potential novelty effects latent to our one-shot design.In future work we intend to explore these issues by developing a tool that supports this style of longitudinal usage.Similarly, while our labbased Wizard of Oz study design afforded us substantial control over the assistant, it may not have presented the same stakes as a real-world one.As a result, participants may have been more willing to accept planning and execution assistance.We tried to mitigate this effect by priming analysts to consider the robustness of their analyses via our introductory material.
Finally, our study concentrated on analyst preferences, leaving evaluation of the resulting quality of analyses unaddressed.Determining analysis correctness is inherently challenging [12,15,18,35,86,103,107], with there often being no single "correct" answer, but rather multiple defensible interpretations.However, extensive literature on multiverse analysis [81,108,109] notes the importance of considering alternative approaches.This helps analysts appreciate uncertainties surrounding their decisions and enhances analysis robustness.This work takes a step toward this larger goal by showing how planning assistance can guide analysts toward assessing multiple analytical perspectives.

A EXAMPLES OF SUGGESTION CATEGORIES
We give a brief description of each category in Table 1 and illustrate full examples of suggestions for each of the suggestion categories.

A.1 Domain Background
While domain background content could be part of suggestions in other categories, this category is for when the main focus of the suggestion is to provide analysts with background on the problem domain.An example could be giving analysts suggestions about similar analyses done in the past: • Similar Analyses Done in the Past.
The question of whether skin tone influences the way referees make decisions in soccer games has been a topic of research for a while.One of the most well-known studies in this area is "Racial Bias in the Allocation of Fouls in Soccer" by Price and Wolfers (2010).They analyzed data from the 2004-2005 English Premier League season and found that black players were more likely to receive red and yellow cards than white players, even after controlling for a variety of factors such as position, nationality, and the competitiveness of the game.
Another study that investigated the issue is "Skin color and red card decisions in football: a preliminary study" by Diaz et al. (2016).They analyzed data from the Spanish league and found that darker-skinned players were more likely to receive red cards than lighter-skinned players, after controlling for various factors such as age, nationality, position, and the competitiveness of the game.
Both studies used regression analysis to examine the relationship between race and the likelihood of receiving red cards.Price and Wolfers included a wide range of control variables, while Diaz et al. included fewer variables but also examined the moderating effect of the referee's of Overall, the analyses focused on controlling for various factors that could influence the relationship between race and red cards, and both studies found evidence of racial bias in the allocation of fouls.

A.2 Data Wrangling
Offer assistance related to understanding, transforming, clearing, and preparing the data.
Example suggestions: • Reminder -Check for Missing Data.
Always check for missing data in your DataFrame before performing any data analysis or machine learning tasks.NaN values can cause unexpected errors and affect the accuracy of your results.
To help us understand the differences between raters better we can use the following code: # Compute the difference between the two raters ' ratings } data [ ' rating_diff '] = data [ ' rater1 ']data [ ' rater2 '] # Create a distribution plot of the rating differences using seaborn sns .histplot ( data = data , x = ' rating_diff ' , kde = True ) • Data Transformation -ZTransform Predictors.
In this dataset, we may want to z-transform predictors as it can improve the interpretability of regression coefficients by putting all predictors on the same scale.Standardization can also reduce the impact of outliers and extreme values that can skew the results of statistical analyses.Finally, standardization can make it easier to compare the relative importance of predictors in a model that has different units and scales.
A code template to z-transform a variable is shown: df $ z _ var <-scale ( x = df $ var , center = TRUE , scale = TRUE )

A.3 Conceptual Model Formulation
Inspired by Jun et al. [57], this category corresponds to assistance that helps analysts identify or think about factors that might influence an outcome and how these factors might interact to give rise to the outcome.Example suggestions: • Collapsing the Data by Player.
Based on the dataset description, each player-referee dyad has multiple games in which the player and referee appeared.
As we explore rating distributions, it may be reasonable to collapse the data across playerShort, keeping the ratings for the player.We can also create a variable sumGames that are aggregated across the variable games that provide the summed number of games played by the player across all player-referee dyads.For the outcome variable, we can create sumRedCards that represents the number of red cards a player received from all referees encountered.The code below does this transformation:

A.5 Choosing the Statistical Model
Analysis assistance which aids analysts in choosing or directly points analysts to a relevant statistical model.
Example suggestions: • Understanding Differences Between Raters.
We can start with a simple OLS model that includes the player's skin tone and the number of red cards as the dependent variable.
Here is an example of how we can create this model in R: # Create an OLS model model <-lm ( redCards ~avg _ rating , data = df ) # Summarize the model summary ( model ) • Modeling Strategy: Generalized Linear Models.
Beyond just using linear models, we may want to account for the random variance of the effects across players, referees, and referees' country of origin.We could use generalized linear mixed models (function glmer in R package lme4).
To start, you could fit three models of increasing complexity to explore the effects of skin tone on the likelihood of receiving a red card.An example template of this code is shown below: gm1 assumes that the effect skin tone has on the likelihood of receiving a red card is constant across all referees and players, but that the baseline rate of red cards varies across referees and players.gm2 allows the effect skin tone to vary across different referees, and gm3 further allows for systematic differences in the way that referees from different countries make biased decisions based on skin tone.All models assume that observations are independent and identically distributed (i.i.d.), the response variable follows a binomial distribution, and the random effects follow a normal distribution.They also assume that the random intercepts and slopes are uncorrelated with each other and with the fixed effects.Further adjustments can be made as appropriate, such as changing the family type or adding other fixed effects.
As the red cards are converted to a binary variable we can run a logistic regression using the statsmodels library.Here's some sample code to run: import statsmodels .api as sm X = sm .add_constant ( X ) # Add an intercept term to the model model = sm .Logit (y , X ) .fit () print ( model .summary () ) The resulting output will give you information about the coefficients for each variable in the model and whether they are statistically significant predictors of receiving a red card, while controlling for other variables in the model.You can also use the output to calculate odds ratios to quantify the association between skin tone and receiving red cards.

A.6 Model Results Interpretation
This category involves any suggestion that helps interpret the outputs of running a statistical model into a more readable and potentially actionable summary.These suggestions are primarily written by ChatGPT Example suggestions: • ANOVA Results Interpretation.
The output shows the results of a one-way ANOVA test that examines the relationship between rater1 and redCards.
The redCards variable is grouped by the levels of rater1, and an ANOVA test is performed to determine whether there is a statistically significant difference in the mean redCards scores between these groups.The F-statistic of 2.35 indicates the ratio of the variance between the groups to the variance within the groups.The p-value of 0.0523 indicates the probability of observing such an F-statistic value or higher under the null hypothesis that there is no significant difference in the mean redCards scores between the groups.Based on the p-value of 0.0523, we can conclude that there is a borderline statistically significant difference in the mean redCards scores between the groups defined by the rater1 variable.Specifically, we can reject the null hypothesis at a significance level of 0.05, but not at a significance level of 0.01.Therefore, there is some evidence to suggest that the rater1 variable is related to the redCards variable, but the evidence is not very strong.Given that the models have low R-squared values, it may be worth trying a different model or adding additional predictor variables to better explain the variance in the dependent variable.Additionally, it would be helpful to investigate the relationship between the predictor variables and the dependent variable further to better understand their impact on the model.

A.7 High-Level Planning
These are suggestions that point analysts to a general analysis decision, have a high level of abstraction, and are not specific or easily actionable enough to fit into the other planning categories discussed before.
Covariates are additional variables that can affect the relationship between the predictor variable and the outcome variable in an analysis.In order to accurately understand the relationship between the predictor and outcome variable, it is important to consider the effects of these covariates.
Including covariates in an analysis can help to control for potential confounding variables, which are variables that can affect both the predictor and outcome variables.By controlling for these confounding variables, the true relationship between the predictor and outcome variable can be more accurately estimated.Additionally, including covariates can help to increase the precision of the estimated effect of the predictor variable on the outcome variable.This can lead to more accurate predictions and better decision-making based on the analysis results.
In summary, considering covariates in an analysis is important because it helps to control for potential confounding variables and increase the precision of the estimated effects of predictor variables on outcome variables.
Based on the dataframe provided, here are some visualizations that can be considered to understand the relationship between variables:

A.8 Execution Assistance
This is a broad category to cover any suggestion that helps analysts carry out an analysis decision.This can involve providing code, aiding analysts in debugging errors, or helping implement analysts' intent.These suggestions are also primarily created using ChatGPT.
We can plot the percentages dataframe to see the differences better.In this example, leagueCountry is added as three separate dummy variables for England, Germany, and Germany, with Germany being the reference category.The drop_first=True argument removes one of the dummy variables to avoid multicollinearity.

B ASSISTANT INTERFACE IMPLEMENTATION
We implemented the wizard and user interface for the Jupyter environment [93] as a JupyterLab extension.We extended the Jupyter-Lab Commenting and Annotation extension [60] and incorporated two modes: wizard mode and analyst mode.Both modes work with the same underlying notebook file and share a similar side panel interface.In wizard mode, the wizard can highlight code or cells and add corresponding suggestions to the side panel.All suggestions were stored on disk as an underlying suggestion file.The extension, when in analyst mode, polled this file for changes and updated the side panel when new suggestions were added to the file.
In the study, both the analyst and wizard worked with the same notebook in the same file system.We hosted the notebook on the wizard's machine.Accordingly, we connected the analyst's machine remotely to the wizard's machine using SSH.To ensure the wizard did not update actual notebook contents, we gave wizard mode no edit permissions.

C NOTEBOOK DETAILS FOR LAB STUDY
The notebook included a description of the task, the dataset columns, and associated descriptive statistics.In addition, the notebook had a profile report of the dataset which we created using the ydataprofiling library [122].This report included univariate analysis (i.e., descriptive statistics and distribution histograms) and multivariate analysis (i.e., correlations, missing data, duplicate rows) of the dataset columns.We wanted to help analysts quickly familiarize themselves with the dataset and direct their focus toward reasoning about their analysis approach.

D STUDY OBSERVATIONS FOR "WHERE" -LOCATION OF ASSISTANT AND SUGGESTIONS
While we did not explicitly ask analysts their preferences for the location of the assistant in the user interface, 7/13 analysts did express what they preferred.Most of these analysts mentioned that they liked the assistant being to the side of the notebook [A4, A6, A8, A11, A12].For instance, A11 liked that the side panel allowed them to focus on the notebook: "It's pretty useful that it is on the side and it doesn't (interfere) with what you are thinking what you are typing."On the other hand, A3 and A10 preferred the suggestion closer to the cell.With respect to the location of the actual suggestions, analysts sometimes found it to be too cluttered [A5, A7, A9, A13].This partially led to some analysts missing suggestions [A5, A10].For A5, this became more difficult as it was "very hard to view when you have a lot of things."Thus, analysts suggested better ways of organizing the suggestions.For example, A13 wanted to tag variables with a categorization: "It would be great to have a category of suggestions so I can quickly see if it's about processing variables or debugging or explanation."A5 expressed similar views but would have liked "some kind of color coding like how in Stack Overflow you have questions and answers in different colors." The lack of consensus on assistant location supports prior findings [85], in that it depends on the needs and preferences of the analyst.
.fit() # Print the regression summary print(results.summary())Given my analysis code what should I do next?

Figure 2 :
Figure 2: Planning assistance has limited support and is unexplored.Existing assistants, such as Github Copilot, are well-suited for execution assistance, such as suggesting data wrangling or modeling code.

Figure 3 :
Figure3: In our Wizard of Oz study, analysts work with a JupyterLab notebook containing the assistant interface (left).They are unaware of the existence of a human wizard, located in a separate room, operating the assistant backend (right).The wizard uses a general purpose LLM and integrates information prepared before the study.The wizard observes the analyst's notebook for opportunities for assistance and obtains the relevant notebook context (A).Next, the wizard either retrieves a prewritten suggestion developed prior to the study (B) or prompts the LLM (C) to generate a relevant suggestion (D).The wizard may further refine the suggestion (e.g., shorten its length) with additional prompting of the LLM (C-D Loop).Finally, after creating the suggestion, the wizard presents the suggestion back to the analyst (E).In this process, the wizard acts as the interface between the analyst and the LLM, deciding on the content and timing of the suggestion.

Figure 4 :
Figure 4: Assistant User Interface.Our assistant is integrated into JupyterLab as a side panel.Analysts can open the side panel in the right sidebar (A) and turn the assistant on/off with a toggle (B).Suggestions are raised as cards in this side panel.When first turned on, the assistant welcomes the analyst with a startup message (C).When a new suggestion occurs, the side panel shows a loading animation of variable duration to call attention to it; variability in duration aims to indicate that the system is working and simulate a real system computing and generating a suggestion[40,98].New suggestions also have a yellow highlight around them for greater visibility (D).Likewise, each suggestion has a corresponding context (D1) in the notebook, which is automatically scrolled into view when the suggestion is clicked (D2).Analysts can provide feedback for each suggestion via the thumbs-up and thumbs-down buttons (E).Finally, analysts can copy recommended code using the copy button (F).Note that the width of the side panel can be adjusted and is typically smaller than is pictured here for illustration.

Figure 5 :
Figure 5: Analysts favor currently unsupported planning assistance.Analysts saw on average 11.85 (std=4.56)suggestions, 9.85 (std=4.16) of which were planning suggestions.All categories of intended planning suggestions (which may include code to help execute a suggestion) were found to be helpful by at least some analysts, highlighting the need for planning assistance that current assistants lack.For planning suggestions which analysts could reasonably incorporate into their notebooks (i.e., those that were not results interpretation and domain background), analysts integrated suggestions 51.6% of the time (47/91).Analyst reactions varied within each category, with responses ranging from finding the assistance helpful to neutral, or even unhelpful in certain instances.Our interviews and observations suggest that the effectiveness of a suggestion is not solely determined by its suggestion category but also depends on various nuanced factors (Sec.5).We provide full examples of the raised suggestions in the appendix.

Figure 6 :
Figure 6: Study-informed Model of Analyst-Assistant Interaction Dynamics.Based on our study findings, we model the underlying influences and interactions between an analyst and an assistant.The assistant can receive information about the analyst's initiative (A) and the analysis contents (B) to develop an understanding of the analyst's goals and background as well as the current analysis plan and context (C).Note that the analyst's initiative is optional.This, in conjunction with the assistant's programmed goals (D) (e.g., offering more planning assistance or doing only what analysts request), informs the assistance (i.e., timing, organization, location, execution vs. planning, and contents) provided to analysts (E).After a suggestion is raised to the analyst, how it is perceived (F) is impacted by the analyst's background (G) and their current analysis plan (H).This then determines the extent to which the analyst accepts the suggestion in their planning (H) and/or execution (I) processes.The loop then continues as analysts update their analysis (B) or require additional assistance (A).

Table 3 :
Assistants should ... Design implications and takeaways for building analysis assistants.

•
Age Variable Rationale.Beyond just skin tone we might want to consider other variables that might affect the analysis.As birthday is in the dataframe, we might want to control for age which is a common practice in research studies, as age can affect various aspects of behavior and cognition.One reason to control for age is to account for the tendency of impulsivity, which may be associated with receiving red cards, to decrease with age.Research has shown that sensation seeking and impulsivity tend to decrease with age, as indicated by both self-report and behavior measures.Therefore, controlling for age can help reduce this variable's impact on the study results.Additionally, age is a common demographic variable, and controlling for it can help to ensure that any observed effects are not simply due to age differences among participants.We can create a new column age based on the birthday column, we first need to convert the birthday column to a date format using the as.Date function.We can then calculate the age using the difftime function, which calculates the time difference between two dates in years.•NewVariableStarStatusand Rationale.We can create a variable to capture star status.More popular or famous players might potentially be given more lenient treatment by referees and be less likely to be ejected with a red card.Based on the data available, we can determine stat status by calculating the average number of goals scored by each player per game they played in the data set.Although this is imperfect in that it might not properly account for players who are defensive stars and score fewer goals, offensive players are typically the most famous, and thus goals scored should generally be a meaningful indicator of this status.The code below shows how we might do this:Help analysts with operationalizing variables involved in the analysis.This may involve data transformations but if the primary goal is to create new columns that can be worked with in the analysis, then the suggestion is about operationalizing constructs.

•
Comparing and Interpreting OLS Results.Looking at the three OLS regression results, we can see that the models have low R-squared values, indicating that the predictor variables explain very little of the variance in the dependent variable, redCards.Additionally, the F-statistic for each model is relatively low, indicating that the models may not be a good fit for the data.Model 1 has two predictor variables, skinTone and goals, with goals having a higher coefficient than skinTone.Model 2 has three predictor variables, skinTone, age, and Intercept, with Intercept having the highest coefficient, followed by age and skinTone.Model 3 only has two predictor variables, skinTone and Intercept, with Intercept having a higher coefficient than skinTone.
1.Histogram: A histogram of the redCards variable can be created to see the distribution of the number of red cards given by referees.This can help identify if there are any extreme values or outliers in the dataset.2.Scatter plot: A scatter plot of redCards againstDark_Skin_Cards can be created to visually see if there is any relationship between the two variables.This can help identify if there is any correlation between a player's skin tone and the number of red cards they receive.3. Box plot: A box plot of redCards grouped by Dark_Skin_Cards can be created to see if there is any difference in the distribution of red cards received by players with different skin tones.4. Bar chart: A bar chart of the total number of red cards given by each referee can be created to see if there are any referees who give out a disproportionately large number of red cards.

•
Controlling for LeagueCountry.To control for the leagueCountry variable in the OLS model, you can include it as a categorical variable in the model.In Python, you can create dummy variables for the leagueCountry variable using the get_dummies() function from pandas.Here's an example code to add leagueCountry as a categorical variable: