AI for Low-Code for AI

Low-code programming allows citizen developers to create programs with minimal coding effort, typically via visual (e.g. drag-and-drop) interfaces. In parallel, recent AI-powered tools such as Copilot and ChatGPT generate programs from natural language instructions. We argue that these modalities are complementary: tools like ChatGPT greatly reduce the need to memorize large APIs but still require their users to read (and modify) programs, whereas visual tools abstract away most or all programming but struggle to provide easy access to large APIs. At their intersection, we propose LowCoder, the first low-code tool for developing AI pipelines that supports both a visual programming interface (LowCoder_VP) and an AI-powered natural language interface (LowCoder_NL). We leverage this tool to provide some of the first insights into whether and how these two modalities help programmers by conducting a user study. We task 20 developers with varying levels of AI expertise with implementing four ML pipelines using LowCoder, replacing the LowCoder_NL component with a simple keyword search in half the tasks. Overall, we find that LowCoder is especially useful for (i) Discoverability: using LowCoder_NL, participants discovered new operators in 75% of the tasks, compared to just 32.5% and 27.5% using web search or scrolling through options respectively in the keyword-search condition, and (ii) Iterative Composition: 82.5% of tasks were successfully completed and many initial pipelines were further successfully improved. Qualitative analysis shows that AI helps users discover how to implement constructs when they know what to do, but still fails to support novices when they lack clarity on what they want to accomplish. Overall, our work highlights the benefits of combining the power of AI with low-code programming.


INTRODUCTION
Most AI development today involves Python programming with popular libraries such as scikit-learn (sklearn) [28].Unfortunately, writing code, even in a language as high-level as Python, is hard for citizen developers [22]-people who lack formal training in programming but nevertheless write programs as part of their everyday work.This is a fairly common situation for data scientists, among others.AI programming libraries also tend to be large and change regularly.Needing to remember hundreds of AI operators and their arguments slows down even professional developers.
Low-code programming [32] reduces the amount of textual code developers write by offering alternative programming interfaces.In recent years, it has been embraced by software vendors to both democratize software development and increase productivity.Most low-code offerings for building AI pipelines currently favor visual programming [6,12,18].While visual programming helps users navigate complex pipelines, it poorly supports discoverability of API components in large APIs due to the large range of options and limited screen space.In parallel, programming by natural language (PBNL) has recently soared in popularity.Tools like Copilot [1] and ChatGPT [2] can generate code from natural language prompts in which users describe what they want to accomplish, which is especially helpful in ecosystems with large APIs.However, these tools still generate code, which can be complicated and hard to understand [34], especially without formal training in programming.
At the intersection of these two paradigms, we propose Low-Coder, the first low-code tool to combine visual programming with PBNL.We conjecture that the respective strengths of these two lowcode techniques can compensate for each other's weaknesses.PBNL uses AI to help users retrieve and use programming constructs based on natural language queries.This does not always return correct programs, necessitating a way to help users understand and fix generated programs.Visual programming complements PBNL by providing a clear, unambiguous representation of the program that users can directly manipulate to experiment with alternatives.Our goal is to help people who know what they want to accomplish (e.g., build a data processing pipeline) but face syntactic barriers from the programming language and library (the how part).
LowCoder's visual programming component, LowCoder VP , lets users snap together visual blocks for AI operators into wellstructured AI pipelines.It uses Blockly [27] to provide a Scratchlike [31] look-and-feel.The PBNL component, LowCoder NL , lets users enter natural language queries and predicts relevant operators, optionally configured with hyper-parameters.It uses a fine-tuned variant of the CodeT5 model [41] that we developed through experiments with a variety of neural models for program generation, ranging from training models from scratch to few-shot prompting large language models [25].We further noticed that it is common in this domain for queries to mention at most a subset of hyperparameters for each pipeline step, so we developed a novel task formulation tailored to this use case that improved learning outcomes.
We leverage LowCoder to provide some of the first insights into both how and when low-code programming and PBNL help developers with various degrees of expertise.We conduct a user study with 20 participants with varying levels of AI expertise using Low-Coder to complete four tasks, half of which with, and half without the help of the AI-powered component LowCoder NL .Overall, we find that the combination of visual programming along with the arXiv:2305.20015v1[cs.SE] 31 May 2023 natural language interface helped both novice and non-novice users to successfully compose pipelines (85% of tasks) and then further refine their pipelines (72.5% of tasks) during the study when using LowCoder NL .Additionally, LowCoder NL helped users discover previously-unknown operators in 75% of the tasks compared to just 32.5% using other methods like web search when LowCoder NL was not available.In addition, despite being trained on a different dataset, LowCoder NL accurately answered real user queries.In summary, his paper makes three main contributions: (i) Low-Code for AI: We introduce LowCoder, a new low-code tool that combines visual programming and PBNL to help develop AI pipelines.(ii) AI for Low-Code: We benchmark various AI models and develop a novel task formulation to develop an AI powered natural language interface to LowCoder.(iii) User Study: We analyze the trade-offs between the two modalities and study the effects of using AI for low-code programming through a user study involving 20 participants with varying levels of AI expertise using LowCoder.

RELATED WORK
AI for Low-code for AI: In adopting a visual programming approach to low-code, we follow a long tradition [7].We were particularly inspired by Scratch, a popular visual programming environment for children that uses lego-like connected blocks [31].
Our other inspiration came from projectional editors, where the visual programming interface is a projection, or view, over an internal domain-specific language (DSL) [37].Our implementation uses Blockly, a meta-tool for creating block-based visual programming tools [27], and Lale, a DSL for AI pipelines [4].AI for Low-code for AI: Most low-code interfaces for programming AI pipelines use visual programming.Examples include WEKA [18], Orange [12], and KNIME [6].Each has a palette of operators that can be dragged onto a canvas, where they can be connected into a boxes-and-arrows style diagram.Commercial lowcode visual interfaces follow the same approach, such as Vertex AI, Sagemaker, AzureML, and Watson Studio.A related approach for low-code AI pipeline development is automated machine learning (AutoML) which is also used by many of the same commercial AI interfaces mentioned earlier.These tools tend to have a black-box approach where the user has little control over the AutoML search and may not even see the resulting pipeline.AutoML libraries such as auto-sklearn [16], TPOT [26], and hyperopt [5] provide a Python interface, which is intended for textual code development.There are also natural-language interfaces for professional developers based on large language models such as GitHub Copilot which uses Codex [10] and ChatGPT.Since these support APIs for which there is sufficient publicly available code to use as training data, they cover popular machine learning libraries such as sklearn.The main difference between these low-code for AI tools and our paper is that we combine the ease-of-use of visual programming with a natural language interface to help users discover and configure operators and, inspired by Scratch [31], our tool encourages liveness [33] through immediate user feedback for each user input into the system.This contrasts with most tools that require explicit training File Edit View Insert Format Slide Arrange Tools Extensions Help Last edit was seconds ago   [13].The Overnight paper addresses the problem of missing training data for PBNL interfaces by crowdsourcing [40].And SwaggerBot lets users extend and customize a chatbot from within the chatbot itself [36].Unlike these works, our paper uses large language models for PBNL, uses PBNL for creating AI pipelines, and integrates with a visual programming interface.
Combining low-code techniques: Our work combines visual programming with PBNL.In a similar vein, Rousillon combines visual programming with programming by demonstration [9] and Pumice combines programming by demonstration with PBNL [23].Like Rousillon and Pumice, our goal in combining techniques is to use strengths of each technique to mitigate weaknesses in the other.However, unlike Rousillon and Pumice, we choose different techniques to combine, and we target a different domain, namely AI.
User studies on AI tools: There are a few studies that aim to evaluate whether developers perform better on programming tasks when working with AI tools.
Vaithilingam et al. had developers use GitHub Copilot on three programming tasks and found that while neither task success rate nor completion time improved while using Copilot, developers preferred using it compared to the standard code completion [34].Similarly, Xu et al. had developers perform several programming tasks with and without the use of a natural language to code generation model and found no significant differences with regards to code quality, task completion time and program correctness [42].Wang et al. interviewed several data scientists to better understand their perceptions of automated AI and found that they had mixed feelings [39].However, nearly all of them felt that the future of data science involved collaboration between humans and AI systems.Unlike other work which tends to focus on how AI supports software development by experienced developers, our paper focuses on AI tools in the context of low-code systems where developers have varying expertise levels in both building software and AI.

LOW-CODE FOR AI
This work explores the intersection of low-code and AI in an effort to understand the benefits and limitations of using low-code for AI.We accomplish this by implementing and studying Low-Coder, a prototype low-code tool for building AI pipelines with sklearn operators for tabular data that includes both visual programming (VP) and natural language (NL) modalities, which complement each other by mitigating the limitations of either modality separately.Building this tool provided us with the opportunity to examine the impact of both modalities on users.Figure 2 highlights the main features and inputs of LowCoder.
To support multiple low-code modalities, we follow the lead of projectional code editors [37] by adopting the model-view-controller pattern.Specifically, we treat visual programming as a read-write view, PBNL as a write-only view, and let users inspect data in a readonly view.The tool keeps these three views in sync by representing the program in a domain-specific language (DSL).The domain for the DSL is AI pipelines.A corresponding, practical desideratum is that the DSL is compatible with sklearn [28], the most popular library for building AI pipelines, and is a subset of the Python language, in which sklearn is implemented, which also enables us to use AI models pretrained on Python code.The open-source Lale library [4] satisfies these requirements, and in addition, describes hyper-parameters in JSON schema format [29], which our tool also uses.The current version of our tool supports 143 sklearn operators.LowCoder uses a client-server architecture with a Python Flask back-end server and front-end based on the Blockly [27] meta-tool for creating block-based visual programming tools.The front-end converts the block-based representation to Lale which is then sent to the back-end.The back-end validates the given Lale pipeline using internal schemas, then evaluates the pipeline against a given dataset.The results of this evaluation (including any error messages) are returned to the front-end and presented to the user.

Visual Programming Interface
LowCoder VP is our block-based visual programming interface for composing and modifying AI pipelines.One goal that this tool shares with other block-based visual tools such as Scratch [31] is to encourage a highly interactive experience.The block visual metaphor allows for blocks that correspond to sklearn operators to be snapped together to form an AI pipeline.The shape of the blocks suggest how operators can connect.Their color indicates how they affect data: red for operators that transform data (with a transform() method) and purple for other operators that make predictions, such as classifiers and regressors (with a predict() method).
A palette (1) on the left side of the interface contains all of the available operator blocks.Blocks can be dragged-and-dropped from the palette to the canvas (2).For ease of execution, our tool only allows for one valid pipeline at a time, so blocks must be attached downstream of the pre-defined Start block to be considered part of the active pipeline.Figure 2  operator along with a description (when hovering over the hyperparameter name) and default values along with input boxes to modify each hyper-parameter.
Our tool provides a stage (4) with Before and After tables to give immediate feedback with every input on how the current pipeline affects the given dataset.When a tabular dataset is loaded, the Before table displays its target column on the left and feature columns on the right.When a pipeline that transforms input data is executed, the After table shows the results of the transformations.At any time, a pipeline can be executed on the given dataset by pressing the "Run Pipeline" button.Executing a pipeline will attempt to train the given pipeline on the training portion of the given dataset and then return a preview of all data transformations on the training data in a second table.For instance, in the example shown in Figure 2, executing the pipeline with SimpleImputer and StandardScaler transforms data from the Before table by imputing missing values and standardizing all feature values in the After table.If training is successful, then the trained pipeline is scored against the test set and the score (usually accuracy) is displayed.LowCoder VP also encourages liveness [33] by executing the pipeline when either the active pipeline is modified or hyper-parameters are configured.For example, adding a PCA operator and setting the n_components hyper-parameter to 2 for the prior example will reduce the feature columns in the After table to 2. This gives the user immediate feedback on the effect that pipeline changes have on the dataset without requiring separate training or scoring steps.This liveness encourages a high degree of interactivity [31].

Natural Language Interface
A potential weakness of visual low code tools is that users have trouble discovering the right components to use [22].For instance, the palette of LowCoder VP contains more than a hundred operator blocks.Rather than requiring users to know the exact name of the operator or scroll through so many operators, we provide LowCoder NL , which allows users to describe a desired operation in the NL interface (5) text box and press the "Predict Pipeline" button.The tool then infers relevant operator(s) and any applicable hyper-parameters using an underlying natural language to code translation model and automatically adds the most relevant operator to the end of the pipeline.The palette is also filtered to only display any relevant operator(s) such as in Figure 2. Pressing the "Reset Palette" button will undo filtering (so the palette shows all available operators again) without clearing the active pipeline or canvas.Depending on the NL search, the automatically added operator may either have hyper-parameters explicitly defined or potentially relevant hyper-parameters highlighted.As an example, the NL search "PCA with 2 components" will automatically add the PCA operator where the n_components hyper-parameter is set to 2 and may highlight other hyper-parameters such as random_state for the user to consider setting.Section 4 describes the design and implementation of this model in detail.A potential weakness of natural language low-code tools is that the generated programs can be incorrect, due to a lack of clarity, or ambiguity, in the query, or a lack of context for the model providing inferences [3].In comparison, visual inputs and representations are unambiguous [20], requiring no probabilistic interpretation, so users can easily understand and manipulate the results returned by LowCoder NL .
To ground our evaluation of LowCoder NL , we also provide a version of the tool without a trained language model to users in our study (described in Section 5).In this setting, the NL interface ( 5) text box becomes a simple substring keyword search that matches the query against operator names.For example, inputting "classifier" filters the palette to only display sklearn operators that contain 'classifier' in the name such as RandomForestClassifier (but notably not all classifiers such as SVC).

AI FOR LOW-CODE
This section discusses the AI that went into LowCoder NL .

Data Collection
Our goal is to make a large API accessible through a low-code tool by allowing users to describe what they want to do when they do not know how.More specifically, we want to enable users to build sklearn pipelines in a low-code setting, using a natural language interface that can be used as an intelligent search tool.This problem can be solved using language models that can be trained to translate a natural language query into the corresponding line of code [15].However, such models heavily rely on data to learn such behaviour and would need to be trained on an aligned dataset of natural language queries and the corresponding sklearn line(s) of code demonstrating how a user would want to use such an intelligent search tool.Naturally, we cannot collect such a dataset without this tool, creating a circular dependency.To overcome this challenge, we curate a proxy dataset using 140K Python Kaggle notebooks that were collected as part of the Google AI4Code challenge.1 From these notebooks, we extracted aligned Natural Language (NL) &

Data Preprocessing
We first filter out notebooks that do not contain any sklearn code.This leaves 84,783 notebooks -evidently, many notebooks involve sklearn.We further filter out notebooks with non-English descriptions in all of the markdown cells, resulting in 59,569 notebooks.We then create a proxy dataset by extracting all code cells containing sklearn code and pairing these with their preceding NL cell to get a total of 211,916 aligned NL-code pairs.We remove any duplicate NL-code pairs, leaving 102,750 unique pairs.For each code cell, we then extract the line(s) of code corresponding to an sklearn operation invocation statement.
We discard any code cells that do not include sklearn operation invocation statements but include other sklearn code leaving a final total of 79,372 NL-Code pairs.We separate these into train/validation/test splits resulting in 64,779 train samples, 7,242 validation samples, and 7,351 test samples.See Section B in the appendix for more details.

Tasks
Given the NL query, our model aims to generate a line of sklearn code corresponding to an operation invocation that can be used to build the next step of the pipeline.We consider a range of formulations of the task with different levels of details, as illustrated in Table 1.Additional examples can be found in Section A of the appendix.

Operator Name Generation.
The simplest task is generating only the operator name from the NL query.This alone can significantly help a developer with navigating the extensive sklearn API.We process the aligned dataset to map the query to the name(s) of operator(s) invoked in the code cell, discarding any other information such as hyper-parameters.

Complete Operator Invocation Generation.
At the other extreme, we task the model with synthesizing the complete operation invocation statement, including all the hyper-parameter names and values.Preliminary results (discussed in Section 5.1.4)show that the model often makes up arbitrary hyper-parameter values, resulting in lines of code that can rarely be used directly by developers.

Masked Operator Invocation Generation.
In this scenario, we mask out all the hyper-parameter values from the invocation statement, keeping only their names.The goal of this formulation is to ensure that the model learns to predict the specific invocation signature, even if it is unaware of the values to provide for the hyper-parameters.

Hybrid Operator Invocation Generation (HOI).
Manual inspection of the NL-code pairs revealed that the queries sometimes explicitly describe a subset of the hyper-parameter names and values to be used in the code.When this is the case, the model has the necessary context to predict at least those hyper-parameter values.Supporting this form of querying enables users to express the most salient hyper-parameters up-front.Therefore, we formulated a new hybrid task, where we keep the hyper-parameter values if they are explicitly stated in the NL query and mask them otherwise.This gives the model an opportunity to learn the hyper-parameter names and values if they are explicitly stated in the description, and unburdens it from making up values that it lacks the context to predict by allowing it to generate placeholders (masks) for them.
Evaluation: To evaluate the feasibility of predicting code using the different task formulations, we train a simple sequence-to-sequence model (detailed in Section 4.4.1) and compare the results for the various training tasks in Section 5.1.4.We find HOI to be the most accurate/reliable formulation for our setting.We therefore proceed to use this task formulation for training the models.

Modeling
All tasks from Section 4.3 are sequence-to-sequence tasks.We compare and contrast three different deep learning paradigms for this type of task, illustrated in Figure 3: 1) train a standard sequenceto-sequence transformer from scratch, 2) fine-tune (calibrate) a pretrained "medium" sized model, 3) query a Large Language Model (LLM) by means of few-shot prompting [30].We elaborate on these models below.Note that we use top-k sampling for our top-5 results.(A comparison of results with other decoding strategies can be found in Section 3 and 4 in the supplementary material).

Transformer (from scratch).
We train a sequence-to-sequence Transformer model [35] with randomly initialized parameters on the training data.Our relatively small dataset of ca.70K training samples limits the size of a model that can be trained in this manner.
We use a standard model size, with 6 encoder and decoder layers and 512-dimensional attention across 8 attention heads and a batch size of 32 sequences with up to 512 tokens each.We use a sentence piece tokenizer (trained on Python code) with a vocabulary size of 50K tokens.The model uses an encoder-decoder architecture that jointly learns to encode (extract a representation of) the natural language sequence and decode (generate) the corresponding sklearn operator sequences.

Fine-tuning CodeT5.
CodeT5 is a pretrained encoder-decoder transformer model [41] that has shown strong results when finetuned (calibrated) on various code understanding and generation tasks [24].CodeT5 was pretrained on a corpus of six programming languages from the CodeSearchNet dataset [21] and fine-tuned on several tasks from the CodeXGLUE benchmark [24] in a multi-task learning setting, where the task type is prepended to the input string to inform the model of the task.We fine-tune CodeT5 on the HOI generation task by adding the 'Generate Python' prefix to all NL queries.We experiment with different size CodeT5 models: codet5-small (60M parameters), base (220M) and large (770M).

Few-Shot Learning With
CodeGen.Lastly, we explore large language models (LLMs) that are known to perform well in a taskagnostic few-shot setting [8].More specifically, we look at CodeGen, a family of LLMs that are based on standard transformer-based autoregressive language modeling [25].Pretrained CodeGen models are available in a broad range of sizes, including 350M, 2.7B, 6.1B and 16.1B parameters.These were all trained on three different datasets, starting with a large, predominantly English corpus, followed by a multi-lingual programming language corpus, and concluding with fine-tuning on just Python data, which we use in this work.The largest model trained this way was shown to be competitive with Codex [10] on a Python benchmark [25].Models at this scale are expensive to fine-tune and are instead commonly used for inference by means of "few-shot prompting" [30].LLMs are remarkably capable of providing high-quality completions given an expanded prompt containing examples demonstrating the task [8].We prompt our model with 5 such NL-code examples.Figure 4 illustrates an example prompt with 3 such pairs.The model learns from the examples in the prompt and completes the sequence task which results in generating the HOI code.

EVALUATION
This section describes the evaluations for the AI modeling that enables LowCoder NL along with the user studies that we conducted to analyze the benefits and challenges of using low-code for developing AI pipelines using LowCoder.

Lowcoder figures
File Edit View Insert Format Slide Arrange Tools Extensions Help Last edit was seconds ago

i n i n g from scratch
WEKA, Orange, KNIME, Vertex AI, Sagemaker, AzureML, Watson Studio

Natural Language Interface
ChatGPT, GitHub Copilot Our models were trained on a single machine with multiple 48 GB NVIDIA Quadro RTX 8000 GPUs until they reached convergence on the validation loss.We clip input and output sequence lengths to 512 tokens, but reduce the latter to 64 when using the model in LowCoder to reduce inference time.We find in additional experiments that since few predictions are longer than this threshold, this incurs no significant decrease in accuracy, but speeds up inference by 34%.We use a batch size of 32 for training and fine-tuning all of our Transformer and CodeT5 models, except for CodeT5-large, for which we used a batch size of 64 to improve stability during training.

Test Datasets.
To ensure a well-rounded evaluation, we look at two different test datasets.
(i) Test data (from notebooks) -We use the NL-code pairs from the Kaggle notebooks we created in Section 4.2 containing 7,351 samples.These are noisy -some samples contain vague and underspecified Natural Language (NL) queries, such as -"Data preprocessing", "Build a model", "Using a clustering model".Others contain multiple operator invocation statements corresponding to a single NL query, even though the NL description only mentions one of them, e.g., "Model # 2 -Decision Trees" corresponds to DecisionTreeClassifier() and confusion_matrix(y_true, y_pred).Furthermore, these samples were collected from Kaggle notebooks, so the distribution of the NL queries collected from the markdown cells are not necessarily representative of NL queries that real users may enter into LowCoder NL .
(ii) Real user data -We log all the NL queries that users searched for in LowCoder during the user studies along with the list of operators that the model returned.This gives us a more accurate distribution of NL queries that developers use to search for operators in LowCoder NL .We obtained a total of 218 samples in this way, which we then manually annotated to check whether (i) the predictions were accurate, that is, if the operators in any of the predictions matches the inferred intent in the query and (ii) the NL query was clear, with an inter-rater agreement of 97.7% and a negotiated agreement [17] of 100%.(See Section E in appendix for details on annotation guidelines.) 5.1.3Test Metrics.We use both greedy (top-1) and top-K (top-5) decoding (see Section C in appendix) when generating the operator invocation sequences for each NL query.We evaluate the models' ability to generate just the operator name as well as the entire operator invocation (including all the hyper-parameter names and values) based on the hybrid formulation.

Task Comparison.
We first train a series of randomly initialized 6-layer Transformer models from scratch on each task formulation from Section 4.3.We compare the model's ability to correctly generate the operator name and the operator invocation based on the formulation corresponding to the training task using top-1 and top-5 accuracy as shown in Figure 5.We find that the hybrid formulation of the operation invocation task, while challenging, is indeed feasible and allowed the model to achieve reasonably strong performance when generating the entire operation invocation statement.Contrary to the other task formulations, a model trained with the HOI signal also achieved comparable performance to the model trained solely on operator names when evaluated purely on operator name prediction (ignoring the generated hyper-parameter string).These results highlight that the hybrid representation helps the model learn by unburdening it from inferring values that it lacks the context to predict.

Model Comparison.
We next evaluate the performance of the trifecta of modeling strategies from Section 4.4 on the task of Hybrid Operation Invocation (HOI) generation.We benchmark across different model sizes and compare the performance for both operator name and operator invocation generation using top-5 accuracy in Figure 6.(See Section D in the appendix for additional results and ablation studies.)The results show that the 0.77B parameter fine-tuned CodeT5 is the best performing model with an accuracy of 73.57% and 41.27% on the test data for the operation name and operation invocation generation respectively.The 0.22B parameter fine-tuned CodeT5 model has comparable performance, but its inference time is approximately 2-3 seconds faster than the 0.77B fine-tuned CodeT5 model, making it more desirable for integration with the tool.idea of the model's performance in the real world, we further evaluate the performance of the fine-tuned 0.22B parameter CodeT5base from the tool on real user data that was collected during the user studies.The distribution of NL queries collected from the user studies represents the "true" distribution of queries that can be expected from users in a low-code setting.Out of the 218 samples that were collected, we found only one sample in which a user explicitly specified a hyper-parameter value in their query.
We therefore only compute the accuracy of the operation name generated rather than the entire operation invocation (as they would use default values anyway and so the scores remain the same except for that one sample).
Out of 218 query requests, the fine-tuned CodeT5-base model that was used in our tool answered 150 queries correctly, which would suggest an overall accuracy of 68.8%.However, 33 of these requests targeted actions that are not supported by the sklearn API, such as dropping a column (commonly the territory of the Pandas library).Disregarding such unsupported usage, LowCoder NL answered 141 out of 185 queries correctly for an overall accuracy of 76.2%.For 33 additional samples, neither annotator could infer a reasonable ground truth since the prompt was unclear (e.g.: "empty").Leaving these out, i.e., when the prompt is both clear and the operator is supported by the tool, LowCoder NL was accurate in over 90% (137/152) of completions (refer to Section F in appendix for additional results).

User Study
We conducted a user study with 20 participants with varying levels of AI expertise to create AI pipelines using LowCoder across four tasks, replacing LowCoder NL with a simple keyword search in half the tasks.We collect and analyze data to investigate the following research questions: RQ1: How do LowCoder NL and other features help participants discover previously-unknown operators?RQ2: Are participants able to compose and then iteratively refine AI pipelines in our tool?RQ3: What are the benefits and challenges of using low-code for AI?
5.2.1 Study Methodology.We recruited 20 participants within the same large technology company via internal messaging channels.We expect that citizen developers without formal programming training may also have varying levels of AI expertise and intentionally solicited participants of all backgrounds.Potential participants filled out a short pre-study survey to self-report experience in the following: machine learning, data preprocessing, and sklearn using a 1 (no experience) to 5 (expert) scale.Participants include a mix of roles including developers, data scientists, and product managers working in a variety of domains such as AI, business informatics, quantum computing, and software services.25% of the participants are female and the remaining 75% are male.40% of the participants self-reported being novices in machine learning by indicating a 1 or 2 in the pre-study survey.
The study design is within-subjects [11] where each participant was exposed to two conditions: using LowCoder with (NL condition) and without (keyword condition) the natural language (NL) interface powered by LowCoder NL .The keyword condition used a simple substring filter for operator names.Each participant performed four tasks total across the two conditions.For each task, participants were instructed to create AI pipelines with data preprocessing and classifier steps on a sample dataset with as high a score (accuracy on the test set) as possible during a time period of five to ten minutes.Each sample dataset was split beforehand into separate train and test sets.Tasks were open-ended with no guidance on what preprocessing steps or classifiers should be used.
There were four sample datasets in total and each participant was exposed to all four.The sample datasets are public tabular datasets from the UCI Machine Learning Repository [14].Two of the tasks (A and D) require a specific data preprocessing step in order to successfully create a pipeline while two (B and D) technically do not require preprocessing to proceed.For each participant, the order of the conditions and the order of the tasks were shuffled such that there is a uniform distribution of the order of conditions and tasks.
As our study included machine learning novices, we gave each participant a short overview of the basics of machine learning with tabular datasets and data preprocessing.We avoided using specific terms or names of operators in favor of more general descriptions of data-related problems.
We then gave each participant an overview of LowCoder.To mitigate potential biasing or priming, the tool overview used a fifth dataset from the UCI repository [14].To avoid operators that were potentially useful in user tasks, the overview used both a nonsklearn operator that was not available in the study versions of the tool as well as sklearn's DummyClassifier that generates predictions without considering input features.Participants were allowed to use external resources such as web search engines or documentation pages.Nudges were given by the study administrators after five minutes if necessary to help participants progress in a task.Nudges were in the form of reminders to use tool features such as the NL For each version of the tool, study administrators would describe the unique features of the particular version and then have participants perform tasks using two out of four sample datasets.After performing tasks using both versions of the tool and all four sample datasets, participants were asked to provide open-ended feedback and/or reactions for both LowCoder and the comparison between the NL and keyword conditions.

Data Collection and Analysis.
To answer our research questions, for each participant, we collect and analyze both quantitative and qualitative data.For quantitative data, we report on the incidence of participants discovering a previously-unknown operator (RQ1) and the incidence of completing the task and iterating or improving the pipeline (RQ2).We consider an operator 'previouslyunknown' if the participant found and used the operator without using the exact or similar name.For example, using an NL query such as "deal with missing values" to find the SimpleImputer operator is considered discovering a previously-unknown operator while a query such as "simpleimpute" is not.We report discovery using the following methods: through LowCoder NL , generic web search engine (Google), and scrolling through the palette.Participants may discover multiple unknown operators during the same task, possibly using different methods.For each participant's task, we consider it 'complete' if the composed pipeline successfully trains against the dataset's training set and returns a score against the test set.We consider the pipeline iterated if a participant modifies an already-complete pipeline.More specifically, we consider the following forms of iteration: a preprocessing operator block is added or swapped, a classifier block is swapped, or hyper-parameters are tuned.We report each of these as separate types of pipeline iteration.Participants may perform multiple types of iteration during the same task.Both sets of quantitative metrics are counted per task (80 tasks total for 20 participants, 40 tasks per condition).
We use qualitative data to answer RQ3.This data focuses on the participants' actions in LowCoder, commentary while using the tool and performing tasks, and answers to open-ended questions after the study.Specifically, the same two authors that administered the user study analyzed the notes generated by the study along with the audio and screen recordings when the notes were insufficient, using discrete actions and/or quotations as the unit of analysis.The first round of analysis performed open coding [11] on data from 16 studies to elicit an initial set of 73 themes.The two authors then iteratively refined the initial themes through discussion along with identifying 13 axial codes which are summarized in Figure 7.The same authors then performed the same coding process on a holdout set of 4 studies.No additional themes were derived from the hold-out set of studies, suggesting saturation.

Study
Results.We answer RQ1 and RQ2 using quantitative data collected from observing participant actions per task and answer RQ3 through open coding of qualitative data.RQ1: How do LowCoder NL and other features help participants discover previously-unknown operators?
Table 2 reports how often participants discovered previouslyunknown operators during their tasks.80% of the participants discovered an unknown operator across 63.8% of all 80 tasks in the study.Participants discovered unknown operators in 82.5% of the 40 NL condition tasks compared to 45% of the 40 keyword condition tasks.The odds of discovering an unknown operator are significantly greater in the NL condition than keyword ( ≪ 0.001) using Barnard's exact test.We examine the methods of discovery in more detail, noting that LowCoder NL is only available in the NL condition whereas web search and scrolling through the operator palette are available in both conditions.We note that the participants were not able to use the keyword search to discover unknown operators due to needing at least part of the exact name.Using LowCoder NL , participants discovered unknown operators in 75% of tasks in the NL condition as opposed to an average of 22.5% using web search engines (12.5% in the NL condition and 32.5% in the keyword condition) and an average of 20% by scrolling through the operator palette (12.5% in the NL condition and 27.5% in the keyword condition).Within the NL condition, the odds of an unknown operator being discovered are significantly greater using LowCoder NL as opposed to both web search ( ≪ 0.001) and scrolling ( ≪ 0.001).When splitting on the experience of the participant, we find statistically greater chances of novices discovering operators in the NL condition using LowCoder NL as opposed to web search (p=0.013)but not scrolling (p=0.086).Non-novices were significantly more likely to discover operators using LowCoder NL compared to web search or scrolling ( ≪ 0.001,  ≪ 0.001).Results do not change if considering web searches or scrolling across all 80 tasks.These results suggest that LowCoder NL is particularly helpful in discovering previouslyunknown operators, especially compared to web search, but novices still face some challenges.We discuss these challenges in RQ3.RQ2: Are participants able to compose and then iteratively refine AI pipelines in our tool?
Table 3 reports how often participants iterated on pipelines.Participants completed 82.5% of the 80 tasks in the study and further iterated their pipelines in 72.5% of the tasks.Splitting on condition, the NL condition has 85% task completion and 72.5% further iteration while the keyword condition has 80% task completion and 72.5% iteration rate.Swapping classifiers was the most common form of iteration at 48.8%, followed by adding or swapping preprocessors at 43.8% and setting hyper-parameters at 30%.Comparing novices to non-novices, both types of participants are mostly successful in iterating pipelines with no significant differences in iteration rate using Barnard's exact test (p=0.109).This result holds when iterating preprocessors (p=0.664)but not classifiers (p=0.038)nor hyper-parameters (p=0.005).Non-novices are more likely to complete the task than novices (p=0.002).Regardless of experience, both novices and non-novices are able to iteratively refine their pipelines, but novices face some challenges compared to nonnovices regarding actually completing the task.These challenges are discussed in the next research question.RQ3: What are the benefits and challenges of using low-code for AI? Figure 7 shows our 13 axial codes for answering RQ3.These codes broadly represent three overarching themes regarding working with low-code and machine learning: 1) Discovery of machine learning operators relevant for the task at hand, 2) Iterative Composition of the operators in the tool, and 3) Challenges that participants, particularly novices, face regarding working with machine learning and/or using low-code tools.We also collect Feedback from participants to inform future development of LowCoder.Due to space limitations, we only report on a selection of the 13 axial codes and 73 codes derived from open coding (refer to Section G in the appendix for the full list of codes).
For the first category of Discovery, our analysis derived two axial codes related to the participants' goal while attempting to discover operators: 1) Know "What" Not "How" where participants have a desired action in mind but do not know the exact operator that performs that action (19 out of 20 participants experienced this axial code) and 2) Know "What" And "How" where participants have a particular action and operator in mind (18/20).We dive deeper into Know "What" Not "How" which includes the code where participants Discover a previously-unknown operator using NL (16/20).We found in RQ1 that LowCoder NL was helpful in finding unknown operators compared to other methods.The qualitative data suggests that participants were able to find unknown operators using LowCoder NL during cases where they have an idea of the action to perform but do not know the exact operator name for a variety of reasons.For example, when discovering SimpleImputer with Low-Coder NL , P11 noted that they "never used SimpleImputer but had an idea of what I wanted to do, even though I generally remove NaNs in Pandas." Another example is P16 who "preferred the [NL version of LowCoder ], even when I was doing Google searches, they... didn't give me options, your tool at least returns some options that I can try out and swap out." As a novice, P16 had difficulties finding the names of useful operators from web search results as opposed to the Low-Coder NL which directly returned actionable operators.We note that challenges regarding general web search is also an axial code.
For the second category of Iterative Composition, we derived four axial codes related to participant behaviors while attempting to compose and iterate on pipelines: 1) General Exploratory (13/20) iteration, 2) Exploratory iteration but where participants will select operators or hyper-parameters seemingly at Random (18/20), 3) Targeted (19/20) iteration where participants select operators or hyper-parameters with a particular intent, and 4) Seeking Documentation (15/20) where participants search for documentation to inform iteration decisions.We note that for both forms of Exploratory iteration and Targeted iteration, we find examples of participants iterating classifiers, preprocessors, and hyper-parameters.For the axial code of seemingly Random iteration, participants, especially (but not exclusively) novices, when unsure of how to proceed, tended to try out arbitrary preprocessors or classifiers.This was more common for more difficult tasks that required particular data preprocessing to proceed.For example, non-novice P9 remarked "I'm not familiar enough with it, so do I Google it or brute force it?[...] I don't even know what to Google to figure this out...I guess I'll do some light brute-forcing" and proceeded to swap in and out preprocessors from the palette.In contrast, the axial code of Targeted (19/20) iteration has codes that reflect particular intentions that participants derived from observations within the tool, such as Noticing error messages (10/20) or Making use of data tables in task (14/20).As an example of the data tables case, P11 realized through the Before data table that the given dataset had "too many columns" and added the IncrementalPCA operator along with setting its n_components hyper-parameter to 5. Upon seeing the change in data in the After data table, they remarked, "Wow...I really like that I can see all the hyper-parameters that I can play with" and proceeded to tune various hyper-parameters.
The third category is the variety of Challenges that participants faced while using LowCoder and performing the machine learning tasks where we derive six axial codes: 1) General challenges (10/20) faced by participants that are not particular to our tool or tasks, 2) Not Knowing "What" (15/20) where participants experienced difficulties due to knowing neither "what" nor "how" to begin, 3) General Discovery challenges (15/20), 4) Discovery challenges around using Web search (14/20), 5) Discovery challenges when using Tool search (17/20) or specifically using LowCoder NL , and 6) Tool Functionality (19/20) which describes challenges participants faced using (or not using) LowCoder features.We dive deeper into the axial code of Not Knowing "What" and note its contrast to the Know "What" Not "How" axial code where participants may have intentions but not know how to execute them or the Exploratory iteration axial code where participants may not have specific intentions but know how to iterate.All novices (8/8) and most non-novices (7/12) experienced this challenge.The primary code is that participants Did not know "what" they wanted to do (11/20).One possible cause of this lack of progression is choice paralysis, for example on P17's first task, "first things first, I don't even know where to begin... right now it's super overwhelming, I guess I'll start throwing stuff in there." We also describe the axial code of Tool search (17/20) where participants had difficulties forming search queries for LowCoder NL .
Participants noted that despite the interface being intended for general natural language, the interface still Needed a specific vocabulary (8/20).As P19, a novice, described it, "I get the idea of how it's supposed to work but it's hit and miss... even if I use very layman's terms... it expects a non-naive explanation of what needs to be done." Part of this challenge may be due to a mismatch in the natural language in Kaggle notebooks used to train LowCoder NL and the language used by novices.

DISCUSSION
Our results show that both the LowCoder VP and LowCoder NL components were helpful with aspects like operator discovery (RQ1) or iteratively composing pipelines (RQ2), even for novice participants.This is useful for citizen developers who have an idea of what they would like to do but do not fully know how to accomplish that, perhaps due to a lack of formal programming training.In fact, our qualitative analysis (RQ3) reveals that a number of our participants (including all novices who participated) struggled with knowing what to do.End-users writing software face similar "design barriers" [22], where it is difficult for a non-programmer to even conceptualize a solution.In contrast to other popular lowcode domains such as traditional software [31], the domain of developing machine learning systems is particularly difficult in this regard due to its experimental nature where progress has a high degree of uncertainty [38].This uncertainty then requires an abundance of judgment calls that rely heavily on prior machine learning experience [19] that novices lack.Some participants in our studies echo this, identifying that some ML knowledge is necessary to use our tool.That suggests that citizen developers who have some data science knowledge but lack programming training, such as statisticians, may benefit the most from our tool.A further improved low-code machine learning tool could thus be made more suitable towards novice citizen developers by guiding them to discover the what along with the how, i.e., by helping developers acquire the necessary ML knowledge.
A potential extension, offered by a study participant, is to provide suggestions in the form of templates or recipes for pipelines.
These suggestions could also be contextual to the given dataset or active pipeline, for example automatically suggesting encoders when detecting categorical features.Ko et al. [22] also suggest templates as a possible solution for design barriers.A related suggestion made by a number of our study participants is to implement data visualization and summarization tools for the given dataset, such as plots, charts, confusion matrices, etc.These visualizations could themselves inform contextual suggestions -a histogram detecting a non-standard distribution may suggest the need for a StandardScaler.These contextual suggestions may also help in guiding developers in what to do, making for a more generally useful low-code tool for both citizen and experienced developers alike.
Threats to Validity: The user study for LowCoder has several limitations.The study focused on relatively small, public tabular datasets and scikit-learn operators and may not be indicative of other machine learning tasks such as deep learning on large datasets.Participants also all come from the same large technology company and may not be representative of general users.However, we did intentionally elicit participation from a variety of groups and experience levels to mitigate this.As our user study has a withinsubjects design, there may be potential learning effects between tasks and conditions.In fact, we observed some cases of this (8/20), with some participants explicitly mentioning selecting particular operators due to the previous task.We mitigated this learning effect by randomizing the order of tasks and conditions, as well as by having two tasks (A and D) require the use of preprocessing operators that were not applicable to other tasks.

CONCLUSION
We developed LowCoder, a low-code tool that combines visual programming, LowCoder VP , and programming by natural language (PBNL), LowCoder NL , to help developers of all backgrounds create AI pipelines.We used LowCoder to provide some of the first insights into whether and how visual programming and PBNL help programmers by conducting user studies across four tasks with (NL condition) and without (keyword condition) LowCoder NL .Overall, LowCoder helped developers compose (85% of tasks) and iterate (72.5% of tasks) over AI pipelines.Furthermore, LowCoder NL helped users discover previously-unknown operators in 75% of tasks, compared to just 22.5% (12.5% in the NL condition and 32.5% in the keyword condition) using web search.Our qualitative analysis showed that PBNL helped users discover how to implement various parts of the pipeline when they know what to do.However, it failed to support novices when they lacked clarity on what they want to accomplish, which may suggest a worthwhile target for improving AI-based program assistants.Our work demonstrates the promise of combining both an AI-powered natural language interface and a visual interface for helping developers of all backgrounds create AI pipelines without writing code.

Figure 1 :
Figure 1: Relationship between LowCoder and other lowcode for AI tools.andscoring steps for feedback.Figure1summarizes the relationship between LowCoder and other low-code for AI tools.AI for Low-code for AI: The most prominent AI technique for low-code is programming by natural language (PBNL).When Androutsopoulos et al. surveyed natural language interfaces to databases in 1995, it was already a well-established field[3].Desai et al. treat PBNL as a program synthesis problem targeting a DSL designed for the purpose[13].The Overnight paper addresses the problem of missing training data for PBNL interfaces by crowdsourcing[40].And SwaggerBot lets users extend and customize a chatbot from within the chatbot itself[36].Unlike these works, our paper uses large language models for PBNL, uses PBNL for creating AI pipelines, and integrates with a visual programming interface.Combining low-code techniques: Our work combines visual programming with PBNL.In a similar vein, Rousillon combines visual programming with programming by demonstration[9] and Pumice combines programming by demonstration with PBNL[23].Like Rousillon and Pumice, our goal in combining techniques is to use strengths of each technique to mitigate weaknesses in the other.However, unlike Rousillon and Pumice, we choose different techniques to combine, and we target a different domain, namely AI.User studies on AI tools: There are a few studies that aim to evaluate whether developers perform better on programming tasks when working with AI tools.Vaithilingam et al. had developers use GitHub Copilot on three programming tasks and found that while neither task success rate nor completion time improved while using Copilot, developers preferred using it compared to the standard code completion[34].Similarly, Xu et al. had developers perform several programming tasks with and without the use of a natural language to code generation model and found no significant differences with regards to code quality, task completion time and program correctness[42].Wang et al. interviewed several data scientists to better understand their perceptions of automated AI and found that they had mixed feelings[39].However, nearly all of them felt that the future of data science involved collaboration between humans and AI systems.Unlike other work which tends to focus on how AI supports software development by experienced developers, our paper focuses on AI tools in the context of low-code systems where developers have varying expertise levels in both building software and AI.

Figure 2 :
Figure 2: LowCoder interface with labeled components, described in the text.

Figure 3 :
Figure 3: Overview of the "trifecta" of training approaches used in contemporary deep learning: smaller models are directly trained from scratch on downstream task data; medium sized models (100M-1B parameters) are pretrained with a generic training signal and then fine-tuned on task data; large models (>1B parameters) are only pretrained on very large datasets and are prompted with examples from the training data as demonstration followed by the query.

Figure 4 :
Figure 4: Example of a few (3) shot prompting template for querying a large language model in our study.

Figure 5 :
Figure 5: Accuracy of Transformer models trained from scratch on various task formulations.'Invocation' test results refer to the specific invocation formulation of the training task, while 'Names only' just considers whether the generated code starts with the correct operator name.Only the Hybrid Operator Invocation setting yields useful quality on both tasks.

Figure 6 :
Figure 6: Accuracy vs. model size based on top-5 sampling.(*The 16B CodeGen uses top-3 due to memory constraints.)We compare the three modeling paradigms, namely training transformer from scratch, finetuning CodeT5, and fewshot prompting CodeGen, on both Operator Name generation and Hybrid Operator Invocation generation.

Figure 7 :
Figure 7: Axial codes from our qualitative analysis.

Table 1 :
Task formulations highlighting the code components: mask , operator name , hyper-parameter name , hyper-parameter value .The Hybrid Operator Invocation setting does not mask 'balanced' as it appears in the query.

Table 2 :
Incidence of tasks where participants find previouslyunknown operators per condition (40 tasks for all, 16 tasks by novices, and 24 by non-novices).Note that rows may not sum to 100% as participants can use multiple methods to discover operators for a given task or not discover operators at all.

Table 3 :
Incidence of tasks where participants complete and iterate on preprocessors, classifiers, and hyper-parameters.

Table 4 :
Examples of NL-Code pairs for different task formulations.

Table 5 :
Distribution of hyper-parameters in hybrid operator invocations.

Table 6 :
Accuracy scores for Hybrid Operator Invocation task across different model variations (*due to memory constraints)

Table 7 :
Distribution of various properties annotated manually for real user data.Task Total Accurate prediction NL unclear Partially correct Not supported by tool No output returned