Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal Machine Intelligence

While developing a story, novices and published writers alike have had to look outside themselves for inspiration. Language models have recently been able to generate text fluently, producing new stochastic narratives upon request. However, effectively integrating such capabilities with human cognitive faculties and creative processes remains challenging. We propose to investigate this integration with a multimodal writing support interface that offers writing suggestions textually, visually, and aurally. We conduct an extensive study that combines elicitation of prior expectations before writing, observation and semi-structured interviews during writing, and outcome evaluations after writing. Our results illustrate the individual and situational variation in machine-in-the-loop writing approaches, suggestion acceptance, and ways the system is helpful. Centrally, we report how participants perform integrative leaps, by which they do cognitive work to integrate suggestions of varying semantic relevance into their developing stories. We interpret these findings, offering modeling and design recommendations for future creative writing support technologies.

produce a good text progression. The need for that "good idea" to be anchored and developed so that the reader can be invested takes a great deal of efort. In today's world, language generation models like GPT-2 [81], GPT-3 [15], and new ones coming down the line are typically silent on the inner processes of negotiation and decision that a human writer is working through. Additionally, possible forms contributions from these systems might take to inluence writing are not limited to text; writers are able to engage multiple perceptual channels through their work: they may activate multisensory imagination through evocative imagery, invoking auditory and olfactory phenomena, and other forms of sensory description.
We propose to investigate how participants engage with a system that does the following: a multimodal writing support interface that bridges generated writing suggestions with multimedia retrieval to produce concept representations simultaneously in sight, sound, and language. We pair this interface with an extensive study that combines surveys, interaction, and semi-structured interviews during observed, think-aloud writing sessions.
Through this study, we examine and report in detail how participants receive, consider, and integrate suggestions from an intelligent tool into their writing. We explore prominent axes of individual and situational variation in these integrative behaviors, noting the diferent kinds of "leaps" participants make to understand suggestions and make the necessary compositional decisions and actions to incorporate new information contained in them, ranging from copying and pasting to re-writing core aspects of their entire story. We are speciically motivated by the following questions: RQ1 What kinds of assumptions, expectations, and understanding are brought into interacting with an AI creative writing system by diferent users? RQ2 How do diferent users process and integrate diferent kinds of writing suggestions, and how and why do they accomplish this? RQ3 How does this suggestion-informed interaction compare with unassisted (or potentially human-assisted) writing? RQ4 What does the combination and interaction of these three factors mean for intelligent writing support tools?
We design our study from a hybrid Expectation-Process-Outcome model (a visual depiction is shown in Fig. 1). We seek to capture prior expectations, which we do through what we call explanatory models, combining aspects of mental models and folk theories of technology. We study the process by closely observing participants as they write with the interface, asking questions, and encouraging them to describe and relect on their thoughts and decisions. Finally, we include an evaluative survey through which participants report on their experience both independently and in comparison to a "blank page" style version of our interface. By combining these sources of information, we seek to document and communicate a range of behaviors, needs, creative processes, and results.
Our indings suggest that (1) diferent interaction approaches afect writer needs from system suggestions, (2) varied prior assumptions and explanatory models exist and may be both anchored to and adjusted during the interaction process, (3) suggestions support writing in more and less visible and direct ways, and (4) participants perform diferent kinds of integrative leaps, involving cognitive work to make suggestions useful to their writing. We interpret these indings and make commensurate design recommendations for future creative writing support tools. Fig. 1. Our Expectation-Process-Outcome study model. We seek to capture (A) each participant's "explanatory models" in areas relevant to our system, (B) the most salient features of their interaction and sense-making process in writing with it, and (C) their evaluation of the outcomes and experience.

Studying Writing
Flower and Hayes [36] describe what they term a cognitive process theory of writing. They model several components as part of this: the task environment includes text produced upto a given point, as well as the rhetorical problem at hand, and the writing process(es) involve planning (generating ideas, organizing them, and setting goals), translating (transforming ideas into visible text), and reviewing (evaluating and revising). At a theoretical level, these components are of interest to us because what they seek to model is how writers make decisions while writing and what factors afect this. We similarly seek to understand how writers make decisions and meaning through interaction with a supporting AI tool.
At a methodological level, they rely on protocol analysis, wherein participants perform an assigned rhetorical task as they think about their actions out loud and are recorded doing so. They note that this avoids the drawbacks of introspective analysis, in which participants report on their actions after-the-fact, observing that this tends to be colored by what they think they should have done. Participants are also instructed not to self-analyze under this method. While this provides a helpful starting point, our circumstance is diferent: participants are not following a task they know how to do and reporting on it. Rather, they are interacting with a new system and engaging in a new process (or a new version of a familiar process of writing), and as such we need more information from them to adequately understand aspects of their relationship with this system and adapted process. For this, we turn to an interpretive methodology, informed by thick description as we will describe later in this section.
More recent psychological approaches to studying creative writing have emphasized the role of retrieval, conceptual combination, and analogical mapping as some of the fundamental cognitive processes that explain creative cognition [35,48,95]. Other work on writing has emphasized social-interactive [69] and sociocultural models of writing, studied with a range of social scientiic empirical methods that consider how writing activity is "situated in concrete actions that are simultaneously improvised locally and mediated by prefabricated, historically provided tools and practices" [78]. Robertson et al. characterized the conditions under which email-repliessuggestions generated by an AI system are perceived as problematic [84]. They highlight how social context, not just content, can inluence how "brief suggestion-like email replies" that ignore social context have the potential to turn otherwise appropriate replies into inappropriate ones. Recognizing that these factors are also essential for understanding writing, especially as writing is re-situated and re-mediated with new technologies, our approach is informed by heterogeneous empirical studies of writing.

Multimodal Feedback
By presenting various communication channels, multimodal systems are considered to support human information processing by using a range of cognitive resources. This assumption is largely based on cognitive theories proposing multiple, modality-speciic processing resources [5,72]. One goal of a well-designed multimodal system is to integrate complementary input modes to create a synergistic blend, permitting each mode's strengths to overcome weaknesses in the others and support "mutual compensation" of feedback errors [71].
In addition to these cognitive beneits, multimodal feedback ofers us a rich window into participants' reasoning and process of sense-making. While language processing alone demands high engagement to process and to make sense, we aim to study how a complementary blend of information representations can allow us to uncover varied aspects of participants' interaction with an intelligent system for creative enhancement. In this section, we look into our two non-textual modalities for feedback: still visual input (images) and auditory input (sound recordings). We review how each of these modalities has been used to support users on a given task, and consider how these approaches might indicate possible beneits for our task.
2.4.1 Imagery. iTell [56] supports retrospective storytelling with digital photos. It employs a design process based on providing support to help novice storytellers engage in the composition process like experts. To assist users with creating retrospective narratives about their personal experiences, iTell presents the users with four steps to complete: Brainstorm, Organize, Writing, and Add Personal Media. The user must inish each step before proceeding to the next step and cannot skip a step without completing it at least once. One interesting inding from the workshop conducted as part of their user study is the inluence of the media modality on the novicesâĂŹ retrospective story development, how novices approach retrospective storytelling, and what is needed to make novices successful retrospective storytellers. In particular, the authors show beneits for novices to have access to mixed media in the story development process. One of the signiicant diferences between iTell and our system is that iTell requires the user to gather any media material beforehand to retrieve and incorporate it during a writing session. Another signiicant diference is the lack of text suggestions to help the user in their writing.
Another example of a support tool for the development of new ideas is Design Daydreams (DD) [64]. DD is part of a suite of computational design tools that integrate ambiguity and juxtaposition into systems that users can use to discover new ideas. Using a low-tech augmented reality system to overlay digital images on top of objects visually, the Design Daydreams augmented "post-it note" luidly extends the inspiration designers ind online into the physically-interactive and collaborative brainstorming environment. Feedback suggested that the low idelity of the tool provided a natural ambiguity that left room for interpretation as designers juxtaposed digital and physical concepts together to create new ideas. Like these projects, our visual feedback aims to discover mental constructs related to the story. It does this either indirectly, through the mood created by the image palette, or directly by layering diverse representations and allowing object or concept features to be distinguished and integrated into the developing story.

Audio.
Speciic attributes of the surrounding environment have been shown to support memory, foster creativity, enhance sensitivity to details, and balance cognitive load [21]. For instance, Mehta et al. found that moderate noise levels, like a cofee shop's ambient sound, facilitate abstract processing [63]. Zhao et al. built a multimodal mediated work environment, where they demonstrated efects on occupants' ability to focus and recover from stressful situations [100]. Sounds with attributable causes (i.e. where humans are able to aurally discern the source) have also been shown to impact memory [82], language learning [99], and, as a feedback modality, attention and information communication [42]. Motivated by this, we integrate an audio feedback system that retrieves sound by concept (rather than by content), to ofer a semantically relevant aural dimension that may confer these beneits in the process of writing a story.

Interpretive Approaches
We approach our observation of participants' interaction through the lens of interpretation. Interpretation as a concept has been used in a number of papers in HCI [8,55,65,89]. The interpretive perspective we maintain in this work is informed by anthropological approaches to make visible the alignments of factors of interaction that would otherwise go unnoticed due to common-sense understanding. Our theoretical approach is built on the dichotomy of social theory concepts of understanding as causal explanation (erklÃďren) versus understanding as interpretation (verstehen).
Following Max WeberâĂŹs distinction [96] between explanation that captures the causal sequence of actions and understanding that attends to the meaning of those actions, our research aims to analyze the interaction of the person with the AI system from the perspective of the latter (i.e. "meaning"). More speciically, the meaning of actions from the point of view of the participants, who organically construct meaning in the process of engaging with complex systems. As such, interpretation in this research is a form of understanding that makes it possible to discern the meaning production that occurs within the interaction between the human and the AI system. To that end, we aim to identify and observe how the interaction is inluenced by the explanatory models of AI that users have. We look at what type of conceptualization work is done on the part of the users in the process of engaging with AI, both in the world (prior assumptions and expectations) and locally in our study (impressions, integrative processes, and interactive reasoning), and how they rely on such conceptions to navigate the process and products of co-writing with AI.
"Understanding" implies the meaning of actions can be transferred through co-presence with a participant in one space, being able to build rapport, and engage with the participant so as to understand how people make sense of the world around them. For this research we aimed to complement the quantitative and qualitative data obtained from survey questions with qualitative data from semi-structured in-session interviews, observation of the participants, and what is called, in the social sciences, âĂĲthick descriptionâĂİ [43]. Thick description allows us to go beyond the observation of causal actions and acquire interpretation by the actors of not only their own actions but also of the context within which they operate. We detail our speciic approach in a later section on study design.

Explanatory Models and Expectations of AI
We use the term "explanatory models" to refer to the super-set of two kinds of conceptual representations of computational systems, commonly referred to as "mental models" and "folk theories" respectively. Here we describe each and outline our rationale for combining them in our work.

Mental and Conceptual Models.
Human-AI researchers often use the concept âĂĲmental model of AI, " a term informed by psychology and cognitive science. In the context of human-machine collaboration, and even for human collaboration alone, a great deal of work has illuminated the importance of mental models in promoting team success [27]. In the case of AI, it has been shown that optimal inference does not necessarily yield optimal human-AI team performance. Bansal et al., for example, study mental models of AI performance in the context of human-AI teams [7]. They do note, however, the relevance of other types of mental models (such as those of how the system works) to collaborative settings.
Gero et al. study human-AI collaboration in a game setting, and their results suggest that understanding of the system alone insuiciently develops appropriate conceptual models [45]. The same authors distinguish between mental and conceptual models by indicating that the latter are held by those with extensive knowledge of the system, e.g., designers and experts. This follows Norman's formulation of these two terms, where he suggests that conceptual models are "invented by teachers, designers, scientists, and engineers," [68] noting that researchers then conceptualize the mental models through experiment and observation in order to produce systems and conceptual models that direct these mental models to be coherent and usable.
2.6.2 Folk Theories. While mental models ofer insight into cognitive representations of a system's operation developed through experience, intuitive theories about the world structure learning, understanding, and cognition more broadly, in diverse ways despite a common psychological substrate [44]. Folk theories are a form of expectations that are based on some experience, but are not necessarily systematically checked [83]. Mental models are structured accounts of a system's mechanics and behavior, but folk theories and implicit beliefs arise from a great many sources of information and interaction, and are not constrained to nor will they necessarily contain an understanding of "the relationship between inputs and outputs" [40]. Folk theories may be especially salient in the domain of AI systems, given their dramatic and continued impact on culture and society. Few kinds of technical systems are as pervasive in the collective consciousness, due to rapid advances, news reports, economic incentives and concerns, and potentially profound implications for human identity and activity.
Folk theories have been captured in the study of cyber-social systems, often relating to algorithms employed in social media platforms. These theories may be elicited through direct investigation by researchers, often through interviews and associated methods, or indirectly, through inferential procedures applied to data "in-the-wild", such as posts on a social media platform. As an example of the former, Eslami et al. elicit theories about the operation of Facebook's news feed algorithm and a designed alternative [32]. In contrast, DeVito et al. aggregate and analyze over 100,000 tweets to determine user folk theories that contribute to resistance against changes in Twitter's algorithmic content curation system [29].
2.6.3 Why combine user mental models and folk theories? To understand the prior assumptions, expectations, and understanding that our participants brought into their interaction with our system, we captured their explanatory models in related areas. Speciically, we identiied related areas as AI and AI creativity, human creativity in writing, and diferences between humans and AI in creativity and writing. Our system is speciic enough that they are unlikely to have encountered a substantially similar system before, and are accordingly unlikely to have developed intuitive theories or mental models of our system. As such, these contextually informative areas may have bearing on their experiences, writing and sense-making processes, and evaluations of the outcome.
Creative processes with AI allow varied creators to expressively produce diverse artifacts. We aim to similarly capture complex and multidimensional explanatory models, considering both cognitive and sociocultural factors to obtain extensive representations of participants' prior assumptions, expectations, and understanding. We build on the concept of mental models to consider aspects including the user's beliefs about and attitudes towards AI and creativity, about the production of creative writing artifacts, and consider how these might afect downstream evaluations of the process of interacting with our system. We believe this approach can work inform design processes to yield tools that have clear afordances in creative contexts, and support a diversity of needs and practices.

SYSTEM PROTOTYPE
Our experimental prototype consists of two writing interfaces: Editor-Green, a minimal "blank page" tool, and Editor-Red, our augmented multimodal tool. To minimize cognitive bias when conducting our user study, we chose to give names to the editors that would seem roughly equivalent. The system also contains a server that runs language models, as well as a real-time database to track inputs, responses from the server, and interactions, e.g., interface settings. Fig. 2 shows both interfaces, including an active multimodal response in (B) with images 1 and sounds. Fig. 3 shows the underlying data low through the system architecture that makes these interfaces possible.

Writing Interface
Editor-Red contains a page-like typing environment, with two suggestion blocks adjacent to it that contain suggestions (these are below the writing page on mobile devices). The two blocks ofer two diferent types of suggestions, corresponding to text generation models ine-tuned on two diferent datasets. These are returned and presented through images and sounds in addition to suggested text. There is a control panel at the top, which contains some basic formatting features (text styles including heading levels, etc. as well as font formatting), and controls for invoking language model suggestions and switching multimodal response displays. One switch selects between two image presentations: by default, (the "on" position), images are displayed as a "grid" with full opacity behind the writing and response display elements. When toggled of, images are displayed as an "overlay", with multiple images stacked on top of each other and their opacity set low enough that they form an environment together. Two modality switches, one each for images and sounds, turn on and of the inclusion of these modalities, respectively. A slider can be used to adjust the volume of retrieved sounds. Finally, a "Suggestion" button launches a query for a new suggestion, and associated images and sounds, based on the current text of the userâĂŹs story). Suggestions can also be invoked via the tab key, and after about 10 seconds without any writing activity, a hint regarding suggestion availability appears (indicating that tab can be pressed for suggestions). The suggestion texts are colored with a gradient to clearly distinguish them from user-written text, and are virtually "typed out" over a small amount of time to visually illustrate their narrative structure. A text ield at the bottom includes credits for the presented photographs.
Several design features of our writing interface are based on popular word processing platforms. For example, a paper-shaped writing area, and a toolbar at the top for text formatting and other controls. The other design choices we made, for the new features we proposed, were reined through early prototyping and pilot testing with fourteen users, through which we discovered several usability challenges and corresponding solutions. We added tab-based suggestion invocation in addition to re-positioning the button to avoid accidental triggering based on observations made during these sessions. While we initially designed the image display as an overlaid blend underneath the writing area, we found that the grid-style display allowed for more explicit idea borrowing when desired, due to the increased clarity of individual images, thus making this display the default. We selected a number of images (20) that we found yielded suicient variety from individual searches, but maintained the individual clarity on typical screen sizes. Finally, we had originally placed suggestions at the bottom of the interface, but found that this constrained space for writing and required more scrolling. We moved it to the right of the writing area to both position it as secondary, and allow quick glancing. We additionally added a gradient to this text to irmly distinguish it from user-written text.
To parallel this augmented interface, we provide another with the same core features, design, and layout that we call Editor-Green. This editor includes the text formatting features and the page-like writing environment. We use this interface as a point of comparison, based on the core features of common writing tools. The additional Editor-Red features are turned on and of, efectively switching between the interfaces, by clicking the interface title in the top left corner. We did this to allow lexible switching in the study context, while reducing the likelihood of accidental switching, which we observed in early iterations.

Language Models
In order to produce relevant suggestions, we expected that a pre-trained language model would need to be subsequently ine-tuned on a dataset containing many useful examples. However, narratives develop simultaneously at multiple hierarchical structural levels, and single suggestions do not capture any variation in this important property. As noted earlier, prior work has provided and investigated multiple simultaneous suggestions to demonstrate diferent directions, which points to multiple suggestions being helpful. However, stories allow us to make some domain-speciic assumptions that can make these parallel suggestion channels semantically relevant. As such, we produce two variants of the base language model to capture overall plot and local description respectively, ofering multiple semantically distinct channels of suggestions. Stochastic variation is also available by simply requesting additional suggestions in sequence without any additional writing (though we note this does introduce additional delay).
We ine-tuned the same language model on two diferent datasets, producing two inal models. The base model is a medium-sized GPT-2 architecture with pre-trained weights obtained from huggingface 2 . The irst experimental model is ine-tuned on a corpus of movie summaries [6], which we observe tend to contain high-level plot components and event sequences. As such, we label suggestions arising from this model as "Plot" suggestions. The second is ine-tuned on a writing prompts dataset [33], which features prompts and story responses taken from a prominent online forum for amateur iction. Following the observation by Fast et al. that amateur iction "tends to be explicit about both scene-setting and emotion, with a higher density of adjective descriptors" [34] Fig. 2. Our experimental writing interfaces. (A) is a "blank page" editor with only basic formating features, while (B) augments this with generated suggestions and multimodal feedback. In the second interface, users write text (C) and can request suggestion by invoking the Suggestion buton (E) or using the tab key (a hint is shown ater about 10 seconds of inactivity). Two types of suggestions, corresponding to text generation models fine-tuned on two diferent datasets, are returned (F) and presented through images and sounds in addition to suggested text. The user can turn on or of these stimuli, or change the image presentations to an overlay (D).
as well as our own review of this dataset and the ine-tuned model, we label this second experimental model's outputs as "Description" suggestions.
For each query, our system produces responses from both models. When sampling from the models, we employ a top-k sampling strategy, with k = 5, temperature = 0.5, and a repetition penalty of 1.0; these parameter settings were based on initial experiments, i.e. looking at the models' outputs for diferent combinations of parameter values and making subjective judgements about quality, consistency, and relevance given a variety of input prompts. We decided on a maximum output suggestion length of 40 tokens (tokens are sub-word units, so there isn't a direct relationship to the number of words), inding that this suices for many scenarios, and weighed against the time and computation needed for autoregressively sampling longer sequences. This cost-detail tradeof is one of many we needed to address in the design process of this prototype. Others included model size, i.e., number of parameters, for which more parameters typically result in greater coherence in modeling long-term semantic consistency but slower performance and consequently signiicantly greater latency, and the number of model options, with similar constraints. While other work has hosted multiple diferent-sized models in order to propagate this trade-of to a user decision at the interface's point of querying [9], we wanted to focus our The user enters text into the interface which is, upon request, transmited to (B) a backend application. This operates two causal language models, fine-tuned for plot-level and description-level suggestions respectively. The text is tokenized and input to both, and generated suggestions are captured. Keywords are then extracted (using the RAKE algorithm) for use in the multimedia search queries: (C) calls to the Unsplash and Freesound APIs retrieve semantically associated image and audio content respectively, and these are sent back to the interface along with the suggestions to be presented to the user. In parallel, all use data is logged into (D) a real-time Firebase database. We track requests (including the state of the story at each request time), system responses (suggestions, links to media), the latest story state, and changes in setings (e.g., turning any specific modalities on and of). The logging system is replicated, for text only, in Editor-Green as well.
approach on the speciic semantic channels of plot and descriptive detail development to support story writing and further reason about suggestion incorporation. As such, we opted to ine-tune two medium-sized models, which balance interactive responsiveness with expressive language modeling.

Multimedia Retrieval
Retrieving visual and auditory stimuli based on natural language descriptions is a challenging task. This is compounded for open-domain text, as in our case. Applications that do this typically need to defer to large internet media databases with search APIs to adequately support the range of possible queries with high-quality media objects. While some recent work focuses on learned approaches to semantic text-image similarity, these approaches are slow, require much data to train, and don't scale well to large databases, and so we opt for simple concept-based (rather than content-based) search of media platforms.
We use two such databases: Unsplash 3 for images, and Freesound [37] for audio. Concept-based searches for media on these platforms are typically performed with keywords rather than long-form text, and so we use the RAKE algorithm for automatic keyword extraction [87] as a preprocessing step, pooling the keywords from both model outputs (plot and description). We then query Unsplash with the output keyword list. We observe that Freesound is sensitive to multi-keyword searches and often returns no sounds in these cases, so to avoid rate limitation problems we supply only the irst extracted keyword to its API to search for sounds. We apply three content ilters to Freesound queries. First, we limit the duration to be between 10 seconds and 30 seconds to allow for sounds that are both long enough to contribute to an acoustic environment, and not so long as to extend retrieval time. Second, we ilter out results marked as containing explicit content, after noting that these sometimes appear even when not necessarily suggested by the query. Finally, we apply a ilter on the "dissonance" feature 4 , which is extracted directly from each audio signal, so it is ≤ 0.4. Since sounds need to be combined together, constraining the sensory dissonance [77] of each independent element helps to layer them efectively into a coherent soundscape.

Data Logging
To allow detailed logging of user interaction with our prototype as well as generated suggestions, we use a real-time Firebase 5 database. This database keeps track of user-typed text in real-time as well as any changes to settings in the interface (e.g., turning sounds or images of or on), time-stamped requests with their associated input text, and time-stamped responses with text suggestions and links to retrieved media. This allows us to approximately reconstruct a sequence of writing events from a session, which we refer to in order to build the later section on integrative leaps.

STUDY
To investigate user interaction with our prototype and the role of user explanatory models of AI within that interaction, we designed a mixed-methods study consisting of two observed writing tasks, a four-part survey, and extensive logging of interaction data.

Formative Study
As part of our initial exploration, we conducted a formative study with 14 participants and an earlier version of the prototype. We called this interface MLVille, an homage to Herman Melville who famously struggled with writer's block before it was a well-documented phenomenon. With 4 out of 14 participants we conducted open-ended, interpretation-focused interviews, which allowed us to get in-depth data. These participants interpreted their actions and interactions with our prototype while performing the study tasks to provide additional context and insight. Through a broad set of survey questions, analysis of the produced text both computationally and qualitatively, and extensively documenting usability feedback, we performed several updates for our second study iteration, which is described in sections to follow. Speciically, we re-designed our interface, ine-tuned new language models on diferent datasets, re-oriented our study around thick description (noting the range of information and useful perspective it generated), and re-designed our survey to capture identiied factors of variation and interest.

Recruitment
Potential participants were invited through large mailing lists associated with various departments and living groups at R1 universities, including one social sciences department and several Computer Science-adjacent lists, as well as a post on reddit. As a pre-condition for being recruited, applicants illed out a short survey to conirm that they meet the requirements of being luent in English and being over 18 years old. The survey also contained a video tutorial that explained the features of Editor-Red and in order to ilter for applicants who watched and paid attention to the video, they were asked to answer two screening attention check questions about the system's features. Those applicants who met the requirements were invited so as to have a balanced pool of participants who identiied as native and non-native English speakers. We also made sure to have a balanced pool of participants with and without Computer Science backgrounds.

Participant Demographics
27 participants completed the writing task. Data from 4 had to be excluded due to irewall-related issues, midsession server problems, and unwillingness to complete the task as instructed. All participants reported having at least a high school diploma. When asked about their disciplinary ailiation, 35% replied Another STEM ield, 21% Computer Science, 13% Life Science, 9% Business, 13% Humanities, 4% Social Sciences, 4% Medicine. Participants' ages ranged from 18 to 45, with 48% of participants in the range of 18-22. 65% of participants reported that English is their irst language. When asked "Do you struggle with writing?", 78% of the participants responded yes. Fig. 4. Study design. Our study consists of two writing tasks, one each with Editor-Green (no augmentation) and Editor-Red (with augmentation) for 20 minutes; which interface participants used first was counterbalanced across subjects. For each task, the participant is given one of two prompts (in randomized order) to then create a story with. The two writing tasks are interlaced with sections of a four-part survey, with introductory and background components, as well as one for each writing task. The study takes approximately 75 minutes in total.

Study Structure
Our study design follows the structure shown in Fig. 4. We began by sending each participant a consent form in advance of the scheduled session, allowing them enough time to read and ask questions. They gave verbal consent at the beginning of the session with the interviewer and were then given an introductory overview of the study procedure, which took 3-5 minutes. Participants then completed a 10-minute introductory survey (S I ), designed to elicit the prior knowledge, conceptual frameworks, and beliefs that participants had about AI and its application in writing, human creative writing, and their own previous writing experience. Participants began the irst 20 minute writing task, with either Editor-Green or Editor-Red depending on their group assignment. They were instructed to write a story using one of the following prompts: The phone began to ring or A train arrives at the station (alternating prompts between groups to control for the efect of the prompt). Both prompts were designed to be short, somewhat vague, and contain the beginning of some action (phone call and train arrival).
Most participants, once given the task ("write a story using the following prompt"), began writing without asking any questions. Some participants asked if there were any requirements in terms of genre, structure, or length, and we informed them that there were none. Participants were informed that they should use suggestions only if they ind them helpful. Deciding when to stop writing was completely up to participants and we clearly stated that at the very beginning of the task.
After each writing task, participants completed the corresponding follow-up survey, i.e., S G (< 5 minutes) for participants who wrote in Editor-Green or S R (∼ 10 minutes) for participants who wrote in Editor-Red. In accordance with standard order-counterbalancing, participants completed a second 20 minute writing task with whichever editor they had not yet experienced, followed by its corresponding survey.
Finally, all participants completed a survey that invited them to compare the two writing experiences they had during the session, as well as provide some additional demographics/background information (S C ). The overall duration was about 75 minutes, and participants were compensated with a $25 Amazon gift card.
Two researchers separately conducted study sessions via Zoom videoconferencing. The sessions were recorded with permission, and the researchers took notes throughout the session. Participants shared their screens during the writing sessions, and they were asked to switch this function of when answering survey questions. While writing, participants were explicitly encouraged to comment and react aloud as they wrote, processed information, and responded to incoming suggestions and media. During the sessions, interviewers observed participants' interactions with the prototype and writing process. Additionally they prompted participants to communicate about their thought processes and experiences periodically throughout each writing session.

Survey
The survey consisted of four blocks. The Introductory block of questions (S I ) contained open questions and multiple choice questions. It was designed to elicit the prior assumptions, expectations, and understanding that participants had about Artiicial Intelligence and the possibility of its application in writing and creative writing, speciically. Participants were also asked about their own writing and their thoughts about creativity in human writing.
The block of questions after writing in Editor-Red contained ive open questions on the experience of the interaction, which was followed by a longer section that contained two grid sections with 7-point Likert-type items relating to general usability. After this, there were six multiple choice questions asking participants to provide more detailed information on their experience (e.g. "When Editor-Red was giving its ideas, what were you paying attention to? Text, Images, Sounds, None", "I think I will enjoy using Editor-Red more, if... "). Finally, there was also one more 7-point Likert-type grid of items asking participants to rate statements on suggestions provided by Editor-Red (e.g."The suggestions made by Editor-Red were creative", "The suggestions made by Editor-Red were coherent", "I enjoyed co-writing with Editor-Red", "I enjoyed collaborating with Editor-Red"). The block of questions after writing in Editor-Green (S G ) contained two grid sections, with all items relating to the augmentation features omitted and the rest replicated.
The block on comparison (S C ) between Editor-Green and Editor-Red consisted of Yes/No and open questions on creative writing ("Do you consider the text that you wrote in Editor-Red/Editor-Green creative?", "If yes, in a few words, explain how it was creative. If no, explain in a few words, why not?"). There were also 7-point ordinal items asking to compare Editor-Green and Editor-Red in terms of creative writing (e.g. "In which editor was the text that you wrote more creative?") and four items on cognitive load adapted from the NASA TLX survey [47] (e.g. "Where did you feel more focused when writing a text?").
The inal section contains demographic questions asking participants about their highest degree, disciplinary ailiation, age, and gender identiication. Those participants who identify themselves as non-native speakers of English are asked to provide more information about levels of self-reported proiciency of various skills and depth of exposure, which we assess based on pre-existing instruments [49,61]. In this block, there are also supplementary questions asking participants what kinds of writing they struggle with and how often they do creative writing.
All the questions throughout the four blocks are meant to elicit data for the key phenomena we were interested in: participants' prior understanding and anticipations of AI and writing using AI, how participants understand creativity and creative writing, participants' interpretation of the system's work and their explanation of engagement with the system's suggestions. Additional concepts of usability of the system, cognitive load, and agency were also included. The questions were strategically phrased in diferent ways (open questions, closed questions, multiple choice, Likert-type items).

Observation and Thick Description
Participants spent about 20 minutes writing in each editor (within each 75 minute session). This gave researchers an opportunity to capture a wide range of phenomena: participants would comment on how they usually write outside the study and how they are writing within the study, explain their process of coming up with ideas, their opinions and judgements of the systemâĂŹs suggestions, talk about how they were making decisions to incorporate or not incorporate suggestions, and give their reasons. The ability to be there with participants when they were writing and to observe immediate reaction and, to the extent possible, raw and unmediated answers, allowed us to produce "thick description" [43], as noted earlier.
Observing the interaction with the system allows us to capture the reasoning of participants for incorporating or ignoring suggestions, and also glean how participants make sense of the interaction with the system and their strategy on structuring this interaction to support their writing. In describing the interaction with Editor-Red, we note that the interaction is not reduced to just getting suggestions from the system. Participants perform a task of writing a text while existing within a particular space, which is deined not only by the interface of the system and its multimodal suggestions, but also those expectations and explanatory models that participants had prior to the study and were constantly adjusting during the study. As such, we seek to capture detail about aspects of their experience that go beyond just suggestion incorporation behaviors.

Data Analysis
4.7.1 Writing sessions. The data from writing sessions consisted of (1) logged data of the texts participants wrote and suggestions they received, (2) transcripts of sessions where participants thought out loud during the interaction with the system and answer interviewers' questions, and (3) the notes that interviewers made during the sessions. After we inished running the study, interviewers watched the session videos, making additional notes and comparing them to the notes they made during the sessions. Then we entered all the data from the writing sessions into a shared document and MAXQDA 6 (software for qualitative analysis). In MAXQDA, we irst used a deductive approach to code the data: we employed pre-existing concepts from research questions (such as conditions for acceptance of a suggestion, creativity, agency and ownership, etc) as codes. In the second stage of the analysis, we applied an inductive approach to code the data: in particular, the in-vivo method (using the words of the participants to create codes), so as to let the voice of the participants and their actual concepts structure the themes. Two rounds of inductive coding were done, followed by a process involving rearranging codes and turning in-vivo codes either into new themes, or adding them to existing codes. At this point, a second researcher did their round of coding and partially re-coded the data. The two coders discussed and reached agreement on the codes. A third round of coding by a third researcher was done to align and streamline all the codes.

Survey responses.
For the answers to open-ended survey questions (expectations), one researcher performed an initial open coding (in-vivo method, i.e. using the words of the participants to create codes), followed by a second cycle that involved deductively applying domain concepts associated with posed questions to the initial codes. For example, in asking about diferences between human and AI text production, we relied on some concepts from prior literature (such as statistical vs. symbolic processing or novelty, value, and surprise in creativity) that were closely related to the in-vivo codes. This was accompanied with a values orientation (i.e. trying to infer participants' values and beliefs). Then a second researcher reviewed and partially re-coded the same data, and disagreements were resolved through discussion of instances and codes themselves (labels and deinitions), as well as including secondary codes for individual responses where appropriate.

RESULTS
We detail indings from the three primary components of our study: our survey to capture participants' prior assumptions and pre-existing explanatory models, observations and responses during the semi-structured interview process that accompanied their writing, and questions posed afterwards about their inal thoughts and experiences during the sessions. In this section, we focus on detailing each independently before examining the synthesis of their respective data. We explicitly review speciic examples of how these data interact, but note Sparse-Abstract 14 "I think AI is a software that attempt to resemble the way an intelligent brain works. I suppose it bases its decision on a set of situations that are used as possible scenarios. " Sophisticated-Operational 5 "Diferent kinds of AI work diferently. If we're talking about machine learning systems, they are trained with large corpi of data that are curated by data scientists or machine learning engineers. The algorithms for these systems ind and exploit patterns in the data to then accomplish tasks. If I ask an AI something, it will look for a way to take my input and compare it to patterns in the corpus of data it was trained on. " Sparse-Operational 2 "i think it works by irst being given a set of instructions, or base algorithms, which are then trained by feeding it various data sets/ user inputs. For example IâĂŸm pretty sure visual captcha companies use user input data from instructions likeâĂĲselect all the traic lights in this imageâĂĲ o train or test their own image recognition algorithms. The data is relayed to the computer in a format it can interpret, such as code for matlab, and the computer then recognizes which conigurations of code evaluate to true given the desired condition. For text, it can also scan the input for key words/tags, that make it branch down a certain path in the algorithm. " Sophisticated-Abstract 2 "There are diferent kinds of AI. The most simple is a seriesof if/else statements, more complicated AI might use neuralnetworks and deep learning. AI could get information from anysource a computer can: ile input, cameras, microphones, etc.It produces information by taking some input, processing it insome way, and outputting it.It does not "understand" anything in the same way a humandoes, but rather algorithmically processes data it is given. " that our broader indings are informed by all three sources as they represent diferent means of inquiry and perspectives on the experiment.

Detail in explanation vs. technical depth and accuracy.
We assessed the structure of participants' prior explanatory models of AI through one open-ended question, i.e., "How do you think AI works? (For example, where does it get information? How does it produce information? How does it understand what you ask it?)", expecting a range in the responses. We observed during the irst coding cycle that the results seemed to actually vary in more than one way (rather than being more or less structured overall), and so we model this as a two-dimensional construct. During the second (deductive) coding cycle, we adjusted code labels to relate them to prior and parallel work: (1) Type of Explanatory Model (a) Abstract: What AI does (b) Operational: How AI works (2) Technical Depth (a) Sparse: Vague or inaccurate description (b) Sophisticated: Low-level and accurate description We based these labels on prior and parallel work. Speciically, DeVito et al. describe abstract and operational algorithmic folk theories, noting that the former "do not include speciic attempts to theorize how an algorithm might actually operate" [30] (their sub-codes for these are not applicable to our case). Interestingly, in a study of mental models of adversarial machine learning, Bieringer et al. found that their participants' prior knowledge did not necessarily determine the technical depth of elicited mental models, pointing to a possibly multidimensional space. They apply the labels sparse and sophisticated to describe the technical depth in these mental models [12].
The majority of participants in our study gave Sparse-Abstract models (N = 14). For example, P3 wrote "I think it gets info from devices and uses language features to understand us. " See Table 1 for the full distribution over these label combinations, and additional examples.
The second most common explanatory type was Sophisticated-Operational (N = 5). For instance, P1 alluded to both generalization and optimization in their explanation: "I think that it works by feeding it data. It is then able to use the data that it is fed and apply the given outcomes for the provided data to novel situations. It is accurate as it continues to learn and reduce loss between the real and given answer. " Sparse-Operational and Sophisticated-Abstract explanations each occurred twice. We identiied P16's description as an instance of the former: ł Usually it gets information from big datasets of training examples, similar to ones it needs to act on. There are diferent ways for it to produce information -it may do some clustering algorithm, use neural network, genetic algorithms or simply build decision tree based on previous answers. Most AI do limited number of tasks, so they recognise one of few commands. The ones which recognise speech likely try to represent sentence grammatical structure and use previous users' dictionaries and set of predetermined algorithms. But i really do not know ž P16's explanation is long and contains examples of machine learning methods and speech recognition systems, the description of the latter however is ambiguous and likely inaccurate (by comparison to most existing speech recognition approaches); moreover, the participant explicitly indicates that they don't know how it works despite ofering an account.
By contrast, a Sophisticated-Abstract explanation provides accurate description coupled with description of what AI does but not how AI works (despite the questions speciically asking "How"). For example, P15 wrote: ł AI is trained on a large amount of data. The training will usually tell the machine what it needs to know and it will then produce information based on its training. It will identify similar features from the training dataset and the test dataset to make an analysis. ž This participant, like P1, refers to generalization (similarity between train and test features), which requires knowledge about how machine learning models can be useful for real-world tasks, but provides no theory about how this is accomplished despite the posed question explicitly asking for it.

Human creativity in writing.
Our two questions relating to human creativity in writing allow us to elicit unconstrained thoughts through open responses as well as anchor to classical constructs such as Novelty (historically new), Surprise (unexpected), and Value (useful to people) [13] via a multiple-choice item. We additionally included an Other ield in the multiple-choice item, to allow participants to specify a diferent dimension if they felt that their concept was not adequately represented by these three, especially given the domain constraint. A majority of participants (N = 12) indicated that Novelty was most important, followed by Value (N = 6), Surprise (N = 4), and one custom response: "evocative use of language", a domain speciic attribute.
In the open responses, participants identiied several features of human creative writing they considered important, which we coded as follows: Freedom/Expression. Five participants commented on personal expression and expressive freedom (P1, P3, P10, P20, P22). For example, P3 simply stated "freedom of expression", while P20 remarked on both aspects: "Creativity in writing for me is putting down a personal, immersive response to a prompt. So taking a spark of direction and going wherever I want from there. " N Example Freedom/Expression 5 "Creative writing, to me, is writing that embodies one's own novel artistic expression and is not primarily functional. " Imagination/Fiction/Inspiration 5 "It is an enjoyable activity that involves imagination and allows one to express his/her feelings. " Structure/Clarity/Goal-directedness 4 "Having words low out and describe things in a satisfying way" Novelty 4 "For me its coming up with new, innovative and engaging ways to write a story. " Unexpectedness 4 "Creativity in writing is using tropes and ideas in uncommon ways. " Truth/Reality 1 "Being able to capture truths about the real world through words" Imagination/Fiction/Inspiration. Five participants commented on imagination and ictive writing (P5, P6, P8, P9, P11), such as P5 who wrote "Letting my imagination go free, creating worlds and scenarios that don't exist. " P6 illustrated this by comparison: "Creativity writing is the type of writing used in stories, novels, poems, journals. I keep it separate from scientiic writing, which I don't consider as creative writing. " Additional thoughts: Structure/Clarity/Goal-directedness, Novelty, Unexpectedness, and Truth. Still others commented on form, structure, low, and direction. P2 indicated that creative writing involves "having a clear goal and many possible ways to accomplish that goal. " The familiar dimensions of Novelty and Unexpectedness (surprise) also appeared in several comments. Finally, one participant alluded to "truth", perhaps indicating not the idea of verisimilitude (or similarity to reality) but "truth in iction. " These features of human creative writing participants considered important are also summarized in Table 2 with additional examples.

AI creativity.
When asked whether they thought AI could be creative, the majority of participants (N = 17) indicated Yes. We assessed this through analysis of an open-ended item, and some participants did indicate uncertainty by using words like "probably. " We obtained the following codes through our analysis: Human-based. Five participants (P3, P6, P8, P17, P19) noted that ostensibly creative AI is somehow modeling human creativity, and used this point to indicate that AI can be creative. For example, P3 wrote "probably, because it can mimic other human features. " Combinatorial Creativity and Uniqueness/Randomness. Others pointed to the notion of combinatorial creativity [13] (P4, P9, P16, P21), suggesting that "It can be creative if it happens to combine things in a way that people wouldn't naturally consider" (P9). Relatedly, participants noted the opportunity for creativity arising from randomness. P15 notes that AI-generated ideas "can be completely illogical which is sometimes the best creativity. " Novelty/Surprise. Three participants (P7, P10, P12) implicitly made the connection to novelty and surprise, remarking that AI "can be creative in the sense that it can produce novel solutions to problems," but also that "this presupposes a narrow conception of creativity" (P12).
Additional thoughts: Future creativity, and uncertainty. Still others pointed toward future creative ability, due to the improvement of AI, such as P5: "Eventually, yes, but I'm not sure it can make big leaps in novelty in a single go. " P18 expressed uncertainty about the question of whether AI can be creative, writing that they were "unsure what makes humans creative." Other participants expressed diferent kinds of uncertainty; P1 wrote that they believe that AI 'can be accurate" but they would have a hard time "imagining what a creative AI would look like. " Gradually/Future 3 "Yes deinitely! But to a certain extent. Because technology keeps on evolving and the internet is a very good example of it where it really helps in building creativity. But nonetheless, i think there is a limit to its creativity as compared to human but of course it will help a lot in enhancing creativity"  P22 emphasised the fact that AI depends on "the input humans give to it" and noted that "if humans don't keep updating the inputs, it may not be creative anymore. " Table 3 provides other examples for codes mentioned above.

Human-AI Diferences.
We also elicited participants' thoughts about qualitative diferences between human and AI text production. We assessed this through two items: one multiple choice to indicate the presence of a diference ("Do you think the way AI produces text is diferent from humans?"-Yes/No/Unsure), followed by an open-ended question prompting them to explain how and why (or why not) human and AI text production mechanisms are diferent. For the multiple choice item, 16 participants answered Yes (they think the way AI produces text is diferent from how humans do it), 5 answered "Unsure", and 2 participants answered "No. " In surveying qualitative examples, we found diverse concepts of what difers between how humans and AI produces text, as well as efects of such diferences.
Statistical/Data vs. Symbolic/Mind. Eight participants (P1, P2, P7, P8, P, P15, P17, P18) pointed to the contents and mechanics of the text producers. They diferentiated between data-driven and mind-driven text production, for example P1 wrote: "AI is taught to produce text based of of given data, an algorithm is used to produce text while a human creates text based of of their mind." Participants phrased this by contrasting generating text "statistically and linearly" with "using mental hierarchy of words" (P1), or indicating rule-based vs data-driven constraints. For example, P8 noted that "people can produce ininite correct and comprehensible outputs based on their knowledge of their native language's grammar and vocabulary, while AI will only be able to produce content based on its input. " We label this by combining the cues from participant responses and the traditional AI notions of symbolic processing vs. statistical modeling, two dominant paradigms in natural language processing.
World model and understanding. Another thought from Five participants (P4, P5, P9, P11, P12) was that AI is lacking suicient understanding of context, which comes from experiences in the world. P5 points out that AI "has not learned language by interacting with society and cultures, learning from family and personal experiences, or have the ability to draw on memory when responding in the same way humans do" while P4 appeals to sensorimotor functions: ł AI produces numbers based on drawing patterns and similarities from the numbers in the dataset that it has been fed. It doesn't understand and have visual representations in the brain that it then produces into motor action, it's just reproducing what it's already seen. ž Not sure how humans do it. Four participants (P9, P12, P16, P18) remarked that either they were unsure, or not much is known, about how humans produce text. They difered in whether they thought this would extend to similarity or diference between humans and AI. For instance, P9 answered: "I'm not really sure how people "produce text"" and continued saying that since "AI learns from previous patterns" then maybe people in a similar way "learn from past experiences and instructions. " P16 noted: "Not that is known how humans produce texts, so mechanisms developed independently are unlikely to be the same. " Complexity, performance. Three participants (P6, P7, P11) commented on expected diference in the complexity of the produced text, or performance factors that might afect quality. P6 emphasized that "it depends on the level of development of the AI" explaining that "in the ideal case one should not be able to recognize a human produced text from a machine produced one." P11 used an example of google translate and it being unable to translate complex terms or understand the context, to point out that the diference between human and AI producing text might have to do with the complexity of the language .
Additional thoughts. P14 argued that AI lacks "intentionality" due to not having "desires or beliefs", while P19 relatedly noted that AI cannot be "spontaneous" or "irrational" in its behavior as compared with humans. Two participants also made comments about formality in language. For example, P21 noted that "AI can only use what it has been taught or can access via some database while humans may access more informal or colloquial writing patterns. " Two comments noted that AI language systems are based on human-provided data. For example, P6 didn't expect a diference "because the information is mainly fed by humans. " Finally, three participants made seemingly contradictory or unclear statements. For example, P22 indicated no diference, but then expressed an opinion on the diference of a somewhat ontological diference: ł I think in some ways, each AI and humans communicate with our own languages and it's a mean of mutual understanding between them, so it's not that diferent. It's just, humans don't operate the way AI does, and vice versa. ž Table 4 provides other examples for examples and ratings mentioned above.

Interacting with the system
All the data discussed in this subsection was received through observation of participantsâĂŹ interaction with the system and through verbal comments that participants made during the study. The comments were made either when participants were thinking aloud during a writing task, or as a reply to the interviewer's questions. Throughout the session interviewers asked general questions, like, âĂĲWhat are you thinking?ž and "How is the writing going?âĂİ: either once every 2 minutes or, if the participant seemed to be disturbed by being interrupted often, when the participant stopped writing. interviewers also asked more speciic questions, like, âĂĲWhat do you think about that suggestion?", "Why are you laughing?", and "Tell me how you incorporated that suggestion.âĂİ ACM  Not sure how humans do it 4 "Not that is known how humans produce texts, so mechanisms developed independently are unlikely to be the same. Also humans seem to be much better at text generating but may be it is because they have more and more diverse experience" Other 3 "In a sense, yes, because itâĂŸs more or less about encoding connections in memory, then following the connections through to retrieve this memory, or making predictions, based on what you already know. " Complexity, Performance 3 "I guess it depends on the level of development of the AI. In the ideal case one should not be able to recognize a human produced text from a machine produced one. " No intentionality 2 "It seems not as structured as humans? And it doesn't seem to have "intentionality" (e.g. they don't appear to have desires or beliefs when they try to make an argument)" 2 0 0 Formal vs. informal language/behavior 2 "AI can only use what it has been taught or can access via some database while humans may access more informal or colloquial writing patterns" 2 0 0 Human-based 2 "Because the information is mainly fed by humans" 0 1 1 Table 4. Codes re: expected diferences between human and AI text production, before writing. N indicates number of participants, Example shows a corresponding quote, and Yes/No/Unsure are counts of categorical responses, for each label, from participants about whether they think AI text production is generally diferent than that of humans. Some responses are labeled with more than one.
We noted broadly diferent styles of overall participant writing and engagement with suggestions, described in detail below. We provide examples of participants who most clearly embodied the associated characteristics of each.
(1) Reactive writing. Through observation, interviewers identiied four participants for whom suggestions were actively shaping their story and helping them decide where the story was going (P17, P7, P10, P23). They wrote in a way that looked like a reaction to either suggestions of the system or as a reaction to the task. There were clearly efects from the pressure of time, conditions of the task, and their habits of writing. Some participants also mentioned that what they wrote was more like a âĂĲstream of consciousnessâĂİ (P23). (2) Proactive writing (with suggestions). Participants with clearly proactive writing (P2, P11, P15, P18, P20) wrote having a clear idea of what they wanted to write (having some horizon of their story) and how they wanted to do it (having their own process). They incorporated suggestions of Editor-Red at some particular points of their stories, either when there was the end of the scene or after they had exhausted the story horizon they had in mind. They did not let Editor-Red take over their process of writing. This type of writing was characterized by longer writing periods and hitting suggestions fewer times. For example, P18 requested suggestions only two times, and had a 15 minute writing process non-stop. P19 requested suggestions three times, and had longer writing periods, with one period being 9 minutes. (3) Actively opposed to suggestions. Four participants (P6, P8, P16, P22) were generally not willing to incorporate the suggestions of Editor-Red. For example, P8 had their own idea of how they wanted to do things and explained the resistance to incorporate the systemâĂŹs suggestions, saying âĂĲIâĂŹm not a super suggestible person.âĂİ P16 and P22 did not like the suggestions and did not include them in the writing. P6 wrote so as to improve the suggestions by the system, waiting for this to occur, and so didn't engage with the suggestions in the duration of writing their story with it. We begin by exploring overall reasons, given certain types of suggestions and contexts, to incorporate or not to incorporate suggestions in the view of our participants. Subsequently, we describe and characterize instances of suggestion integration, often from more reactive and proactive writers (who did accept suggestions), in detail through the lens of integrative leaps.

Reasons to incorporate suggestions.
Making a judgment as to whether the systemâĂŹs suggestions are in line with the participantâĂŹs writing or too âĂĲout thereâĂİ seemed to be an important axis along which participants, in the process of writing, were constantly making decisions about suggestion incorporation. Five participants speciically commented on the systemâĂŹs suggestions being in line with their writing (P1, P3, P4, P20, P21, P5). For example, P1 explained their decision to incorporate a suggestion because it was âĂĲthematically accurate and kind of good-to-keep-the-story-going description." P20 set a scene in their story and then hit the suggestions, as they wanted to see how Editor-Red "would interpret that." One of the suggestions of Editor-Red was âĂĲ...IâĂŹm calling to let you know that youâĂŹve been selected to the next round of the lottery,âĂİ and P20 exclaimed: "Wow... it's a bit scary because I had thought of the lottery idea or just some other kind of news... yeah, so it's interesting that it immediately followed that train of thought about a lottery.âĂİ At the same time, some participants appreciated that the system was providing suggestions that were unexpected and not immediately related to their previous writing. For example, one participant explained that some suggestions, even though they seemed âĂĲabsurd,âĂİ were also so âĂĲdetailedâĂİ and âĂĲspeciicâĂİ that it was âĂĲinspiring.âĂİ As a result, even though some suggestions were âĂĲa little bit out thereâĂİ to the extent that they would make them laugh, the suggestions would still give them âĂĲsomething to goâĂİ (P1).
We observed a subtle and variable trade-of between how creative or unusual suggestions were thought of as being by participants and how easy it was to incorporate them. The possibility of an easy transition towards incorporating the suggestions seemed to be a crucial factor. For example, P1 commented on one of the suggestions: ł Some of the suggestions, would be either so similar to what I wrote that it doesn't seem worth incorporating or too creative, or I guess too hard to transition to. But that [suggestion] would logically be the next thing that I write about. ž Along these lines, some suggestions that might have been incorporated under the right circumstances by the corresponding writers were not integrated because of the time and efort it would require. P9, when choosing one of the two suggestion types (Plot and Description) that they equally liked, acknowledged that though they liked the suggestion that was âĂĲmore fun, weird, crazy,âĂİ it was such âĂĲa divergent shiftâĂİ that the suggestion looked âĂĲtoo efortful to incorporate.âĂİ Another participant explained that if they were to take âĂĲa much longer routeâĂİ they might follow the systemâĂŹs suggestion of monster hunting (Description suggestion: âĂĲYou are a member of a group of monster hunters.âĂİ) as âĂĲit seems funâĂİ but since they were short on time they decided not to explore this plot line (P5).

Reasons to not incorporate suggestions.
Participants gave a wide variety of reasons as to why they might have been unwilling to incorporate suggestions. Some participants commented that the textual suggestions of Editor-Red looked âĂĲbasicâĂİ (P15), âĂĲplainâĂİ (P6), âĂĲredundantâĂİ (P4), or were not âĂĲpicking up the toneâĂİ of the story they were writing (P6). At some points in their writing, six participants commented that suggestions were not in line with what they wrote (P2, P6, P7, P8, P12, P22), complaining that the system was not able to see that âĂĲthis is not where IâĂŹm goingâĂİ (P7). Some of the suggestions also did not make sense to participants and were repetitive (P2, P6, P17), whereas for some participants the fact that some of the suggestions were not coherent was not an obstacle to engaging with their content. For example, when P7 was interacting with the system, it experienced a number of delays in producing suggestions, and inally a plot suggestion came out as: "not sure where to end Train to MIT for the irst time not sure where to end Train to MIT for the irst time not sure where to end Train to MIT for the irst time not. " The participant said laughing: "ThatâĂŹs ine. I donâĂŹt necessarily need it to be coherent" and expressed the readiness to carry on the writing.
One of the four participants who did not incorporate the systemâĂŹs suggestions found that the suggestions contained tropes and led to more âĂĲstereotype writingâĂİ (P8). Four participants also speciically pointed out that the systemâĂŹs suggestion distracted them from pursuing their own ideas (P2, P8, P12, P14). P8 explained that they felt they had âĂĲto tune outâĂİ the systemâĂŹs suggestions as they already had a picture of what they wanted to write in their head and it was easier for them to write âĂĲwithout the extra stuf.âĂİ Some suggestions also did not come at the right time in the narrative: âĂĲOh, wow, I would not think of cryogenic sleep! That's an interesting idea, but I didn't use it. I don't think it came at a good time in the story.âĂİ (P1).
Visual suggestions were not incorporated if they were perceived as unrelated to the current writing (P14, P15, P6, P4). Participants also commented that some of the images were not only unrelated to the writing but also seemed arbitrarily constrained or homogeneous (e.g., demographically): âĂĲIt's kind of strange, there's just a bunch of white guys staring at me and I don't know whyâĂİ (P2) and âĂĲIâĂŹm confused ...And IâĂŹm curious as to why all the suggestions are very similar, and they are all images of straight blonde Caucasian womenâĂİ (P5).
P14 considered the images âĂĲaestheticâĂİ and âĂĲcuteâĂİ but was following the idea they had in mind already. Some of the image suggestions, similarly to the text, did not come at the right time: âĂĲThe pictures are coolâĂębut this doesn't really it with the character I have right nowâĂİ (P15). P3 commented that even though they were not using visual suggestions, having them seemed âĂĲless daunting than having a white space in front of you.âĂİ Sound suggestions were the least used, sometimes due to a lack of relevance to the writing, either in content or in tone and style, and sometimes for other reasons. P5 described the sounds being not relevant and âĂĲrandomâĂİ (P5), and P9 explained why they were going to switch of the sounds: The sounds were also described as âĂĲaggressiveâĂİ making it hard to focus (P8), âĂĲtoo muchâĂİ (P9), and âĂĲdistractingâĂİ (P12). All participants, at least once, switched of the sounds at some point of their writing, although we observed instances where sounds were related to the story and/or resulted in incorporated suggestions.

5.2.3
Editor-Red as "support system". Some participants described the overall experience of the system, which is not limited to just the relevance of the suggestions. For instance, some reported that they felt Editor-Red supported the writing process. P3 commented that the system was giving âĂĲgood linesâĂİ and admitted that it helped âĂĲto continue along, where otherwise I think I will just stop writing.âĂİ P3 continued saying that when they had absolutely no idea what to write, taking a word or a line from Editor-Red gave them âĂĲsomething to addâĂİ and then they "kept going, and kinda went from there.âĂİ Two participants explained that Editor-Red helped them to feel less stuck in their writing (P12, P16). Even though P16 did not incorporate any of the suggestions, and during the experiment commented on the suggestions being âĂĲdumb,âĂİ they expressed their surprise that, in the end, writing in Editor-Red did seem to help them feel âĂĲless stuck.âĂİ P12 explained their feelings about one of the suggestions: âĂĲAlthough I wouldn't word-for-word take that, it, at least, redirects my attention from just being stuck in the kind of the crucial little loop to having somewhere else to go. So that's helpful to get unstuck, I suppose.âĂİ P23, relecting on their experience of writing after the end of the session, admitted that they felt that inspiration came not directly from suggestions but rather suggestions made them think of something else and this is where ideas came from. P6 admitted that even though they did not use the suggestions of Editor-Red, writing in it actually âĂĲrelieved some of the stress of writing.âĂİ P6 further explained that even though the suggestions were not helpful to them personally, the system was still âĂĲcreating that distraction, that was good for making the task a little bit more relaxing.âĂİ They noted that interacting with Editor-Red really helped in mitigating the stress of writing, comparing it to a feeling of âĂĲpetting a cat.âĂİ P1 described how interacting with the system changed their process of writing as they would âĂĲwrite for the suggestions.âĂİ P1 explained that when they didn't want to continue writing as they could not think of what to say, being in the system would be a motivation to âĂĲwrite a few additional sentences in order to get a better suggestionâĂİ and to continue writing if they were unhappy with the suggestions that I received till the get a better suggestion. P1 found it "very helpful. " All in all, observing participants' interaction with the system enables us to map out the multiplicity of tactics that participants engage in while performing the task of writing a text. Participants borrow words and lines from the system's suggestions, get inspired by the system's suggestions directly or indirectly, use ideas from the suggestions as reference points to produce their own ideas, or receive psychological support from the system (e.g. reducing the stress of writing by, for example, giving them something to focus on besides their own feelings of getting stuck).

Willingness to cooperate with
Editor-Red. Participants practiced diferent patterns of engagement with the system hoping that it would provide suggestions that better served their purposes. Some were willing to wait longer for the system to start giving better suggestions, occasionally under the assumption that allowing more time would give the system an opportunity to catch up with the participantâĂŹs writing. For instance, P1 said: âĂĲThis makes me think I didn't wait long enough because now everything's about phones. So maybe I should wait a little longer.âĂİ Some also decided to keep writing in order to give more information to the system (P1, P19, P8). P1 reasoned that the information that they were "feeding it" might be "not substantial nowâĂİ. When P7 received another round of suggestions, one sentence that particularly got her attention was the phrase âĂĲI could feel the horn blaring in the distance.âĂİ This phrase was almost identical to what the participant had already written, with the system having changed precisely one thing: the original sentence was âĂĲI hear the horn blaring in the distance. " P7 pondered: ł I see it changed some of the words around. I could feel the horn blaring, which is interesting. ItâĂŹs much more visceral than I can hear it blaringâĂę Yeah, hear is not quite the right wordâĂę I will just put feel for now. ž This is an example of how a participant is willing to cooperate with the system and make sense of its contributions, even when someone else might consider the re-phrasing to be overly subtle (merely a lexical substitution) and not valuable to developing the story further.

Integrating the system's suggestions.
To provide a more granular exposition of the suggestion integration patterns we observed, we enumerate and detail a collection of integrative leaps. These leaps describe the diferent kinds of interpretation and expression involved in incorporating aspects of suggestions into the developing story, in particular how and how much participants alter the meaning and structure of their narratives when doing so. We use them as ine-grained windows into the mechanics of the most visible examples of this incorporative process, those we are able to access through our observational methodology.
Our data on suggestion integration contains 47 instances of integrative leaps from 19 (out of 23) participants; P6, P8, P16, and P22 did not appear to incorporate Editor-RedâĂŹs suggestions in any identiiable way). These are examples that the researchers conducting study sessions identiied of participants engaging with and actively incorporating suggestions from the system. Participants often explicitly commented and explained why and how they incorporated suggestions, as they were encouraged to do, and we report on their interpretation of this process in addition to our observations and analysis.

5.2.6
Types of integrative leaps. The integrative leaps can be analyzed along a number of axes. First, we consider the "edit" distance (e.g. lexical, semantic, etc.) between the suggestion as presented to the user and as incorporated into the story. We broadly characterize these as direct integration (N = 30; e.g., verbatim or restructured verbatim for a textual suggestion or a textual analogue of the object or idea represented in a visual or auditory suggestion) or indirect integration (N = 17), where it often would be impossible to capture this integration if we did not have the participantsâĂŹ explanations, due to the modiications they made in the process of suggestion incorporation.
Second, we look at how incorporated suggestions relate to global aspects of their story's direction and most prominent elements. When participants used suggestions to explore new lines of narration, we call it exploratory integration (N = 28, shown on left half of both igures), in contrast to taking suggestions to continue with their chosen narrative by adding more details, which we call conirmatory integration (N = 19, shown on right half of both igures).
Finally, with the view that suggestions are intended to ease cognitive inertia in the writing process, we attend to the role they play in creative problem solving. Do they simply solve a localized problem by "closing" some aspect of the narrative in a necessary, analytical, or expected way? For example, naming a character that has already been described, or explaining why a character went from place A to place B if both of those events have been established. Or do they "open" up options to consider, resulting in abstract, novel, or unexpected events, patterns, or directions? We describe these as convergent integration (N = 31, shown on the bottom half of both igures) and divergent integration, shown on the top half of both igures (N = 16) respectively. While these often overlap with conirmatory and exploratory integrations respectively, there were a few cases in our coding process where we found it useful to explicitly make a distinction between these two dimensions, in order to better explain behaviors that we observed. For example, two of the six integrations we detail in the following section are ones we labeled, through an iterative process, as exploratory and convergent. In these leaps, participants may use suggestions to both pivot at a narrative level, and solve a local problem within this context. Although our categories are still relatively broad and cannot cover all the diferences between integrations that we observed, we sought to suiciently represent the most prominent aspects of integrations with these labels.

Integrative leaps.
In this section we review several examples of integrative leaps, identifying them along the aforementioned axes as well as describing the participants' interpretation and comments. We summarize each instance in a discrete box that clearly identiies the input text (before the suggestion), the suggestion at hand, the text after integration, the participants' explanation, and our labels (for example, Integration 1). When participants identiied that they were prompted by visual or auditory suggestions, we include thumbnails or links for the reader to review.
P3, following the "The phone began to ringâĂİ prompt, was writing an intense story of a mother getting a phone call from her estranged son. Through a number of previous suggestion interactions, the participant wrote a story where the son on the phone call was in trouble, as some people were holding a gun to his head and demanding some information he didn't have. The next round of suggestions contained "IâĂŹm just a normal person who is in a hurry to get home." Following that, the participant wrote "She freezes. What is he talking about? This isnâĂŹt making any senseâĂę yes, she has an estranged relationship with her son, but they are normal people. " As the participant explained, the phrase in the suggestion "I'm just a normal person" stood out to them and prompted them to develop it into the mother's inner thoughts trying to come to terms with the fact that her son and she herself are probably in big trouble. We labeled this example as direct (almost verbatim integration: normal person to normal people), exploratory (the participant did not have a clear idea of the narrative) and convergent (solving a local question of how the main character reacts to the news that her son is in trouble). See integration 1 for more details.
P21 was developing a story from the prompt "The phone began to ring" and was describing a call from the best friend of the main character. P21 wrote the irst part of the dialogue "âĂĲWait why were you in the hospital?âĂİ Input (summary): [Emotional dialogue, son is held captiveâĂę] âĂęâĂĲWhat?âĂİ She replied back. âĂĲWho are you talking about?âĂİ âĂĲItâĂŹs them,âĂİ he whimpered. âĂĲBut I-I donâĂŹt have anything to tell them. I donâĂŹt have the information theyâĂŹre looking for.âĂİ Suggestion: Plot. IâĂŹm just a normal person who is in a hurry to get homeâĂę Integration: She freezes. What is he talking about? This isnâĂŹt making any senseâĂę yes, she has an estranged relationship with her son, but they are normal people. âĂĲYouâĂŹre not making any sense.âĂİ âĂĲItâĂŹs not normal. None of this is normalâĂİ he responds shakily. She hears a scream and the phone cuts out. Explanation: łâĂęI'm just thinking about how to continue this story but I don't really have muchâĂę but the suggestion under Plot is giving me someâĂę you know, "I'm just a normal person" lineâĂęI still don't have any sort of direction with the storyâĂę this feature seems to be good to help me, like, continue along, where otherwise I think I will just stop writingâĂę ž ł âĂęit just kinda stood out to me in relationship to this story... cause this story, it seems like again the mom is just a normal person, so if she is getting this phone call from her son, it doesn't make any sense, we are just normal people, so I thought I would incorporate that" ž Our labels: • Direct: the incorporation is almost verbatim (I'm just a normal person to they are normal people).
• Exploratory: the writer, from their own remarks, does not have a clear narrative direction that this suggestion would reinforce.
Rather, it gives them a possible next step to build on. • Convergent: the suggestion helps to solve a local problem in a concrete way (continuing the story further).
I asked my friendâĂİ and the subsequent round of the suggestions contained images with cars. The participant immediately took on the idea: âĂĲI'm seeing cars, so maybe he was in a car crash.âĂİ and to continue the dialogue, P21 wrote: âĂĲMy sister was in a car crash. SheâĂŹs okay, but she broke a rib.âĂİ Since the suggestions helped to keep the writing going and did not prompt the participant into a new avenue of thought, as well as being a textual representation of a suggested visual object, this entry is labeled as direct (images of cars to car crash), conirmatory (reinforces the existing narrative), and convergent (closes a local question of why the person is in the hospital). We report details in integration 2.

Integration 2
Input (summary): [Best friend phone callâĂę] âĂęâĂĲI ran into your ex-boyfriend at the hospitalâĂİ. I was in shock. I hadnâĂŹt seen him since 4 years ago when he left me to run away to Cuba with some new woman.
Suggestion: (Images) Integration: âĂĲWait why were you in the hospital?âĂİ I asked my friend. âĂĲMy sister was in car crash. SheâĂŹs okay, but she broke a rib.âĂİ I completely forgot about what she said about my ex being in the area, assuming it was hours ago, and rushed to the hospital. We were neighbors growing up, so I was pretty close with her sister too. Explanation: âĂĲI'm seeing cars, so maybe she was in a car crash.âĂİ Our labels: • Direct: direct representation of visually represented object.
• Convergent: closes a local question: i.e. "why?", "what happened?" regarding a character in the story. P4, following the prompt "The phone began to ring," was developing a story about a police detective who called the main character and asked to come to the police station because their sister was in trouble. P4 felt unsure as to how to continue and what it could be that the detective could have been accusing their sister of. This participant was really perplexed with what in their previous writing could have prompted the subsequent suggestions involving zoos, animals, and tropical places (these were in the retrieved images) but still decided to go ahead and integrate the suggestions into their story. In the end, P4 wrote (about the police detective): âĂĲHe appeared a bit nervous. He told me that he suspects my sister may have stolen an elephant from the zoo when she was studying abroad in India. I felt shocked.âĂİ P4 explained their reasoning in integrating the system's suggestion: âĂĲI donâĂŹt know why these images popped up and how they are related to what I wrote before. But I saw the elephants and some kind of more tropical places and so it kind of made me think of ...I don't know I was thinking what could it she possibly have done wrong that she could be in trouble and so the elephant was standing out to me, so I chose to say that âĂĲshe stole an elephantâĂİ and I was thinking where elephants are and I know that there are a lot of elephants, like Indian elephants, so thatâĂŹs why I said that.âĂİ See integration 3 for more details.
The participant concluded their story by writing, in an attempt to rationalize and make sense of the participation of the elephant in their story: Plot. to be a little bit nervous but seemed to calm down when I asked himâĂę (Images) Integration: He appeared a bit nervous. He told me that he suspects my sister may have stolen an elephant from the zoo when she was studying abroad in India. I felt shocked. Explanation: ł I guess I used the irst section of the plot to write âĂĲhe appeared a bit nervous"âĂę these images I donâĂŹt know why they popped up and how they are related to what I wrote before. But I saw the elephants and some kind of more tropical places and so it kind of made me think of ...I don't know I was thinking what could it she possibly have done wrong that she could be in trouble and so the elephant was standing out to me, so I chose to say that âĂĲshe stole an elephantâĂİ and I was thinking where elephants are and I know that there are a lot of elephants, like Indian elephants, so thatâĂŹs why I said that ž Our labels: • Direct: elephant (from images), nervous (from text) • Exploratory: the elephant, India, studying abroad are substantially new aspects of the plot at this point • Divergent: does somewhat "close" a local question (what did she do?), however in a very unexpected way that raises many more questions than it answers Later on in P21's story (previously described in integration 2), they were describing a character driving to the hospital and the system gave auditory suggestions that P21 described as chanting and explained: âĂĲThere is chanting happening, it makes me think she got into traic because thereâĂŹs a protest happening, ...or a parade.âĂİ So P21 wrote in their text: âĂĲIn my mad dash to get to the hospital, I forgot that the 4th of July parade was happening today just blocks down from the hospital. IâĂŹm stuck at an intersection where the parade is passing by...âĂİ In this example, sound suggestions prompted the participant to think about what could have caused the traic, so call the integration indirect. The integration of this suggestion also signiicantly altered the course of the plot (exploratory) creating new avenues of the story development (divergent). More details are in integration 4.

Input (summary):
[Continuation (see integration 2] âĂęWe were neighbors growing up, so I was pretty close with her sister too. Suggestion: Sound 1 (crowd call and response); Sound 2 (crowd cheering) Integration: In my mad dash to get to the hospital, I forgot that the 4th of July parade was happening today just blocks down from the hospital. IâĂŹm stuck at an intersection where the parade is passing by, so I have no choice but to watch the high school band and loats made by various organizations go by. Explanation: âĂĲThere is chanting happening, um It makes me think she got into traic because there's a protest happening, âĂęor a parade.âĂİ Our labels: • Indirect: no words or concepts are directly applied; abstract link must be explained by participant (sound of crowd chanting to 4th of July parade) • Exploratory: altered the course of the plot signiicantly; narrator eventually turned around and went home after several experiences in the parade • Divergent: not necessary or expected; creates a twist to develop further; opens up new questions for story P5 was writing a slow-paced descriptive story using the prompt "A train arrives at the station. " At some point, the protagonist was stopped by an oicer and told that the train would not be boarding as there was some issues. P5 requested a suggestion and one of the suggestions was âĂĲI had been waiting for this moment for years." The participant continued developing their story and wrote: âĂĲThe train was already late and now this; who knows how long before I get on board?! I canâĂŹt be lateâĂę maybe if I start now, I can drive over toâĂę no, no, no. IâĂŹll never make it that way.âĂİ To the interviewer who ran the session, there was no obvious connection between the suggestion and what the participant subsequently wrote. However, P5 explained that the suggestion "I had been waiting for this moment for years" made them think "more of a frustration for the train being late" and they imagined that there was something that the character was supposed to get to on time in another city. So this idea was translated into making the character impatient.

Integration 5
Input (summary): [Train is running lateâĂę] "âĂęThere is a matter we have to attend to irst before we will let anyone be checked in,âĂİ said the oicer calmly. Suggestion: Description. I had been waiting for this moment for years. Integration: The train was already late and now this; who knows how long before I get on board?! I canâĂŹt be lateâĂę maybe if I start now, I can drive over toâĂę no, no, no. IâĂŹll never make it that way. Explanation: âĂĲSo in this case rather than âĂĲwaiting for this moment for years,âĂİ I'm thinking more of, like, a frustration for the train being late and now more delays and there's like something that character was supposed to be trying to get to on time in another city. So it's going to [make] the character impatient.âĂİ Our labels: • Indirect: waiting for years to frustration, impatience • Exploratory: switches from describing scene and events to narrating internal dialogue about the character's feelings • Convergent: an expected reaction to the situation that describes the efect of the train's lateness Following the prompt "A train arrives at the station" P9 started writing a fantasy story about frogs waiting for their tadpoles to get back from Tadpole Kindergarten. Another round of suggestions read: âĂĲsound of a bell ringing and the frog who was holding the bell was holding a tray of frogsâĂęâĂİ As the participant explained, the speciic "sound of a bell ringing" in the suggestion made them think about sounds in general and what kind of sounds can be in the setting of their story. The participant wrote âĂĲOnce inside the parlour they were all taken back by the ringing and clattering of dishes and trays. âĂİ Here, the participant took a concrete description of sound (sound of a bell ringing) and then made a shift from concrete description to the general concept of sound and made a decision about what kind of particular sound will be in their story (âĂĲclattering of dishes and traysâĂİ).

Input (summary):
[Tadpoles taking the train back home from KindergartenâĂę] âĂęOnce inside the parlour they were all taken back by the Suggestion: sound of a bell ringing and the frog who was holding the bell was holding a tray of frogs and he was holding a tray of tadpoles who were all waiting for the new tadpoles Integration: Once inside the parlour they were all taken back by the ringing and clattering of dishes and trays. Frogs and toads were excitedly gulping down the various ly illed delights inside. âĂĲGeorgia! Barry! Tadette!âĂİ beamed Mr Willeker. âĂĲYou all look so well!âĂİ Please take a look at the menu. Explanation: ł I found that interesting as I guess it made me think more of like the sounds that could be inside this parlor or something âĂę because, basically, I was going to end up doing another long description that's probably quite boring. Probably similar to my previous thing I was writing, but I could then think about the sounds like clattering plates. ž Our labels: • Indirect: sound of a bell + tray to ringing and clattering of dishes and trays; tray is shared, but most of it is indirect • Exploratory: participant uses it to lead in a diferent description (see their explanation) We summarize these axes of integrative leaps in two igures. Fig. 5 shows direct integrative leaps, and  bottom is convergent. Participant IDs are noted along the horizontal axis, aligned to the corresponding instances.
We can see a few patterns when surveying these leaps in total. For example, participants generally made more direct leaps than indirect leaps, but these are also related to the other dimensions: most direct leaps were also convergent, addressing necessary and local narrative features, though there are several exceptions. Conversely, indirect leaps are slightly biased toward divergent integrations. Similarly, the exploratory label often coincides with divergent, but we can see several exceptions to this. On the sides, we include high-level descriptions of what each integration contributes to the developing story, with illustrative examples provided for each quadrant.

Outcome Evaluations
In addition to participants commenting on their experience during the interaction with the system, we were also interested in capturing overall impressions and speciic thoughts post-writing related to aspects of the prior assumptions and participant explanatory models we captured.

General Impressions.
We obtained general impressions with one open-ended question ("What are your impressions from using Editor-Red?"), and subsequently tagged these with overall sentimental valence (summary in Table 5). We found that a majority of participants (N = 13) noted largely positive experiences with the interface, ofering considerably diferent reasons. Some participants felt the suggestions were impressive or surprisingly relevant; for example, P20 noted that they were "pleasantly surprised by how spot on the predictions were at times, " and P12 wrote "I was impressed by its ability to generate sentences based on the context. " Some participants indicated that suggestions were directly helpful. P1 made an explicit connection to writer's block, saying "I really enjoyed the visuals and suggestions. I'm not a very visual person and struggle with writer's block, so this was very helpful. " Others found suggestions indirectly helpful, due to mood enhancement or some other afective efects. P11 wrote that it was "Very helpful in bringing the mood up in writing. It helps create the ambience and emotions needed for the writing. " P5 and P9 both noted that the interaction was "fun" in addition to noting other efects beyond helpfulness of suggestions. P9 wrote: ł It was fun/funny. The plot suggestions often didn't make sense but the description ones were either useful/thoughtprovoking or amusing to read even if I didn't use any part. The sound wasn't directly inluencing my ideas but having background noise was relaxing. The pictures sometimes were relevant and sometimes not, so I didn't stare at them too long when they changed. ž P15 indicated that suggestions weren't always relevant, but that they were willing to do the work to ind ways to incorporate them, which ultimately did make them helpful: ł It was deinitely a place to draw inspiration from. I would not use the suggestions as they are, but with some modiications, the ideas presented were deinitely helpful. The sounds are calming and peaceful. The background images are a hit or miss because they don't tell the same story as the text I am writing, but if I thought about the images a bit more creatively rather than literally, they were a bit more helpful. I don't ever see myself using the overlay version of the images though. ž We annotated several responses as being overall negative (N = 6). Most of these also included comments about suggestions being less relevant, but participants did not indicate that they were necessarily able to integrate them despite this (in contrast to P15). For example, P16 wrote: ł good idea but implementation worse than I expected. Images and text suggestions did not match storyline well, some proposed options were too exotic like "ustrophobia", selection tool to ind images was highlighting parts of word not whole word, sounds were not very relevant, ž P18 also commented on suggestion relevance, but focusing on the writing context and timing rather than the overall relevance, noting that "I think it's interesting, but I don't see the plot or descriptions as being particularly helpful unless it's for a writing prompt. Once you get into the story, the suggestions are not very relevant. " Still other participants seemed neutral about their experience with the interface (N = 4), either providing little direction in terms of what worked for them and didn't, or explicitly accounting for both strengths and weaknesses in a balanced manner. P14 wrote: ł It was sometimes helpful when I don't have any ideas, but not super helpfulâĂę Sometimes the text might not make senseâĂę The images are usually slightly of topic, but that could be helpful in giving me new ideas (they are very aesthetic/instagramfeel)âĂę I liked the sounds when they existed (it really does bring me to the place) ž 5.3.2 Suggestion Helpfulness. Participants' overall impressions contained general indications of suggestion helpfulness, but we were also interested in obtaining ine-grained relections to better elucidate the conditions and motivations involved in this. Again, participants diverged in what they found helpful and why. Here, we coded responses as indicating suggestions were Deinitely helpful (N = 2), Helpful (N = 5), Somewhat helpful (N = 11), Rarely helpful (N = 1), or Not helpful (N = 4), see Table 6 for full details. In addition to this, we found several diferent ways that suggestions were or were not helpful.
New ideas, phrases, words. A common indication was that suggestions yielded new ideas, possibly in phrase form but sometimes even as a word that was contextually useful. Although the suggestions typically contained Positive 13 "In Editor-Red I felt like I was writing in a time travel machine rather than staring at a blank page. I think it helped me feel a lot more grounded, present in the moment and in my body rather than just a disembodied brain trying to force words on a page. " Negative 6 "I didn't ind it very helpful, but I could see how some features might be useful if developed further. The suggestions didn't seem to take into account the total content of what I had written, and so they seemed irrelevant and even distracting (for example, a phone scrolling instragram on a beautiful summer day when the phone in my story was a handset phone in a hotel room)" Neutral 4 "It seemed fairly easy to use. I primarily was looking at the Plot section, possibly because it was more relevant to the given prompt while the Description section seemed to be more like diferent prompts. " full sentences, subsets of these were more often described by participants as being helpful. The helpfulness of these ideas also extended to their presentations in other modalities (images and sounds). P21: ł The suggestions were helpful. At one point, I felt that there had not yet been enough action in the story and the Editor suggested an event to happen. Another time, the sounds that were playing triggered an idea for the next setting of the story. Other times, even if I didn't like the suggestion I was given, it would still give me an idea for how to proceed. ž Relevance. Another reason suggestions were or were not helpful had to do with how relevant they were regarded as being in the context of writing. Most participants were mixed on this. For example, P8 wrote that the sounds seemed in-tune with the tone of the story they were writing, but described challenges to overall relevance of suggestions: ł I was impressed that the sound suggestions seemed to pick up on the creepy, suspenseful tone of the story right away, and it could be helpful if the image suggestions followed the tone more closely. It kept showing me pictures of smart phones, which was not helpful. It would have been more helpful to see images of places and people for inspiration about how to describe their features. it would also be more helpful if the suggested images were more varied rather than all being pretty similar. That way the writer could choose from potentially useful images (and maybe even indicate which ones were more useful to see more like that?) ž P2, by contrast, found the sounds very distracting but the plot-level suggestions occasionally helpful: ł the plot suggestions were occasionally helpful, but the descriptions were usually completely of the mark; the background images were aesthetic but not totally related, and the sounds were very distracting ž Mood, continuity, and low. Some participants expressed efects on process rather than on content, as we also observed in their comments during the interaction. They suggested beneits beyond the direct application of suggestions, for example P1 noted: ł For the most part I think that they were helpful. Even if I didn't use every idea that was suggested to me, I was inspired by their mood. ž P11 also pointed to this, writing about low, mood, and ambience (the latter of which did seem to depend on relevance to the general topic of their progressing story): ł Yes! They are deinitely helpful! I think it helps to prompt me to think about what to write next and keep the mood on writing.
It helps to keep the ambience according to the topic Im writing as well. ž

N Example D H S R Nt
New ideas/phrases/words 10 "The plot descriptions and text suggestions were helpful and creative. Some of the images, however were a bit generic. I mentioned a phone and the grid overlay just shoved several iterations of smartphones/, it would be nice if it could show diferent types of telephones, and in that way, allow the writer to have a visual reference, in case the writer wants to describe the object in more detail. " ł The suggestions could deinitely be improved. For example, the irst suggestion I had kept repeating the same things in the plot box and the description box was very plain and essentially what I had already written. However, the second time I used a suggestion went better and I was able to draw inspiration from the images and plot box. ž Additional thoughts. Two participants found suggestions not or rarely helpful during the course of their writing, even when they might identify their potential value. Table 6 contains such an example, as well as a summary of all the other themes.

Outcome
Ownership. Almost all participants (N = 22) indicated that they felt the outcome text was primarily or entirely theirs, citing a few diferent factors to reason about their ownership. No participants reported not feeling ownership, while one expressed a little uncertainty.
Only took some phrases/words/ideas. Most participants pointed to their cognitive and creative work in absorbing suggestions in the form of phrases, words, and ideas into their stories. P5 made a comparison to a collaborative process with another writer: ł I would because I didn't take suggestions word-for-word except for one short phrase. Otherwise, it was like me bouncing ideas of a friend rather than the friend actually writing prose for me. ž Ideas were primarily theirs. Other participants made the perhaps related argument that general, global aspects of the stories were their own. From P13: ł Yes because while I used the suggestions somewhat, the general storyline was my own. ž Autonomy and authorial discretion. P22 pointed to authorial discretion, i.e. "inal cut", as the source of their ownership over the outcome: "I think whatever platform you use to brainstorm, in the end of the day, you are the decision maker to put that into your writings or not. " P21 referenced their work in not only deciding how a suggestion might be integrated, which we have discussed earlier, but perhaps whether or not to integrate it altogether: ł I would call it my own. While I did receive inspiration from the Editor, there was nothing that I took verbatim from the Editor. I didn't include anything without putting thought into whether or not it would add to the story ž Similar to real-world encounters. Some participants compared the references obtained through working with Editor-Red to real-world encounters or analogous explorations of open domains like the internet. P11 wrote: ł âĂęJust like when we try to ind ideas through browsing the internet, or just having a walk outside to create the mood and inspiration to write. But editor Red makes it more eicient and easier to ind idea since it is all in one platform. ž Suggestions were not helpful. Some participants felt ownership due to not incorporating any suggestions. P8 very simply stated: "I didn't really use any of the suggestions. " Additional thoughts. P1 indicated "I would call it my own, but acknowledge the suggestions that were made to me. " They were the only participant that suggested such an acknowledgement. All these codes are summarized in Table 7.

Diferences from Initial Expectations.
To capture the ways in which writing with our interface qualitatively difered from their expectations of it, we posed an open-ended question. We coded the responses both for overall diference (Yes/No/Unsure) and themes that explain how or why, if so. A majority of participants (N = 13) indicated a diference from their expectations, with many also indicating that it was overall similar (N = 9). We coded one response as being unsure: P10 wrote "It was a novel experience. I had no expectations because I wasn't sure what to expect. " Similar. 6 participants noted that the experience was overall similar to their expectations. P14 wrote: ł I have worked with language models before (I've played around with Writing with Transformer-type websites, using GPT2 for applications, and I've actually done an NLP externship that involved making image recommendations based on text haha using Unsplash too) so it was similar to what I expected in that it sometimes doesn't make sense, says things that are not super related, but could be coherent/interesting sometimes. ž Less relevant. Participants remarked that suggestions were less relevant than they imagined initially, if not always then some of the time at least. P4 remarked: ł Sometimes I had no idea where the pictures in the grid came from, because sometimes they seemed relevant to what I had written and then sometimes it seemed like they were completely random. I expected the plot suggestions to be a little less repetitive. ž Less out-there. P2, among others, noted that suggestions were less big-picture or less out-there (which we use to imply further from the written narrative) than they anticipated: ł fairly diferently... I guess I was expecting some bigger-picture feedback, like larger plot suggestions or thematic images (like outer space for a space-related story ... though how clear would it be to AI that the story is about space?) ž More creative/intelligent. Two participants noted suggestions being more creative or intelligent than they expected. P1 wrote: ł It was surprising to see the intelligence of the AI and the creativeness of the suggestions, for example "cryogenic sleep" was a very novel idea suggested to me. ž ACM Trans. Comput.-Hum. Interact.
Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal Machine Intelligence • 1:35

N Example
Yes Unsure Only took some phrases/words/ideas 10 "Since some of the suggestions were pretty far out there, I would say that what I wrote is my own. However, there were interesting moments where I did copy a phrase that the system proposed. This felt more like plagiarizing than, say, picking a diferent word from a thesaurus. But I don't think the system can take too much credit since it was generating ideas based on my writing. " 10 0 Other 3 "It depends. If I only used it to provide visual references and sounds for describing a scene, then would still call it my own, but if i got vital plot points from the editor suggestions, then I wouldn't fully call it my own work. " 2 1 Ideas were primarily theirs 3 "It deinitely feels very much like my own because almost all the words were mine, the story and progression is mine, the tone is very much mine. " 3 0 Autonomy and authorial discretion 3 "I would call it my own because even though I did use the features to brainstorm, but I mostly write it on my own and I think whatever platform you use to brainstorm, in the end of the day, you are the decision maker to put that into your writings or not. " 3 0 Similar to real-world encounters 2 "Yes. Because it helps me ind an idea, but I was the one who developed the story and make the story coherent. I think Editor Red is just a platform that helps a lot in creating ideas and mood in writing, not to help in writing the whole story itself. Just like when we try to ind ideas through browsing the internet, or just haviing a walk outside to create the mood and inspiration to write. But editor Red makes it more eicient and easier to ind idea since it is all in one platform. " 2 0 Suggestions were not helpful 2 "Yes, because it was produced by me entirely " 2 0 Table 7. Codes from open responses regarding ownership over outcomes ater writing. N indicates number of participants, Example shows a corresponding quote, and Yes/Unsure are counts of participants reporting a feeling of ownership or being unsure (no participants responded no).
Less subtlety/control. P7 expected and desired diferences in what parts of their writing the system attended to, indicating that they might like to do this through some control input: ł I thought it would just look at the last few words that I had written and it would ideate on those ideas. Sometimes that was the case, particularly for the image suggestions. However, the text suggestions would sometimes go all the way back to the beginning of the story. I think that I wanted more control over where the AI was paying attention, but it otherwise did what I thought it would do. ž Slower. P16 remarked that the interface was slower to respond than they expected, in addition to suggestions being less relevant: "its interface acted as expected but I expected it to give suggestions faster and those be more relevant" Not as directly usable/helpful. P9 and P3 noted that suggestions were not as directly usable or helpful as expected. P3 wrote: "I thought it would give me more suggestions/sentences that I would just copy into my writing directly. It was more of abridged words/phrases. " 5.3.5 Human-AI Diferences. We assessed diferences in participants' practical expectations of our system and human writing partners with a counterfactual item: how did they think writing with such a partner might be diferent?

N Example
Yes No Unsure Similar 6 "Yes, it did. It is similar with the introduction video and I can use it easily by watching that. " Less relevant 4 "Sometimes I had no idea where the pictures in the grid came from, because sometimes they seemed relevant to what I had written and then sometimes it seemed like they were completely random. I expected the plot suggestions to be a little less repetitive. " 4 0 0 Less out-there 3 "I expected the sounds to be more ambient sounds that contributed to a certain vibe of the story, but they seemed much more random and chaotic, which maybe had to do with the content of my story. I didn't expect that the editor would be able to draw upon elements of the story I had already written. I expected it to imagine new storylines that may have taken me in a new direction. " More creative/intelligent 2 "It was surprising to see the intelligence of the AI and the creativeness of the suggestions, for example "cryogenic sleep" was a very novel idea suggested to me. " Less subtlety/control 2 "The plot suggestions were not as subtle as I expected, like I thought it could help lead into plot points but instead the suggestions were often far of things I'd have to work toward and might take a bit of time to write to that point to incorporate the plot suggestions and make it make sense. " 2 0 0 Slower 2 "see answer above for details. its interface acted as expected but I expected it to give suggestions faster and those be more relevant" Not as directly usable/helpful 2 "The sounds were not what I was expecting. They sounded quite eery and scary, and I was trying to write something more lighthearted. Some of the suggestions were surprisingly good (like including characters that I'd mentioned and dialogue). Some of the suggestions also looped around though, and didn't really make sense (& I maybe expected them all to be kind of ok). I liked that the suggestions were slightly longer than I expected though. " 2 0 0 More collaborative/communicative. Most commonly, participants expected greater communication and a more collaborative interaction from a human co-writer. P7 noted this: ł I think I would want another human to be more coherent. I would expect them to make more meaningful contributions to the work and I think that would feel more like a collaboration. This felt like I was immersing myself in a writing environment where the things were tangentially related to what I was writing, but not exactly relevant. ž More understanding/experience. Other participants alluded to experience of the world or understanding about more general aspects of narrative development. P6 wrote that "I could explain to the person the story I am thinking about. I could convey the tone and the feelings I am trying to infuse in my text" Speed (slower/faster) or efort. Participants either thought writing with a human would be slower or faster, or require less or more efort in comparison. More commonly, participants thought it would be slower and require more efort. P22 reasoned: "I think, it will require more eforts because mutual agreements between the other humans are needed and the outcome really depends on how you and that person's relationship, their background, mutual understanding, etc. " P11 similarly expected writing with a human to be slower and/or require more efort, especially one who isn't a professional author. They attributed this to time needed for information processing, searching with other tools, etc.:

N Example
More collaborative/communicative 10 "I think that another human being would have asked questions along with suggestions in order to better tailor their suggestions. " More understanding/experience 8 "Yes. I think the human would be more helpful because they could suggest if more description should be added to the setting or if the character needs to be developed more or suggest a direction for the plot, none of which I felt I got from the editor-red suggestions. " Speed (Slower/Faster) or Efort 4 "Slower going. Plot suggestions would have made more sense, but I don't think description ones would have been much diference in help. A human might have been more helpful with naming things in the story. " More questions 3 "I think we probably wouldn't asked each other questions back and forth like "why is the character doing this? What does it sound like?" etc. Writing with humans tends to involve a more "question-based" approach. " Other 2 "Another human might have challenges coming up with ideas just like the writer. The suggestions might have been diferent which would have geared my story in a diferent direction altogether. " Less self-driven 2 "Writing with another human would have deinitely changed the course of the story. It also would have felt less like my writing because of the other person's ideas. " Table 9. Codes re: expected diferences from writing with a human co-writer, ater writing. N indicates number of participants, Example shows a corresponding quote. N.B. some responses are labeled with more than one code.
ł I think it would take more time if I write with another human since he/she would have to think for ideas and suggestions as well, or if not, he/she might also use another artiicial intelligence like Google to ind more ideas. Unless, the human is a professional author. If not, I think it will take more time to write as I would have to discuss as well. In addition, another human would not be able to create the mood/ ambience that I would like to have while writing. ž More questions. P3, among others, expected more questions along the writing process: ł I think we probably [would've] asked each other questions back and forth like "why is the character doing this? What does it sound like?" etc. Writing with humans tends to involve a more "question-based" approach. ž Less self-driven. P14 and P15 expected the process to be less self-driven or self-owned, P14 wrote: "It might also feel less introspective (I enjoy the space from being alone), " while P15 alluded to ownership (see Table 9. Additional thoughts. P23 expected that another human might face similar challenges to the writer (see Table 9 for quote) and P16 simply indicated a general diference, without specifying their reasoning about why.

Relating participant expectations, processes, and outcomes
Our three-fold study data collection generated a great deal of information from relatively few participants, describing each one's interaction with our system in substantive detail. Although our study design's primary goal is for these three types of data to collectively provide insight, here we review a few examples of instances where explicitly combining these sources of data at the level of the participant or sample provides additional information.

Anchoring to prior expectations.
Regarding the possibility of AI creativity, P6 noted that they thought it "depends on the amount and type of data that will be available to the AI to create something new, " emphasizing that the "broader and [more] various the set of data the more creative the AI could be. " During the interaction, we observed P6 not engaging with the suggestions to advance their story, but rather, as discussed earlier, attempting to improve the system suggestions with their writing instead. In this case, an inaccurate expectation resulted in the system being unhelpful to them, due to their behavior anchoring to this expectation rather than adjusting to the system's behavior during the process.
By contrast, some participants who were optimistic about the ability for AI to be creative managed to ind utility in suggestions that may even have relected poor or less coherent language-modeling behavior. For example, P15 wrote that "âĂęinformation/ideas provided by AI can be completely illogical which is sometimes the best creativity" and, after writing, indicated their willingness to make sense of and incorporate possibly irrelevant suggestions: "with some modiications, the ideas presented were deinitely helpfulâĂę images are a hit or miss because they don't tell the same story as the text I am writing, but if I thought about the images a bit more creatively rather than literally, they were a bit more helpful. " 5.4.2 Adjustments to prior expectations. Some participants appeared to adjust their prior expectations after interacting with the system. A particularly clear case is P1, who initially expressed a belief that "AI can not be creative," but could be "accurate." During the interaction with the system, having received a suggestion, P1 was impressed with it and was contemplating whether to characterize it as âĂĲaccurateâĂİ or "creative, " inally coming to the conclusion that "this is a really good plot and it's creative enough. " After the interaction, they noted that "It was surprising to see the intelligence of the AI and the creativeness of the suggestions. " This participant initially identiied that Value (useful to people) was the most important aspect of human creativity to them, and described the suggestions afterwards as "helpful. Even if I didn't use every idea that was suggested to me, I was inspired by their mood. " The suggestions were useful to them in their process of writing, perhaps demonstrating the characteristics of Value.
At a sample level, a majority of participants (N = 14) initially responded that they would consider the inal text to be "co-written by myself and AI," however afterwards, almost all participants (N = 22) indicated that they would call the written text their own (with one participant unsure). In addition, all those who responses Unsure or No to diferences between human and AI text production initially (N = 7; P23, P17, P10, P6, P22, P9, P3) were able to communicate clear expected diferences after the interaction. For example, P6 initially indicated that they were unsure, suggesting that "âĂęit depends on the level of development of the AI", but inally wrote that with a human they "could explain to the person the story I am thinking aboutâĂę convey the tone and the feelings I am trying to infuse," which is about explicit communication and intuitive inluence rather than modeling performance. Participants had viewed a video of our system before the initial response, indicating that their perception was informed by actually interacting with the system rather than its overall design and method of suggesting.

5.4.3
Do more accurate mental models of AI improve the experience or outcome? To examine this question, we consider the two opposed categories of mental models of AI: Sparse-Abstract and Sophisticated-Operational. These groups respectively had the least and most detailed and accurate expectations of AI. We can examine how their outcome evaluations varied across the dimensions of overall experience, suggestion helpfulness, and diferences from expectations.
Sparse-Abstract. 8/14 participants reported overall positive experiences, with 2 neutral and 4 negative. 10/14 total reported suggestions being at least sometimes helpful (1 rarely helpful, 3 not helpful). 9/14 indicated that the experience was diferent from their expectations, with 1 unsure and 4 reporting no or not much diference.
Sophisticated-Operational. 3/5 participants reported overall positive experiences, with 1 neutral and 1 negative. All 5 indicated that suggestions were at least sometimes helpful. 3/5 indicated that the experience was diferent from their expectations, with 2 reporting no or not much diference.
The data in this case indicate a complex relationship between the depth and accuracy in explanatory models of AI and experiences with our interface. A simple assumption might be that having more well-calibrated expectations of AI systems might arise from technically deeper and more accurate reasoning about how it generally works, which may appear to be supported by how helpful participants thought suggestions were.
However, our observations and participants' comments about suggestion helpfulness suggest that the diferences have more to do with styles of writing and openness to narrative change than accurate expectations of the system's behavior. This is reinforced by the lack of clear diference in whether the system behaved as expected or not between these two groups.
Even so, we can explore this further by considering whether accurate expectations themselves were clearly associated with positive impressions or indicated suggestion helpfulness. 9/13 of those who reported a diference from expectations indicated an overall positive experience, compared with 4/9 who didn't. For helpfulness of suggestions, 9/13 who reported a diference from expectations found suggestions at least somewhat helpful, as compared with 8/9 who didn't. Again, there is a divergence between the two outcome variables, suggesting a complex relationship between expectations and outcomes.

Usability and overall experience
5.5.1 Usage data. In all, participants requested 165 suggestions and received 162 (3/165 requests were not resolved, for example due to requesting a subsequent suggestion while one was already processing). The median number of suggestions requested and received per participant over the 20-minute session was 7 (min = 2, max = 15). Participants wrote 229 words on average in Editor-Red compared with 296 in Editor-Green. With regard to suggestion modalities, on average participants had images toggled "on" for 99.8% (min 97.5%, max 100%) of the duration of their active engagement with the system, and sounds for 77.1% (min 25.2%, max 100%), with this duration measured from either the irst suggestion request or alterations to these controls, to the last. During this period, participants toggled images on and of between 0 and 9 times (median 1) and sounds between 0 and 11 times (median 1.5).

System usability.
As part of the post-task survey, we presented the participants with a battery of questions tailored to understand their experience using our interface, as shown in Fig. 7. The post-task survey contained 42 questions and produced more data than could be analyzed within the scope of this paper. We report those questions that examined the issues that are the focus of this paper.
Finally, We were also curious to know how participants felt towards the diferent modalities available to them in the interface shown in Fig. 7B. When asked to rate Q14.5 "I mostly used the textual suggestions and not pictures or sounds" participants agreed (AAA=8, AA=2, A=4, N=1, D=3, DD=4, DDD=1).

Efort and cognitive load.
To understand the cognitive load imposed by writing with our system, we sub-sampled a group of 4 items from the NASA TLX (removing physical efort and performance, which are less

Evaluation of outcome creativity.
We captured the participants' perception of the creative output and of Editor-Red's creative support by asking them to rate the following statement: Q14.1 "I did most of the creative writing, using Editor-Red just for suggestions." Almost all the participants (22/23) agreed with the statement with the exception of one participant (AAA=9, AA=7, A=6, DD=1) . When asked to rate Q14.8 "The suggestions made by Editor-Red were creative" we found that 16/23 participants agreed the suggestions were creative, 2/23 were neutral and 5/23 disagreed (AAA=4, AA=4, A=8, N =2, D=2, DD=3) Fig.7C.
We further asked for them to evaluate whether or not the inal text created during the task was creative, using Editor-Red: Q17 "Do you consider the text that you wrote in Editor-Red creative?" Most participants responded "Yes" (N=20), with 3 participants answering "No. " Similarly, we asked them to rate the text they wrote with Editor-Green: Q19 "Do you consider the text that you wrote in Editor-Green creative?" 14/23 participants responded "Yes" and 9/23 responded "No." We found that 13/23 participants preferred Editor-Red as the text generated during the task was perceived as more creative, and when asked: Q21 "In which editor was the text that you wrote more creative?" (from Editor-Red=1 to Editor-Green=7), 2/23 indicated Neither or Both, with 8/23 participants towards Editor-Green (1: (N=1), 2: (N=4), 3: (N=8), 4: (N=2), 5: (N=5), 6: (N=1), 7: (N=2)). Participants show a preference towards Editor-Red for using that type of editor for writing creative text. Q25 Which editor did you prefer for writing a creative text? Editor-Red

Agency and ownership
We found that participants generally enjoyed writing with the help of suggestions from Editor-Red and were enthusiastic about the concept of writing with a âĂĲcollaborator,âĂİ especially once natural language generation capabilities improve and the suggestions are closer to their own writing style. From observing the writing session and post-survey conversation, it was unclear that the issue of ownership and agency in co-writing with AI was something that participants were at all concerned with or gave any thought to.
Two participants admitted they were primed by questions in the pre-session survey to think about agency and ownership in writing using the suggestions of the system. P3 explained they had thought that âĂĲusing AI interface would make me feel that I wasnâĂŹt even doing my own writing, " but ultimately P3 felt that the system âĂĲhelped alongâĂİ but âĂĲdidnâĂŹt tell me what to write.âĂİ P22 (one of the four participants who didnâĂŹt visibly use Editor-Red's suggestions) explained that even if they had used the suggestions, they would still call it âĂĲusing my own creativityâĂİ as they believed that âĂĲeven deciding to use it or not, is actually really a choice for meâĂİ and is seen as a part of âĂĲcreativeness.âĂİ The participants seem to think that this type of system can improve creative writing by being supported by the system and less cognitively demanding than the simple text editor. 6 DISCUSSION 6.1 Suggestion quality: relevance, coherence, and variety Several participants rejected suggestions for a perceived lack of coherence or relevance to their developing texts, which comports with prior work on language model assisted writing [18,22]. Building on this, we have also shown that several others in our study did not see this as an obstacle to working with the system and in some cases appreciated less immediately semantically relevant suggestions and were able to incorporate ideas from less linguistically coherent suggested sentences. While we attempted to trace this diference to expectations, model behaviors, and other potential predictors, our expectation from observing participants is that this has primarily to do with a diference in participants' approach to creative writing. As such, the relevance of suggestions may not be a simple variable to always aim toward maximizing; rather, the optimal level of relevance might vary by writer. Sometimes, it might also vary depending on other circumstances; for example, some participants noted that less relevant suggestions likely required more time to integrate, and that they might do so given additional time to write. This may also be relected in the fact that on average, participants wrote less text in Editor-Red than in Editor-Green, though we note this is also related to other aspects of the interaction in our study (the novelty of the interface, participants talking more while using Editor-Red, etc.).
The ambiguity in assessing relevance extended to the multimodal concept representations; even when they were not used directly, their contribution to the environment might vary with their relevance. For example P8, who didn't visibly incorporate any suggestions, noted that they were "impressed that the sound suggestions seemed to pick up on the creepy, suspenseful tone of the story right away, and it could be helpful if the image suggestions followed the tone more closely" as compared with P5 who wrote that the "sound wasn't directly inluencing my ideas but having background noise was relaxing. " Balancing relevance with variety is likely to be important in making suggestions useful to participants, in our assessment. Participants especially noted the homogeneity of images: "I mentioned a phone and the grid overlay just shoved several iterations of smartphones, it would be nice if it could show diferent types of telephones" (P20). This homogeneity also extended demographic factors: âĂĲthere's just a bunch of white guys staring at me and I don't know whyâĂİ (P2) and âĂĲthey are all images of straight blonde Caucasian womenâĂİ (P5). We noted that these instances were not directly related to query material, indicating that these might relect broad biases in available images.
Technical approaches to generative modeling and information retrieval to support creative processes should, in our view, be intentional in handling these parameters (relevance, variety) and consider individual and situational variation in their optimality criteria. Modeling this is likely non-trivial and raises questions such as: what is relevant when and to whom? When are precise, logical suggestions needed, and when are surprising, unusual suggestions needed? The integrative leaps we have reported on suggest the practical challenges in automatically inferring this trade-of, or even reducing it to a simple, one-dimensional control. A helpful source of information in our case is the writers; they often have strong intuitions about both. Finding channels for writers to communicate their personal stylistic and contextual narrative needs to both interfaces and the underlying models, for example in natural language or by providing examples, may help these systems robustly support creative expression by both being lexible and allowing users to clearly and naturally communicate their needs and intentions.
6.2 Editor-Red beyond writing suggestions 6.2.1 A supportive writing environment. Though what Editor-Red seems to be doing on the surface is providing words, lines, and ideas to borrow and rely on, we observed much more than just that in the interaction settings we studied. A wide range of participants' comments highlight that the system acted as a support tool in diverse ways. Those participants who actively integrated the system's suggestions admitted that Editor-Red was structuring their process of writing. For instance, P1 admitted that they found themselves at a certain point "writing for the suggestions, " seeing Editor-Red as "a form of motivation to continue writing" in order to get better suggestions. P3 commented that Editor-Red helped them "keep going" and "continue along" with their writing when they otherwise would have stopped.
Editor-Red redirected attention from being stuck (P12) and helped feel "less stuck" even when the participant was not taking the system's suggestions (P16). Writing as a process is fraught with self-doubt, anxiety, and feeling overwhelmed, systems like Editor-Red can mitigate stress by being a comforting distraction like "petting a cat" (P6).
Some participants used suggestions as just a starting point for the participants' own individual creative journey. For example, P23 explained that often Editor-Red suggestions gave them a diferent idea rather than taking the suggestion right directly. As P23 further explained, getting inspiration from something can be unrelated to what that inspiration was. The integrative leaps (see ğ5.2.5) that participants made when engaging with Editor-Red illustrates a wide range of examples of what users are capable of and willing to do when integrating with a writing system. 6.2.2 Personal and cultural references versus AI-generated references. In the "blank page" writing with Editor-Green, 10 participants out of 23 visibly relied on cultural (books, TV shows, music videos) and personal references (memories, personal experiences, and immediate surroundings, e.g. describing what one can see from the window). For example, P8 writing in Editor-Green with the prompt "A train arrives at the station," explains that they are thinking about "the train station and Anna Karenina, kind of thing." P9 writing in Editor-Green with the prompt "The phone began to ring" explains that "the phone" made them think about a landline, a landline made them think about a hotel, and that, in turn, made them think about the last trip they had when they were staying in a hotel, which prompted a subsequent description they made in Editor-Green writing (after completing Editor-Red writing).
When writing in Editor-Red, 5 participants out of these 10 did not visibly use any cultural or personal references in their writing, but instead relied on the Editor-Red suggestions. Their story development was structured and oriented by the suggestions they incorporated. This is not that surprising on its own since participants were asked to use the features in Editor-Red. However, it is possible to imagine that writing with systems like Editor-Red allows a user to rely less on one's own cultural and personal references and one's "self" (an individualâĂŹs interiority, a means and ends of one's own actions [51,52]). Rather, it provides an opportunity for a user to interact with the "self" of a system, deriving references from its suggestions.
We hypothesize that systems like Editor-Red can be used also when users for situational or psychological reasons do not want to engage with their own experience and inner thoughts and can be used for building systems that can provide therapeutic support for a user. This type of psychological support function has been recently identiied in other human-AI creative interaction domains [91].

Dynamics of suggestion integration
Ideas for writing often came not through directly applying Editor-Red's suggestions, but as a result of active engagement with the system from the participant's side and their readiness to do cognitive work in extending, adjusting, and altering suggestions and/or prior text to better suit the combination of text they had written and either any thoughts in their mind about how to proceed (conirmatory) or ideas about altering the narrative to lead in a new direction (exploratory).
The comments that participants made explaining these integration examples provide an insight into the multiplicity and multidimensionality of practices involved in human interaction with generative language systems, and especially how users create new meaning through this interaction. We speciically did not aim to do a linguistic or semiotic analysis of the integrative leaps that we documented, which we argue would require a great deal more data in order to yield generalizable insights. Instead, we aimed to document some orientation points that relect structural diferences in participant behaviors during our study.
6.3.1 Creativity, ineficiency, and synthesis. When interacting with Editor-Red , participants' concepts of creativity in suggestions often seem to be constrained by the possibility of an easy transition. The possibility of an easy transition, in turn, is individually and contextually varying. Those participants whom we identiied as willing to cooperate with Editor-Red and incorporate its suggestions, did not seem to mind suggestions being "absurd," "crazy," and "out there" (we will refer to these as "out of sync"). These suggestions sometimes led to considerable changes to the subsequent and prior narratives; participants made decisive creative moves when they were willing to engage in this way.
Monster hunting, cryogenic sleep, a detective in 1890 Austria, and the irst human to die on Mars were just some of the ideas that participants received as suggestions. Incorporating these suggestions depended on whether participants were mentally and emotionally ready to make the necessary eforts to synthesize an easy transition from the prior text and the given suggestion. It was "a much longer route" for monster hunting (P5), and didn't come "at a good time for the story" for cryogenic sleep (P1), and so these suggestions were not integrated. However, "you are a detective in 1890 Austria" made the participant think about the concept of time in their story (P1). P2, having completed almost the whole story that took place on a London farm, received a suggestion saying "you are the irst human to die on Mars." P2 did comment that the description "is not very accurate" to their situation but then changed their mind: "You know, letâĂŹs make it about Mars, why not?" and rewrote the story in four places to it the premise of taking place on Mars.
We conclude from this assortment of complementary and contradictory behaviors that the incorporation of a suggestion that is "too creative" does not depend just on the content of the suggestion, but rather on the possibility of transition which is inluenced by individual and situational factors. The transition towards a suggestion that is unexpected and unrelated to the input text is dependent on the readiness and motivation of a user to the requisite cognitive and/or emotional work toward a meaningful synthesis of elements. These observations align with Freiman's characterization of the writer's drafting process, involving a "state of unknowing", a "kind of faith" that something will emerge from the drafting, and ultimately how "something that perhaps lacked cohesion or structure now becomes more concrete or coherent in the making of the text" [39]. Freiman suggests this happens by the writer making cognitive, afective, linguistic, and other creative decisions through a series of drafts and changes. We also expect that cognitive work done on drafting and revising to achieve such a synthesis may also become a path to support ownership of the text and creative endeavor, in our context of AI-generated suggestions.
6.3.2 Cognitive reorganization and expectations of non-human writing systems. What are the underlying cognitive mechanisms by which distant suggestions are able to be meaningfully integrated into users' existing narratives? Participants of our study were actively aware of the task environment [36]: writing a story using Editor-Red (nonhuman, AI system), a rhetorical problem (write a story given a prompt), integrating Editor-Red suggestions (which they had preconceptions of being based on human language, rule-based, possibly random, illogical, and creative), and the text itself that is evolving and changing. Since it was possible to get suggestions multiple times and the suggestions were diferent every time (both in content but also in terms of the level of relevance or detail), every new suggestion created a micro-moment of interaction and adjustment.
Attending to Editor-Red suggestions, building up all the missing cognitive links or not immediately visible links so as to update the story sometimes involves a considerable amount of cognitive reorganization of narrative information, in the sense of reorganizing what one already knows (e.g. Piaget's equilibration [75]) or, in this case, has already written. One possible mechanism for this is self-explanation, which is an attempt to make sense of new information by explaining it to oneself [20]. Unlike self-explanation in learning, wherein the central inferential process needs to construct new knowledge at the level of "the world", here self-explanation may provide an inferential process to reorganize the narrative by inding possible connections and associations, similarity, extracting abstract properties, or making referential links (for example, as we described earlier with P4 having the precondition of a crime, seeing an elephant that seems irrelevant, and explaining the presence of the elephant by making it the object of the crime involved). Other possible mechanisms for combining distant concepts have also been described in prior literature, such as causal reasoning [54], comparison and construction [97], conceptual integration or "blending" [25,92], and satisfying constraints like diagnosticity, plausibility, and informativeness [24].
In the case of writing with our system, the willingness of participants to engage in this process may come from user expectations, due to the non-human source of the suggestions. For example, we noted earlier that participants may expect suggestions to be random, illogical, having connection to the real world yet often being situationally "out of sync. " In that way, any faults of the suggestions and diiculties that arise from those become not only something that has to be looked past but actually act as inherent features of the system and accepted as part and parcel of this interaction. These suggestions can be seen as not a bug but an inherent feature and a necessary condition of this interaction, as humans perform integrative leaps, engaging in cognitive reorganization of narrative because they accept the premise of interacting and co-writing with a non-human system, and the implications that come with it.
6.3.3 How can "out of sync" suggestions be helpful for writing? Earlier work has illustrated how completely unrelated ideas and unusual word combinations can be evocative and productive for creative writing [19,31,94]. In the case of causal reasoning, the surprisingness of combinations may provoke additional and exploratory processes and thereby the production of creative ideas [54]. We hypothesize that another mechanism by which semantically distant suggestions might be useful is by explicitly prompting more critical evaluations of written content, i.e. what Flower and Hayes call "evaluating" and "revising" [36]. By contrast, we might consider highly probable and user-adapted word predictions, which can be absorbed into a writing task with minimal efort (e.g. a click) to accomplish well-deined goals more eiciently (e.g. respond to a work email). We can model distant suggestions with such semantic diiculties as we observe as being useful ineiciencies which prompt critical evaluations of drafts and suggestions, metacognitive relection about narrative development, and ultimately axes for more substantial narrative reorientation, where otherwise there would be no prompt or incentive to re-engage with and reconsider prior thoughts and writing. More work is needed to examine this possibility in detail.

Design recommendations
We observed that participants are capable of making leaps to integrate suggestions into their writing when presented even when the suggestions were unrelated to their current writing. However, there seems to be a general need to have these suggestions build on, refer to, or otherwise be relatable to aspects of their writing/story for many writers to have a more helpful experience. There is a need for details and descriptions of objects and important places when developing the story, and for systems to attend to the right parts of stories, which vary, when making suggestions. In our study, we observed most users rely on the system to enhance their writing when adding supporting material. When the system was not helpful in either introducing supporting material or helping them think of new directions, frustration and lack of trust in the tool often began to arise. However, as we have explored, suggestion quality is a multidimensional property which varies individually and contextually. To make suggestions useful to participants does not always mean maximizing their immediate relevance, but rather requires supporting the process of suggestion integration. Here we consider what that may mean for diferent technical and design considerations for creative writing support tools at every level of the process.
6.4.1 Datasets for creative writing. Participants in our study had diferent experiences with the two suggestion channels (Plot and Description), despite the commonality in the modeling method. Mirroring calls from other domains for data-oriented rather than model-oriented progress in AI [28,88], we argue that well-curated datasets oriented towards domain constructs can support diversity and relevance, two factors we identiied earlier as especially salient in machine contributions to creative writing work. Larger and more diverse pretraining sets can also result in greater coherence, if matched with an appropriately parameterized model, which we ind would be helpful to several users in a variety of contexts.

Language modeling.
What can better models help with? As noted, more modeling power can result in increased coherence and relevance, especially as processed sequences get longer, if pretrained on appropriately large and diverse datasets, as well as ine-tuned on downstream datasets that provide creative value. These properties are desirable in many cases, as pointed out by our study participants. In parallel, models with implicitly richer knowledge bases [74] may also extend more diverse suggestions to users, inding interesting relations with aspects of their writing, and assisting them in performing contextually appropriate and creatively fulilling integrations.
What can't better models help with? Larger models are typically slower, more diicult to ine-tune and host, and increasingly closed-source, expensive to obtain access to, and private. Additionally, we noted many instances in which the cognitive work done by participants was the operative force in making suggestions helpful and ultimately able to contribute to their writing. For these participants, writing styles, and situations, larger language models may not necessarily help much, but would incur costs in interactivity, which were already pointed out by some participants in our current prototype. In our case, suggestions typically took 3-5 seconds after requests (given that we were running two separate ine-tuned models, extracting keywords, etc.), depending on the length of the input text; larger models may take signiicantly longer (one writer estimates GPT-3's Davinci model's typical speed at 147 words per minute [14]) and are very challenging to host and serve interactive requests with due to the resources needed.
Even the best possible language models have an extremely limited capacity to understand our intentions. They cannot reason about human internal cognitive processes, implicit judgments, and novel forms of creative exploration and expression that intentionally disregard convention. Better language models, better for diferent purposes, can support the process, but a great deal of what makes human creativity successful is outside of their purview.
Semantic inluence. Some participants indicated a desire to inluence or control this facet of suggestions with prior information, e.g. high-level story goals, moods, feelings, and ideas. While relevance can already be expressed to language models at sampling time to some extent, through stochastic decoding methods and controls like temperature, the ability to semantically "steer" relevance towards more fruitful integrations, rather than expressing it as a numerical value, might also better support diverse writers' diverse needs. Such steering can be explicitly enabled [50,53,58], for example, by conditional modeling, or, in the absence of specialized approaches, even discovered by so-called "prompt engineering" which has been successfully used 7 by many for language-controlled visual art generation [73] with general-purpose vision+language models [80].

Interface design.
Overall goals of interfaces. Based on the behaviors observed in our study, we recommend that creative writing suggestions be designed to prompt and support cognitive processes that lead to suggestion integration and narrative engagement, rather than auto-complete style continuation. This seems to additionally support participant ownership over the outcome, as we observed in our study. There is a great deal of cognitive efort involved in writing with external stimuli, in order to make sense of them, recognize the possibilities for their contributions to the work, and perform efective integrations.
We argue that the focus of designing new creative writing support tools with intelligent augmentation should be on supporting this cognitive efort while preserving writer autonomy, authorial discretion, and creative low. In our interface, we do this by implicitly discouraging directly absorptive behaviors; suggestions are presented in a diferent graphical environment rather than overlaid on the text, and the familiar tab key invokes new suggestions rather than directly integrating them into the writing. The corresponding reduction in cognitive load for most participants (17/23) by a small amount overall may relect both the helpfulness of external suggestions in easing the cognitive burden of blank-page style writing, as well as the additional load introduced by the additional stimuli in context.
Multimodal support for a unimodal task. Additionally, visual and auditory suggestions cannot be simply inserted into a textual story, and we expect that the process of resolving these morphological diferences to create meaningful semantic connections may also contribute to making creative leaps in writing stories. Our results suggest these features be made easy to turn of: this was a feature our participants used extensively to account for both individual and situational variation. Future work might examine the methods for communicating these parallel channels of information.
6.4.4 Evaluation criteria and methodologies. Our Expectation-Process-Outcome model, which guided our study design that combined surveys, behavioral observation, and semi-structured interviews, allowed us to capture several things: a rich representation of conceptually relevant background which participants brought into their interaction with a novel system, their interpretive reasoning through the course of the interaction, and their evaluative judgements and impressions afterwards. Additionally, through capturing prior assumptions and explanatory models, we were able to begin to obtain a fuller picture of how the interaction is framed by and adjusts expectations, as well as some efects this may have on the experience and outcomes.
We recommend that designers of complex, novel tools to support open-ended creative tasks similarly consider the conceptual priors of their users in conjunction with evidence from their experiences, behaviors, and a posteriori thoughts. Through this, we might begin to better characterize the signiicant level of individual and situational variation, and design tools that not only practically accommodate this but actively beneit from it.

CONCLUSION
This research presents an extensive study of machine-in-the-loop creative writing, centered around a new interface that makes writing suggestions through sight, sound, and language. Through collecting data on participant expectations, processes, and outcomes of interacting with this system, we discussed how individual writing approaches and narrative circumstances inluence the interaction. By eliciting user explanatory models of AI, human and AI creativity, and creative writing, we explored how expectations might inluence and be inluenced by the interaction. We additionally reported on users' responses to suggestions through the lens of integrative leaps, by which participants incorporate suggested ideas into their writing process by performing cognitive work to make transitions possible.
As AI-based systems increasingly engage in traditionally human creative capacities, building stronger and more adaptive human-centered foundations for human-AI creative interaction will be increasingly important. Modeling advances in the systems periphery of everyday life have made it increasingly plausible that AI can be creative, but the more challenging work is to make it plausible that it might broadly extend our creative faculties by understanding our needs diferently than other human creative partners. We believe that deep and wide-ranging investigations such as those we described in this work can inform design methodologies and yield powerful and useful tools that extend our abilities.

ID QUESTIONS
6.1 I quickly igured out how to use Editor-Red 6.2 It was easy to come up with ideas while writing 6.3 It was easy to decide how I will continue this story 6.4 The more time I spend writing with Editor-Red, the better it gets. 6.5 The pictures used in Editor-Red distracted me from my task 6.6 The pictures used in Editor-red were helpful 6.7 The sounds used in Editor-Red distracted me from my task 6.8 The sounds used in Editor-Red were helpful 6.9 The story that I wrote in Editor-Red is coherent 6.10 The story that I wrote in Editor-Red is creative 6.11 Using Editor-Red felt intuitive 6.12 Using Editor-Red was easy 14.1 I did most of the creative writing, using Editor-Red just for suggestions. 14.2 I enjoyed co-writing with Editor-Red 14.3 I enjoyed collaborating with Editor-Red 14.4 I equally used textual suggestions and pictures and sounds 14.5 I mostly used the textual suggestions and not pictures or sounds 14.6 The inal product of writing is a result of joint eforts of Editor-Red and myself 14.7 The suggestions made by Editor-Red were coherent 14.8 The suggestions made by Editor-Red were creative 14.9 The suggestions made by Editor-Red were grammatically correct 14. 10 The suggestions made by Editor-Red were relevant 14.11 The suggestions made by Editor-Red were surprising 16.1 It was easy to come up with ideas while writing 16.2 It was easy to decide how I will continue this story 16.3 The story that I wrote in Editor-Green is coherent 16.4 The story that I wrote in Editor-Green is creative 16.5 Using Editor-Green felt intuitive 16.6 Using Editor-Green was easy 21 In which editor was the text that you wrote more creative? 22 In which editor was the text that you wrote more coherent? 23 Where did you feel more relaxed when writing a story? 24 Where was it easier to write a text? 25 Which editor did you prefer for writing a creative text? 26 Where did you feel more focused when writing a text? 27 Where did it feel more demanding when writing a text? 28 Where did it feel more rushed when writing a text? 29 Where did you feel you had to work harder when writing a text? 30 Which editor made you feel more discouraged or annoyed when writing a text?