Formulating or Fixating: Effects of Examples on Problem Solving Vary as a Function of Example Presentation Interface Design

Interactive systems that facilitate exposure to examples can augment problem solving performance. However designers of such systems are often faced with many practical design decisions about how users will interact with examples, with little clear theoretical guidance. To understand how example interaction design choices affect whether/how people benefit from examples, we conducted an experiment where 182 participants worked on a controlled analog to an exploratory creativity task, with access to examples of varying diversity and presentation interfaces. Task performance was worse when examples were presented in a list, compared to contextualized in the exploration space or shown in a dropdown list. Example lists were associated with more fixation, whereas contextualized examples were associated with using examples to formulate a model of the problem space to guide exploration. We discuss implications of these results for a theoretical framework that maps design choices to fundamental psychological mechanisms of creative inspiration from examples.


INTRODUCTION
Examples -descriptions or representations of possible solutions (or parts thereof) for the same or related problems [8,18,35,83,84] -are an integral part of the creative problem solving process.Examples can take many forms, such as previous physical prototypes brought to a brainstorm [33], search results for patents for related problems [25], spoken ideas from collaborators [66], UI designs in an accessible gallery [50]), or references from memory to earth animals or science fiction creatures when inventing new fictional alien creatures [89]).Importantly, examples can substantially shape what ideas come to mind [84,89].This "structuring of imagination" [89] is sometimes helpful "inspiration" that leads to more creative ideas [19,23,33,80].But examples can also have harmful "fixation" effects that constrain novelty and innovation [40,54,84,89] (for recent reviews, see [83] and [18]).Importantly, examples can influence problem solving without conscious effort or recognition [54,60,61], and persist in spite of creators' explicit intentions not to be influenced by them [40,84,85].Perhaps in recognition of these facts, effective creators take an active role Chan, et al. in finding, structuring and interacting with examples [23,35,79] using a variety of analog and digitally mediated systems and practices, such as search engines [35,68,69,97], design workbooks and commonplace notebooks [26], online whiteboards [58,86], mood boards [55], and wider interactions with their community of practice, such as trade publications and conventions [23,33].
An important area of HCI research on creativity support tools therefore studies the design of interactive systems that can assist creators in discovering examples [9,39,47,48,74,80,82,87,88], structuring, analyzing, and exploring collections of examples [15,16,53,57,81,93], and adapting and using examples [39,48,50].Designers of such systems need to grapple with an array of very practical interaction design decisions.For example, how should we support interaction with examples over different screen sizes?Should examples be delivered via recommendation (in small sets), a feed, or some other interaction paradigm?What information should be presented alongside an example?We would like to have a consistent theory to draw from to make these decisions.Beyond considerations of usability, we conjecture that such a theory would need to map design decisions (or classes thereof) to creativity-relevant behaviors and outcomes, ideally with a nuanced specification of the precise benefits and costs of each design decision for these behaviors and outcomes.A theory of human-example interaction like this could help us design better systems with sensible defaults, prioritize and negotiate design requirements, and guide evaluation.
As a step towards developing such a theory, we conducted an experiment with 182 participants solving a controlled analog to an exploratory creativity task [7].We varied both the diversity of examples and the types of presentations: overlaid on the search environment (the "In-Context" condition), presented in a list (the "List" condition), or in a dropdown selectable menu (the "Dropdown" condition).The "In-Context" design was inspired by an emerging pattern of contextualizing examples in the creator's workspace or problem in HCI systems for example-based creativity [39,47,78,91,93] on the one hand, and theoretical descriptions of the use of examples to (re)formulate problems [34,35,43,59,67,79]; the latter "List" and "Dropdown" conditions were designed to be representative of common interfaces for interacting with examples (in search results lists and pages of recommendations).
Our primary results were threefold: 1) "List" presentation harmed solution quality compared with "In-Context" or "Dropdown" presentation; 2) each interface condition was associated with distinct self-reported example usage strategies (notably, more usage of examples to "model" the problem space to guide exploration in the In-Context vs. List or Dropdown conditions, and more usage of examples to "stimulate" a specific direction of exploration in the List condition); and 3) the List condition's propensity for stimulation-based strategies was corroborated by an increased usage of "hill-climbing" strategies early on, as evidenced by analyses of Dropdown distance between participants' moves.
We discuss how these results, in conversation with the literature on example-based creativity support systems as well as psychological mechanisms of creativity with examples, could contribute to a theoretical framework for designing interactive systems for creative problem solving with examples.

Sources of empirical variability in effects of examples
Prior work has examined how the consequences of examples for creative problem solving outcomes are related to characteristics of the examples, such as their novelty [1,6,11,80] (generally positive effects), conceptual distance from the problem domain [10,19,25,31,90] (mixed or curvilinear effects), and example diversity [12,21,29,38,80,94,96] (generally positive or contingently positive effects).Our work contributes additional empirical results on the relative contributions of (and potential interactions, in the statistical sense, between) example characteristics and example Manuscript submitted to ACM interface characteristics.In particular, we explore how the example characteristic of diversity might interact with example interface characteristics, such as whether the examples are presented as a list vs. in context of a representation of the design space.To do this, we need to also consider the cognitive mechanisms of inspiration or fixation from examples (or varying characteristics), which might be more or less afforded by example interfaces.We discuss this body of literature in the next section.

Theoretical insights into human-example interaction
A number of detailed in-situ studies of creators have documented a range of strategies for working with examples, ranging from simpler, more source-driven strategies like direct source adaptation [24], to more complex and reflective strategies associated with more radical transformation of examples, such as source analysis and schema-driven source selection [24], analogical reasoning [3,27,28,37], and generating novel emergent features that can connect disparate attributes across examples [92].These "processing strategies" can be described by a variety of theoretical frames from the psychological literature on creativity.We believe this theoretical level of description could facilitate our goal of synthesizing mappings between interface characteristics and effects of examples on creative problem solving outcomes.Some notable examples include basic memory mechanisms such as priming [63] and spreading activation [17,73], and higher level cognitive processes such as conceptual abstraction and analogical transfer [20,28], conceptual combination [92], and problem reformulation based on examples [22,34,57].Of particular interest in our study is a contrast between priming and spreading activation mechanisms on the one hand, which is associated with lower-level conceptual influences, and problem (re)formulation, which is associated with more complex, higher-order processing of examples.In this study, we extend this literature by exploring how two specific mechanisms of processing examples (stimulation, and (re)formulation) might be helped or hindered by different example interaction interfaces.To set the context for our results, we briefly review the literature on each mechanism here.

2.2.1
Using examples to stimulate ideation.Spreading activation has been invoked to explain the impact of external stimuli on ideation.For instance, the "search for ideas in associative memory" (SIAM) model [66] proposes that when ideas come to mind, whether from memory, or through discussion with others or exposure to examples, they also raise the activation level of other associated concepts and features in memory, which can stimulate or inhibit ideation by making certain sets of ideas more or less likely to be generated based on the current network of associations in memory.For example, an example idea "use as paperweight" (for a design prompt to generate alternative use for a brick) may activate related concepts such as "office", or "is heavy", along with their associated concepts; subsequent ideas such as "construct a table", or "prop up a bookshelf" may then be more likely to come to mind, compared to ideas like "use as a weapon" or "makeshift goalposts for soccer".In this way, exposure to examples can shape the trajectory of ideation, and the corresponding floor or ceiling of creativity [6,13].In this paper, we discuss this set of mechanisms under the label "stimulation", to capture the intuition of examples stimulating ideation along a particular direction.

Using examples to (re)formulate problems. Past research has documented how people can use examples to
construct, refine, and even reformulate their understanding of the creative problem they are trying to solve [34,35,43,59,67,79], through processes such as intentional free-form curation of examples [56] or on mood boards [55].For instance, Okada et al. documented how two artists used individual artworks to shape not just their ideas, but used a process of "analogical modification" to search for and modify higher-order concepts, and their creative vision over the course of years [67].This process of example-influenced problem formulation is related to computational and neurobiological models of cognitive search [36] which study the range of strategies that intelligent agents can use to Manuscript submitted to ACM structure their search processes: here, there is an important distinction between "model-free" search, where external feedback from the world on the agents' actions guide search in a simpler, more local, stimulus-response manner, and "model-based" search, where the agent constructs a model or representation of the task and environment (in the case of creative problem-solving, this would be the problem and/or design space [22,30,65]) partly on the basis of reflection its own actions and possibly observation of others' actions, and uses that model to decide where and when to explore in the task environment vs. continue to sample locally.Additionally, insight problem solving research has documented how people not only construct models, but also substantially revise them in radical ways, to solve difficult creative problems [43,45]; this process is a key source of difficulty for creative problems, where one's initial problem formulation (e.g., key constraints or requirements), may be unhelpful [45].In this paper, we discuss this set of mechanisms under the label "(re)formulation", to capture the intuition of creators leveraging examples to (re)formulate their understanding of the design space.

Interactive systems for interacting with examples
In this study, we are interested in understanding how the impact of examples on problem solving varies as a function of interaction design decisions for how creators will interact with examples.To set the context, we briefly review here some emerging interaction design patterns in HCI systems research into how (vs.when, as in recommendation systems) participants interact with sets of examples (vs.understand or modify a single example).
One higher-level design pattern involves explorable overviews of examples.For example, MetaMap [42] supports exploration of examples through keywords and colors and offering a playground to curate examples; RecipeScape [16] uses a map UI to present recipe examples; Sifter [70] presents large collections of image manipulation tutorials in a faceted view based on their command-level structure; the Adaptive Ideas Web tool [53] enables designers to explore and structure collections of web design examples; the Freed system [64] empowers design students to spatially organize their digital collection of examples, define relations and reflect on their interrelationships; and Cabinet [44] supports collecting and organizing of visual examples for inspiration and reference.
Another emerging design pattern can be described as contextualizing examples in the creator's workspace or problem, enabling designers to curate and reflect on the examples to build an understanding of their design space.For example, ReflectionSpace [78] interactively contextualizes design artifacts in project timelines (and associated comments and reflections) to promote reflection and learning; MoodCubes [39] enables designers to curate, compare, and explore suggested 3D design elements in the context of an overall 3D "cube" room layout; IdeateRelate [93] locates design examples in coordinates of similarities corresponding to the users' original ideas; the IdeaMache system provides an environment for free-form visual curation and sensemaking of creative materials in the context of a project canvas [91]; and ImageSense embeds the process of searching for, exploring, and integrating examples into both individual and shared work spaces [47].
We build on this work by directly testing how the emerging pattern of contextualizing examples might impact the effects of examples on problem solving.To facilitate downstream theoretical development, we go beyond formulations of problem solving effects and outcomes that are task-specific -such as writing code [9], designing websites [50], or designing room layouts [39] -and/or removed from creativity-specific mechanisms, such as browsing and searching and exploring, to more theoretically grounded descriptions of psychological mechanisms such as fixation and problem reformulation.
Manuscript submitted to ACM

The WildCat Wells Task as a Controlled Analog to Exploratory Creative Problem Solving
We experimentally investigated our research questions using a controlled analog to exploratory creativity, a term introduced by Margaret Boden's influential model of creativity [7] to describe a subset of creative problem solving processes that involve exploration within a conceptual space that is often large and complex.This conception of exploratory creative problem solving as search in a space has deep roots in research on search landscapes and innovation in organization and management science [5], as well as psychological models of problem solving [65] and creativity [71] (as reviewed in our discussion of model-based mechanisms for using examples in 2.2.2).A key insight from this literature is that local search and hill-climbing are insufficient in more rugged and complex search landscapes, because they can trap searchers in local optima; to overcome this, searchers need to find ways to explore or "jump" to new regions of the landscape [5], such as guiding search through (re)modeling of the search space [36].This contrast between local and distant exploration is often described in terms of the shift between exploitation and exploration [5], where the latter search dynamics are more associated with innovation and creativity [71].Note that the notion of exploratory creativity is distinct from another important class of creative processes that involve what Boden [7] calls transformational creativity: in this form of creativity, creators search for or construct alternative problem spaces (as discussed in the related work) [22,43], rather than search within an existing problem space as given.
Our controlled analog is the WildCat Wells task.The name of the task takes inspiration from the real-world task of wildcat drilling1 , a form of exploratory drilling for oil and gas in an unfamiliar environment where the distribution of resource-rich locations is unknown.Accordingly, in this task, participants can "drill" for "resources" in a 2D grid by clicking on locations in the grid; clicking on a grid location then uncovers a score amount, analogous to the amount of oil/gas uncovered at a drilling site.Like its real-world counterpart, the distribution of resources in this task is unknown; in our particular instantation, participants' goal is to uncover the most resource-rich drilling location (i.e., the grid location with the highest score).Following our conceptualization of examples as descriptions/representations of possible solutions to the same/similar problem, in this task, we operationalized examples as possible grid locations and their associated scores.
We chose this task for several reason.First, we had a high degree of parameter control over the properties of the task and examples, which allowed us to precisely control the ruggedness of the task structure and also employ a within-subjects design while mitigating learning effects by constructing and sampling from a set of Wildcat Wells' tasks with isomorphic ruggedness/complexity properties (see Section 3.1.1).The task structure also gave us granular and precise measures of process and outcome dynamics.Finally, the simplicity of the task allowed us to minimize the impact of prior knowledge because the task does not require specialized domain expertise.While the specific task structure in terms of distribution of rewards over the search space is unknown to participants, the generic task structure of searching a space for rewards is probably not unfamiliar to most people.The Wildcat Wells task and its operationalization of examples is also conceptually similar to other instances of exploratory creativity that may draw on example solutions from a very similar problem: for instance, when searching for effective parameter settings for wing airfoil designs, other airfoil designs -which, like our grid location, are also combinations of parameters -may serve as relevant examples; when designing effective ads for a vaccine persuasion campaign, other vaccine persuasion ads -which are also combinations of design features -may serve as relevant examples; and when designing effective UI elements, other websites and their UI elements -which are also compositions of UI features -may serve as relevant examples.We also adapted the Wildcat Wells task specifically from a prior study [62] of the dynamics of exploration and exploitation in collaborative problem solving.However, because the WildCat wells task is only analogous to exploratory creativity, our results here can only speak to effects of examples on exploratory, but not transformational, creative problem solving.
3.1.1Search Environments.Our WildCat Wells search environments consisted of a 100x100 grid of points (with corresponding scores controlled by a synthetic objective function that determines the distribution of scores; see Algorithm 1 in Appendix A and our source code for generating search environments).Figure 1 (A) shows a representative search environment we used in our experiment.
Our goal was to more closely match the difficult search spaces that the creativity theorist David Perkins calls "Klondike spaces" [71], which are environments where simple "hill-climbing" exploration strategies are insufficient, and likely outperformed by other creative exploration strategies such as a mix of exploration and exploitation [5,71].We describe the specific parameter settings and algorithm we used to generate these task environments in Appendix A (and share the code generating the environments in the Supplementary Material); here, we note that we set the parameters to yield a search environment that was fairly rugged (adding more false "peaks" to incorrectly intuit as the location of the maximum score) and locally noisy (reducing the local correlation between scores in the grid, such that searchers would often be surprised by the score of nearby regions in the grid).The parameters to generate search environments were determined by a series of rubrics (e.g. more than one area with scores higher than 80), and pilots for a qualitative sense of difficulty based on the topology of the solution space.To reduce the likelihood that our results were tied to a specific formulation of the search environment, we generated ten search environments with the same synthetic objective function parameters but different random seeds.The resulting search environments were qualitatively similar to Mason & Duncan's work [62].

Examples.
To prepare sets of initial examples with High Diversity (HD) or Low Diversity (LD), we randomly generated 10,000 sets of 10 examples each (recall that each example is a "drilling location" point in the 100x100 Wildcat Wells search environment, with a corresponding score) for each of our 10 search environments, and ranked the diversity of each example set with a close variant of the Determinantal Point Process (DPP) approach [49] described in Algorithm selected set of points, such that larger volumes corresponded to higher levels of diversity, since these points spanned a larger set of the space of possible moves [49].We then randomly picked three HD examples sets that were greater than the 99th percentile of the distribution of diversity across the example sets, and three LD examples sets with diversity lower than 1st percentile of the diversity distribution.To ensure that examples would not directly reveal the location of the peak, or provide a high enough score that participants might simply stop after seeing the example instead of searching, we discarded example sets that had any point with a score over 80 (scores in the search environment ranged from 0 to 100), and resampled example sets as necessary -subject to the same low/high diversity sampling criteria) to construct our final example sets for each search environment.Figure 1 (B) shows an LD and HD example set used in our experiment.

Experiment Design
We conducted a mixed design experiment.Example interface was a between-subjects factor, with three conditions: (what we called a "(re)modeling" mechanism in 2.2).Since we designed our Wildcat Wells task to be unsuitable for simpler hill-climbing (e.g., "stimulation-based" mechanism as described in 2.2), we also expected that these interfaces might also lead to better performance on the task, through, for example, model-based exploration strategies.
Example diversity was a within-subjects factor: each participant attempted the WildCat Wells task twice, once with a set of HD examples, and once with LD examples.Recall that we generated 10 variant search environments, each with their own set of HD and LD example sets.To approximate counterbalancing of our within-subjects factor, we created 2 "run" variants for each search environment, with each variant having an HD or LD example set as the first trial.Participants were randomly assigned first to an example interface condition, and then randomly assigned to one of the 20 potential "runs" in each interface condition (but constraining assignment such that participants would not see the same search environment twice).Based on prior research on example diversity, we expected that participants would perform better when given high vs. low diversity example sets.
Manuscript submitted to ACM

Participants
We recruited participants from the Amazon MTurk platform, limiting participants to U.S. residents with more than 500 HITs with at least 99% approval rate.Each participant was paid US$1.3 for their participation, which was an effective rate of $10 per hour, given the average task completion time of 8 minutes.
We aimed for a total sample size of 195 (65 per each of the three conditions), to achieve target statistical power of over 0.80 to detect medium-sized statistical effects in a mixed between-within design experiment analysis.After rejecting invalid work of 42 participants for irrelevant responses (e.g., "nice") to the closing survey question about how they used examples), we obtained data from 182 participants (63 females, 118 males, 1 other; 65 in context, 56 List, and 61 Dropdown) in total, yielding an effective statistical power of 0.86 for medium-sized effects.

Experimental Procedures
Participants experienced the WildCat Wells task as a 100x100 space (see Figure 2).Their task was to find the square with the highest score.Participants explored squares by clicking on them to reveal their underlying score, shown in color coding, similar to the examples.To simulate the constrained nature of real creative tasks (which often have some time/budget pressure) and reduce the likelihood of ceiling effects, participants had a total budget of 60 moves for exploring squares.This budget was estimated from our pilot studies, where on average, most participants found the highest scoring square within 50 moves.We also provided incentives to encourage participants' exploration: there was a $0.25 bonus for achieving a highest score greater than 95, and a $0.50 bonus for achieving the maximum score of 100.
The information panel on the right side of the experimental interface (see Figure 2 showed moves remaining, the score of the current exploration and the maximum score the participant had achieved in the current round. Manuscript submitted to ACM Since the WildCat Wells task does not interact strongly with prior knowledge in any particular domain, we addressed potential pre-existing differences in ability by measuring participants' baseline divergent thinking ability, a correlate of creative ability [75].Before the study, we asked participants to generate as many alternative uses of coffee cup as they could in 2 minutes (an instance of the commonly used Alternative Uses task [32] for measuring divergent thinking [75]).
Participants were then given one trial round through the WildCat Wells task (without examples) to familiarize them with the interface and task.After that, participants completed two formal rounds of the WildCat Wells task, which constituted the main experimental trials in our study.Finally, participants completed a post-study questionnaire, with three free-response questions: 1) What strategy did you use for hunting?2) How did you use initial examples (the values of ten points given to you)? 3) What differences did you notice between initial examples given in those two rounds?Which did you find helpful?
We obtained institutional IRB approval for the whole project prior to the study.

No significant differences in baseline divergent thinking ability across interface conditions
We first report the results of our check for random assignment with respect to divergent thinking ability and baseline performance on our task.We observed no statistically significant difference in the number of generated alternative uses across three conditions ("In-Context" participants:  = 6.52,  = 3.02; "List" participants:  = 5.75,  = 3.58; "Dropdown" participants:  = 6.46,  = 3.87, Kruskal-Wallis  = 2.40,  = 0.30).Similarly, we observed no statistically significant difference in participants' best score on the trial run of the Wildcat Wells task across the conditions ("In-Context" participants:  = 90.97, = 7.52; "List" participants:  = 90.91, = 7.13; "Dropdown" participants:  = 91.18, = 7.37, Kruskal-Wallis  = 0.19,  = 0.91).This suggests that participants across the interface conditions were comparable in terms of baseline divergent thinking ability as well as baseline task performance.

List presentation of examples and low diversity example sets associated with lower best scores
The List condition had slightly lower scores on average compared to the other conditions (regardless of example diversity; Fig. 4

RESULTS: EXPLORATORY ANALYSES
We conducted a set of exploratory analyses to better understand the results of our main planned analyses, focusing on understanding process effects of interface conditions that might plausibly explain performance differences.

In-Context presentation of examples associated with early performance advantages, and List presentation of examples with early and persistent performance disadvantages
First, for a more granular view of performance, we examined how the participants' best score changed as a function of their move sequence.This analysis confirmed a cumulative disadvantage for participants in the List condition, but also showed an early advantage for the In-Context interface, particularly with LD examples (see Figure 5).Using a Kruskal-Wallis H-test on the current max score from the 1st to the 30th move, we observed statistically significant differences from the 1st move to the 6th move and the 8th move except the 7th move (see Table 1).There were statistically significant differences between the conditions from the 1st to 8th moves (p<0.05except the 7th move).

Variations in example presentation
not using, stimulation-based and model-based.This classification was guided and refined by our initial theoretical interest in the contrast between stimulation-based and (re)modeling-based use of examples, as discussed in 2.2.Examples of responses coded as "not using" include "I did not give much thought to it", and "Not much to be honest"; examples of "stimulation-based" responses included "Start at the reddest one and explore its surroundings", and "I looked around the higher values for boxes that were darker"; examples of "model-based" responses include "To get an overview on which squares would be best", and "They gave a vague idea of whether or not there might be "hot" or "cool" zones around those points".When we could not infer how the participants used the initial examples, the answers were coded as "unclear".
The researchers were blinded to condition during coding.Inter-rater reliability was substantial, at Cohen's  = 0.725 [51]; all disagreements were resolved by discussion.
The In-Context condition had the largest portion (30.8%) of participants who self-reported using the initial examples to model the space, compared to the List condition (10.7%) and the Dropdown condition (11.5%) (see Figure 6).In contrast, for self-reported use of initial examples to stimulate their exploration, the List condition had the highest percentage (42.9%)followed by the Dropdown condition (29.5%) and the in context condition (29.2%).Finally, 17/61 (27.9%)Dropdown participants self-reported that they were not using examples, which was higher than participants in the other two conditions.Our log data were consistent with this observation: 37/61 (60.7%) participants in the Dropdown for LD.

In-Context participants more likely than other interface conditions to self-report model-based example usage, and
Dropdown participants more likely than other conditions to self-report not-using examples.We first statistically tested these patterns with a series of logistic regressions, one for each example strategy (not-using, stimulation-based, and model-based) (see Table 2).We ran separate logistic regressions rather than a single multinomial regression given our interest at this step in the relative likelihood across interface conditions of self-reporting a particular example strategy, rather than relative differences across strategies within each condition (best answered by a multinomial logistic regression).We first observe that participants in the List and Dropdown conditions were less likely to self-report using .05); in more intuitive odds ratio terms, Dropdown participants were 2.7x more likely than In-Context participants to self-report a "not using" strategy (Odds Ratio = 2.75).The overall model fit, though better than a null model (  = -86.62 vs.   = -89.10),was marginally significant, Likelihood Ratio  2 (2) = 5.14,  = .08.Finally, we observe that there were no significant differences across conditions in the likelihood of self-reporting a stimulation-based strategy.
Manuscript submitted to ACM In odds ratio terms, participants in the List interface condition were 4x more likely to self-report a stimulation-based vs. model-based strategy (Odds Ratio = 4.21).The overall model fit was statistically significantly better than a null model, Likelihood Ratio  2 (4) = 14.39, < .01.

List presentation of examples associated with more local initial exploration of the solution space
Finally, we explored how log data might be consistent (or not) with participants' self-reported example usage strategies.
We wanted to study how initial examples would affect participants' exploration behaviors, especially at the beginning of exploration when the examples provided were a major source of information.To explore this, we first constructed an exploration graph for the first 30 moves of each participant trial by computing Euclidean distances between each successive move; the intuition was that long sequences of low distances between moves would suggest "hill-climbing", and large distances would suggest "jumps".We conjectured that a "hill-climbing"-like exploration graph would be consistent with a stimulation-based strategy, rather than a model-based strategy.
Two coders independently coded all 364 exploration graphs (each of the 182 participants had an HD plot and a LD plot), coding whether the exploration behaviors were hillclimbing (h) or not (n) for sequences of 10 moves (the 0th-10th, the 10th-20th, the 20-30th).Two examples of coding are shown in Figure 7.We coded 1092 10-move instances (3 10-move instances per round x 2 rounds per participant x 182 participants) with substantial inter-rater reliability, Cohen's  = 0.78.
For all trials, a higher proportion of participants in the List interface condition used an exclusively hill-climbing strategy for the first 30 moves (proportion=0.161,lower bound=0.112,upper bound=0.210)compared to the In-Context (pro-portion=0.046,lower bound=0.020,upper bound=0.072)and Dropdown condition (proportion=0.082,lower bound=0.047,upper bound=0.117;see Figure 8aa).Similarly, for LD trials, the proportion of participants in using an exclusively hill-climbing strategy for the first 30 moves was higher for the List condition (34%) compared to the In-Context (18%) and Dropdown condition (18%) (Fig. 8ab, left).In contrast, for HD trials, the proportion of participants using an Manuscript submitted to ACM Fig. 7. Example exploration graphs used for coding hill-climbing strategies: not "all hill-climbing" (left) and "all hill-climbing" (right), where each transition between participant moves is plotted on the x-axis, and the Euclidean distance between each move and its immediately preceding move is plotted on the y-axis.In the not "all hill-climbing" example, the first 10 moves (the 0th-10th moves), where there are substantial variations in Dropdown move distances across the sequences, would be coded as non-hillclimbing (N), while the 10th-20th moves and the 20th-30th moves, where Dropdown move distances are consistently low, would be coded as hillclimbing (H).In the "all hill-climbing" example, all first 30 moves would be coded as hillclimbing (H).
To statistically test these observations, we fitted a series of logistic regressions, estimated with maximum likelihood, predicting  (_ℎ), the probability of being all hill-climbing in the first 30 moves as a function of   .Prior work suggests that choice of exploration vs. exploitation is influenced by the "goodness" of the current region of the search space (better scores makes hill-climbing more likely) [5,36].Our data confirmed this pattern: the average score of the first move in each of the first three 10-move blocks was positively correlated with the likelihood of being all hill-climbing in the first 30 moves, Kendall's  = .17, < .01.Thus, we conditioned our logistic regression models on the average score at the beginning of each 10-move block.
We first analyzed  (_ℎ) aggregated across both HD and LD trials (value would be 1 if both LD and HD trials were 1), and then HD and LD trials separately.Table 3 shows the coefficient estimates for each of these models.For all trials, the coefficient for the contrast between the List and In-Context conditions was  = -1.79,95% CI=[-3.40,-0.45], z = -2.45, < .05; in odds ratio terms, participants in the List condition were 6x more likely to use an exclusively hill-climbing strategy for the first 30 moves, compared to participants in the In-Context condition (Odds Ratio = 5.99).
Note that this effect was independent of the significant positive coefficient for the average first score in the block.The Because we were concerned this pattern of differences might be driven by pre-existing individual differences in propensity to hill-climbing, rather than a shift due to the interface condition, we repeated our coding procedure for exploration graphs generated from the initial trial round of exploration, which did not include examples (described in 3.4.The proportion of participants who displayed a predominant hill-climbing strategy, as described above, was distributed across conditions as follows: In-Context= 0.05 (SE = .03),List= .07(SE = .03),and Dropdown= 0.15 (SE = .05).A logistic regression predicting the probability of a predominant hill-climbing strategy (yes or no) as a function of interface did not improve fit over a null model with no predictors,  2 (2) = 4.17,  = .12;note, however that the overall frequency of hill-climbing strategies was lower than in the main trials, and the List condition was not the condition with highest frequency (in contrast to the main trials).There was also no significant correlation between the likelihood of hill-climbing predominance and either the number of alternative uses task responses, r = .04, = .61,or the likelihood of hill-climbing predominance in the main trials, r = .03, = .65.Altogether, these results are inconsistent with the alternative explanation that List participants were simply more likely (due to individual differences) to choose a predominantly hill-climbing strategy overall; instead, taken together with the survey results, we believe this set of Manuscript submitted to ACM In this paper, we aimed to contribute to a theory of human-example interaction to guide the design of example-based creativity support tools.Towards this goal, we conducted an experiment with a controlled analog to an exploratory creativity [7]  performance.Separately, we also observed beneficial effects of diversity for final solution quality, in line with some previous work [2,4,29,38,80,96]; importantly, this effect was similar in magnitude to the example presentation effects, suggesting that example presentation considerations may be just as important to consider as example characteristics when designing example-based creativity support systems.points for their exploration.Importantly this self-report data was consistent with patterns in our log data: we observed that List participants were more likely to use a predominantly "hill-climbing" strategy (with low Dropdown distance between their moves) early in their exploration, relative to the In-Context and Dropdown participants; this association was independent of the relationship between hill-climbing behavior and the "goodness" of initial moves (hill-climbing in a given block of moves was more likely when the initial move was higher-scoring, consistent with prior empirical work on exploration/exploitation decisions [5,36]).Considering these results alongside the performance results suggests that List participants were being fixated [40] by the examples.
A fruitful direction for further research would be to investigate the mechanisms that drive fixation in the List the List condition to the search space, which would be consistent with past research on the cognitive load benefits of integrating diagrams and text (similar to integrating examples and the search space) in instructional design [14].
We view the difficulty of transferring from the text modality of lists of examples to the visuospatial modality of the In-Context solution space as a potential mechanism by which example presentation variations might shape their impact on ideation: future work could investigate in more detail how different example presentation designs might shift the cost structure of different processing strategies, in a similar way that variations in environment or interface structure have been shown to shape sensemaking by changing the cost structure of various crucial actions, such as skimming/previewing, moving documents, applying schemas to documents, or adjusting schemas [72,76,77].
Separately, we observed that the "Dropdown" presentation was associated with limited usage of the examples: many Dropdown participants self-reported not using examples (moreso than In-Context participants, for example), and this was also corroborated in their log data (via a lack of interaction with the example interface).We do not think that this lack of example usage is indicative of a lack of engagement: recall, for instance, that performance in this condition was on par with the In-Context condition (i.e., higher than in the List condition).Post-survey comments indicating enjoyment and engagement (e.g., "Fun game.Thank you!") were also seen across conditions at similar rates, and there were no statistically significant mean differences across the conditions in the trial run of the task.For these reasons, we believe that -possibly due to the interaction affordances -the Dropdown condition appeared to act similarly to a "no-examples" control condition, where participants used a wider mix of strategies vs. a particular set of example-based strategies tied to an experimental intervention.In light of this, the overall strong performance of the Dropdown condition is akin to past observations of strong performance by control "no-intervention" conditions in ideation experiments (see, e.g., [11,80]; we thus add to a growing body of evidence that it may be easier to harm rather than help creative ideation by intervening (as in the List condition).
Overall, our results suggest that interaction design considerations for human-example interaction go beyond usability: there is indeed a space of mappings to explore between design affordances and fundamental psychological mechanisms

Limitations
The WildCat Wells task we used in our experiment is simpler than most real-world exploratory creativity tasks-such as airfoil design, ad design, or UI design -of which it is an analog.For instance, the task did not require any specialized domain knowledge, and the generic task structure of searching a space for rewards is probably familiar to most people: indeed, one participant in our study noted that in the post-survey that the task was "very fun and somewhat similar to minesweeper." Additionally, although we carefully constructed our Wildcat Wells task surfaces to be rugged, with multiple peaks of good solutions, our task technically has a single best solution; in contrast, many real-world creative problems -such as policy design -lack a single best solution, due to task factors such as intrinsic tradeoffs between different problem requirements; in these cases, creators often search for and construct "good enough" solutions under high uncertainty (though this might sometimes be a function of feasibility constraints rather than intrinsic properties of the task).It is unsurprising, then, that participants performed relatively well as a whole, and "Dropdown" participants Manuscript submitted to ACM for this task.Additionally, while participants engaged in modeling of the problem space, they were not able to make larger changes to the problem space, such as questioning assumptions or relaxing constraints [45], or even changing the goal/problem altogether [43], mechanisms that are common in real-world creative problem solving tasks, such as design [22].Thus, we reiterate that our results cannot speak to how example presentation design decisions might influence example usage for transformational creativity [7] tasks.
Thus, more work is needed to extend our exploration of patterns in example interaction design choices to more complex settings: for example, what might it mean to design a "contextualized" presentation of examples for UI elements, more complex airfoil designs, ad persuasion campaigns, research papers, or policy ideas?We are keen to build on existing design patterns similar to this in previous systems such as ReflectionSpace [78], MoodCubes [39], and ImageSense [47], as discussed in Section 2.3.Our implementation of the Dropdown condition may also be quite different from other Dropdown presentations, such as forward/backward interfaces (e.g., image suggestions [46]).Future studies can explore the consequences of these differences.For now, we note that our main results on the contrast between In-Context and List conditions are independent of this limitation, and recommend caution in generalizing the results from the Dropdown condition around non-use of examples.
Finally, we did not measure demographic information that may have been correlated with task performance or example usage and/or exploration patterns -for example, personality traits such as disagreeableness or extraversion may be correlated with real-world creative achievement [95]; and gender might interact with potential differences in visuospatial reasoning demands between example interfaces, given some existing research on gender differences in spatial ability [52].

Towards an interaction-oriented theory of creative inspiration from examples
Returning to our higher-level goal of constructing an interaction-oriented theory of human-example interaction, we now reflect on how the empirical results from our study, in conversation with theoretical mechanisms and design patterns from prior work, could contribute to an overall theory that bridges design patterns to psychological mechanisms.
We conjecture that a useful theory of human-example interaction could be conceptualized as paths through multiple We believe that fleshing out these paths through these coordinated spaces towards a theory of human-example interaction can both make contributions to fundamental HCI theory -by enhancing synthesis of design knowledge about how to best support creative inspiration from examples -and practice -by providing a principled framework that is sufficiently granular and directly connected to design decisions, to guide effective design decisions when building example-based creativity support systems, and to practicing creators who wish to more effectively leverage examples in their creative process.We invite the rest of the creativity support systems community to join us in these efforts.

B ALGORITHM FOR SAMPLING DIVERSE AND NON-DIVERSE EXAMPLE POINTS
Algorithm 2 Generating a ranked distribution of example sets with Determinantal Point Process (DPP) approach [49],  is batch size (10000).  is a combinatorial set defined on a finite set  ∈ R 2 , where each element     ∈   is k elements long.  .

Fig. 1 .
Fig. 1.Example Wildcat Wells search environment with color coding of points to indicate their scores (0-50: dark blue to light blue, 50-100: light red to dark red) (A), and example sets of high and low diversity sets of points in this search environment, which are given as examples (B).
1) "parallel examples with context" interface: all 10 examples were shown in the 100x100 space with color coding to denote the score associated with each point, referred to as the "In-Context" interface 2) "parallel examples without context" interface (shown in Figure 2): all 10 examples were shown in a list, also with color coding to indicate example score, referred to as the "List" interface 3) "serial examples without context" interface: only one example was shown at a time and the participant needed to use a dropdown button to see other examples, referred to as the "Dropdown" interface.Figure 2 shows the experimental interface of the List condition as an example.The In-Context interface was inspired by design patterns of example interfaces that contextualized examples in the creator's workspace or problem (e.g., [39, 47, 78, 91] The List interface was inspired by the familiar design pattern of a "list" of examples, often in the context of a search interface (as search results), or list of recommendations in a recommender interface.The Dropdown interface was designed to approximate more constrained interfaces for interacting with examples, such as through chat-based or recommendation systems (e.g., popping up one or two examples at a time).The three example interfaces were shown in the context of the WildCat Wells task in Figure 3.We conjectured that interfaces that allowed for comparison between examples (whether in the context of a task environment, as in the In-Context interface, or just with attributes shown for comparison, as in the List interface) might facilitate more model-based usage of examples

Fig. 2 .
Fig. 2. Screenshot of experimental interface, shown for the List condition: the 100x100 grid, which constituted the search environment for the task, was shown on the left panel: participants explored the space by clicking anywhere on the 100x100 grid.The 10 initial examples, moves remaining, the score of current move, the current max score and score legend were shown on the right panel.In the Dropdown condition, the dropdown menu as seen in Figure 3 was shown in the same position as the list of examples in the List condition.In the In-Context condition, examples were instead overlaid as points, with corresponding values, on the search grid, as shown in Figure 3.

Fig. 3 .
Fig. 3. Three conditions of presenting examples: "In-Context" (directly on the search environment grid), "List" (in a list) and "Dropdown" (in a clickable dropdown selector).
, top right).There was also an overall slight advantage of HD examples over LD examples (Fig. 4, bottom right).A linear mixed effects model with best score as the dependent variable, interface condition and example diversity as factors, and random intercepts for participants (estimated in the 'lme4' package in 'R'), showed a significant main Manuscript submitted to ACM

Fig. 4 .
Fig. 4. Distribution of best scores by interface and example diversity conditions.Participants in the List interface condition had lower best scores than participants in the other interface conditions regardless of example diversity (top right).Best scores were also lower when participants were given low vs. high diversity examples (bottom right).
interfaces associated with different self-reported example usage strategies Next, to understand how participants used the initial examples in their exploration, two researchers coded participants' responses to the question "How did you use initial examples (the values of ten points given to you)?" with three codes: Manuscript submitted to ACM

Fig. 5 .
Fig. 5. Maximum score at n-th move for each participant: left) HD; right) LD.We observe (1) cumulative disadvantages for the List condition, as well as (2) early advantages for the In-Context interface, especially with LD examples.

Fig. 6 .
Fig. 6.Raw proportion of participants expressed "not using (examples)", "stimulation-based" or "model-based" in their answer to "How did you use initial examples (the values of ten points given to you)?".Error bars are standard error of proportion.More participants self-reported using a model-based strategy in the In-Context condition compared to other conditions.

Fig. 8 .
Fig. 8. Raw proportion of participants with a predominantly hillclimbing strategy in the first 30 moves, across interface conditions (a) with both HD and LD examples (b).More "List" participants did hillclimbing in the first 30 moves with both high and low diverse example sets than "In-Context" participants.More "List" participants did hillclimbing for the first 30 moves with LD examples than other combinations of the presentation and example sets.
Second, our exploratory analyses suggest that "In-Context" and "List" presentation of examples may lead to distinct patterns of example usage."In-Context" presentation of initial examples was associated with a greater likelihood of a "model-based" strategy for using examples, where participants self-reported using the examples to gain an overall understanding of the distribution of score in the search environment to guide their exploration, compared to List or Dropdown presentations.Conversely, the "List" presentation of initial examples seemed to encourage a predominant "stimulation-based" example usage strategy, where participants selected promising examples as starting condition.One reason might be the upper limit in scores (no more than 80/100) on the examples presented to the participants; if taken as starting points to begin hill-climbing, those relatively low quality examples could be misleading, and block access to high-quality solutions.Another reason might be the increased effort needed to connect examples of Manuscript submitted to ACM of creative inspiration from examples.From a practical standpoint, our empirical results suggest the limitations of only showing examples without the problem space as context, especially if the problem space is large (a common feature of real world problems) and there exist some potential solutions far away from the initial examples.This implication is significant since the "List" view of examples -examples presented in a list -is commonly used in current creativity support tools, such as search engines and recommendation systems, yet was associated with substantial negative effects on the usage of examples and task performance relative to In-Context presentation of examples.
coordinated spaces of example interaction patterns, example-ideation psychological mechanisms, ideation characteristics and creative outcomes.Paths through this overall set of coordinated spaces could then represent a set of principled design hypotheses about how to best support creative work with examples.For instance, bringing our empirical results in conversation with the literature we reviewed in sections 2.2 and 2.3, we could hypothesize that, given a particular exploratory creativity task environment like our instantiation of the WildCat wells task, where the key creative outcome of solution quality is determined at least in part by the ideation characteristic of diversity of search, which is in turn positively influenced by the psychological mechanism of (re)modeling, and negatively influenced by the mechanism of stimulation, it may be advantageous to choose example interaction patterns like Manuscript submitted to ACM contextualizing examples in the problem space (which is positively mapped to (re)modeling mechanisms), over patterns like List viewing of examples (which is positively mapped to stimulation mechanisms).Multiple other hypothesized paths could be generated and refined to map other example interaction patterns from prior work, such as faceted search systems, or example dissection/analysis, to other psychological mechanisms, such as conceptual combination, or analogical abstraction; each of these mechanisms might then in turn be contextually important for certain kinds of creative problems, such as policymaking or room layout design.

Table 1 .
Means () and results of Kruskal-Wallis H-test on the current best score for three interface conditions with LD examples.
5.2.2List participants more likely to self-report a stimulation-based example usage strategy compared to not-using or model-based example usage.Next, we focus on statistically evaluating the apparent predominance of a stimulation-based self-reported example usage strategy for List participants.We fitted a multinomial logistic regression, with model-based usage as the reference outcome class.Participants in the List condition were significantly more likely to self-report using a stimulation-based strategy compared to a model-based one,  = −1.44,95% CI = [-0.34,-2.53], z = -2.58, < .01.
task to investigate how example presentation interface variations influence whether/how people benefit from examples.We found evidence that List presentation of examples might harm the quality of final solutions.For example we found variations across interface conditions in mean best score obtained at the end of trials across example interface and diversity conditions (List participants had worse best scores compared to In-Context and Dropdown participants; Section 4.2) and cumulative performance differences (with an early and persistent disadvantage of the List condition compared to the other conditions, with an especially pronounced early disadvantage relative to the In-Context condition; Section 5.1).This result is conceptually significant because the "List" participants received more information (seeing all 10 examples at the same time) than the "Dropdown" participants (only seeing 1 example at a time.and over 60% of them never checked other examples), and approximately equivalent information but different presentation compared with the "In-Context" condition.This suggests that seemingly unimportant, low-level interaction design decisions with respect to presentation of examples can have measurable consequences for creative problem solving Chan, et al. also had competitive performance even though they did not interact or use the examples.We note, however, that performance was not quite at ceiling: only 16% of participants reached the global max in either trial (and 42% reached the threshold score of 95 for the first bonus.Still, caution is warranted when generalizing to other more complex instances of exploratory creativity; for instance, it may be that the effects of examples, and the corresponding effects of variations in their presentation interactions, will become more pronounced in more sophisticated tasks.Relatedly, the Wildcat Wells task captures aspects of search dynamics (exploration and exploitation) in exploratory creative problem solving quite well, but does not enable observation of more sophisticated psychological mechanisms for working with examples.For instance, it is unclear what it might mean to "combine" different problem solving moves