LLMR: Real-time Prompting of Interactive Worlds using Large Language Models

We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.


INTRODUCTION
Creating 3D virtual worlds is a challenging task that requires both artistic and technical skills.In addition, 3D content often becomes deprecated and has limited interoperability due to platform and device upgrades.Recently, generative AI models have made considerable progress in producing meshes for objects and scenes [17,18,22,25,26,41,43].However, few works have ventured beyond visual appearances to bring e.g., interactive and behavioral elements into the generated content.In addition, existing renderingbased methods require substantial compute and time to generate and render 3D objects, while the quality and resolution of these generations are limited [11,35].
On the other hand, the rapid advancement in Large Language Models (LLM) like GPT has shown promise in code generation and reasoning [1,6,14,21,33].An integration of LLMs with a game engine, like Unity [50], can enable faster 3D content development and spontaneous user creation, a core element of mixed reality since its inception.In addition, the 3D mixed reality worlds offer rich, spatial, multimodal information (most are post-symbolic or beyond language) that can potentially help LLMs to better situate their reasoning in the reality that humans live in.
This paper presents LLMR(Large Language Models for Mixed Reality), a framework that enables real-time creation and modification of interactive 3D scenes.LLMR can create objects that are rich in both visual and behavioral aspects, or make spontaneous and bespoke edits on an existing environment.For example, we leverage LLMR to spawn interactive tools that are self-contained units designed to perform specific functions in virtual and mixed-reality environments.They can be combined to form more complex interactive systems, extending the range and depth of user and AI-driven experiences.These configurations can be saved and transferred across various environments, serving as the building blocks for versatile interactive experiences.
LLMR is an orchestration of an ensemble of specialized GPTs.At its center is the Builder GPT serving as an architect of C# Unity code for crafting interactive scenes.However, the multitude of tasks falling under virtual world creation renders a standalone coder insufficient.For instance, the ability to meaningfully modify an existing virtual world necessitates a profound semantic understanding of the scene.As humans, we have the ability to infer the properties of objects in the world and can refer to objects in the environment using demonstratives.To simulate the benefits of perceptual access, we incorporated the Scene Analyzer GPT.It generates a comprehensive summary of scene objects, offering detailed information when requested, including aspects like size, color, and the functionalities of interactive tools previously generated by LLMR.We also implemented the Skill Library GPT that determines the relevant skills that are needed for the Builder to accomplish the user's request.In addition, we have observed that the code generated by the Builder lacks robustness and frequently contains bugs.To remedy this, we introduce the Inspector GPT, which evaluates the Builder's code against a predefined set of rules.This evaluation acts as a protective measure against compilation and run-time errors before the code is executed via the Compiler in the Unity Game Engine.
To illustrate the efficacy of our framework in the creation and editing of virtual scenes, we tested LLMR on two sets of 150 prompts encompassing a wide array of creation and modification tasks.Our findings demonstrate LLMR's superior performance in contrast to general-purpose LLMs while emphasizing the performance gain achieved with the addition of each module in our pipeline.In particular, LLMR exhibits 4x reduction in code errors in both an empty and an existing scene, when compared to off-the-shelf .In the meantime, LLMR can successfully complete sequences of tasks with varying complexities, while keeping the completion time around a minute.These outcomes underscore LLMR's capacity to execute user instructions in real time with a higher degree of robustness.
To evaluate if our framework can generate not only functional code but also interactive worlds that meet users' instructions, we evaluated LLMR with 11 participants with varying Unity experiences.At a high level, participants found LLMR to be intuitive and easy to use, and they were able to iteratively achieve desired outputs without much manual scripting.While the framework has limitations such as its unpredictability due to generative models' stochastic nature, and thus is not applicable for all contexts (especially ones that require precise and specific control), the output generated by LLMR serves as a starting point for more complex scene generation.
Our paper is organized as follows: we begin by describing prior work and approaches to generating 3D objects and environments for mixed reality in Section 2. In Section 3, we first provide an overview of LLMR followed by details of the function of each module of our framework.We then discuss important extensions of our framework, such as incorporating plugins, memory management, and cross-platform compatibility, in Sections 4,5,6, respectively.We then present a series of exemplar applications in Section 7 to illustrate the wide range of creations enabled by LLMR.Section 8 Numerical Study includes a comprehensive evaluation of our framework against our design goals: high completion rate, real-time execution, robust against complex tasks, and iterative fine-tuning ability.We follow the Numerical Study with Section 9 User study that evaluates the quality of LLMR's output and presents usability feedback.Finally, in Section 10, we discuss the limitations and future work for others to build upon.
In summary, our main contributions are the following: (1) We introduced a versatile framework for real-time generation of interactive 3D objects and scenes using LLM modules, designed for easy setup with an OpenAI API key and adaptable across various mixed reality tools, environments, and devices.(2) We carried out extensive evaluations, including a technical ablation study to gauge the framework's performance and reliability, and a user study to derive design recommendations for optimizing the user experience.(3) We showcased the expanded capabilities of GPT beyond text inputs, illuminating the broader potential of LLM applications, and demonstrated the framework's broad applicability in domains such as remote training, creativity, and accessibility.(4) We advocate for the interoperability and longevity of mixed reality applications enabled by AI, and thus we openly share the installation package, code, and prompts used in our application and evaluation so that future work can build on top of our framework.

RELATED WORK
Our research on the creation and modification of interactive 3D scenes using natural language is situated at the intersection of large language models (LLMs) and 3D content generation.This section provides an overview of the related work in these areas, highlighting how our work builds upon and extends existing research.

Generative 3D Assets
The generation of 3D assets has been a significant focus in recent research.The work of Li et al. with 3DDesigner [25], Jun and Nichol with Shap•E [22], and Poole et al. with DreamFusion [35] have demonstrated the potential of text guidance and generative models in creating complex and diverse 3D objects.Lin et al. introduce Magic3D [26], a high-resolution text-to-3D content creation framework that addresses the limitations of slow optimization and low-resolution output inherent in existing methods like Dream-Fusion.Recently, Holodiffusion by Karnewar et al. [23] furthered the conversation by employing diffusion models for 3D generative modeling.The Instruct-NeRF2NeRF method [15] and advancements like Pointclip v2 [65] as well as the work of Roberts et al. [39] have explored the power of prompting techniques in 3D open-world learning.A comprehensive review of Neural Radiance Field (NeRF) models by Gao et al. [11] adds to our understanding of this rapidly growing field and aligns with our approach of enabling LLMs to interpret non-linguistic or non-symbolic information.Our approach extends beyond visual appearances to incorporate interactive and behavioral elements into the generated content.

Generative Interactive 3D Environments
In addition to generating objects, the creation of interactive 3D environments has been further explored, with contributions from Wang et al. with Voyager [53], Singer et al. with MAV3D [41], and Höllein, Lukas, et al. with Text2Room [17].Volum et al. has shown that LLMs can be used to guide NPC interactions with a virtual environment [52].Wang et al. also introduced Chat-3D [56], a system that focuses on universal dialogues for 3D scenes, which is further augmented by the work of Hong et al. with 3D-LLM [18].New approaches like Oasis [43] and Procedurally Generated Virtual Reality [44] add novel perspectives.Recent advancements such as Interactive Example-Based Terrain Authoring with Conditional Generative Adversarial Networks by Guérin et al. [13] add a layer of complexity to how terrains can be generated from simple user inputs.Research by Freiknecht and Effelsberg [10], Cao et al. [4], and Song et al. [42] has focused on the balance between realism and algorithmic performance.DeepSpace introduced a novel method of mood-based texture generation from music [45], adding another layer of complexity to asset generation.While these contributions are significant in building interactive 3D spaces, the interplay between AI and mixed reality in these environments remains an open question.Our work tackles this gap by bringing the capabilities of LLMs to a real-time Unity editor for Mixed Reality applications.

Editor Support for Mixed Reality Development
Mixed Reality (XR) development has been explored by Hirzle et al. [16] and Fidalgo et al. [9], who provide comprehensive reviews at the intersection of AI and XR.Lindlbauer et al. [27] and Cheng et al. [5] focus on the automatic adaptation of MR interfaces, a line of work that is relevant for multi-user XR experiences, as shown by Mandi et al. with RoCo [30].Thoravi Kumaravel et al. [51] complement these efforts by focusing on bi-directional mixedreality telepresence.Compared with prior work, we allow users to directly authorize the environment using natural language.

LLMs Interpreting Spatial, Non-Linguistic Information
Lastly, many have pushed the boundary of LLMs by inputting nonlinguistic information (which was not in the training set), such as for visual programming [64] or processing sensor data [28].More related to our work is using LLM to interface with spatial, embodied data.The Planner decomposes the user prompt into a sequence of sub-prompts, while the SA summarizes the current scene elements.These are then integrated with a Skill Library (SL) to guide the Builder (B) module, which generates the appropriate code.The Inspector (I) module iteratively checks the generated code for compilation and run-time errors.Upon receiving the green light from the Inspector, the code is compiled using the Roslyn Compiler and executed in the Unity Engine to produce the desired 3D scene and functionalities as specified by the user.

LLMR: A FRAMEWORK FOR GENERATING REAL-TIME, INTERACTIVE 3D WORLDS USING LARGE LANGUAGE MODELS
Large language models are capable code generators, and their ability to synthesize programs has been extensively tested [1,6,14,21,33].Scripting in a game engine, however, is especially challenging given the multitude of tasks and the complexity of the development environment.For a non-comprehensive list, generating a realistic 3D world may involve object creation, texturing, behavior programming, event scripting, animations, particle effects, lighting, and user interface [3].Prompting these elements in real time requires a framework that understands the virtual scene, interprets user intention, and generates high-quality code.To this end, we present Large Language Model for Mixed Reality (LLMR), a framework that enables real-time creation and modification of interactive 3D scenes using natural language.
LLMR is an orchestration of language models, each contextualized with a distinct metaprompt to outline its role, as illustrated in Figure 2 and Algorithm 1.A metaprompt is a specially crafted input sequence or context that guides an LLM's behavior or output, enabling more focused or nuanced responses than standard prompts.We start with the Planner, which breaks down the user's request into a sequence of appropriately scoped instructions.These instructions, along with a concise summary of the existing scene from the Scene Analyzer and extra knowledge for specialized skills from the Skill Library, are used as inputs to the central module called Builder, which generates code to fulfill these instructions.In addition, we use a separate Inspector module to check the Builder's generated code against potential compilation and run-time errors before finally executing the code.
The task of generating interactive 3D scenes boils down to generating and executing appropriate code snippets to accomplish the user's prompt.Formally, denote the user's request by  and the current 3D world by Ω (which may be empty), we wish to draw sample  ∼ P ( |, Ω), where P is the distribution of syntactically valid, request-fulfilling code.We then compile and execute  at run-time under the Unity Engine [50], a development platform for creating virtual scenes that suits our needs.Below, we detail each module and explain the design choices that enable various aspects of prompting a virtual world into existence.3: The Planner and its role in breaking down a user's high-level request into a sequence of manageable subtasks ( 1 ,  2 , . . .,   ).The Planner engages in a user-oriented conversation to determine the appropriate scope and granularity of each subtask.Following this, the Builder executes the plan by generating code ( 1 ,  2 , . . .  ) for each subtask, effectively carrying out the user's initial request.

Planner
Prompting a world into existence can be a hefty task."Create a city and all its denizens" is a valid request, albeit one that is overly ambitious to achieve in a single step.Following the common wisdom "nothing is particularly hard if broken into small jobs", instead of directly sampling from P ( |, , Ω), we propose a Planner  :  ↦ → ( 1 ,  2 , ...,   ) to decompose each prompt into subtasks within an appropriate scope, then use autoregressive sampling to carry out these subtasks via a sequence of generated code ( 1 ,  2 , ...,   ): where   ( 1 , ...,   ).The second quality follows by assuming independence of code generations and requests at different steps,   ⊥ ⊥   , ∀ ≠ .An illustration for this procedure is provided in Figure 3.However, sampling from P ( +1 |  ,  +1 , Ω) may be difficult for a language model, because it has to infer the effect of ( 1 , ...,   ) on the initial world Ω before writing code  +1 .To remove the guesswork, we leverage a runtime compiler  to execute ( 1 , ...  ) in order, each time getting a new world state Ω +1 = ( +1 , Ω  ).We can then rewrite: where we assume {  }  =1 is Markovian when conditioned on Ω  .That is, the current world state is rich enough to capture all previous executions past the most recent one.
In principle, it is possible for the user to limit their prompts within a certain difficulty so that the decomposition is unnecessary.However, the user may not know the appropriate task scope a priori (if creating a city is too hard, how about a single house?Or a room in the house?)As a result, having a properly configured Planner makes the framework robust to prompts of varying difficulty.In addition, the user may have different levels of details in their prompt.For example, "Creating a car" is a valid request that nevertheless does not specify its appearance or functionality.Here, the Planner serves as a conversational assistant that interacts with the user to devise a plan with an appropriate scope and granularity, which significantly improves the user experience.The virtual scene, depicted in the bottom-left corner, is converted into a parsed scene hierarchy in JSON format.This, along with the user request, serves as input to the Scene Analyzer.The output is a filtered, relevant summary of the scene, which is then used for conditioning subsequent modules like the Builder.The process optimizes the utilization of the language model's fixed context window and enhances focus on objects relevant to the user prompt.

Scene Analyzer
There are many possible representations of a virtual world Ω that may include visual, behavioral, and auditory elements.In this work, we derive Ω from the Unity scene hierarchy, which contains all existing game objects, their attached components, and their parentchild relations.The hierarchy is parsed into a JSON string and can then be used as input to language models.However, directly using the raw JSON string as input proves to be infeasible in practice.First, most prompts only require interactions with a small subset of Ω, so it is unnecessary and even distracting to use its entirety as input.Second, LLMs have a fixed context window  that serves as its short-term memory, which has to contain its metaprompt, few-shot examples, user prompt, and generative output [62].For example, GPT-4 supports either 8k or 32k tokens for maximum number of token at a time [34], but even the 32k token limit can be insufficient, particularly for intricate scenes containing numerous objects, each consisting of multiple components.
To tackle these issues, we created a separate module termed the Scene Analyzer, which is a properly prompted LLM A( |, Ω) that outputs a succinct summary of Ω conditional on the user request.At a high level, one can think of the Scene Analyzer as a means of perception that relays an abstraction of the environment for downstream processing.An illustration of the module is provided in Figure 4. Concretely, the output   ∼ A(•|, Ω  ) is used to reparametrize the density at each sampling step:

Builder-Inspector
Central to LLMR is the Builder B( |, ), a module responsible for generating code conditional on the user prompt.It serves as our main apparatus for approximating P. In other words, we hope holds with a carefully crafted metaprompt and enough in-context demonstrations.In practice, however, the complex nature of creating a virtual world makes the approximation unsatisfactory even with as many examples as the context length allows.This is largely because the Builder module is asked to accomplish the instructions with some creativity while faithfully following an extensive list of specific guidelines that align the output, which causes to Builder to have a "cognitive overload".To ameliorate this, we introduce another module, the Inspector I(,  |, ), that checks the Builder's generated code for compilation and run-time errors.In the case of a failed inspection indicated by verdict , the Inspector outputs a suggestion  for potential fixes and prompts the Builder to make another attempt.As a result, the Builder and Inspector work in tandem to write and self-debug code, forming a feedback system that significantly improves the quality of the generated scripts.We outline this paradigm in Algorithm 2 and illustrate it in Figure 5. Interestingly, the Inspector excels at catching errors even if the same guidelines in its metaprompt are present in the Builder.One possibility is that this is due to providing a more extensive list of negative and positive examples to the Inspector.Still, when the Builder is provided with the same examples, performance is not as high.Our intuition for this is that verifying a snippet of code is easier than writing the said code, or the two tasks bear different failure modes that can be effectively hedged.

Compilation, Save and Reload
After the Builder-generated script passes the inspection, we follow the approach in [39] to compile and execute the scripts at runtime through the Roslyn C# compiler [49].The inclusion of run-time compilation elevates LLMR from an offline development tool to a real-time generative framework.
To enable iterative design, users can save their generations and selectively reload the saved generations in the existing or new scene without having to repeat the prompting process.The generated output is saved as C# scripts and reattached to the Compiler to be compiled at runtime.A one-sentence summary of each script's function is saved, so alternatively, the output can also be regenerated by the framework based on the summary.

Skill Library
The creation of the Skill Library Module is motivated by two primary challenges.The first is the token size limitation imposed by the GPT architecture on the context, or the "metaprompt, " provided to the Builder.Typically, the Builder is presented with a comprehensive list of various APIs and plug-ins that could be employed to meet the user's needs.As the range of available skills expands, this list lengthens, eventually surpassing GPT's token size limit for public users.The second challenge lies in the Builder's attention capacity, which appears to be limited.Even when we attempt to condense all the available skills into the Builder's metaprompt, it struggles to keep track of a specific skill when the list becomes too lengthy.This limitation is further exacerbated by the necessity to include precise coding examples for each plugin to ensure their effective utilization by GPT.To address these challenges, we created the Skill Library module, denoted as L(ℎ|), which serves as a centralized repository for all available skills and as an attention mechanism that retrieves only the skills relevant to a specific user prompt.We illustrate this module in Figure 6.
Formally, a specialized GPT is provided with a metaprompt containing two essential pieces of information: 1) a high-level summary of the available skills, and 2) the user's prompt.The GPT model is tasked with identifying either a single skill or a subset of skills that are most pertinent to the user's request.The Skill Library remains efficient and small in token size because it only needs the high-level descriptions of each skill, while the specific usage details, Figure 6: Skill Library module workflow.On the left, the module receives inputs from the Scene Analyzer and a user prompt "create a whale and make it swim happily".A list of skills is provided to the SL GPT module in its metaprompt, which also contains a high-level summary of available skills such as object retrieval and animation.The module then identifies and outputs the most relevant skills (in this case, object retriever and animation) to the Builder, which subsequently utilizes these tools for implementation.
As an illustrative example, consider a skill we created for GPT's use, which leverages a combination of generative and contrastive models along with the Sketchfab API to source and integrate 3D models into a scene.We have also created skills that allow the generation of animation of a rigged object in real-time [19].While we delve into the specifics of a skill in the next section, it is worth noting that the Skill Library only receives a high-level summary of how this particular skill functions, along with similar descriptors for other skills.The actual examples needed to use this skill are then retrieved and supplied to the Builder for execution.B( |, , ℎ) : Builder; ℎℎ = retrieved skills from (6) This approach ensures that the Skill Library and the Builder work in tandem to efficiently and effectively generate code that fulfills the user's request while overcoming the token size and attention capacity limitations of LLMs.

INCORPORATING EXISTING OPEN-SOURCE 3D ASSETS
The process of generating interactive 3D scenes often involves the creation and placement of various objects.For instance, a request to create an office space might be decomposed into the generation of a desk, a chair, a lamp, and a clock.While it is possible to generate these objects using primitives, a method that works well even for composite objects like a car or an entire room (depicted in the car of Figure 8 and the kitchen of Figure 1), there is a need to leverage the intricate objects created by artists and 3D developers that exhibit high real-world fidelity.Previous work has utilized objects from Sketchfab [39,40] and used the priors of GPT to size them accordingly to the real world.However, this approach encounters challenges when the user prompts an object, say a clock, and Sketchfab offers 50 different clocks, only three of which are suitable for an office setting.
To address this issue, we introduce the Object Retriever, a skill that employs other AI models to identify the 3D object that the user most likely intended.The workflow of the Object Retriever can be formalized as follows: given a user prompt , the Object Retriever identifies an object  contained in  and calls the Dall•E-2 [47] API for the object , generating a "target image"  .Concurrently, the same object-prompt  is used to download  screenshots of 3D objects freely available on Sketchfab, denoted as  = { 1 ,  2 , ...,   }.We then employ CLIP [36] to map out similarity spaces in the language domain  and the visual domain  .We select the top Figure 7: Object Retriever pipeline for generating a 3D scene.The user provides a prompt for a scene containing a clock, a picture frame, a chair, and an apple on a table.For each object (e.g., clock), the pipeline uses DALL-E 2 to create a target 3D image.Concurrently, multiple screenshots of potential matches from open-source Sketchfab models are downloaded using the object label as the query.CLIP is employed to generate embeddings for these images which includes the target image.The top 5 candidates in the language similarity space are selected.The final object is then chosen based on the highest visual similarity to the target image.This sequence is repeated for each object in the prompt to assemble the complete 3D scene, as shown on the far right.
5 images  ′ ⊂  that are closest to the object-prompt  in the language similarity space , and from these, we select the image  * that is closest to the target image  in the visual similarity space  .Formally, let (,   ) and  ( ,   ) denote the language and visual similarity between the object-prompt  and the screenshot   , and the target image  and the screenshot   , respectively.The Object Retriever operates as follows: This process is repeated to generate entire scenes.Algorithm 3 and Figure 7 describe this pipeline.There is potential for further exploration to improve this pipeline.For instance, selecting from the visual similarity space before the language similarity space might yield better results.Future work will involve human feedback to identify the workflow that maximizes the likeness between the 3D object loaded and the user's intended object.

MEMORY MANAGEMENT
By default, language models generate new words based on all previously sampled tokens, a configuration that may not be ideal due to their finite context length.For instance, this may hinder the model's ability to engage in extended conversations.To mitigate this, techniques such as dialogue summarization and distillation can be employed [2,20,54].Additional research has delved into leveraging persistent memory and retrieving in-context examples from databases to enhance few-shot performance [55,63].
We sought to deploy a protocol that alters the contents within the LLM's context window while the framework is in continuous use.We explored three memory modes for each module within LLMR : full memory, limited memory, and memory-less.We document the memory modes used for each module in Table 1.These modes pertain to the retention of all, a few, or none of the historical instructions and generated code within the model's context.Define an episode of interaction as the input and output to the module for

Module
Memory Mode Planner Memory-less Scene Analyzer Memory-less Builder Limited-memory Inspector Memory-less Skill Library Memory-less Table 1: Memory mode for each module.Note that no module uses full memory, the default GPT paradigm.Figure 9: Sketching objects into existence with LLMR.In the left panel, a user requests a "magic paintbrush" to be attached to a VR controller.The middle panel illustrates the automatic conversion of the line renderer into a paintbrush, where the user is shown drawing a chair.The right panel demonstrates the 2D-to-3D transformation using 2D-3D ControlNet [59] and our Dall•E-CLIP Sketchfab API.This enables the generation of multiple chair models that can then be transferred across different platforms using LLMR for further interaction.
a single user prompt to LLMR.To implement a memory-limited module, for example, we clear its context of all but the most recent  episodes after every prompt, where  = 1 typically.
An effective memory management protocol offers three distinct advantages: Token limit: Trimming old memory reduces token consumption and enables prolonged usage of LLMR, a critical feature for gradually constructing intricate scenes.Notably, the Scene Analyzer benefits from having no memory of prior interactions, as it is susceptible to token constraints.As an example, the first AI2-THOR scene hierarchy measures around 7k GPT-4 tokens [24].Hence, a full memory Scene Analyzer with 8k tokens can only fulfill a single instruction before its context is depleted, rendering the framework essentially unusable outside of a memory-less setting.
Performance: Certain modules perform better with reduced memory, as they may be prone to be confused by earlier interactions.For example, our empirical observations indicate that the Inspector module exhibits increased leniency in repeated inspections, allowing the proposed code to pass before all errors are rectified.
Interpretability: A memory-limited framework provides clearer error attribution.For instance, when a sequence of prompts is sent, and the generation fails at the final step, maintaining all memory makes it challenging to discern whether the last prompt posed a unique challenge or if the framework became perplexed by aspects of an earlier task.Improved transparency facilitates swift debugging and iterating on our framework.
We believe the choice of memory mode is a crucial aspect of any LLM orchestration pipeline, and our design choices may offer insights for the development of LLM systems beyond the task of creating virtual worlds.
Figure 10: Accessible Interface Features in Action.A1 and A2 show how a user can prompt the system to adjust the color scheme of a kitchen scene for red-green color-blind compatibility.B1 and B2 demonstrate the activation of a magnifier tool.C1 and C2 reveal the option to hide objects deemed not kid-friendly.

CROSS-PLATFORM COMPATIBILITY AND INSTALLATION
We show that our framework can be deployed in various types of platforms (e.g., Web, Mobile, AR, and VR) and on various devices (e.g., Meta Quest, HoloLens 2).To keep the framework lightweight, we deploy our framework's run-time compiler on a PC that acts as the server, and we build upon existing remoting protocols and frameworks [32,46] to stream the generated results to the client device (e.g., holographic remoting for a HoloLens 2).Platform dependencies, such as namespaces and other packages can be added as a "Skill" to the framework's Skill Library, which allows the user to quickly enable interaction modalities such as pinch and input modalities like speech and controller.
Interactive elements built within one scene can be saved as selfcontained units by storing the source code that created them.We can then re-execute the cached code to load and adapt the prompted objects into novel scenarios, which can be as simple as a different scene with adjusted physics or a project with completely new APIs, as depicted in Figure 8.Our experiments with LLMR suggest that translating interactive elements between independent SDK platforms is possible and suggests an application of adapting existing pieces of software (perhaps ones written with obsolete, no-longer working code) to newer SDKs.We leave this for future explorations.

Installation
Our framework can be easily added to any existing Unity scenes.The framework consists of a unity package and a few additional open-sourced packages (such as GLTF loader and OpenAI), and the installation process takes only a few steps.This enables anyone with an OpenAI API key to try our framework.We are strong proponents of the adaptability of our framework, and so we have open-sourced the foundational framework along with several examples on GitHub (https://llm4mr.github.io/).Readers who wish to try our framework can try out the example playground scenes or can easily add our Unity package to their existing Unity projects.They would need to obtain an OpenAI API key, a copy of the Roslyn compiler and optionally an account for Sketchfab if they wish to automatically load existing assets.In the Appendix, we also provide the metaprompts used for each LLMR's modules for transparency.

EXAMPLE PROMPTED INTERACTIVE WORLDS AND USES
In this section, we illustrate the wide range of objects, tools, and scenes one can construct with LLMR.We highlight that our framework is modular, real-time, adaptive, interactive, and multi-modal, which differentiates this approach from other generated 3D worlds that primarily focus on visual appearance.For all of the examples below, it is important to stress that all of the results are achieved simply by prompting the system, without the need for manual intervention.
Figure 11: Spontaneous Creation of Teaching Guides.A demonstration of creating a guide for operating a coffee machine in which LLMR animates a hand model to point out the various steps of the operation.Our framework allows for the rapid creation of such guides and furthermore allows users to ask questions that were not predicted by the instructor beforehand, with appropriate motions being animated on the fly.

Game Design and Creativity
An immediate application of our framework is the creation of games, in particular, scenes.A scene sets the context of a game, and it usually involves numerous assets that are difficult and tedious to set up manually.A game designer can use the Planner to create a draft environment, and add interactive components like "players" and "opponents" with responsive behaviors to mock up the gameplay logic.In addition, game designers can expand gameplay in multiple environments.For example, a toy car can be created and reloaded in a moon simulation environment in VR (Figure 8 B) or be spawned in the physical world and driven around with a mobile phone (Figure 8 C).Besides "prompting" objects into existence, we show that our framework also allows users to "draw" things into existence.
Here the user wishes to design a chair (Figure 9).They can do so by simply prompting "a magic paintbrush", which has functions similar to that of TiltBrush [12], a popular 3D drawing application, and then turn the drawing into a 3D model with the integration of Dall-E 2, CLIP, and Sketchfab, through a similar process illustrated in Figure 7.

Accessibility and Adaptive Interface
Similar to the accessibility feature in 2D documents, our framework can also be prompted to make a 3D scene accessible and adaptive to different user needs and preferences.Figure 10 shows three examples of editing an existing virtual kitchen scene to different requests.For example, one can request to make the scene to be more friendly to red-green color-blind users.For someone who is near-sighted, they can prompt a magnifier tool that zooms into a particular part of the room.An architect can use our framework to figure out if the space is friendly for wheelchair users or make sure objects in the room are child-proof.These examples show how our framework leverages LLMs' prior knowledge and puts the knowledge into the context of a spatial world at a human scale.

Remote Assistance and Planning
In a remote training scenario, typically, creating such a 3D interactive training guide requires custom creation, from rigging a gesture to placing a UI element.An instructor can use our framework to automate the generation of a training guide from a list of instructions.(Figure 11).The trainee can then, for example, use an AR device that overlays information on the machine.As the trainee advances through the steps, they can ask questions directly to the guide where answers can be generated in the context of the trainee's learning progress.In another scenario of remote rescue planning, helicopter operators can prompt a simulation of the flight path given several target locations and see how the flight path might be affected by different wind conditions (Figure 12).

NUMERICAL STUDY
As an orchestrated pipeline, LLMR augments an LLM coder with multiple modules to enhance its reliability.To empirically justify the inclusion of each module in our framework, we quantitatively evaluate LLMR's generative performance against a variety of prompts and baselines.In addition to success in compiling the generated code, we evaluate how our framework meets our design goals: real-timeness, complexity of interaction, and iterative fine-tuning ability.This section is organized as follows: we begin by evaluating LLMR on single prompts in an empty and existing scene, highlighting the impact of each module and overall performance compared to standard LLMs.We also discuss the framework's performance at completing tasks with different complexity.Then, we conduct a similar experiment on sequential prompts to illustrate LLMR's capacity for iterative designs.Lastly, we present an analysis of the real-time aspect of our framework.

Error Rate
8.1.1Experiment Setup.We start by investigating LLMR's ability to carry out single, independent requests in either an empty or existing scene.To this end, we created two datasets each with 150 prompts.The first set is used as inputs in an empty scene and is mainly creative in nature as there is nothing to modify or interact with in the world.An example is "creating a cat and mouse out of primitives.The cat should chase the mouse, who flees in an erratic pattern." The second set is used as inputs in an existing scene shown in Figure 13.The scene was downloaded from Sketchfab [48] and was chosen as it is sufficiently complex (around 35 objects).A few example prompts are shown in Figure 13, which involve visual and semantic alteration of the space.To promote fairness and diversity in our test prompts, we use a separate, properly prompted GPT to generate two evaluation datasets.The authors created 15 prompts as demonstrations for the prompting GPT.The full evaluation datasets can be found in Appendix.
To assess the efficacy of LLMR, a proper metric is required.Given the subjective nature of tasks such as "make the room more uplifting, " it is difficult to systematically determine if a prompt has been met successfully.However, the presence of run-time or compilation errors in the generated code can be considered a clear indicator of failure.Therefore, we have selected the 'error rate' -the proportion of outputs with bugs -as the criterion for assessing the framework's performance.
To evaluate the efficacy of each module of LLMR, we created three model conditions, each with adding one additional GPT module, besides GPT-4 zero shot and GPT-4 few shot as our baselines.This makes a total of 5 model conditions.We conducted 5 runs of 150 prompts with each model condition and for each scene condition (empty scene and existing scene).
Figure 13: An illustration of our experimental setup.We provide the bathroom scene (left) and a subset of the 150 prompts (right) used in this space for the evaluation provided in Figure 14.8.1.2Results and Discussion.We provide a summary of error rates for our model and various baselines in Figure 14.To underscore the benefit of each LLMR module, we add each component incrementally to tease out its marginal impact.Starting with the off-the-shelf GPT-4, we see that standard in-context learning techniques increase performance in both settings, yet only to the extent that roughly half of the requests fail.From here, we augment the standard GPT-4 with components developed in this work, starting with the Scene Analyzer, then the Skill Library, and finally the Inspector.As a result, the generated errors drop substantially to only 20.5% and 25.2% of the error rate observed in the original GPT-4 for the empty and existing scene, respectively, which attests to the effectiveness of our pipeline over standard, off-the-shelf LLMs for the task of generating interactive scenes.
We now discuss the impact of each LLMR module in detail.As explained in Section 3.2,the Scene Analyzer allows LLMR to parse and understand the virtual scene and is thus indispensable for meaningful manipulations of existing environments.Consequently, enhancing GPT-4 with the Scene Analyzer results in a significant performance enhancement in the Bathroom scene.Secondly, the Inspector module enables LLMR to perform self-debugging and effectively prevents the generation of erroneous code, further reducing the error rate in both scenarios.Although we integrated the Inspector at the final stage, it is compatible with any combination of modules and will consistently reduce the output error rate.As an example, we added Inspector to GPT-4 with few-shot prompting in the empty scene and observed the average error drops from 45.0% to 13.1%.We also observe the Skill Library has a marginal impact on the error rates.This is expected, since the Skill Library is designed to handle more specialized tasks, which we discuss in more detail in the following subsections.Lastly, the Planner is not included as it alters the input prompt with step-by-step decomposition, making the results incommensurable.We include in the Appendix an example where the Planner is used to build a virtual kitchen, underscoring the benefit of decomposing difficult tasks into incremental steps.

Methodology.
To explore the relationship between the complexity of prompts and the completion rate of LLMR, we performed an ad-hoc analysis of the results from the previous section.We classified our prompts for single prompt task into levels of difficulty from 1 to 10.To achieve this, we utilized a non-contextualized LLM devoid of any meta-prompting, asking it to assign a difficulty level to each prompt.This process was repeated ten times for each prompt, and the average difficulty level was then calculated (one which had a small standard deviation).The aggregated results, categorized by difficulty level, are illustrated in Figure 15.The prompt given to this LLM (GPT-4) was "The above are prompts that are given to a system that can code and execute commands inside of Unity.We want to measure how good this system is at coding in C# for Unity purposes.Given your knowledge of Unity, please rate all of the prompts above on a level of difficulty from 1 to 10".The rationale behind employing a non-contextualized LLM (without any meta-prompting) lies in the subjective nature of assessing difficulty levels.Being the developers of the system, our judgment might be inherently biased, influenced by our understanding of the system's capabilities and limitations.Furthermore, engaging Unity experts to determine the difficulty levels presents its challenges.The variability in the expertise and experience levels among Unity developers could lead to inconsistent evaluations and difficulty in standardizing the experience of the evaluators without a comprehensive and uniform examination framework.

Results and Discussion.
In this section, we analyze the performance of various architectures in executing Unity tasks, differentiated by difficulty levels that range from Easy to Hard.These levels were determined based on a 1-10 scale assigned by GPT-4. Figure 15 shows the error rate of the different architectures on two panels.On the left, we have the results for the empty scene and on the right for a scene with a bathroom containing various objects.
Across all levels and scenes, LLMR (orange line) consistently outperforms other architectures, underscoring its robustness.In the empty scene on the left, a noticeable trend is that the error rate generally increases with the task difficulty.This trend aligns with expectations, except in the case of GPT-4 Zero Shot.A notable point here is that the Easy category only contains a single prompt, which is a basic "Hello World" console display.The simplicity of this task explains its solitary placement in this category.For the bathroom scene, the error rates for Medium and Somewhat Hard tasks show minimal variation, suggesting a plateau in difficulty perception.An interesting observation is the drop in error rates from Easy to Somewhat Easy tasks, although this is not consistent across all models.The integration of the Skill Library shows mixed effects (dark blue line).In some instances, it enhances performance, while in others it seems to hinder it.
Estimating the difficulty of tasks, especially in scenarios involving modifications to an existing scene rather than building from scratch, presents challenges.This is exemplified in the bathroom scene, where adding new objects (difficulty levels 3-4) did not require scene understanding, contrary to the tasks in the Easy category, which involved moving objects and thus relied more on scene comprehension.Our analysis of the prompts indicates that the nature of the scene significantly influences the perceived difficulty.For instance, in the bathroom scene, certain tasks categorized as Easy in theory turned out to be more challenging in practice.The Appendix offers a more comprehensive analysis, including variations in architecture, such as the combination of Scene Analyzer (SA) and Inspector modules.
In conclusion, LLMR demonstrates superior performance across various scenarios, underscoring its effectiveness in handling tasks of varying complexity in Unity environments.This analysis also highlights the intricate relationship between task difficulty, scene context, and architectural components, paving the way for further exploration in optimizing task-specific architectures.

Task Complexity
Complexity can manifest through different aspects.To supplement the ad-hoc analysis above, we now provide a more comprehensive discussion by breaking down the concept of complexity through the following aspects and share findings that emerged throughout our experimentation: Specific Skill Requirement -Certain tasks are inherently more difficult.For example, deforming the mesh of an object is much more complicated than adding an object to the scene.A human developer may need to look up examples and documentation to achieve a complex task; LLMR can reduce the complexity of the task by starting a templated script.However, LLMR is not error-proof.As shown in the previous section, LLMR's error rate increases (but not above 40%) as a task becomes more difficult.The error rate can be further brought down by adding relevant skills to the Skill Library by an experienced developer to help LLMR achieve a higher success rate and reduce possible rounds of iteration between the Builder and Inspector.LLMR can save the time of experienced developers by generalizing beyond the examples provided in the Skill Library.
Token Requirement -The amount of tokens required grows as the scene or the object becomes more complex.For example, if the existing scene has a tree object with many leaves, where each leaf is considered a child game object, the scene summary could easily exceed the maximum token allowed (at the time of writing, the maximum number of tokens was 8k, though this has now grown significantly).In anticipation of this, our Scene Analyzer module only fetches the top-level game object name to filter out the relevant game object based on user query and task.This allows our framework to handle a scene as complex as the Kitchen scene (example in Figure 10) and the Bathroom scene (used in Numerical Study, Figure 13).Besides the complexity of objects and scenes, a task itself can also require a lot of tokens.One such example is generating an animation [19] (with the help of a skill written for the Skill Library) that involves generating time-series of numerous joint positions Memory Requirement -When a task requires prior knowledge of previous prompts (e.g., behaviors created by previous prompts), this requires previous prompts to have been successfully compiled and to be robust enough.We described approaches to managing the memory of the different modules in Section 5 to both conserve total memory consumed while preserving the necessary information for the framework to carry out complex tasks.
Quality Requirement -A user may request different levels of fidelity of the output.For example, the user could create a complex scene out of primitives only with the help of the Planner module (e.g., a full kitchen, Figure 1) instead of out of higher fidelity 3D models (see examples of participants' creations in the video figure).The flexibility to create visually simple yet functional and interactive scenes is akin to creating a lo-fi mockup that allows users to quickly prototype and iterate without waiting for the full generation of 3D scenes that are visually complex but cost a lot of compute and time and are not easy to modify.

Iterative, Incremental Design
8.4.1 Experimental Setup.In practice, creating content-rich virtual worlds requires incremental steps.Therefore, it is important to assess how LLMR performs in iterative scenarios, where requests are made and fulfilled one after another to gradually build and alter a virtual scene.We tested LLMR with 80 sequential prompts, each averaging 5 single prompts.These sequential prompts consist of a set of instructions aimed at completing a complex task.For instance, a sequential prompt for constructing a bedroom might include steps like "create an empty room with walls; add a bed with a lamp next to it; add a window on the wall." We use three metrics to evaluate performance in an iterative setting.First, the error rate on all individual prompts is considered and is the same as in single tasks.Second, we calculate the average degree of completion, measured as the number of completed single prompts over the sequence length for each sequential prompt.As the sequential prompts have varying lengths, accessing the completion average prevents "long and simple" sequences from flooding the error rate.Lastly, we define fulfilled prompts to be sequential prompts that are completed from start to finish and compute their percentage over the total number of prompts.This is a demanding metric that validates whether the model can manage extended use sessions gracefully.In extreme cases, a model excelling only in short sequences can have a reasonable error rate yet zero perfectly fulfilled prompts.3: Average time taken in seconds to generate and compile each prompt.SA stands for the Scene Analyzer, and SL stands for the Skill Library.LLMR is equivalent to GPT-4 augmented with the Analyzer, the Skill Library, and the Inspector.
the progressive creation and modification of virtual scenes, a scenario that resonates more closely with practical use cases.Lastly, we discuss in section 9 how the users subjectively rate the iteration process working with our framework.
In general, sequential prompts are much more challenging than single prompts because they require the model to maintain and manage long-range dependencies, a task known to be challenging in sequence modeling [61].To use the provided example, adding a window on the wall requires knowledge of the wall that was created a few prompts prior.From this perspective, the Scene Analyzer serves as an effective summarization [54] that helps the model redirect its attention to the part of the scene most relevant to the request, thereby reducing potential errors.In addition, the Inspector receives scene parsing from the Scene Analyzer and can thus effectively shield the generated code against potential errors in a sequential setting.

Real-time
8.5.1 Methodology.Last but not least, an important strength and design goal of our framework is the real-time creation and modification of objects and scenes, which is crucial to ensure the practicality of use.To evaluate the framework's real-timeness, we measured the time taken for task completion (from generation to compilation) with different combinations of modules.Once again, the results were from 5 runs of 150 prompts with each model condition.Note that we ran the experiment from August 2023 until December 2023, where the performance and latency of OpenAI's GPT-4 model varied slightly but the difference was marginal.The experiments were run on a PC with 32GB of RAM and an Nvidia RTX 3080 GPU.8.5.2 Results and Discussion.Table 3 shows the average completion time (i.e., including generation and compilation time) for each model and condition.The off-the-shelf GPT-4 takes around half a minute to complete a single prompt, and the full LLMR framework on average takes a little over a minute -a timeframe we consider acceptable given the task complexity (Figure 15) and improved task completion rate (Figure 14 and Table 2).To put things into context, completing these complex tasks manually takes much longer even for someone reasonably familiar with Unity if we account for the time spent on looking up documentation and debugging.
There are a couple of factors that contribute to the additional operation time.First, the complexity of certain tasks, such as retrieving 3D assets from Sketchfab, requires extra time to download assets from third-party sites.The time needed to finish retrieving a 3D model varies a lot and depends on the size of the model, and thus we did not include this in our evaluation.In this case, our framework anticipates this by caching the previously saved model for faster reloading.Second, to ensure the success rate of our framework, the Inspector nearly doubles the code generation time (see the last two rows of Table 3).This is an inherent tradeoff, and the Inspector module can be turned off for simple tasks.Finally, back-and-forth interactions between LLMR and the user as well as iteration over the generated results contribute to the overall development time.It is worth noting that during our user study (Section 9), none of the participants mentioned or complained about generation time.Participants who are novice Unity users appreciated that LLMR saved them time from the steep learning curve.In addition, our "saving and reloading" capability (Section 3.4) allows users to iterate faster by reusing prior creations, which takes less than 10 seconds to recompile.

USABILITY STUDY
The ablation test focused on only the compilation and run-time errors in the code generated by our framework.We also wanted to evaluate the quality of the generated output with human users.In addition, we also wanted to understand how users with different levels of familiarity with Unity would use our framework.

Procedure
We recruited twelve users (1 pilot, 11 participants) with different levels of experience using Unity (5 participants had more than one year of Unity experience).The participants' backgrounds were software engineers, product managers, or researchers.Each session took around 2 hours, and each participant had at least 1.5 hours to experiment with the framework.We provided a unity package that includes basic features (Scene Analyzer and Skill Library, Builder, and Inspector).Before the study, each participant downloaded the package to an empty or existing Unity scene and followed the instructions to set it up.Each participant went through a few rounds of interaction with the framework.A round of interaction could look like the following.The participant types: "Create a tool that changes the color of the car." The framework processed the prompt and generated scripts that were then automatically compiled at runtime.The participant looked at the generated output and decided on the next prompt.The investigator might suggest different things to try or remind the participant of the capabilities of the framework.
They were asked to think out loud throughout the study.At the end, the investigators conducted a semi-structured interview with the participant (see Appendix for the full list of questions).After the study, each participant filled out a seven-question questionnaire on a seven-point Likert-scale about their experience using the framework.

Results and Design Recommendation
Participants were able to generate various outputs using our framework, such as cities and Asteroids-like games.Some even recreated their professional work, such as rigging camera angles and generating animations.
We used a mixed-methods approach to analyze the user study; We took into account the quantitative insights from the questionnaire response, and we thematically grouped participants' thinkaloud and semi-structured interview responses to identify patterns.These findings were then utilized to generate a set of design suggestions, which we will discuss in detail.
Questionnaire results revealed that participants generally had positive experiences with our framework in terms of achieving their goals, intuitiveness, and iterative use.However, there is room for improvement in reducing frustration and further enhancing user satisfaction (Figure 16).We also compared the responses between beginners and experienced Unity users.Beginners rate their experience with our framework more positively across most categories, as we will detail in the following sections.9.2.1 Approach to Prompting and Instruction Strategies.We asked participants to describe their approach to prompting when using our framework.Participants emphasized the importance of ensuring that their prompts were easy for GPT to understand (P1, P2).Some participants treated the interaction with the framework as an experimental playground, experimenting with different prompts and refining them over time through trial-and-error (P0, P6).Many participants stressed the need to be highly specific in their instructions.This involved specifying object names, exact changes, and detailed parameters to achieve desired results and avoid unpredictability (P3, P4, P5, P9, P11).Many took the approach of breaking down tasks into smaller, more manageable steps.This included starting with simple components and gradually adding complexity (P4, P5, P6, P7, P8, P10).When creating environments or settings, participants often prioritized static elements before motion-centric ones and ensured that interactive elements responded to the environment (P7).

Comparison with
Prior Approach to 3D World Creation.When asked to compare our framework to their prior experience of creating 3D worlds, several participants appreciated the ease of describing their ideas directly to the model, eliminating the need for extensive manual scripting or documentation reference (P1, P3, P5, P6).Participants appreciated our framework's integration capabilities and its ability to automate certain aspects of 3D world creation, such as selecting and loading objects from 3D model repositories like Sketchfab and determining model placement (P4, P7).Some participants mentioned that our framework reduced the learning curve for newcomers, making it easier to get started with 3D world creation (P3).Participants appreciated the ability to directly intervene and manually adjust the 3D world generated by our framework, which they considered a powerful feature compared to other uses of generative models, which do not allow the user to adjust the final output (P9, P11).

Considerations and Challenges.
Participants noted that our framework's output was often unpredictable compared to traditional methods, and there was uncertainty about whether the model would understand the desired structure (P0).Some participants pointed out that the choice between traditional methods and our generative method of creating 3D worlds might depend on the artistic nature of the project and the need for creative input (P5, P7, P11).Other participants recognized that for projects requiring precise, structured, or rigid control, traditional tools might be preferred over our framework (P4, P8).Lastly, they also mentioned that for more complex tasks or as projects grew in size, manual code editing might still be necessary, as it could be faster than creating detailed descriptions of edits for the model (P1, P4, P8).9.2.4 User Expectations and Surprises.We also probed participants' expectations and what they were surprised by during the user study.Many were surprised by our framework's ability to generate code effectively, helping them automate complex scripting tasks (P1, P4, P6, P8, P10).Specifically, P4 was impressed by the model's ability to handle complex structures like trees, despite token limits.P10 also highlighted the framework's ability to understand the hierarchy of game objects and utilize this information as input, especially in large and complex projects.Participants were impressed by the integration capabilities of our framework with Dall-E 2 and Sketchfab, which allowed for the creation of complex structures and the addition of 3D objects (P4, P6, P8, P10).It was surprising for participants to discover that our framework allowed for subjective queries and descriptions, accommodating plain language and euphemisms rather than strictly technical terms (P5).P1 was also amazed that the framework exhibited flexibility in understanding their Unity scripts and could even help resolve errors.Similarly, some were pleasantly surprised that, when prompted correctly, our framework could produce unconventional or unexpected results, such as unique player movement (P0, P3).

ETHICAL CONSIDERATIONS
While LLMR and other LLM tools can be transformative to many industries and applications, there exist risks with any AI-enabled systems.Firstly, the concern of developers and creators being replaced has been on the surface of discussion.However, these tools have not been proven to achieve end-to-end development.Participants of our user study commented that our framework is better at integrating human intervention and involvement, and thus our framework helps improve productivity and facilitate brainstorming, rather than completely automating the creation process.A more serious concern is the potential for individuals to generate harmful and inappropriate content with our framework.Despite the safeguards put in by Sketchfab and OpenAI through content moderation and model alignment, it is still possible to creatively circumvent these safeguards [29].While the Roslyn compiler can automatically check for unsafe code, the need for research on how to moderate 3D content is merited.

LIMITATIONS AND FUTURE WORK
Limitation in scene understanding -Currently, our framework requires access to a "scene graph" with descriptions and the hierarchy of the game objects.The scene graph provides the spatial relationship of game objects and it assumes that the names of the game objects (and their children game object components) are correct and unique.However, 3D models from repositories like Sketchfab often have random, non-descriptive component names.At the moment, the framework manipulates objects by finding the game objects with the exact name, which is not always reliable.
In addition, the scene hierarchy does not contain meta information about the objects, such as affordance and functions.Furthermore, a scene graph would not be readily available when the scene is a physical environment with augmented virtual objects.The natural next step is the incorporation of Large Vision Models (LVM) [18] to achieve tasks that require visual knowledge and semantic understanding of objects and environments.Our framework can benefit from the enhanced feedback and semantic information from these models, and our framework can enable more interactive editing of and interactions with a given 3D environment.
Incorporating world feedback and direct user feedback -Similar to how the Builder-Inspector loop reduces code compilation error, the framework's understanding of the world could be further improved by incorporating feedback from the virtual world and from the user.For example, if a 3D model is loaded from Sketchfab, the framework is ignorant of the model's (and their subcomponents') orientation and center of pivot, and thus does not consistently produce the desired output when asked to rotate the 3D model.
Limitation in Memory and Traceability -Another limitation of the framework is the token size.As mentioned in Section 5, we have optimized access to historical conversation and generation to reduce token usage.There is an inherent tradeoff, where the user instruction might refer to something in a previous prompt exchange that is not exposed to the next exchange.For example, the Scene Analyzer has access to the name of the script, the summary of the script, and the public fields of the script, but if the user just wants to change a specific part of a previous script generated two prompts prior, the framework would not know what to do.
Correspondingly, showing the generated code gives traceability and transparency to the results of our framework.At the moment, code written by our framework is stored locally in a cached folder and can be viewed within the Unity editor window.In addition to providing feedback via a follow-up prompt, the option to directly edit the code generated by our framework would give users more agency and achieve more complex, precise tasks (as mentioned in Section 9.2.3).
Automatic Skill Generation -At the moment, skills in the Skill Library are created by human users.For example, we incorporated the skill of loading assets from Sketchfab as well as the skill of making objects "grabbable" using MRTK's [31] namespaces.The ability to automatically generate new skills [53] based on a couple of examples would allow our framework to achieve more complex tasks (such as generating animations) and to be compatible with different platforms (such as Quest and ARKit).
Interoptability -We built upon the Unity engine for its robustness and the large amount of existing examples of C# code that our LLMs have likely seen during training.Our work is independent of Unity and its closed-beta AI tools1 , although our tool can be an add-on to Unity.We want to highlight that our framework can be adapted to any environment that supports run-time compilation.Unity is the baseline requirement for using our framework, and a web-based approach would further make prompt-based interactive 3D worlds easy to share and collaborate within.In fact, some of our user study participants work on web-based mixed reality development, and they commented that our framework can be easily adapted to their coding environment.

CONCLUSION
In this paper, we have introduced a novel framework that addresses certain difficulties in applying LLMs to generate interactive 3D experiences.This framework leverages the abilities of multiple distinct and specialized LLM modules, orchestrated in a way that enhances their individual and collective performance on both coding and reasoning.Additionally, we have presented certain engineering aids, such as a skill that utilizes other AI models to add content into scenes, further expanding the capabilities of our framework.
Our research has demonstrated the benefits of each LLM-based module, providing a clear rationale for the inclusion of each module in our framework.By combining somewhat specialized components, our overall system became more robust and is significantly better than off-the-shelf LLMs.Through a user study, we have tested the quality and usability of our framework, allowing participants to challenge our framework with unprecedented prompts, thereby pushing the boundaries of the examples provided to LLMs.
The significance of this work lies in its potential to improve the generation of virtual world content with internal degrees of freedom and interactivity, and to improve the likelihood that such content will make sense intuitively to humans in a human-scale world.In turn, this shows a path to making LLMs more reliable in the domain of human-scale activity.The LLM is not merely incorporating what has been said about the world, but tests results in a simulation of the world.The described framework operates across various devices and platforms; the present implementation does assume Unity.
We propose that this framework offers an opportunity for the HCI community studying LLMs.By providing virtual-or real-world data and the ability to act via code in such a world, our framework can serve as a platform to test and improve the limits of LLM reasoning capabilities when placed in 3D environments.
In conclusion, our work presents a significant step forward in the integration of LLMs with virtual world content and experience generation, offering a powerful tool for both developers and researchers.We look forward to seeing how this framework will be utilized and expanded upon by the wider community.

A NUMERICAL STUDY
For the numerical study, we run both the empty scene and bathroom scene experiments 5 times to reduce the randomness in nondeterminism of GPT-4.Even when we run each module with temperature 0, there is an inherent level of randomness present in LLMs, both due to sampling and the nondeterminism of GPU operations used for inference.We report in tables below the average error rates and their standard deviations across these 5 runs, as well as the average time in seconds and the standard deviation in time over the 5 runs (not but not between each sample).

Model
Error  5: Bathroom scene results averaged over 5 independent runs to reduce nondeterminism of GPT-4 run with 0 temperature.Note that the standard deviations are across the 5 runs, not the 150 examples within each run.
In addition to the results presented in the main text, we ran experiments on GPT-4 combined with the Scene Analyzer and the Inspector, hence the LLMR without the Skill Library.We find that this configuration on our datasets has similar performance to LLMR (same within the error bars), but in the case of scene editing it leads to longer completion times, which is compatible with the fact that it supplies more information in the context to the framework.Create a cloud out of primitives that rains when I click on it Create a firework out of primitives that explodes when I click on it Create a teddy bear out of primitives that hugs me when I pick it up Create a boat out of primitives that I can sail with the wind Create a snowflake out of primitives that falls when I click on it Create a donut out of primitives that I can eat with my mouse Create a planet out of primitives that orbits around a sun Create a flag out of primitives that waves when I blow on it Create a spider out of primitives that crawls when I touch it Create a fish out of primitives that swims when I feed it Create a phone out of primitives that I can use to call a number Create a clock out of primitives that I can set and alarm Create a letter out of primitives that I can write and send create a calico cat out of primitives create a magic brush out of primitives that draws out of its brush tip create a door out of primitives and add appropriate hinges such that it functions like a physical door make the ball bounce in response to the environmental gravity sort the objects in the scene from smallest to biggest in terms of their model size make a 3D lever with appropriate joints and hinges , where the rotation of the lever controls the size of the cube create a clock from primitives and add behaviors to the clock such that it functions like a real clock create a button from primitives and add a behavior to the button such that it changes the color of the sphere when pressed create a simple platformer game with a player , some platforms , and a goal create a pendulum from primitives and add a behavior to the pendulum such that it swings back and forth create a camera from primitives and add a behavior to the camera such that it follows the player create a flashlight from primitives and add a behavior to the flashlight such that it emits a cone of light create a slider from primitives and add a behavior to the slider such that it controls the volume of the audio source create a dice from primitives and add a behavior to the dice such that it rolls randomly when clicked create a fan from primitives and add a behavior to the fan such that it rotates and blows air create a balloon from primitives and add a behavior to the balloon such that it floats and pops when touched create a flower from primitives and add a behavior to the flower such that it grows and blooms over time create a book from primitives and add a behavior to the book such that it opens and closes when clicked create a water bottle from primitives and add a behavior to the water bottle such that it fills and empties when tilted create a car from primitives and add a behavior to the car such that it moves and steers when the arrow keys are pressed create a guitar from primitives and add a behavior to the guitar such that it plays different notes when the strings are plucked create a telescope from primitives and add a behavior to the telescope such that it zooms in and out when the mouse wheel is scrolled create a vending machine from primitives and add a behavior to the vending machine such that it dispenses different items when the buttons are pressed create a calculator from primitives and add a behavior to the calculator such that it performs basic arithmetic operations when the keys are pressed create a chess board from primitives and add a behavior to the chess board such that it allows two players to play chess create a snowman from primitives and add a behavior to the snowman such that it melts when exposed to heat create a toaster from primitives and add a behavior to the toaster such that it toasts bread when the lever is pushed down create a radio from primitives and add a behavior to the radio such that it plays different stations when the dial is turned create a microwave from primitives and add a behavior to the microwave such that it heats up food when the timer is set create a fridge from primitives and add a behavior to the fridge such that it keeps the food inside cold and opens and closes when clicked create a blender from primitives and add a behavior to the blender such that it blends the ingredients inside when the button is pressed create a coffee maker from primitives and add a behavior to the coffee maker such that it brews coffee when the switch is flipped create a lamp from primitives and add a behavior to the lamp such that it turns on and off when the switch is clicked create a TV from primitives and add a behavior to the TV such that it displays different channels when the remote is used create a phone from primitives and add a behavior to the phone such that it makes and receives calls when the buttons are pressed create a keyboard from primitives and add a behavior to the keyboard such that it types letters when the keys are pressed create a mouse from primitives and add a behavior to the mouse such that it moves the cursor when the mouse is moved create a printer from primitives and add a behavior to the printer such that it prints the text on the screen when the button is pressed create a scanner from primitives and add a behavior to the scanner such that it scans the image on the paper when the button is pressed create a speaker from primitives and add a behavior to the speaker such that it plays the sound on the computer when the volume is adjusted create a mailbox from primitives and add a behavior to the mailbox such that it opens and closes when the flag is raised and lowered create a bicycle from primitives and add a behavior to the bicycle such that it moves and brakes when the pedals and the handle are used create a skateboard from primitives and add a behavior to the skateboard such that it rolls and flips when the board and the wheels are used create a roller coaster from primitives and add a behavior to the roller coaster such that it moves along the track and loops when the speed is controlled create a ferris wheel from primitives and add a behavior to the ferris wheel such that it rotates and stops when the switch is used create a merry -go -round from primitives and add a behavior to the merry -go -round such that it spins and plays music when the button is pressed create a swing from primitives and add a behavior to the swing such that it swings back and forth when the rope is pulled create a seesaw from primitives and add a behavior to the seesaw such that it balances and tilts when the weight is shifted create a slide from primitives and add a behavior to the slide such that it slides down when the object is placed on top create a sandbox from primitives and add a behavior to the sandbox such that it creates and destroys sandcastles when the shovel and the bucket are used Create an iguana out of primitives and move it along a sinusoid trajectory Create a tool that changes the color of whatever I right click Create some stairs and drop a sphere from the top with physics so that it bounces down Create a bird that follows the camera around as the camera moves Simulate falling rain using many spheres Create an office space with a desk , a lamp , a chair , and human that moves in circles around the desk Create a frog that jumps every 2 seconds Create a clock that comes quickly towards the camera view if the mouse is still for 40 seconds Create a button that spawns a random animal when clicked Create a cube that rotates and scales based on the mouse position Create a script that prints " Hello World " to the console Create a car that moves forward when I press W and turns when I press A or D Create a flashlight that toggles on and off when I press F Create a sphere that changes its material to match the color of the skybox Create a terrain with hills and valleys Create a painting tool that lets me draw on a canvas with different brushes and colors Create a vending machine that dispenses a soda can when I insert a coin Create a simple calculator that takes two numbers and an operator as input and displays the result Create a flower that grows and blooms when I water it Create a bouncing ball that changes color every time it hits the ground Create a firework that explodes in the sky when I press space Create a snowman that melts when I touch it Create a maze with walls and a goal Create a chess board with pieces that can move according to the rules Create a flag that waves in the wind Create a book that opens and closes when I click it Create a dice that rolls and shows a random number when I throw it Create a fan that spins and blows air when I turn it on Create a sun that rises and sets according to the time of day Create a balloon that inflates and deflates when I press a pump Create a bridge that collapses when I walk on it Create a camera that takes a picture when I press a button Create a door that opens and closes when I approach it Create a tree that grows leaves and fruits when I fertilize it Create a rocket that launches and lands when I press a switch Create a puzzle that reveals an image when I solve it Create a map that shows my location and direction Create a compass that points to the north Create a stopwatch that starts and stops when I press a button Create a slider that changes the volume of a sound Create a toaster that pops out a toast when I press a lever Create a microwave that heats up a food when I press a button Create a radio that plays a station when I turn a knob Create a telescope that zooms in and out when I scroll the mouse wheel Create a skateboard that moves and flips when I press the arrow keys Create a roller coaster that loops and twists when I press a button Create a spaceship that flies and shoots lasers when I press the spacebar Create a cat out of primitives that wanders around Create a snake that slithers and eats apples Create a guitar that plays a note when I strum a string Listing 1: Prompts used in an empty scene

A.2 Bathroom Scene Prompts
Create a sphere on top of the bathtub Change the toilet 's color to gold Make it so that when I click on the faucet , water flows out of it .Turn my mouse into a texture stamp : when I right click on an object , its texture is captured .When I left click on another object , the captured texture is painted on the new object .Modify the bathroom so that I can have nice bathing experience Tell me what the biggest object in the bathroom is Highlight the towels in the scene Place objects in the room to make it more relaxing Create a cat and put it on the floor Create a whale that swims around in the bathroom Change the color scheme of the room to be more relaxing Allow me to click on the shower door and open it Modify the trash to be twice as big Make water flow out of the faucet Create a button that lets me switch between day and night mode Make the bathroom light dim when I enter the shower Move the towel to a position right in front of me Create a mirror effect on the window Make the plant grow as time passes Hide the towels Tell me about the objects in this room Create a soap dispenser and a soap bar on the vanity counter Create a slider that lets me adjust the temperature of the water Make the ceiling passage lead to another room Create a mini game where I have to catch the towel before it falls on the floor Make the mirror glass shatter when I click on it Create a particle system that emits bubbles from the bathtub Make the floor slippery when wet Create a painting on the wall that changes every time I look at it Create a spider that crawls on the ceiling Make the toilet paper unroll when I drag it Create a clock on the wall that shows the current time Make the handwash dispense foam when I press a button Create a hair dryer that blows hot air when I click on it Make the flowerpot fall and break when I bump into it Create a fog effect that fills the room when I turn on the shower Create a rug that covers the floor Make the window open and close when I double click on it Make the plant wilt when I don 't water it Change the color of the showerhead to red Modify the plants to change color every few seconds Fill the bathtub with water Create a toothbrush and place it on the counter Put a music player next to the plants Make the plants half as big Add a UI slider that allows me to adjust the size of the bathtub Create a sponge and place it on the shower panel Create a trash bag and place it next to the dustbin Create a tissue box and place it on the counter Remove the glass door enclosure create a mug inside the sink turn off the shower light throw away the handwash bottles in the trashcan open the shower door hang the towel on the bathtub The user represented by the Camera GameObject uses a wheelchair .Are there any items that are too high , too low , or too far for the use to see or touch in the room ?Important : Only consider the GameObjects in the Scene description .How many tennis balls can fit in the room Create a tool that picks up the color of the object that it touches and changes the color of the object that it is dragged over Make the mirror reflect the image of the user Create a button that toggles the water flow from the shower head and the faucet Make the plant grow or shrink depending on the temperature of the room Create a slider that controls the brightness of the lamp Make the dustbin lid open and close Create a digital clock that displays the current time on the UI canvas Make the window glass breakable when hit by a forceful object Create a soap dispenser that dispenses soap when the user presses the nozzle Make the towel roll unroll Create a spray bottle that sprays water when the user squeezes the trigger Make the ceiling passage open and close when the user presses a hidden switch Create a radio that plays music from a list of preset stations on the vanity counter Make the flowerpot fall and shatter Create a hair dryer that blows hot air Make the toilet flush when the user presses the handle Create a scale that measures the weight of the user when they step on it Make the mirror glass foggy when the room is humid Create a toothbrush that cleans the user 's teeth when they move it over their mouth Make the shower panel display the water temperature and pressure Create a razor that shaves the user ' s beard when they drag it over their face Make the handwash bottles refillable when the user places them under the faucet Create a fan that cools the room when the user turns it on Make the towel dry the user when they wrap it around themselves Create a magnifying glass that zooms in on the object that it is held over Make the lamp cloth catch fire when exposed to a flame Create a perfume bottle that sprays fragrance when the user presses the cap Create a sponge that absorbs water when it is dipped in the sink Make the floor slippery when it is wet Make the shower head rotate when the user twists it Create a hair clip that attaches to the user ' s hair when they click on it Make the mirror body rotate Create a toothpaste tube that squeezes out toothpaste when the user presses it Make the faucet and handle leak when they are damaged Make the window open and close when the user clicks on it Make the plant wilt Create a nail clipper that cuts the user 's nails when they squeeze it Make the ceiling bathroom detachable when the user pulls on it Create a tissue box that dispenses tissues when the user taps on it Make the lamp metal shock the user when they touch it Create a rubber duck in the bathtub Make the mirror glass shatter when the user hits it with a hammer Create a loofah that exfoliates the user 's skin when they scrub it Bring the jet spray right in front of the camera Move the camera in front of the shower head Make it seem like water comes out of the shower head Add 3 spotlights of red color Add a picture frame on the window Clone the lamp and move it to new location Make the mirror reflect the camera view Make the dustbin fall over when the camera collides with it Change the color of the towel to blue Make the faucet and handle rotate when the camera clicks on them Add a physics material to the floor that makes it slippery Make the ceiling passage open and close when the camera presses a key Add a particle system to the flowerpot that emits butterflies Make the handwash dispense soap when the camera gets to it Add a UI text that displays the current time on the mirror Make the bathroom light flicker randomly Add a collider to the ceiling that prevents the camera from going through it Make the shower panel slide up and down when the camera drags it with the mouse Add a sound effect to the toilet paper when the camera interacts with it Make the towel roll unroll when the camera pulls it Add a script to the lamp that makes it follow the camera Make the mirror glass shatter when the camera shoots a raycast at it Add a shader to the floor that makes it reflective Make the plant grow and shrink when the camera scrolls the mouse wheel Add a UI slider that controls the intensity of the shower light Make the vanity counter have a marble texture Make the passage light turn on and off when the camera enters and exits the bathroom Add a script to the dustbin that makes it collect any objects that fall into it Make the crome have a metallic effect Make the window glass break when the camera applies a force to it Add a script to the bathtub that makes it fill with water when the camera turns on the faucet Make the towel01 have a cloth physics Make the decoreplate spin when the camera hovers over it Add a script to the mirror body that makes it change its shape when the camera presses a key Make the enclosure glass foggy when the camera is inside the shower Add a script to the lamp cloth that makes it change its color when the camera clicks on it Make the floor have a random pattern of tiles Make the shower head spray steam when the camera turns on the jetspray Add a script to the plant that makes it sway when the camera blows into the microphone Make the vanity side wall have a graffiti effect Make the mirror glass have a crack effect when the camera hits it with an object Add a script to the flowerpot that makes it attract or repel the butterflies when the camera presses a key Make the handwash have a liquid physics Make the bathroom wall have a wallpaper effect Add a script to the towel that makes it dry the camera when it touches it Make the window have a rain effect Add a script to the shower panel that makes it change its temperature when the camera slides it Make the lamp metal have a rust effect add 3 lights to the theme and play them rhythmically change the wall ' s color to yellow Listing 2: Prompts used in the bathroom scene

A.3 Sequential Prompts
Create a car with a body , wheels , and doors ; put four seats on the car ; put a front wind shield , a ceiling , and a back cover on the car ; put a spare tire on the back cover ; program the car to be drivable with w ,a ,s , d Make a torch out of primitives ; Add a light to the torch and allow me to turn it on and off by clicking on the torch Create water effect in the scene ; create a burning cube that I can pick up ; make it so that if I place the cube near the water , it stops burning Create a cat that will attempt to catch any mouse it sees ; create a mouse that moves randomly Create a cube ; create a UI slider that allows me to adjust the size of the cube Create a sphere that serves as the nucleus of a hydrogen atom ; create an electron density cloud that behaves stochastically Make a physically realistic car that I can drive with w ,a ,s , d and jump with space ; change the gravity to be closer to moon so that the car jumps much higher Create a red cube and a green cube ; change their colors to be more accessible for a red -green colorblind person Create a trees out of primitives ; turn my mouse into a flamethrower that can emit fire when I leftclick ; make the cubes catch on fire if I use the flamethrower on them Create two cubes that I can move around with my mouse ; let cubes glue onto each other when they collide Create a bouncing ball that changes color every time it hits the ground ; make the ball leave a trail of its previous colors Create a painting app that allows me to draw on a canvas with different brushes and colors ; add an undo and redo button Create a solar system with the sun and the planets ; make the planets orbit the sun with realistic speeds and distances ; add moons to some of the planets Create a flock of birds that fly around the scene ; make the birds avoid obstacles and follow a leader ; make the leader change randomly every few seconds Create a snowman out of spheres ; add a hat , a scarf , a carrot nose , and buttons ; make the snowman melt over time Create a maze out of cubes ; add a player that can move around the maze with arrow keys ; add a goal that the player has to reach Create a fireworks show that starts when I press space ; use particle systems to create the fireworks ; add different colors and shapes to the fireworks Create a simulation of a pendulum that swings back and forth ; use physics to calculate the motion of the pendulum ; add a slider that lets me change the length and the mass of the pendulum Create a cellular automaton that simulates the game of life ; use a grid of cells that can be alive or dead ; apply the rules of the game of life to update the cells ; add a button that lets me pause and resume the simulation Create a kaleidoscope that shows a symmetrical pattern of colors and shapes ; use a camera that captures the scene ; use a shader that applies the kaleidoscope effect to the camera Create a dice out of a cube ; add numbers to the dice ; program the dice to roll when I click on it Create a flower out of primitives ; add petals , stem , and leaves to the flower ; make the flower bloom when I hover over it with my mouse Create a house out of primitives ; add windows , doors , and a roof to the house ; let me enter and exit the house with my mouse Create a pendulum out of primitives ; add a string and a weight to the pendulum ; program the pendulum to swing back and forth Create a Rubik 's cube out of smaller cubes ; add colors to the cubes ; program the Rubik ' s cube to rotate when I drag it with my mouse Create a spaceship out of primitives ; add thrusters , wings , and a cockpit to the spaceship ; program the spaceship to fly with w ,a ,s ,d and space Create a snake out of spheres ; program the snake to move with the arrow keys ; add food that the snake can eat ; make the snake grow longer when it eats food Create a basketball hoop out of primitives ; add a net and a backboard to the hoop ; create a basketball that I can throw with my mouse ; program the basketball to bounce and roll Create a candle out of primitives ; add a wick and a flame to the candle ; program the candle to burn down over time ; make the flame flicker Create a fan out of primitives ; add blades and a base to the fan ; program the fan to spin when I press a button ; make the fan blow air Create a submarine out of primitives ; add a periscope , a propeller , and a hatch to the submarine ; program the submarine to dive and surface with w and s ; program the submarine to shoot torpedoes with space Create a roller coaster out of primitives ; add rails , carts , and seats to the roller coaster ; program the roller coaster to move along the rails ; let me ride the roller coaster with my mouse Create a heart out of primitives ; add veins and arteries to the heart ; program the heart to beat and pump blood Create a lamp out of primitives ; add a bulb , a switch , and a cord to the lamp ; program the lamp to turn on and off when I click on the switch Create a teddy bear out of primitives ; add fur , eyes , and a nose to the teddy bear ; make the teddy bear hug me when I click on it Create a map out of primitives ; add continents , oceans , and borders to the map ; let me zoom in and out of the map with my mouse ; add labels to the map Create a paintbrush that can draw on any surface ; make it so that I can change the color and size of the brush by using a UI panel Create a spaceship that can fly in any direction with w ,a ,s , d and rotate with q , e ; add a laser cannon to the spaceship that can shoot projectiles with space ; make it so that the projectiles explode on impact Create a snowman out of spheres ; add a hat , a scarf , a carrot nose , and buttons to the snowman ; make it so that the snowman melts gradually over time Create a clock that shows the current time ; make it so that I can change the time zone and format of the clock by using a UI panel create a cube ; create a lever out of primitives ; add appropriate joints and constraints to the lever ; Use the lever to control the size of the cube create a rotating cube ; change the rotation speed to be slower create an apple and a whale ; duplicate the apple to have different sizes and with unique gameobject names ; remove all apples that are smaller in size than the whale create a room out of primitives with walls and floor only ; shrink the size of the room by half and put the walls back together ; add a door to the room ; add a button to the wall that opens and closes the door create a sphere ; add a script to the sphere that makes it bounce on the floor ; add gravity and physics to the scene ; add a slider to the scene that controls the bounciness of the sphere create a terrain ; add some trees and rocks to the terrain ; add a first -person controller to the scene ; add a script to the controller that allows it to pick up and throw rocks create a cylinder ; add a script to the cylinder that makes it follow the mouse position ; add a trail renderer to the cylinder ; change the color and width of the trail create a canvas ; add a text element to the canvas ; add a script to the text element that displays the current time ; add a button to the canvas that toggles the visibility of the text element create a plane ; add a mesh collider to the plane ; add a texture to the plane ; add a script to the plane that changes the texture based on the mouse position create a sphere ; add a mesh renderer to the sphere ; add a script to the sphere that changes the color of the mesh renderer based on the distance from the camera create a cube ; add a rigidbody to the cube ; add a script to the cube that makes it explode into smaller cubes when clicked create a camera ; add a script to the camera that allows it to zoom in and out with the mouse wheel ; add a script to the camera that allows it to take a screenshot with the space key create a light ; add a script to the light that makes it flicker randomly ; add a script to the light that makes it change color based on the ambient temperature create a particle system ; add a script to the particle system that makes it emit fire particles ; add a script to the particle system that makes it follow the mouse position ; add a script to the particle system that makes it stop emitting when the mouse button is pressed create a cube ; add a script to the cube that makes it rotate around its own axis ; add a script to the cube that makes it rotate around another cube ; add a script to the cube that makes it stop rotating when the mouse button is pressed create a canvas ; add an image element to the canvas ; add a script to the image element that loads a random image from the internet ; add a script to the image element that makes it draggable and resizable create a sphere ; add a script to the sphere that makes it move along a predefined path ; add a script to the sphere that makes it change direction when it collides with another object ; add a script to the sphere that makes it emit a sound when it collides with another object create a plane ; add a script to the plane that makes it generate a random height map ; add a script to the plane that makes it apply the height map to its mesh ; add a script to the plane that makes it change the height map based on the mouse position create a cube ; add a script to the cube that makes it spawn a smaller cube every second ; add a script to the cube that makes it destroy the smaller cubes when they reach a certain number ; add a script to the cube that makes it change color based on the number of smaller cubes create a canvas ; add a input field element to the canvas ; add a script to the input field element that validates the input as a valid email address ; add a script to the input field element that displays a message when the input is valid or invalid Create a plant ; Move the camera in slow motion towards the plant and the stop ; turn off all the lights , create a spotlight that points at the plant Create a bathtub ; Make the bathtub float in space ; place a big rabbit under the bathtub ; make the sink pour water ; make the rabbit run when water touches it Create a dustbin and a sink ; Move the dustbin close to the sink ; create a toilet -paper 10 times ; make all toilet papers move around randomly ; make it look like there are infinite toilet papers raining Create a towel ; Make the towels roll back and forth ; make the towels jump if they are clicked on ; create plants 10 times ; move all plants to random new positions ; make it seem like the plants are watering each other Create a new camera and name it Camera2 ; place a human character in front of Camera2 ; add the view of Camera2 to be shown on the mirror ; make the humanoid wave when the mouse is over it ; make the humanoid follow the mouse movement Create a chessboard and chess pieces ; make the chess pieces move according to the rules of chess ; add a simple AI opponent that can play chess ; make the AI opponent move a piece every 5 seconds ; add a timer and a score board Create a bookshelf and a book ; make the bookshelf rotate slowly ; make the book fall from the bookshelf when clicked ; make the book open and show some text when it hits the ground ; make the text fade away after 3 seconds Create a ball and a basket ; make the ball bounce when it hits the ground ; make the ball change color randomly ; make the basket move left and right ; add a score counter that increases by one when the ball goes into the basket Create a guitar and a pick ; make the guitar strings vibrate when they are touched by the pick ; make the guitar produce different notes depending on which string is touched ; make the pick follow the mouse movement ; add a chord chart that shows the name of the chord being played Create a clock and a calendar ; make the clock show the current time ; make the calendar show the current date ; make the clock and the calendar change color depending on the time of day ; make the clock and the calendar disappear when clicked Create a car and a road ; make the car move forward along the road ; make the car steer left or right when the arrow keys are pressed ; make the car speed up or slow down when the up or down keys are pressed ; add some traffic lights and other cars on the road Create a pizza and a knife ; make the pizza look delicious ; make the knife cut the pizza into slices when dragged over it ; make the slices fall off the pizza when cut ; make the slices disappear when clicked Create a candle and a match ; make the candle stand on a table ; make the match light up when clicked ; make the match burn out after 5 seconds ; make the candle wick catch fire when touched by the match ; make the candle melt slowly Create a snowman and a hat ; make the snowman consist of three snowballs of different sizes ; make the hat sit on top of the snowman 's head ; make the snowman smile and have buttons for eyes and a carrot for a nose ; make the snowman wave when the mouse is over it Create a flower and a bee ; make the flower bloom when clicked ; make the bee fly around the flower ; make the bee collect pollen from the flower when it lands on it ; make the bee fly away to a hive when it has enough pollen Create a cube and a sphere ; make the cube and the sphere have different colors and textures ; make the cube and the sphere collide with each other when they are close ; make the cube and the sphere bounce off each other when they collide ; make the cube and the sphere spin when they bounce Create a bird and a tree ; make the bird fly in the sky ; make the tree have green leaves and brown branches ; make the bird land on a branch when it is tired ; make the bird sing when it is happy ; make the bird fly away when it is scared Create a map and a compass ; make the map show a terrain with mountains , rivers , and forests ; make the compass point to the north ; make the map and the compass move together when the mouse is dragged ; make the map and the compass zoom in or out when the mouse wheel is scrolled Create a pen and a paper ; make the pen write on the paper when the mouse is pressed ; make the pen write the text that is typed on the keyboard ; make the pen change color when the right mouse button is clicked ; make the paper crumple when the space bar is pressed Create a star and a planet ; make the star emit light and heat ; make the planet orbit around the star ; make the planet have a day and night cycle ; make the planet have clouds and oceans ; make the star explode when the planet is clicked Listing 3: Sequential prompts.Each individual prompt within a sequence is delimited by ";".

B PLANNER EXAMPLE: BUILDING A VIRTUAL KITCHEN
In this section, we take a classic example in VR-assisted design -building a virtual kitchen -and study the benefit of using the proposed Planner.Creating a kitchen from scratch is a task of considerable complexity: one has to conjure multiple objects of varying sizes and place them in the scene in a logical fashion.One approach to create a kitchen in Unity with LLMs is to detail the specifications and have the model generate it in one shot.The result of which, shown in the bottom left of figure 18, appears promising.However, upon closer inspection, we notice that the LLM has neglected several instructions and shown a few notable modes of failure.First, the spatial placements of bigger objects seem reasonable (tables and chairs, fridge and counter, etc.), while the positioning of smaller objects are off.For instance, despite instructions to place multiple small appliances on the counter, only one is visible.Investigation revealed that others were created but hidden inside the counter, indicating the LLM's difficulty in handling objects of vastly different sizes simultaneously.To address this, the Planner breaks down the complex task into subtasks, each dealing with objects of similar sizes, then uses the LLM to execute these instructions in order.This approach, as seen in the bottom right of figure 18, leads to a more detailed and accurate arrangement such as appliances on the counter, salt and pepper shakers, fruits on the table, and stove tops on the oven.
Additionally, the original LLM ignored the request for making the sink faucet interactable, possibly due to its differing nature from other prompts.In contrast, the Planner-enhanced LLM successfully implemented this feature by attaching particle effects to the faucet, as evidenced by a shuriken-like pattern on the cylinder.-Add a kitchen counter in the middle of the kitchen .Place a kettle , a coffee machine , a knife rack , and a toaster on the counter .Create the objects with primitives and give them appropriate colors in RGB .
-Place a fridge just to the left of the counter .Create the objects with primitives and give them appropriate colors in RGB .-Place a sink just to the right of the counter .Use primitives and assign them appropriate colors in RGB .
-Add a faucet on top of the sink with two cylinders .Make the faucet functional , so that water flows out of it with particle effects when I click on it .Use primitives and assign them appropriate colors in RGB .-Place a dishwater below the sink but above the ground .Create the objects with primitives and give them appropriate colors in RGB .-Place an oven a few units in front of the counter .Add a towel rack in front of the oven and place a towel hanging from the towel rack .Also place four stove tops on top of the oven .Create the objects with primitives and give them appropriate colors in RGB .-Place a cupboard above the counter .Create the objects with primitives and give them appropriate colors in RGB .
-Place a table a few units away on the opposite side of the room of the oven .Place four chairs around the table .
Create the objects with primitives and give them appropriate colors in RGB .-Add a fruit basket with fruits on the table .Add salt and pepper shakers on the table .Create the objects with primitives and give them appropriate colors in RGB .

C.2 Scene Analyzer
You will be given a JSON file with all white spaces and quotations removed , which contains all the game objects in a 3D Unity scene that may represent a particular object , widget , or even an entire 3 D world .Your task is to interpret this JSON file and give a brief description of the scene .
# Guidelines -Pay attention to the user request and only summarize the part of scene that is relevant for fulfilling the request .For example , suppose there are a lot of objects in the scene , among which is a bear .If the user only wants to change the color of the bear , only output information relevant to the bear .-A child is usually a sub -component of its parent .
-The " Compiler " gameObject is a manager that contains the previously generated scripts .
-The " Main Camera " gameObject is me ( the user ) .You don 't need to include it in the description .
-If there is a child object , list the names only .Do not mention their relationship with the parent .

C.4 Builder
You are a code -writing assistant that is embedded in the Unity game engine equipped with a runtime C # compiler .I will ask you to generate content in 3 d environments , and your task is to write C # code that will implement those requests .You can only respond in code that will compile in C # , and can only add other text in inline comments , like so : // comment Do not put the code in a code block , just directly respond with it .That means it should not start with ```csharp using UnityEngine ; but rather with just using UnityEngine ; You are attached to the GameObject " Compiler " in the scene that also has the runtime compiler attached to it .Your generated C# script is automatically attached to this object and immediately executed .Each of your output messages is compiled as a separate script .Make sure to make any elements you might want to reuse public .
# Guidelines you ** must ** follow -Each C# script should be a complete and valid class that inherits from the Widgets class .
-Each C# script should have a variable called summary that briefly describes the purpose of the script .Write this summary as part of the Start () function .-Do not apologize when the I point out a mistake and suggest a modification of the code you wrote .Only output the modified code .-If you want to delete an existing script , use : Destroy ( gameObject .GetComponent < script_name >() ) .
-If it is appropriate to respond to my request by displaying text on the screen , use : Utils .DisplayMessage ( message ).
-If you ' re using UI elements in your script , create them yourself as children of the " UI_Canvas " gameObject .
-Avoid using raw numbers in the code .Instead , create a public class variable with the desired value assigned to it .
-ALWAYS make class members public .
-Only write self -contained code .Do not leave anything for me to do .-When defining more than one classes , they should all be contained within one overarching class .
-When the Library is available , put more considerations to the descriptions that guide you on how to write code that uses the relevant skills .

C.5 Inspector
You ' re a meticulous inspector who will evaluate a snippet of Unity C # code .The provided code will be automatically attached to the " Compiler " gameObject , and the Start () function for any classes defined will be executed .Note that " Compiler " contains scripts that are responsible for the behavior of objects in the scene , so it is okay for the code to try and locate the script on the " Compiler " instead of other gameObjects .You need to assess whether the provided code abides by the following set of guidelines .
# Guidelines -The code assigns each created gameObject a unique name .Note that this doesn ' t mean the variable names are unique , but rather the gameObject names are unique .GameObjects are named in the following way : GameObject x = new GameObject ([ Name ]) , where [ Name ] is the unique name .
-The code does not add a component in the Start () function of the class with the same name , as that creates an infinite loop .For example , if the class name is Foo , DO NOT use AddComponent < Foo >() in the Start () function .-The code works as -is without any additional actions from the Unity editor .For example , the code should not declare a public GameObject and expects it to be filled in from the editor screen .However , it is okay for the code to declare a public GameObject but assigns them values later through code .It is only an issue if there is a public variable declared but not assigned any values .-The following components and functions are not used : NavMesh ; -The script always inherits from Widgets instead of MonoBehaviour .You can assume the Widgets class exists .
-The script always changes the " summary " variable describing its purpose .Assume it is already declared .It is not useless .-The script shouldn 't contain any code outside of the main class , except statements that start with " using ".Ex : " using UnityEngine ;" can exist outside of the main class .-Make sure any user input gameobjects like hands and controllers have the right names .Always refer to the UserInput string .-Make sure there are no placeholder functions .-Unity tags are not used .-Unity layers are not used .# Notes -You can assume " UI_Canvas " always exists in the scene .
-It is always acceptable to declare variables as public , even if they are assigned values in the code .
In addition , the code is allowed to use these methods from the Utils class # Utils You have access to a Utils class that supports various functionalities .
-AddPhysics ( gameObject , mass ) : makes the object physically realistic by attaching colliders and rigidbody to it -PlaceNextTo ( base_obj , obj_to_place , offset ) : place obj_to_place next to base_obj in the direction of offset .The placement will be resolved such that the object 's mesh bounding boxes do not overlap .-DisplayMessage ( message ) : Use this if it is appropriate to respond to the user 's request by displaying text on the screen .
Where " Scene " is a brief summary of the existing objects in the Unity scene ." Library " is a relevant skill that is implemented and can be used in the code ." Code " contains the code you need to evaluate .

Figure 1 :
Figure 1: Examples of diverse use cases and functionalities enabled by the Large Language Model for Mixed Reality (LLMR) framework.A: Creation of a detailed kitchen scene from scratch using Unity primitives.B: Prompting and drawing objects into existence via multi-modal interactions.C: Integration with external plugins like loading objects from Sketchfab to create high-fidelity scenes and special skills like generating animations.D: Prompting edits of existing VR scenes like changing the color of the objects.E: Automated generation of instructional guides and Questioning and Answering about the scene.F: The framework is compatible across platforms and supports the integration of external sensor data.

Figure 2 :
Figure 2: Large Language Model for Mixed Reality (LLMR) architecture for real-time interactive 3D scene generation.Starting from the left, a user prompt and the existing 3D scene (Ω) are fed into the Planner (P) and Scene Analyzer (SA) modules, respectively.The Planner decomposes the user prompt into a sequence of sub-prompts, while the SA summarizes the current scene elements.These are then integrated with a Skill Library (SL) to guide the Builder (B) module, which generates the appropriate code.The Inspector (I) module iteratively checks the generated code for compilation and run-time errors.Upon receiving the green light from the Inspector, the code is compiled using the Roslyn Compiler and executed in the Unity Engine to produce the desired 3D scene and functionalities as specified by the user.

Figure 4 :
Figure4: Scene Analyzer module.The virtual scene, depicted in the bottom-left corner, is converted into a parsed scene hierarchy in JSON format.This, along with the user request, serves as input to the Scene Analyzer.The output is a filtered, relevant summary of the scene, which is then used for conditioning subsequent modules like the Builder.The process optimizes the utilization of the language model's fixed context window and enhances focus on objects relevant to the user prompt.

Figure 5 :
Figure 5: Builder-Inspector paradigm in LLMR.The Builder module B( |, ) generates code based on user input and current state.The generated code is then inspected by the Inspector module I(,  |, ) for compilation and run-time errors.If errors are found, indicated by verdict , the Inspector provides suggestions  for corrections.The process iterates until either the code passes inspection or a maximum number of inspections  is reached.This feedback loop significantly enhances the quality of the generated scripts.
as well as positive and negative examples, are stored separately.Once the relevant skills are identified, their detailed information and usage examples are fetched and passed on to the Builder for implementation.ℎ  ∼ (•|  ) (Retrieve required skills, if any.)

Figure 8 :
Figure 8: Cross-Platform and Cross-Scene Transferability made possible by LLMR.The left panel shows a car automatically created by LLMR using Unity primitives, complete with color and composite features (e.g., wheels and headlights), controllable via keyboard inputs.The middle panel displays the same car transferred to a different Unity scene featuring moon-like gravity and terrain.The right panel showcases the framework's adaptability across platforms by illustrating how the car can collide with objects in the physical world and can be controlled using IMU data from a user's mobile phone.

Figure 12 :
Figure 12: Simulated Rescue Plan.The HoloLens 2 displays the automated generation of a simulation of a rescue plan using our framework.The guide shows an interactable 3D terrain, helicopter, and simulated wind, allowing rescue workers to visualize the flight path under different weather conditions.

Figure 14 :
Figure 14: Comparison for average compilation and run-time error rate.SA stands for the Scene Analyzer, SL stands for the Skill Library, and I stands for the Inspector.Overall, in both creating from scratch, as well as editing existing scenes, LLMR outperforms GPT-4 by 3x in the case with few-shot prompting, and gives over 4x improvement compared to the performance of zero-shot GPT-4.

Figure 15 :
Figure 15: Performance Improvement of LLMR Modules Organized by Difficulty Level of Prompts.Comparing Error Rates of GPT-4 with Incremental LLMR Module Integration.The error rates for most methods increase with difficulty, and the LLMR method (in orange) still maintains a consistently lower error rate compared to others.

Figure 16 :
Figure 16: Results of the user study for both experienced and beginner users of Unity.Overall, the users found LLMR satisfactory and would recommend others to use it too.

Figure 17 :
Figure 17: Error rate for all architectures organized by level of prompt difficulty.

Figure 18 :
Figure18: Creating a virtual kitchen without (left) and with (right) the proposed Planner.The colored boxes show the desired specifications for the kitchen (left) and its decomposition into smaller steps (right).From the snapshots, we see that the LLM can carry out the user request with significantly better attention to details if we use a step-by-step plan.
the sphere twice as big skill name : MeshSize , description : If the user requests the size of the object , this is file has the function to get the renderer bounds .skill name : CreateAnimation , description : Creates an animation that fits the clip_name for object .You should not use the animation manager if you ' re moving simple objects like Unity primitives .Instruction : animate the whale to flap its tail left and right , then play the clip OUTPUT : CreateAnimation INPUT : skill name : MeshSize , description : If the user requests the size of the object , this is file has the function to get the renderer bounds .skill name : CreateAnimation , description : Creates an animation that fits the clip_name for object .Important : You should not use the animation manager if you ' re moving simple objects like Unity primitives .Instruction : Make the car in the scene move in response to w ,a ,s , d keys OUTPUT : n / a INPUT : skill name : MeshSize , description : If the user requests the size of the object , this is file has the function to get the renderer bounds .skill name : CreateAnimation , description : Creates an animation that fits the clip_name for object .You should not use the animation manager if you ' re moving simple objects like Unity primitives .Instruction : Where is the top of the fridge OUTPUT : MeshSize Listing 6: Skill Library's metaprompt Top 5  ∈ with highest(,   );  * ← arg max   ∈ ′  ( ,   ); return  *

Table 2 :
8.4.2 Results and Discussion.The numerical results, presented in Table2, show that GPT-4's performance in sequential tasks improves significantly with the addition of each LLMR module.When all modules are integrated, LLMR surpasses the standard GPT-4 by approximately 2.5 times across all metrics, aligning with results from single prompt tests.Furthermore, LLMR's memory-efficient design maintains a constant context usage for arbitrary prompt sequences and thus removes token size limitations during prolonged sessions.As such, LLMR demonstrates promising performance in Numerical results for sequential prompts.The arrows next to the metrics point to its favored direction.For example, the down arrow next to error rate means a lower error rate should be preferred.

Table 4 :
Empty scene results averaged over 5 independent runs to reduce nondeterminism of GPT-4 run with 0 temperature.Note that the standard deviations are across the 5 runs, not the 150 examples within each run.
Assistant : Alright .How about the table with chairs ?User : I want a fruit basket with fruits on the table .Salt and pepper shakers as well .Assistant : Awesome .Lastly , anything special about the cupboard ?User : Nope .Assistant : Okay .Here ' s the plan : Plan : -Create a kitchen with four walls and a floor .Create the objects with primitives and give them appropriate colors in RGB .-Add a kitchen counter in the middle of the kitchen .Place a kettle , a coffee machine , a knife rack , and a toaster on the counter .Create the objects with primitives and give them appropriate colors in RGB .-Place a fridge just to the left of the counter .Create the objects with primitives and give them appropriate colors in RGB .-Place a sink just to the right of the counter .Use primitives and assign them appropriate colors in RGB .-Add a faucet on top of the sink with two cylinders .Make the faucet functional , so that water flows out of it with particle effects when I click on it .Use primitives and assign them appropriate colors in RGB .-Place a dishwater below the sink but above the ground .Create the objects with primitives and give them appropriate colors in RGB .-Placeanoven a few units in front of the counter .Add a towel rack in front of the oven and place a towel hanging from the towel rack .Also place four stove tops on top of the oven .Create the objects with primitives and give them appropriate colors in RGB .-Place a cupboard above the counter .Create the objects with primitives and give them appropriate colors in RGB .-Place a table a few units away on the opposite side of the room of the oven .Place four chairs around the table .Create the objects with primitives and give them appropriate colors in RGB .-Add a fruit basket with fruits on the table .Add salt and pepper shakers on the table .Create the objects with primitives and give them appropriate colors in RGB .
-Create a kitchen with four walls and a floor .Create the objects with primitives and give them appropriate colors in RGB .
You have access to a Utils class that supports various functionalities .-AddPhysics ( gameObject , mass ) : makes the object physically realistic by attaching colliders and rigidbody to it -DisplayMessage ( message ) : Use this if it is appropriate to respond to the user 's request by displaying text on the screen .foreach ( GameObject food in foodItems ) { Bounds bounds = MeshSize .GetRenderBounds ( food ) ; float size = bounds .size .x * bounds .size .y * bounds .size .z ;