ChatTwin: Toward Automated Digital Twin Generation for Data Center via Large Language Models

Digital twin has been applied in various industrial fields to represent physical systems. However, the design of high-fidelity digital scenes is challenging in that it often requires intensive manual processes and domain expertise to edit the 3D models or description documents. To reduce human efforts, this paper proposes ChatTwin, a conversational system that leverages the power of GPT-4 to automate the generation of scene description documents for digital twins. ChatTwin assists scene generation by i) segmenting user-input prompts, ii) generating scenes with segmented prompts, and iii) optimizing the generated content. Specifically, the Segment-and-Generate (SG) workflow decomposes the long-text generation into several subtasks and reduces the complexity of the original task. The evaluation through our data center digital twin system shows that ChatTwin outperforms other baselines in terms of generation accuracy and efficiency.


INTRODUCTION
Digital Twin refers to a virtual model that represents an object or system throughout its lifecycle [4].On top of physical data and prior knowledge, the digital model is advantageous in assisting predictive analysis and decision making.These advantages have spawned various applications to create digital replicas of the physical assets in industrial fields, such as data center [15], manufacturing [9], and smart grid [6].Existing industrial digital twins utilize textual description documents to build the digital twins.The Azure Digital Twins Definition Language (DTDL) [10] uses a JSON-LD-based language to define certain contents, such as properties, telemetry, and commands.The Universal Scene Description (USD) in Omniverse [11] provides a rich set of features for large-scale 3D content creation and collaboration.These description documents provide a standard and interpretable way to define digital twins.However, the construction of the digital twins is an arduous process.Generating a reliable description file often requires several domain experts to manually operate the 3D models or edit the template files.As the scales of the designed system increase, such as industry-grade data centers with hundreds to thousands of equipment, manually editing the file becomes labor-intensive and error-prone.Therefore, novel methods to automate scene generation would be highly desirable.
The pretrained large language models (LLMs) have shown remarkable performance in a wide range of tasks [16].A potential application of LLMs is to generate the hierarchical description document based on textual prompts.However, directly applying the LLMs to generate the scene description documents for the digital twin faces two challenges.First, the length of the description document is proportional to the complexity of the target scene.The description file may consist of thousands of lines for a system with multiple components.Generating such a long text in a one-shot manner is difficult and inefficient with the pretrained LLMs.Second, the pretrained LLMs used for domain-specific tasks are prone to hallucination.In such cases, the generated contents may look correct but physically implausible without verification.
To address the above challenges, we present ChatTwin, which utilizes the power of Generative Pretrained Transformer 4 (GPT-4) [12] to facilitate the generation of comprehensive scene description documents specifically designed for the data center digital twin.ChatTwin features two important designs for generating more reliable results.First, we introduce a Segment-and-Generate (SG) workflow that decomposes the long-text generation into a series of subtasks and thus reduces the likelihood of generating erroneous or inconsistent scene descriptions.Second, we solve an integer programming problem to optimize the imperfect file based on several predefined rules.The post-process eliminates factual errors and improves the overall quality of the generated documents.We conduct  experiments on our data center digital twin to evaluate the accuracy and efficiency of the SG compared with baseline prompting methods.Experimental results show that ChatTwin improves the generation accuracy by 55% and 23% compared with zero-shot and few-shot learning, respectively.In terms of efficiency, the designed SG workflow also surpasses the two basic prompting methods by generating 45% and 12.5% more tokens per second.
The contributions of ChatTwin are summarized as follows: 1) We propose the SG workflow and develop ChatTwin for long-text description document generation; 2) We implement ChatTwin and show the performance of LLM technologies to empower the generation of digital twin description files.

RELATED WORK
Text-to-3D has become an active research field and spawned various applications in the digital twin.Set-the-Scene [3] develops an agentbased training framework to fill the gap from controllable text to 3D synthesis.Text2Room [8] generates immersive 3D meshes with textures for room-scale environments based on given text prompts.Different from the 3D object generation, the contents in digital twins are predefined in a certain fixed structure and set with parameters for simulation purposes.Therefore, utilizing text-to-text technologies to generate scene descriptive documents (i.e.DTDL, USD) is more suitable for generating digital twins.

CHATTWIN OVERVIEW
This paper focuses on the data center as the core application scenario of the ChatTwin design, which consists of the components like, room, air conditioner unit (ACU), rack, and server.Figure 1 shows the overview workflow of ChatTwin, where the left part illustrates the document generation and the right part reveals the post-process of document optimization.To address the challenges of long-text generation in LLMs, we introduce the SG to empower automatic description file generation.
Segment.Instead of outputting all letters of a stringified JSON at once, the SG first cuts the natural language input I into several parts denoted as T = { 1 ,  2 ,  3 , . . .,   , . . .,   }, where T is the set of all subtasks by users and   represents each subtask.In the scene of the data center digital twin, the facilities are always in a nested structure.For example, the racks are placed in a room and the servers are installed in a rack slot.We denote the facilities as  = { 1 ,  2 ,  3 , . . .,   }, where each facility   is associated with its parent facility   and the directed edge   = (  ,   ).Their nested tree structure can be predefined by a hierarchical representation  = { , }, where  = { 1 ,  2 ,  3 , . . .,   , . . .,   }.Then,   can be extracted from the original input I by its corresponding node facility   and its parent node facility   as   =  (  ,   , I), where   is also written in natural language.We continue to break down each subtask   into units   = { ,1 ,  ,2 . . . , }, and obtain the unit set U = { 1 ,  2 , . . .,   } with all the  subtasks.The segmented prompts reduce the complexity of the original task and are expected to improve the generation quality and efficiency.
Generate.We can get a dictionary of subtasks T with each facility as the key and its description as the value from the above task segmentation.For example, for an original input "a data center with ten racks", the segmented subtasks are identified with the "rack" key associated with the value "a data center with one rack" repeated ten times.With the relationships, we solve the tree-structured subtasks T by order of deep-first search (DFS).A prompt   for each unit   of subtask   yields the corresponding string segment   .With predefined prompts for each type of facility, we can simultaneously generate each corresponding segment individually with the thread pool technology.Consequently, we can combine these segments to the string segment   for the subtask   .After solving all subtasks, we can obtain a set of JSON segments J = {  1 ,  2 ,  3 , . . .,   , . . .,   }, which will be concatenated together by the predefined file structure.
Optimization.Although we can get the scene description document with the right hierarchical structure, the detailed values are always unreasonably set to zero or a random value due to the poor mathematical reasoning ability of the language model [5].Through the above SG process, we can get an imperfect scene description document with predefined facility models in the correct structure that fits the original input.To obtain a reasonable digital scene, we optimize the geometric values with several predefined rules.
Given the nested structure of the scene description file, the proposed method can be extended to other building-related scene (e.g.data center buildings) generation with the proper description document template and pre-designed rules.

IMPLEMENTATION OF CHATTWIN
We apply our ChatTwin to a self-developed data center digital twin.It can build a digital twin of a real-world data center hall that hosts ACUs and racks containing multiple servers.We first introduce the structure of the scene description document and the heuristic approach to optimize the imperfect file.

Document Generation
Similar to existing digital twin systems, the digital twin of a data center is predefined with a textual file termed as dcfile.The structure of the dcfile is illustrated in Figure 2, which utilizes three levels to represent all facilities in the data center.Specifically, the top level defines only one item, i.e., the "Room" object, which refers to the data hall.The "Room" has several children, i.e., "Racks", "ACUs", "Sensors", etc., which is defined at the second level.In this paper, we focus on the locations of racks and ACUs for optimization in the ChatTwin design.The third level defines the "Servers" object that is accommodated under the "Racks".We first generate the dcfile that satisfies the predefined structure and the NL-task requirements, i.e., the number of desired facilities to generate.Then, with the generated file, we obtain the geometry values for each facility and the data hall by solving a domain-specific optimization problem.

Heuristic Optimization
ChatTwin post-processes the imperfect dcfile by heuristic-based integer programming.In this paper, we consider a typical rectangular design of the data hall with width and length denoted by  w and  l .
The bottom area of each hosted facility is denoted as   .We assume a hall room of a data center should be optimized to a maximum utilization denoted as  =      w × l , where  is the total number of facilities.Then, we denote the total number of racks as  rack and the total number of ACUs as  acu , where we assume there is only one type of racks and ACUs, respectively.Given that each facility has four dimensions to determine its location, the incremental number of variables imposes high complexity for solving the optimization problem.Thus, the feasible region of this optimization problem is too broad, considering the locations of every rack and ACU as the variables.Due to the NP-completeness [1], we plan to solve the problem via a heuristic way by the following rules, i.e., constraints for the above problem.
Rule 1: We set the racks in columns with the same gap  rack and use slots to determine the positions of the racks.The rack slot layout is shown as the left part of Figure 3. Thus, we can set the layout of the rack slots by: where  represents the number of racks in a column,  represents the number of rows, and  represents the number of racks in the extra column.Since , , and  decide the layout of the racks, they are regarded as the variables of the optimization problem.Rule 2: Racks should be placed in the central area as Figure 3, while ACUs should be evenly arranged around the group of racks with gap  acu .The margin  of the area of racks and the padding  of the data hall should be adjusted by the operator's preferences.The data hall room width  w and length  l can be calculated by the rack's width  rack and length  rack and the ACU's width  acu as: Rule 3: We ensure there is no facility placed out of the hall.The hall shapes in rectangular without any extra pillars in the room.Considering the number of ACUs  acu , we have the constraints of room width  w and length  l in Eq. ( 2), respectively, as: where  = ⌈  acu 4 ⌉ means the largest possible number of ACUs on one side.From the three rules, we formulate a convex problem of integer programming: minimize ,, ∈Z +  w ×  l subject to (1), ( We apply grid search to solve the optimization since the dimension of this problem is only three.In other words, the time complexity of the algorithm is  ( 3 ).The thermal effects are influenced by the hyperparameters, including the gap between racks ( rack ), the gap between ACUs ( acu ), the margin (), and the padding ().

EVALUATION AND RESULTS
This section evaluates the effectiveness of ChatTwin and compares its performance with two baselines.
Baselines.For scene description documents generation, we compare our solution with two common types of prompting baselines: 1) Zero-shot prompting We provide the whole template for generating description documents without any clues and implements.2) Few-shot prompting Besides providing the whole template of the scene description document, few-shot prompting provides several examples for different cases.
Implementations.We use the cutting-edge GPT-4 through the API provided by OpenAI as the backbone language model.The settings are applied to all baselines and our approach, including zeroshot, few-shot, and SG prompting for generation.We first create 100 data hall descriptions in natural language without labeling the corresponding dcfile.Then, we compare the SG with the other two baseline prompting methods using the following metrics.
Metrics.In this paper, we report the success rate and generation efficiency across 100 descriptions as the metrics.The descriptions only focus on the rack, ACU, servers, and room in three levels.An example description is "I want a data hall room with four ACUs and ten racks.Each rack contains two servers".The success rate is defined as the ratio of generated files with correct facilities.The generation efficiency is denoted by  =  token  , where  is the makespan and  token is the generated token length measured by TikToken [13].
Results.We next show the results in this section to evaluate the ability of ChatTwin to generate the scene description document in the aforementioned digital twin system.
According to Table 1, the SG workflow achieves the highest success rate 87% compared with zero-shot and few-shot, which only have 32% and 64%, respectively.The SG can also generate longer and more complete files than the baselines, which outperforms 124% and 70% tokens on average.The efficiency of the SG workflow is revealed in Figure 4 compared with zero-shot and few-shot prompting methods.In terms of efficiency, some samples generated using the few-shot method outperform the SG method.This is because the additional time required for the segment process is redundant when generating short textual documents.With the help of the thread pool in ChatTwin, task units can be executed simultaneously, which brings 45.0% and 12.5% speed acceleration compared to the zeroshot and few-shot methods, respectively.ChatTwin can mitigate the influence of segmenting process of SG and generate the scene description file with a much higher success rate and completeness.
We also evaluate the efficiency of generating large digital twin scenes.Since zero-shot and few-shot are not effective for small data halls, we only evaluate SG.We gradually increase the number of racks from 10 to 50 and servers from 5 to 20 accommodated by each rack in the certain description.The result, shown in Figure 5, indicates that with the increasing number of racks and servers, the average makespan becomes longer.Besides, the SG can handle the scenario when  rack ≤ 30, since  server becomes an insignificant factor to the makespan.When the user wants to generate a large industrial scene, SG also performs well when  rack > 30.The maximum makespan is still under 350 seconds, which is taken by the task of building the scene of a huge data hall with 16 ACUs and 50 racks containing 1,000 servers in total.Such a scale captures the largest data hall rooms in the current industry-grade data centers [14].The corresponding dcfile consists of 289,127 characters.

CONCLUSION AND FUTURE WORK
In this paper, we design ChatTwin to efficiently generate the description files, and reveal the ability of LLM in modeling and generating data center scenes of the digital twin.We found that there are   many repetitive segments among units at the same level as shown in Figure 1.To fix the scale issue and accelerate the generation process in SG workflow, we will employ the concept of programaided language model (PAL) [7] in our future work.In ChatTwin, the heuristic optimization has four hyperparameters, which are required to be manually tuned for thermal constraints.In future work, the optimization objectives will incorporate both thermal and geometry factors based on the given parameters of each facility.The optimization module of ChatTwin will also be compatible with the scale of the real-world data center that consists of multiple data halls, chiller plants, and other functional rooms.Besides, the performance of ChatTwin majorly depends on the GPT-4 API, which is updated opaquely.Some researchers claim that the performance of GPT-4 has been shrinking over time from March to June 2023 [2].To get rid of network and performance fluctuations, we plan to fine-tune LLaMa-2 for our downstream tasks.

Figure 1 :
Figure1: The workflow of ChatTwin in steps.1) Segmentaion: We process the natural language (NL) task to graph-based sub-tasks and later decompose them into units.2) Generation: By the instructions of each sub-task, we generate an imperfect file.3) Optimization: We heuristically post-process it by rules.

Figure 3 :
Figure 3: (left) The designed layout for racks in rows and columns.(right) The recommended layout from a large amount of real data hall room design.

Figure 4 :
Figure 4: The efficiency of generating scene documents by zero-shot prompting, fewshot prompting, and SG.

Figure 5 :
Figure 5: The average makespan of generating various numbers of racks and inside servers by SG.
need a data hall room with one ACU and eight racks.Each rack contains three servers. I

Table 1 :
The success rates and the average token lengths of generated files of zero-shot, few-shot, and SG.