COFS: COntrollable Furniture layout Synthesis

Realistic, scalable, and controllable generation of furniture layouts is essential for many applications in virtual reality, augmented reality, game development and synthetic data generation. The most successful current methods tackle this problem as a sequence generation problem which imposes a specific ordering on the elements of the layout, making it hard to exert fine-grained control over the attributes of a generated scene. Existing methods provide control through object-level conditioning, or scene completion, where generation can be conditioned on an arbitrary subset of furniture objects. However, attribute-level conditioning, where generation can be conditioned on an arbitrary subset of object attributes, is not supported. We propose COFS, a method to generate furniture layouts that enables fine-grained control through attribute-level conditioning. For example, COFS allows specifying only the scale and type of objects that should be placed in the scene and the generator chooses their positions and orientations; or the position that should be occupied by objects can be specified and the generator chooses their type, scale, orientation, etc. Our results show both qualitatively and quantitatively that we significantly outperform existing methods on attribute-level conditioning.


Introduction
Automatic generation of realistic assets enables content creation at a scale that is not possible with traditional manual workflows.It is driven by the growing demand for virtual assets in both the creative industries, virtual worlds, and increasingly data-hungry deep model training.3D scene and layout generation plays a central role in automatic asset generation, as much of the demand is for the types of real-world scenes we see and interact with every day, such as building interiors.
Deep generative models for assets like images, videos, 3D shapes, and 3D scenes have come a long way to meet this demand.In the context of 3D scene and layout modeling, in particular auto-regressive models based on transformers enjoy great success.Inspired by language modeling, these architectures treat layouts as sequences of tokens and are particularly well suited for modeling spatial relationships between elements of a layout.For example, Para et al. [24] generate two-dimensional interior layouts with two transformers, one for furniture objects and one for spatial constraints between these objects, while SceneFormer [37] extends interior layout generation to 3D.
A main limitation of these approaches is that they do not support scene completion from arbitrary partial scenes, due to their need for a consistent sequence ordering.In bedroom layouts, for example, the bed always needs to be generated before the nightstands, which precludes completing scenes that already have nightstands, but are missing a bed.ATISS [26], which is the most current layout generation approach, tackles this problem by randomly permuting the token sequence during training, enabling scene completion from arbitrary subsets of objects.
While ATISS works well, we aim to improve on these results to enable more fine grained conditioning.We want to keep the advantage of conditioning on arbitrary subsets of objects, but we also want to extend to conditioning on arbitrary subsets of attributes.For example, a user might be interested to ask for a room with a table and two chairs, without specifying exactly where these objects should be located.Another example is to perform object queries for given geometry attributes.The user could specify the location of an object and query the most likely class, orientation, and size of an object at the given location.Our model thereby extends the baseline ATISS with new functionality while retaining all its existing properties and performance.
The main technical difficulty in achieving a more fine-grained conditioning is due to the autoregressive nature of the generative model.Tokens in the sequence that define a scene are generated iteratively, and each step only has information about the previously generated tokens.Thus, the condition can only be given at the start of the sequence, otherwise some generation steps will miss some of the conditioning information.The main idea of our work is to allow for fine-grained conditioning using two mechanisms: (i) Like ATISS, we train our generator to be approximately permutation-invariant and provide the condition as partial sequence that needs to be completed by the generative model.Unlike previous work, the condition is not restricted to the start of the sequence, which means that some tokens do not have full information about the condition through the autoregressive generator alone.(ii) To give our autoregressive model knowledge of the entire conditioning information in each step, we additionally use a transformer encoder that provides cross-attention over the complete conditioning information in each step.These two mechanisms allow us to accurately condition on arbitrary subsets of the token sequence, for example, only on tokens corresponding to specific object attributes.
In our experiments, we demonstrate four applications: (i) outlier detection, (ii) unconditional generation, (iii) traditional scene completion from a partial set of objects, and (iv) fine-grained conditioning on a subset of object attributes.We compare to three current state-of-the-art layout generation methods [30,37,26] and show performance that is on par or superior, while also enabling fine-grained conditioning, which, to the best of our knowledge, is currently not supported by any existing layout generation method.

Related Work
We discuss recent work that we draw inspiration from.In particular, we build on previous work in Indoor Scene Synthesis, Masked Language Models, and Set Transformers.
Indoor Scene Synthesis: Before the rise of deep-learning methods, indoor scene synthesis methods relied on layout guidelines developed by skilled interior designers, and an optimzation strategy such that the adherence to those guidelines is maximized [39,10,38].Such optimization is usually based on sampling methods like simulated annealing, MCMC, or rjMCMC.Deep learning based methods, e.g.[36,30,37,26] are substantially faster and can better capture the variability of the design space.
The state-of-the-art methods among them are autoregressive in nature.All of these operate on a top-down view of a partially generated scene.PlanIT and FastSynth then autoregressively generate the rest of the scene.FastSynth uses separate CNNs+MLPs to create probability distributions over location, size and orientation and categories.PlanIT on the other hand generates graphs where nodes are objects and edges are constraints on those objects.Then a scene is instantiated by solving a CSP on that graph.
Recent methods, SceneFormer [37] and ATISS [26] use transformer based architectures to sidestep the problem of rendering a partial scene which makes PlanIT and FastSynth slow.This is because using a transformer allows the model to accumulate information from previously generated objects using the attention mechanism.SceneFormer flattens the scene into a structured sequence of the object attributes, where the objects are ordered lexicographically in terms of their position.It then trains a separate model for each of the attributes.ATISS breaks the requirement of using a specific order by training on all possible permutations of the object order and removing the position encoding.In addition, it uses a single transformer model for all attributes and relies on different decoding heads which makes it substantially faster than other models while also having significantly fewer parameters.
Masked Language Models: Masked Language Models (MLMs) like BERT [8], ROBERTa [20], and BART [19] have been very successful in pre-training for language models.These models are pretrained on large amounts of unlabeled data in an unsupervised fashion, and are then fine-tuned on a much smaller labeled dataset.These fine-tuned models show impressive performance on their corresponding downstream tasks.However, the generative capability of these models has not been much explored except by Wang et al. in [35], which uses a Gibbs-sampling approach to sample from a pre-trained BERT model.Follow up work in Mansimov et al. [22], proposes more general sampling approaches.However, the sample quality is still inferior to autoregressive models like GPT-2 [28] and GPT-3 [1].More recently, MLMs have received renewed interest especially in the context of image-generation [14,3].MaskGit [3] shows that with a carefully designed masking schedule, high quality image samples can be generated from MLMs with parallel sampling which makes them much faster than autoregressive models.Edi-BERT [14] shows that the BERT masking objective can be succesfully used with a VQGAN [9] representation of an image to perform high quality image editing.Our model most closely resembles BART when used as a generative model.
Set Transfomers: Zaheer et al. [40] introduced a framework called DeepSets providing a mathematical foundation for networks operating on set-structured data.A key insight is that operations in the network need to be permutation invariant.Methods based on such a formulation were extremely successful, especially in the context of point-could processing [4,29].Transformer models without any form of positional encoding are permutation invariant by design.Yet, almost all the groundbreaking works in transformers use some from of positional encoding, as in objection detection [2], language generation [28,1], and image-generation [3].One of the early attempts to use a truly permutation invariant set transformer was in Set Transformer [18], who methodically designed principled operations that are permutation invariant but could only achieve respectable performance in toy-problems.However, recent work based on [18] shows impressive performance in 3d-Object Detection [5], 3d Pose Estimation [33], and SFM [23].

Method
Our goal is to design a generative model of object layouts that allows for fine-grained conditioning on individual object attributes.Fine-grained conditioning enables more flexible partial scene specification, for example specifying only the number and types of objects in a scene, but not their positions, or exploring suggestions for plausible objects at given positions in the layout.
Generative model.We use a transformer-based generative model, as these types of generative models have shown great performance in the current state of the art.Originally, proposed as a generative model for language, transformer-based generative models represents layouts as a sequence of tokens S = (s 1 , . . ., s n ) that are generated auto-regressively; one token is generated at a time, based on all previously generated tokens: where p(s i |S <i ) is the probability distribution over the value of token s i , computed by the generative model f θ given the previously generated tokens S <i = (s 1 , . . ., s i−1 ).We sample from p(s i |S <i ) to obtain the token s i .Each token represents one attribute of an object, and groups of adjacent tokens correspond to objects.More details on the layout representation are described in Section 3.1.
Conditioning.To condition a transformer-based generative model on a partial sequence C, we can replace tokens of S with the corresponding tokens of C, giving us the sequence S C .This is done after each generation step, so that the probability for the token in each step is conditioned on S C <i instead of S <i : Each generated token s i in S C (i.e.tokens that are not replaced by tokens in C) needs to have knowledge of the full condition during its generation step, otherwise the generated value may be incompatible with some part of the condition.Therefore, since each generated token s i only has information about the partial sequence S C <i , the condition can only be given as start of the sequence: Conditioning without permutation invariance.Typically both the objects and the attributes of the objects in the sequence are consistently ordered according to some strategy, for example based on a raster order of the object positions [24], or on the object size [36].Therefore, a generative model f ordered θ that is only trained to generate sequences in that order cannot handle different orderings, so that in general: where π o is a random permutations of the objects in sequence S <i .The consistent ordering improves the performance of the generative model, but also presents a challenge for conditioning: the consistent sequence ordering limits the information that can appear in the condition.In a bedroom layout, for example, if beds are always generated before nightstands in the consistent ordering, the layout can never be conditioned on nightstands only, as this would preclude the following tokens from containing a bed.
Permutation invariance for more general conditioning.Recent work [26] tackles this issue by forgoing the consistent object ordering, and instead training the generator to be approximately invariant to permutations π o of objects in the sequence: This makes generation more difficult, but allows conditioning on arbitrary subset of objects, as now arbitrary objects can appear at the start of the sequence.However, since only objects are permuted and not their attributes, it does not allow conditioning on subsets of object attributes.Permuting object attributes to appear at arbitrary positions in the sequence is not a good solution to obtain a more fine-grained conditioning, as this would make it very hard for the generator to determine which attribute corresponds to which object.
Fine-grained conditioning.We propose to extend previous work to allow for fine-grained conditioning by using two different conditioning mechanisms, in addition to the approximate object permutation invariance: First, similar to previous work, we provide the condition as partial sequence C. To allow conditioning on only a subset of the object attributes, we introduce special mask tokens M in C that denote tokens that are not constrained by C. The constrained sequence S C , is then defined as: Second, to provide information about the full condition to each generated token, including those that replace mask tokens, we modify f θ to use a transformer encoder g ϕ that encodes the condition C into a set of feature vectors that each generated token has access to: where C g is the output of the encoder, a set of encoded condition tokens.We use a standard transformer encoder-decoder setup [34] for f θ and g ϕ , implementation details are provided in Section 3.2, and the complete architecture is described in detail in the appendix.

Layout Representation
Parameters.We focus on 3D layouts in our experiments.A 3D layout L = (I, B) is composed of two elements -a top-down representation of the layout boundary I, such as the walls of a room, and a set of k three-dimensional oriented bounding-boxes B = {B i } k i=1 of the objects in the layout.The boundary is given as a binary raster image and each bounding box is represented by four attributes: B i = (τ i , t i , e i , r i ), representing the object class, center position, size, and orientation, respectively.The orientation is a rotation about the up-axis, giving a total of 8 scalar values per bounding box.This setup is the same as ATISS.
Layout sequence.A layout sequence S is defined by randomizing the order of the bounding boxes, concatenating their parameters, and adding special start and stop tokens SOS and EOS to mark the start and the end of a sequence: S = [SOS; B π1 ; . . .; B π k ; EOS], where π is a permutation of the object indices 1, . . ., k and [; ] denotes concatenation.Note that the objects are permuted in S, while the parameters of each object have a consistent order.The condition C for a layout can be any sub-sequence S <i of S, where some tokens may be replaced by mask tokens M, leaving them unconstrained.During training and sampling, these tokens are replaced by generated token.
Parameter probability distributions.The generative model outputs a probability distribution over one scalar component of the bounding box parameters in each step.Similar to previous work [26,32], we represent probability distributions over continuous parameters, like the center position, size, and orientation, as mixture of T logistic distributions.Probability distributions over the discrete object class τ are represented as vectors over logits l τ over discrete choices that can be converted to probabilities with the softmax function.
where b is a component of t i , e i , or r i , and α, µ, σ, respectively are the mixture weight, mean and variance of the component logistic distributions.Each probability distribution over a continuous scalar component is parameterized by a 3T -dimensional vector, and probability distributions over the object class are represented as n τ -dimensional vectors, where n τ is the number of object classes.

Implementation
Condition encoder g ϕ : To encode the condition C into a set of encoded condition tokens C g , we use a Transformer encoder with full bidirectional attention.As positional encoding, we provide two additional sequences: object index tokens O i provide for each token the object index in the permuted sequence of objects; and relative position tokens R i provide for each token the element index inside the attribute tuple of an object.Since the attribute tuples are consistently ordered the index can be used to identify the attribute type of a token.These sequences are used as additional inputs to the encoder.The encoder architecture is based on BART [19], details are provided in the appendix.
Boundary encoder g I ψ : To allow conditioning on the layout boundary I, we prepend a feature vector encoding of the boundary to the input of the condition encoder, as shown in Figure 1.Similar to ATISS, we use an untrained ResNet-18 [12] to encode a top-down view of the layout boundary into an embedding vector.
Generative model f θ : The generative model is implemented as a Transformer decoder with a causal attention mask.Each block of the decoder performs cross-attention over the encoded condition tokens C g .As positional encoding, we provide absolute position tokens P, which provide for each token the absolute position in the sequence S.This sequence is used as additional input to the generative model.The output of the generative model in each step is one of the parametric probability distributions with the desired value of the attribute in the sequence.Sampling is performed autoregressively.(Right): After a token has been sampled, both the encoder and decoder sequences are updated, and atuoregressive generation continues.described in Eq. 8. Since the probability distributions for discrete and continuous values have a different numbers of parameters, we use a different final linear layer in the generative model for continuous and discrete parameters.Similar to the encoder, the architecture of the generative model is based on BART [19].
Training: During training, we create a ground truth sequence S GT with randomly permuted objects.We generate the condition C as a copy of S GT and mask out a random percentage of the tokens by replacing them with the mask token M. The boundary encoder g I ψ , the condition encoder g ϕ and the generative model f θ are then trained jointly, with the task to generate the full sequence S GT .For unmasked tokens in C, this is a copy task from C to the output sequence S.For masked tokens, this is a scene completion task.We use the negative log-likelihood loss between the predicted probabilities p(s i ) and ground truth values s GT i for tokens corresponding to continuous parameters, and the cross-entropy loss for the object category τ .The model is trained with teacher-forcing.
Sampling: We generate a sequence auto-regressively, one token at a time, by sampling the probability distribution predicted by the generative model (as defined in Eq. 7) in each step.We use the same model for both conditional and unconditional generation.For unconditional generation, we start with a condition C where all tokens are mask token M. To provide more complete information about the partially generated scene to the encoded condition tokens C g , we update the condition C after each generation step by replacing mask tokens with the generated tokens.Empirically, we observed that this improves generation performance.An illustration of this approach is shown in Fig. 2. The function SAMPLE samples from the probability distributions described in Eq. 8.
Once a layout has been generated, we can populate the bounding boxes with objects from the dataset.For each bounding box, we pick the object of the given category τ that best matches the size of the bounding box.In the supplementary, we will present an ablation of the tokens O i , R i , and P that we add to the conditional encoder and generative model.

Experimental Setup Datasets:
We train and evaluate our model on the 3D-FRONT dataset.It consists of of about 10k indoor scenes created by professional designers.We follow ATISS preprocessing which removes a Table 2: Comparison on Unconditional Generation: We provide floorplan boundaries from the Ground Truth as an input to the methods and compare the quality of generate scenes.We measure the CAS at a resolution of 256 × 256.We retrain the ATISS model and report the new numbers.The retrained model is called ATISS * .The KL-Divergence follows from [30] and is the categorical KL-divergence between the distribution of classes in the GT and the layouts synthesised by our method.CAS ×10 2 (↓) KL-Divergence ×10  few problematic scenes that have intersections between objects, or mislabeled objects, or scenes that have extremely large or small dimensions.For further details on the preprocessing, we refer the reader to ATISS [26].We train on the following classes of rooms -BEDROOM, LIBRARY, DINING, LIVING.We closely follow the preprocessing step of ATISS, yielding approximately 6k, 0.6k, 3k and 2.6k scenes for BEDROOM, LIBRARY, DINING, LIVING respectively.These classes have substantially different numbers of layouts and numbers of objects per layout.
Hyperparameters: We implement our models in PyTorch [27] 1.7.0.We use standard transformer blocks for the encoder and decoder except that the ReLU activation is replaced with GeLU [13].
We use 4 encoder layers and 4 decoder layers, with a hidden dimension of 256 and 4 attention heads yielding a query vector of dimension 64.We use a batch size of 128 sequences and train on a single nVIDIA A100 GPU with the AdamW [21] optimizer which we found to be more stable than Adam [16].We use weight decay of 0.001 and clip the gradient norm to be a maximium of 30.We found that the networks begins to overfit very early, especially for classes other than BEDROOM, because of the scarcity of data.Thus, for training networks on other classes, we pre-train on the BEDROOM class, and then reuse those weights as initialization.We do not use any form of learning rate scheduling as our experiments did not suggest significant performance gains.We train for 1000 epochs and use early stopping.
Baselines: Our main comparison is with respect to ATISS, as they are the only other method that considers the layout generation problem in a permutation invariant set-generation setting.We show that our model is competitive with ATISS and outperforms it on certain metrics.Moreover, we subsume all functionality of ATISS, including partial scene completion and outlier detection.Unlike ATISS, our model allows for fine-grained conditioning on arbitrary object parameters.ATISS does not provide pretrained models, hence we train their models using the official code 1 and match their training settings as closely as possible.Further, we compare to previous state-of-the art methods, FastSynth [30] and SceneFormer [37] quantitatively.
Metrics: Our metrics mostly derive from [30,26].Following, [30], we report the KL-divergence between the distribution of the classes of generated objects and the distribution of classes of the objects in the test set.We further report the Classification Accuracy Score (CAS) [26].We render the the populated layout from a top-down view using an orthographic camera at a resolution of 256 × 256.
We report the FID computed between these rendered top down images of sampled layouts and the renders of the ground truth layouts.

Applications
In this section, we discuss the application of our model on multiple interactive tasks.We show that our formulation a) can perform the same tasks as existing models and b) additionally enables fine-grained conditioning on arbitrary sequence subsets, like individual object parameters.
Scene Completion: In order to perform scene completion, we prepend both S and C with the tokens of the objects that already exist in the scene.Examples are shown in Figure 4. We can see that our method successfully generates plausible room layouts that respect the condition shown in the top row.
Outlier Detection: To estimate the likelihood of each token, we follow [31] and replace the token at ith position with [MASK].This can be performed in parallel by creating a batch in which only one element is replaced with [MASK].This is shown in Fig. 1.The likelihood of one bounding box then is the product of likelihoods of all element of the bounding box.Objects with low likelihood can then be resampled.Based on user constraints, only one parameter may be changed as we allow for arbitrary conditioning as explained in the next section.Arbitrary Conditioning: To condition on an arbitrary sub-set of object attributes, we can construct a condition C that is filled with [MASK] tokens, i.e. a fully unconstrained setting, and then only replace the tokens in C corresponding to the attributes we want to constrain with constraint values.The non-mask tokens do not need to be contiguous in C. Any subset of C, for example only object class attributes or only object locations, can be used as unmasked condition.Given such a constraint C, the generator proceeds as described in Section 3.2 and fills in all mask tokens with generated values.The encoder allows each generation step to have full knowledge of the condition C. See Fig. 6 for examples.To the best of our knowledge, ours is the first model that allows for this form of conditioning without imposing an order on the generation of parameters.

Conclusions
We proposed a new framework to produce layouts with auto-regressive transformers with arbitrary conditioning information.While previous work was only able to condition on a set of complete objects, we extend this functionality and also allow for conditioning on individual attributes of objects.
Our framework thereby enables several new modeling applications that cannot be achieved by any published framework.
Societal Impact.We do not expect any negative social impact that is directly linked to our work.We do acknowledge possible concerns about the energy consumption of training auto-regressive transformers.
Limitations and Future Work.Our work inherits some limitations of auto-regressive models (or even generative models in general) in that we require long training times and that setting up the code is laborious, time-consuming and error-prone.In addition, it requires significant expert knowledge in training transformers.We will therefore release the code upon acceptance.Another limitation of the work is that all objects need to have the same number of attributes.An open challenge for future work is to extend our approach to layouts where each object is described by a token sequence of variable length.Further, we would like to work on scaling auto-regressive models to much larger sequences, e.g., by using hierarchical auto-regressive models.In future, we would also like to explore parallel sampling strategies to sidestep the slow autoregressive generation process.

COFS: Controllable Furniture layout Synthesis Supplementary Material Abstract
In this supplementary document accompanying our main submission, we describe our system in more detail.In particular, we detail each of the components of our architecture, including the training protocol.We discuss our method in comparison to ATISS.We describe the metrics and how they were evaluated and compared against the baselines.We provide details on the sampling strategy that we employ.
We perform an ablation study justifying our design choices.We conclude with additional quantitative and qualitative results.We also provide a table of key notation used in the main paper.Additional details can be found on the project page.

A Detailed Architecture and Training Setup
We base our architecture on ATISS [26] in order to ensure a fair comparison to our closest competitor, using the same underlying library [15].Consequently, most of the building blocks are shared.However, we would like to point out major differences in this section.
Layout sequence: During training, we construct the sequence S corresponding to the layout by arranging the object bounding boxes in a random order with a permutation π and concatenating their bounding box attributes as individual tokens.
where τ π1 and (t π1 ) x represent the class and x−translation of the first object after permutation, τ π2 represents the class of the second object after permutation and so on.
Object attributes are always flattened the same way in our implementation, although in principle the attribute order can itself be permuted.We use the same attribute order for ease of implementation.
Embeddings: We described how we generate embeddings for the tokens in C and S. We use a learnable matrix E class of dimension n τ × 256 to encode the type τ i , with each row corresponding to one class.We use an additional [MASK] class.For the other attributes of translation (t i ), size (e i ) and rotation (r i ), we use sinuosoidal positional encodings [34,26] with 128 levels (L = 128).We call these embeddings γ: For the encoder, the embeddings of R i are a learned matrix E r of dimension 8 × 256.Each row corresponds to a different type of attribute -one for type, 3 each for translation and size, and one for the rotation.The embedding of O i are again a learned matrix E o of dimension k × 256, where k is the maximum number of objects.For the decoder, the embeddings of P i are also a learned matrix E p of size n × 256.The final embeddings are the sum of the corresponding embeddings: where γ e and γ d are the encoder and decoder embeddings respectively.
Optimizer: We use the PyTorch [27] implementation of the AdamW [21] optimizer with the default parameters for our model with a constant learning rate of 10 −4 and weight decay set to 10 −3 .We linearly warmup the learning rate for 2000 steps.In addition, we found gradient clipping 2 to be necessary to ensure convergence.We set the maximum gradient norm to be 30.Empirically, we found that setting the gradient norm to be low led to slower convergence.
We train with a batch size of 128, and train for 1000 epochs.We perform validation every 5 epochs.We save the model with the best performance on the validation set.We use random rotation augmentation by randomly rotating each scene between 0 and 360 degrees.
We wish to clarify that while we used the AdamW optimizer for our model, we used the vanilla Adam optimizer for ATISS, as described in [26].
Parameter Probability Distributions: We need to predict object attributes from the final transformer decoder outputs.To this end, we use use MLPs to go from the embedding dimension to the parameters of the distribution describing the attributes.For the class τ , we use a linear layer from the embedding dimension to the number of classes.For the other attributes, we use MLPs with one-input layer (256, 512), one hidden-layer (512, 256), and one output-layer (256, 30) and ReLU activations.The output size reflects that we use a mixture distribution with 10 components, and each component-distribution is parameterized by 3 values.
Transfer Learning: The datasets LIVING, DINING, LIBRARY are much smaller compared to the BEDROOM dataset.Thus, we use a transfer learning approach, where we first train on the BEDROOM dataset, and use those weights as an initialization, when training on the smaller datasets.This reduces the training time significantly, as well as combats overfitting on the smaller datasets.
We note that the datasets have a slightly different number of classes, thus any weights associated with the number of classes are not transferred, but instead sampled from a Normal Distribution, with mean 0 and standard deviation 0.01.

A.1 Metrics
The evaluation protocol follows ATISS closely, but we describe it here for the sake of completeness.
To compute the FID, we render both the ground-truth and generated layouts from a top-down view into a 256 × 256 images with an orthographic camera using Blender v3.1.0[6].Following ATISS, the FID is computed using the code from Parmar et al. [25] 3 We will release the .blend-fileused for rendering upon acceptance.To compute the Classifier Accuracy Score (CAS), we use an AlexNet [17] 4 model pretrained on ImageNet [7] to classify the orthographic renderings as real or fake.The Synthesis Time numbers for the competing models are lifted from ATISS [26].We train our models on an nVIDIA A100 GPU.To ensure fair comparison to the numbers in ATISS, we ran inference on a GTX 1080 GPU which is the same GPU used in [26].To compute the KL-divergence, we simply create a histogram of object categories in the generated layouts g i and the ground-truth gt i , where 1 ≤ i ≤ n class and use the the formula for the categorical KL-Divergence: where ϵ = 10 −6 is a small constant for numerical stability.
We use the same train-val-test splits as ATISS.To compute the aforementioned metrics, we generate layouts for all floorplans in the test-set and compare each with the corresponding ground-truth layout.

B 3D-Front Dataset
To the best of our knowledge, the 3D-Front [11] dataset is the largest collection of indoor furniture layouts in the public domain.Its large scale is obtained, in part, by employing a semi-automatic pipeline, where a machine-learning system places the objects roughly, and an optimization step [38] refines the layouts further to conform to design standards.The only human involvement is verification that the layouts are valid -do not have object intersections, objects that block doors, etc.However, in our exploration, we find several inconsistencies still remain in the dataset.We mention a fewnightstands intersecting their nearest beds, nightstands obstructing wardrobes, chairs intersecting their closest tables, and chairs that face in the wrong directions.We point out a few of these examples in Fig. 7. Our method, like other data-driven methods, learns the placement of objects from data.Thus, any errors in the ground-truth data itself would also show up in the sampled layouts.This is true, especially for BEDROOM dataset, where the sampled nightstands often end up intersecting with beds.

C Ablations
In this section, we justify our design choices by conducting an ablation study.We train our model under different settings on the BEDROOM dataset, unless specified otherwise, and use the validation loss, the Negative Log-Likelihood (NLL) as the metric to judge performance.This is because we empirically found the validation loss to correlate directly with sample quality.In particular, we ablate the choice of our position encodings, the number of layers and training with gradient clipping.We also include a discussion of the masking strategy and transfer learning.

C.1 Position Encodings
We consider the input conditioning to be a set.In contrast, the output is a sequence.Thus, the model needs additional information to align the input and the output .We use object index tokens O i and the relative position tokens R i to provide this additional information.
During training, the objects themselves are permuted.The intuition is that O i injects information about how early or late each object must appear in the output sequence.However, this information  alone is not enough to disambiguate where each of the attributes of the object must appear.Hence, we also add R i to the object attribute embeddings.Together, these embeddings localize the position of the attribute in the output sequence, given the current permutation.
In Fig. 8a, we progressively add our embeddings to the Baseline model which is the model without any positional encodings on the encoder.It is clear that our embeddings helps the model better align the set-input and the sequence-output.While each of O i and R i roughly align the input and the output, it is only when using both the embeddings that the model can precisely locate the actual position of tokens in the output sequence.

C.2 Number of Layers
For all the experiments in the main paper, we used 4 transformer layers in both the encoder and the decoder.In Fig. 8b, we show how the model performance scales with scaling the number of layers.We see that the performance correlates strongly with the number of layers.However, the performance gains become marginal when going from 4 to 8 layers or 8 to 16 layers.These larger models take longer to train and sample from.We believe our 4 layer models provide a good compromise between performance and speed.
Note that the values in Fig. 8b are smoothed by an interpolating spline to highlight the general trend.

C.3 Gradient Clipping
We found that the validation loss oscillated considerably during training.Upon further investigation, we noticed that the gradients norms tended to be unusually large, especially for the last layers in the parameter generating MLPs.Thus, we train the final networks with gradient clipping.Surprisingly, we found that even without gradient clipping, if we retain the model with the best NLL on the validation set, the performance is the same.However, with gradient clipping, we found the training curves to be much smoother (Fig. 8d).Consequently, we were able to perform validation at less frequent intervals to select the best performing model, which sped up training.

C.4 Masking Strategy
MaskGIT [3] find that using a robust masking strategy is important, as the usual 15% masking leads to a distribution shift between training and sampling.We see in Fig. 8e that masking with a uniform ratio of 15% leads to better NLL as the network is more confident in it's predictions.But we found out that the we could not sample from such a trained network, as it would output a stop token after only generating a few objects, which intuitively makes sense, as the network would only see a few mask tokens during training.

C.5 Transfer Learning:
We plot the validation loss in Fig. 8c on the LIBRARY and LIVING datasets under two configurations -No Transfer, where the models are trained from scratch and Transfer, where the model is first trained on the BEDROOM dataset and these weights are used as initialization for training on the target dataset.We make a few observations: 1.The models begin to overfit fairly early.For the BEDROOM dataset, the loss contiues to fall until epoch 1200, but in the No Transfer configuration for the LIBRARY dataset, we see overfitting at epoch 150 and for the LIVING dataset, at epoch 600.We hypothesize that this is due to the small size of these datasets compared to the BEDROOM dataset.2. The No Transfer configuration has a higher (worse) NLL as compared to the Transfer configuration, even when trained for longer.
These observations led us to use the Transfer configuration for the LIBRARY, LIVING and DINING datasets.

D Sampling Details
We highlight the difference between our sampling algorithm and the standard conditional sampling algorithm in this section.These differences are highlighted in blue in Alg. 2. The primary difference is that in our sampling algorithm, a forward pass is made through the decoder every time a new token is sampled.This token then replaces the corresponding [MASK] token in both C and S.
In addition, our algorithm runs for a fixed number of iterations (until all [MASK] tokens are replaced) compared to the standard algorithm which terminates when an EOS token is generated.This is both an advantage and a drawback -it is an advantage in the sense that a user can implicitly specify the number of objects by specifying the number of [MASK] tokens.It is a drawback in that the number of objects must be known before sampling can proceed.

D.1 A Sampling Trick
For our outlier detection examples, we use a simple trick -if there is only a single object to be sampled, we can create a permutation so that the [MASK] tokens of the object to be sampled are toward the of the sequence in C and S. With this permutation we only have to make forward passes beginning from the first masked token.All the tokens before the first masked token can simply be copied.This leads to faster sampling.
C[i] = s, S.append(s) 5: end for 6: return S In all our experiments, we set the number of objects to be sampled to be the same as the number of objects in the ground-truth layout associated with the particular floorplan boundary.

E Arbitrary Conditioning
We first recap the sampling strategy of ATISS.
(q, λ(c), γ(t), γ(r)) → ŝ (16) These equations say the following: From a query vector q, the model predicts a class.From the query and class, the model predicts the translation.From the query, class, and translation, the model predicts a rotation, and so on.This means that in ATISS, future attributes cannot affect the distribution of previous attributes.When conditioning, we can specify the class and then sample a translation, but we cannot specify a translation and let the model infer the most likely class for that given translation.
In contrast, COFS has bidirectional attention on the encoder side, enabling us to specify any subset of object attributes.This is done by replacing the [MASK] token corresponding to the object attribute by its actual value in C. The copy-paste objective ensures that the same attribute will be sampled at the desired location by the decoder.The mask-predict objective trains the model to get the most-likely attributes for the unspecified tokens.
We describe the process using the following example: We start out with a layout, shown in Fig. 10a.If we mask out the table in cyan (Fig. 10b, and sample unconditionally, we get another similar table (Fig. 10c).We now wish to have some control over the generation process.
We now mask out a different object -stool in the upper left corner.We have masked out a single object, thus we have 8 [MASK] tokens.Our sequences C and S look like Fig. 10d.If we want to specify the position of the next object, we simply set the token corresponding to position-attribute of the next object in Cc i to the value we want.We show a few examples of this type of conditioning in Fig. 10f and Fig. 10g.In the rest of the figures, before beginning sampling, we set the class tokens.We see that the generated layouts follow the condition, while also generating plausible layouts, even if the classes of conditioning objects never occur together.As an example, there are only 5 examples of bedrooms with two beds, yet our model is able to reason about the placement of such challenging layouts in Row 5.
We further see that the model is able to place other objects in such a manner that the constrained objects can still satisfy their constraints.In Row 4, we see that when we constrain the angle of the bed, the other objects move in tandem to create a plausible layout.

F Limitations and Discussion
We now discuss limitations of our model.The first is related to our simple object retrieval scheme based only on bounding box sizes.This often leads to stylistically different objects in close proximity even if the bounding box dimensions are slightly different.We show such an example in Fig. 9(left).The second is related to the training objective of the -we only consider the cross entropy/NLL.Thus, the network does not have explicit knowledge of design principles such as non-intersection, or object co-occurrence.This means that the model completely relies on the data being high-quality to ensure such output.We highlighted the fact that certain scenes in the dataset have problematic layouts, and our method cannot filter them out.We show an example of intersections in Fig. 9(center).Thirdly, the the performance on the LIVING and DINING datasets is not as good as the other classes, which is clear from the CAS scores.This is in part because the datasets are small but also have significantly more objects than BEDROOM or LIBRARY.This leads to accumulated errors.We would like to explore novel sampling strategies to mitigate such errors.Lastly, while the conditioning works well, it is not guaranteed to generate a good layout.For example, in Fig. 9(right) we set the condition to be two beds opposite each other, but the network is unable to place them in valid locations.Adding explicit design knowledge would help mitigate such arrangements, but we leave that extension to future work.

G Additional Results
We show additional results on unconditional sampling from our model in the concluding figures.Our synthesised layouts are novel and do not merely copy the ground-truth layout.In addition, we see that our layouts respect the floorplan boundary and mimic the underlying style of the datasets, in terms of object-object co-occurrence.The Boundary Encoder.An untrained ResNet-18 model.

M/[MASK]
A learnable token representing a missing value which the Generative Model tries to predict.

C
The sequence of tokens describing the condition.It is the input to the Condition Encoder.
c i The i-th element of C.

C g
The output of the last layer of the Condition Encoder.
Encodes conditions from C and boundary I.

S
The sequence representing the layout.

S GT
The sequence representation of the Ground Truth layout.

s i
The i-th element of S.

Figure 1 :
Figure 1: COFS Overview.(Left):The model is a BART-like encoder-decoder model, with a bidirectional encoder and an autoregressive decoder.Note the attention matrices shown in grey.At training time, a permutation π permutes the objects in a scene.The encoder receives additional information in the form of Relative Position Tokens R i , and the Object Index Tokens O i .A random proportion of tokens is replaced with a [MASK] token.The decoder is trained with Absolute Position Tokens P i and performs two tasks -1.copy-paste: the attributes should be sampled at the proper location 2: mask-prediction: the decoder predicts the actual value of the token corresponding to a [MASK] token in the encoder sequence.(Right): During inference, to measure likelihood, we create a copy of the sequence with each token masked out.The decoder outputs a probability distribution over the possible values of the masked tokens.

Figure 2 :
Figure 2: Sampling Strategy.(Left): We start out with a sequence of all [MASK] tokens on the encoder, and the SOS token on the decoder.Conditions, if any, are specified by replacing the [MASK] with the desired value of the attribute in the sequence.Sampling is performed autoregressively.(Right): After a token has been sampled, both the encoder and decoder sequences are updated, and atuoregressive generation continues.

Figure 3 :
Figure 3: Scene generation from scratch: We compare generated scenes from GT, ATISS, and our model.Both ATISS and our model are conditioned on the floorplan boundary (first column).In contrast to ATISS, we can see that our model consistently creates plausible layouts within the floorplan boundary while avoiding unnatural object intersections.These are results on the challenging LIVING (rows 1,3) and DINING (row 2,4) categories.Additional results can be found in the supplementary.(Best viewed zoomed in, on a computer display)

Figure 4 :
Figure 4: Partial Scene Completion: We show qualitative visualizations for the scene completion task.The generated layouts all respect the condition they were generated with (top row) and are all plausible room layouts.

Figure 5 :
Figure 5: Outlier detection: Our model can utilize bidirectional attention to reason about unlikely arrangements of furniture.We can then sample new attributes that create a more likely layout.Top row: An object is perturbed to create an outlier (highlighted in blue).Bottom row: The object can be identified by its low likelihood, and new attributes sampled which place it more naturally.Please zoom in for best viewing.

Figure 6 :
Figure 6: Arbitrary conditioning:We constrain the location of the next object to be sampled.The location is highlighted in pink.We sample location after the class.In a normal autoregressive model, we would not be able to constrain on the location.The bidirectional attention in the encoder allows the network to generate current tokens based on future tokens.In this example, the network automatically infers the proper class and size and even matches the style of the chairs in the example on the left.

Figure 7 :
Figure 7: We show a few examples of inconsistencies in the 3D-FRONT dataset.Top: Camera placement in 3D-Front layouts.Center: The corresponding regions show errors in the ground-truth data.Left: Chairs facing and intersecting a shelf.Right: Chairs in the correct orientation, but intersecting with a table.Bottom: Some more ground-truth errors.(From Left to Right:) Intersection.Blocking.Wrong Orientation and Intersection.Wrong Orientation.

16 (
i + O i (a) Adding additional tokens helps the decoder to better align the input and the output.b) The smaller models perform worse, but adding more layers does not yield large correspondingly larger gains.Left: Transfering weights on the LIBRARY dataset.Right: Transferring weights on the LIVING dataset.Left: Gradient clipping applied on the DINING dataset.Right: Gradient clipping applied on the LIVING dataset.Using a uniform masking ratio of 0.15 shows good performance in terms of NLL, but is unable to sample owing to large distribution shift between training and inference.

Figure 8 :
Figure 8: Ablation Studies We show the validation losses for the different architectural choices we make.

Figure 9 :
Figure 9: We show failure cases in the samples generated by our model.

Figure 11 :Figure 12 :Figure 13 :Figure 14 :
Figure 11: Scene generation from scratch: We compare generated scenes from GT, ATISS, and our model on LIBRARY class.

Table 1 :
Comparison of our proposed model to other state-of-the art models: Our model is the only model that can perform non-autoregressive sampling, has a set representation, is lightweight, can estimate the likelihood of an existing scene and also allows for arbitrary conditioning on any object parameter.

Table 3 :
Comparison on Synthesis Time: We compare the time required to synthesize a single scene for different categories.Our model outperforms all other models on this metric, and is competitive on FID in comparison to ATISS.Our model achieves this while being parameter efficient.We only have about half the parameters of the next closest competitor.About 10m of these parameters come from the ResNet-18 encoder.

Table 4 :
Summary of key notation used in the paper.Symbol Description I A binary image representation of the floorplan boundary.g ϕ The Condition encoder.Implemented as Transformer Encoder with Bidirectional Attention.f θ The Generative Model.Implemented as Transformer Decoder with Causal Attention.