PSDR-Room: Single Photo to Scene using Differentiable Rendering

A 3D digital scene contains many components: lights, materials and geometries, interacting to reach the desired appearance. Staging such a scene is time-consuming and requires both artistic and technical skills. In this work, we propose PSDR-Room , a system allowing to optimize lighting as well as the pose and materials of individual objects to match a target image of a room scene, with minimal user input. To this end, we leverage a recent path-space differentiable rendering approach that provides unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing us to optimize all of these components using gradient descent to visually match the input photo appearance. We use recent single-image scene understanding methods to initialize


INTRODUCTION
Progress in computer graphics over recent decades has led to algorithms that turn detailed scene descriptions into highly photorealistic renderings.Such a scene description is composed of many components: lights, materials and geometries (themselves separated into sub-geometries with different material assignments).These pieces all interact to reach the desired appearance.However, the problem of composing such scenes (especially with complex materials and lighting) is a manual undertaking that requires significant technical and artistic skill, creating a need for methods that automatically handle parts of the arduous process of 3D content creation.
Composing a scene from a single photograph with a high level of automation is a long-standing challenge [Roberts 1963].Several recent work made progress on parts of the problem, focusing either on geometry retrieval and placement [Izadinia et al. 2017;Nie et al. 2020] or materials and lighting [Yeh et al. 2022] independently.Recently, progress in physically-based differentiable rendering reached a point where full light transport computation can also provide additional gradients of the final pixel values with respect to all relevant scene parameters: geometric information, material parameters, and lighting parameters.For the first time, this theoretically opens the exciting possibility of jointly refining the estimates of all scene components through optimization.
In this work, we propose a system, PSDR-Room, to match the appearance of a single photograph of an indoor scene by initializing and optimizing all of these components: lighting, geometry and materials.To do so, we leverage recent path-space differentiable rendering (PSDR) approaches [Zhang et al. 2020a[Zhang et al. , 2021;;Yan et al. 2022].We also take advantage of the recent progress in scene understanding, using powerful models for image segmentation [Cheng et al. 2022], depth estimation [Ranftl et al. 2020[Ranftl et al. , 2021]], FOV estimation [Jin et al. 2023], and image latent-space encoding [Radford et al. 2021].Furthermore, given their importance in industry, we retrieve and optimize materials represented as procedural node graphs, using a recent differentiable approach [Shi et al. 2020;Li et al. 2023].With our system, given an input photograph, users can generate a visually matching 3D scene within a few minutes, using geometry and procedural material assets from their preferred library.Once optimized, the scene can easily be edited to change its components, or further modify the parameters of the selected procedural materials.Throughout our pipeline, differentiable rendering is crucial in adjusting initial guesses and reaching a high quality match to the target image.
Our method involves several stages as follows.First, the preprocessing step leverages state-of-the-art single image estimators for camera intrinsics, depth and object segmentation, which combine to estimate a point cloud and divide the scene into its component objects.Based on this initial information, the next stage finds an approximate room shape, which we further refine through physically-based differentiable rendering.The object stage then aims at retrieving, positioning and orienting each object detected by the segmentation.Again, our method first estimates a rough position based on the estimated point cloud and further refines this prediction using differentiable rendering.In the material and lighting stage, we assign a material, possibly based on a complex procedural node graph retrieved from a database, to each object and generate lighting, based on the layout of the scene.Once more, we leverage differentiable rendering to jointly refine all material properties as well as the scene lighting.
Our system also allows user guidance, by specifying pairs of crop between the target image and a rendering of our optimised scene to enforce material similarity in these areas.As our method is retrieval based, multiple objects of materials can be similar to the target.In our main setting (automatic), we select the top-1 result, but a user can easily adjust the select object/material among the closest match.
We evaluate our methods against real indoor scene photographs, obtaining visually close reconstructions.We further compare to recent work, showing that our approach is more accurate in jointly estimating lights and materials.In summary, we propose a method to go from a single photograph to a 3D scene with lighting and procedural materials in a few minutes with a minimal amount of user input.This is the first end-to-end system applying physicallybased differentiable rendering to turn an input photo into a valid 3D scene with separate objects, capable of optimizing geometry poses, lighting and complex procedural materials.

RELATED WORKS
Scene-level inverse rendering.Creating entire 3D scenes from photographs alone is one of the big challenges at the interaction between Computer Graphics and Vision.Previous work leverages deep learning to directly reconstruct meshes [Nie et al. 2020] or infer per-pixel scene illumination, normal and material parameters [Li et al. 2020;Zhu et al. 2022].Other approaches take advantage of the recent progress in differentiable rendering to optimise for texture and materials [Azinovic et al. 2019;Nimier-David et al. 2021].Reconstructing both geometries and materials with a production-level quality is a particularly challenging task.Closer to our approach, different work proposed to instead retrieve geometries in a mesh database and optimise their positions given a target photograph, such as the IM2CAD system [Izadinia et al. 2017].Other approaches [Nie et al. 2020;Huang et al. 2018] predict geometries from a single photo directly, not utilizing a database; it also predicts poses and camera parameters.As opposed to our approach, these works do not match lighting or materials and do not benefit from the recent progress in differentiable rendering to enable fine-grained adjustment to the generated scene.
A recent method, PhotoScene [Yeh et al. 2022], retrieves and optimizes materials as well as lighting on a pre-existing geometry scene (manually created or predicted by a different method like Total3D [Nie et al. 2020]).This work is fairly close to our material and lighting stage, and also uses differentiable procedural materials as a texture prior.However, they use simpler approximations for differentiable rendering.Our material and lighting results improve upon theirs, due to having a more powerful full-scene differentiable rendering system, and CLIP-based search for materials.We also find that comparing texture on rectangular crops is more robust than masked VGG losses used by PhotoScene.
Procedural material optimization.Procedural materials are an industry standard for material representation is procedural materials.The manipulation and generation of such materials has been an active field of research in recent years.The most recent work on procedural material are starting to enable procedural graph generation [Hu et al. 2022c;Guerrero et al. 2022], however the generated material quality do not yet match the existing databases.Most relevant to our goals are therefore procedural material parameters estimation [Hu et al. 2019] and optimization [Shi et al. 2020;Hu et al. 2022a;Li et al. 2023] methods.Combined with a new material retrieval approach, we leverage this progress in procedural materials optimization to better match the target scene appearance.
Single image scene understanding.Recent year saw significant progress in neural models for single image scene understanding.We utilize models for intrinsic camera parameter [Lopez et al. 2019;Zhang et al. 2020b;Jin et al. 2023] and depth estimation [Ranftl et al. 2020[Ranftl et al. , 2021]].We benefit from this progress, to better understand the photograph FOV and scene geometry.To automate our method as much as possible, we benefit from recent improvements in instance level segmentation in a single image [Cheng et al. 2022;Kirillov et al. 2023], allowing us to separate the different geometries both spatially and semantically.Our approach relies on these scene understanding components, but is not directly tied to any specific method, meaning that it would benefit from any future progress in this area.

PRELIMINARIES
In this section, we introduce several technical tools, recently made available by the research community, which are key enablers of our approach.We give a brief introduction to path-space differentiable rendering [Zhang et al. 2020a[Zhang et al. , 2021] ] used by our PSDR-CUDA system.We also discuss the inverse procedural material approach MATch and its library of differentiable procedural nodes, DiffMat [Shi et al. 2020], as well as losses based on Gram matrices of VGG layers [Gatys et al. 2016].

Path-Space Differentiable Rendering
Physically based rendering is frequently formalized using the pathintegral formulation by Veach [1997].The intensity  of a final image pixel resulting is an integral over all light paths through that pixel: where:  is the path space comprised of light paths x = ( 0 , . . .,   ) of all lengths  ≥ 1, connecting the camera to the light source through a number of scene interaction events;  is an appropriate measure on the space of paths.Finally,  is the contribution function defined as the product of terms on each path vertex: source emission on the light source vertex; bidirectional scattering distribution functions (BSDFs) on vertices corresponding to scene geometry reflection and transmission events; geometric terms on the path segments, and a detector response term on the camera vertex (typically based on a pixel reconstruction filter such as a box or a Gaussian).
A key challenge is to extend the above formulation to differentiable rendering, where the goal is to compute the derivative of the pixel intensity  with respect to some scene parameter  ∈ R. We could think of  as a time-like parameter, where the scene "evolves" with it, though this does not need to be the case; there could be many such parameters controlling lighting, materials and geometry.
In the latter case when scene geometry (i.e., union of object surfaces) M evolves with some parameters  ∈ R, Zhang et al. [2020a;2021] have demonstrated that the derivative d /d of Eq. ( 1) can be expressed as the sum of an interior and a boundary path integrals where the latter is unique to differentiable rendering and capture light transport paths with a segment constrained on a visibility boundary.
In practice, our renderer PSDR-CUDA gives correct gradients with respect to parameters  that cause geometric change, because it appropriately samples the boundary as well as interior terms.Note that this formulation can be easily adapted beyond solving full light transport, and we use it to compute anti-aliased depth and object mask images as well, by modifying the contribution function  accordingly.

Inverse rendering and losses
Optimization-based inverse rendering, or analysis by synthesis, infers a set of   scene parameters  ∈ R  by minimizing some predetermined rendering loss L between the rendered image  and reference image  0 .Solving this optimization using stochastic gradient descent methods such as Adam [Kingma and Ba 2014] requires differentiating the rendering loss L with respect to the parameters  .According to the chain rule, the gradient dL /d satisfies where the  L / on the right-hand side can be obtained using differentiable evaluation of the loss L, and d /d is computed using differentiable rendering ( §3.1).
Procedural materials.Substance materials [Adobe 2023] are an industry standard for defining realistic material textures through procedural node graphs, where nodes generates noises and patterns, and adjust them using image processing filters.The MATch approach and DiffMat library by Shi et al. [2020] implemented differentiable versions of many of the filter nodes in the Substance engine, which can be used to optimize their parameters to match a target material appearance.The DiffMat library was recently improved (DiffMatV2) for faster optimization and generator node optimization [Li et al. 2023].Similar to PhotoScene [Yeh et al. 2022], we use this approach (we leverage the most recent version) to define a manifold of plausible textures, with a much smaller number of optimizable parameters than texels, providing important regularization such that even invisible parts of the scene objects receive valid texture.
In practice, the presence of procedural material parameters introduces another step into the chain rule above.The differentiable rendering will back-propagate the loss to a texture space gradient, after which the DiffMatV2 library (in Taichi) will take over and further back-propagate to a procedural parameters gradient.This requires passing gradients between multiple systems written in different languages.
Texture descriptors.Gatys et al. [2016] leverage a pre-trained VGG neural network [Simonyan and Zisserman 2014] to guide style transfer, using the Gram matrices of extracted deep features from the VGG layers as their statistical representation.We use 5 layers (the ones after each pooling operation in the VGG19 variant), and concatenate the flattened Gram matrices into a single vector descriptor of the crop texture.We use this loss to compare the texture content between image crops, and found it to be the most reliable of the alternatives.Note that the descriptor size does not depend on the input crop size, and can be used to compare differently-sized crops.Heitz et al. [2021] introduced an alternative sliced Wasserstein loss, which in theory compares the distributions of VGG activations more accurately, though this is at the cost of introducing more noise into the gradients.Some previous works have generalized these descriptors to arbitrarily-shaped image regions given by masks [Yeh et al. 2022;Hu et al. 2022b]; however, the masks often need to be processed by case-dependent amounts of erosion, and we find that rectangular crop regions remain more reliable.

METHOD
Overview.Our PSDR-Room system starts from a single input image of an indoor scene, with an object segmentation, depth and camera intrinsics (specifically field-of-view) provided by existing estimation methods.Our goal is to obtain a visually matching scene reconstruction, which includes picking the right objects, optimizing their poses, picking the right materials, and optimizing the lighting and materials in the scene.For most objects (at least the ones where texturing makes sense) our goal is to obtain tileable material textures, including diffuse color, roughness and normal maps.

Output
Output Scene Editable Scene Fig. 2. A high-level overview of our pipeline.In the preprocess stage (Sec.4.1), we segment the input image and estimate camera field-of-view and per-pixel depth.A user can optionally edit these estimates.In the room stage (Sec.4.2) and object stage (Sec.4.3), we initialize and optimize the geometry of the box representing the room, as well as the objects in it.The objects are selected from the 3D-Future database by search using similarity in CLIP space.Finally, the material and lighting stage (Sec.4.4) selects procedural node graph materials from a database using CLIP similarity, initializes material texture transforms and lighting, and refines the estimates using optimization, making use of corresponding crop pairs between input and rendered images, which can be automatically detected or user-provided.The final output is a reconstructed scene, which supports camera, geometry, material and lighting edits.
Our method includes four high-level stages.First, in the preprocess stage we predict the camera field-of-view and estimate the depth map and segmentation per object; this stage is entirely based on previous methods.These inputs can be interpreted as an approximate segmented point cloud, which is sufficient for initialization decisions.Second, in the room stage, we initialize a coarsely aligned room box based on the subset of the point cloud belonging to the room's walls, floor and ceiling, and use differentiable rendering to refine the room box to satisfy an image-space mask loss in combination with a depth loss.Next, in the object stage, we search a database of objects to match the segmented objects in the input image, coarsely align them to their point clouds and use differentiable rendering to refine their poses to minimize their image-space mask loss and depth loss.Finally, in the material and lighting stage, we choose either a homogeneous material or a procedural node graph from a database for each material part, initialize lighting in the scene, and use differentiable rendering to jointly optimize the materials and lighting; for procedural node graph materials, this optimization backpropagates to their node parameters.

Preprocess stage
Camera intrinsics.We assume a simple pinhole camera model.Given an input image, we use PerspectiveFields [Jin et al. 2023] to estimate the camera vertical field-of-view and assume square pixels.
In our experiments, the model is robust enough and its predictions do not need manual adjustments.
Depth estimation.We use a recent monocular depth estimation proprietary model based on DPT [Ranftl et al. 2021] and MiDaS [Ranftl et al. 2020], to generate an approximate depth map.As the depth units returned are unknown, we simply used the normalised depth and do not assume anything about absolute scale.While not perfect, this depth information works reasonably well for room and object placement, which we will further refine using differentiable rendering, considering image-space losses.We can use depth information from any other source, e.g. a depth sensor.
Segmentation.We use Mask2Former [Cheng et al. 2022] to perform a panoptic segmentation, which yields a mask image for each object, including the room ceiling, floor and walls; these are labeled by the model, which we can use to establish correspondence with our rendered box and objects.

Room stage
For simplicity we assume the walls, ceiling and floor of the room can be approximated as a box; more complex floor plans could be supported in the future using an extension of our approach.We aim to initialize a coarsely aligned room box to the approximate point cloud estimated above, followed by a refinement stage, where we further optimize the box size and placement to match the imagespace segmentation.
Room initialization.In the coarse stage, we take the depth and segmentation masks corresponding to room walls, ceiling and floor (with some erosion applied), to generate a segmented room point cloud, without objects.We estimate the floor plane using RANSAC on the floor segmented point cloud.We then estimate the room Initialization Optimized Result Fig. 3. Room stage optimization: We show the estimated segmentation masks of the walls/floor/ceiling versus the rendered room box before and after the room optimization.The optimized room box is better aligned to the mask (see circular insets).Also note that the estimated masks are much smaller than the rendered box sides, as the room was not empty; our differentiable IoU metric is appropriate for this scenario.
height through maximum height point value.If the room does not have a ceiling visible in the image, we simply set the room height sufficiently above the top visible point (1.2×maximum height point value).We rotate the room box horizontally around the predicted floor normal and find the rotation+scaling aligning best with the RANSAC estimated wall from the point cloud.This results in a coarsely aligned room box that approximately matches the predicted room point cloud, and is sufficient as initialization.
Room optimization.We next use differentiable rendering to further optimize the room box, improving visual alignment to the input image.We optimize the pose (object-to-world matrix) of the room, including rotation, translation and scaling.We base our differentiable renderer on PSDR-CUDA and optimise the walls, floor and ceiling positions based on depth maps and predicted maps differences.As PSDR-CUDA handles the edges discontinuities defined in Section 3.1, losses based on depth/mask can be differentiated and optimized smoothly.Note that the rendered masks have anti-aliased edges, which is crucial for gradient descent to work, as only the edge pixels will have non-zero gradients.We use the predicted depth and masks to compute our loss function, minimizing the mean L1 difference of depth predictions, and maximize the intersection over union (IoU) for room walls, ceiling and floor.The IoU metric is extended to correctly handle fractional anti-aliased pixels, giving correct gradients from these terms.
where   are the room "box" parameters,  varies over the set {wall, floor, ceiling},    and    are the rendered and target masks respectively, and   and   are the rendered and target depths respectively.We normalize depths to a unit range.

Object stage
Model database.We use the 3D-FUTURE dataset [Fu et al. 2021] as the 3D model database.It contains a total of 10,000+ artist-made models (mostly furniture and similar assets typical in indoor scenes) with split material groups for each model, so that objects with multiple materials are possible (for example, a sofa with pillows of a different material).We discard the materials provided by the dataset,

CLIP Encoder
Fig. 4. We use CLIP as a zero-shot ranked classifier of 3D geometries from the 3D-Future database.We encode the renderings provided with the database as well as a crop around each object from our input image, and find nearest neighbors using cosine similarity.
since we want to retrieve and optimize materials depending on the input photograph.
Model search.We use CLIP [Radford et al. 2021] to search 3D models from the database that match the input scene objects.We use the renderings available in the database and encode them all using CLIP into normalized 768-dimensional vectors.To search an appropriate model for each scene object in our segmentation, we crop a target image around each object's mask and encode the cropped image using CLIP.We then select the Nearest Neighbors -using a cosine distance-in CLIP space from the database, providing the closest assets in the database matching the input crop.
Coarse geometry prediction.Similarly to the room stage, we predict each object's pose by fitting to the approximate point cloud of the object and refining it using gradient-based optimization to match the image-space segmentation.For initial alignment, we first find the center as the median value of the point cloud over the three axes.We then compute the scaling factor by matching the mesh bounding box center to the point cloud median, and the mesh radius to half the point cloud radius, where radius is defined as median distance to center.While this process initializes the scaling to a smaller value than reality, it provides a good starting point for the later stage of geometry optimization; making the objects too small at initialization and letting them grow appears to make the optimization better behaved than trying to initialize the scale accurately.
We make the vertical axis for each object orthogonal to the room floor; this heuristic is appropriate for most objects common in indoor scene settings.We try a number of rotations around the vertical axis, which is kept orthogonal to the room floor.We pick as initial rotation the object rotation which has the minimum mean L1 difference between the rendered mask and object segmented mask.If the bottom face of an object's bounding box and the floor are close (distance < 0.1) and the mask edge is close enough to the floor mask (distance < 20px), we mark the object as "on the floor" and use a floor-distance loss when optimizing that object's position.If an object is floating, we search if the object mask edge is close enough (distance < 20px) of another object edge on the floor.If so, we enforce the floating object to stay on top of the parent object.Refining geometry.Next, we use differentiable rendering to perform a joint pose optimization for all objects.We optimize the scaling, horizontal translation and rotation around the vertical axis -keeping alignment with the room floor/ceiling-for each object.This ensures that objects snapped to the floor/ceiling preserve this constraint.Similarly, for objects on the wall, the optimization transforms and scales them only along the wall.Our differentiable renderer generates the masks with anti-aliasing, which is critical for non-zero gradient in masks comparing losses.For the loss function, we use the mean L1 difference between the rendered and estimated segmentation mask multiplied by the depth of each object.We also use the L1 difference of heavily blurred versions of the masks, using a Gaussian blur.
where    are the object pose parameters,  varies over the set of objects,    and    are the rendered and target masks respectively, and   and   are the rendered and target depths respectively, normalized to unit range. is a 2D Gaussian convolution pyramid of a mask.

Material and lighting stage
In this stage, we predict and optimize the lighting and materials for the scene.For this, we establish a list of crop pairs between the input image and the rendering.Our optimization will match appearance between the crops of each pair, in addition to optimizing a low-resolution full-image loss.These crop pairs are normally automatically detected, but can optionally be provided by the user for best appearance match.
Crop pairs.To achieve the best robustness and closest appearance match for different material parts, we sample a number of crops inside the intersection region of a mesh material group rendered mask and the object segmentation mask.For each object , let   :=    (   ) ∩    be the intersection between the rendered    and the target    masks for this object.Then, we compute a rectangular mask   ⊂   with maximal size to use as the crop window for this object in both the render and the target images.In this way, (  ,   ) forms a crop pair where   =  [  ] and   =  [  ] denote, respectively, crops of the rendered image  and the input image  using the window   .
To compare the textures between each pair of crops during optimization, we use a Gram matrix loss as we found it very reliable for optimization, outperforming alternative losses like masked VGG.
Light initialization.In this stage, we generate the initial lighting for the scene.We split the room ceiling into a grid of area lights and scale them down by a ratio of 0.8.We remove the lights that intersect the camera frustum, or are completely behind the camera.We also add one large area light behind the camera and one per invisible wall, simulating an invisible window.We also turn visible windows and lamps into emissive objects.For windows, we add a simple frame, though a database of window geometries could be used if available.For lamps, we make the largest visible rendered part into emissive.
We try a number of uniform light radiance values as initialization and pick the one with the minimum down-sampled L1 loss on luminance images as our initial light radiance.As light transport is linear in light intensity, a single rendering with unit radiance is needed for this step.
Material initialization and search.For material parts that do not have any available crop pair because of small mask area, or the cosine similarity for the best matched CLIP search result is worse than 0.25, we use homogeneous parameters (constant albedo, roughness, specular).We initialize homogeneous material albedo to the median color of the mesh-mask intersection area and a roughness of 0.5.If there is no intersection, we use the median color of the area of the entire object segment.
For each material part that needs a procedural material (i.e. that has at least one crop pair (  ,   )), we use the crop   to run a CLIP search on rendered thumbnails of 118 procedural materials from the DiffMatV2 library provided by the MATch method [Li et al. 2023].Our material search is based on cosine similarity of normalized CLIP encodings, same as the object search described above.If multiple crops are on a single material part, all pairs are used for CLIP search; for each crop we evaluate the top-10 similar materials.We use a voting scheme and select materials appearing for multiple crops.Out of these selected materials, we compare cosine similarities and select the material with highest mean cosine similarity.To determine the right texture transform, we sample the chosen material with different scales (0.5-8.0) and rotations (-45, 0, 45, 90) and find the one with minimum Gram matrix loss on that crop pair.If the Gram matrix loss of a homogeneous material is better than any of the texture transforms, we refrain from applying textures to this material and treat it as homogeneous.Note that we render and compare these crops under the initial lighting in the scene a the final lighting has not yet been optimized at this stage.
Full scene material and lighting optimization.In the final stage of our pipeline, we jointly optimize all materials and lights in the scene, under full global illumination.The loss function we use combines several terms: the L1 difference between the input image and rendering (downsized to 1/8), the Gram matrix difference for each crop pair, as well as the mean RGB color per object mask: The complete loss for joint material and lighting optimization is as follows: where  is the vector of all material and lighting parameters,  1/8 and  1/8 are the rendered and input image, downsized to 1/8 of original size of height and width,  iterates over all crop pairs, (  ,   ) are the crop pairs,  denotes the RGB mean over a mesh mask,   denotes the median color computed during initialization and   is the Gram matrix texture descriptor of a crop.Note that the optimization over procedural material parameters is bound to a plausible texture manifold allowed by the node graph, acting as a prior and ensuring the optimization generates only high-quality textures without baking in lighting cues, even in invisible parts of the scene.

RESULTS
Scene edits.In Fig. 8, we show examples of material editing.We can modify the parameters of the procedural node graphs (c) or switch to a completely different procedural graph material for the floor (d).Please refer to the Supplemental Material for more editing examples.
Ablation on database size.As described in §4, our pipeline retrieves object geometries from an input database.Fortunately, as demonstrated in Fig. 9, our pipeline is robust to the choice of this database.Even with only 5% of the data used (containing 800 3D models), our pipeline remains well-behaved.
Comparison with IM2CAD.We compare to IM2CAD [Izadinia et al. 2017] in Fig. 10.Since the implementation of this technique has not been made publicly available, we apply our pipeline to two examples from their paper.To evaluate the results quantitatively, we use the LPIPS [Zhang et al. 2018] metric, which is appropriate for comparing images with some misalignment, and RMSE (computed using downsampled images) metrics.The quantitative errors are shown under the images, with the lowest error shown in bold, showing that our approach better matches the target photographs.
Comparison with PhotoScene.In Fig. 11, we compare results generated with our technique (b) and two pipelines (c, d) based on Pho-toScene [Yeh et al. 2022].Similar to the comparison with IM2CAD in Fig. 10, we compare LPIPS and RMSE metrics to show the effectiveness of our method.PhotoScene with our geometry (c) produces numerically good results, but worse than ours, due to several factors.They used simpler approximations for differentiable rendering compared to our path-space system.We also found that comparing texture on rectangular crops is more robust than the masked VGG losses used by PhotoScene.Further, our CLIP based search provides better results than VGG-based Gram matrix distance as demonstrated on flat surfaces in the supplemental materials.When using PhotoScene with geometries reconstructed by Total3D (d), the reconstruction quality becomes poor, although it should be noted that Total3D [Nie et al. 2020] is solving a much harder problem of directly predicting the meshes, rather than searching for them in a database.
Using user-specified crops.Results generated by our automated pipeline can be further improved by using minimal user inputs.We demonstrate this in Fig. 12 where user-specified crops are used to drive the material and lighting stage ( §4.4).
Performance and supplementary materials.On average, the room optimization stage takes 100 iterations, with less than 0.1s per iteration.The geometry initialization and CLIP search take usually around 5s per object in the scene.The optimization time per iteration is around 1-3 seconds, with a total of 300 iterations.The fine stage usually take 3-5 seconds per iteration, with about 120 iterations.A more detailed performance table is shown in the supplementary materials.The supplementary also shows animations illustrating optimization progress and partial results.

LIMITATIONS AND FUTURE WORK
While PSDR-Room already gives useful results, there are some challenges left.First, while our fully automatic pipeline outperforms previous work, it is not perfect, and can sometimes benefit from user inputs.For example, user-provided crop pairs can focus attention on matching materials that matter to a human viewer, and establish correspondences between material parts that are semantically related but not aligned in image space.Further, a user picking from, say, top-3 materials and objects returned by CLIP search can sometimes make a better choice than simply using the highest-ranked result.
Our current box assumption is simple, and could be extended to more intricate floor plan layouts, as well as non-horizontal ceilings.Our work pushes even the most advanced differentiable rendering approaches to their limits.Intersection between objects causes erroneous zero gradient due to a missing boundary term.Differentiable collision detection during the optimization process could help avoid two objects intersecting, but remains out of the scope of this work.
Object databases with well-separated material parts are not easy to come by and our material database is fairly small; a commercial deployment of our method would likely invest in curating larger databases of objects and materials with the right properties.

CONCLUSION
In this paper, we introduced PSDR-Room, an end-to-end pipeline that generates 3D models of indoor scenes, complete with objects, lighting and textured materials, with minimal user input.Based on a single segmented image, our technique uses CLIP search to automatically select the shapes and materials for each object from databases.Further, our pipeline refines the object poses, lighting and materials of each object by leveraging physics-based differentiable rendering, significantly improving the reconstruction quality compared to previous work when evaluated on real indoor photographs.Our generated 3D models can be easily edited by the user as a postprocess.We believe our pipeline provides a solid foundation for future improvement towards better learning-based initialization approaches, more complex room layouts, or other environments beyond room scenes.
Fig. 5. Object Stage: We take the segmentation and depth map to construct segmented point clouds for coarse geometry prediction.We initialize object poses to the point cloud and use differentiable rendering in image space (with mask and depth losses) to further refine the poses.