Enhancing Diffusion Models with 3D Perspective Geometry Constraints

While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.

While perspective is a well-studied topic in art, it is generally taken for granted in images.However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement.Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective.We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy.We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images.Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time.SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.
Additional Key Words and Phrases: Diffusion Models, Perspective Constraints, Depth Estimation

INTRODUCTION
"Re-draw The School of Athens in the style of Van Gogh", "Show an aerial viewpoint of the Washington Monument".The introduction of recent text-to-image synthesis methods such as latent diffusion models has drastically increased our creative capabilities.These models can generate anything from a Renaissance style painting to an everyday smartphone selfie from just a simple text prompt.However, as powerful as these models can be, their limited ability to adhere to physical constraints that are explicitly present in natural images restricts their potential [Wang et al. 2022].In contrast, traditional methods of image generation such as hand-drawn art or ray-traced images place careful attention on ensuring an accurate physical environment including geometry and lighting.One of the largest advancements in the photo-realism of handdrawn art was the development of a system to draw accurate perspective geometry in the 1400s.While the gap between real and generated images is not as large for diffusion models as it was back then, a greater consideration for perspective accuracy can have a similarly large impact in the photo-realism of their outputs.
Perspective is one of the most important physical constraints because it ensures object properties such as size, relative location, and depth are accurately represented.In a sense, it ensures physical accuracy [Kadambi 2020].This allows the use of perspective accurate data for downstream tasks such as camera calibration [Beardsley and Murray 1992;Caprile and Torre 1990;Chen and Jiang 1991;He and Li 2007;Li et al. 2010], 3D reconstruction [Guillou et al. 2000;Wang et al. 2009], scene understanding [Geiger et al. 2014;Han and Zhu 2009;Satkin et al. 2012], and SLAM [Camposeco and Pollefeys 2015;Georgis et al. 2022;Lim et al. 2022].
However, current diffusion based image generators such as [Bau et al. 2021;Radford et al. 2021;Razavi et al. 2019;Rombach et al. 2022b;Yu et al. 2022] do not generate perspectively accurate data [Farid 2022b].Please refer to Fig. 1(a) for an example of this phenomenon.This is because latent diffusion models typically lack the interpretability necessary for explicit encoding of a physical prior such as perspective in the model architecture [Kadambi et al. 2023].By utilizing a novel loss function that ensures the gradient field of an image aligns with its expected vanishing points, we are able to encode this physical prior.By enforcing this perspective prior on generated images, we also increase the accuracy of object properties important for downstream computer vision tasks and photo-realism.
As it turns out, the perspective correctness of an image has a strong influence over its overall scene coherence and therefore realism.This is most likely true because, as mentioned before, perspective provides crucial information regarding the size, relative location, and depth of a scene.To illustrate this, we set up a human subjective test where the photo-realism of our perspective-corrected images is put to the test.We show that latent diffusion models which utilize our novel perspective loss generate images that are rated as more realistic an overwhelming majority of the time as compared to images generated by the base diffusion model.We also verify the visual benefits of our proposed constraint by applying it to the inpainting task.We show that inpainted images generated from models trained with our loss consistently appear more perceptually similar to the original image than images from models without our loss.
Additionally, images generated with our novel loss prove beneficial to the accuracy of downstream tasks which are inherently reliant on these same object properties.As proof of this concept, we fine-tune multiple SOTA monocular depth estimation models such as DPT [Ranftl et al. 2021] and PixelFormer [Agarwal and Arora 2023].We show that training on data with accurate perspective leads to models with higher performance that can capture high-frequency details to a higher degree.

Contributions
In summary, we make the following contributions: • We introduce a novel geometric constraint on the training process of latent diffusion models to enforce perspective accuracy.• We show that images from models trained with this constraint appear more realistic than models trained without this constraint 69.6% of the time.

RELATED WORK 2.1 Synthetic Image Generation
Image generation, while a popular task, has proven to be difficult because of the high dimensional space and variety of images.One of the most popular techniques for image generation has been Generative Adversarial Networks (GANs) [Goodfellow et al. 2020].While GANs are capable of high quality image synthesis [Brock et al. 2019], they are limited by the fact that they are difficult to train, often failing to converge or collapsing into a mode where all generated images are the same [Arjovsky et al. 2017;Mescheder et al. 2018 et al. 2015] for image generation have grown in popularity.These models work by reversing a diffusion process which adds noise to high quality images and are capable of generating high quality samples from a variety of distributions [Daras et al. 2022;Dhariwal and Nichol 2021;Ho et al. 2020].Subsequent works have expanded the scope even further by adding text guidance to the diffusion process [Ramesh et al. 2022;Saharia et al. 2022], simplifying the inverse process [Wallace et al. 2022], and reformulating the diffusion process to occur in a latent space for speed benefits [Rombach et al. 2022b].While recent work has explored guiding diffusion models in various ways [Ho and Salimans 2022;Meng et al. 2023;Rombach et al. 2022a;Wallace et al. 2023], most diffusion models rely almost entirely on their vast datasets and text encoders for priors on scene composition and object properties.This means that there are no explicit guarantees that generated images will be physically accurate, making them a poor fit for use in synthetic datasets.Our work aims to add 3D geometry constraints to image generators in order to improve the quality of generated images.A specific task in the space of synthetic image generation that is related to our work is the edge-to-image synthesis problem.In this task, the diffusion model is conditioned on both a text prompt as well as an edge map of the scene we want to generate [Batzolis et al. 2021[Batzolis et al. , 2022]].Although this is similar to our task in terms of constraints on edges in an image, they are not quite the same problem: for the edge-to-image task, the goal of training is to have a model which can follow the provided edge map faithfully [Xu et al. 2020].If this is achieved, perspective accuracy can be achieved by providing perspectively accurate edge maps.However, for our work, the task is to instead train a model that can generate perspectively accurate images without access to an edge map, meaning our models require less input and are more general.

Vanishing Points in Computer Vision
Vanishing points have many varied and important uses in computer vision.One common use for vanishing points is camera calibration.Early examples of this include [Beardsley and Murray 1992;Caprile and Torre 1990;Chen and Jiang 1991] who use vanishing point geometry to compute the intrinsics and extrinsics of one or more cameras given single or multiple images.Subsequent papers, such as [He and Li 2007;Li et al. 2010], provided improved techniques that were simpler or required less data and assumptions.In addition, newer works Vanishing points are labeled in blue, perspective lines are in red, and the horizon lines are in light green.One-point perspective is typically used when there is one focal point of the image or when only one side of an object is visible.Two-point perspective is used to illustrate multiple sides of an object, while three-point perspective is used for viewpoints that are above or below the horizon line of the 3D scene.
began to not only compute camera parameters, but also use them to compute 3D reconstructions of single images [Guillou et al. 2000;Wang et al. 2009].Beyond camera calibration, vanishing points are also useful for general scene understanding.[Han and Zhu 2009] use vanishing points to help create generative grammar for synthetic scenes, [Geiger et al. 2014] use vanishing points as priors for 3D scene and traffic understanding, and [Satkin et al. 2012] estimate 3D models from singular images using vanishing point priors.Vanishing points are also particularly useful for road detection thanks to easily identifiable perspective lines, as demonstrated by [Kong et al. 2009;Liou and Jain 1987].Vanishing points are also regularly used in SLAM techniques.[Lee et al. 2009] were one of the first in this space, using vanishing points to identify the heading of a robot for navigation.Subsequent works further expanded the capabilities of SLAM systems built on vanishing points including [Camposeco and Pollefeys 2015;Georgis et al. 2022;Lim et al. 2022] who use vanishing points to identify direction and perform structural mapping of scenes in real-time.Given the significance of vanishing points in computer vision, we aim to enhance image generators with accurate perspective, in order to benefit photo-realism and downstream tasks.
In additional to vanishing points, perspective has been used in computer vision for computational photography tasks.For example, many works use perspective principles to allow for editing the focal length and camera position of an image after it is taken [Badki et al. 2017;Liu et al. 2022].Another application of perspective are techniques which aim to reduce distortion in wide-angle images [Carroll et al. 2009;Shih et al. 2019].These techniques often learn the perspective projection of an image and then find transformations to achieve the desired un-distorted images.Other works have also gone the opposite direction by introducing new types of perspective projections that are not necessarily physically accurate but can result in artistic and aesthetic images [Agrawala et al. 2000;Collomosse and Hall 2003].

Monocular Depth Estimation
Supervised methods for monocular depth estimation typically require paired image and depth data.One of the first works in this area was Make3D [Saxena et al. 2008] which relied on hand-crafted features and Markov random fields.Subsequent works then applied deep learning to the problem, starting with multi-scale convolutional networks [Eigen et al. 2014] and followed by conditional random fields [Li et al. 2015], residual networks [Laina et al. 2016], convolutional neural fields [Liu et al. 2015;Xu et al. 2018], and most recently transformers [Agarwal and Arora 2023;Ranftl et al. 2021Ranftl et al. , 2020]].Many approaches also take advantage of known geometric relationships, such as normals [Qi et al. 2018] and planes [Lee et al. 2019;Yang and Zhou 2018].Newer techniques have also taken an unsupervised approach [Fei et al. 2019;Wong and Soatto 2019] or use multi-modal data capture [Singh et al. 2023].However, most supervised monocular depth estimation models are limited by the availability of paired data on which to train as this data is difficult to collect.
In order to overcome the challenge of a lack of sufficient training data, many techniques turn to synthetic datasets.The renderers used to generate the images in these datasets can often generate corresponding ground-truth data, making it simple to acquire pixelaligned ground-truth depth maps.In addition, these renderers often allow for different types of data, such as varied weather conditions or indoor vs. outdoor scenes, making them an attractive way to get training data.Examples of such datasets include Virtual KITTI, a photorealistic copy of the popular self-driving dataset KITTI [Gaidon et al. 2016;Geiger et al. 2013] and SYNTHIA, a dataset that includes depth and semantic segmentation information for images of a synthetic city [Ros et al. 2016;Zolfaghari Bengar et al. 2019].Although these datasets are often quite realistic, there are often key differences between synthetic and real images which leads models trained on synthetic images to achieve lower performance when tested on real datasets compared to models trained and tested on real images.This difference in performance is referred to as the Sim2Real gap.As monocular depth estimation is a popular task, many works have attempted to address the problem of the Sim2Real gap [Cao et al. 2018;Damodaran et al. 2018;Long et al. 2015;Rozantsev et al. 2018;Sankaranarayanan et al. 2018].However, all of these techniques approach the problem by attempting to improve the neural network architectures.On the other hand, we approach this problem from the perspective of improving the synthetic data used to train the neural networks.
In addition to monocular depth estimation, the techniques we describe in the paper can be easily applied to the task of depth completion as well since the data format is the same for both tasks [Nazir et al. 2022;Wong et al. 2021;Wong and Soatto 2021].

PERSPECTIVE BACKGROUND 3.1 Linear Perspective
Although perspective is a word commonly used in a variety of contexts, it has a very specific meaning in terms of art and photography: techniques used to draw objects in 2D such that their 3D attributes are correctly modeled.In practice, perspective refers to a multitude of different techniques which can be used to create a 3D feel, but the most common technique is called linear perspective.In linear perspective, all mutually parallel lines, on the same or parallel planes, in 3D space, converge to a single point in the image plane which is referred to as a vanishing point.The only exception to this rule is sets of lines that are exactly parallel to the camera sensor.In this case, these lines are also parallel in the image plane.A typical drawing/image often has anywhere from one to three vanishing points, with the number of vanishing points determining the style and view of the drawing/image.Another key component of linear perspective is the horizon line.The horizon line is a horizontal line that represents the viewer's eye level in an image, and typically at least one of the vanishing points of an image lies on this line.A visualization of these principles can be found in Fig. 2.

Perspective Consistency in Images
Perspective in images is not always easy to confirm, as the vanishing points of an image can only be easily identified with the aid of parallel lines in 3D space, which may not always exist in images.For images that do have sets of parallel lines, perspective consistency can be verified by extending sets of parallel lines in either direction until they intersect and ensuring that all pairs of lines in a set intersect at the same point.
Natural Images.By the math of perspective projection for a pinhole camera, an point X = (, ,  ) is projected to a point x = (, ) = (  /,   / ) [Ma et al. 2003].If we are concerned with a line  = O + D, after replacing , ,  above with the equation for a line and taking the limit as  goes to positive/negative infinity, we see that the final projected point is dependent on only D. Therefore, sets of parallel lines, that are not parallel to the camera plane, will all come together to the same point, known as a vanishing point.
Synthetic Images.Although natural images are forced to follow perspective rules, there are no such restrictions on synthetic images, particularly images generated by deep learning approaches.Most of the loss functions used to train these models are focused on image quality or how well prompts are followed, meaning physical properties such as perspective, shadows, or lighting can often be neglected [Farid 2022a,b].An example of this can be seen in Fig. 1(a).

IMPROVING PERSPECTIVE ACCURACY OF GENERATED IMAGES
Our fine-tuned model is built on top of the latent diffusion models introduced by [Rombach et al. 2022b], using code from [Pinkney 2022].We describe the latent diffusion process in Section 4.1.We add a new term to the traditional loss function and train on a specialized dataset that provides ground truth vanishing points.This new constraint is described in Section 3.2.

Latent Diffusion Models
Traditional image generation diffusion models are concerned with a forward diffusion process over images x 0 ,...,x  : where  is the forward diffusion function,  is the current time step, and I is the identity.  = 1 −   and  1 ,...,  compose a pre-selected variance schedule.The reverse process is then parameterized as: where  is defined as the reverse diffusion function and Σ(x  , ) is typically set to time-dependent constants.  (x  , ) is defined as: where   = Π  =1   , and   (x  , ) is a learned function parameterized by a UNet model [Ronneberger et al. 2015] with learned parameters  .Based on this, the traditional diffusion model loss is as follows: More details and derivations can be found in [Ho et al. 2020].Latent diffusion models work very similarly, but perform the forward and reverse diffusion processes in latent spaces.Specifically, an encoder and decoder are introduced to translate to and from the latent space.
The encoder is defined as: E :  ∈   × ×3 ↦ →  ∈  ℎ××3 , while the decoder is defined as: , where ℎ =  / ,  =  / and  is a downsampling factor.With this formulation, the loss function now becomes: where the image   is replaced by its latent space representation   .
In order to add perspective priors to a latent diffusion model, we add an additional perspective loss term.At a high level, this loss works by sweeping lines extending out from a vanishing point over the image and calculating the sum of image gradients across the line, as illustrated in Fig. 3. Pseudocode for this algorithm is shown in Alg. 1.This sum is designed to represent how "edge-like" the region along that line is in the image.We can then write our new loss as: where  is a weight factor for our perspective loss, v x is a set of vanishing points in image space and x is our reconstructed image, which can be written as: where  is randomly chosen between 0 and  for each iteration.In order to define  persp , we first define some intermediate quantities: •  x represents the gradients of an image x computed with a 3x3 Sobel filter.•  min and  max represent the minimum and maximum angle from the vanishing point to a corner of the image relative to the x-axis.•  0 ,...,  represent  equally-spaced angles between  min and  max .•  represents a particular vanishing point in the set v x .
•   (, ) represents a point at time  on a ray   () starting at  in the direction of   .•   () represents a vector perpendicular to the line   ().
Using these, we define: where  0 and  1 represent the times of the intersection of   () with x.   (, x) is then our measure of how "edge-like" the region along this ray is, and we can then define: In practice, the integral in Eq. 8 becomes a sum over the image pixels that the line intersects.Our loss function was implemented entirely in PyTorch and is fully differentiable end-to-end.

EXPERIMENTS
In order to evaluate our proposed constraint, we conduct comprehensive experiments.In Section 5.1, we detail how we fine-tune latent diffusion models with the proposed constraint, in Section 5.2, we detail how we fine-tune monocular depth estimation models on images generated from our fine-tuned models.In Section 5.3, we describe how we evaluate the photo-realism of images generated from our fine-tuned models, and in Section 5.4, we describe our ablation studies.

Training Latent Diffusion Models
For all of our image generation experiments, we build off the depthconditioned Stable Diffusion V2 model from [Rombach et al. 2022b].This model is trained on LAION 5B, a database of 5.85 billion image caption pairs [Schuhmann et al. 2022].In this paper, we refer to this model as the baseline model.
Datatsets.In order to fine-tune the baseline model, we use the HoliCity dataset [Zhou et al. 2020].This dataset provides 50,078 real images taken in London along with ground truth vanishing points for each image.We use MiDaS [Ranftl et al. 2020] to compute a depth prediction for each image which is then used as conditioning for the latent diffusion model.1This is the same procedure used to originally train the depth-conditioned model [Rombach et al. 2022b].Captions used for conditioning are generated for each image using the BLIP captioning model [Li et al. 2022].
Training Details.The code for our fine-tuned model is built using PyTorch on top of [Pinkney 2022], which is built on top of the original code released by [Rombach et al. 2022b].The original code from [Pinkney 2022] is built on top of Stable Diffusion v1, so part of the modifications made by us include updating the code to be compatible with Stable Diffusion v2 checkpoints, including updating the encoder/decoder and dataloaders.We update the loss function of the baseline model to the loss function detailed in Eq. 6.We train at an image resolution of 512×512 with a learning rate of 1e-6 and  = 0.01.We train for 4 epochs or approximately 200k steps with an effective batch size of 16 after gradient accumulation.We found that the perspective loss had generally saturated by this point.This training takes approximately 12 hours on 4 RTX3090 GPUs.Results are shown in Section 6.1.

Inpainting.
In addition to text-to-image generation, we also test the value of our constraint for the inpainting task where a model is asked to fill in masked regions of an image.Applying our proposed constraint to the inpainting task does not require any extra training, as we are able to take our general text-to-image diffusion models and perform inpainting using the techniques described by [Lugmayr et al. 2022].We evaluate the results using the LPIPS metric [Zhang et al. 2018] as is the norm for the inpainting task.LPIPS measures the perceptual similarity between two images using features from deep neural networks, in particular AlexNet.Results are shown in Fig. 7 and Table 4 and are discussed in Section 6.1.1

Training Monocular Depth Estimation Models
In order to evaluate the performance from another perspective, we also conduct an experiment on the effect of our new images on monocular depth estimation models.In particular, we fine-tune DPT-Hybrid [Ranftl et al. 2021] and PixelFormer [Agarwal and Arora 2023] on images generated from both the baseline model and our fine-tuned model.DPT-Hybrid is originally trained on MIX 6, a collection of 10 datasets described in [Ranftl et al. 2021], and PixelFormer is originally trained on the KITTI dataset.In order to generate these images, we rely on the SYNTHIA-AL [Zolfaghari Bengar et al. 2019] and Virtual KITTI 2 [Cabon et al. 2020;Gaidon et al. 2016] datasets.SYNTHIA-AL contains 70,000 images and Virtual KITTI 2 contains 2,656 images.We take only depth maps from both datasets, and use them as conditioning to generate synthetic images using the base, and our latent diffusion models.In addition, we use BLIP [Li et al. 2022] to generate captions for all images.For Virtual KITTI 2, we take 8 random crops per image.We also generate diffusion images with 4 different seeds, resulting in a total of 84,992 images derived from the Virtual KITTI 2 dataset.For SYNTHIA, we use the original images, resulting in a total of 70,000 images.Combined, our dataset is 154,992 images and covers various city and driving scenes.We refer to this dataset as All.We additionally train the depth estimation models on images generated only from vKITTI, and refer to this dataset as vKITTI.We additionally append the name of the model used to generate different datasets so that All Enhanced refers to the full set of 155k images generated by our Enhanced model while All Base refers to the full set of images generated by the Baseline model.Results of fine-tuning on these datasets are discussed in Section 6.2.
Training Details.For DPT-Hybrid, we train with a learning rate of 5e-6 for 19,500 steps with a batch size of 16.We use a scale and shift invariant loss as described in [Eigen et al. 2014;Ranftl et al. 2021].For PixelFormer, we train with a learning rate of 4e-6 for 20,800 steps with a batch size of 8. We train on 1 RTX3090 GPU using the same loss as DPT.
Test Sets.We evaluate the trained depth estimation models on commonly used real datasets KITTI [Geiger et al. 2012] and the outdoor subset of DIODE [Vasiljevic et al. 2019].We use the Eigen split for KITTI [Eigen et al. 2014] and a test set of 500 images from DIODE.
Metrics.In order to evaluate the performance of the models, we follow the procedure used by [Ranftl et al. 2021] and we adopt common depth estimation metrics: Absolute relative error (Abs Rel), Square relative error (Sq Rel), Root mean squared error (RMSE), Log RMSE (RMSE log), and Threshold Accuracy (  ) at thresholds   's =

Human Subjective Test Methodology
In order to evaluate the photo-realism of images generated by our fine-tuned models, we run human subjective tests on the Prolific [Academic Ltd 2023] website.We ran two tests, one comparing our enhanced model with the baseline model and one comparing our enhanced model with an ablation model.We set up the test as a ranking task where participants are asked to rank sets of three images (Real, Baseline, Ours or Real, Ablation, Ours) in order of photorealism.The real images come from the HoliCity dataset [Zhou et al. 2020], a landscapes dataset from Kaggle [Rougetet 2020], and an animal images dataset from Kaggle [Awais 2020].The baseline, ablation, and enhanced (ours) images are generated using depth maps extracted from the real image by MiDaS [Ranftl et al. 2020] and prompts from the BLIP captioning model [Li et al. 2022].Participants were shown all three images side by side in random order.Please refer to Fig. 4 for a visualization of the testing setup.We recruit 50 participants across the world and ask them to rate 80 sets of images.Participants were given up to 90 minutes to complete the task.Results from this test are in Section 6.3 and Fig. 11.

Ablation Study
In order to evaluate the benefits of our proposed constraint, we perform two ablation studies.First, we fine-tune the baseline model on the same dataset but without our updated loss.We refer to this model as the No Loss/Ablation model.We also train a model which takes in vanishing points as conditioning and is trained without our loss.For both models, we generate the same synthetic datasets and train the same monocular depth estimation models described in Section 5.2.Results are shown in Section 6.4.An ablation study was also done for the human subjective tests and the inpainting task for the no loss model.Results are described in Section 6.3 and shown in Fig. 11, Fig. 7, and Table 4.

RESULTS
This results section is split into sub-sections according to the experiments described in Section 5.In Section 6.1, we describe the results of fine-tuning latent diffusion models.In Section 6.2, we discuss the results of fine-tuning SOTA monocular depth estimation models on our generated images.In Section 6.3, we discuss the results of our human subjective test, and in Section 6.4, we discuss the results of our ablation tests.

Fine-tuned Latent Diffusion Models
We show some representative generations from our fine-tuned model in Fig. 5.In the figure, we show the depth maps used to condition the diffusion models along with generations from the baseline model and our enhanced model.Images from the baseline model tend to suffer from curved lines and distortions that affect perspective accuracy.In particular, the baseline model tends to have trouble accurately generating regions with windows, highfrequency details such as many parallel horizontal or vertical lines, and corners.We also draw perspective lines on images from the baseline and our models in Fig. 8. Images from our model tend to have more coherent perspective lines and more accurate vanishing points.In addition, in both figures, because of the aforementioned distortions, the baseline images look further from the distribution of natural images than images from our model.Since our enhanced model is fine-tuned on a dataset of mainly only cityscapes, we also generate varied nature [Rougetet 2020], animal [Awais 2020], and indoor scenes [Vasiljevic et al. 2019] to verify that this fine-tuning does not limit the ability of the model to generate other types of images.Some representative images are shown in Fig. 6.We additionally quantitatively evaluate these images using the FID metric [Heusel et al. 2017].Our model outperforms both the baseline model and the no loss model.The results are shown in Table 5.
6.1.1Inpainting.We evaluate the inpainting performance of our models using both qualitative (Fig. 7) and quantitative (Table 4) results.All three models of interest, the baseline model, ablation model, and enhanced model were tested on the combination of two datasets: the HoliCity validation set [Zhou et al. 2020] and a landscape dataset [Rougetet 2020].The LPIPS metric [Zhang et al. 2018], which measures perceptual similarity using features from deep image networks, was used to compare models as is the norm for the inpainting task.We used the official implementation provided by [Zhang et al. 2018].Note that lower is better for the LPIPS metric.As seen in Table 4, our enhanced model consistently outperforms both the baseline model and ablation model, with a 7.1% improvement over the baseline model and a 3.6% improvement over the ablation model on the combined dataset.

Monocular Depth Estimation
In order to evaluate the performance of our fine-tuned depth estimation models, we use both qualitative and quantitative measures.
A qualitative comparison is shown in Fig. 9, while quantitative comparisons are in Table 1 and Table 2. DPT-Hybrid.We fine-tune one model from the base DPT-Hybrid using the generated vKITTI datasets and then test the model on both the original KITTI test set (Eigen Split) and a subset of the DIODE Outdoor test set.Results are in Table 1.The models fine-tuned on images generated from our diffusion model outperform the original DPT-Hybrid model on all metrics on both datasets and outperform the model fine-tuned on images generated by the baseline model on all metrics for KITTI and all but one metric (SqRel) for DIODE Outdoor.In addition, for the DIODE Outdoor dataset, the original DPT-Hybrid model outperforms the base model on five out of eight metrics, but outperforms our model on no metrics.In particular, our model shows a 7.03% improvement in RMSE and a 19.3% improvement in SqRel over the original model while also demonstrating a 3.4% improvement in SqRel and a 2.2% improvement in SiLog over the baseline model.Fig. 9 also shows qualitative comparisons between the original DPT-Hybrid model and the model fine-tuned on images generated by our enhanced diffusion model.Each set of images contains the input image, ground truth depth map (dilated with a 3×3 kernel), and error maps from both the original model and our enhanced model.Additionally, the RMSE values for each of the depth predictions are shown in the top right of the error maps.The depth models from our model capture more high-frequency detail such as corners and poles, and also consistently have lower RMSE values.
PixelFormer.We fine-tune the base PixelFormer using both the generated vKITTI dataset and the full generated dataset and evaluate on the DIODE Outdoor test set.We additionally fine-tune a model using the original training set, KITTI [Agarwal and Arora 2023;Geiger et al. 2012].Results are shown in Table 2.The model finetuned on images from our diffusion model outperforms the original model, the models trained on images from the baseline model, and the model trained on KITTI on all metrics.Our model trained on the vKITTI dataset achieves a 4.1% improvement in RMSE over the original model, while our model trained on the entire dataset achieves an 11.6% improvement in SiLog over the original model and a 2.4% improvement over the model trained on baseline images.
Additionally, the original model outperforms the baseline model trained on the entire dataset on five of eight metrics, but outperforms the model trained on our images on no metrics.

Human Subjective Tests
Results from the human subjective tests are shown in Fig. 11. (a) shows the comparison between our enhanced model and the baseline model while (b) compares our enhanced model and the ablation model.Over all trials, images from our enhanced model appear more photo-realistic than images from the baseline model 69.6% of the time and appear more photo-realistic than images from the ablation model 67.5% of the time.In addition, the average rank of our images (between 1 and 3, lower is better) compared to the baseline was 1.9345 vs 2.4383 and was 1.9584 vs 2.4011 compared to the ablation model.The differences in average rank between our enhanced images and the baseline images (0.5038) and the difference between  our images and the ablation images (0.4427) are also consistently less than the difference in average rank between our enhanced images and real images (0.3072 and 0.318 respectively).Overall, the results show that our proposed geometric constraint helps improve the photo-realism of generated images, as our enhanced images are consistently preferred over images from both the baseline model and ablation model.Table 1.Monocular Depth Estimation performance of DPT-Hybrid fine-tuned on our data compared to the base DPT-Hybrid model.The original DPT-Hybrid model was trained on a dataset referred to as MIX 6, which is a collection of 10 datasets as described in [Ranftl et al. 2021].Fine-tuned models were trained on synthetic datasets generated by either the base stable diffusion model or our fine-tuned model.

Ablation Study
To evaluate the value of our proposed constraint, we perform extensive comparison between our enhanced model and the ablation models.We include qualitative results comparing the no loss model and enhanced model in Fig. 10.The edges and corners of our images are more consistent than similar features in the baseline model's images.We also include quantitative comparisons between depth estimation models trained on the vKITTI dataset from our enhanced diffusion model and depth estimation models trained on the vKITTI dataset from our no loss and conditioned diffusion models.The results from this experiment, for both DPT-Hybrid and PixelFormer, are shown in Table 4.The models trained on our enhanced model images outperform the models trained on the no loss model images on all metrics except for one (SqRel for DPT-Hybrid trained on the vKITTI dataset and tested on DIODE Outdoor).In addition, our model demonstrates significant improvements, up to 16.11% on RMSE, compared to the no loss model.Our enhanced model also out-performs the conditioned model on all metrics.These results demonstrate that the superior performance of downstream models trained on our enhanced dataset is a result of our proposed constraint rather than a result of the new images introduced in fine-tuning.Beyond downstream tasks, the human subjective tests   also show that our enhanced images are considered more photorealistic than images from the no loss model 67.5% of the time (Fig. 11).In addition, quantitative and qualitative results (Fig. 7 and Table 4) on the inpainting task further highlight the improvement between our enhanced model and the no loss model.Combined, results from downstream tasks, human subjective tests, and the inpainting task demonstrate that the improvements achieved by our enhanced model are the result of our proposed geometric constraint rather than a result of fine-tuning on new images.

DISCUSSION 7.1 Limitations
One of the key limitations of our approach is that fine-tuning the diffusion model requires a dataset of images with vanishing points during training.Although these can be approximated using vanishing point detection tools [Lin et al. 2022;Liu et al. 2021], these tools generally only work for images with strong vanishing lines.For images without these lines, such as nature scenes, our proposed loss would likely be ineffective.Another limitation of our approach is the generation speed of latent diffusion models.On average, it takes ~3 seconds to generate a single image on 1 RTX3090, meaning generating a dataset of 150,000 images takes ~125 hours on 1 RTX3090.This significantly limits the potential size of synthetic datasets generated by latent diffusion models.Another limitation is that although our images are improved compared to the baseline model's images, they are still not quite at the level of real images as shown by our subjective test results.For example, Fig. 12 shows an image of Big Ben, and, although perspective lines are accurately Real Ours Enhanced Fig. 12. Outputs from stable diffusion are still unable to make certain semantic judgments.Note that the clock shown on Big Ben is not functional and has no hour or minute hand.
depicted in the output, certain semantic details of the image are missing.Additionally, our technique only enforces perspective accuracy, meaning that other physical properties, such as lighting, shadows, or spatial relationships, may still be inaccurate.

Societal Impact
As always, there are downsides in improvements to generative models.As we increase the photo-realism of synthetic images, the potential for malicious use in the spread of disinformation also grows.In addition, perspective has been used as a tool to identify synthetic images from diffusion models [Corvi et al. 2022].With the addition of our constraint, these tools could lose their efficacy, further increasing the potential for misuse of diffusion models.

Future Work
The current work is limited to 3D geometry perspective constraints, but there are still many other physical properties that affect the realism of generated images.One such example is lighting and shadow consistency [Farid 2022a,b] and semantic and physical consistency.Images generated by diffusion models often break physical laws, for example by having people walking on water.Future work can explore other constraints to help fulfill these physical laws and further increase photo-realism and the performance of downstream tasks.

Conclusions
In the 1400s, Leon Alberta Battisti established the foundations for perspective in art, which pushed the boundaries of hand-drawn realism.In this work, we propose a first attempt at a novel geometric constraint which encodes perspective into latent diffusion models.We demonstrate that introducing these physically-based 3D perspective constraints improves both photo-realism on subjective tests and downstream performance on monocular depth estimation.We hope that our work can be a small step in our community effort to improve the realism of image synthesis.

Fig. 1 .
Fig. 1.Images generated with our novel geometric constraint preserve straight lines and perspective.(a) An image generated by stable diffusion v2, (b) An image generated by our fine-tuned diffusion model, and (c) the depth map and prompt both models were conditioned on.

Fig. 2 .
Fig.2.Examples of one, two, and three-point linear perspective.Vanishing points are labeled in blue, perspective lines are in red, and the horizon lines are in light green.One-point perspective is typically used when there is one focal point of the image or when only one side of an object is visible.Two-point perspective is used to illustrate multiple sides of an object, while three-point perspective is used for viewpoints that are above or below the horizon line of the 3D scene.

Fig. 3 .
Fig. 3. Graphical description of our geometric constraint.Left: A visualization of how the loss function sweeps lines across the image.Right:  (, x) plotted for the image at right.The red and yellow lines in the left plot are identified by the corresponding dots.

Fig. 4 .
Fig.4.A screenshot of the graphical user interface for the human subjective test we performed on the Prolific platform.Annotators are asked to rank the image by realism, with "1" being the most and "3" being the least real.Images include one generated from a baseline model, one generated from our enhanced model, and one real image in random order.

Fig. 5 .
Fig. 5. Images from our model are better at preserving straight lines.Examples of outputs from the base model and from our enhanced model.The depth maps these outputs are conditioned on are put at the top.Inlets show specific regions of interest.

Fig. 6 .Fig. 7 .
Fig.6.Despite being fine-tuned on images of city scenes, our model is able to generate high-quality images of varied settings including nature landscapes, indoor scenes, and pictures of animals.Images were taken from a landscapes dataset[Rougetet 2020], an animal dataset[Awais 2020], and the indoor subset of the DIODE dataset[Vasiljevic et al. 2019].

Fig. 8 .
Fig. 8. Images from our model have more consistent vanishing point lines.This figure shows examples of stable diffusion outputs from the baseline model and from our model with perspective loss along with perspective lines for the image.The depth maps these outputs are conditioned on are put in the left-hand column.Note that for the baseline image in the first row, the lines do not intersect at a single vanishing point, violating perspective geometry.These violations can sometimes result in curved lines as seen in the baseline image in the second row.

Fig. 9 .Fig. 10 .
Fig. 9. Qualitative comparisons of DPT-Hybrid fine-tuned on the data from our fine-tuned models and the original DPT-Hybrid model.The depth maps produced by models trained on images from our enhanced model capture more high-frequency detail than the models trained on images from the baseline model.The RMSE error of the outputs of our model is also consistently lower.

Fig. 11 .
Fig. 11.Images from our enhanced model consistently appear more photo-realistic than images from the baseline model (a) and our ablation model (b) according to the results of the subjective human tests.Top.How often each set of images was ranked lower.Our enhanced images were ranked as more photo-realistic (lower) than baseline images in 69.6% of trials and were ranked as more photo-realistic than the ablation images in 67.5% of trials.Bottom.Average ranking for our images, real images, and comparison images.Although real images are consistently ranked the lowest, our images beat out both baseline and ablation images and are closer to real than the comparison.

Table 2 .
The best performing model is in bold and the second best is underlined.Monocular Depth Estimation performance of PixelFormer fine-tuned on our data compared to the base PixelFormer model (trained on KITTI) on the DIODE outdoor dataset.Fine-tuned models were trained on synthetic datasets generated by either the base stable diffusion model or our fine-tuned model.The best performing model is in bold and the second best is underlined.

Table 3 .
Ablation Study: Monocular Depth Estimation performance of DPT-Hybrid fine-tuned on data from a model trained with no loss, a model conditioned on vanishing points with no loss and a model trained with our loss.The best performing model is in bold.

Table 4 .
Inpainting Quantitative Results: Images generated by our enhanced model out-perform both the baseline Stable Diffusion V2 model and Ablations on the LPIPS metric.Our enhanced model performs best on all three datasets, while the ablation model is outperformed by the baseline model when tested on only landscapes.Lower is better for all columns.

Table 5 .
FID [Vasiljevic et al. 2019]n-building scenes generated by our enhanced model out-perform both the baseline Stable Diffu-sionV2 model and the No Loss model on the FID metric.Metric was computed on 6.7k images from nature[Rougetet 2020], animal[Awais 2020], and indoor datasets[Vasiljevic et al. 2019].Lower is better.