Animating Street View

We present a system that automatically brings street view imagery to life by populating it with naturally behaving, animated pedestrians and vehicles. Our approach is to remove existing people and vehicles from the input image, insert moving objects with proper scale, angle, motion, and appearance, plan paths and traffic behavior, as well as render the scene with plausible occlusion and shadowing effects. The system achieves these by reconstructing the still image street scene, simulating crowd behavior, and rendering with consistent lighting, visibility, occlusions, and shadows. We demonstrate results on a diverse range of street scenes including regular still images and panoramas.


INTRODUCTION
Google Street View and similar services allow people to virtually visit locations worldwide with street-level imagery.These platforms offer an immersive experience of different areas such as neighborhoods, streets, and tourist attractions.Nevertheless, the major limitation of the visual content provided by these services is that they are all still -notably, imagery without people or cars moving through each scene.Street-level video could address this shortcoming but would be difficult due to constraints on image capture systems, storage, and privacy requirements.We propose a system that mitigates these limitations by first erasing still pedestrians and vehicles from an image or panorama of a scene and repopulating it with synthetic, moving pedestrians and vehicles.Our method can thus enhance the vividness of street view imagery without additional capture and without privacy concerns.
Our framework takes as input a single street image or panorama and generates a video of arbitrary length by populating it with moving pedestrians and vehicles.In order for the generated videos to look realistic, they must respect the geometry, lighting, and semantic information -e.g., labeling sidewalks and streets -of the given scene, while modeling reasonable behaviors of newly added elements (people and vehicles).Our method takes all these factors into consideration in three main stages (as shown in Figure 2): (1) Reconstruction: estimating the visible geometry, lighting, and semantic information of a scene.(2) Simulation: inserting and modeling the behavior of pedestrians and vehicles in a given street setting.
(3) Rendering: synthesizing realistic videos with consistent lighting, shadows, and occlusions layered into the original frames.
Our method is the first to demonstrate realistic results for street scene image/panorama animation, producing arbitrarily long videos with moving people and cars.We combine learning-based approaches with traditional rendering to deliver visually appealing results without large training sets, large scale annotations, long training time or advanced hardware.Our approach leverages a number of ingredients from prior art.Specifically, our geometry reconstruction, crowd simulation and shadow/occlusion rendering components build on existing pre-trained networks and rendering systems, except for a sun estimation sub-module that we trained.And while these ingredients by themselves are not novel, our system combines them in novel ways to enable an exciting new application.We only require a single image as input with a known focal length, or a 360 • panorama.The whole pipeline can be run on a conventional laptop.Figure 1 illustrates several output frames for three different scenes.Please refer to the supplementary material for video results.

RELATED WORK 2.1 Object Insertion
Early insertion work includes Poisson blending [Pérez et al. 2003] which produces seamless object boundaries but results in illumination and color mismatches between object and the target background.[Karsch et al. 2014] and [Karsch et al. 2011] recover geometry and lighting from a single image for object insertion.However, [Karsch et al. 2011] requires manual insertion, and neither demonstrates automatic occlusion behavior, shadow cast by the scene onto objects, or object placement/movement.More recent work renders inserted objects with estimated lighting maps and shadow synthesis [Tang et al. 2022;Wang et al. 2022], but does not handle sharp lighting changes near shadow boundaries.A number of recent methods learn end-to-end object insertion and composition based on GANs [Azadi et al. 2020;Lin et al. 2018;Zhan et al. 2019] and diffusion models [Ma et al. 2023;Song et al. 2022].
Specifically for street scenes, [Chien et al. 2017;Lee et al. 2018;Sun et al. 2020] estimate the locations of roads and walking paths.[Wang et al. 2020[Wang et al. , 2021] ] manually insert static pedestrians and vehicles into a scene image, selected to have scene-compatible lighting and pose, but do not support animation.[Wang et al. 2020] also requires video as input.In contrast, our approach inserts 3D dynamical subjects, is automatic, and operates on a single image.
The problem of animating street scenes has been less well studied.[Lee et al. 2019] learn to composite a given video clip of a pedestrian onto a scene video, but with limited quality.[Xu et al. 2022] train a 3D-aware GAN given scene image, but only show short (1-2 seconds) video clips with noticeable artifacts.Closest to our work is [Chen et al. 2021] which composes moving car assets into a scene video, and [Wang et al. 2022], but they require 5-7 posed RGB imagery and LiDAR for NeRF or 3D-aware geometry representation.[Chen et al. 2021] also only supports vehicles instead of the much more complex pedestrian/vehicle interaction.Compared with previous works, our system only requires a still image, and handles pedestrians in addition to vehicles.

Still Image Animation
Image animation aims to generate a video given one or a sequence of still images as input.[Holynski et al. 2021;Okabe et al. 2011] show beautiful animation results assuming repetitive motions, e.g., waterfalls.[Huang et al. 2022b;Peng et al. 2021;Weng et al. 2019;Yoon et al. 2021] animate existing people in the scene, and [Mallya et al. 2022;Pumarola et al. 2019] animate faces.[Mallya et al. 2022;Wang et al. 2019] take in conditional signals like facial keypoints or pose sequences provided by a driving video and synthesize video in 2D space, and [Huang et al. 2022a] reconstruct an animatable implicit representation from a still image.[Hu et al. 2022;Yu et al. 2022] animate a single image by semantic instructions.There are also a line of work on 3D-aware GAN for scene generation ( [Epstein et al. 2022;Nguyen-Phuoc et al. 2020;Niemeyer and Geiger 2021;Xue et al. 2022]), which can be leveraged for scene animation by GAN inversion.Recent methods generate full videos via diffusion models, e.g., [Ni et al. 2023] focuses on human motion and fixed camera pose, and [Karras et al. 2023] show animation specific to fashion models.We focus on 3D scene modeling for the purpose of  In Stage 1, we reason about the scene by predicting its semantic segmentation labels, depth values, sun direction and intensity, as well as shadow regions.We additionally determine walking and driving regions for adding pedestrians and cars (red straight lines: lane detection; blue points: origin and destination points).In Stage 2, we simulate the pedestrians in a 2D bird's eye view representation (BEV) of the scene, and simulate car movements with predicted lanes (four colors correspond to four predicted path, both in BEV and scene images).If there is a detected crosswalk, we also simulate the traffic behavior by controlling a traffic light (not shown in this example, refer to Figure 10).In Stage 3, we render the scene with the estimated lighting, shadows, and occlusions.The whole pipeline is automated.
occlusion-aware and lighting-consistent animation of novel objects' interaction within the scene.

Playable Video Generation
[ Menapace et al. 2022Menapace et al. , 2021] ] address Playable Video Generation that learns semantically consistent actions and generates realistic videos conditioned on the input.Similarly, [Zhang et al. 2021[Zhang et al. , 2023] ] produce videos by modeling interactively controllable video sprites and composing them onto an empty scene.[Davtyan and Favaro 2022] learn to segment video into foreground-background layers and generate transitions of the foreground over time.[Kim et al. 2021] learn to simulate a dynamic driving environment directly in pixel-space.These methods require a large video dataset for training and are thus limited to a narrow set of scene images, for example a tennis court from a fixed angle, or a first-person driving scene.Our method is the first approach to animate street scene without heavy training by populating it with naturally behaving characters.

APPROACH
Given an image of a street scene, our goal is to generate an arbitrarily long animation of this scene populated with objects including pedestrians and vehicles.The resulting video should respect the geometry and illumination of the scene, contain realistic shadow and occlusion effects, and feature natural traffic behaviors.Our automatic pipeline contains three stages.Section 3.1 describes a reconstruction stage where the basic geometry and illumination are estimated and reconstructed.Section 3.2 describes a simulation stage where pedestrians' and vehicles' behaviors are simulated in the reconstructed scene.Section 3.3 describes a rendering stage where characters are placed into the scene and processed to generate desired visual effects.
For simplicity, this section focuses on traditional camera images and omits discussion of hyperparameters.For equirectangular panoramas, we decompose the imagery into six perspective images in a cubemap manner and process each direction separately.A comprehensive discussion of hyperparameters and panorama processing can be found in the supplementary materials.

Reconstruction
In this section, we describe the reconstruction process, which removes existing pedestrians and vehicles from the image, and estimates the (1) semantic information to determine walking and driving regions for the scene, (2) ground plane for deciding scale and camera angles, (3) sun light direction, intensity and ambient light intensity for lighting the objects, and (4) existing shadow regions in the image for darkening objects when they move into shadow.
3.1.1Inpainting.If an image contains pedestrians and/or vehicles, we start by removing them with segmentation and inpainting.
As illustrated in Figure 3, we directly take advantage of the pretrained Stable Diffusion inpainting tool [Rombach et al. 2022] to remove existing objects.We segment people and cars using Seg-Former [Xie et al. 2021] and compute a bounding box for inpainted objects as a rectangle covering the objects with a 10% margin to accommodate boundary inaccuracy in semantic segmentation.We use "A photo of an empty street" as the inpainting prompt, and "Human, pedestrian, vehicle, car" as the negative prompt for Stable Diffusion.3.1.2Segmentation.We segment the image to determine where the pedestrians can walk or the vehicles can drive.We start with a semantic segmentation module with off-the-shelf semantic segmentation model SegFormer [Xie et al. 2021] to segment out the ground regions (sidewalk and road classes in CityScapes [Cordts et al. 2016]).However, we found SegFormer [Xie et al. 2021] sometimes identifies Road region as not suitable for driving especially when the scene is not a traditional CityScapes [Cordts et al. 2016] style image.We thus apply an additional zero-shot language-driven segmentation model CLIPSeg [Lüddecke and Ecker 2022] to help us segment the image with keyword "drive".CLIPSeg is adaptable to more segmentation classes in a wider range of scenes, but produces less accurate pixel-level boundaries.If the identified "drive" region is smaller than a threshold  d , we make the scene "pedestrian-only" by considering the Road label pixels also as Sidewalk pixels, and not adding vehicles to the scene.
To handle the case where there are obstacles on street that partition the walking/driving regions into pieces -e.g., the a pole in the image may split the visible sidewalk into disjoint 2D pieces -we dilate the region and then take the convex hull of each connected component.At the same time, we record the ground position of each obstacle to reconstruct a binary obstacle map with non-walkable obstacle pixels.We utilize the estimated ground plane equation to project the semantic information onto a bird's eye view (BEV) map (Figure 2 top figure in Stage 2), which contains the spatial information for walking/driving regions and obstacle positions.This BEV map is discretized into a grid and serves as the abstract representation for pedestrian simulation.Note that vehicle simulation is a different process and not shown in this BEV map.
3.1.3Ground Plane Estimation.We make the assumption that the ground region in the image can be well-approximated by a plane and the goal is to estimate the values of , ,  in metric units (1/meters).This approach is similar to [Wang et al. 2021].
We use the monocular depth estimation model AdaBins [Bhat et al. 2021] to predict the absolute depth map in meters.According to the properties of perspective projection, for known focal length  , image pixels ,  and 3D positions , ,  , we have:  =   / and  =   / .Plugging into Equation 1 and rearranging yields Given the segmentation map, we collect all 2D road/sidewalk pixels (  ,   ) and their depth   , and solve for , ,  via linear least squares.For simplicity, we assume that the road and sidewalk are co-planar.
This plane is then the surface on which pedestrians will walk and vehicles will drive in simulation, augmented with the segmentation map to restrict those regions and the depth map to determine occlusions between the scene and inserted assets.
3.1.4Sun Estimation.In order to realistically light the inserted objects, we train a sun estimation network that learns to predict the sun direction from a single image.We formulate it as a classification problem and predict a distribution over discrete sun angles.We divide the range of azimuth angles [0, 2) into 36 bins and elevation angles [0, /2) into 18 bins, each bin spanning 5 degrees.We train with a ResNet50 [He et al. 2015], but replace the last fully connected layer with two, one for azimuth and one for elevation.We train this network using ground truth sun positions as supervision via a cross-entropy loss.This sun estimation approach is closely based on [Wang et al. 2021].We compute the variance of the predicted elevation/azimuth bin distributions to decide if the scene has a strong, directional light or a diffuse environment light, and accordingly set the directional light source strength and ambient light when lighting inserted objects.The thresholds of variance and corresponding light intensities are provided in the supplementary material.
For training and ground truth data, we use a panorama dataset [Chang et al. 2018] with estimated sun direction for Google Street View panoramas from sky appearance and metadata.We take perspective images from each panorama with diverse camera angles and FOV, and compute the ground truth sun elevation and azimuth for these images.Refer to Figure 4 for an example data point, and to Section 4 for a discussion on data preprocessing.
3.1.5Scene Shadow Estimation.Analyzing local shadows in the scene is important for realistic object insertion.In particular, shadows cast by inserted objects should not darken existing shadows, and shadows cast by elements in the scene, e.g., by in-scene structures or trees, should also be cast onto the inserted objects.
We start by detecting the existing shadows in the image using an off-the-shelf shadow detection network [Zhu et al. 2021].We keep ground shadow regions by intersecting this binary map with the ground region obtained from semantic segmentation.
With the estimated ground plane equation and sun direction from previous steps, we now construct a shadow occluder to explain shadows cast into the scene.In particular, we first create a plane parallel to the ground plane and displaced vertically above the height of any potentially inserted objects and then intersect rays shadow rays -from the 3D positions of shadowed scene points toward the light direction -with this plane.We connect these points into a mesh to construct the shadow occluder and use it to cast shadows onto inserted objects.See Figure 5 for an illustration of this pipeline.

Simulation
After reconstructing the basic scene geometry and semantics, we utilize the scene understanding information to simulate the movement of pedestrians and vehicles in the scene.We start with estimating origin and destination points for pedestrian movements, and then run a simple yet effectively potential field-based crowd simulation algorithm.For scenes with more complex structure, we also design a traffic simulation module that naturally controls the behavior of pedestrians and cars at traffic lights and crosswalks.
3.2.1 Origin/Destination Estimation.We would like each pedestrian to either appear or disappear from one edge of the image and disappear or appear, respectively, at the far end of the scene.Thus we build a pool of possible origin/destination pairs containing the intersection of walkable regions and image boundaries, as well as the farthest points from the camera.To simulate each pedestrian, we randomly select one pair of origin and destination from this pool and check if there is an existing pedestrian in this same or neighboring grid.If not, we initialize a pedestrian in this grid.
For cars, we split the drivable region into lanes by estimating the width of the drivable region.See more details in supplementary.We first project the walkable regions onto a BEV grid.Then compute a discomfort field based on scene layout and other pedestrian's positions.Finally we perform potential descent based on the potential field (planning a path perpendicular to the potential contours at each grid location.) 3.2.2Pedestrian Simulation.We base our pedestrian simulation algorithm on a simplified version of the potential field algorithm in [Treuille et al. 2006].We start with a binary BEV map  marking the walkable regions in the scene, with an additional binary obstacle map  marking the position of obstacles on the street (poles, flowers, etc).We build a discomfort field  such that people prefer to be at point  rather than  ′ if  () >  ( ′ ).The choice of  is flexible; we design it to discourage people from walking too close to (1) the edge of the sidewalk region, (2) any obstaclkes on the walkable regions, or (3) other pedestrians in the scene.Finally, we compute a dynamic speed map  with each grid cell set inversely proportional to the local crowd density.,  are re-computed at each time-step.For a single pedestrian with non-obstacle start and target points ,  ∈ R 2 , we would like to predict a path  = {(  ,   )}  between them in  .To make the path natural, we follow [Treuille et al. 2006] to assume that the pedestrians would like to minimize: • The length of the path.
• The amount of time to the destination.
• The discomfort felt per unit time along the path.
As in [Treuille et al. 2006], we compute the optimal path  by We start with building the unit cost field  as a weighted combination of inverse speed and discomfort.Then we use the fast marching algorithm [Tsitsiklis 1995] to compute a potential field  such that  = 0 at target point and otherwise satisfies the eikonal equation: At each time step, we compute the velocity of each pedestrian at position x and update its position with a pre-defined step size Δ: See Algorithm 1 for the complete procedure, and the supplementary material for implementation details.

Traffic Simulation.
We extend our attention to more complex scenes where vehicles could be present in addition to pedestrians.Specifically, we add cars to the scene if there exists a large enough "driven" region identified by CLIPSeg [Lüddecke and Ecker 2022].The cars move with a fixed speed following the detected lane.
We additionally detect Crosswalk region with CLIPSeg to activate the traffic simulation module in the system.Specifically, the system takes in a dynamic binary input indicating the status of traffic light at each crosswalk.If the light is red for cars (pedestrians), the pedestrians (cars) cross the street as normal while the cars (pedestrians) would stop before the crosswalk.We naturally accelerate or decelerate a car by checking its distance to its previous car or to the crosswalk, illustrated in Algorithm 2.Here Δ is the unit time step, and  is a minimum possible distance between two cars or between the car and the crosswalk.

Rendering
In this final stage, we render 3D pedestrians and cars into the reconstructed scene with plausible shading, shadows, and occlusions.Our approach operates frame-by-frame;  is the current video frame.

Shadow
Rendering.We start by generating shadows, shadow masks, and object masks, and compositing over the background image .The colors and depth map for  are not used in this step.Specifically, we render the scene with two different layers.First, we render a color image  rgb with an opaque ground plane against a pure white background, using the estimated lighting direction to cast shadows.Second, we take a depth image  depth with a transparent ground plane, which gives us a mask of objects in the scene without shadows (Figure 7).
The shadow mask  s and objects mask  o are computed by: where I is a per-pixel indicator operator that evaluates to 0 or 1 for argument that is false or true, respectively.We composite the final frame with shadow  ws , darkening the shadow region by a shadow color factor  as follows: where ⊙ is element-wise multiplication.
We implement three additional refinements to improve shadow quality.First, we apply shadow color matching by taking the average color of a local non-shadowed patch and a local shadowed-patch in , and compute their per-channel ratio as the shadow color factor . Second, we exclude previously detected shadow regions in  from  s to avoid double shadows and apply Gaussian blur to smooth out the sudden change in shadow factor.Third, we reconstruct vertical building walls if they appear in the segmentation map, and compute shadow masks cast on them in the same way as ground shadows.For cloudy days we set up a top-down light perpendicular to the ground, and apply Gaussian blur on the shadow mask to create a soft, diffuse shadow effect.See Figure 7 for an example result.

Occlusion
Rendering.We use the estimated depth map with standard Z-buffering to model occlusions.Observing that monocular depth estimation is inaccurate for thin objects, we refine their estimated depth  bg by computing their intersection with the ground, and assigning depth based on the estimated ground plane.We also refine  bg for ground pixels to fit the ground plane.Given the depth map  depth of rendered assets, refined background depth  bg ,  shadow from Equation 8, and the original background scene  , we composite the final result image as

Data and Implementation
We apply our system on a collection of 156 street scene images, with resolution ranging from 2160 × 1620 to 4032 × 3024.126 of the images are taken with the main camera of iPhone 12 mini (5.76 × 4.32mm sensor) while the rest are photos from the Internet (Creative Commons or licensed).
The inference part of the system (pre-trained segmentation and depth estimation) runs on two GeForce RTX 2080 GPUs.We use the Unity 2022 3D High Definition Rendering Pipeline (HDRP) [Juliani et al. 2018] for rendering.

Reconstruction
4.2.1 Sun Estimation.We train the sun estimation network using the dataset from [Chang et al. 2018] containing 19093 panorama images from Google Street View as well as the ground truth sun position derived from metadata.We take images from each panorama with various perspective camera angles and FOVs and compute the corresponding sun direction as ground truth.
We compare our system with [Hold-Geoffroy et al. 2019] and [Ma et al. 2016] by adding a fully connected layer to their method to predict the elevation angle.Since there is no official implementation for both papers nor released dataset, we implement them from scratch and train with our prepared datasets.On average over the test set, our azimuth prediction has an angular error of 38.7   place character assets into the scene and make them follow the path.Refer to the attached video for the smooth path following behavior.
Scene image w/ overlaid paths BEV paths Figure 9: Crowd simulation algorithm applied to scene images.We simulate the paths in BEV space, and project the path back to RGB image space.Each color represents a pedestrian path.
4.3.2Traffic Simulation.We control the state of each crosswalk to simulate a traffic-light interaction model where pedestrians and cars wait in front of the crosswalk when the light is red.Figure 10 shows an example of an intersection.When the traffic light is green for cars and red for pedestrians, the pedestrians form a small group on the sidewalk waiting for vehicles to pass.When it's green for pedestrians and red for cars, the cars slow to a stop, while the pedestrians traverse the crosswalk.Refer to the video for a demonstration of traffic simulation.
Car passing Pedestrian passing Figure 12 compares the synthesized shadow with two state-of-the-art shadow synthesis methods [Liu et al. 2020] and [Hong et al. 2022].We also show our system's ability to match shadow color and handle double shadow artifacts in Figure 13.
4.4.2Occlusion.We illustrate how taking advantage of semantic segmentation information can help generate occlusion effects on thin and small objects.Refer to Figure 14 for visual results.

Video Generation
We demonstrate 12 high-resolution videos results for regular images (Figure 16) and one result for a panorama (Figure 15).These results are best viewed in the supplementary video.We record the run time of our video generation system.Reconstruction takes 5-10 second per image depending on the resolution.Simulation at 30 fps.Rendering is an offline process of around 10 minutes for a 1 minute video at the highest resolution (4032 × 3024) but could be made in real-time with standard shader techniques or by producing lower resolution alternatives.

Failure Cases
As the first end-to-end system for the task of animating street view images, we did not have a complete baseline to compare to directly.The relevant works either take different input formats than us (multi-view images or videos), or produce different output format (static images).Instead, here We count the failure cases at each step.Note that different categories' counts may have overlapping.For example, one single scene may suffer from both depth estimation issue and semantic segmentation issue.The statistics of failure cases on our dataset of 156 images are as follows: • Segmentation and depth estimation models contribute to most of our failure cases, and our results will improve with progress on these topics.Our contribution is to build a composable system and using algorithms proven to be state-of-the-art in the literature.

CONCLUSION AND DISCUSSION
In this paper, we present a system that automatically populates a still scene image with naturally behaving pedestrians and vehicles.We describe the algorithms for each component of the system (reconstruction, simulation, rendering) as a combination of state-ofthe-art deep learning methods and traditional simulation/rendering.We illustrate the quality of our approach with rendered videos.Limitations: Our system does not currently handle curved lanes or hills/slopes, complete shadows cast on inserted objects when the full shadow is not in camera view, automatic traffic direction detection, recovering the full extent of obstacles (incl.back half not seen), and casting shadows onto arbitrary scene geometry; and our lighting model is approximate (not a full global illumination rendering).As our system relies on the accuracy of pre-trained models (depth, segmentation, shadow), it fails on out of distribution images, e.g., with non-street-level camera angles and difficult lighting conditions.Indeed, we are presenting a method that doesn't attempt to improve scene understanding, but takes advantage of existing information in an integrated way to achieve appealing visual effects.With advances in pre-trained scene understanding models, our system can be adapted to a wider range of scenes.See Figure 11 for a few examples of scenes we don't currently handle.
Diversity and Ethical Considerations: We acknowledge that our paper does not explicitly model any population distribution so it could potentially generate videos that misrepresent a place's public space, and tolerated outfits, vehicle types, etc.The assets we use to generate visualizations in the paper and demo video show limited diversity in appearance, gender, and ethnicity.However, our system can be applied to any type of assets compatible with the game engine.It remains worth exploring problem to model the actual population distribution based on a scene layout.

Figure 2 :
Figure 2: Our system has three major components.In Stage 1, we reason about the scene by predicting its semantic segmentation labels, depth values, sun direction and intensity, as well as shadow regions.We additionally determine walking and driving regions for adding pedestrians and cars (red straight lines: lane detection; blue points: origin and destination points).In Stage 2, we simulate the pedestrians in a 2D bird's eye view representation (BEV) of the scene, and simulate car movements with predicted lanes (four colors correspond to four predicted path, both in BEV and scene images).If there is a detected crosswalk, we also simulate the traffic behavior by controlling a traffic light (not shown in this example, refer to Figure10).In Stage 3, we render the scene with the estimated lighting, shadows, and occlusions.The whole pipeline is automated.

Figure 3 :
Figure 3: Stable Diffusion [Rombach et al. 2022] based inpainting.We detect existing people and cars with segmentation map, crop the surrounding region, inpaint and compose back to the image.

Figure 5 :
Figure 4: A 360 degree panorama is projected onto planar images with various camera poses and FOVs.Then the system is trained to predict sun light direction (ground truth shown in the panorama) by estimating its elevation and azimuth.The bottom row visualizes possible light location estimated by the network, shown as the outer product between azimuth and elevation distribution vectors.Note that the sun is in different positions relative to the camera viewpoints for the two crops.

Figure 6 :
Figure 6: Example of potential field-based crowd simulation.We first project the walkable regions onto a BEV grid.Then compute a discomfort field based on scene layout and other pedestrian's positions.Finally we perform potential descent based on the potential field (planning a path perpendicular to the potential contours at each grid location.) , ,  are weight parameters that could be set manually. is the discomfort value at each grid.The two variables ,  indicate whether the integral is taken with respect to space or time, and satisfy the relationship  =  where  is the speed.This can be further

Figure 7 :
Figure 7: Shadow and occlusion rendering.We render the inserted objects with the synthesized shadow effects.Red boxes highlight occluded objects.
Figure 8 compares the system with and without shadow occluders.Notice how the people and cars are darkened by cast shadows when they move into a shaded region.

4. 3
Figure 8: Rendered video frames with and without shadow occluders.See the darkened effect when an object moves into a shaded region.

Figure 10 :
Figure 10: Example of traffic simulation.Pedestrians and cars stop before the crosswalk region when the corresponding traffic light is red.

Figure 14 :
Figure 14: Monocular depth estimation could be off in shadow regions (red box) or have vague boundaries (blue box).Semantic segmentation information helps refine the depth value on grounded, thin objects.

Figure 15 :
Figure 15: Visual results of the system on panorama.Top: original panorama.Down: captured still image from different angles.Better viewed in video format in the supplementary material.

Figure 16 :
Figure 16: Visual results of the system on regular images.Three representative frames are shown here for each video.Better viewed in video format in the supplementary material.
Algorithm 2 Car Simulation Algorithm Construct a list with each car pointing towards the previous car.for car with speed  and acceleration/deceleration  do Compute  car , the signed distance to previous car Compute  cross , the signed distance to the crosswalk if 0 <  cross <  2 /2 +  and red light then Decelerate,  ←  − Δ else if 0 <  car <  2 /2 +  then Decelerate,  ←  − Δ else if  <  max and green light then Accelerate,  ←  + Δ end if end for