VALERIE22 - A photorealistic, richly metadata annotated dataset of urban environments

The VALERIE tool pipeline is a synthetic data generator [14] developed with the goal to contribute to the understanding of domain-specific factors that influence perception performance of DNNs (deep neural networks). This work was carried out under the German research project KI Absicherung in order to develop a methodology for the validation of DNNs in the context of pedestrian detection in urban environments for automated driving. The VALERIE22 dataset was generated with the VALERIE procedural tools pipeline providing a photorealistic sensor simulation rendered from automatically synthesized scenes. The dataset provides a uniquely rich set of metadata, allowing extraction of specific scene and semantic features (like pixel-accurate occlusion rates, positions in the scene and distance + angle to the camera). This enables a multitude of possible tests on the data and we hope to stimulate research on understanding performance of DNNs. Based on cross-domain semantic segmentation experiments, i.e. training on synthetic data and evaluation on target real world data, a comparison with several other publicly available datasets is provided, demonstrating that VALERIE22 is one of best performing synthetic datasets currently available in the open domain. 1


Introduction
Recently, great progress has been made in applying machine learning techniques to deep neural networks to solve perceptional problems.Automated vehicles (AV) are a recent focus as an important application of perception from cameras and other sensors, such as LIDAR and Radar [31].Although the current main effort is on developing the hardware and software to implement the functionality of AVs, it will be equally important to demonstrate that this technology is safe.
The German collaborative research project KI Absicherung [1] was a cross industry and academia effort to develop a methodology for the validation of DNNs in the context of pedestrian detection in urban environments for automated driving.Specifically, one important goal of that project was to make the safety aspects of ML-based perception functions predicable.As one important research stream of this project synthetic data generation was used as a base, as this allows full control over domain-specific scene parameters and the ability to generate parameter variations of these.Further, additional metadata annotations were specified and automated computation of these were added to the synthesis pipeline.
The VALERIE tools pipeline was developed as a research tool to improve quality of data synthesis and to get an understanding of factors that determine the domain gap between synthetic and real datasets.For that a powerful synthesis pipeline has been developed, which allows the fully automated creation of complex urban scenes.In this paper we only summarize some of the functionalities of the VALERIE synthesis pipeline and focus on a description of the (meta-)data formats of the VALERIE22 dataset that was generated with the tool chain.More details on the synthesis tools can be found in [12].
Additionally, we present evaluation results to assess the quality of our synthetic data compared to other synthetic datasets in the autonomous driving domain.

Related work
In [12] we suggest a computational data synthesis approach for deep validation of perception functions based on parameterized synthetic data generation.We introduce a multi-stage strategy to sample the input domain and to reduce the required vast amount of computational effort.This concept is an extension and generalization of our previous work on parameterization of the scene parameters of concrete scenarios.We extended this parameterization by a probabilistic scene generator to widen the coverage of scenario spaces and a more realistic sensor simulation.These approaches were used to generate the scenes and data in the VALERIE22 dataset.
Techniques to capture and render models of the real world have been matured significantly over the last decades.Computer generated imagery (CGI) is increasingly popular for training and validation of deep neural networks (DNNs) as synthetic data can avoid privacy issues found with recordings of members of the public and can automatically produce ground truth data at higher quality and reliability than costly manually labeled data.Moreover, simulations allow synthesis of rare scene constellations helping validation of products targeting safety critical applications, specifically automated driving.Because of the progress in visual and multi-sensor synthesis, now building systems for validation of these complex systems in the data center becomes not only feasible but also offers more possibilities for the integration of intelligent techniques in the engineering process of complex applications.
The use of synthesized data for development and validation is an accepted technique and has been also suggested for computer vision applications (e.g.[2]).Several methodologies for verification and validation of AVs have been developed [7,16,17] and commercial options exist. 2 These tools were originally designed for virtual testing of automotive functions, like braking systems and then extended to provide simulation and management tools for virtual test drives in virtual environments.They provide real-time capable models for vehicles, roads, drivers, and traffic which 2 For example Carmaker from IPG or PreScan from TASS International.are then being used to generate test (sensor) data as well as APIs for users to integrate the virtual simulation into their own validation systems.
Recently, specifically in the domain of driving scenarios, game engines have been adapted [22,29].Another virtual simulator system, which gained popularity in the research community is CARLA [9], also based on a commercial game engine (Unreal4 [10]).Although game engines provide a good starting point to simulate environments, they usually only offer a closed rendering set-up with many trade-offs balancing between real-time constraints and a subjectively good visual appearance to human observers.Specifically, the lighting computation in this rendering pipelines is limited and does not produce physically correct imagery.Instead, game engines only deliver fixed rendering quality typically with 8bit per RGB color channel and only basic shadow computation.
In contrast, physical-based rendering techniques have been applied to the generation of data for training and validation, like in the Synscapes dataset [28].For our experimental work we use the physical-based open source Blender Cycles renderer3 in high dynamic range (HDR) resolution.
The effect of sensor and lens effects on perception performance has only been limited studied.In [3,19] the authors are modeling camera effects to improve synthetic data for the task of bounding box detection.Metrics and parameter estimation of the effects from real camera images are suggested by [18] and [4].A sensor model including sensor noise, lens blur, and chromatic aberration was developed based on real data sets [13] and integrated into our validation framework.
Looking at virtual scene content, most recent simulation systems for validation of complete AD system include simulation and testing of the ego-motion of a virtual vehicle and its behavior.The used test content or scenarios are therefore aiming to simulate environments with a large extension and are virtually driving a high number of test miles (or km) in the virtual world provided [7,20,27].This might be a good strategy to validate full AD stacks, one problem for validation of perception systems is the limited coverage of data testing critical and performance limiting factors.
A more suitable approach is to use probabilistic grammar systems [8,28] to generate 3D scenarios which include a catalog of different object classes and places them relative to each other to cover the complexity of the input domain.The VALERIE22 dataset demonstrates the effectiveness of our probabilistic grammar system together with our previous scene parameter variation [25] with a novel multi-stage strategy.This approach allows to systematically test conditions and relevant parameters for validation of the perceptional function under consideration in a structured way.
The remainder of this contribution is structured as the following: The next section will give an outline of our synthesis approach and a description of the generated metadata.In section 3 we will give a comparison of VA-LERIE22 with a number of publicly available real and synthetic datasets.

VALERIE data synthesis pipeline
VALERIE is composed of several modules, as depicted in fig. 1.The validation flow control is in principle designed to run automated validation strategies in a data center, with the help of the 'SCALA' orchestration module based on slurm 4 .A description of the concept of these modules is outside the scope of this paper, see [12] for more details.The aim in here is to only give an overview over some of the modules in the data synthesis part, so that the reader is able to understand the features of the dataset and how to identify objects in the rendered frames.

Computation of synthetic data
Synthetic data is generated with graphics methods.Specifically for color (RGB) images, there are many software systems available, both commercially and as open source.For the generation of the dataset described in this paper Blender was used as a base to import, edit, and rendering of 3D content.
The generation of highly varied synthetic data involves the following steps: 1.A 3D scene model with a city model is generated using a terrain/street generator.Parameters like width of a street and pavement, type of segment (e.g.tall houses, sub-urban residential, green/park, place, etc.) and materials for roads, sidewalks, segments are generated based on a scene description.Alongside this process the semantic information about the types and geometry of the segments is passed as input to the next step.
4 https://slurm.schedmd.com/documentation.html 2. A placement step is inserting 3D assets, like cars, vegetation, road elements and pedestrians into the scene.This placement is inserting objects based on a density declaration (per segment) and a list of assets for this type of segment (e.g.road, sidewalk, etc.).The result is a complete scene.Fig. 3 shows examples of scenes with a variation of person densities.
3. (optionally) a set of scene parameters can be varied before each rendering pass.This includes position of objects, cameras and time-of-the-day (to vary the sun position) and many more.
The dataset contains a multitude of additional metadata.For example all objects in the scenes are tagged with an identifier (see next section) and semantic and scene information, like position in the scene and distance + angle to the camera is documented in form of json files.This enables a multitude of possibilities to analyze the data and we hope to stimulate research on understanding performance of DNNs with our dataset.

Assets and object instances
The assets 5 in the asset database (left side in fig. 1) have a unique identifier in form of a UUID (Universally Unique IDentifier).This identifier is used in the scene description either explicitly (for static objects) or in selection lists used by the probabilistic scene generator.
The asset id 6 is also used to identify objects in the rendered frames.The dataset contains metadata files (json format) with a list of objects and their asset ids.Objects are also identified with a specific UUID.This is depicted in fig.
2. In the appendix, section on Metadata an example json file is listed.The "entities" key, in this example "91" is an integer and corresponds to the instance label (see below) of the instance ground truth.With the help of the scene metadata files and the unique UUIDs of the assets it is possible to identify assets in the rendered scene.This can be used for statistical purposes or to retrieve more information from the asset database (not included in the dataset).
The scene composition and also the used assets in VA-LERIE22 are European, e.g. the traffic signs and road markings are German.The types of houses are also mainly European style.

Ground truth and metadata
The VALERIE22 dataset provides a very rich set of metadata annotations and ground truth: • pixel-aligned class groups (semantic label image) • pixel-aligned object instances (label image) 5 An asset here means a 3D model or 2D texture. 6id == identifier for brevity.• scene parameters, specifically time-of-the-day and sun (illumination) • camera parameter, including pose in scene The labels for object classes will be mapped to a convention used in annotation formats and follows the Cityscapes convention [6] for training and evaluation of the perception function.The 2D image of a scene is computed along with the ground truth extracted from the modeling software rendering engine.

Sensor Simulation
We implemented a sensor model to simulate real sensor behavior.The module works on HDR images in linear RGB space and floating point resolution as provided by the Blender Cycles renderer.
We simulate a camera error model by applying sensor noise, as added Gaussian Noise (mean=0, variance: free parameter) and a automatic, histogram-based exposure control (linear tone-mapping), followed by non-linear Gamma correction.Further, we simulate the following lens artifacts chromatic aberration, and blur.Fig. 4 shows a comparison of the standard tone-mapped 8bit RGB output of Blender (left) with our sensor simulation (right).The parameters were adapted to approximate the camera characteristic of Cityscape images.The images do not only look more realistic for the human eye, they also are further closing the domain gap between the synthetic and real data (for details see [13]).

Sampling of variable parameter
Variations in the dataset were created by linear stepping through a parameter interval or random sampling of these.Examples are time-of-the-day to control the sun settings or position and orientation of the camera.The parameters used in variation runs are documented in a json file with the actual parameter variations.However, the sun camera parameters are also documented in the 'per-frame-analysis' file.

Evaluation
To evaluate the quality of our dataset we conducted several experiments using the semantic segmentation task.We compare the segmentation performance of a DeeplabV3+ model trained on our synthetic data and compare the performance with models trained on several synthetic datasets.The performance of these models is then evaluated on five different real world automotive segmentation datasets.Use cases of our metadata include improved training and identification of impairing factors (for more details see [14,15]).
Next, we investigated the segmentation performance on the person class of the CityPersons dataset if we train the model on subsets of our dataset.We additionally evaluated the person class performance with models trained on subsets of the SynPeDS dataset [24] provided by the KI Absicherung project 7 .Finally, we investigated how the performance of the models differs for the number of unique person assets used to create the datasets and their subsets.
Lastly, we investigated how the number of training images influences the segmentation performance.Again we trained on subsets of our dataset and the SynPeDS dataset and evaluated the segmentation performance on all classes with the DeeplabV3+ segmentation model.

Computation and evaluation of perceptional functions
State-of-the-art perception functions consists of a multitude of different approaches considering the wide range of different tasks.For experiments presented in this chapter, we are considering the task of semantic segmentation.In this task, the perception function segments an input image into different objects by assigning a semantic label to each of the input image pixels.One of the main advantages of semantic segmentation is the visual representation of the task which can be easily understood and analyzed for flaws by a human.
In this work, we considered the DeeplabV3+ model which originated from [5] and utilizes a ResNet101 backbone.
We compare our dataset to three different synthetic datasets.The first dataset is the synthetic dataset SynPeDS [24] consisting of urban street scenes inspired by the preceding two real-world datasets.The second dataset is the GTAV dataset [22], created by sampling data from the 3D game of the same name.Last, the Synscapes dataset [28] which is intended to synthetically re-create charateristics of the Cityscapes dataset is considered.
To compare our dataset we train segmentation models on each of these datasets and evaluate the segmentation performance on five real-world datasets.The first dataset is the Cityscapes dataset [6], a collection of European urban street scenes in the daytime with good to medium weather conditions.The second dataset is the A2D2 by [11], similar to the Cityscapes dataset it is a collection of German urban street scenes and additionally it has sequences from driving on a freeway.The third dataset is the BDD100K dataset [30] a diverse dataset recorded in North-America at diverse weather conditions.Next, the India Driving Dataset dataset [26], which was recorded in India and contains entirely different street scenes compared to the European or American datasets.Last, the Mapillary Vistas dataset [21], a world wide dataset with emphasis on northern America.All of these datasets are labeled on a subset of 11 classes which are alike in these datasets to provide comparability between the results of the different trained and evaluated models.
To measure the performance of the task of semantic segmentation the mean Intersection over Union (mIoU) from the COCO semantic segmentation benchmark task is used [23].The mIoU is denoted as the intersections between predicted semantic label classes and their corresponding ground truth divided by the union of the same, averaged over all classes.
We showed in our previous work how to use the extensive metadata accompanied to our dataset to detect data biases in person detectors due to the underlying training data used to train the bounding box detectors [15].
Another work investigated the usage of the metadata to calculate visual impairing factors, i.e., factors that lead to detrimental detection performance of a person detector such as increased occlusion or decreased contrast.Re-training a person detector with a focus on harder to detect samples, according to these factors, improves the overall detection performance [14].

Cross domain evaluation
To demonstrate the quality of our synthetic dataset we conducted several cross-domain performance experiments with other real-world automotive and synthetic datasets.This cross-domain performance analysis is also commonly referred to as generalization distance.We trained a DeeplabV3+ model on our VALERIE22 dataset, as well as for the SynPeDS, the GTAV and the Synscapes dataset.Next, we evaluated the segmentation performance on realworld datasets A2D2, BDD100K, Cityscapes, IDD and Mapillary Vistas.
As the real-world and synthetic datasets do not have exactly the same semantic annotation format, the segmentation models were trained on a subset of 11 labels per dataset to ensure consistency of classes across.The labels are defined as follows: Road and sidewalk incorporate the roadmarkings and the curb respectively.Further, the building, sky, car and truck classes are used, which are consistent across these datasets.Pole, traffic light and traffic sign classes are mapped from similar sub-classes in the used datasets, e.g., utility pole in Mapillary Vistas.The vegetation class consists of the Cityscapes sub-classes terrain, i.e., plants covering the ground, and the original vegetation class, i.e., trees and bushes.Last, the person class is defined as all humans in the dataset, e.g., pedestrians and riders.The mIoU cross-domain generalization performance results over all 11 classes are depicted in 5. Our VALERIE22 dataset performs best on three datasets (BDD100K, Cityscapes, IDD) and just marginally worse than the SynPeDS trained model on A2D2.Compared to the mainly North-American based Mapillary Vistas dataset our dataset shows a significant domain shift.Although, still the crossdomain evaluation of VALERIE22 is significantly better than Synscapes and close to GTAV.
Most notably our dataset outperforms the SynPeDS dataset on the Cityscapes dataset.This comes as a surprise as the SynPeDS dataset was created to synthetically resemble the Cityscapes dataset.

Number of Assets
We conducted experiments to understand the influence of diversity of the training data.Therefore, cross-domain performance is evaluated by comparing the number of unique training assets and the resulting cross-domain segmentation performance.
While comparing automotive real-world and synthetic images it becomes obvious that most images and scenes in real-world images are unique, whereas in synthetic images the scenes are often composed of repetitive content, i.e., a limited amount of unique assets, which are continuously differently arranged.In synthetic datasets the 3D assets, i.e., the 3D meshes and textures of objects in a scene, are expensive to create at a high fidelity and should therefore be used as much as possible.Training a pedestrian detector on a dataset consists of too few unique person assets will lead to a strongly biased detector which is able to detect solely the few trained person assets, but will fail to generalize on other persons.Overfitting will therefore occur if the training data is of low diversity and the model will fail to generalize, but it is non-obvious on how much diversity is actually needed to generalize well.
To understand the required diversity we investigated the semantic segmentation performance on the person class of a DeeplabV3+ model trained with different subsets of the VALERIE22 and the SynPeDs datasets.The subsets, i.e., sequences, of our dataset are described in the Appendix whereas the subsets of the SynPeDS dataset, i.e., tranches, are described in [24].To track the number of unique person assets per subset in our dataset we just have to count the occurrences of unique asset IDs in the scene metadata files of a sequence.
Each subset of both datasets represents a stage in the process of its development and therefore these dataset subsets consist of an increasing number of pedestrian assets the further the development progressed.The trained models are cross-validated on the Cityscapes validation dataset to investigate the cross-domain generalization performance.Figure 6 shows the resulting number of unique person assets in the dataset subsets compared to the cross-domain person class performance measured as mIoU on the Cityscapes dataset.The VALERIE22 subset for higher unique person counts clearly outperforms the SynPeDS subset in the cross-domain performance.While a low number of unique assets will lead to overfitting on these assets, a higher number clearly benefits the generalization capabilities of the model.Both, the VALERIE22 trained models and the SynPeDS trained model benefit from an increasing number of person assets on the cross-domain performance.The model trained on our full VALERIE22 dataset is just < %1 worse in performance than the baseline Cityscapes trained model.The results clearly indicate the more diverse a dataset with regard to person assets, the better the generalization capabilities of a segmentation model on this class.

Number of Training Images
Training with a diversified dataset shows significant improvement on the cross-domain performance.This might also raise the question on the performance difference if we have a huge number of training images with lower asset diversity compared to a smaller count of images but with a higher number of assets.A very low number of images should obviously lead to overfitting, but training with a huge dataset with only marginal differences between images could lead to overfitting as well.From our previous experiment we found that the person asset diversity in the overall VALERIE22 dataset is higher compared to the SynPeDS dataset and this leads to a better segmentation performance.However, the number of training images is vastly different between these datasets.To understand the influence of the number of training images we compared the cross-domain performance on all 11 classes on the Cityscapes dataset again trained on subsets of the VALERIE22 and SynPeDS datasets.Figure 7 shows the generalization results with the respective cumulative frame counts that were used to train each segmentation model.While no model reaches the baseline performance of 82.34%, the cross-domain performance with Sequences of our VALERIE22 dataset reach higher mIoU performance values with far fewer image frames than the SynPeDS dataset.As previously shown, the diversity in the VA-LERIE22 dataset continuously improved, which is evi-dent by the increasing cross-domain performance, whereas the performance of the SynPeDS model even deceased for tranche 4. In tranche 4 a significant pedestrian object distribution bias was introduced into the dataset as was found in [12].In [12] we additionally showed how to utilize the exact positioning metadata of the person assets in the images to identify the pedestrian distributions and understand if data biases were introduced.Overall, it is clearly visible in this result that only increasing the frame count by reiterating the same assets in the scenes is no viable strategy to increase the cross-domain generalization performance.

Summary
This paper describes the VALERIE22 dataset.The dataset and its underlying scene models are generated completely automated with a parametric scene generation and rendering pipeline.The results of a cross-evaluation with real and other synthetic datasets demonstrates the performance of this approach.Compared to European datasets VALERIE22 is performing best (or equal) compared with the synthetic SynPeDS, GTAV and Synscapes datasets.
VALERIE22 comes with a rich set of metadata annotations making it a valuable asset for research on understanding performance and domain aspects of DNNs.and the same images distorted with our sensor simulation as described in section 2.4 (in the "png distorted" folder).The following gives a brief description of the main sequence groups.
Sequence 50   This version of the scene generator uses an automatic layout of the street (autolane feature).Depending on the width it includes park lanes and separate lanes in each direction.

class-id png
The class-id png files contain the semantic-groupsegmentation png files mapped to the following 11 classes in grayscale.

semantic-group-segmentation png
The semantic-group-segmentation png files contain the segmentation [RGB] files with class to color mapping as defined in Table 1.

semantic-instance-segmentation png
The semantic-instance-segmentation png files contain the instance segmentation labels for each frame.The in-general-globally-per-frame-analysis json The general globally per frame analysis file defines the Environment, Camera and Entities, i.e., Objects, per frame.

Figure 2 .
Figure 2. Object identifiers allow tracking of object instances through the rendered frames and metadata.

Figure 3 .
Figure 3. Variation of density of pedestrians in the street and on side walk (top) low, to high (bottom).

Figure 6 .
Figure 6.Unique person assets per SynPeDS (blue) tranche or VALERIE22 (red) sequence and person class generalization performance on the Cityscapes dataset.

Figure 7 .
Figure 7. Number of training frames per SynPeDS (blue) tranche or VALERIE22 (red) sequence and overall generalization performance on the Cityscapes dataset.

Table 1 .
Ground truth class mapping of semantic group segmentation to trainIds and color.