Audible Panorama: Automatic Spatial Audio Generation for Panorama Imagery

As 360 deg cameras and virtual reality headsets become more popular, panorama images have become increasingly ubiquitous. While sounds are essential in delivering immersive and interactive user experiences, most panorama images, however, do not come with native audio. In this paper, we propose an automatic algorithm to augment static panorama images through realistic audio assignment. We accomplish this goal through object detection, scene classification, object depth estimation, and audio source placement. We built an audio file database composed of over $500$ audio files to facilitate this process. We designed and conducted a user study to verify the efficacy of various components in our pipeline. We run our method on a large variety of panorama images of indoor and outdoor scenes. By analyzing the statistics, we learned the relative importance of these components, which can be used in prioritizing for power-sensitive time-critical tasks like mobile augmented reality (AR) applications.


INTRODUCTION
Sound has been demonstrated to be an integral aspect of immersion [12,35], so it is no surprise that there have been numerous attempts to produce realistic sound for 3D environments [2,4].
Given a 360° panorama image, we present an approach to create a realistic and immersive audible companion. We start by detecting the overall scene type as well as all the individual sounding objects. Leveraging scene classifcation and object recognition, we then assign a background audio of the scene and customized audios for each object.
We use object and scene tags assigned during scene classifcation and object recognition for audio assignment. These tags are matched with audio tags from our audio fle database and audio associated with the audio tags are assigned as audio sources. We estimate the depths of detected objects by comparing relative heights of objects and pre-established knowledge on the average heights of diferent types of objects.
We have three major contributions.
• We proposed a new framework for automatically assigning realistic spatial sounds to 360° panorama images based on object recognition and depth estimation. • We constructed a large dataset of panorama images and audio fles. The panorama images are made audible by running our approach on them. The dataset, the results obtained by running our approach, as well as the tool for experiencing the audible panorama will be publicly released for research purposes. • We conducted statistical analysis that evaluates the importance of various factors in our proposed framework.

RELATED WORK
As the camera hardware for 360° contents are signifcantly improving, there is an increasing amount of interests to facilitate better interaction with the 360° contents. However, most existing work focused on the visual aspect, for example, sharing content playback [30], improving educational teaching [1], assisting visual focus [16], augmenting storytelling [8,23], controlling interactive cinematography [22], enhancing depth perception [13], and collaborative reviewing [19]. On the audio side, Finnegan et al. made use of audio perception to compress the virtual space, in addition to conventional visual-only approaches [5]. Rogers et al. designed and conducted a user study on sound and virtual reality (VR) in games, exploring the player experience infuenced by various factors [25]. Schoop et al. proposed HindSight that can detect objects in real-time, therefore greatly enhancing the awareness and safety with notifcations [27]. While previous methods all rely on accompanying audio signals, our project tries to enable better interactive experience in 360° images by synthesizing realistic spatial audio from only visual content. We achieve this by constructing a comprehensive audio dataset which we will discuss later.
To enhance the sense of presence enabled by immersive virtual environments, high-quality spatial audio that can convey a realistic spatial auditory perception is important [2,3,10,29]. Many researchers have studied realistic computergenerated sounds. One widely researched excitation mechanism is rigid body sound [20]. To model the sound propagation process, efcient wave-based [24] and ray-based simulation methods [21] have been proposed. More closely related to our method are scene-aware audio for 360° videos [14] and automatic mono-to-ambisonic audio generation [18], both of which require audio as part of the input. We draw inspirations from existing sound simulation algorithms and synthesize plausible spatial audio based on only visual information without any audio input to our system.
In existing virtual sound simulations, scene information, such as the number of objects, their positions, and motions, is assumed to be known. However, in our project, we compute this information automatically through object detection and recognition. In computer vision, robust object detection has been a grand challenge for the past decades. Early work detects objects rapidly, for example, human faces, using carefully designed features [33]. Recently, more robust methods leveraging deep neural networks have been shown to achieve high accuracy [9,11]. We match the scene in a panorama image with an audio fle, but also detect and recognize objects in the image and estimate their depth to place audio sources for those objects at convincing positions.
Since the production of sound is a dynamic process, we need to classify not only the objects, but also their movements, their actions, and their interaction with the surroundings. For example, a running pedestrian and a pedestrian talking on the phone should produce diferent sounds in the audible panorama. To this end, accurate action recognition can guide the generation of more realistic sounds. Most existing action analysis research requires video as input since the extra temporal information provides strong hints as to what certain human actions are [31,36]. Human action recognition on still images remains a challenging task. One approach is to use word embeddings and natural language descriptions to recognize actions [28].
Traditional sound texture synthesis can generate credible environmental sounds, such as wind, crowds, and trafc noise. Harnessing the temporal details of sounds using timeaveraged statistics, McDermott et al. demonstrated synthesizing realistic sounds that capture perceptually important information [17]. In addition to sound texture, natural reverberation also plays an important role in the perception of sound and space [32]. An accurate reverberation conveys the acoustic characteristics of real-world environments. However, in these ambient sound synthesis methods, the spatial Figure 2: Overview of our approach. Given a panorama image, our approach performs scene classifcation and object detection on images sampled horizontally from the panorama image. Next, it performs object recognition. Based on the scene knowledge, it assigns a corresponding background sound for the scene; it also places appropriate sound sources at the estimated object locations accordingly.
information is missing since the environment is treated as a difuse audio source. We build upon these existing ideas and augment panorama images with spatial audios. Figure 2 depicts an overview of our approach. Our method takes a 360° image as input. After scene classifcation, object detection, and action recognition, the image is labeled with what we will call here on out a background tag, which matches the type of scene in the image (for example, "Town"), and also the objects are labeled with object tags. Object tags either are object names or, if the object detected is a person, words for actions.

OVERVIEW
To start with, we built an audio fle database. These fles are organized into two types: background and object audio fles. Each audio fle is associated with an audio tag, which we set as the object tags for object audio fles and scene tags for background audio fles.
Our approach then estimates, with a single user interaction for inputting the depth of one object in the scene, the depth and hence the 3D location for each of the detected objects by using estimates of the real-life average height of objects and the relative height of objects recognized in the scene. Based on the detection and recognition results, our approach assigns appropriate audio sources at the calculated depth for each object by comparing the object tags with the audio tags in the database. If there is a match between the tags, we randomly select an audio fle from our database that is labeled with the matching audio tag. These audio tags are efectively the diferent categories of sound that we have in our database.
For getting the audio fle for the background audio of the scene, we use the same approach except that we use the scene tag instead of the object tags. The output of our approach is an audible panorama image, which can then be experienced using a virtual reality headset. (b) A sample image and the corresponding tags assigned by our automatic scene classifer. In this case, some scene tags assigned were "Crowd", "Transport", and "Town". "Town" was ultimately the highest scored and most frequently occurring tag across all tested image segments, so it was selected as the scene tag for the Chinatown scene.

APPROACH Sound Database
To ensure that our sound database is comprehensive, we select the sounding objects based on 1,305 diferent panorama images found on the internet. By running scene classifcation and object detection and recognition on these panorama Object Detection at 252° images we were able to detect repeatedly occurring objects and scene types, which we also use as the corresponding object and scene tags. We then set the categories for all the sounds and build a database with background and single audio fles. The audio fles represent background sounds, human actions, sounds of motor vehicles, etc. Our sound database constitutes a total of 512 diferent audio sources, which are organized into two types: audio sources for objects and sounds for background ambient scenes: • Object Sounds: There are 288 diferent object sounds, which include human chatting and walking, vehicle engine sounds, animal yelling and others in 23 categories. Each category, which matches previously mentioned object tags, is used for audio source assignment for objects in the scene. We normalize the volume of all sounds and generate versions of the audio fles for various confgurations. • Background Audio: There are 224 diferent background audio fles in total. We catalog these using 22 tags that match the diferent scene tags such as "City", "Room", "Sea", etc. These audio tags are used for selecting the audio source for the background based on the scene tag. For a list of representative audio tags, refer to Table 1. A complete list can be found in our supplementary material. All audio fles were obtained from the website "freesound.org" [6]. To generate 3D spatial audio from a 2D image, we estimate the relative depth of diferent objects in order to place the sounds reasonably in 3D space to simulate spatial audio.  Table 1: A partial list of the audio tags used in our database. Each audio tag is a label for audio fles in our database. So the "Chatting" tag, for example, tags diferent fles for audio of people chatting. The type refers to audio fles either being for background audio (bg) or object audio (obj). Refer to our supplementary materials for a complete list of all audio tags. Our database contains a total of 512 audio fles.

Scene Classification
The goal of scene classifcation is to assign a scene tag to the scene, which matches one of the background audio tags in our audio database. Beginning with a panorama image and a camera viewing horizontally from the center of the rendered image, we rotate the viewpoint horizontally 36° to capture diferent segments of the image. Our system also splits the captured samples vertically. If desired, the user may increase the number of samples adaptively to provide more input data for scene classifcation. Each segment is assigned a list of fve scene categories, which we use as the scene tags, ranked in decreasing order of classifcation confdence scores.
We combine the classifcation results on all slices and select the most frequently-occurring, highest-scored tag as the tag for the overall scene for the panorama image. So, for example, that the two most common occurring scene tags for an image are "Library" and "Living Room", and the confdence score of each is 0.8 and 0.92 respectively, then "Living Room" will be selected as the scene tag. Refer to Figure 3(b) for an example.

Object Detection and Recognition
We use TensorFlow Object Detection, which is based on a Convolutional Neural Network (CNN) model pre-trained on the COCO dataset [15]. Our approach slices the panorama image the same as in scene classifcation, and we run object detection on each slice. If there are any overlapping objects We assume the average real-world heights of diferent objects categories based on estimation and previously recorded averages, which are shown in our supplementary materials. In this example, we show both people (average height: 5.3 ft) and a car (average height 5.0 ft). The person highlighted in red represents the reference object and the red line represents the baseline depth specifed by the user. The depth estimation estimates the depth of the black objects by these inputs. (a) The system chooses the reference object, which corresponds to the object that received the highest confdence score during object detection. (b) The designer specifes the baseline depth to the reference object. (c) The system estimates the depths of the other objects. (d) audio sources corresponding to the objects will be placed at their estimated depths to create spatial audio.
from one slice to the next, we count the detected objects as the same object. Once objects have been detected, we use Google Vision for object recognition, feeding in cropped images of all detected objects to the object recognition algorithm. We assign object tags to the objects in the same way that we tag the scene. Figure 4 depicts the object detection process.

Depth Estimation
Our depth estimation technique requires comparing two objects detected on a panorama image. The frst object is a reference object r which we use as a baseline to estimate the depths of all other detected objects in the image. The depth of the reference object, d r , is set by the user. This is the only user interaction required during the depth estimation process. By default, the reference object is chosen as the object with the highest score (which corresponds to the highest confdence) of running object recognition on the image.
Our goal is to estimate the depth d i for each object i detected on the panorama image. Let R r and R i be the estimated real-world heights of the reference object r and the detected object i respectively (e.g., the average height of a "person" is 5.3 ft [34], and that of a "car" is 5.0 ft).
The average heights for all objects in the database were either estimated by us or taken from real-world data. Please refer to our supplementary materials for a complete list of the average heights for all the object tags. Savva et al. ofer a technique for automatically estimating the appropriate size of virtual objects [26], which is complementary to our approach. Let N r be the normalized height of the reference object r (i.e., the object's height in the image divided by the image height) and N i be the expected normalized height of the detected object i. By similar triangles, we have the following relationship: Here, N i is the only unknown variable, which we solve for. The fnal step is to calculate the estimated depth d i . To do this, we compare the expected normalized height (N i ) with ′ the actual normalized height (N i ) of the object in the image, whose depth we are trying to estimate. Then, by similar triangles, we obtain the estimated depth d i of object i by: This depth estimation process is applied for every detected object in the image. Figure 5 illustrates this process.

Audio Source Assignment
Once objects in the image have been detected and tagged, the next step is to assign an adequate audio source to each one. We accomplish this by comparing the 5 tags assigned to each object during object recognition to the tags in our database. We assign a sound if one of the tags matches a tag in the database. The tags of each object are compared to the tags in the database in order of highest to lowest confdence scores.
For objects in the image detected as persons, tags for actions are automatically assigned by the object recognition algorithm, so our approach handles action recognition for the assignable object tags for actions. In our approach, these are "Chatting", "Chatting on Phone", "Typing", and "Walking". Some object tags for actions recognized by object recognition including "Sitting" and "Standing" are ignored by our approach because they are not audible.

Audio Equalization Processing
As a post-processing step, all the assigned audio fles can be equalized using Adobe Audition. This sound equalization is done in order to match the quality of the sound to ft either an indoor or outdoor environment according to the recognized scene type. In our implementation, we frst normalized the volumes of the sounds from diferent sources Seashore Cafeteria Living Room Neighborhood Bus . . .

Museum
Campus Dock Park before scaling them by distance, and applied an equalization matching algorithm to create indoor and outdoor versions for each sound [7].

EXPERIMENTS
Our experiments were conducted with a 3.3GHz Intel Core i7 processor, an NVIDIA Quadro M5000 GPU, and 32GB of RAM.
To conduct the experiments, we created a 360° panorama image viewer with the Unity engine, which supports spatial audio. We release our code, data, and viewer for research purposes. The data includes 1,305 panorama images obtained from fickr.com and four images which we captured with a Ricoh Theta V 360° camera.

Sound Placement Results
We discuss the results of running our approach on 4 diferent panorama images, namely, Chinatown, Seashore, Living Room, and Cafeteria, which are depicted in Figure 1 and Figure 6. For audible versions of these results for the Chinatown scene, please refer to the supplementary video. Chinatown: Our approach automatically assigned an audio fle that matched the "Town" tag as the background audio and identifed many people. These people were assigned object tags like "Walking" and "Chatting". Cars were also detected and recognized, and the object tag "Car" was assigned to these objects. The database then assigned audio fles for the objects that matched these tags. The locations of the people and vehicles are easily discernible, with the background audio making the scene experienced in VR feel like an actual city center. The background audio "Town" matches sound that one could expect to hear in a town center including background vehicle sounds and construction sounds. Seashore: Our approach automatically assigned an audio fle matching the "Sea" tag from our database as the background audio. Our approach also detected one person far of in the distance with a cellphone so an object tag of "Chatting on Phone" was assigned to that object. This object tag matches the audio tag "Chatting on Phone" in the database, so an audio fle associated with that tag was randomly chosen.
Since the person is far of in the distance, the audio source was placed accordingly, which can barely be heard. We used this image to test results of using background audio with few object audio sources.
The background audio assigned consists of the sounds of waves reaching the shore. This mimics what one would expect to hear at a beach shore, as the sound of waves tends to drown out other sounds, especially in a scene like this, which is not crowded. Living Room: The background sound assigned from our database matches the background tag "Room" and consists of quiet background noise, which mimics the background noises heard inside city apartments. Only a single audible object in the room was detected and recognized with the object tag "TV". By matching the object tag with the same audio tag in the database, we randomly selected an audio fle with sounds from a television. In our case, music plays from the television. When viewing this scene with the Oculus Rift headset, it is immediately recognizable where and from what object the audio is coming from. Cafeteria: Our approach assigned an audio fle matching the audio tag "Restaurant" as the background sound of this scene. It also identifed people with the tag "Chatting", so audio fles for chatting were also assigned.
The restaurant audio fle used as background audio, combined with the placed audio fles for people chatting, produces the efect of being in an indoor crowded space.  Table 2: Audio confgurations used in the user study. The bolded confguration is the standard confguration used in each set. The standard confguration is the result of running our approach on a particular scene without any modifcations. We created these sets to investigate what features of the audio were important in delivering a realistic audio experience for the scenes.

USER STUDY
To evaluate our approach, we conducted an IRB-approved user study with 30 participants. The users were aged 18 to 50, consisting of 17 males and 13 females, with no selfreported hearing or vision impairment. They were asked to wear an Oculus Rift headset with Oculus Touch controllers and headphones to view four panorama images. The audio was spatialized to stereo via Unity's built-in spatializer plugin (Oculus Spatializer HRTF). The 4 scenes are Chinatown, Seashore, Cafeteria and Living Room.

Study Design
The goal of the user study was to test how diferent characteristics of the audio assigned afected user ratings. To this end, we set out to test 7 sets of diferent audio confgurations. The goal of this setup is to investigate which aspects of the synthesized audio had the largest efect on the subjectively perceived quality. Please refer to Table 2 for a description of the audio confgurations included in each set.
Users were asked to view each scene while the audio was played with a certain audio confguration. For each set of audio confgurations, the user experienced the same image under the diferent confgurations belonging to that set. The 7 sets were tested in random order, with within-set audio confgurations also being played in random order. The initial viewing angle of the scene was randomized before each scene and audio confguration were shown to avoid bias. Users experienced each scene and audio confguration combination once so that they could give a rating for each audio confguration under each set on each scene.

Rating Evaluation
Users rated each confguration using a 1-5 Likert scale, with 1 meaning that the audio did not match the scene at all and 5 meaning that the audio was realistic. The audio for each confguration played for approximately 10-20 seconds. We also calculate the p-value for each audio confguration  The p-value for each scene and set of audio confgurations calculated from the user study data. The pvalues smaller than 0.05, which reject the null hypothesis H 0 , are bolded. We performed this statistical analysis to study which aspects of our system afect the user-perceived quality of the audio assigned in each case.
comparison using the Analysis of Variance (RM-ANOVA) test for sets with 3 confgurations and using the T-Test for sets with only 2 confgurations. The tests were run independently for each scene. We chose Repeated Measures ANOVA and the T-Test since each participant did all confgurations under each set. Note that the audio synthesized by our approach without any modifcation (referred as standard confguration) was included in each set. Any p-value below 0.05 indicates that we can reject the null hypothesis H 0 , which refers to the situation that the average user ratings for the audio confgurations in each set are about the same. So, whenever we reject H 0 for confgurations in a set, it means that the diference in rating caused by the diferent confgurations in that set is signifcant. Results Figure 7 shows a comparison of the diferent ratings given to each audio confguration by the users. The p-values calculated are shown in Table 3. Our supplementary materials contain the individual scores of each participant. We discuss the results for each set of audio confgurations: Set 1 (Space): For the space set, full spatial audio (the standard confguration) received the highest average score. If we look at the p-values for each scene, we see that they are all below the threshold of 0.05 for rejecting H 0 , except for the Seashore scene. Seashore is assigned only background audio and one object audio for a person who is far away in the scene; the realism brought about by spatial audio may not be apparent or important for this scene. We can conclude that the spatial positioning of audio sources by our approach produces more realistic results than only using stereo or mono audio.
Set 2 (Background): For every scene except Seashore, confgurations with background audio have scores about the same or slightly higher than the scores obtained by only including object sounds. The p-values for all scenes except Seashore are above 0.05. What we can conclude from this is that the efect of adding background audio is not signifcant when sufcient object audio sources are placed in the scene. This could be explained by the fact that for Chinatown and Cafeteria there are many distant, individual objects whose sounds may constitute a realistic background sound Figure 7: Results of the user study. Each plot corresponds to a set of audio confgurations. The colored boxes represent the IQRs; the colored dots represent the means; the thick bars at the middle of the IQRs represent the medians; the upper and lower whiskers extend from the hinge by no further than 1.5* IQR, which show an approximately 95% confdence interval for comparing medians; and the bold texts represent the standard confgurations. For each set, the average user rating under each audio confguration for each scene is shown. Note that for the Seashore and Living Room scenes, sets 6 (Depth) and 7 (Number of Objects) were not tested since each scene only has the background audio and audio for one object assigned.
when mixed together, while for Living Room the environment is supposed to be quiet. For Seashore, turning of the background audio renders the scene almost silent because all the expected background beach sounds (e.g., sea waves sound) are gone while only the sound of a single detected person chatting on a phone can be heard. In conclusion, we believe that the background audio's importance depends on the scene's setting, while using both background and object audios may provide a more realistic experience in some scenes.
Set 3 (Synthesized): For all the scenes, the confguration with our synthesized audio had a higher average score than that of the recorded audio. However, only the p-value for Cafeteria scene was signifcant (below the threshold of 0.05). We conclude that our synthesized audios are perceived to be at least as realistic as those recorded with a 360° camera.
Set 4 (Correctness): Audio placed for correctly recognized objects received higher average ratings across the board. Confgurations produced by our approach with correct audio placement scored higher across all four scenes. The p-values were all below 0.05, so we can conclude that using audio fles that match correctly with the objects of the scene is important.
Set 5 (EQ): Across all four scenes, the average scores for confgurations with and without audio equalization were approximately the same and the p-values were all above 0.05. We can conclude that the efect brought about by sound equalization is negligible for our scenes at least. Set 6 (Depth): On average, the confguration for audio placed at depths calculated by our approach scored higher than the confgurations for uniform and random depths. While this is the case, the p-values for all scenes were above 0.05. We can conclude that while placing audio sources at proper depth may enhance realism, the efect is not signifcant in general. As shown in set 1, the positions of the sound sources are more important than their depths with regard to the realism perceived by the users. We conclude that while being able to recognize the depths of audible objects is important, there can be some fexibility in positioning the sound sources at their exact depths.
Set 7 (Number of Objects): On average, users preferred the confgurations with all (i.e., 100%) audio sources used. However, the p-values were all above 0.05. We conclude that while using all detected audio sources produces more realistic results on average, the realism enhancement brought about by using all sources compared to using only some of the sources is not signifcant. In other words, having a few properly placed sound sources is enough to create the sense that the sound is realistic, even if other objects in the scene are not assigned audio fles.
Post-hoc Tests: We also run pairwise post-hoc tests on sets 1, 6, and 7 for the four scenes, with a total of 24 posthoc tests. For set 1 (space), there is signifcant diference (p-value smaller than 0.05) in ratings between the standard confguration (spatial) and the mono confguration in the Living Room scene. For set 6 (depth), there is signifcant difference in ratings between the standard confguration (using our estimated depth) and the random depth confguration in the Chinatown scene. For set 7 (number of objects), there is signifcant diference in ratings between the standard confguration (using all detected objects) and confguration using 10% of the detected objects. Please refer to our supplementary documents for full results of the post-hoc tests.

User Feedback
Most users commented that they found the synthesized sounds to be realistic and immersive. However, some users commented that they found some audio sources unrealistic because they could not see moving objects. This was especially true in the Chinatown scene, where some users complained that they found the sound of cars unrealistic since no vehicles were moving. While this is a limitation posed by static panorama images, it does relate to the most common suggestion that users had on extending our approach to videos.
For the Living Room scene, some users stated that while the TV sound source was realistic, the audio for the sound source was too centralized. In other words, when turning their head away from the TV, they expected to hear more sound waves bouncing back across the room. We could explore incorporating sound waves bouncing of surfaces in our approach, which would require semantic understanding of what surfaces are in the scene.
Many users claimed that they could clearly tell the difference between the confgurations that had audio sources positioned with respect to objects in the image and those that did not. Overall, most users were able to identify where audio sources were placed in the images. Most said that such 3D placement of sounds enhanced the scenes. Refer to our supplementary material for information on the frequency of certain types of comments made by users.

Discussion
Out of the 7 sets of audio confgurations tested, audio placed by our approach with no modifcations tested best in most experiments. From our user study, we see that our confguration received the same or highest average score among all audio confgurations in each set. From the parameters we tested, the most relevant ones are spatial audio and correct objects. In comparison, audio equalization, accurate depth, and using all detected objects are not as important as the spatialness and correct detection. As for the background audio, our results show that its importance depends on the scene complexity and it can enhance the realism in some cases (e.g., Seashore).
One application of these fndings is for prioritizing in power-sensitive or real-time computing, for example, mobile AR applications where certain computations can be conducted with a lower priority without signifcantly deteriorating overall user experience. We advocate an emphasis on a high-quality ambisonic audio engine and robust object detection algorithm. On the other hand, estimating accurate, high-resolution depth and performing audio equalization could be given a lower priority.

CONCLUSION AND FUTURE WORK
We presented Audible Panorama, an approach for automatically generating spatial audio for panorama images. We leveraged scene understanding, object detection, and action recognition to identify the scene and objects present. We also estimate the depths of diferent objects, allowing for realistic audio source placement at desired locations. User evaluations show that the spatial audio synthesized by our approach can bring realistic and immersive experience for viewing a panorama image in virtual reality. We are opensourcing the audiovisual results we ran on Flickr panorama images, the viewer program, and the manually curated audible sound database.

Limitation and Future Work
Our current approach only applies to 360° panorama images. As an early attempt, we focus on panorama images which are abundant but usually lack an accompanying audio. A useful and natural extension would be to make our approach compatible with 360° videos with temporal consistency.
As with other data-driven synthesis approaches, one inherent issue with our approach is generalization. In our current study, only 4 scenes are comprehensively evaluated.
While we synthesized and release the audios for all the panorama images we collected on our project website, we did not conduct a full-scale evaluation on all the 1,305 results. One future work is to determine whether those results are as good as the 4 evaluated ones. Our approach may not perform well on panorama images with a) moving objects with dynamic sound; b) tiny objects too small to be detected (e.g., a bird); and 3) partially occluded objects that result in object detection failure. For example, while a user may expect a partially occluded car on a road to give car engine sounds, an object detection algorithm may fail to detect the car due to partial occlusion and hence no car engine sound is assigned by our approach. We are interested in developing a more systematic way of measuring audio quality and perceptual similarity, especially for immersive audiovisual contents.
The user feedback we received also hints that exploring how to synthesize sounds for moving objects could help improve our approach. Inferring semantic behavior from still images is inherently challenging due to the lack of temporal information which carries important object movement and action cues. However, as humans can infer the object motions on a single image in many cases, with the advancements of computer vision techniques, we believe it would be possible for computers to infer similarly, perhaps by leveraging a more sophisticated understanding of the scene context (e.g., a car near the curb rather than in the middle of the road is more likely to be parked and static) or by analyzing the subtle details (e.g., motion blur) on diferent regions of the image.
Another interesting extension is to apply our approach for panoramic cinemagraphs: still panorama images in which a minor and repeated movement occurs on a few objects, giving the illusion that the viewer is watching an animation. Our approach could be applied to assign sounds of this repeated motion to the moving objects on a cinemagraph.
We have created a sound database containing audio sources for many types of scenes and objects observed on common panorama images. One further augmentation is to devise sound synthesis algorithms that can synthesize novel audios adaptively based on observations from new images. Such synthesized audios may match with the scene even more closely and realistically, as well as introducing more natural variations. By releasing our results and working toward a more comprehensive research toolkit for spatial audio, we look forward to user experience enhancement in virtual reality.