Toward Next-generation Volumetric Video Streaming with Neural-based Content Representations

Striking a balance between minimizing bandwidth consumption and maintaining high visual quality stands as the paramount objective in volumetric content delivery. However, achieving this ambitious target is a substantial challenge, especially for mobile devices with constrained computational resources, given the voluminous amount of 3D data to be streamed, strict latency requirements, and high computational load. Inspired by the advantages offered by neural radiance fields (NeRF), we propose, for the first time, to deliver volumetric videos by utilizing neural-based content representations. We delve deep into potential challenges and explore viable solutions for both video-on-demand (VOD) and live video streaming services, in terms of the end-to-end pipeline, real-time and high-quality streaming, rate adaptation, and viewport adaptation. Our preliminary results lend credence to the feasibility of our research proposition, offering a promising starting point for further investigation.


Introduction
Holographic communication [9], a major beneficiary of 3D content delivery, harnesses volumetric content to construct holograms that depict 3D objects or scenes, thereby offering an immersive experience for users.A key characteristic of volumetric content is its provision of six degrees of freedom (6DoF) in movement, enabling users to not only change viewing angles but also freely navigate in the 3D space.
While there have been increasing efforts in recent years to optimize volumetric content delivery and enhance its quality of experience(QoE) [19,23,29,[73][74][75][76], existing work still falls short in several areas.For example, traditional representation methods with point clouds and meshes [46,54] have limitations when it comes to representing dynamic elements [32,41] and lighting effects [10,63], owing to their discrete nature.Thus, these techniques often fail to achieve photo-realistic rendering quality, affecting the QoE.
The latest advancements in implicit neural representations, such as neural radiance fields (NeRF) [38], have gained popularity as an attractive alternative for representing volumetric content with high visual quality [16,37,47], as shown in Figure 1.NeRF is a neural-based method for generating high-quality images through novel view synthesis.Rather than relying on discrete points or polygons, it leverages a multilayer perceptron (MLP) to depict a scene as a continuous function, enabling the rendering of photo-realistic images for an immersive viewing experience.It maps a continuous space of 3D position and viewing direction to a density and view-dependent radiance, leading to the creation of a 2D image through volume rendering, a process that aggregates colors along each ray.
Given that the vanilla NeRF is computationally intensive and primarily suited for static scenes, recent work concentrates on optimizing the performance of NeRF [6,15,26,70] and extending it for dynamic scenes [13,44,45,56,57,64].Nevertheless, current endeavors have yet to address the networking and systems challenges associated with streaming volumetric content with NeRF.
In this paper, we propose to deliver volumetric content by leveraging neural-based representations.Our goal is to make NeRF-based volumetric video streaming systems practical by boosting network efficiency and enhancing QoE.Despite extensive research towards NeRF for dynamic scenes, the design and implementation of NeRF-based volumetric video streaming systems still pose the following challenges: (1) the trade-offs between model size, inference time, and visual quality; (2) the feasibility of incorporating rate and viewport adaptation into NeRF models; and (3) the stringent real-time requirement for live video streaming.
Our research aims to provide novel insights and potential strategies for real-time, high-quality volumetric video streaming systems by leveraging the power of NeRF.The motivation to utilize NeRF for representing volumetric content comes from the inherent capability of NeRF to synthesize photo-realistic 3D scenes directly from 2D images, an attribute particularly valuable for outdoor scenes [35], where content capturing via RGB-D cameras is less effective.Driven by these remarkable capabilities, our work delves into the potential challenges and solutions in applying NeRF-based methods for video-on-demand (VOD) and live streaming services.In summary, our work has the following contributions.
• We first explore the research challenges tied to NeRF-based volumetric content delivery in VOD services and propose potential solutions ( §3.2- §3.4).For example, high-resolution image rendering is still demanding due to the increased latency for processing more pixels.We propose to leverage foveated rendering [3,39], which reduces the total number of to-berendered pixels and decreases the overall computation load.On the other hand, delivering NeRF models over the Internet to represent volumetric content may be bandwidth-intensive.Thus, we propose to explore model compression [60], rate adaptation with scalable neural networks [7,69], and viewport adaptation to alleviate bandwidth consumption.
• We then investigate NeRF-based live volumetric video streaming, a promising avenue for next-generation services such as telesurgery [8] and remote collaboration [61] ( §3.5).The main challenges stem from the need for real-time and continuous learning since future frames in a live setting are unknown.Given that the real-time training of NeRF models remains challenging, we propose an acceleration approach that involves offline pre-training of the model for the initial scene, followed by frame-specific fine-tuning based on pixel alterations between subsequent frames.
• Finally, we study the feasibility of NeRF-based volumetric content delivery that utilizes state-of-the-art NeRF models for dynamic scenes.Our preliminary results indicate that there exists a trade-off between model size, inference time, and visual quality.In addition, current methods commonly train a singular NeRF model for all frames in the video, and the model size stays the same despite using chunks with varying numbers of frames, making it less suitable for video streaming applications.These results underscore the importance of further optimizations to make NeRF-based volumetric video streaming practical.

Background
Traditional Volumetric Content Representations mainly utilize geometry structures such as point clouds [19,29], meshes [10,72], and voxels [12,42].Point clouds are effective for non-manifold structures [11], yet their absence of spatial connectivity [50,58] may cause holes.Meshes excel in offering surface detail and efficient rendering via rasterization [5,27].However, their reliance on a fixed topology [32,41] hampers modeling topological changes.Furthermore, they often struggle to model occlusions and optical effects [10,63], restricting their potential for generating photo-realistic 3D models.Comparatively, voxels surpass point clouds with their regular structure and editing efficiency [68], and provide internal features and facilitate volumetric operations for topological flexibility [17,36] compared to meshes.However, typical voxelization strategies, which map voxels to occupancy fields [36] or signed distance functions [17], still demand significant memory, confining their application to simple geometric shapes [38,43].Neural Radiance Fields.Beyond traditional geometric representations of volumetric content, the advent of neural networks has introduced more innovative methods.With MLP models, NeRF [38] leverages the plenoptic function [1] to construct an implicit, continuous representation of a volumetric scene.For rendering, NeRF utilizes a differentiable version of ray marching [22] that involves querying the neural network at multiple positions along each camera ray to generate color and density values.The inherent differentiability of this approach facilitates the optimization of scene representation, effectively narrowing the gap between 2D image pixels and 3D properties of the scene [21,33].Hence, NeRF serves as an efficient method [35,59,71] for synthesizing novel views from 2D images, effectively capturing the dynamic interplay of light and color within the 3D space.Learning-based Immersive Content Representations.In addition to NeRF, there are several other learning-based approaches to represent immersive content [4,31,32].Neural volume [31] conducts volume rendering for view synthesis, similar to NeRF.It utilizes an encoder-decoder network architecture wherein the decoder generates a volume containing RGB and opacity values.MVP [32] is a follow-up of neural volume [31].It combines the neural volume and traditional 3D mesh to represent volumetric content, enabling practitioners to strike a balance between rendering quality and latency.Despite these viable alternatives, NeRF, with its relatively easy implementation and high-quality representation capabilities, is our primary focus in this paper.Streamable NeRF.The original NeRF is designed mainly for static scenes, making it not applicable for streaming.To adapt NeRF for free-view volumetric videos, early studies either directly integrated the time dimension as an additional input to NeRF [13,67] or employed a secondary MLP to model and learn deformations for each video frame [44,45,49].However, these methods bear several limitations, including slow rendering speed [13,67], difficulty in representing large-scale motion or dynamic events such as topological changes [13,44,49], and large model size alongside lengthy training periods [44,49].To address these challenges, recent efforts proposed several innovative solutions, for example, utilizing latent codes to represent the frames with the goal of reducing the model size and training time [25], as well as dynamically detecting foreground objects to accommodate the representation of large movements [30,57,66].

NeRF-based Volumetric Content Delivery
In this section, we start with outlining the end-to-end pipelines for streaming volumetric videos by delivering NeRF models.Subsequently, we delve into an in-depth exploration of the research challenges associated with VOD services.Finally, we pivot to discussing the distinct research challenges inherent to live video streaming.

End-to-end Pipeline
Figure 2 depicts the end-to-end pipeline of VOD and live video streaming scenarios.The setups encompass three components: the client, the client's edge server (referred to as "edge"), and the video content server (referred to as "server").On the client side, the user wears an MR headset to watch videos to gain a truly immersive experience.Given the resource constraints of mobile headsets, the user is assisted by an edge that executes volume rendering based on a trained NeRF model.The data exchange between the client and the edge is as follows.During streaming, the client sends the headset's 6DoF pose to the edge.The edge then creates the input parameters of the NeRF model based on the received pose.After that, it performs volume rendering with NeRF's output and sends the rendered image back to the user's headset.The main variation between different setups resides in the communication between the server and the edge.VOD Service.Current research in computer vision and graphics communities generally trains a single NeRF model for all video frames.In this setup, the server transmits the trained model to the edge prior to streaming, and during streaming, there is no data transmission between the server  and the edge, as shown in Figure 2 (a).However, we argue that an ideal setup should divide the video into several chunks, each represented by a separate NeRF model, as demonstrated in Figure 2 (b).This setup is driven by the following two considerations.First, training a model for an entire video could result in a large model size, particularly for long videos.Furthermore, given that the model is not divisible ( §3.3), this could prolong startup time, negatively impacting QoE [28].Second, in real-world scenarios, users may watch only portions of a video, which can lead to inefficiencies if a single model is used for the entire video.
Live Video Streaming.Applying NeRF for live video streaming requires continuous learning (fine-tuning) of the NeRF model since the future frame is unknown.This process can be conducted on either the edge or the server, as shown in Figures 2 (c) and (d).This choice presents a trade-off between end-to-end latency and bandwidth requirements, warranting further exploration.Fine-tuning on the server could potentially decrease training latency, as servers usually possess better computational resources.However, transmitting the fine-tuned model over the Internet may demand higher bandwidth than images, introducing additional latency if the network is congested or has limited bandwidth.Conversely, while offloading the fine-tuning task to the edge device might reduce bandwidth requirements, the computational resources at the edge might be insufficient for rapid model fine-tuning, leading to high end-to-end latency.

Real-time and High-quality Streaming
Real-time and high-quality NeRF-based volumetric video streaming demands a delicate balance between (model) transmission latency, rendering latency, and visual quality.This presents two substantial challenges as follows.
Delivery of NeRF models.A common strategy to accelerate frame rendering involves a trade-off between the storage footprint and frame rendering latency.For instance, Muller et al. [40] simplified the MLP model for real-time rendering by employing multi-level hash tables to encode low-dimension inputs into high-dimensional features, which preserves necessary information for high-quality rendering.This strategy reduces the rendering latency at the cost of increased storage usage, incurring the challenge of high bandwidth requirements for transmitting NeRF models (e.g., 245.2 MB for a 200-frame video processed by the model of Peng et al. [47], as shown in §4.2) from the server to the edge.Moreover, the inherent requirement of large models for high-resolution content could further increase the bandwidth needed for model transmission.
To address the above issue, we propose to leverage advanced compression techniques to reduce the model size for storage and transmission, without significantly affecting visual quality.For instance, the vector-quantized auto-decoder compression method, proposed by Takikawa et al. [60] for static scenes, could be adapted to dynamic videos.However, the computation overhead associated with decompression operations is typically high [60].Thus, it is critical to facilitate real-time decompression while minimizing the usage of computation resources, to preserve sufficient computational capacity for frame rendering on the edge.Frame Rendering.Despite the advancements in NeRF acceleration, further improvements are needed to achieve realtime rendering of high-resolution images for dynamic scenes.For example, the state-of-the-art design by Peng et al. [47], while efficient for rendering 512×612 images, may fail to maintain real-time rendering of high-definition content (e.g., with a resolution of 1920×1080).We propose to employ foveated rendering [3,39] inspired by the characteristics of the human visual system (HVS) to solve this problem.The HVS features a high-resolution foveal area and a peripheral region where resolution gradually decreases [18].Accordingly, foveated rendering reduces computation overhead without perception loss by focusing computation resources on rendering the high-resolution fovea region in detail, while reducing the rendering quality in the peripheral vision.By reducing the total number of pixels to be rendered, foveated rendering can significantly decrease the overall computational load.Our proposed foveated rendering can selectively march more rays in the foveal region and fewer in the peripheral regions.

Rate Adaptation
In NeRF-based volumetric video streaming, where the NeRF models are transmitted over the Internet, rate adaptation becomes a crucial aspect.This method, similar to traditional 2D video streaming, may require substantial bandwidth (e.g., streaming at 30 FPS with a resolution of 512×612 [47] necessitates ∼300 Mbps).Under dynamic network conditions, a common strategy is to adjust image resolution based on the predicted available bandwidth [20,34].However, in the case of NeRF-based streaming, altering the model size dynamically is impractical due to the intrinsic design of NeRF.
The original NeRF utilizes an MLP model that is not directly capable of handling different rendering resolutions through the adjustment of model size.In NeRF, all the model parameters contribute to the 3D scene reconstruction [14].Due to the intricate interconnection of weight parameters within the model, omitting even a small portion could potentially disrupt the process and result in reconstruction failure.A straightforward extension of traditional rate adaptation schemes to NeRF-based video streaming involves training different MLP models to handle various output resolutions by modifying their depth and width.However, this approach could lead to substantial increases in memory and storage requirements, making it less than ideal.
To address this challenge, we propose to extend the scalable video coding (SVC) [55] to NeRF by exploring scalable neural networks [7,69], such as slimmable networks [24] and progressive networks [53].These networks are designed to be segmented into multiple executable sub-networks of varying widths and depths, each trained to accommodate a particular rendering resolution.For example, a narrower sub-network would manage low-resolution outputs, whereas a wider sub-network, with the narrower ones incorporated, would accommodate high-resolution content.By dynamically adjusting the network size, the enhanced NeRF model could support various rendering resolutions.

Viewport Adaptation
Viewport adaptation is a prevalent strategy in immersive video streaming, aiming at bandwidth reduction by delivering mainly video content that the viewer is anticipated to consume, rather than the entire scene [19,51].In bandwidthhungry NeRF-based volumetric video streaming, viewport adaptation is a critical component.There are two basic requirements for its application in immersive video streaming: content segmentation for selective transmission and viewport prediction for content selection.The latter has been extensively studied in immersive video streaming, such as 360°video streaming [51] and point-cloud-based volumetric video streaming [19].The former, however, is non-trivial for NeRF-based video streaming, for which a single MLP is typically trained to represent the whole scene.However, as illustrated in §3.3, to gain high-quality reconstruction, we need to transmit all parameters of the MLP model, undermining the bandwidth-saving benefit of viewport adaptation.
A straightforward solution is to partition the whole scene into multiple voxels and represent each voxel with an MLP [40,52].However, this solution may be confined only to the bounded scene [52].Moreover, determining the optimal number of cells is non-trivial.A fine-grained segmentation strategy may incur high segmentation and storage overhead, while a limited number of cells might reduce the effectiveness of viewport adaptation.An innovative approach involves utilizing attention mechanisms [62] to assign weights to NeRF parameters.During streaming, this adaptability allows us to rank and stream high-weight parameters for the viewer's specific viewport, optimizing bandwidth consumption.

Live Video Streaming
Different from VOD services, live volumetric video streaming enables more exciting use cases, such as telesurgery [8] and remote collaboration [61].In this section, we outline several challenges and potential solutions related to NeRF-based live volumetric video streaming.Real-time and Continuous Learning.Live video streaming presents the complex challenge of real-time, continuous training of NeRF models for novel view synthesis since the content of future frames remains unknown.Training NeRF models is notoriously time-consuming [2,48], intensifying this challenge.To mitigate this issue, we propose a solution hinged on the observation that the variation of content between frames may be limited.Hence, once the initial scene model is trained, there is no need for retraining from scratch.For subsequent frames, we can fine-tune the pre-trained model by feeding features extracted from the altered content [71], potentially expediting continuous training.
Even though we still need to train the models for the initial scene, its impact on QoE is likely to be limited due to the following two reasons.First, in real-world scenarios, the initial scene is generally known before streaming commences, such as the recording studio, allowing us to pre-train the NeRF model offline.Second, recent advancements have significantly accelerated the NeRF training process [16].Viewport Adaptation.In the context of live video streaming, we could conduct viewport adaptation on the transmitted multiple-view images.For each frame, we could potentially transmit only the content within the user's predicted viewport, effectively reducing the number of delivered pixels.This approach could potentially decrease both the size of the transmitted images and the model.Consequently, it could alleviate the bandwidth demand for live streaming schemes, as shown in Figure 2 (c) and (d).However, a caveat to consider is that the transmitted multi-view images are used to fine-tune the MLP, which is trained on the previous frames.
Given that the user's viewport is likely to change over time, the content in the current frame might not be present in the previous ones, making it difficult to fine-tune the model for the current frame.Devising an effective strategy to overcome this challenge necessitates further investigation.
4 Preliminary Results

Experiment Setup
We reproduce two state-of-the-art NeRF models, the dynamic MLP-maps [47] and Tensor4D [56], which are specialized for dynamic 3D reconstruction and rendering with dense and sparse input views, respectively.We leverage the RGB datasets associated with the aforementioned models that capture foreground dynamic scenes within bounded regions.
In particular, for the dynamic MLP-maps model, we utilize the NHR [65] dataset that contains images at a resolution of 1224×1024, captured by an array of up to 80 synchronized cameras.For the Tensor4D model, we employ a synthetic dataset from D-NeRF [49] that consists of images with 800×800 resolution, captured by a single monocular camera.
To explore the trade-offs between model size, inference time, and visual quality, we experiment with two different settings.In the first setting, we train the model corresponding to different chunks by partitioning the input frames.In the second setting, we generate images at different resolutions, with the downsampling ratios set to 2 and 4, utilizing the same pre-trained models.These models are implemented in vanilla PyTorch and executed on a machine with an NVIDIA GeForce RTX 3060 GPU and an Intel Core i7-12700K CPU.

Experimental Results
Model Size.In the context of volumetric video streaming based on NeRF representations, the server transmits the pretrained MLP model to the edge.Therefore, the size of the model crucially determines the initial startup time, influencing the QoE.In this paper, we train the vanilla MLP-maps model with 200 frames of a video in the NHR dataset, resulting in a large model size of 245.2MB.We then investigate the Sparse-view in Tensor4D [56] Dense-view in MLP-maps [ impact of using chunks with varying frame numbers on the model size and reconstruction quality.We train the "lego" model following the Tensor4D configuration with chunk sizes of 10 and 50 (i.e., number of input frames).Despite the variation in chunk size, the model size remains consistent at 197.9MB.Figure 3 shows the reconstructed images at resolution 400 for a monocular synthetic dataset, trained with 10 and 50 input frames, respectively.This figure indicates that smaller chunk sizes yield higher quality reconstructions, featuring detailed aspects and no artifacts, despite the model size remaining the same.This suggests that a model trained with a smaller number of inputs potentially retains more detailed information without compressing as much data.
Inference Time.Table 1 shows a comparison of rendering time per frame at varying resolutions, both in sparse and dense view cases.The table illustrates that rendering highresolution images typically requires more time, highlighting the need for a balance between video quality and rendering time.Furthermore, a noteworthy observation from the table is that the rendering time for the dense view using the MLP-maps model exhibits a significant improvement over the sparse view on the Tensor4D model.This enhancement potentially paves the way for achieving real-time rendering at the cost of requiring dense views with more cameras.
Visual Quality.Figures 4 and 5 depict reconstructed images at various resolutions and the ground truth for the monocular synthetic dataset with the Tensor4D model and the NHR dataset with the MLP-maps model.The qualitative comparison of these images reveals that, generally, as the resolution increases, the reconstructed images become clearer with enhanced details, such as the granularity of the "lego", and the defined facial features and clothing folds.Despite the models producing photo-realistic images, there are still visible discrepancies compared to the ground truth.Therefore, there is still significant room for improvement in terms of visual quality for NeRF-based volumetric content representation.
Discussion.Our preliminary results demonstrate the inherent trade-offs between model size, inference time, and visual quality in NeRF-based volumetric video streaming.We observe a considerable discrepancy in rendering time with sparse and dense view inputs, even though the perceived difference in the visual quality of the reconstructed images is small.In the case of dense-view inputs, the MLP-maps model   is already capable of achieving real-time rendering.However, a prevalent issue is that current models tend to train a singular NeRF model on all frames in the video.This approach results in the NeRF model with a high storage cost, making it less suitable for video streaming applications.Therefore, additional design and optimization are required to achieve practical NeRF-based volumetric content delivery.

Conclusion
In this paper, we charted an ambitious research agenda, focusing on neural-based volumetric video streaming.This approach harnesses the strengths of NeRF, aiming for photorealistic visual quality.To reduce bandwidth consumption and ultimately enhance the QoE, we delved into the unique challenges related to NeRF-based volumetric content delivery in VOD and live video streaming services and proposed potential solutions to these problems.Our preliminary results suggest a delicate balance that needs to be maintained between model size, inference time, and visual quality.We hope that our work will inspire future research endeavors in NeRF-based volumetric video streaming, ultimately delivering immersive content with high visual quality, efficient bandwidth usage, and low end-to-end latency.

Figure 1 :
Figure 1: Comparison of volumetric content rendered on HoloLens, with NeRF (left) and point cloud (right).

Figure 2 :
Figure 2: End-to-end pipelines of VOD and live video streaming for NeRF-based volumetric content delivery.Top: Data delivery from the server to the edge for VOD, where the server trains an MLP for the whole video (a) and trains an MLP for each chunk of the video (b), respectively.Middle: Data delivery from the server to the edge for live video streaming, where the fine-tuning is conducted on the server (c) and the edge (d), respectively.Bottom: Data exchange between the client and the edge for VOD and live video streaming (e).

Table 1 :
Comparison of rendering time per frame at different resolutions in sparse and dense view cases.