QV4: QoE-based Viewpoint-Aware V-PCC-encoded Volumetric Video Streaming

Volumetric videos allow six degrees of freedom (6DoF) movement for viewers, enabling numerous applications in domains such as entertainment, healthcare, and education. MPEG's Video-based Point Cloud Compression (V-PCC) is a recent new standard for volumetric video compression that achieves a considerable compression rate while maintaining the quality of the point cloud sequence. However, V-PCC is hard to fit into existing tiling-based volumetric video streaming framework due to the lack of proper user viewing adaptive techniques. In this paper, we propose QV4, a Quality-of-Experience (QoE) based streaming pipeline for viewpoint-aware V-PCC-encoded volumetric video. Specifically, we leverage the intermediate results produced by the V-PCC encoder to achieve effective and efficient viewpoint-aware tiling for V-PCC. We then build a QoE model and a 6DoF movement model based on real-world user data, to predict the users' viewing experience and behaviors, respectively. The proposed QoE model and 6DoF movement model are combined with viewpoint-aware V-PCC tiling to maximize the visual quality of volumetric videos. Extensive simulations show that by enabling viewpoint-aware adaptation and optimization for V-PCC-encoded volumetric videos, QV4 can achieve up to 14.67% improvement in structural similarity index (SSIM) and 7.39% improvement in video multi-method assessment fusion (VMAF) over highly dynamic viewing behaviors in a network with limited and fluctuating bandwidth.


INTRODUCTION
Volumetric video is an emerging and highly interactive form of media that immerses users in a dynamic 3D experience [11].These videos employ a data representation called point cloud, which consists of a set of 3D points, each with coordinates and color.However, achieving high-quality 3D object representation in volumetric videos demands point clouds composed of millions, if not billions, of points, posing a challenge for transmission-friendly codecs, adaptive transmission methods, and accurate prediction of viewing behavior.
The codecs for point cloud compression can be categorized into projection-based codecs, geometry-based codecs, and neural-based codecs.MPEG standardizes two compression technologies for point cloud compression (PCC): video-based PCC (V-PCC) and geometrybased PCC (G-PCC).V-PCC is a projection-based approach that projects each frame of 3D point clouds into three 2D images.The generated 2D images can then be encoded using state-of-the-art video codecs (e.g.HEVC).G-PCC uses octree or triangle soup (trisoup) data structures and performs arithmetical encoding on different attributes.Similarly, Google's Draco1 uses the kd-tree data structure to compress grid data and point clouds.PU-GAN [20] and MPU [38] models resort to deep learning techniques, to extract the features of the point cloud surface by neural networks and upsample the point cloud.
V-PCC, in particular, takes advantage of successful 2D video compression techniques, to achieve up to 100× better compression ratio compared with geometry-based PCC techniques (e.g., Google's Draco) and neural-based PCC techniques (e.g., MPU [38] or PU-GAN [20]), showing great potential for the applications of volumetric videos.Moreover, as a projection-based method, V-PCC is well suited for dense point cloud sequences since V-PCC can generate continuous and smooth surfaces of 2D projections from dense point clouds [3].As a result, however, V-PCC cannot be fully utilized by tiling-based streaming techniques, that need to segment the video into small and sparse tiles for adaptive optimization.
Recently, Rudolph et al. [30] propose a point cloud decomposition method called NoVA-PCC.They propose to utilize the 2D-projection mechanism of V-PCC for point cloud decomposition, taking the first step to enable user-adaptive optimization techniques on V-PCC-encoded volumetric videos.Nevertheless, V-PCC-encoded volumetric video streaming is still very challenging due to the lack  of effective quality metrics, bitrate allocation, and accurate viewing prediction.Because of the aforementioned limitations, most of the works of tiling-based volumetric video streaming resort to geometry-based compression and deep-learning-based compression [13,16,18,20,38,40], which achieve low compression ratios and fail to work under limited and fluctuating network conditions.
In this paper, we present QV4 , a QoE-based streaming pipeline for Viewpoint-aware V-PCC-encoded Volumetric Video.To our best knowledge, QV4 is the first tile-based volumetric video streaming system that exploits user viewing adaptations on V-PCC-encoded content.Specifically, to take advantage of V-PCC and decompose the point clouds for viewpoint-aware optimization, we leverage the intermediate results produced by the V-PCC encoder as suggested by Rudolph et al. [30].To take the viewer's subjective perception and quality of the viewing experience into account, a QoE model is built based on 7,680 ratings of V-PCC-encoded volumetric videos from 120 users by Cox et al. [6].Our QoE model can accurately predict the subjective quality of volumetric videos encoded by V-PCC with 0.06 root-mean-square error (RMSE).Additionally, to enable effective and efficient viewpoint adaptation for better user perception quality under diverse and dynamic user viewpoints in 6DoF, we develop a viewpoint prediction model for 6DoF viewpoint prediction based on real-world viewpoint data collected from 26 users by Subramanyam et al. [33] and emulated dynamic viewpoint data, which achieves 0.02 mean absolute error (MAE) and 0.07 MAE on average for normalized translational movement and rotational movement, respectively.We combine the proposed QoE model and viewpoint prediction model with the state-of-the-art bitrate allocation algorithm to maximize the visual quality of volumetric video under dynamic network conditions and diverse user viewpoints.We perform extensive simulations under complex and diverse network profiles and viewpoint trajectories to verify the feasibility and performance of QV4 .The results show that QV4 achieves up to 14.67% improvement in SSIM and 7.39% improvement in VMAF.
The rest of this paper is organized as follows.Section 2 presents the related work on volumetric video compression and streaming, and introduces the background of V-PCC.Section 3 introduces the proposed methods to achieve QoE-based viewpoint-aware volumetric video streaming encoded by V-PCC, including viewpoint-aware V-PCC tiling, 6DoF viewpoint prediction, QoE estimation, and QoEbased bitrate adaptation.We show the experiments and results in Section 4. Finally, we discuss and conclude this paper in Section 5.

RELATED WORK AND BACKGROUND 2.1 Point Cloud Compression
The exploration of point cloud compression has an extensive background, originating from previous studies conducted by Devillers and Gandoin [8] as well as Peng and Kuo [22].These early works focused on compressing the geometry of point clouds by employing tree-based data structures for representation.Recently, MPEG defines two PCC standards: V-PCC for dynamic point clouds and G-PCC for scenes and objects.By adopting the representation of 2D projection structure, V-PCC projects the original 3D point cloud into the 2D space, and compresses them with 2D video compression methods.3D data structures (e.g., kd-tree [4], octree [21] or trisoup [11]) are used in representative algorithms including G-PCC and Google's Draco to exploit 3D correlations.The field of neural point cloud compression has recently gained attention.For example, Quach et al. have contributed to this area with their series of works, employing neural networks to compress point cloud geometry [26].They frame the decoding process as a binary classification problem.Another notable contribution comes from Yan et al., who propose a specialized autoencoder architecture for compressing point cloud geometry [36].Que et al. introduce a deep-learning framework that utilizes voxel context to compress octree-structured data in both static and dynamic point cloud compression scenarios [27].Sheng et al. present an approach using neural networks for point cloud attribute compression, incorporating second-order point convolution [31].Furthermore, Quach et al. provide a comprehensive survey summarizing recent advancements in neural point cloud compression [25].While neural compression techniques show promise, the current research emphasizes static point clouds.

Video-based Point Cloud Compression
V-PCC's main idea is to project each frame of 3D point clouds into three kinds of 2D images.The generated 2D images can then be encoded using state-of-the-art video codecs (e.g.HEVC).
On the encoder side, patch generation first projects the 3D point cloud into 2D space with different projection angles.Specifically, normal estimation is applied to each point, and the points are assigned to a projection plane based on its normal direction.By default, the six faces of a bounding box are used as projection planes.By projecting the points to projection planes, the point cloud is segmented into patches.These patches are further refined to ensure connectedness in the projection.Figure 1 shows an example of a segmented point cloud with different colors representing different patches.Patch packing then sequentially places the patches into 2D video frames while maintaining the close position and orientation of each patch across frames.The objective of patch packing is to fit all the patches onto 2D frames which can be compressed effectively by existing video coding standards.The process consists of applying a packing method followed by a global patch allocation (GPA).V-PCC produces three artifacts after patch packing: the occupancy map which contains information if a pixel in the resulting projection plane is occupied, the geometry image which contains information on the distance between points from the projection plane (i.e., depth), and the attribute image which contains information on attributes, e.g.colors, of the projected points.The example of these three artifacts is shown in Figure 1.Finally, the generated 2D video is encoded with a 2D video codec, such as the HEVC.As a result, V-PCC compression rates are controlled by the geometry and texture quantization parameter (QP) of the 2D video codec.On the decoder side, the 2D video decoder decodes and generates the occupancy map, geometry map, and attribute map, respectively.The geometry and attribute of the point cloud are then reconstructed.Post-processing (e.g., duplicate pruning, smoothing) is then applied to the reconstructed point cloud.Please refer to [11] for more details.

Volumetric Video Streaming
Several volumetric video streaming systems have been developed, demonstrating the feasibility of volumetric video transmission [12, 13, 16-19, 24, 33].YuZu [40] pioneered the use of point cloud super-resolution techniques, enhancing the user experience with models like PU-GAN [20] and MPU [38].However, YuZu's experiments lacked a comparison with conventional point cloud codecs and were limited to small datasets.Subramanyam et al. [33] proposed a view-adaptive streaming approach that segments individual point cloud objects into non-overlapping tiles.The tiles are transmitted with different bitrates based on the position and orientation of the users' viewport.Li et al. [19] introduced an innovative method that incorporates human saliency to optimize 3D tiling schemes for dynamic point cloud streaming, improving efficiency.Shi et al. [32] observed that V-PCC suffers a great quality drop at low bit-rate.To address this problem, they proposed to exploit the redundancy of 3D point clouds by simple 3D sub-sampling and achieved considerable quality improvement of V-PCC encoded point clouds.AITransfer [16] proposed a neural adaptive transmission scheme, but it does not address color attributes in point clouds.Transmission systems such as ViVo [13] and Groot [18] combined compression and optimization methods to achieve parallel acceleration based on visual characteristics.Nonetheless, these approaches adopt geometry-based compression techniques for tilebased transmission, thus suffering from limitations in compression ratios and bandwidth requirements.Both ViVo and Groot achieve a low compression ratio (less than 8), thus failing to work well under complex network conditions [37].
In contrast, we capitalize on the utilization of V-PCC and viewingaware optimization to tile-based volumetric video streaming system.These unique components enable our approach to achieve a significantly higher average compression ratio of 610, while keeping satisfactory quality even under limited network conditions and complex viewing behavior.Further details of our experiments and analyses are contained in Section 4.

METHODOLOGY 3.1 Pipeline Overview
QV4 is, to the best of our knowledge, the first tile-based streaming system for V-PCC-encoded volumetric video that considers user viewing experience and behaviors.It is designed based on dynamic adaptive streaming over HTTP (DASH) standards.Figure 2 gives an overview of QV4 .
QV4 streams video-on-demand volumetric content stored on an Internet server to client hosts.On the server side, the volumetric videos are segmented into view-aware tiles which are independent and non-overlapping ( §3.2).Then each tile is independently encoded with different quality levels by V-PCC.We include tile metadata for each tile in the Media Presentation Description (MPD) document.During playback, the client continuously predicts the user's viewpoint ( §3.3) and estimates the QoE of tiles ( §3.4).The tiles with optimal quality are chosen based on the bitrate of tiles, predicted viewpoint, estimated QoE, network condition, and buffer status ( §3.5).Then, the client sends requests to the server for the selected tiles.After receiving the encoded tiles from the server, the V-PCC decoder decodes the tiles and stitches them together to form a complete Group of Frames (GoF) of the point cloud.The buffer finally stores the GoF and sends the GoFs to the renderer sequentially for the user.

Viewpoint-Aware V-PCC Tiling
In viewpoint-aware point cloud segmentation, a point cloud is expected to be divided into  groups corresponding to  view directions, denote D = { 1 ,  2 , . . .,   } as the collection of predefined view directions.As the performance of V-PCC is highly dependent on the performance of 2D video codec, naïve point cloud segmentation can destroy the spatial and temporal redundancy in projected 2D images, making view-dependent V-PCC challenging [41].
As discussed in Section 2.2, the V-PCC encoder first generates patches which are the 2D projections of 3D point clouds.The projections are then packed and compressed by sophisticated 2D compression techniques.Rudolph et al. [30] first observe that during patch generation, the assignment of projections can be interpreted as the view directions.For example, in Figure 1, the points of the point cloud are projected into patches based on their directions of normals which also indicate the view directions.Therefore, we can fully utilize the assignments of 2D projections produced within V-PCC encoding for point cloud segmentation, thus incorporating the high compression efficiency brought by V-PCC.
Specifically, the V-PCC encoder generates patches in the following steps.First, a normal per point is estimated according to Hoppe et al. [14].A tangent plane and its normal are defined per point, based on the nearest neighbors within a predefined search distance.To make the directions more uniform, each point's normal is enforced to point in similar directions to its neighbors' normals.Then, each point is associated with the plane that has the closest normal.As a result, the point cloud is segmented into a set of patches.Several techniques are further used to refine the segmentation to ensure the inter-smoothness and intra-connectedness of the patches.Instead of packing all the patches into one 2D image, we pack them into distinct tiles.Each tile corresponds to a direction of projection that indicates the view direction.As a result, for the pre-defined view directions  1 ,  2 , . . .,   , we will create  attribute, geometry, and occupancy videos.In the end,  bitstreams are produced by the V-PCC encoder.By default,  is set to 6 since six faces of a bounding box are used by V-PCC as projection planes.Because the aforementioned bitstreams are independent and non-overlapping, we can allocate different bitrates to point cloud segments based on the viewpoint of users.

6DoF Viewpoint Prediction
6DoF Viewpoint prediction is a critical component of real-time volumetric video streaming, as it facilitates a smooth and seamless experience, where the user's movements are tracked in real-time and the appropriate content is prefetched, thus minimizing delay.In this section, we propose our model to achieve accurate viewpoint prediction.
Problem Definition.For current time  , our first objective is to predict the position of viewpoint at time  +  based on the history viewpoints of frames from  − ℎ to  , where  is the prediction window and ℎ is the history window.Given the predicted viewpoint v  + , we then infer the in-the-view tiles for QoE prediction (a) Real user trajectories from [33].
(b) Synthesized trajectories.and bitrate allocation, as discussed in Section 3.4 and Section 3.5, respectively.One-hot encoding is applied to represent the tiles.Specifically, we denote the tiles at  +  as t  + ∈ {0, 1}  where the -th entry is 1 if and only if the -th tile falls into the user view, and is 0 otherwise.Viewpoint Trajectory.To better understand how real users move when watching volumetric videos, we adopt real viewpoint trajectories from [33].The viewpoint trajectories are collected from 26 participants.Participants wore an Oculus Rift HMD to watch volumetric videos rendered from 8i Dataset [9].They were free to make translational movements (pitch and yaw), move backward or forward (Z dimension), leftward or rightward (X dimension), and upward or downward (Y dimension).Figure 3(a) shows an example of the real viewpoint trajectories.We characterize the viewing behaviors and make two interesting observations.First, the horizontal positions X and Z range from -0.13 m to 1.36 m, and 0.77 m to 2.75 m, respectively, indicating that the participants spent all the time looking toward the frontal body of the avatars in volumetric videos and never moved to the back.Second, the median and 90th percentile of vertical position Y are 0.02 m and 0.09 m, respectively, meaning that the participants hardly moved vertically.Based on the above observations, we can find that the participants spent most of their time in a certain area looking toward the frontal body of the avatar in volumetric videos, which is in agreement with previous studies [2,29,39].
Therefore, to better evaluate the generalization ability of our method, we additionally emulate more dynamic user movement and generate synthesized viewpoint trajectories.In the synthesized trajectories, the viewers are kept moving around the object horizontally and vertically, as shown in Figure 3    Prediction Model.Volumetric videos involve multiple dimensions of movement, which can make it challenging to predict the user's viewport in real-time.As suggested by Han et al. [13], predicting each dimension separately and then combining the results to derive the predicted viewport allows for more accurate predictions and better real-time performance, making it suitable for real-world applications.We thus consider four learning-based models to predict each dimension of X, Y, Z, yaw, and pitch: Linear Regression (LR), Multi-Layer Perceptron (MLP), Support Vector Regression (SVR), and Gated Recurrent Unit RNN (GRU).Specifically, the MLP is configured with 2 hidden layers, with 64 neurons for the first hidden layer and 32 neurons for the second hidden layer.Hyperbolic tangent is employed as the activation function and Adam is used for optimization.Default settings are applied to LR and SVR.As for the GRU model, 2 GRU layers are stacked together, with 64 neurons for each layer.Adam is adopted as the optimizer.
Evaluation.We consider several history windows from 30 frames to 60 frames and prediction windows from 30 frames to 120 frames at a step of 30 frames for an FPS of 30.The X, Y, Z, yaw, and pitch are normalized using mean normalization.Mean absolute error (MAE) is utilized to evaluate the performance of the models.Figure 5 plots the MAEs of translational (X, Y, Z) and rotational (yaw, pitch) movement prediction, with ℎ = 60 (the upper bound of inference time) and  = 120 (the lower bound of prediction accuracy).As shown, GRU outperforms the other models in both translational and rotational movement prediction, with the lowest MAE.To better illustrate the prediction performance of the models, we also visualize the predicted trajectory of the viewpoint of these four models in Figure 4, which shows that GRU is capable of accurately predicting the viewpoint trajectory, compared to other models.
After predicting the viewpoint, we then infer the tiles that fall into the view based on the predicted viewpoint.Three metrics are utilized to evaluate the performance of in-the-view tile prediction: exact-match accuracy, Hamming score, and weighted-average F1 score.We also measure the total inference time of viewpoint prediction and in-the-view tile prediction on a computer with Apple M1 Pro with 32 GB RAM running on macOS 13.2.
As shown in Table 1, GRU achieves the best performance with 0.6440 exact-match accuracy, 0.8686 Hamming score, and 0.8742 weighted-average F1 score.In terms of time efficiency, the inference of these four models can all be completed within 1 ms for five dimensions.We thus adopt GRU as our viewpoint prediction model based on the above results.

QoE Prediction Model
We then explain the proposed QoE prediction model.The model considers two factors: (i) parameters associated with the encoder (i.e., V-PCC compression rate), and (ii) the viewpoint of the user.We start by discussing the trivial version of our QoE prediction model, which only considers the effect of V-PCC compression rate.Given a tile   , we want to estimate its quality with respect to its representation level , where the representation level is controlled by geometry QP and texture QP as mentioned in Section 2.2.
To accurately predict the quality, we resort to supervised machine learning algorithms.The volumetric video quality assessment dataset (VOLVQAD) [6] is adopted to train and evaluate our model.VOLVQAD consists of 376 video sequences and 7,680 ratings from 120 users.The volumetric video sequences are encoded with MPEG V-PCC using 4 different avatar models from the 8i Dataset [9] and 16 varying quality, where the quality is controlled by geometry QP and texture QP.The volumetric video sequences are then rendered into test videos for subjective quality assessment.The participants were asked to compare the QoE of the rendered videos by providing mean opinion scores (MOS), ranging from 1 to 5. Each participant viewed 64 pairs of videos.The order of the videos was chosen randomly for each participant.
We consider five lightweight machine learning (ML) models to predict the quality: (i) Polynomial Regression (PR), (ii) Support Vector Regression (SVR), (iii) Random Forest (RF), (iv) Multi-Layer Perceptron (MLP), and (v) K-Nearest Neighbour (K-NN).During the training stage, we perform 5-fold cross-validation for hyperparameter tuning by using the grid search algorithm where the Coefficient of Determination ( 2 ) is used as an objective criterion.The following hyperparameters are selected after hyperparameter tuning: (i) PR has two degrees, (ii) SVR has linear kernel; (iii) RF has 100 estimators with 5 maximum depth; (iv) MLP has 0.001 learning rate with a hidden layer with 100 neurons; (v) KNN has two neighbours with uniform weights.Other hyperparameters are kept in default.Three metrics are used to evaluate the ML models: RMSE, MAE, and  2 .The prediction performance of the five models is reported in Table 2.We find that PR achieves the best performance with 0.06 RMSE, 0.20 MAE, and 0.52  2 , compared to other methods.We thus adopt PR to predict the quality of tiles with respect to geometry QP and texture QP, given its effectiveness and efficiency.For a tile   , we denote its geometry QP as    and texture QP as    , the quality is then denoted as (   ,    ).We denote the quality as   for simplification.Then, we predict its quality   with: where ( We then consider the effect of the viewpoint of users.Ideally, we want to attach more importance to the tiles facing the users' viewpoint and attach less importance to the occluded tiles.To depict the position relationship between tiles and viewpoint, we calculate the cosine of the angle between the line of sight and the normal of tiles.We denote the cosine of the angle between the line of sight and the normal of tile   as cos(  , ) ∈ [−1, 1], where  is the position of the viewpoint.With the viewpoint getting closer to the front of the tile   , their cos(  , ) decreases.Figure 6 gives an example of the viewpoint-tile relationship, where the point cloud is divided into six tiles.In this example, let's denote the angles between the line of sight and the normals of frontal, right, and back tile as  1 ,  2 , and  3 , respectively.We can find that cos  1 is the smallest compared to others because the frontal tile is closer to the viewpoint while the right and back tiles are almost occluded.
Therefore, the proposed QoE prediction model is: where   is the predicted quality of   obtained by Eq. 1.

QoE-based Tile Rate Adaptation
Given the tiles of point clouds segmented based on the method described in Section 3.2, the user viewpoint predicted by the model discussed in Section 3.3, and the QoE of tiles estimated by the QoE model described in Section 3.4, the next step is allocating high bitrate to the portion of the point cloud contents that fall within the viewing frustum while transmitting the rest tiles with low bitrate.Our algorithm involves two steps.First, the client estimates the optimal bitrate budget of the next point cloud contents to download based on the network condition and buffer state.The estimated bitrate budget is then allocated to the tiles of the volumetric video, given the user viewpoint, the quality and bitrate of each representation level of tiles, and the buffer states.
In the first step, we determine the ideal bitrate by using a stateof-the-art bitrate allocation algorithm QUETRA [35], which is an effective yet simple rate adaptation algorithm based on an //1/ queuing model of DASH.Note that other rate adaptation algorithms, such as ELASTIC [5] or BBA [15], can also be used in our pipeline.
We then allocate the bitrate budget estimated from QUETRA to the tiles.The tile rate allocation problem can be modeled as a Multiclass Knapsack problem, which is a special case of 0-1 Knapsack Problem with a disjoint multi-class constraint.A frame is segmented into  tile groups, each group has  representation levels.Let  , be the tile of group  and representation level , denote its estimated bitrate of representation level as  , and its rendered quality as  ( , , ), where  is the position of user viewpoint.The rendered quality  ( , , ) is obtained by Eq. 3, and the viewpoint position  is predicted by our viewpoint prediction model described in Section 3.3.
We wish to choose tiles from different representation levels to maximize the expected rendered quality, while the total rate is kept within the given bitrate budget  estimated from QUETRA.Let X =  , : 1 ≤  ≤ , 1 ≤  ≤  , where  , is 1 if and only if tile  , is selected to be downloaded for rendering, and is 0 otherwise.We can formulate the problem as: To address the multi-class knapsack problem, we adopted the solution based on Pisinger's greedy heuristic [23].This heuristic begins by sorting representation levels of each tile by their bitrate and selecting the lightest one from each tile to cover all tiles.Next, it sorts representations in each tile group based on their quality-tobitrate slope  ( , , ) −  ( , −1 , ) /  , −  , −1 , which quantifies the improvement in quality for selecting a higher representation level from a sorted list.The algorithm then replaces previouslyselected levels with the next level from the sorted sequence within the same tile until the total bitrate reaches the budget bitrate.

EVALUATION 4.1 Experiment Setup
Volumetric Video Dataset.We use the four dynamic point cloud sequences from the 8i Dataset [9]: RedAndBlack, Loot, Soldier, and LongDress, for our evaluation.Each sequence has 300 frames with a frame rate of 30 fps.The average number of points per frame and corresponding bitrates (in Gbps) of the uncompressed volumetric videos are summarized in Table 3. Network Profile.Four network profiles (P1, P2, P3, and P4) are used for evaluation.We follow the DASH Industry Forum Guidelines [7] to generate Profile P1 and P2.Specifically, P1 follows a high-low-high pattern and P2 follows a low-high-low pattern.The throughput of P1 and P2 varies regularly at the interval of 3 seconds in five levels with rate {3, 4, 6, 8, 10} Mbps.Profile P3 and P4 are real 4G/LTE network traces collected from a moving bus and car [28], respectively.The throughput of P3 and P4 ranges from 0 to 26 Mbps, varying every second.The throughput variation does not follow any regular pattern, making the rate adaptation challenging.Figure 7 shows their throughput variations over 60 seconds.Viewpoint Trajectory.We use two viewpoint trajectories (V1 and V2) for evaluation.As discussed in Section 3.3, V1 is the set of real viewpoint trajectories collected from 26 participants [33].V2 is the synthesized trajectory generated from the emulation of more dynamic user movement.
Encoder Settings.V-PCC reference software version 152 is used as our point cloud codec.We use the "lossy geometry lossy attribute random access" configurations defined in the V-PCC common test condition (CTC) [10].We divide the point cloud into six tiles with the decomposition method described in Section 3.2.V-PCC compression rates are controlled by the geometry and texture quantization parameter (QP).There are five compression rates defined in the CTC, labeled as R5 to R1. R5 has the highest quality (lowest compression) and R1 has the lowest quality (highest compression).We also define two additional rate points with higher distortion levels than that of R1.We denote those two additional levels as R0 and R-1.The geometry quantization parameter is set to 36 and 40 respectively; The attribute quantization parameter is set to 47 and 52 respectively.Therefore, a total of seven representation levels are given for each tile.Table 4 summarizes the CTC encoder settings as well as our additional setting.Decoding and Rendering Setting.To enable real-time V-PCC decoding, we reconstructed the V-PCC reference decoder in Rust with various optimizations, achieving an average of 15 times faster    than the reference decoder.The code is made public to support further research 3 .We also work with a toolkit called VVTk 4 , which provides an extendable video player for real-time dynamic point cloud rendering.The streaming simulation and rendering in our experiments are all conducted online and in real-time through our V-PCC decoder and VVTk, running on a computer with Apple M1 Pro with 32 GB RAM running on macOS 13.2.The width and height of the rendered videos are set to 900 and 1600, respectively.The point size is set to 1, and the background color of all the rendered videos is set to black.The playback was looped in 60 seconds for each point cloud sequence, network profiles, and viewpoint trajectories, resulting in 32 combinations.Evaluation Metrics.For evaluation, PCC Arena [34] is adopted to compute the 2D quality of rendered volumetric videos.Two metrics are used to quantify the quality of rendered volumetric videos: VMAF [1] and SSIM.To avoid the disturbance from the background which is irrelevant to the quality of 3D point clouds, we generate a mask map to exclude the background pixels of each rendered 2D image when using SSIM to evaluate the quality of rendered point cloud frames.

Experimental Results
To demonstrate the effectiveness of our QoE model and viewing adaptations, we first evaluate the performance of QV4 and compare it with the baseline streaming pipeline denoted as V-PCC (Base).V-PCC (Base) is selected for encoding and decoding the whole point cloud by V-PCC without tiling.The proposed QoE-based tile rate allocation and viewpoint prediction are also disabled for V-PCC  5. We can make several observations from the results.First, we can observe that our proposed method consistently achieves the best quality with up to 14.67% improvement in SSIM and 7.39% improvement in VMAF on average, compared with V-PCC (Base).Besides, we can find that both QV4 and V-PCC (Base) achieve better performance on V1 than V2.As discussed in Section 3.3, users in V1 spent most of the time in a certain area looking at the frontal body, while the movement in V2 was more dynamic and complex.Under a dynamic viewpoint trajectory, the viewpoint predictions are less accurate.Wrong-predicted viewpoints lead to wrong bitrate allocation, thus degrading the video quality.Our method still achieves 3% improvement in SSIM and 3.58 in VMAF on average over four network profiles, compared with the baseline.The above results demonstrate the superiority of the proposed method compared with the method without our proposed viewing optimizations in terms of 2D visual quality.
To sum up, the overall quality reported in Table 5, and the results of quality versus encoded bitrate shown in Figure 8 and Figure 9 indicate that our method can effectively and robustly improve the visual quality of volumetric videos, under diverse network conditions and complex viewing behaviors.
Bandwidth Efficiency.From Figure 8 and Figure 9, we can observe that when comparing the bit-rate of QV4 and V-PCC (Base) at the same quality, our method generally consumes much less bandwidth compared with V-PCC (Base) at the same quality in terms of SSIM and VMAF.Taking LongDress as an example, QV4 saves up to 49.6% and 41.1% bitrate at the same quality in terms of SSIM and VMAF, respectively, compared to V-PCC (Base).In summary, the results plotted in Figure 8 and Figure 9 demonstrate the bandwidth efficiency of our method compared to V-PCC (Base).
Effect of Viewpoint Adaptation.The main reason that QV4 can save considerable bandwidth while keeping high visual quality is that QV4 considers users' viewing behaviors and assigns higher quality to the tiles in the view and lower quality to the tiles out of the view.To better illustrate the effect of viewpoint adaptation of QV4 , we present Figure 10 which depicts the bitrate distributions of in-the-view and out-of-the-view tiles for all video frames over two viewpoint trajectories and four network profiles.For comparison, we disable the QoE-based tile rate adaptation of QV4 as described in Section 3.5 so that each tile is assigned the same bitrate, and denote it as QV4 (NVA), i.e.No Viewpoint Adaptation.A main observation is that QV4 consistently assigns higher quality to the tiles within the view and decreases the bitrate of the tiles out of the view, under all four network profiles and two viewpoint trajectories.Overall, QV4 allocates representation levels with 45.5% bigger bitrate to the visible tiles and 62.5% smaller bitrate to the tiles that are not visible, compared to QV4 (NVA).By adapting to the viewpoint of users, QV4 decreases the bitrate of invisible tiles and increases the bitrate of in-the-view tiles, thus achieving high-quality volumetric video streaming under limited and complex network conditions.
Effect of Viewpoint Prediction.We further conduct comparison experiments to demonstrate the feasibility and effectiveness of the proposed viewpoint prediction.We remove the viewpoint prediction model from QV4 and denote it as QV4 (NVP), i.e.No Viewpoint Prediction.We conducted experiments under four network profiles and two viewpoint trajectories.The average SSIM and VMAF over four network profiles are shown in Table 6.As shown, compared with the proposed method, the overall quality of the volumetric videos produced from QV4 (NVP) degrades from 4% to 6% in SSIM and from 1.78 to 5.71 in VMAF.To visualize the difference, Figure 11 provides an example of the rendered point clouds with and without viewpoint prediction.One could find that QV4 (NVP) suffers great quality degradation, compared to the proposed method.By predicting the user viewpoint, QV4 can prefetch the 3D contents with higher quality and reduce the delay.In contrast, QV4 (NVP) cannot detect the user behavior, thus failing to fetch the accurate 3D contents in advance.As a result, QV4 (NVP) has to lower the quality of point clouds for low-latency volumetric video streaming.In conclusion, Table 6 and Figure 11 validate the effectiveness of the proposed viewpoint prediction.
QoE.Moreover, we calculate the QoE value of each frame for the proposed and baseline streaming method.Figure 12 presents the distribution of QoE of four dynamic point clouds.The mean values are also plotted for better comparison.
The results indicate that when the bandwidth is limited and fluctuating, QV4 always provides considerably better QoE on average.Specifically, compared to V-PCC (Base), QV4 improves the QoE of LongDress, RedAndBlack, Loot, and Soldier by 45.45%, 29.65%, 12.98%, and 14.70% on average, respectively.Besides, the MOS of 52.34% of point cloud frames streamed by QV4 are predicted as 4 (good) or 5 (excellent) whereas only 26.92% of frames' MOS streamed by V-PCC (Base) are predicted as good or excellent.The improvement on QoE is because the viewpoint-aware optimizations of QV4 reduce the bitrate of the tiles outside the viewport and assign a higher quality to the tiles that are more visually important.
System Performance.We also conduct a comprehensive performance evaluation of QV4 under varying viewpoint trajectories (V1 and V2) and network conditions (P1, P2, P3, and P4). Figure 13 shows the overall frame update rate, measured considering decoding and rendering time.Remarkably, our findings demonstrate that QV4 consistently achieves a frame rate exceeding 30 frames per second (fps), reaching up to 60 fps, across all four dynamic point clouds.This outstanding performance is maintained in the face of complex and dynamic network conditions and viewing trajectories.Furthermore, we assessed the average per-frame decoding and rendering latency of QV4 , yielding 16.87 ms and 8.11 ms, respectively.The result reveals that QV4 effectively keeps decoding latency below 30 ms, making V-PCC decoding no longer the bottleneck for  volumetric video streaming [18,30,32].Notably, the latency analysis excludes the inference time of viewpoint prediction and QoE prediction, as their durations are negligible (within 0.01 ms).

DISCUSSION AND CONCLUSION
We present QV4 , a volumetric video streaming pipeline empowered by QoE-based viewpoint-aware optimizations.With the proposed QoE model and viewpoint-aware optimizations, QV4 allows to adaptively choose the optimal tiles under complex user viewing behaviors and network conditions, thus maximizing the user's viewing experience.Extensive simulations are conducted.The experiments show that our pipeline can improve the volumetric video quality by up to 14.67% in SSIM and 7.39% in VMAF over dynamic viewpoint trajectories and limited and fluctuating network conditions.
There are several limitations to our work that we can improve upon.First, the point cloud is partitioned into non-overlapping tiles based on the predefined view directions.The tiles are then encoded independently.However, the main issue incurred by segmentation is that smaller tiles contain less inter-spatial redundancy, making the overall compression efficiency worse than that of unsegmented video.Especially, V-PCC is sensitive to the density and complexity of the volumetric videos.In this paper, we only validate the feasibility and effectiveness of six tiles, which is the default setting of V-PCC.Therefore, exploring the trade-off between the segmentation size and overhead from V-PCC remains a potential research direction.Secondly, visual saliency serves as an important feature in tile rate allocation.By allocating higher bitrate to the tiles with higher visual saliency and lower bitrate to the unattended tiles, we can significantly reduce the utilized bandwidth and client-side decoding overhead, and improve the users' viewing experience.We hope to propose a saliency-detection model to segment the tiles based on their visual importance, thus allowing saliency-aware tile rate allocation.Thirdly, while the results of visual quality through SSIM and VMAF provide valuable insights into the performance of our proposed method, user studies will provide a more comprehensive understanding of the effectiveness of QV4 in terms of the viewing experience.Therefore, future work should involve subjective evaluations to further assess of our method.

Figure 1 :
Figure 1: Example of patch generation and packing of V-PCC.From left to right: The original point cloud.The segmented point cloud with different colors represents different patches.The occupancy map, attribute image, and geometry image are generated from these patches.

Figure 2 :
Figure 2: The pipeline of our proposed streaming system.On the server side, V-PCC tiling is performed to partition the point cloud into view-dependent representations of varying qualities.On the client side, different adaptation techniques described in §3.2, §3.3, §3.4, and §3.5 are leveraged to optimize the quality of the visible parts, resulting in higher visual quality of visible parts while lower quality of invisible parts.

Figure 3 :
Figure 3: Examples of viewpoint trajectories.The centroid of the point cloud object is located at (0,0,0).
(b).The corresponding positional values for X and Z axes span from -2.92 m to 3.16 m and -3.08 m to 3.21 m, respectively, while the Y axis ranges from -3.08 m to 3.21 m, resulting in a wide spatial coverage.The standard deviations for X, Y, and Z coordinates amount to 1.58 m, 0.64 m, and 1.39 m, respectively, indicating notable dispersion tendencies within the dataset.This augmented dataset provides a more realistic representation of dynamic user movement, enabling a more robust evaluation of the proposed method's effectiveness in handling diverse motion patterns.
(a) The predictions of LR.(b) The predictions of MLP.(c) The predictions of SVR.(d) The predictions of GRU.

Figure 4 :
Figure 4: Visualization of the predictions of four models and the ground truth when ℎ = 60 and  = 120.

Figure 6 :
Figure 6: An example of the relationship between viewpoint and tiles.

Figure 8 :
Figure 8: SSIM.Each point represents the average quality per second with the corresponding bitrate.

Figure 9 :
Figure 9: VMAF.Each point represents the average quality per second with the corresponding bitrate.
(a) Bitrate of in-the-view tiles.(b) Bitrate of out-of-the-view tiles.

Figure 10 :
Figure 10: The distributions of bitrate of tiles in the view (left) and out of the view (right) on viewpoint trajectories (V1 and V2) and network conditions (P1, P2, P3, and P4).The quartiles are shown for better comparison.

Figure 11 :
Figure 11: Sample rendered point clouds of LongDress using the proposed method with and without viewpoint prediction.

Table 1 :
Performance of in-the-view tile prediction.

Table 2 :
Comparison of the QoE models.

Table 4 :
Settings for MPEG V-PCC Reference Encoder

Table 6 :
Performance of the proposed method with and without viewpoint prediction.