Extending 3-DoF Metrics to Model User Behaviour Similarity in 6-DoF Immersive Applications

Immersive reality technologies, such as Virtual and Augmented Reality, have ushered a new era of user-centric systems, in which every aspect of the coding-delivery-rendering chain is tailored to the interaction of the users. Understanding the actual interactivity and behaviour of the users is still an open challenge and a key step to enabling such a user-centric system. Our main goal is to extend the applicability of existing behavioural methodologies for studying user navigation in the case of 6 Degree-of-Freedom (DoF). Specifically, we first compare the navigation in 6-DoF with its 3-DoF counterpart highlighting the main differences and novelties. Then, we define new metrics aimed at better modelling behavioural similarities between users in a 6-DoF system. We validate and test our solutions on real navigation paths of users interacting with dynamic volumetric media in 6-DoF Virtual Reality conditions. Our results show that metrics that consider both user position and viewing direction better perform in detecting user similarity while navigating in a 6-DoF system. Having easy-to-use but robust metrics that underpin multiple tools and answer the question "how do we detect if two users look at the same content?" open the gate to new solutions for a user-centric system.


INTRODUCTION
Immersive reality technology has revolutionised how users engage and interact with media content, going beyond the passive paradigm of traditional video technology, and offering more degrees of presence and interaction in a virtual environment.Depending on how much a user can move in the 3D space, immersive environments can be classified as 3-or 6-Degrees-of-Freedom (DoF).In a 3-DoF scenario, the de-facto multimedia content is the omnidirectional or spherical video, representing an entire 360 • environment on a virtual sphere.The viewer is fully immersed in a virtual space where they can navigate and interact thanks to an immersive device -typically a head-mounted display (HMD), which enables to view only a portion of the environment around themself, named viewport.The media is displayed from an inward position, and the viewer can interact with the content only by changing the viewing direction (i.e., by looking up/down or left/right or tilting the head side to side).In a 6-DoF system, the user can also change viewing perspective by moving (e.g., walking, jumping) inside the virtual space.The scene is therefore populated by volumetric objects (i.e., meshes or point clouds) which are observed from an outward position.These extra degrees of freedom bring the virtual experience even closer to reality: a higher level of interactivity makes the user more immersed and present within the virtual environment [4].
Despite their differences, the common denominator of both interactive systems is the viewer as an active decision-maker of the displayed content.This active role defines the user-centric era, in which content processing, streaming, and rendering need to be tailored to the viewer interaction to remain bandwidth-tolerant whilst meeting quality and latency criteria [28,45].Media codecs need to be optimised such that the quality experienced by the user is maximised [34,46].Similarly, streaming should be tailored to users' interactivity to ensure high-quality content and smooth navigation, while remaining bandwidth-tolerant [13,22,39].From here, the importance to understand, analyse and predict users' movements (i.e., user behaviour) within an immersive scenario [12,26,29,43].A better understanding of how the population behave when experiencing immersive reality has an impact that goes beyond system applications, leading to user similarities, i.e., user clustering/profiling [23], which is essential for several purposes: from secure authentication [42] to medical application [17].
Thanks to the large availability of public datasets [14,20,25], user navigation in 3-DoF immersive systems has been deeply investigated [30,35], showing the importance of analysing and detecting key behavioural aspects in interactive (user-centric) systems.However, the 6-DoF counterpart has been scarcely considered in the literature [1,38,47].Due to the change in the viewing paradigm (from inward to outward) and to more level of interaction in 6-DoF, current studies in 3-DoF cannot be directly applied to 6-DoF domains [33].Filling this gap is the main goal of this paper by providing new metrics for user analysis in 6-DoF.
In this work, we focus on extending the applicability of clustering methods to investigate users similarity (i.e., users sharing common behaviours while interacting with the content) to 6-DoF environments.Specifically, clustering techniques usually rely on pairwise similarity metrics, with similarity being in this case in terms of 6-DoF interaction.To the best of our knowledge, such metric has not been proposed yet in 6-DoF context.Starting from the state-ofthe-art clustering algorithm developed in 3-DoF [27], and the main limitations of the tool when extended to 6-DoF described in [33], we investigate new methodologies for better modelling user similarities and overcoming those limitations.First, we recall the definition of user navigation trajectory in 6-DoF.Then, we present the exact user similarity metric, which we will be considering as our ground truth.Given its computational complexity, after an exhaustive study, we propose a simpler and yet reliable proxy for it.More concretely, we define and compare 8 similarity metrics which are based on different distance features (i.e., user positions in the 3D space, user viewing directions) and distance measurements (i.e., Euclidean, Geodesic distance).We validate and test our proposed similarity metrics on a publicly available dataset of navigation trajectories collected in a 6-DoF Virtual Reality (VR) scenario [39].Results have shown that similarity metrics based on more distance features are promising solutions to correctly detect users with similar behaviour while experiencing volumetric content.
Our work contributes to the overall open problem of behavioural analysis in a 6-DoF system with the following main contributions: • presenting the general problem of detecting behavioural similarities in a 6-DoF system, and introducing novel similarity metrics able to model the user behaviour in this scenario.These are expressed as a function of various distance features and measurements and we divide them into two groups: singleand multi-features metrics; • an exhaustive analysis of the different proposed metrics aimed at capturing users' trajectory similarity (in terms of distance on the plane or from the object) and the ability to approximate the ground truth.This analysis based on 6-DoF VR trajectories reveals that the position on the floor alone is not sufficient to characterise the user behaviour and that the viewing direction cannot be neglected.
The remainder of this article is organised as follows: related work on user behavioural analysis in both 3-DoF and 6-DoF systems are reported in Section 2. The main challenges of detecting behavioural similarities in a 6-DoF system and the importance of having a tool that approximates such similarities are described in Section 3. Our proposed similarity metrics are described in Section 4; while Section 5 and Section 6 present experimental setup and validation of our proposed metrics on real navigation trajectories collected in a 6-DoF VR setting, respectively.Further discussion and final conclusion are summarised in Section 7.

RELATED WORK
We now describe how user behaviour has been analysed in 3-DoF systems, showing also the benefit of this type of analysis in user-centric systems.Then, we show which methods have been used for the analysis in 6-DoF scenarios, highlighting the still outstanding open challenges.

User Behaviour in 3-DoF environment
The user navigation within a 3-DoF environment has been intensely analysed from many perspectives.Many studies have focused on psychological investigations of user engagement and presence correlated to movements within the spherical content.In [15], a study from a large-scale experiment (511 users and 80 omnidirectional videos) showed a positive correlation between lower interactivity level and higher engagement level (strong focus on few points of interest).Similarly, a correlation between the perceived sense of presence and the interactivity level was detected in [2], with more random exploratory interactions for less immersed (and hence less engaged) users.However, no objective metric to properly quantify and characterise user behaviour has been presented in these works.
To further understand how people observe and explore 360°contents, many public datasets of navigation trajectories have been made available.Those datasets usually come with statistical analysis aimed at capturing average users behaviour, as a function of maximum and average angular speeds under various video segment lengths [5], exploration time [35] or eye fixation distribution [7].A deeper analysis was presented in [20] where the dataset has been analysed through a clustering algorithm presented in [27], specifically built to have in the same cluster users who similarly explore 360°content.However, behavioural analysis based on such clustering tool mainly provides a general idea of similarity among viewers without offering however a quantitative metric.To overcome such limitation, authors in [30], showed the benefit of studying spatiotemporal trajectories by information theory metrics, and thus the possibility of identifying and quantifying behavioural aspects.Key outcomes from this quantitative analysis were the study of similarities between users when watching the same content, but also the similarity of a given user when watching diverse content.The importance of these behavioural insights has been then exploited in different VR applications.For instance, authors in [21] proposed a scalable prediction algorithm for user navigation, which considered previous navigation patterns while in [19] a hybrid approach has been presented based on both dominant user behaviour (detected via a clustering approach) and the video content.Recently, authors in [11] showed that behavioural uncertainty could lead to different navigation in the future even if previously presented similarity; thus, a deep variational learning framework to predict multiple plausible head trajectories was presented.Moreover, in order to extend publicly available navigation datasets, realistic synthetic head rotation data were also generated using a deep learning algorithm given similar data distribution over time [37].Finally, the analysis and understanding of user navigation in a VR environment have shown promising results also in determining the mental health issues of subjects (e.g., anxiety, autism spectrum disorder, eating disorders, depression) and their treatment [9,10,18].

User Behaviour in 6-DoF environment
Extending such behavioural analysis to a 6-DoF environment is not straightforward, due to the change in the viewing paradigm (from inward to outward) and the addition of translation in 3D space.In the past, user navigation in 6-DoF scenarios was studied in the context of locomotion and display technology for CAVE environments [24,41].A Cave Automatic Virtual Environment (CAVE) system is an immersive room on which walls and floor are projected the video content and viewers are free to move inside [6].For instance, the study in [41] focused on task performance analysis in terms of completion time and correct actions.Authors in [24] compared instead the effect of two different immersive platforms such as CAVE and HMD on the user navigation.More traditional metrics, such as angular distance and linear velocity, alongside completion time, were also used to compare different navigation controllers (i.e., joystick-based vs head-controlled navigation) in 6-DoF [3].In detail, the authors showed the superiority of head-controlled techniques, allowing more sense of presence and better control with less discomfort in the navigation.While the tools mentioned above are highly informative to summarise the interaction of users within a 6-DoF environment, they usually fail to provide other key insights: which users navigate similarly, and which are the dominant interaction behaviour among users.
Recently, the focus has been put on subjective quality assessment based on different coding techniques of volumetric content, both static [1] and dynamic [40].These studies present a statistical analysis of user movements in terms of mean angular velocity, the ratio of frames viewed while in movement, most displayed areas of the content showing an influence in the navigation due to the perceived content quality, and point out a users' preference to visualise the volumetric object from a close and frontal perspective.A behavioural analysis of user navigating in 6-DoF social VR movie has been also presented in [32].An investigation on how users are affected by virtual characters and narrative elements of the movie has been conducted through objective metrics, showing a more static behaviour when an interactive task was requested, and more exploratory movements during dialogues.Authors in [31] present an exploratory behavioural analysis of users while displaying volumetric content within a 6-DoF environment focusing on the understanding how the way of navigating is affected by the content and its features, such as dynamics and quality, but also by the intrinsic disposition of the single user.Finally, to encourage the collection of navigation data in 6-DoF immersive experience, a new tool was recently released in [44].
These preliminary studies are based on conventional metrics, which consider only one user feature at a time, either position on the floor or viewing direction but not together, suffering from the major shortcomings highlighted before.In this paper, we aim to overcome these limitations by proposing a general and efficient tool for detecting similar viewers while experiencing 6-DoF content.

CHALLENGES
In this work, our main goal is to define a new pairwise metric able to capture the (dis)similarity between two 6-DoF users (in terms of displayed content).This metric needs to be reliable and yet simple to compute.In the following, we first present our assumption of similarity among users while navigating in a 6-DoF environment based on [33].Then, we show an exact user similarity metric highlighting its limitations, and therefore the need to find a simpler proxy for it.Finally, we emphasise the advantages of having a similarity metric for behavioural analysis via a clique-based clustering approach presented in [27], which identified users who are attending the same portion of an omnidirectional content in a 3-DoF system.This clustering technique relies on a pairwise similarity metric, and thus, having a proper metric also for 6-DoF system would extend the applicability of this state-of-the-art tool.

User Similarity in 6-DoF
Similarly to [33], we are interested in analysing user behaviour, assuming that users interact similarly when they observe the same volumetric content.The user behaviour can be identified by the spatio-temporal sequences of their movements within the virtual environment, namely navigation trajectories.
In a 3-DoF scenario, the trajectory of a generic user  can be formally denoted by the sequence of the user's viewing direction over time {  1 ,   2 , ..,    } where    is the centre of the viewport projected on the immersive content (i.e., spherical video) at timestamp .In fact, the viewport centre alone is highly informative of the user behaviour and can be used as a proxy of viewport overlap among users [27].In particular, the geodesic distance has been proved as a reliable similarity metric such that a low value indicates high similarity between 3-DoF users.
Differently in a 6-DoF setting, the more degrees of freedom are given to the user, the more challenging becomes the system and the description of user navigation within it.The viewport centre alone is no more sufficient to characterise the user behaviour in a 6-DoF scenario since now the distance between the user and immersive content can change over time due to the added degrees of freedom.Figure 1 shows an example of two users navigating in a 6-DoF system.On the left side of the figure, there are navigation trajectories of two users  and  projected on a 2-D domain (i.e., floor).Each point   represents the spatial coordinates (i.e., [x,y,z]) on the floor of viewers while each associated vector symbolises the viewing direction.In the right part of Figure 1, we have instead a snapshot of a specific time instant .In more detail, the shaded triangular areas represent the viewing frustum per user, which indicates the region within the user viewport, and   is the distance between the user and the volumetric content.We have also depicted the viewport centre   projected on the displayed volumetric object.Given the two users  and  at time , in the case of    ≫    , the user  (very close to the object) is visualising a very focused and detailed part of it; conversely, user  is pointing to the same area but from a much further distance, thus the experienced content is different with less defined details.Despite this difference, the small distance   (, ) between viewport centres    and    might suggest a high similarity between the users, which does not reflect the reality in the case of    ≫    .Thus in this scenario, we cannot rely on the viewport centre only to characterise the user behaviour.The distance  and the spatial coordinates on the virtual floor  are also needed to formally define the navigation trajectory for a generic 6-DoF user  as . This information is crucial to define a simple similarity metric among users in this new setting.

Overlap Ratio as the ground-truth metric
Since we are interested in capturing viewers that are attending similar volumetric content at the same time instance, following the work presented in [33], the straightforward measure that could show this behaviour is the overlap among viewports.Given two users  and  shown in the right part of Figure 1, we denote their displayed viewport as S   and S   , respectively, defined as the set of points of the volumetric content falling within their viewing frustum.Then, we denote the overlap set by S   ∩ S   , the portion of points displayed by both users.Equipped with the above notation, we can now introduce a key metric for the analysis: the overlap ratio  (, ).This is defined as the cardinality of the overlap set, normalised by the cardinality of the set containing all points of the volumetric content visualised by both users.More formally, the overlap ratio in a specific time  is: where S   and S   are the displayed viewport of users  and , respectively.In particular, a high value of overlap ratio means high similarity between users of the displayed content, and conversely.Even if this metric is exact and a clear indicator of how much similar users are with respect to their displayed content, its evaluation is not trivial as it is intensely time-consuming.For instance, the overlap ratio between two users requires 0.8986 seconds per frame on average on an Intel R machine with CPU E5-4620 at 2.10 GHz; the operation needs to be computed for all the possible combinations of users, leading to a large overhead which does not meet requirements for real-time and scalable applications.A new measure is needed to perform real-time applications.In the rest of the paper, we will use this metric as the ground truth of overlap among users and investigate different weights as a proxy for viewport overlap.

Clustering as a tool for behavioural analysis
Being able to assess users similarities in an objective way might be crucial for different applications such as behavioural analysis.As shown in [27], a clique-based clustering algorithm is used to detect users with similar behaviour.This requires a reliable graph to be constructed in such a way that only the nodes that identify similar users (i.e., who are displaying the same portion of the content) are connected.Equipped with such a meaningful graph, the cliquebased clustering identifies optimal sub-graphs of all inter-connected nodes, ensuring the identification of the largest cluster of users all sharing a large viewport overlap.In more detail, given a set of users who are experiencing the same content, we can represent their movements in a time-window  as a set of graphs {G  }   =1 .Each unweighted and undirected graph G  = {V, E  , A  } represents behavioural similarities among users at time , where V and E  denote the node and edge sets of G  , respectively.Each node in V corresponds to a user interacting with the content.Each edge in E  connects neighbouring nodes defined by the binary adjacency matrix A  .Assuming that users are connected if they are displaying similar content, we can formally define the adjacency matrix A  as follow: where   (, ) is a similarity metric between user  and  and  ℎ is a thresholding value.On this final graph, the clique-based clustering algorithm can be applied to identify a set of users all connected (i.e., clique), and therefore with similar behaviour.In [27], this graph construction is based on a pairwise similarity metric specifically for the 3-DoF trajectories.Identifying a generic and reliable metric (, ) that approximates Geodesic distance behavioural similarities among users who experience a 6-DoF content is a key step to enable user behavioural analysis via tools proposed for 3-DoF scenario and the focus on the next section.

PROPOSED METRICS
In this section, we present eight similarity metrics and we provide an exhaustive study to understand which one approximates at the best the viewport overlap.Those metrics are expressed as a function of various distance features and measurements considering either users' position on the floor () or users' viewing direction in terms of the viewport centre projected on the volumetric content () or both.We divide the metrics into two groups: single-feature and multi-feature metrics.For the sake of notation, we omit the temporal parameter .Table 1 summarises the distance features and measurements that we consider, while our proposed similarity metrics are reported in Table 2.

Single-feature metrics to assess users similarity
The first set of similarity metrics is based on one single distance feature.We model the similarity functions via radial basis function kernel.Specifically, we consider the Gaussian kernel [36] defined as follows: where  (, ) is the distance between two generic users  and , while  > 0 is a parameter to better regularise the distance.This distance can be evaluated in multiple ways and we consider the distance features and measurements taken into account in [33].Specifically, the first two similarity metrics  1 and  2 are based on the location of users in the virtual space with respect to the virtual object or other viewers.The former metric is based on the Euclidean distance E(  ,   ) between user  and  on the virtual floor.Instead,  2 considers the difference in terms of the relative distance of users to the centroid of the displayed content, L = ||  −   ||.Specifically, we define them as follows: The metrics  3 and  4 are instead based on the distance between the two viewport centres  of user  and user  projected on the volumetric content.To take into account the heterogeneous shape of the volumetric content, this distance in  3 is measured in terms of the Geodesic distance G(  ,   ) while in  4 in terms of the Euclidean distance E(  ,   ).More formally, they are defined as: 4.2 Multi-feature metrics to assess users similarity As emerged in [33], both user viewing direction and position on the virtual floor are relevant to detect similar behaviour among users.Thus, the last set of proposed similarity metrics considers a combination of distance features.In detail,  5 and  6 are based on the previous similarity metrics  1 and  2 , but include also the distance of their viewport centres  projected on the volumetric content in terms of Geodesic distance G(  ,   ) and Euclidean distance E(  ,   ), respectively.More formally, we define  5 as: while the second weight is equal to: For the sake of clarity,  and  are regulators such as .
The preliminary analysis presented in [33] has also highlighted a correlation between the viewport overlap of two users and their relative distance from the volumetric content.The closer users are to the volumetric content, the smaller and more detailed is the portion of the displayed content; the farther they are, the bigger but with fewer details becomes the displayed portion.Thus, in the first case, the high overlap between displayed areas of two different users is more difficult.To take into consideration this behaviour, we model the relative distance via a hyperbolic tangent kernel.Given the relative distance   between the user  and volumetric content, we evaluate it as follows: As previously, metrics  7 and  8 are based on both user distance in the virtual floor E(  ,   ), and on the volumetric content in terms of Geodesic distance G(  ,   ) and Euclidean distance E(  ,   ), respectively.More formally, we define  7 as follows: while  8 is:

EXPERIMENTAL SETUP
We now validate the above metrics using a point cloud dataset.We now describe the navigation dataset and how we evaluate the performance of our similarity metrics (Section 5.1 and 5.2, respectively).Then, we run an ablation study to evaluate for each similarity metric the best-performing set of regulators.

Dataset and Methodology
Dataset.Existing datasets with user navigation collected while displaying volumetric objects in a 6-DoF environment are still very limited.In the following, we use the open dataset presented in [39].This is comprised of navigation trajectories of 26 users participating in a visual quality assessment study in VR.For the study, four dynamic point cloud sequences were employed [16], namely Long dress (PC1), Loot (PC2), Red and black (PC3), Soldier (PC4) (Figure 2).Each sequence was distorted at four different bit rate points with two compression algorithms: the anchor used for the MPEG call for proposals, and the upcoming MPEG standard V-PCC.Hidden references were additionally employed in the test, for a total of 36 stimuli.Similarly to what is shown in Figure 1, a single object of interest was placed in the VR scene, and users were instructed to focus on the volumetric content for the duration of the session and rate its visual quality.Therefore, the navigation data adheres to the assumptions listed in Section 3. Graph Construction.To implement the graph-based clustering proposed in [27] based on our proposed similarity metrics, we need to construct a binary graph following Equation (2), as described in Section 3.3.To be noted, our proposed similarity metrics are based on distance measurements.As shown in [27], the correlation between overlap and distance is inversely proportional.This means that high values of overlap (and thus, high similarity) correspond to low distance.Therefore, the condition to construct the adjacency metric A  based on our proposed similarity metrics becomes the following:  (, ) ≤  ℎ where  (, ) is one of the similarity metrics proposed in Section 4 and  ℎ a threshold value which identifies similar users and thus, neighbours on the graph.In short, users with a similarity metric below a threshold value  ℎ are neighbours in the graph.Hence, the first step now is to identify  ℎ .Per each proposed similarity metric, we empirically evaluate the Receiver Operating Characteristic (ROC) curves based on the navigation trajectories of the entire dataset above described and select the best value of threshold as originally done in [27].Specifically, we set the thresholding values such that a good trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) is met.As ground truth for the ROC, we assumed that two users are attending the same portion of the content, and thus are classified as similar, if their viewports overlap by at least 75% of their total viewed area.The predicted event is instead evaluated using the eight metrics presented in the previous section, and the corresponding threshold values are selected in order to have TPR equal to 0.75.For the sake of clarity, the ground-truth value of viewport overlap has been set equal to 75% because ensures per each similarity metric a low probability to have a wrong classification (i.e., FPR below 0.4) without compromising the probability of correctly classifying the similarity event (i.e., TPR) which remains above 0.75.In the last column of Table 2, we provide the selected  ℎ per each similarity metric that will be used in the following.

Performance Evaluation Setup
To validate our proposed similarity metrics, we consider three performance metrics: averaged overlap ratio per cluster, relevant clustered population, and precision.The first two are more specific to our navigation trajectory in a VR system, while the latter is a popular index used to evaluate clustering algorithm performance.Overlap ratio per cluster: as defined in Section 3.2, the overlap ratio computes the portion in common of displayed content between two users.Therefore, to compare the performance of our detected clusters with the different similarity metrics, we average the overlap ratio among all users who are put in the same group.More formally, given a detected cluster   is defined as follows:  where  and  are two generic users,   is the cardinality of elements bellowing to cluster   and  (, ) the overlap ratio as in Equation 1.
Relevant clustered population: the more users are clustered together with high viewport overlap, the more meaningful are our clusters.Thus, we consider as relevant clustered population the sum of users that have been put in clusters with more than 2 elements.Precision: in a classification task, this index evaluates the portion of elements that are classified correctly and has values between 0 and 1 [8].More formally: where True Positive (TP) (False Positive (FP)) is the number of viewers classified correctly (incorrectly) together in a cluster.In our case, two users are identified positively if they are in the same cluster and their viewport overlap is actually over the desired threshold.

Ablation Study
We finally present an ablation study to tune the best set of regulator parameters that maximise the performance of each similarity metric.Equipped with the threshold values given in Table 2, we run a framebased clustering to select the best regulators ,  and .We test their performance in terms of the metrics above described in the following range of values [0, 0.05, 0.1, 0.125, 0.2, 0.25, 0.5, 1, 2] based on navigation trajectories collected in the entire dataset.Finally, we average over time and across content the performance of each cluster obtained by all the similarity metrics.Single-feature metrics For single-feature metrics ( 1 − 4 ), we notice a very small variance in terms of performance.Thus, we selected  = 1 for this set.Multi-feature metrics More challenging is instead the selection parameters for multifeature metrics ( 5 −  8 ).Each similarity metric depends on three parameters: ,  and .To overcome this, we first select three sets of parameters taking into account only navigation trajectories for reference content: one group of parameters (set 1) based on the maximum overlap ratio, the second (set 2) on the maximum relevant clustered population and the last group (set 3) as the one reaching the highest precision.As an example, Figure 3 shows the selection of these three sets of parameters for the metric  7 .Then, we test these on all the available trajectories included in the analysed dataset to finally select the best set of parameters.Table 3 provides all the performance of the multi-feature similarity metrics obtained by the three selected sets of parameters.Since there is no particular configuration that outperforms in terms of overlap ratio, relevant population and precision, we decided to select set 3. This configuration, besides ensuring the highest value of precision, also guarantees acceptable values of overlap ratio and relevant population for all the similarity metrics.For example for  7 , selecting values of set 3 means that users are correctly clustered in almost the 50% of the time (precision equal to 0.49); at the same time the 77% of the population is put in clusters with more than Each dot represents a user on the virtual floor while the blue star stands for the volumetric content.In the legend in brackets, per each cluster with more than 2 users are reported: the number of users in the same cluster, averaged pairwise viewport overlap and corresponding variance within the cluster.
the 2 users (relevant population equal to 0.77) and on average the overlap of viewport between users in the same cluster is consistent (overlap ration equal to 65%).These values are similar to the highest value for  7 of the relevant population and overlap ratio which are 0.87 and 0.66, respectively.Table 2 summarises the values used in the following.

RESULTS
Equipped with the similarity metrics, the corresponding values of regulators and thresholds in Table 2, we now conduct our validation study, focusing on analysing navigation trajectories experienced with non-distorted content.

Frame-Based Analysis
As first step, we implement a frame-based analysis (i.e., frame-based clustering) to visually compare the detected clusters by the different similarity metrics.Figure 4 shows the clusters detected using the ground-truth metric  to construct the graph (Figure 4 (a)) with the ones given based on each proposed similarity metric (Figure 4 (b-i)), for frame 50 of sequence PC1.In particular, each user is represented by a point on the VR floor which is coloured based on the assigned ID cluster, whereas the volumetric content is symbolised by a blue star.Per each relevant cluster (i.e., cluster with more than 2 users), we provide in the legend the following results: the number of users inside the cluster, the average and variance of the overlap ratio among all users within the cluster.Finally, we represent the remaining users which are in either single or couple-cluster as black points; the total number of these users is also provided in the legend as "Small clusters (total number of non-relevant clusters)".Figure 4 (a) shows the clusters that we consider as our ground truth since they are evaluated considering the overlap ratio  as a similarity metric.In this case, 5 main clusters are detected with an average overlap ratio per cluster above 0.82.In particular, cluster ID 1 has the highest number of users ( 8) but has a relevant value of overlap ratio (0.84).Only 4 users in this case are put in single clusters.The goal is to find a similarity metric that can detect similar results.We can notice that single feature metrics, Figure 4 (b-e), have the tendency to create very populated clusters but with a low overlap ratio.For instance,  3 and  4 generate a main big cluster with 18 and 19 users, respectively, while the corresponding overlap ratio drops drastically to 0.62.The only exception is given by  1 , which generates a variable set of clusters with consistent values of overlap ratio, over 0.64.Let us now consider as an example the users 13, 15 and 17, which in the ground-truth case (Figure 4 (a)) form their own cluster (i.e., ID 5) with a high overlap ratio (0.83), and user 24, who is quite isolated from other users and belongs to a single cluster.We can notice that  2 and  4 fail in detecting the group of users 13, 15 and 17 as similar, dividing them instead in different clusters.On the other hand,  3 detects this similarity but puts user 24 in a relevant cluster (ID 1).From these observations, we can notice that the viewport centre on the volumetric content, on which  3 and  4 are based, is not sufficient to correctly identify similar users.Analogously, considering only the difference in terms of the relative distance between the user and volumetric content, as done in  2 , does not allow the detection of similarity among users.Thus, the most promising metric in this group seems to be  1 , which is based on the user position on the virtual floor.The last group of Figure 4 (f-i) shows clusters based on multi-feature similarity metrics.In all these settings, a total of four main clusters are detected, except for  6 which leads to three clusters, as shown in Figure 4 (g).The latter detects the highest number of small clusters (6) while being the only one that does not identify users 13, 15 and 17 within the same cluster.On the contrary, the other three metrics  5 ,  7 and  8 detect a main cluster and three smaller clusters with a consistent overlap ratio.For instance, the resulting clusters based on  5 have an overlap ratio always bigger than 0.69 and only two users fall into small clusters.Overall, multi-functional metrics appear to be better suited to detect similar users than previous ones, with the exception of  6 .
In Table 4, we extend our per-frame clustering analysis to the entire dataset: we show the average and standard deviation of performance metrics described in Section 5.2 obtained by our proposed metrics.Clusters based on  2 group in relevant clusters the majority of the population in all the analysed PCs (reaching the maximum value of 0.94 in PC1) to the detriment of precision, which falls to values between 0.22 and 0.35.As shown also in the previous investigation, the most promising similarity metrics in terms of precision and overlap ratio are both  7 and  8 followed by  5 .These outperform the other weights in all PCs, ensuring an overlap ratio within the same cluster with values in the range of 0.59 and 0.70 for  7 , 0.60 and 0.72 for  8 .Similarly, the values of precision are always over 0.42 for both  7 and  8 .The only exception is in PC1, where the best performing metric in terms of precision is  6 , which for the other contents cases is always the worst performing metric.

Trajectory-Based analysis
Given the above remarks, we now analyse the performance metrics over time, taking into account only  1 ,  5 ,  7 , and  8 .Indeed, we decide to select the best-performing similarity metrics in the previous investigation ( 5 ,  7 and  8 ).To have a fair comparison, we also keep the most promising among the single-feature metrics,  1 .We compute clique-based clusters over a time window of 1 (i.e., chunk) and a time similarity threshold of 0.8.At each chunk, we evaluate the average overlap ratio per relevant cluster, the average of the relevant population and the precision of detected clusters.As an example, we show in Figure 5 the performance results per sequence PC1 (Longdress) as functions of time per each similarity metric.In Figure 5, we also add the performance of clusters detected by the ground-truth metric  (i.e., red line).The goal is indeed to find a metric able to perform similarly to our ground-truth  over time.All the similarity metrics reach an average overlap ratio within clusters between 0.6 and 0.75 (Figure 5 (a)).However, clusters based on  1 have lower performance, while other metrics are performing quite similarly, although with a slight predominance of  7 .In terms of relevant users (Figure 5 (b)), it is worth noting that all the proposed similarity metrics generate bigger clusters than the ground-truth metric, which considers only half of the population as relevant.In more detail, the clusters resulting from  1 ,  5 and  8 put in relevant clusters 0.8 of the entire population for all the sequence time.Finally, in terms of precision as highlighted in Figure 5 (c) the only similarity metric that generated clusters with P over to 0.4 in the entire sequence is  7 .These investigations show that similarity metrics based on multi-feature, such as  7 and  8 , are more promising for detecting users with similar behaviour while experiencing volumetric content.In summary, from this validation analysis, we can conclude the following: • Overall, multi-feature metrics are more precise in detecting users with similar behaviour (in terms of displayed content) both in a frame-and chunk-based analysis; • In particular, in spite of the slightly more complex formulation,  7 and  8 are robust and easy-to-use metrics that ensure a robust and reliable behavioural analysis via clustering tools; • On the contrary, metrics based only on a single feature (i.e., single-feature metrics) are not sufficient to correctly identify similar users; • The only exception among single-feature metrics is  1 which is based only on the position of the user on the floor.Despite its simplicity, this metric is comparable with multi-feature metrics.Hence, it can be used for an easy-to-implement preliminary behavioural analysis.
However, it is important to point out that these observations are currently only valid for similar volumetric contents (i.e., human body).We leave further analysis across multiple datasets and types of content for future work.

DISCUSSION AND CONCLUSION
In this paper, we have summarised the main challenges of user behavioural analysis in a 6-DoF system due to the new settings and the added locomotion functionalities.Behavioural analysis of 6-DoF users is not considered in the literature yet; as such, there is no reference metric available to detect viewers who are displaying the same portion of the content.Thus, we considered a general ground-truth user similarity metric, such as overlap ratio: the percentage of points displayed in common by two users.This is fairly straightforward, albeit time-consuming, to compute for point cloud contents, in which each point is rendered separately.For other types of volumetric contents, determining the overlap ratio is not as simple.Considering the number of vertexes that fall into a given frustum could lead to misleading results when large faces between sparsely distributed vertexes are present.Moreover, the metric requires rendering each volumetric video at any given time and for each viewer, making its computation not trivial and intensely timeconsuming.To overcome this issue and to assess users' similarity in a simple and objective way, we formulated and investigated several similarity metrics considering different distance features and measurements.We were interested in modelling similarities among users observing the same volumetric content.In detail, we investigated different features or combinations of them which consider users' location in the virtual space and their viewing direction.We validated and tested our similarity metrics via a clique-based clustering tool proposed for 3-DoF scenario on real navigation trajectory collected in a 6-DoF VR environment.Therefore, in this article we advanced the state-of-the-art, proposing novel similarity metrics taking into account the new physical settings and locomotion functionalities given to users.Our results showed that solutions that consider both user position and viewing direction are promising to correctly detect users with similar behaviour while experiencing volumetric content.Moreover, since these metrics are based on simple operations of data that are typically already known in a multimedia system (i.e., user position in the virtual space and viewing direction), they can be evaluated on average in a hundredth of a second.This makes our proposed metrics suitable for real-time applications.In future work, we will further test the robustness and versatility of these metrics on 6-DoF navigation trajectories collected in a different virtual scenario, for example in Augmented Reality (AR) applications [47].

Figure 1 :
Figure 1: Example of two 6-DoF trajectories projected in a 2D domain for user  and .On the right side, a snapshot at time : coloured triangles represent viewing frustum per user.

Figure 2 :
Figure 2: Human Body Point Clouds[16] content used in the collection of a public available dataset presented in[39].

Figure 3 :
Figure 3: Example of parameter selection for  7 with  = 0.5.Values set 1 selected based on max overlap, set 2 max clustered users, set 3 based on precision.

Figure 4 :
Figure4: Cluster results in frame 50 of sequence PC1 (Longdress).Each dot represents a user on the virtual floor while the blue star stands for the volumetric content.In the legend in brackets, per each cluster with more than 2 users are reported: the number of users in the same cluster, averaged pairwise viewport overlap and corresponding variance within the cluster.
(a) Mean Overlap Ratio in Relevant Cluster.(b) Mean Relevant Users.(c) Precision.

Table 1 :
Definition of distance features and measurements.

Table 2 :
Similarity metrics: definitions, included distance features and measurements, regulator and threshold values.
(a) Overlap Ratio (b) Relevant population (c) Precision