4DSR-GCN: 4D Video Point Cloud Upsampling using Graph Convolutional Networks

Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (e.g., LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. In consideration of recent growing relevance of 3D applications, %We focused on a model allowing user-side upscaling and artifact removal for 3D video point clouds, a real-time stream of which would require . Our model consists of a specifically designed Graph Convolutional Network (GCN) that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. By taking inspiration PointNet++, We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (about 300KB), making our solution deployable in edge computing devices such as LiDAR.

Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting.We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node enough features about its neighbourhood in order to later on generate new vertices.Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (≃ 300KB), making our solution deployable in edge computing devices such as LiDAR.

INTRODUCTION
In light of emerging applications such as Augmented and Virtual Reality (AR/VR), there is a rising interest in capturing the real world in 3D at high-resolution.For real time applications in dynamic settings, such as 3D sensing for robotics, telepresence, automated driving applications using LiDAR, this technology might need high-resolution point clouds with up to millions of points per frame.After taking into consideration the average point-cloud video, under some constraints such as keeping the identity of a human subject recognizable, we observe that the size of a single instance, which is a single frame, can be approximated as ∼10 Mbytes, which translates to a bitrate of ∼300 Mbytes per second without compression for a 30-fps dynamic point cloud.The high data rate is one of the main problems faced by dynamic point clouds, and efficient compression technologies to allow for the distribution of such content are still widely sought.One result in this direction, is represented by the Point Cloud Compression standard specifications that include video-based PCC (V-PCC) and geometry-based PCC (G-PCC) [9] as released in 2020 by the The Moving Picture Expert Group (MPEG).
Given these premises, our task is to perform upscaling and artifact removal of sparsely populated 3D point cloud videos.The terms upscaling and artifact removal are usually found in image/video super resolution literature and so they might not have an immediate translation in the 3D context.We will use the term upscale to indicate the operation by which the total number of vertices of an input point cloud is increased; by using artifact removal, instead, we will imply the correct reconstruction process after some sort of compression or subsampling has been performed on an input point cloud.As shown in Figure 2, a high compression rate can achieve acceptable bandwidth requirements with a huge decrease in fidelity.For some applications, for example where the user experience is important, the identity of the subject must be maintained or, for autonomous driving, such a low-resolution is not acceptable.
Recent approaches that tackled this task, such as [39], employed a strategy that uses long sequences of input frames and a large encoder-decoder model.As we will detail below, we followed a different approach.
In this paper, we pose the upscaling problem in a Generative Adversarial setting using two architectural modules from the literature: the EdgeConvolution [43] and the Graph Attention Network (GAT) [38].In particular, the input point clouds are modeled as graphs and processed by a Graph Convolutional Network (GCN).The convolution operation has been performed using a EdgeConv module: this module incorporates local neighborhood information, can be stacked to learn global shape properties, and affinity in feature space captures semantic characteristics over potentially long distances in the original embedding.While this module was used for CNN-based high-level tasks on point clouds, including classification and segmentation, the GAT has been used for feature aggregation performing an attentioned learned mean of the neighbourhood features instead of simply averaging it out.Experiments have been performed on the FAUST 4D dataset [2] also in comparison with state-of-the-art solutions.Overall, our method shown upscaling reconstructions that are comparable with those reported in the literature, while using a lower number of input frames and an architecture with a much lower number of parameters.This opens the way to the deployment of our architecture in edge computing devices.
In summary, the main contributions of our work can be summarized as follows: • We propose a new architecture for time varying point cloud upscaling that combines together a PointNet [32], used as a Discriminator Network in a GAN, and a Generator that makes use of Edge Convolution on the input graphs derived from the point clouds and a graph attention mechanism for aggregating the features of the local neighbourhood.The resulting GAN architecture represents a setting that, to our knowledge, has not been tried before for this task; • The proposed solution demonstrates a clear advantage over the existing methods in the capability of producing upscaled 3D point clouds with comparable accuracy but using a way lower number of parameters in the architecture.Finally, the inference time is compatible with an online application of the method handling a stream of input frames.

RELATED WORK
Numerous studies have been conducted with the goal of reconstructing a 3D model given inputs in various possible forms: a mesh, a 3D point cloud, a collection of voxels or an implicit function.Some of these works focused on the use of a 3D point cloud as an input [5,36].Others, instead, used a discretized version based on voxels, such as [7,41], or directly tried to reconstruct a mesh [21,40].
Point cloud upsampling was first approached using optimization based solutions, while deep learning based methods were applied only more recently.Methods from both these categories are summarized below.
Optimization-based methods.One of the first work addressing point sets upsampling was proposed by Alexa et al. [1].In their approach, points at vertices of a Voronoi diagram were interpolated in the local tangent space.Lipman et al. [23], presented a Locally Optimal Projection (LOP) operator performing points resampling and surface reconstruction using L1-median.The LOP operator showed satisfactory results even in the case the input point set was affected by noise and outliers.An improved version of the LOP approach aiming to address the density problem of the upscaled point set was then proposed by Huang et al. [13].Overall, good results were demonstrated by these works though their applicability scope was limited by the smoothness assumption of the underlying surface, which is rarely matched by data acquired with real scanners.To overcome such limitation, in [14] Huang et al. proposed an edge-aware point set resampling solution that first resamples away from edges, then progressively approaches edges and corners.One limitation of this method is the dependence of the quality of the results from the normals accuracy at the points, and the need for a careful tuning of the parameters.A point representation method based on volumetric voxelization was introduced by Wu et al. [45].As a preliminary operation, they proposed to fuse consolidation and completion in one coherent step.However, the goal of this operation was on filling large holes, so that global smoothness is not enforced, making the method sensitive to large noise.All these methods are not driven by the data, rather they strongly rely on some priors.
Deep-learning based methods.Only recently, methods have adopted deep architectures to directly learn from point sets.This was mainly due to the inherent difficulty of such data, where points are unordered and do not follow any regular-grid structure in their spatial arrangement.To circumvent such difficulty, some methods converted point clouds to other 3D representations, based on graphs [3,25] or volumetric grids [4,26,35,45].The PointNet [31] and Point Net++ [32] were the first successful attempts to directly process point clouds for classification and segmentation purposes using a hierarchical feature learning architecture that captures both local and global geometry contexts.Other networks that were proposed for high-level analysis of point clouds focusing on global or mid-level attributes of point clouds include [12,17,20,30,42].Local shape properties, like normal and curvature in point clouds, were estimated by the network proposed in [11].Interesting network architectures were also proposed for 3D reconstruction from 2D images [5,10,22].For example, Fan et al. [5] addressed the problem of 3D reconstruction from a single image, generating a straightforward form of output-point cloud coordinates.The 4D extension of the resulting Point Set Generation Network (PSGN-4D) was used in several studies as a baseline for comparison.
One of the first work aiming to perform point cloud upsampling was proposed by Yu et al. [48].They introduced the PU-Net that learns per point features at multiple scales, and expands the set of points using a Multi-layer Perceptron (MLP) with multiple branches.However, to learn multi-layer features the input point sets are downsampled, thus potentially causing a loss of resolution.In [47], the same authors proposed an edge-aware network for point set consolidation (EC-Net) that uses a specific loss to encourage learning to consolidate points for edges.On the negative side, a very expensive edge-notation is needed for training the EC-Net.In the work of Yifan et al. [46], a progressive network (3PU) was proposed that duplicates the input point patches over multiple steps.The progressive architecture of 3PU makes its training computationally expensive.More data are also required to supervise the middle stage outputs of the network.A Generative Adversarial Network designed to learn upsampled point distributions (PU-GAN) was proposed by Li et al. [19], with the main performance improvement obtained by the discriminator.Qian et al. [34] proposed to upsample points by learning the first and second fundamental forms of the local geometry.However, their PUGeo-Net needs additional supervision in the form of normals.The PU-GCN proposed by Qian et al. [33] performed upsampling by leveraging on an Inception based module to extract multi-scale information, and using a GCN-based upsampling module to capture local point information.This has the main advantage of not needing for additional annotations, like edges, normals, point clouds at intermediate resolutions, etc., while also avoiding the use of a sophisticated discriminator.
Recently more and more works have been shifting the attention towards 4D reconstruction, where a sequence of 3D objects is reconstructed from time-varying point clouds given as inputs [18,28].
In the Occupancy Network (ONet) proposed by Mescheder et al. [27], a 3D object was described using a continuous indicator function that indicates which sub-sets of the 3D space the object occupies, and an iso-surface retrieved by employing the Marching Cube algorithm.Tang et al. [37] learned a temporal evolution of the 3D human shape through spatially continuous transformation functions among cross-frame occupancy fields.To this end, they established, in parallel, the dense correspondence between predicted occupancy fields at different time steps via explicitly learning continuous displacement vector fields from spatio-temporal shape representations.Niemeyer et al. [29] introduced a learning-based framework for object reconstruction directly from 4D data without predefined templates.The proposed OFlow method calculates the integral of a motion field of 3D points in a 3D point cloud specified in space and time to implicitly represent trajectories of all the points in dense correspondences between occupancy fields.Vu et al. [39] proposed a network architecture, called RFNet-4D, that jointly reconstructs objects and their motion flows from 4D point clouds.It is shown that jointly learning spatial and temporal features from a sequence of point clouds can leverage individual tasks, leading to improved overall performance.To this end, a temporal vector field learning module using unsupervised learning approach for flow estimation was designed that, in turn, leveraged by supervised learning of spatial structures for object reconstruction.Jiang et al. [15] introduced a compositional representation that disentangles shape, initial state, and motion for a 3D object that deforms over a temporal interval.Each component is represented by a latent code via a trained encoder.A neural Ordinary Differential Equation (ODE) is used to model the motion: it is trained to update the initial state conditioned on the learned motion code, while a decoder takes the shape code and the updated state code to reconstruct the 3D model at each time stamp.An Identity Exchange Training (IET) strategy is also proposed to encourage the network to learn decoupling each component.
With respect to the above solutions, our approach is characterized by a specific design that combines two Graph Convolutional Networks (GCNs) to work in an adversarial setting (GAN).The resulting architecture proved to be flexible in the number of frames used as inputs and conjugated effective reconstructions with inference times that are compatible with online execution.

PROBLEM STATEMENT
We consider a sequence of point clouds in the 3D space.Each point cloud can be regarded as a frame of a 4D video at time .In the following, we consider  point cloud frames fused together forming a time varying point cloud as an unordered lists of , , ,  points.Our task is to upscale, a term borrowed from the 2D image super-resolution domain, each of the point cloud (frame) of the input sequence   and get a more detailed one by leveraging the information of the previous  − 1 low-resolution point cloud frames (i.e.,   −1 , . . .,   −+1 ).
More in detail, given a buffer composed of  previous frames, the input point cloud   is defined as: where each low-resolution point cloud   is composed of  points: We are interested in learning a map  (  ,  ) from   →   , where   is the target point cloud and it represents the zeroth frame upscaled to have  =  ×  ×  points, with  being the scale factor. (3)

PROPOSED METHOD
Our proposed method makes use of message passing Graph Networks, different neighbourhood sampling techniques and Generative Adversarial training.We employed a Graph Neural Network for this task.More in detail, our architecture has been developed starting from [32].The employed architecture works on unordered lists of , , ,  points, representing the last  frames fused together, using two Graph Convolutional Neural Networks (GCNs from here on) in an adversarial setting.The discriminator is based on [32], while the generator improves on the architecture proposed in [32].In particular, we used different neighbors sampling techniques that were developed with the intent of collecting, for each point, features contemporaneously of its immediate neighborhood and also from furthest vertices of the whole point cloud without making the computation too expensive.
The fully convolutional nature of our generator network allows us to potentially train and test at different input and output resolutions.

Edge Convolution and GAT
The basic module composing our generator network is made of the combination of Edge Convolution [43] and Graph Attention Networks (GAT) [38].The Edge Convolution allows us to perform message passing over a dynamic graph in which the edges are updated as the point cloud changes.The GAT side is used to perform an attentional aggregation over the features collected from the dynamic local neighbourhood, this in contrast with much more common choices for aggregation such as  or .We refer to this combination module as Edge Convolution with Attention.

Parallel Double Sampling (PDS) module
The core of the generator side of the architecture is the Parallel Double Sampling (PDS) module that performs two different graph convolutions using two different sets of sampled points.A simplified illustration of this module is presented in Figure 3.For each point, two sets of operations are performed in a parallel fashion.The first set, is a pipeline composed of: • Radius filtering: For each vertex, a filtering step leaves as neighbors, with the capability of passing messages, only those vertices that belong to a sphere of radius  , centered on the vertex.• Furthest Point Subsampling: We use the Furthest Point Subsampling (FPS) algorithm in [32] in order to sample temporarily, a fraction  of the original points that are the farthest away, inside the radius, from a starting point.• Convolution: Graph convolution is applied over the remaining vertices, independently of their number, and their features are aggregated.
The second set of operations, performed in parallel to the first one, is composed of: • K-NN: A fixed number of  closest vertices is selected as neighbors.
• Convolution: Graph convolution is applied over the vertices, and their aggregated features.Finally, the two sets of features are concatenated and fed to a linear layer that maps 2 × ℎ  → ℎ  .

Our Architecture
The developed architecture is composed of two Graph Convolutional Networks (GCNs) working in an adversarial setting (GAN) [8].It is illustrated in the bottom of Figure 4. Basically, the point cloud given as input is processed as a graph using message passing based convolution.

Discriminator
The discriminator is inspired by the PointNet++ architecture [32], since it also targets a classification task.We use the same structure that progressively reduces the number of points using max-pooling operations and finally a sequence of linear layers before the output as shown in the bottom part of Figure 4.

Generator
The generator side of the model is instead built as an initial sequence of Edge Convolution with Attention modules followed by our Parallel Double Sampling (PDS) module.It is also inspired by the PointNet++ architecture [32] but undergoing major changes as detailed in Section 4.2.In the top part of Figure 4 a simplified visualization of the PDS generator is presented.The generator is composed of multiple Graph Convolutions with Attention followed by a single PDS.The intuition behind this choice is to collect various features for each node, using different neighborhood sampling techniques.Once the original node has been enriched with the local features, the PDS will use them to generate multiple new vertices according to the scale factor.Finally this new vertices position is summed with the closest one that originated it, in a sort of residual fashion (see Figure 1).
The generator loss   is composed of an adversarial component   coming from the Discriminator,a full reference reconstruction loss   computed as the Chamfer Distance between the restored point cloud and the original one and an additional Density Loss   .We used the LSGAN from [24] loss for our training, which assumes the form: for the generator, and: for the discriminator.

Loss functions
The model is trained end-to-end using multiple losses.Beside the adversarial component   , we also compute the point-to-set distance (Chamfer distance)   between the reconstructed point cloud and the target one and, similarly to [44], we take into account the neighbourhood of each point.That is, for each reconstructed point   ∈   , we find the closest point   ∈   in the target point cloud, and compute both the distance between them and the difference in terms of local neighbors: We define a vertex  neighbourhood density  () as the normalized sum of its neighbours in a given radius: (9) The generator final loss is therefore given by: where values for   have been empirically determined ( 1 = 1.0,  2 = 0.5,  3 = 0.1).

EXPERIMENTS
The proposed solution for point clouds upscaling has been evaluated in a comprehensive set of experiments.both qualitative (Section 5.3) and quantitative (Section 5.4).An ablation study aiming to evidence the relevance of different components of our architecture is also reported in Section 5.5.

Implementation details
Our model is implemented in PyTorch, using the PyTorch Geometric (PyG) library [6].This library was built upon PyTorch and is specifically designed for Graph Neural Networks (GNNs).The two networks are implemented as two Message Passing Networks put in an adversarial setting.Both the Discriminator and the Generator are optimized with Adam, using the standard learning rate  = 1 −4 and betas  1 = 0.9,  2 = 0.999, using a linear decaying scheduler that drops the learning rate to 1/10th every 10 epochs.
Other hyperparameters, such as the radii for the Ball Query for the FPS sampling (  = 0.06,   = 0.1) and the number of neighbours for the KNN sampling ( ℎ = 9) were empirically determined trough grid search.

Augmentation.
The training data is augmented using different operations.Each sequence of input point clouds and its relative ground truth point cloud is randomly flipped along any of its axes, per point random noise is added, and finally a random scale along any axes is applied in a range between [0.9, 1.1].As a form of augmentation, we also exploit the fully convolutional nature of the generator architecture; similar to the case of 2D image superresolution, where patches of the target high resolution image are used in the training, we randomly feed a 3D slice of the video instead of the full body.A final form of augmentation used during training is a simple time inversion inside the sequence.

Dataset
To evaluate our proposed solution, we used the Dynamic FAUST (D-FAUST) dataset [2].It contains animated meshes for 129 sequences of 10 human subjects (5 females and 5 males) with various motions such as "shake hips", "punching", running on "spot", or "one leg jump".
In order to compare with other methods, we used the train/test split proposed in [29].For each sequence, at training time, we randomly pick an index and then subsample the following frames according to the model's frame rate.We trained multiple models at different frame rates.
We also followed the evaluation setup used in [29].Specifically, for each evaluation, we carried out two case studies: seen individuals but unseen motions (i.e., test subjects were included in the training data but their motions were not given in the training set); and unseen individuals but seen motions (i.e., test subjects were found only in the test data but their motions were seen in the training set).

Qualitative results
Some qualitative results of the proposed upscaling are given in Figures 6 and 7.In Figure 6, the input low-resolution frame, our reconstruction point cloud and the ground truth are given from left ro right.A second example is shown in Figure 7, where the input frame, our reconstruction and the ground truth are compared both in terms of point clouds (top) and in terms of mesh reconstruction using the Poisson algorithm (bottom).Additional qualitative results are given as videos in the supplementary material.
To give some insights on the behaviour of the network layers, we inspected the response of the various convolutional layers given an input point cloud, and visualized them.As an example, on the left of Figure 5, three frames of an input point cloud are shown (the frames are taken at three consecutive times,  0 ,  0 + 1 and  0 + 2).Points in the clouds are colored to highlight their movement with respect to the previous frame.On the right of Figure 5, instead, the response features of different layers are visualized (the depth of the layers increases from left to right).It is interesting to note as, similarly to CNNs, depth correlates to complexity: The first convolutional layers seem to have strong response for large physical parts of the human subject, while the latter stages take into consideration time and movement.

Evaluation Metrics.
To measure the reconstruction quality we applied the standard Chamfer Distance (CD), a point-to-set metric since following the same protocol as reported in [39] that uses the CD as reconstruction metric for measuring the dissimilarity between a point and a point set.

Compared
Methods.We compared our approach with respect to six state-of-the-art solutions in the literature for 4D reconstruction from point cloud sequences, namely, PSGN 4D, ONet 4D, OFlow, LPDC, 4DCR, and RFNet-4D.The PSGN 4D extends the PSGN approach [5] to predict a 4D point cloud, i.e., the point cloud trajectory instead of a single point set.The ONet 4D network is an extension of ONet [27] to define the occupancy field in the spatiotemporal domain by predicting occupancy values for points sample in space and time.The OFlow network [29] assigns each 4D point an occupancy value and a motion velocity vector and relies on the differential equation to calculate the trajectory.The LPDC [37] learned a temporal evolution of the 3D human shape through spatially continuous transformation functions among cross-frame occupancy fields.The 4DCR solution [15] used a compositional representation that disentangles shape, initial state, and motion for a 3D object that deforms over a temporal interval.Finally, RFNet-4D [39] jointly reconstructs objects and their motion flows from 4D point clouds.

Results
. Tables 1 and 2 report results for our solution and for the other methods as given in [39].For our method (last line in the tables) we used 3 frames for upscaling at 60fps with a scale factor of ×4 starting from low-resolution point clouds composed of 1024 vertices.For the unseen individual and seen motion protocol in Table 1, our approach achieves the second best score.From Table 2, it can be observed that our method reached a reconstruction error of similar magnitude with respect to the two best performing methods, i.e., RFNet-4D and LPDC.It is worth noting that RFNet-4D obtained the reported error using a larger number of input frames (i.e., 17 against 3 to 8 as used in our tests).It was not possible to test the RFNet-4D with our setting because the code was not publicly available.
In Table 3, we report the inference time, in seconds, for various different configurations of our model.All the measurements correspond to experiments executed on an Nvidia 2080Ti GPU.The  values reported in the table evidence that our approach can open the way to real-time upscaling.As reported in [39], their method used 17 inout frames to reconstruct an output frame, while our range of frames is between 3 (for models using larger input point clouds) and 8 (for smaller inputs) due to memory constraint at training time.

Ablation Studies
In this section, we present ablation studies to verify different aspects in the design of our model.In particular, we verify each of the introduced components in our architecture for 4D point clouds reconstruction by comparing the percentage decrease of the model when some particular features are removed.
We performed a first set of experiments by using a stream of input point clouds at 60fps and with 256 points per frame; on this stream, we performed upscaling from subsets of consecutive 3 frames, using

Method
Chamfer Distance x 10 −3 ↓ PSGN-4D [5] 0.6189 ONet-4D [27] 0.5921 OFlow [29] 0.1773 4DCR [15] 0.1667 LPDC [37] 0.1526 RFNet-4D [39] 0.1504 Ours 0.1638 an upscale factor of ×2.From Table 4, we can notice that by removing individual components of our architecture, the performance of the model significantly and consistently decreases.In particular, we removed the attention aggregation module and we substituted it with a more common mean aggregation.We also ablated the impact of the Density Loss and the adversarial component.4: Ablation study for our model using 256 input points, 3 frames, 60fps, and upscale factor ×2.
In Table 5, we repeated the above ablation experiments using a different setup.In this case, the frame rate is changed to 30fps, the input resolution to 512 points per frame, and we performed upscaling using a factor of ×4.
Also in this case, ablating the density loss term results into the most significant decrease in the accuracy of the upscaled model.is also interesting to observe that, while the percentage increment in the Chamfer distance when removing the attention layer and the adversarial loss shows small differences between the two tables, this is not the case for the density loss: removing this term has a much larger impact on the results in Table 5 (∼ +13%) than in Table 4 (∼ +6%).

Importance of the temporal information.
A question that arises with the proposed solution is the actual impact of having the time buffer compared to using just the last point cloud as an input.
To compare these two solutions, we feed our model with the same frame repeated  times.In this way, we keep the comparison fair by not changing the input size and the amount of starting points but only the information contained within it.We refer to this setup as Static Sequence, whilst we use the term Dynamic to refer the proposed procedure that uses  different frames.In Table 6, we report some comparative results between the two ways of using the frames in a sequence.It can be observed that there is useful information in the time and movement of the cloud.Just like in a 2D video, the same frame repeated  times does not contain the same amount of useful data for reconstruction as  different subsequent frames.

CONCLUSIONS
In this paper, we presented a fully convolutional graph-based approach for time-varying point clouds upscaling using a novel and different approach with respect to most of the state-of-the-art models.Our proposed method is comparable with state-of-the-art solutions in terms of upsampling performance but it has a much lighter architecture that allows the implementation on edge devices with limited computational capabilities.As a future development this type of application could be implemented as an update for older LiDAR devices or to allow faster 3D point cloud streaming by only transmitting/sampling a subset of the original points.
While out method tackles the problem in a different way bringing some advantages it still has some limitations and drawbacks: • Training time and memory footprint.Not relying on an encoder-decoder model implies having the whole point cloud at every stage of the network in memory, this slows down training and poses some limitations in the number of input frames; • Results for the reconstruction accuracy are comparable with those reported in the state-of-the-art, though a bit lower.

Figure 2 :
Figure 2: Left: Sample of an input low-resolution point cloud with ∼3K vertices.Center: our model reconstruction with ∼12K vertices.Right: Ground truth point cloud with ∼12K vertices.

Figure 3 :
Figure 3: Schematic representation of the proposed Parallel Double Sampling (PDS) module.

Figure 4 :
Figure 4: Schematic representation of the proposed GCN architecture.Top: Generator architecture; Bottom: Discriminator architecture.

Figure 5 :
Figure 5: The top and bottom row show: (left) the point clouds of a three-frame input sequence with movements.Colors indicate the movement of a point with respect to the previous frame; (right) different features obtained from subsequent edge graph convolutional layers of the proposed architecture as a response to the three-frame sequence shown on the left.It can be noted the response of layers seems to pass from spatial to temporal details.

Figure 6 :
Figure 6: Left: Sample of a single frame from an input low resolution point cloud with ∼512 vertices, Center: reconstruction obtained with our proposed solution; Right: Ground truth point cloud.

Figure 7 :
Figure 7: Top: Point cloud visualization.Bottom: Mesh reconstruction using Poisson surface reconstruction from [16].Left: Sample obtained from a single frame of an input lowresolution point cloud with 1024 vertices; Center: Model reconstruction using our proposed approach; Right: Ground truth point cloud.

Table 2 :
Reconstruction accuracy for the seen individuals and unseen motions protocol.We report the Chamfer distance (lower is better).Results for the best and second best performing methods are given in bold and underlined, respectively.Our approach results in the third best performance.

Table 3 :
Inference time for different configurations of our model using a three-frames buffer.Every test was performed on an Nvidia2080Ti.For the other models it must be noted that they used a 17 frame input sequence to output a frame.

Table 6 :
Ablation study for our model using the aforementioned sequences at different resolutions.It shows how the dynamic approach performs consistently better than the static one.