skip to main content
research-article
Open Access

Deep Saliency Mapping for 3D Meshes and Applications

Authors Info & Claims
Published:06 February 2023Publication History

Skip Abstract Section

Abstract

Nowadays, three-dimensional (3D) meshes are widely used in various applications in different areas (e.g., industry, education, entertainment and safety). The 3D models are captured with multiple RGB-D sensors, and the sampled geometric manifolds are processed, compressed, simplified, stored, and transmitted to be reconstructed in a virtual space. These low-level processing applications require the accurate representation of the 3D models that can be achieved through saliency estimation mechanisms that identify specific areas of the 3D model representing surface patches of importance. Therefore, saliency maps guide the selection of feature locations facilitating the prioritization of 3D manifold segments and attributing to vertices more bits during compression or lower decimation probability during simplification, since compression and simplification are counterparts of the same process. In this work, we present a novel deep saliency mapping approach applied to 3D meshes, emphasizing decreasing the execution time of the saliency map estimation, especially when compared with the corresponding time by other relevant approaches. Our method utilizes baseline 3D importance maps to train convolutional neural networks. Furthermore, we present applications that utilize the extracted saliency, namely feature-aware multiscale compression and simplification frameworks.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Recent technological developments in the fields of real-time three-dimensional (3D) capturing [1], 3D displays [38], and wearable 3D glasses [35] lay the foundations that allow the development of fascinating applications. A key feature for engaging the user and providing realistic experiences in full 3D media environments is to allow unrestricted viewpoint experience and navigation, emphasizing the need for robust and scalable frameworks facilitating the immersive experience of the user [15]. The visual quality of 3D media is affected by geometry and texture parameters, while the temporal aspects of smooth movement and synchronization are affected by lag introduced by network transmission effects. Several efficient multi-sensor telepresence and teleimmersion frameworks have emerged in the literature [3, 11]. Nevertheless, the real-time streaming of huge 3D models introduces real-world challenges, which are related to the increasing demands for low-cost, low-latency, and scalable coding of the 3D data [5]. Additionally, besides the need for efficient transmission, these applications effectively face the challenge of storing a huge amount of information. It seems necessary the use fast frameworks that achieve both low compression ratios of the 3D model without noticeably affecting the visual perception of the user [18, 24], especially in immersive applications [10].

The reconstructed geometries are subsequently, processed, compressed, simplified, stored, transmitted and eventually reconstructed in a virtual space [2, 3, 20, 21, 22]. Efficient compression and simplification of 3D geometries, as counterparts of the same process, are based on the prioritization of groups of vertices or mesh segments [19]. This mechanism, which identifies the importance of a point in space with respect to its neighbouring region in a global or a local setting, is a stimulus-driven process also referred to as saliency [25]. Saliency detection is first introduced in the research area of image [9, 39], and video [14] processing, highlighting specific areas that consist of more critical visual features. The saliency mapping provides tremendous advantages for the non-isotropic processing of 3D objects. The salient part is distinguished from its surroundings due to the lack of coherence. Saliency for 3D meshes was initially introduced by [17], while several other methods have been presented in the literature including spectral methods [31], curvature-based methods [17, 36], multi-scale methods [26, 37], anisotropic filtering approaches [40], entropy-based schemes [34] and hybrid methods [4]. Despite their success, their main drawback is that they are susceptible to noise, outliers or missing parts. The authors of Reference [6] introduced a sparse-modelling-based approach that exploits the low-rank property of the geometry and the sparsity of the features in the Laplacian domain to generate importance maps efficiently filtering noise and outliers. More specifically, this scheme consists of a fusion of Robust Principal Component Analysis (RPCA) and Eigenvector analysis of matrices containing the curvature-normal vectors of the 3D mesh. Robustness to noise and outliers also makes them suitable for applications, including compression, simplification and denoising. However, their main drawback is that they exhibit high execution times for large and dense 3D meshes. To this end, using deep networks allows for lower complexity and higher performance. Even though recent works [12] have employed deep networks to extract saliency maps for 3D representations, they mainly focus on image-based or multi-view setups. The authors of Reference [27] employed RPCA-generated baseline 3D saliency maps to train deep Convolutional Neural Networks (CNNs), utilizing 3D saliency descriptors that operate directly on the 3D points contributing to the field of geometric deep learning, where the sampling of the latent space is non-uniform. They classified the 3D mesh faces into four salient levels and utilized feature-aware compression and simplification outcomes. CNN-based classification-oriented approaches generate outcomes equivalent to classical approaches with much lower execution times. However, classification in predefined classes offers poor visual outcomes and limited flexibility in the case of vertex prioritization. Furthermore, Nousias et al. [28] presented a regression-oriented CNN architecture. The outcome was continuous saliency values with a greater level of detail. They also demonstrated its effect on multiscale feature-aware compression and simplification schemes. However, since CNN-based solutions transverse the entire geometry face by face, they exhibit high overlap, especially in the case of neighbouring faces.

To address this limitation, we propose a fully convolutional network– (FCN) based solution that simultaneously extracts the saliency for an entire patch. FCN deployment allows using only 6% of the 3D mesh faces, reducing the overlap. Furthermore, the performance evaluation study reveals that FCN is much faster than CNN- and RPCA-based solutions. The FCN follows a UNet-based architecture consisting of two parts, a contracting part and an expansive part. The contracting path follows the typical structure of a CNN, while during each expanding step, the outcome is concatenated with corresponding tensors of the contracting part. More details about the architecture of the used FCN network will be discussed in Section 2.2.

Visual saliency is a subjective perception cue that differentiates a region from others and immediately attracts human attention. The human visual system has evolved to detect salient regions over the entire field of view automatically. It is first attracted by the most representative salient elements, then the visual attention is transferred to other regions. Most existing methods try to simulate how the human perceptual system works, emphasising what the human brain assumes as salient information. Nevertheless, what a human assumes as a salient feature may vary from what computational methods assume as salient. However, in many applications (industrial, heritage), simple geometry is usually more common and useful than complex surfaces of high spatial frequency that trigger human visual attention.

The typical way to evaluate a saliency mapping method is to evaluate the extracted visual saliency map objectively. Nevertheless, the subjective evaluation cannot clearly show if a specific saliency mapping has achieved its purpose, is applied in a specific application, and provides a fair comparison with the results of other salient mapping methods.

On the one hand, the annotated database for 3D point clouds data is not available, and, on the other hand, even assuming that there was available, this does not mean that the simplification/ compression results based on this annotation will be better, since the human perceptual saliency is different from the geometrical saliency. Human annotators may not be perfect (limited or subjective evaluation) or emphasize areas that are not salient to facilitate the simplification/compression process. Additionally, it is tough, if not impossible, to annotate temporal data objectively.

The objective of this work is not to supersede RPCA but to achieve similar performance with a fraction of the resources and time (CNN/FCN accelerates the process). We choose the RPCA method for saliency extraction as the baseline to develop our approach, because the performance and advantage of RPCA-based saliency mapping have been successfully evaluated in previously works [6, 7]. We do not evaluate the extracted RPCA-based saliency map directly, but the reconstructed results of the 3D models using RPCA in different applications (i.e., simplification and compression). It should be emphasized that this is because there is no ground-truth saliency map or a reliable metric that can be used for benchmarking purposes.

More specifically, the outcomes of the presented approach can be summarized in the following points: (i) We train a fully convolutional network architecture to estimate 3D mesh saliency values in a regression-oriented fashion. (ii) We provide qualitative, quantitative, and performance evaluation studies demonstrating the robustness of the presented approaches with respect to resolution and noise, also highlighting their invariance to scale, rotation, and translation. (iii) We examine and discuss the application of saliency extraction in sequences of animated meshes. (iv) We present applications that take advantage of the extracted saliency. Such applications include multiscale feature-aware compression and simplification.

The rest of this article is organized as follows. Section 2 describes our approach in detail, presenting our 3D saliency descriptor and the corresponding deep architecture. Section 3 presents the outcomes of the presented approaches, while Section 4 draws conclusions and discusses the outcomes of our approach.

Skip 2DEEP SALIENCY Section

2 DEEP SALIENCY

This section presents the pipeline of the proposed approach, visualized in Figure 1 for the deep saliency analysis of 3D geometries. We can think of the 3D model as a point sampled 3D manifold with connectivity. The saliency map is assumed to be registered upon this 3D manifold, where each face or point is assigned a value to indicate the significance of this point with respect to its local neighbourhood or the global setting. Specifically, the RPCA and eigenvectors-based approach was proposed in previous publications and is taken as a baseline [6, 7]. This baseline captures, practically, the curvature variation profile within a particular patch of the manifold compared to all the other patches. The proposed approach employs deep neural networks to learn from this baseline metric so that one should not have to perform singular value decomposition on large matrices, which is intractable in some cases.

Fig. 1.

Fig. 1. Deep network–based saliency map extraction pipeline.

We extract training data from a simplistic 3D geometric model referred to as “the armchair.” Each face of the model is formulated as a training example. A number of training examples (291,863) was used in total to train a FCN and a CNN architecture. The CNN included in this study aims to highlight the benefits of the proposed FCN-based scheme. A 3D saliency patch descriptor, receiving as input the geometrical characteristics of \(N\) neighbouring faces, transverses the mesh surface. For the selected patch, we generate a rotation, translation, and scale-invariant representation. Afterwards, the trained models are used to extract the “data-driven” saliency map of 3D models available in public datasets. The extracted saliency maps are used for compression and simplification, identifying which mesh vertices need to be preserved during a decimation process to be attributed to more bits in a compression pipeline.

2.1 3D Saliency Descriptor

Triangular 3D meshes \(\mathcal {M}\) consist of \(n\) vertices \(\mathbf {v}\) and \(n_f\) faces \(f\). Each vertex \(\mathbf {v}_i\) is denoted by (1) \(\begin{equation} \mathbf {v}_i = \left[x_i,\ y_i,\ z_i\right]^T, \ \forall \ i = 1, \ldots , n \end{equation}\) and each \(f_j\) face is a triangle defined by (2) \(\begin{equation} f_j = \lbrace \mathbf {v}_{j1} \ \mathbf {v}_{j2} \ \mathbf {v}_{j3} \rbrace , \ \ \ \forall \ j = 1, \ldots , n_f \end{equation}\) with centroid (3) \(\begin{equation} \mathbf {c}_j =\frac{\left(\mathbf {v}_{j1}+\mathbf {v}_{j2}+\mathbf {v}_{j3}\right)}{3} \end{equation}\) and its outward unit normal (4) \(\begin{equation} \mathbf {n}_{c_i} = \frac{\left(\mathbf {v}_{j2}-\mathbf {v}_{j1}\right) \times \left(\mathbf {v}_{j3}-\mathbf {v}_{i1}\right)}{\left\Vert \left(\mathbf {v}_{j2}-\mathbf {v}_{j1}\right) \times \left(\mathbf {v}_{j3}-\mathbf {v}_{j1}\right)\right\Vert }. \end{equation}\) The sliding 3D saliency descriptor receives as input a tensor consisting of the normal coordinates of the face in question and \(N-1\) neighbouring faces and generates a representation to be employed as input for the deep network. We assume that \(S_i\) is the set of face indices neighbouring \(f_i\) and \(k=\left|S\right|\) is the cardinality of \(S\). We select \(k\) to be a power of 2, \(k=w^2\) and \(w \in [4,8,16,32]\)

\(\mathbf {S}=[\mathbf {c}_1 \: \mathbf {c}_2 \: \mathbf {c}_3 \: \cdots \: \mathbf {c}_k]\) contains the centroid normals of the patch. Then \(\mathbf {S}_{rot}=\mathbf {R}\mathbf {S}\) where (5) \(\begin{align} \mathbf {R}&=\mathbf {I} +(\sin \theta)\mathbf {K} +(1-\cos \theta)\mathbf {K} ^{2}, \end{align}\) (6) \(\begin{align} \mathbf {K} &=\left[\begin{array}{ccc}0&-a_{z}&a_{y}\\ a_{z}&0&-a_{x}\\ -a_{y}&a_{x}&0 \end{array}\right]\!, \end{align}\) (7) \(\begin{align} \cos \theta &=\frac{\hat{\mathbf {c}}\cdot \mathbf {c}_{const}}{||{\hat{\mathbf {c}}}||\cdot ||{\mathbf {c}_{const}}||}, \end{align}\) (8) \(\begin{align} \mathbf {a}&=[a_x\: a_y\: a_z]=\hat{\mathbf {c}}\times \mathbf {c}_{const}, \end{align}\) (9) \(\begin{align} \hat{\mathbf {c}}&=\frac{1}{k} \sum _{i\in {S}}A_i \mathbf {c}_i, \end{align}\) where \(\mathbf {c}_{const}\) is an arbitrary vector and \(A_i\) is the area of face \(f_i\). Then to map the input data in \([0,1]\), (10) \(\begin{equation} \mathbf {S}^{\prime }=\frac{\mathbf {S}_{rot}+\mathbf {1}}{2}. \end{equation}\) A space filling function \(Q\) is employed to transform \(\mathbf {S}^{\prime }\) to tensor \(\mathbf {\mathcal {P}}\) so that (11) \(\begin{equation} \mathbf {S}^{\prime } \in \mathcal {R}^{ k \times 3} \xrightarrow {} \mathbf {\mathcal {P}}\in \mathcal {R}^{w\times w \times 3}. \end{equation}\) Space filling functions are bijective transformations from a one-dimensional to an \(n\)-dimensional space, as presented in Reference [8]. In our case, two space filling functions are investigated, the sweep space curve \(\mathcal {S}\) and the Hilbert \(\mathcal {H}\) curve. Specifically, \(\mathbf {\mathcal {P}}_{(x,y,:)}=\mathbf {S}^{\prime }_{(i,:)}\) where \(i=\mathcal {H}(x,y)\). The resulting representation is independent of patch scale, translation, or rotation. Figure 2 visualized the Hilbert curve and the sweep space curve.

Fig. 2.

Fig. 2. Space filling curve functions for Hilbert curve on the left and sweep space curve on the right.

2.2 FCN Architecture Training and Inference

This section presents the deep architecture approaching the saliency extraction task as a regression problem. We assume that given \(\mathbf {\mathcal {P}}\) as input of dimensions \(w \times w \times 3\) there is a transformation function \(\mathcal {F}(\mathbf {\mathcal {P}})\) so that \(\mathcal {F}(\mathbf {\mathcal {P}})=s \in [0,1]\) where \(s\) is the corresponding saliency.

The fully convolutional network follows a UNet-based [29] architecture, presented in Figure 4. It consists of a contracting part and an expansive part. The contracting path follows the typical structure of a CNN. Each convolutional layer, applying padded \(3 \times 3\) convolutions, is followed by a rectified linear unit (ReLU). The first level consists of two convolutional layers followed by a \(2 \times 2\) max-pooling operation. The second level consists of three convolutional layers followed by a \(2 \times 2\) max-pooling operation. The third level of the contracting part consists of four convolutional layers followed by a \(2 \times 2\) max-pooling operation. At each downsampling, the number of feature channels is doubled. For the expanding part, \(2 \times 2\) up-convolutions are used that halves the number of feature maps. In each expanding step, the outcome is concatenated with corresponding tensors of the contracting part. After each up-convolution and concatenation, a number of convolutional layers with padded \(3\times 3\) convolutions follow. The first expanding level consists of one convolutional layer and the second expanding level of five consecutive convolutional layers before the outcome is mapped to a \(w\times w \times 1\) output.

A CNN architecture is also employed to provide a more extensive comparison and highlight the benefits of the FCN. The CNN architecture, presented in Figure 3, consists of three convolutional layers, each followed by a max-pooling layer and a ReLU activation function. A flattening layer succeeds the convolutional layers, followed by three fully connected layers and a single neuron to yield the saliency metric. The dimension of the convolutional kernels is \(3 \times 3\), performing a padded convolution operation.

Fig. 3.

Fig. 3. Schematic representation of the proposed CNN architecture.

Fig. 4.

Fig. 4. Schematic representation of the proposed FCN architecture.

To form the training set, we extract patches \(\mathbf {\mathcal {P}}_i\) from 3D meshes and assign baseline saliency class value to each patch. For the training, a 3D model referred to as “armchair,” presented in Figure 1, was employed. The model comprises 145,934 vertices and 291,864 faces, generating 291,864 training examples. Each one consists of a tensor \(\mathbf {M}\in \mathcal {R}^{32 \times 32 \times c \times b}\) and a saliency value. \(b\) is the batch size, set to \(b=600\), and \(c\) the number of channels \(c=3\). The training of the CNN took place in an NVIDIA GeForce GTX 1080 graphics card with 8-GB VRAM and compute capability 6.1.

2.3 Feature-Aware Simplification and Compression of 3D Meshes

Dense models, consisting of millions of vertices, often have to be simplified to proceed further (e.g., denoising, reconstruction, deformation) or used in applications (e.g., rendering, transmission, storing). In more advanced simplification strategies, the purpose of a simplification algorithm is to preserve the most representative and perceptually essential parts of a model (i.e., corners, edges, or high curvature surfaces) and remove the least representative ones (i.e., flat areas). To this end, we suggest a multiscale feature-aware simplification process. At first, vertices are classified into different saliency levels, and then a different portion of vertices is kept from each class. Priority is given to those vertices categorized to belong in more salient classes.

Simplification is a low-level application focusing on representing an object using less information (primitives). The main objective of a successful simplification approach is to remove only those vertices that do not offer significant geometric information, and their removal will not change the shape of the 3D object significantly, trying at the same time to decrease the perceptual error. Based on this assumption, we suggest removing the least perceptually important vertices (e.g., vertices lying in flat areas) and preserving only the most geometrically salient vertices that will be utilized to reconstruct the new simplified 3D model. More specifically, the steps of the suggested multiscale feature-aware simplification process are shortly discussed next.

The first step is the extraction of the saliency map of the 3D model based on the approach presented in Section 2.1. Then, the vertices are classified into N classes based on the magnitude of their saliency values. The classification of the vertices is used to indicate the final remaining vertices based on the selected simplification strategy. When the least salient vertices have been removed, a KNN algorithm is used to recreate the new connectivity (triangulation). More specifically, for each vertex that is removed, we find the closest neighbouring vertex that remains, and we perform half edge collapse.

Three-dimensional mesh compression is a low-level geometry process that aims to encode and reconstruct a 3D geometry from a subset of the initial mesh vertices potentially coded in a different domain. Geometry information can be encoded using delta coordinates, and the connectivity derived Laplacian matrix, similarly to Reference [33]. High-saliency regions of the mesh encompass essential features that need to be encoded with more bits, while low-saliency faces correspond to flat or lower curvature areas that can be selectively encoded with fewer bits. Thus, a sequence of the level of detail scheme emerges. For the proposed multiscale feature-aware compression scheme, we uniformly sample 10% of the mesh vertices to serve as anchor points and quantize them to 12 bits. Subsequently, we classify the mesh vertices into levels of saliency, select a different portion from each class and quantize the corresponding \(\delta\)-coordinates to 12 bits while setting the rest of them to zero. This creates a smoothing effect forcing the reconstructed, at the decoder, vertices to move toward the centre of gravity with respect to their neighbours. In the decoder site, we solve a sparse linear system of equations, similarly to Reference [16] to reconstruct the compressed 3D model. Algorithms 1 and 2 summarize the aforementioned approaches.

2.4 Efficient Estimation of Saliency Maps in Dynamic 3D Meshes

The estimation of the saliency mapping for all of the frames constituting a dynamic 3D mesh could be a very time-consuming process. Nevertheless, motivated by the observation that the surface and the shape of a non-rigid dynamic 3D model do not significantly change, frame by frame, we assume that it is not important to estimate the saliency map of the whole vertices all over again for each frame. To handle this issue, we take advantage of the saliency information of the previously estimated frame and use it in the corresponding areas of the new frame that geometrically remain very similar. In other words, assuming non-isometric deformations, we suggest estimating the saliency values only to these vertices that have a significant change, regarding the normalized difference \(\mathbf {d}\) that represents the percentage difference of the first-ring area between the corresponding vertices in two consecutive frames. Otherwise, if the first-ring area of a vertex between two frames has not been significantly changed, then we assume that the vertex area has preserved its original geometry resulting in having the same saliency value as in the previous frame. So, in this case, the saliency mapping information is transferred to the next frame’s vertices. The criterion used to define whether we have to estimate the saliency value of a vertex again or not is presented in the following Equation (12) [5]: (12) \(\begin{equation} \Phi (i,l) = \left\lbrace \begin{matrix}1 & \text{if} \ \mathbf {d} = \frac{| \mathbf {A}(i,l)-\mathbf {A}(i,l-1)|}{\text{max}(\mathbf {A}(i,l),\mathbf {A}(i,l-1))} \ge 0.1,\\ 0 & \text{otherwise} \end{matrix}\right. \end{equation}\)

where \(\mathbf {A}(i,l)\) is the first-ring area of point \(i\) as it appears in the \(l\) frame and when the value of \(\Phi (i,l)\) is equal to 1 \(\Phi (i,l) = 1\), it means that we have to estimate again the saliency value of the \(i\) vertex that lies in frame \(l\).

Skip 3EXPERIMENTAL EVALUATION Section

3 EXPERIMENTAL EVALUATION

The proposed approach employs deep neural networks to learn from this baseline metric so that one should not have to perform singular value decomposition on large matrices, which is intractable in some cases. To establish a solid experimental strategy, initially, we extract training data from a simplistic 3D geometric model referred to as “the armchair.” Each face of the model is formulated as a training example. A number of training examples (291863) was used in total to train an FCN and a CNN architecture. The CNN included in this study highlights the benefits of the proposed FCN-based scheme. Afterwards, the trained models are used to extract the “data-driven” saliency map of 3D models available in public datasets, visualized in Figure 5. The extracted saliency maps are used for compression and simplification, identifying which mesh vertices need to be preserved during a decimation process or attributed to more bits in a compression pipeline.

Fig. 5.

Fig. 5. Visualization of the 3D model utilized within this study. (a) “armchair” model used for training of the deep architectures, (b) “cad,” (c) “block,” (d) “casting,” (e) “fandisk,” (f) “joint,” (g) “stonecorner,” and (h) “centurion.”

The metrics that evaluate the applicability and the accuracy of our approach are twofold. The first refers to the accuracy of the deep networks referring to how close to the ground truth the inferred values are, where RPCA-based saliency is assumed to be the ground-truth saliency map. For the comparison to take place, the saliency map is discretized for both prediction and ground truth into eight or four bins, converting the regression outcome into a classification outcome. The saliency is compared using traditional confusion matrices. The second evaluation method is through simplification and compression applications. In this setup, we assume that a compression of simplification pipeline aims to keep intact the most significant points of the 3D mesh. To this end, the deep saliency extraction process takes place at first. The following figure visualizes publicly available 3D models utilized within this study

3.1 Qualitative and Quantitative Evaluation of FCN-based Deep Saliency

Figure 6 presents a qualitative comparison of the extracted saliency maps for the CNN-based approaches with respect to the respective baseline saliency. The first column visualizes the initial 3D model, while the second represents the baseline saliency map. The third and fourth columns visualize the prediction outcome either for the CNN outcome (first row) or the FCN outcome (second row). Specifically, the third column corresponds to the Hilbert curve, while the fourth column corresponds to the sweep space curve. Faces with higher saliency values are in red, while faces with lower saliency values are in blue. Qualitative evaluation reveals that deep architecture-derived maps efficiently capture baseline saliency.

Fig. 6.

Fig. 6. Saliency maps of 3D scanned geometries. (a) 3D mesh geometry, (b) ground-truth estimation, (c) CNN-based Hilbert curve saliency map, and (d) CNN-based sweep space curve saliency map.

To present a more comprehensive assessment, we also appose confusion matrices measuring the quality of the generated saliency maps, quantizing the baseline and predicted saliency into eight classes. An ideal distribution would follow the green line that corresponds to the diagonal of the confusion matrix. The darker the diagonal, the better the prediction outcome. In the case of the CNN (first row of Figures 6(c) and 6(d)), the Hilbert curve and the sweep space curve seem not to differ significantly. However, in the FCN case, the Hilbert curve outcome brings more predicted values near the diagonal, yielding a better result than the sweep space curve. To elaborate further, the explanation lies in the arrangement that each space-filling curve captures. Figure 2 reveals that the convolutional kernels transversing a Hilbert curve arrangement capture more consistent regions of the mesh. That fact may not affect the CNN case, but it significantly affects the FCN case that relies on the relation between nearby elements to constrain the latent space.

A quantitative evaluation is presented in Table 1. The extracted confusion matrices are evaluated using the (a) overall accuracy; (b) micro precision, recall, and F1 Score; (c) macro precision, recall, and F1 Score; and (d) weighted precision, recall, and F1 Score. The quantitative evaluation reveals that CNN is slightly better than FCN, but it also depends on the model. For example, the “head” model Figure 6, which is a heritage model, lacks flat surfaces, while industrial “fandisk” and “casting” models have more faces on flat areas. In the case of flat areas, the FCN yields lower accuracy than in the case of curved surfaces.

Table 1.
CNN 256FCN 256
CastingFandiskHeadCastingFandiskHead
Accuracy0.600.730.680.590.640.67
Micro precision0.600.730.680.590.640.67
Micro recall0.600.730.680.590.640.67
Micro F1 Score0.600.730.680.590.640.67
Macro precision0.710.670.730.700.530.70
Macro recall0.490.560.380.490.430.36
Macro F1 Score0.450.590.410.440.440.38
Weighted precision0.660.710.670.660.610.66
Weighted recall0.600.730.680.590.640.67
Weighted F1 Score0.530.700.620.520.600.60

Table 1. Quantitative Evaluation

Figure 7 visualizes the CNN outcome for different patch sizes ranging in \(k \in [16,64,256,1024]\) and presents the corresponding confusion matrices. As it becomes evident, for the “Centurion” and the “screw” model, smaller patches perform equally successful in estimating the corresponding saliency values compared to larger patches. This observation indicates that it is possible to significantly reduce execution times without loss of accuracy, as the complexity study presented in the following sections reveals.

Fig. 7.

Fig. 7. (a) Original model. (b) Baseline saliency. (c) Saliency map with \(k=1,\!024.\) (d) Saliency map with \(k=256.\) (e) Saliency map with \(k=64.\) (f) Saliency map with \(k=16\) , where \(k\) is the number of faces included in the patch.

3.2 Robustness Evaluation of Proposed Deep Saliency

Figure 8 visualizes CNN-generated saliency maps of industrial models for different resolution, scale rotation and translation settings. The first row depicts the original model, a scaled version and a randomly rotated and translated version without alterations in the predicted saliency map. The second and third rows depict the same model with different resolutions, meaning that the model was simplified before the saliency map was recalculated. The number of vertices in the simplified models ranges from 90% to 50% of the initial value.

Fig. 8.

Fig. 8. Extracted saliency maps for different scale, rotation, and translation setups and different resolution quality. The decimated model contains (a) 90%, (b) 85%, (c) 80%, (d) 75%, (e) 70%, (f) 65%, (g) 60%, (h) 55%, and (i) 50%, of the initial number of vertices.

Likewise, the saliency map shows no apparent fluctuations. The same evaluation process is repeated for the “screw” model, presented in rows 4 to 6, and for the “composite” model, presented in rows 7 to 9, confirming similar observations.

Furthermore, Figure 9 visualizes saliency maps for different setups of added Gaussian noise \(N\) ranging from \(N \sim (0,\;0.2\cdot \bar{L}_p)\) to \(N \sim (0,\;2.0\cdot \bar{L}_p)\) where \(\bar{L}_p\) is the average edge. Two models are compared, the “propeller” model presented in rows 1 to 6 and the “gear” model presented in rows 7 to 12. Rows 3 and 4 show the model, affected by the corresponding noise level and an enlarged detail. Finally, rows 5 and 6 show the corresponding saliency maps. As it becomes evident, added noise affects the saliency maps turning deep blue colours to light blue, corresponding to a deviation of low-saliency values to higher-saliency values for the flat areas. However, high-saliency areas, i.e., the tips of the “propeller,” maintain their initial characterization. Similar observations are confirmed for the “gear” model. The observations mentioned above show that resolution, pose, and orientation does not affect the presented saliency extraction approaches, but added noise leads to higher misclassification toward the higher-saliency classes, maintaining the recognition of geometric features.

Fig. 9.

Fig. 9. Extracted saliency maps for different noise setups (a) \(\sigma =0.02\cdot L\) , (b) \(\sigma =0.04\cdot L\) , (c) \(\sigma =0.06\cdot L\) , (d) \(\sigma =0.08\cdot L\) , (e) \(\sigma =0.1\cdot L\) , (f) \(\sigma =0.12\cdot L\) , (g) \(\sigma =0.14\cdot L\) , (h) \(\sigma =0.16\cdot L\) , (i) \(\sigma =0.18\cdot L\) , and (g) \(\sigma =0.2\cdot L\) .

3.3 Complexity Study

Figures 10(a) and (b) present a performance evaluation of the deep learning–based approaches compared to the RPCA-based approach. The RPCA-based approach requires the decomposition of matrices containing patches of neighbouring faces for all the faces of the mesh leading to very large matrices with dimensions \(n_f \times (3k)\). The resulting matrices are decomposed by robust principal component analysis, a computationally expensive process with an exponential increase in execution times as the size of the matrix increases. The CNN- and FCN-based approaches act locally, receiving as input the neighbouring area of a certain face \(f_i\). However. For the FCN case, only 6% of the faces are used, significantly reducing the execution times. The sole drawback of the CNN- and FCN-based approaches is that the preprocessing step requires the rotation of normal coordinates around a computed rotation axis. Figures 10(a) and (b) compare the execution times for three models. The “head” model is on the left, the “centurion” model in the middle and the “stonecorner” model on the right. Different patch sizes were selected to demonstrate the effect on execution times. For the CNN the patch size ranges from \(k_{CNN}=16\) to \(k_{CNN}=1024\), for the FCN \(k_{FCN}=[64,256]\) while for the RPCA-based approach the same patch size \(k_{RPCA}=60\) is used. Further increasing the RPCA patch size may lead to intractable computations, especially for large models. In each case, the execution time required by CNN is lower than the RPCA-based method. Furthermore, in each case, the execution time required by the FCN is faster than the RPCA and CNN approaches. For a similar patch size, \(k_{FCN}=64\), \(k_{CNN}=64\), and \(k_{RPCA}=60\), the CNN-based approach is 20 times faster than the RPCA-based method, and the FCN-based approach is 100 times faster than the RPCA-based method for a model with 300K faces as Table 2 reveals.

Table 2.
ModelFacesI/ORPCACNNFCN
PlatformCPUCPUCPUGPUCPUGPU
Patch size60166425616642566425664256
Execution times (seconds)Head88K1863527571713059180431601646
Centurion200K41186361128399711403979835235115
Stonecorner300K63577510019863711721760815354759166

Table 2. Execution Times for CNN-based Approaches and RPCA-based Approaches

Fig. 10.

Fig. 10. (a) Execution times evaluation. \(n_f\) is the number of faces an \(k\) is the number of faces included in the patch. (b) Execution times evaluation. \(n_f\) is the number of faces an \(k\) is the number of faces included in the patch.

For the testing and evaluation, all operations took place in an Intel Core i7-4790 CPU @ 3.60 Hz, with 32 GB of RAM and an NVIDIA GeForce GTX 1080 graphics card with 8-GB VRAM and compute capability 6.1, while the I/O operations were not taken into account for the execution times evaluation. It is essential to highlight that deep networks can take advantage of the GPU, while for the traditional geometric saliency extraction approaches, no GPU-based implementations are available in the literature.

3.4 Evaluation of Feature-aware Simplification and Compression of 3D Meshes

Figure 11 illustrates simplification scenarios under extremely target ratios ranging from \(70\%\) to \(90\%\). Enlarged details with the connections (i.e., edges) between the vertices of the simplified models are provided for easier visual comparison. As it becomes evident, using the proposed granular scheme generates better results than a straightforward “up-to-bottom” scheme. Specifically, the granular approaches, visualized in Figure 11(c) and (d) for the CNN and FCN approaches correspondingly, also sample uniformly different portions of vertices belonging to less salient classes. However, keeping only higher class vertices leads to the loss of important details, visualized in Figure 11(b). Additionally, in Figure 11(a), a plot of the Hausdorff Distance (HD) error per different simplification scenarios and approaches is also provided.

Fig. 11.

Fig. 11. Reconstructed simplified results under of different simplification scenarios. (a) Origninal model and plots of HD error per different simplification scenarios. The models have been reconstructed using (b) the feature extraction method [16], (c) our proposed method utilizing the CNN approach, and (d) our proposed method utilizing the FCN approach.

Figure 12 presents the reconstruction error per vertex, in terms of mean theta error metric, defined as the mean value of \(\theta\) angle difference between the ground-truth face normal position and the reconstructed. The reconstruction error is plotted as a function of bits per vertex. For a more comprehensive comparison, we compare the CNN-based multiscale feature aware with the RPCA-based feature-aware compression, uniform high pass quantization [33] and the O3DGC encoder [23]. Blue corresponds to low \(\theta\) error, while red to high \(\theta\) error. RPCA and CNN-based feature-aware compression approaches exhibit similar mean \(\theta\) error. O3DGC and uniform high pass quantization demonstrate a higher error rate. The former exhibits a blocky reconstructed surface and even higher error in lower bit per-vertex values. Per-vertex visualization in Figure 12 of the \(\theta\) error and visual inspection of the reconstructed surface verify these observations. A close visual comparison between RPCA and CNN-based compression outcomes shows that the former exhibits lower error in high curvature parts of the mesh while the CNN-based solution has a lower error in flat or low curvature areas while being computationally much less intensive.

Fig. 12.

Fig. 12. Mean \(\theta\) error as a function of compression ratio and \(\theta\) error visualization. Red colors correspond to higher \(\theta\) .

In Figure 13, we present the simplification results using six new 3D models, under two different simplification scenarios (namely 80% and 90% simplification), and in comparison with six extra state-of-the-art and recent works of the literature.

Fig. 13.

Fig. 13. Simplification of 3D models using the saliency mapping of different methods. (First line) Ninety percent simplification, (second line) 80% simplification, and (third line) heatmap of saliency mapping in four classes. The compared methods are (a) curvature co-occurrence histogram [36], (b) pointwise saliency detection [13], (c) fusion of eigen and RPCA saliency [7], (d) mesh saliency [17], (e) mesh saliency via spectral processing [32], (f) entropy-based salient model [34], (g) our CNN approach, and (h) our FCN approach.

3.5 Simplification Outcome Evaluation under Different Number of Classes and Percentage of Vertices per Class

The purpose of these experiments is to evaluate the accuracy of the reconstructed model under different simplification strategies. First, we separate the saliency values of the estimated saliency mapping in four, six, and eight, correspondingly, different classes and then for each case of classification, we follow the same steps for the simplification analysis. The percentage of the selected vertices per class is presented in Table 3 for the three different classification scenarios. The number of the classes indicates their importance. More specifically, the bigger the number of a class, the more geometrically significant the vertices that belong there. This means we intend to keep a higher percentage of vertices lying in the most important classes.

Table 3.
c8c7c6c5c4c3c2c1
a100%100%30%70%
b100%100%80%20%30%70%
c100%100%100%40%15%50%30%20%

Table 3. (a) Percentages of Vertices per Each Class (Four Classes), (b) Percentages of Vertices per Each Class (Six Classes), and (c) Percentages of Vertices per Each Class (Eight Classes)

The bold values of the tables show the percentage of vertices that we keep for the specific class, while the values in italics show the percentage of vertices that we keep, non in total, but taking into account the remaining number of vertices. A more explainable example is presented below. Assuming the existence of a model with 99,994 vertices, a simplified version of this model (70% simplification or 30% remaining vertices) consists of 29,998 vertices. Regarding the CNN saliency mapping extraction approach, the classification of vertices to each class is c1 = 78,946 vertices, c2 = 19,800 vertices, c3 = 1,170 vertices, and c4 = 78 vertices (Figure 14(a)). However, regarding the FCN saliency mapping extraction approach, the classification of vertices to each class is c1 = 63,301 vertices, c2 = 34,491 vertices, c3 = 2,075 vertices, and c4 = 127 vertices (Figure 14(b)).

Fig. 14.

Fig. 14. (a) Distribution of vertices per class for the CNN approach. (b) Distribution of vertices per class for the FCN approach.

Figure 15 presents the visual simplified reconstructed results and the corresponding HD error, using different simplification scenarios and different classification approaches. The lowest values of HD error per case are shown in bold.

Fig. 15.

Fig. 15. Simplified reconstructed models. FCN implementation in (a) four classes, (b) six classes, and (c) eight classes, and CNN implementation in (d) four classes, (e) six classes, and (f) eight classes.

Additionally, in Figure 16, we visualize in different colours the different classes of each vertex, both the initial and the simplified models. The classification of the vertices in different classes, as well as the selection of the remaining vertices per each class, can indeed affect the reconstructed results. Considering that the selection of the vertices that we keep per each class follows a uniform and randomly generated function, we can understand that every time the algorithm runs, the results may be different.

Fig. 16.

Fig. 16. Heatmap that visualizes the class of each vertex. (a) Original heatmap of the initial model via FCN approach, (b) 90% simplified model via FCN approach, (c) 80% simplified model via FCN approach, (d) 70% simplified model via FCN approach, (e) original heatmap of the initial model via CNN approach, (f) 90% simplified model via CNN approach, (g) 80% simplified model via CNN approach, and (h) 70% simplified model via CNN approach.

In the following experiments, we will show how this occasion can affect the reconstructed results. We start assuming that the most important vertices of classes 3 and 4 have to remain as is (we do not remove them, since they represent the most geometrical salient information of the 3D surface), and we only change the percentages of the remaining vertices for the rest two less important classes 1 and 2, under different simplification strategies. Table 4 shows the different simplification approaches using different percentages of remaining vertices. The approach mentioned above is repeated 10 times for each simplification strategy. The results of the HD error of the reconstructed models are presented in Figure 17(a). Additionally, for easier comparison between the simplified strategies, Figure 17(b) presents the corresponding boxplots.

Table 4.
c4c3c2c1
1120%80%
1130%70%
1140%60%
1150%50%
1160%40%

Table 4. Different Simplification Approaches Using Different Percentages of Remaining Vertices

Fig. 17.

Fig. 17. (a) HD error for different simplification approaches under different runs of the unified selection of vertices. (b) Boxplots of HD error under different simplification strategies.

3.6 Evaluation of Saliency Extraction in Dynamic Meshes

Observing Figure 18(a), we can see that most of the vertices of the dynamic 3D model Handstand (10,002 vertices and 175 frames) have a small value of first-ring area difference (i.e., \(\mathbf {d} \lt 0.1\)) for most of the frames. For these vertices, we do not need to re-estimate their saliency values, but we can use the same values of the previous frame. The results of this figure show that the first-ring area of only a few vertices per frame has been significantly changed, so the saliency values of only these vertices have to be re-estimated in this frame. Additionally, the experimental analysis has shown us that the magnitude of the value \(\mathbf {d}\) is affected by the motion of the animation. More specifically, we can observe that for the first and last frames, the value of \(\mathbf {d}\) is minimal \(\lt\)0.1 for almost all the vertices (\(\sim\)10,000) of the models, since the motion in these frames is very smooth and slow.

Fig. 18.

Fig. 18. (a) Histogram representing the number of vertices per frame that has a specific value of first-ring area difference \(\mathbf {d}\) between [0–0.9]. (b) Heatmap visualization of different consecutive frames of the dynamic 3D mesh (Handstand). It is also provided the identification of the vertices (highlighted in red) that have a difference \(\mathbf {d} \ge 0.1\) in the first-ring area.

In Figure 18(b), we present two representative sub-sequences of frames of the same animation (i.e., (i) frames 2–10 and (ii) frames 78–86). The first row of each part presents the heatmap visualization of saliency maps for nine consecutive frames. The second-row highlights, in red, the vertices of each model that correspond to \(d \ge 0.1\), and the rest vertices are highlighted in blue. The third row depicts the heatmap visualization of the value \(\mathbf {d}\) per each model. The frames of the first sequence change slowly, having as a result, the saliency map representation of any of these models remains almost the same. However, in the second sequence, the movement of the model is more intense, negatively affecting more areas that need a re-estimation of their saliency map.

Skip 4DISCUSSION AND CONCLUSION Section

4 DISCUSSION AND CONCLUSION

In this article, we introduce a fully convolutional network for the fast and accurate estimation of the saliency mapping of 3D meshes. The deep architecture generates the saliency values of a patch of faces simultaneously, significantly speeding up the process despite a drop in accuracy. We further present multiscale, multi-locality extensions that facilitate compression and simplification applications, achieving aggressive compression ratios as well as simplification schemes that preserve salient features, minimizing the perceptually visible errors. Experimental results based on scanned and synthetic 3D models validated the performance of the proposed algorithms, their ability to discover discriminative features directly from 3D meshes in a fully unsupervised way, their reliability and robustness in the presence of scanning noise, occlusion, as well as noise added by rotation and scale transformations.

It is a fact that the scale of acquired data in real-time 3D scanning operations is proliferating, requiring the accurate reconstruction of entire scenes with various objects represented by structured shapes. These requirements introduce significant scientific challenges, facilitating a wide number of applications with tight timing restrictions, such as teleimmersion, real-time surface mapping and tracking, and aero-reconstruction for disaster management. Therefore to support such challenging problems, the presented approaches should be able to extract in a real-time fashion spatiotemporal saliency maps for dynamic meshes. This can be achieved following a clear future work path, utilizing the proposed schemes in a spatio-temporal sense and further extending them to function in an adaptive and distributed manner. Despite the significant progress on the landscape of saliency mapping, we believe that our approaches provide a novel insight into a critical area with a renewed research interest, where there is high potential for novel improvements such as online saliency mapping based on the motion, the number and the scale of geometric features, is feasible in the near future.

REFERENCES

  1. [1] Alexiadis D. S., Chatzitofis A., Zioulis N., Zoidi O., Louizis G., Zarpalas D., and Daras P.. 2017. An integrated platform for live 3D human reconstruction and motion capturing. IEEE Trans. Circ. Syst. Vid. Technol. 27, 4 (2017), 798813. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Alexiadis Dimitrios S., Zarpalas Dimitrios, and Daras Petros. 2012. Real-time, full 3-D reconstruction of moving foreground objects from multiple consumer depth cameras. IEEE Trans. Multimedia 15, 2 (2012), 339358.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alexiadis Dimitrios S., Zarpalas Dimitrios, and Daras Petros. 2013. Real-time, realistic full-body 3D reconstruction and texture mapping from multiple kinects. In Proceedings of the Image, Video, and Multidimensional Signal Processing Workshop (IVMSP’13). IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] An Guangming, Watanabe Taichi, and Kakimoto Masanori. 2016. Mesh simplification using hybrid saliency. In Proceedings of the International Conference on Cyberworlds (CW’16). IEEE, 231234.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Arvanitis Gerasimos, Lalos Aris S., and Moustakas Konstantinos. 2019. Adaptive representation of dynamic 3D meshes for low-latency applications. Comput. Aid. Geom. Des. 73 (2019), 7085. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Arvanitis G., Lalos A. S., and Moustakas K.. 2019. Saliency mapping for processing 3D meshes in industrial modeling applications. In Proceedings of the IEEE 17th International Conference on Industrial Informatics (INDIN’19), Vol. 1. 683686. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Arvanitis G., Lalos A. S., and Moustakas K.. 2021. Robust and fast 3-D saliency mapping for industrial modeling applications. IEEE Trans. Industr. Inf. 17, 2 (2021), 13071317. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Asano Tetsuo, Ranjan Desh, Roos Thomas, Welzl Emo, and Widmayer Peter. 1997. Space-filling curves and their use in the design of geometric data structures. Theor. Comput. Sci. 181, 1 (1997), 315.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2, Article 48 (April2018), 21 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Doumanoglou A., Drakoulis P., Zioulis N., Zarpalas D., and Daras P.. 2019. Benchmarking open-source static 3D mesh codecs for immersive media interactive live streaming. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 1 (2019), 190203. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Doumanoglou Alexandros, Griffin David, Serrano Javier, Zioulis Nikolaos, Phan Truong Khoa, Jiménez David, Zarpalas Dimitrios, Alvarez Federico, Rio Miguel, and Daras Petros. 2018. Quality of experience for 3-D immersive media streaming. IEEE Trans. Broadcast. 64, 2 (2018), 379391.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Favorskaya M. N. and Jain L. C.. 2019. Saliency detection in deep learning era: Trends of development. Manage. Inf. Syst.3 (2019), 1036.Google ScholarGoogle Scholar
  13. [13] Guo Yu, Wang Fei, and Xin Jingmin. 2018. Point-wise saliency detection on 3D point clouds via covariance descriptors. Vis. Comput. 34, 10 (2018), 13251338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] He Jiale, Yang Gaobo, Liu Xin, and Ding Xiangling. 2020. Spatio-temporal saliency-based motion vector refinement for frame rate up-conversion. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article 55 (May2020), 18 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Hu Xinjue, Shan Jingming, Liu Yu, Zhang Lin, and Shirmohammadi Shervin. 2020. An adaptive two-layer light field compression scheme using GNN-based reconstruction. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s, Article 72 (June2020), 23 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Lalos Aris S., Arvanitis Gerasimos, Spathis-Papadiotis Aristotelis, and Moustakas Konstantinos. 2018. Feature aware 3D mesh compression using robust principal component analysis. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’18). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Lee Chang Ha, Varshney Amitabh, and Jacobs David W.. 2005. Mesh saliency. In ACM SIGGRAPH 2005 Papers (SIGGRAPH’05). ACM, New York, NY, 659666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Liu Shujie and Chen Chang Wen. 2012. A novel 3D video transcoding scheme for adaptive 3D video transmission to heterogeneous terminals. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 8, 3s (2012), 1–21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Luo Guoliang, Deng Zhigang, Zhao Xin, Jin Xiaogang, Zeng Wei, Xie Wenqiang, and Seo Hyewon. 2020. Spatio-temporal segmentation based adaptive compression of dynamic mesh sequences. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16, 1 (2020), 1–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Maimone Andrew, Bidwell Jonathan, Peng Kun, and Fuchs Henry. 2012. Enhanced personal autostereoscopic telepresence system using commodity depth cameras. Comput. Graph. 36, 7 (2012), 791807.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Maimone Andrew and Fuchs Henry. 2011. Encumbrance-free telepresence system with real-time 3D capture and display using commodity depth cameras. In Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality. IEEE, 137146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Maimone Andrew and Fuchs Henry. 2012. Real-time volumetric 3D capture of room-sized scenes for telepresence. In Proceedings of the 3DTV-Conference: The True Vision-capture, Transmission and Display of 3D Video (3DTV-CON’12). IEEE, Zurich, Switzerland, 1–4.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Mamou Khaled, Zaharia Titus, and Prêteux Françoise. 2009. TFAN: A low complexity 3D mesh compression algorithm. Comput. Anim. Virt. Worlds 20, 2–3 (2009), 343354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Milani Simone and Calvagno Giancarlo. 2010. A cognitive approach for effective coding and transmission of 3D video. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7, 1 (2011), 1–21.Google ScholarGoogle Scholar
  25. [25] Nordfang Maria and Wolfe Jeremy M.. 2014. Guided search for triple conjunctions. Attent. Percept. Psychophys. 76, 6 (2014), 15351559.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Nouri Anass, Charrier Christophe, and Lézoray Olivier. 2015. Multi-scale saliency of 3D colored meshes. In Proceedings of the IEEE International Conference on Image Processing (ICIP’15). IEEE, Québec, 28202824.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Nousias Stavros, Arvanitis Gerasimos, Lalos Aris S., and Moustakas Konstantinos. 2020. Mesh saliency detection using convolutional neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’20). IEEE, London, 16.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Nousias Stavros, Arvanitis Gerasimos, Lalos Aris S., Pavlidis George, Koulamas Christos, Kalogeras Athanasios, and Moustakas Konstantinos. 2020. A saliency aware CNN-based 3D model simplification and compression framework for remote inspection of heritage sites. IEEE Access 8 (2020), 169982170001.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Ronneberger Olaf, Fischer Philipp, and Brox Thomas. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, Munich, 234241.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Rossignac Jarek. 1999. Edgebreaker: Connectivity compression for triangle meshes. IEEE Trans. Vis. Comput. Graph. 5, 1 (1999), 4761.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Song Ran, Liu Yonghuai, Martin Ralph R., and Echavarria Karina Rodriguez. 2018. Local-to-global mesh saliency. Vis. Comput. 34, 3 (2018), 323336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Song Ran, Liu Yonghuai, Martin Ralph R., and Rosin Paul L.. 2014. Mesh saliency via spectral processing. ACM Trans. Graph. (TOG) 33, 1 (2014), 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Olga Sorkine, Daniel Cohen-Or, and Sivan Toledo. 2003. High-Pass Quantization for Mesh Encoding. In Proceedings of the 2003 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing (SGP’03). Eurographics Association, Aachen, 42–51.Google ScholarGoogle Scholar
  34. [34] Tao Pingping, Zhang Lina, Cao Junjie, and Liu Xiuping. 2016. Mesh saliency detection based on entropy. In Proceedings of the 6th International Conference on Digital Home (ICDH’16). IEEE, Guangzhou, 288295.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Watanabe H., Sasaki H., Ikeya K., Okaichi N., Kano M., Omura T., Hisatomi K., Kawakita M., and Mishina T.. 2020. 3D video technology based on spatial imaging for advanced broadcasting. SMPTE Motion Imag. J. 129, 9 (2020), 2430. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Wei Ning, Gao Kaiyuan, Ji Rongrong, and Chen Peng. 2018. Surface saliency detection based on curvature co-occurrence histograms. IEEE Access 6 (2018), 5453654541.Google ScholarGoogle Scholar
  37. [37] Wu Jinliang, Shen Xiaoyong, Zhu Wei, and Liu Ligang. 2013. Mesh saliency with global rarity. Graph. Models 75, 5 (2013), 255264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Xing S., Sang X., Cao L., Guan Y., and Li Y.. 2020. A real-time super multiview rendering pipeline for wide viewing-angle and high-resolution 3D displays based on a hybrid rendering technique. IEEE Access 8 (2020), 8575085759. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Zhang Jun, Wang Meng, Lin Liang, Yang Xun, Gao Jun, and Rui Yong. 2017. Saliency detection on light field: A multi-cue approach. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 13, 3 (2017), 1–22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Zhao Yitian, Liu Yonghuai, Song Ran, and Zhang Min. 2012. Extended non-local means filter for surface saliency detection. In Proceedings of the 19th IEEE International Conference on Image Processing. IEEE, Orlando, Florida, 633636.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Deep Saliency Mapping for 3D Meshes and Applications

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Multimedia Computing, Communications, and Applications
                  ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2
                  March 2023
                  540 pages
                  ISSN:1551-6857
                  EISSN:1551-6865
                  DOI:10.1145/3572860
                  • Editor:
                  • Abdulmotaleb El Saddik
                  Issue’s Table of Contents

                  Copyright © 2023 Copyright held by the owner/author(s).

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 6 February 2023
                  • Online AM: 27 July 2022
                  • Accepted: 8 July 2022
                  • Revised: 28 May 2022
                  • Received: 13 December 2021
                  Published in tomm Volume 19, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Refereed
                • Article Metrics

                  • Downloads (Last 12 months)508
                  • Downloads (Last 6 weeks)62

                  Other Metrics

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!