Gender Classification via Graph Convolutional Networks on 3D Facial Models

The automatic classification of human gender and other demographic attributes such as age and ethnicity is gaining significant attention. These attributes provide rich information with applications in personalization, behavior analysis, consumer research, digital forensics, security, human-computer interaction, and mobile applications. In the literature, the face is a commonly used feature for gender classification. In this paper, we follow this attitude but referring to 3D face data that offer advantages in terms of capturing spatial information and reducing sensitivity to ethnicity and acquisition conditions. In particular, we address gender classification using RGB-D data, which is structured as graphs and processed using a Graph Convolutional Neural Network (GCNN). Experiments conducted on the BP4D+ dataset demonstrate the effectiveness of this approach.


INTRODUCTION
Automatic classification of human gender, and other demographic attributes such as age and ethnicity are gaining significant attention due to the rich and distinct information these attributes provide [33].Indeed, numerous domains, including personalization and recommender systems, behavior analysis, consumer research, digital forensics, security and biometrics, human-computer interaction, and mobile applications, stand to improve their performance by having access to user gender information [22].
According to the literature, the approaches for gender detection based on data derived from the human body can be classified as appearance-based and non-appearance-based.The appearancebased approaches can be further categorized in using static body features (face, hands, fingernails, body shape), dynamic body features (gesture, motion, gait, voice), and clothing features (clothing, footwear).The non-appearance-based approaches use biometric features (fingerprint, iris, ear, skin colour), bio-signals (DNA, EEG, ECG), and social information (blog, email, handwriting) for gender detection.
Beyond doubts, the face is the most frequently employed feature in gender classification [2,10,17,36,37], as it is easy to capture and serves other purposes such as identity [9,11] and expression recognition [4,20].According to the field of application, face images can be characterized by RGB-images acquired either in the wild [17] or in more controlled conditions [13].Certainly, when it comes to evaluating anthropometric facial features, RGB-D data would be the most suitable choice [40].Indeed, three-dimensional data captures the spatial information of the face, providing a richer representation of facial characteristics.By incorporating depth information, 3D data can offer a more comprehensive view of facial features, making it better suited for multiple classification tasks, like gender classification.Furthermore, with the growing availability and popularity of devices designed for capturing RGB-D data, it is evident that such data could become a practical and significant option in many application, including automatic gender recognition.
The benefit of 3D face representation lies in its emphasis on geometric attributes rather than appearance-based characteristics.This shift diminishes sensitivity to both ethnic differences, including facial features such as skin tone, and hairstyle, as well as acquisition conditions such as lighting, camera angles, and image quality, thereby enhancing generalization across diverse populations and acquisition scenarios.
Indeed, 3D data poses distinct challenges.Primarily, it is susceptible to changes in facial expressions that alter facial structure.Furthermore, the current lack of a sizable dataset containing RGB-D faces gives rise to legitimate concerns when considering the use of deep learning methods, notably requiring large volume of labeled data in order to avoid overfitting and bias in the model's prediction.
In this paper, we address the gender classification problem using RGB-D data, and to achieve this, we suggest employing a Graph Neural Network (GNN), due to its established capacity for robust generalization both on 2D and 3D data [25,30].Specifically, we implement a Graph Convolutional Neural Network (GCNN) [38], being highly adept at capturing both the local and global geometric features present in 3D facial data.Indeed, GCNNs exhibit a remarkable capability to gather information from nearby facial landmarks, facilitating a comprehensive understanding of the intricate nuances within facial structures.This competence proves to be pivotal, particularly in tasks like gender classification, where the discernment of subtle geometric patterns indicative of gender plays a crucial role.
The experiments involve conducting comparisons using the BP4D+ dataset [39], which comprises recordings from 140 participants with a large range of attributes, including gender, age, facial expressions, and ethnic variations.These experiments with the state-of-the-art methods in this field, allow to show the effectiveness of our approach.

RELATED WORKS
To our knowledge, the latest survey on gender recognition date back 2016 [22], thus not including deep learning based solutions.Traditionally, methods required a feature extraction module to extract some spatial or textural feature (e.g.LBP, wavelets, ...), followed by a classification module (e.g.support vector machines, linear discriminant analysis, ...) [3,29].
Since then, Convolutional Neural Networks (CNNs) have emerged as the leading approach in numerous computer vision tasks, including gender recognition, often tackled together with other demographic tasks (e.g.age, ethnicity).Levi and Hassner were among the first proposing a simple CNN architecture with 5 layers, to perform age and gender prediction [19].Besides, the effectiveness of employing pre-trained CNNs like AlexNet or VGG has been exemplified in numerous studies, as evidenced by papers such as [18,24,33].
Advancements have been achieved integrating attention mechanisms into the feed-forward models, enabling these models to identify the most informative and reliable facial components for the specific task at hand, as demonstrated in [27].In the same vein, Abdolrashidi el al. [1] proposed the ensemble of attentional and residual convolutional networks.The effectiveness of attention mechanisms has been amplified with the advent of transformers [34], which have been consistently gaining prominence and importance in various tasks and benchmarks.Gender recognition is no exception, and Kuprashevich and Tolstykh [17] introduced a transformer model, namely MiVOLO, for age and gender estimation, establishing them as the current state-of-the-art approach.
To the best of our knowledge, no model working on 3D data has been proposed yet.

LANDMARK-BASED GRAPH STRUCTURE
When working with a 3D facial mesh, various approaches could be employed for classification tasks.Direct manipulation of facial point clouds is one option, as demonstrated in techniques like PointNet [30].Another method could revolve around working with the graph structure that defines the mesh itself.
Let  = (, ) denotes a mesh, with  = {  }  =1 ⊂  3 representing the points in the 3D euclidean space and  the connections between them.Due to the large number of points in a mesh, we adopt a method inspired by anthropometry in medicine that leverages a set of facial landmarks to capture key features of the human face [7,8].
This way, we define a landmarks-based graph  = ( , ), where  = {  }  =1 ⊂ , with  ≪  , is a set of extracted facial landmarks and  the edges computed via -NN(  ), for each  ∈ [1..].The rationale behind this approach is to focus on few significant points, and characterize them with a robust discriminative representation (cfr.Sec.3.1) such as surface curvatures or geometric relations such as distances.
There exists a significant literature about 3D Facial landmark detection [14], with methods falling into two main categories: those based on 3D geometric information [21,23], and those relying on statistical learned models [15,32].
Here, we adopt a statistical learned model, more precisely the MVLM proposed by Rasmus et al. [26].Briefly, MVLM exploits the 3D face mesh to render several views from different view points and uses a CNN based model to estimate the 2D location of each landmark from each view point.These estimates are then combined using a LSQ and RANSAC approach to have a robust and reliable estimate of the 3D location of  = 84 landmarks.The MVLM model has been deployed in several pre-trained versions with a combination of different datasets and rendering methods, these include RGB renderings as well as depth and geometry ones.In our work we refer to the version trained using geometry+depth image channels and with the BU-3DFE dataset which contains 100 subjects (56 female, 44 male), that gained an average error of 2.42 mm on localizing 84 3D landmarks (Figure 1).

Node-level Feature Representation
In this section, a joint multimodal embedding space that includes both local geometric features and simple empirical statistics is proposed and motivated.
As for geometric features we simply use the 3D coordinates of each landmark   ∈  , i.e. pos(  ) = (  ,   ,   ) and for each pair of distinct landmarks   ,   ∈  we compute the geodesic distance from   to   .The computation of approximate shortest (i.e.geodesic) paths on a triangle mesh is a common operation in many computer graphics applications [5].In particular, here we implemented a method based on the heat equation [6], taking the geodesic distances geo(  ) between a node   ∈  and all points in its neighborhood N (  ) .
Furthermore, following [28], we introduce a feature called Fast Point Feature Histograms (FPFH), a method for representing feature points in 3D point cloud data, which can be used to accurately label the points based on the type of surface they are lying on.This approach aims to capture the geometric relationships between each point in a 3D point cloud and its neighboring points, going beyond surface normals and curvature estimations to characterize the mean curvature surrounding a specific point.The proposed feature representation is based on a multi-dimensional histogram that characterizes the local geometry around a query point.A key property of this representation is its invariance to pose (i.e.3D rotations and translations) and sampling density, and it can cope well with noisy sensor data.
To define the feature space computational model, the authors of [28] introduce the following elements.For each pair of nodes   and   , we build a fixed reference frame, consisting of the three unit vectors , ,  defined as follows: where × denotes the cross product of two vectors.
Using the defined reference frame, the difference between the two normals   and   can be expressed as a set of angular features: where • denotes the scalar product of two vectors.The attributes  and  represent   as an azimuthal angle and the cosine of a polar angle, respectively, while  represents the direction of the translation from   to   .
In the model proposed in [28], the 3D feature distribution of points sampled from the 3D face surface is represented by histograms.In particular, for a point   each feature of the tuple (, ,  ) is mapped onto exactly one bin , ,  of the histograms  , (  ),  , (  ) and  , (  ), respectively, where , ,  ∈ [1..11], thus providing three 11-dimensional histograms for each tuple.The final 33-dimensional histogram SPFH (Simple Point Feature Histogram) for each vertex   is achieved by horizontally concatenating the three given histograms, i.e.
The final FPFH descriptor representing each point in a face is computed by collecting the local information of each landmark point   ∈  on the face: being    the distance between the query node   and its neighbor   .
In our approach the node characterization is obtained by horizontally concatenating the 3 vectors, i.e. sets of features defined above, that are point coordinates (3-dim), geodesic distances (84-dim) and histograms (33-dim), yielding the comprehensive 120-dimensional vector: The histograms for each node, interpreted as embeddings in a 120-dimensional space, enables the use of a GNN model introduced in the next section, where the nodes of the graph model correspond to landmarks of the face.

GENDER-GCNN RECOGNITION METHOD
Graph Neural Networks (GNNs) employ multiple layers to process node representations.In a nutshell, under the assumption of a static graph and a single input feature vector per node, at each layer indexed by  ≥ 0 (with  = 0 representing the input layer), GNNs calculate a node representation by gathering information from its neighborhood through an aggregator.Stacking  layers in a GNN allows a node's  -hop neighborhood to affect its representation.
Since we are dealing with a prediction problem for the entire graph (graph-level task), we need to consider the relationships between all of the nodes in the graph.In our scenarios, the graph structure is explicitly induced by the landmark-based graph described in the previous section, where the embeddings initially assigned to each node   correspond to those calculated in Eq. ( 2).The edges in the graph are undirected, with no features, and are built linking each node to its  nearest neighbours, with  = 32.
Formally, let ℎ  () be the embedding of node  at layer , being ℎ 0 () defined in Eq. (2).A nonlinear aggregation mechanism is employed for the evolution of ℎ  (), taking into account its embedding at the previous layer, ℎ  −1 (), as well as those of its neighbors in N ().This way, in each message-passing iteration of a GNN, the embedding ℎ  () for each node  ∈  is updated based on information aggregated from 's neighbors, that is where, MSG  and AGG  are arbitrary differentiable functions representing message computation and aggregation, respectively.While MSG  employs a Multi-Layer Perceptron (MLP) network,   can implement various aggregators, such as graph convolution [16], attention mechanisms [35], or pooling [12].
In this work Eq. 3 is implemented in Gender-GCNN Layer according to these two phases: (1) Message computation: Each node receives messages from all its 32 neighbors, and each message is computed using a shared-weight MLP, i.e.
(2) Node Features Update: The node features are aggregated using the MAX operator, i.e.
where the MAX function is used to combine the information from all of a node's neighbors messages into a single vector.This vector is then used to update the node's own features, so that the node can learn to represent its neighbors and itself in a more informative way.
Our Gender-GCNN final network model (see Figure 3) consists of four Gender-GCNN Layers.After a first convolutional layer that preserves the feature dimensionality, the network progressively reduces the feature dimensionality from 120 to 64, then to 32, and finally to 16.
Following the four Gender-GCNN Layers, the information of all nodes is aggregated in a single super-node, creating a representation of the entire graph via a global max-pooling approach, that is each feature of the super-node is determined by taking the maximum value among the corresponding features of all the nodes in the graph.By following this approach, it is expected that the supernode will contain all the relevant information and fully characterize the associated graph.Subsequently, the features of the super-node are passed to the final classifier, which is implemented as a fullyconnected (FC) layer.The FC layer includes a softmax layer at the end, which normalizes the sum of the predicted values for the two target classes to 1.By applying this normalization, the output can be interpreted as the model's predicted probability of the input being associated with each class.

EXPERIMENTAL RESULTS
In this section, we describe the experimental setup and methods we used to compare our proposed Gender-GCNN model, which works on 3D data, to other deep learning models designed for gender classification.In the following we detail the dataset used, the models referred for comparisons, the experimental protocol, and the obtained results.

The dataset
The multimodal spontaneous emotion (MMSE) dataset [39] also referred to as BP4D+ consists of 1400 multimodal recordings from 140 participants varying for gender, age, and racial ancestries.Data includes for each recording a 3D dynamic facial models (each model has 30-50k vertices), two 2D texture videos acquired at 25fps, with resolution 1040 × 1392 pixels (acquired using the Di3D dynamic  imaging system), a thermal video (acquired at 25 fps with resolution 640 × 480 pixels).An example of a record from the dataset is represented in Figure 4.
Furthermore, the Biopac MP150 data acquisition system is used to collect physiological signals, including blood pressure, respiration frequency, and EDA.Participants took part to 10 different tasks conceived to elicit different emotions.Each frame is manually labelled with its corresponding Action Unit (AU) vector, which delineates the activation status of each AU: 0 signifies inactivity, while 1 indicates activity.
In this study, only the 2D textures, 3D meshes and the AU annotations of 136 people have been utilized, being the publicly available ones.

Comparison models
To evaluate the performance of our proposed 3D-based method, we compare it to two state-of-the-art 2D gender recognition models.
The first model we consider is DeepFace [31], which is a facial recognition system developed by Facebook (now Meta Platforms, Inc.) that uses deep learning algorithms to identify and verify faces in photos and videos.It gained attention for its ability to achieve high accuracy in facial recognition tasks, including gender estimation and emotion analysis.
The second is MiVOLO (Multi Input VOLO) [17], which is a recent work that proposes a straightforward approach for age and gender estimation using the latest vision transformer.MiVOLO provides several pre-trained models, each with a different combination of input (face, body) and output (age, gender).The one used in this work requires only the face as input and provides age and gender as output.

Experiments
As previously mentioned, we carried out the experiments utilizing the BP4D+ dataset.As an initial study, we evaluated the methods under favorable conditions, reducing the complexity introduced by variations in facial expressions.To achieve this, we focused on data from a single task out of the 10 included in BP4D+.Specifically, we utilized data from Task 1, which is characterized by relaxed reactions and happiness responses.
Within this scenario, we constructed multiple datasets, gradually relaxing the constraint over the neutrality of facial expressions.More specifically, a perfect neutral frame can be described as one where specialized psychologists carefully evaluate the action units, ensuring that their cumulative total equals 0. However, this request is very hard to be verified in the reality, so we relaxed this criterion at different levels, selecting, when possible, 10 frames for each person, where the action units sum up to less than 3, 4, or 5 respectively.
Let   be the video of the -th subject in the dataset and   be a collection of its frames selected according to the following rule: where  ∈ {3, 4, 5} and  () is the sum of the AUs of the frame .The final dataset for a given level  is then given by: where  is the number of subjects in the original dataset.
Additionally, we generated two more datasets with even less control over the neutrality of expressions, not relying on psychologist labeling.In one case, we leveraged the automatic expression classification provided by DeepFace, resulting in the dataset  DF .In the other instance, we performed random sampling of frames from the entire video collection without considering specific facial expressions, yielding  rand .
The  DF dataset collects up to 10 frames for each subject by identifying frames where the subject is making a neutral expression, with a neutral score greater than a threshold we fixed to 95%.
To obtain  rand , we performed a random sampling of 10 frames from each video without any control on the represented faces.This approach resulted in the inclusion of frames exhibiting a variety of facial expressions and occasional occlusions caused by the movements of various body parts.For instance, some frames might feature an arm obstructing the camera view.These variations represent potential challenges in the classification processes.It is also worth noting that dataset  rand includes a wider range of facial expressions, which makes it more challenging to train a model on.However, it also makes the model more generalizable to new data, since it is less likely to have overfit to the training data.
The obtained datasets are characterized by low cardinality being the limited amount of data particularly strong when referring to the very selective dataset  3 , incorporating 94 out of 136 subjects.This is due to the fact that several videos exhibit lower or any frames with total AUs score lower than 3. Similarly, the datasets  4 and  5 , incorporate 111 and 122 subject, respectively.Finally,  DF and  rand , comprising 135 and 136 subjects respectively.Furthermore, it's worth noting that with the exception of  rand , which contains precisely 10 frames per subject, the other datasets have, on average, 9 frames per subject meeting the specified criteria.This results in the cardinalities detailed in Table 2, in the first row.
The datasets generated were used for the purpose of evaluating the performance of the Gender-GCNN model, the Transformerbased system MiVOLO, and the DeepFace models.Data is used at frame level, meaning that 2D images are used to test the MiVOLO and DeepFace models, and the 3D models corresponding to the chosen frames are used to evaluate the Gender-GCNN model.For our experiments, we employed a pre-trained MiVOLO model provided by the authors, which was originally trained on the IMDB dataset.
As for DeepFace, experiments were carried out using a model based on a pre-trained VGG-Face model, where both IMDB and WIKI datasets had been used for both training and testing phases.
To train the Gender-GCNN model, we used leave-one-subject-out cross-validation (LOSOCV) to address the problem of limited data in each dataset.LOSOCV comes with several advantages.It helps to ensure that the model is robust by evaluating its performance on a wide variety of subjects in the dataset.This allows us to make the most of the available data and get accurate performance estimates.This technique also aids in mitigating biases that may exist in the dataset, reducing the impact of subject-specific idiosyncrasies.Moreover, we enhanced the robustness of our metrics by calculating both mean accuracy and standard deviation through multiple trials (specifically, 10 trials have been setup) of LOSOCV.A summary of the hyperparameters used for training procedure is listed in Table 1.To address the imbalance in the dataset classes, we also implemented an oversampling technique.This technique involved duplicating samples from a subset of the training data until a perfect balance between male and female classes was achieved.2 presents the performance results achieved by these models when tested on the datasets described earlier.
Despite being trained on datasets with significantly fewer frames than the MiVOLO and DeepFace models, the Gender-GCNN model achieves superior gender recognition performance, with a mean accuracy of 90.41% and a minimal standard deviation.This remarkable ability is due to several inherent strengths of our model.One of the key strengths of the Gender-GCNN model is its ability to capture subtle geometric features that are invariant to ethnicity, appearance, and facial expression.Unlike traditional models, which can struggle with variations in these factors, GCNNs are able to learn hierarchical features that transcend these differences.This adaptability to geometric features allows the Gender-GCNN model to generalize effectively and robustly, even when trained on limited data.Another strength of the Gender-GCNN model is its feature engineering process, which leverages 3D data to compute FPFH and geodesic distances.These features provide a more accurate representation of the facial structure than 2D images, which inherently lose depth information.3D data also retains spatial relationships and surface curvatures, leading to a richer and more realistic portrayal of the face.

CONCLUSIONS
The automatic classification of human gender and other demographic attributes, has gained significant attention in various domains.3D data offers several advantages for gender classification, 67.71% Ours (Gender-GCNN) 90.07%± 0.18% 90.71% ± 0.35% 90.97% ± 0.89% 91.29% ± 0.52% 90.37% ± 0.49% 90.68% ± 0.48% including increased robustness across diverse populations and acquisition scenarios, as well as better representation of facial characteristics.However, challenges remain in terms of data availability and changes in facial expressions, which require further investigation.The combination of our Gender-GCNN model's inherent strengths in capturing geometric invariances with the strategic feature engineering process focusing on geometrical and morphological cues leads to enhanced discriminative power.This two-fold approach empowers our model to excel in gender recognition, even in scenarios with limited training data cardinality, offering a more refined and accurate understanding of facial gender characteristics.
Additional work is needed to evaluate the robustness, generalization, and applicability of our model to a wider range of 3D faces datasets, including those from diverse populations, acquisition scenarios, and real-world settings.Finally, building upon the promising performance observed in this context, it would be valuable to explore the potential application of this approach in other domains that face the challenge of low data cardinality.For instance, the medical field, where the objective is to diagnose rare diseases based on facial characteristics, could benefit from such investigations.

Figure 1 :
Figure 1: Example of a BP4D+ subject triangular mesh, with the 84 extracted landmarks (not all visible)

Figure 2 :
Figure 2: Different views of a graph obtained defining a node for each of the 84 landmarks and linking each node to its  nearest neighbours, with  = 32.The graduated scale provides information about -coordinate of each landmark.

Figure 3 :
Figure 3: Our Gender-GCNN model.It consists of 4 Gender-GCNN Layer, where each layer shares the same MLP weights.After a max-pooling operation, a MLP performs the final classification.MLPs are represented as light blue arrows.

Figure 4 :
Figure 4: An example of a BP4D+ record.From left to right, a frame's 2D RGB texture, plain mesh and RGB mesh.

Table 2 :
Summary of the experiments conducted on the different datasets, with the number of frames for each dataset in parentheses.