Cross-Architecture Distillation for Face Recognition

Transformers have emerged as the superior choice for face recognition tasks, but their insufficient platform acceleration hinders their application on mobile devices. In contrast, Convolutional Neural Networks (CNNs) capitalize on hardware-compatible acceleration libraries. Consequently, it has become indispensable to preserve the distillation efficacy when transferring knowledge from a Transformer-based teacher model to a CNN-based student model, known as Cross-Architecture Knowledge Distillation (CAKD). Despite its potential, the deployment of CAKD in face recognition encounters two challenges: 1) the teacher and student share disparate spatial information for each pixel, obstructing the alignment of feature space, and 2) the teacher network is not trained in the role of a teacher, lacking proficiency in handling distillation-specific knowledge. To surmount these two constraints, 1) we first introduce a Unified Receptive Fields Mapping module (URFM) that maps pixel features of the teacher and student into local features with unified receptive fields, thereby synchronizing the pixel-wise spatial information of teacher and student. Subsequently, 2) we develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge while preserving the model's discriminative capacity. Extensive experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.


INTRODUCTION
Face recognition has attained tremendous success in various application areas [21,23,58].However, compact yet discriminative face recognition models are highly desirable due to the proliferation of identification systems on mobile and peripheral devices [15].Despite the variant proposals of enhanced neural network designs, there remains an immense performance disparity between these compressed networks and the heavy networks with millions of parameters.A natural option is to optimize neural network architectures for mobile devices, e.g., MobileFaceNet [4], and MobileNetV3 With the student network identified as MobileFacenet [4], we adopt IResNet-50 [7] as the teacher for homologous-architecture distillation and Swin-S as the teacher for cross-architecture distillation.Then, we evaluate the performance variation of students with different KD methods [5,10,15,38,39,42,49] in both scenarios by: (a) the average accuracy on the five popular face benchmarks [13,35,44,59,60], and (b) the average TPR@FAR=1e-4 on IJB-B [54] and IJB-C [33].Practical application requires a solution to transfer knowledge from Transformer to CNN, which serves as the primary focus of this study.[12].However, discriminative networks always benefit from a large modeling capacity, which is time-and labor-intensive.Knowledge distillation (KD) refers to the vanilla method for enhancing the performance of light models [10,42].A typical scenario involves distilling either the intermediate features or subsequent logits from a strong teacher neural network to a compact student network, aiming at substantially improving the performance of the student model.Nevertheless, existing KD techniques primarily focus on homologous-architecture distillation, i.e., CNN to CNN.Recently, Transformers have demonstrated exceptional capabilities in various vision tasks [2,8,30].Nonetheless, their high computational requirements and insufficient support for platform acceleration have hindered their deployment on mobile devices.On the other hand, CNNs have undergone significant development in recent years, with hardware-friendly acceleration libraries such as CUDA [36] and TensorRT [37] rendering them suitable for both servers and mobile devices.Consequently, considering the exceptional modelling capacity of Transformers and the compatibility of CNNs, it has become a prevalent practice to employ a Transformer as a teacher network and maintain CNN as a student network for KD.However, current KD methods concentrate on homologousarchitecture distillation and overlook the architectural gap between teacher and student networks, leading to inferior performance of Transformer to CNN compared with that of CNN to CNN.Therefore, we probe the implication of the architecture gap on knowledge distillation in face recognition.Specifically, we first trained an IResNet-50 [7] on MS1M-V2 [7] as a teacher network under the homologousarchitecture scenario, followed by training a Swin-S [30] as a teacher network under the cross-architecture scenario.It is worth noting that Swin-S has a slight performance improvement over the IResNet-50.For the student network, we choose MobileFaceNet [4] as the backbone.We reproduce the canonical knowledge distillation methods [5,10,15,38,39,42,49] in face recognition in the homologousand cross-architecture settings, respectively.We calculate the performance variation of different methods from homologous-to crossarchitecture scenarios, as shown in Fig. 1.Most methods suffer from performance degradation in cross-architecture scenarios.However, we believe Cross-architecture knowledge distillation is still effective in face recognition due to the highly organized face structure.
In this paper, we find that the deployment of cross-architecture knowledge distillation in face recognition encounters two major challenges.First, as illustrated in Fig. 2, there is a significant architecture gap between teacher and student networks in terms of pixel-wise receptive fields, i.e., the teacher network adopts shifted window attention [30] and the student utilizes conventional convolution operations.To demonstrate this, We visualize ERF [32] of pixel-wise receptive fields for the teacher and student.As illustrated in Sec 4.5.3, the teacher and student share disparate pixel-wise spatial information.Second, the teacher network is not trained in the role of a teacher, lacking awareness of managing distillation-specific knowledge.The challenge lies in developing an auxiliary module that enables the teacher to manage distillation-specific knowledge while preserving its discriminative capacity.
To address the aforementioned challenges, we first introduce a Unified Receptive Fields Mapping module (URFM) designed to map pixel features of the teacher and student models into local features with congruent receptive fields.To achieve this, we utilize learnable local centers as the query embedding, supplemented with facial positional encoding to synchronize the receptive fields of the pixel features in both teacher and student.Additionally, recent research explores the feasibility of prompts in visual recognition and continual learning [17,45,53].In this paper, we investigate the applicability of prompts in KD, allowing the teacher to optimize during distillation.Specifically, we develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
In summary, the contributions of this paper include: • We propose a novel module called Unified Receptive Fields Mapping (URFM) that maps pixel-wise features to local features with unified receptive fields.In URFM module, we exploit learnable local centers as the query embedding on which we supplement a facial positional encoding with the facial key points to synchronize the pixel-wise receptive fields of teacher and student networks.• We introduce Adaptable Prompting Teacher network (APT) that supplements learnable prompts in the teacher, enalbing it to manage distillation-specific knowledge while preserving model's discriminative capacity.We further propose to adapt the model's adaptable capacity by altering the number of prompts.To the best of our knowledge, we are the first to explore the feasibility of prompts in KD. • The extensive experiments on popular face recognition benchmarks demonstrate the superiority of the proposed method over the state-of-the-art methods.

RELATED WORK 2.1 Face Recognition
Face recognition (FR) is a demanding computer vision task that seeks to identify or authenticate a person's identity based on their facial features.A crucial component of face recognition systems is the loss function, responsible for measuring the similarity or dissimilarity between face embeddings.Two primary loss functions are employed in face recognition: verification loss and softmax-based loss.The former optimizes pairwise Euclidean distance in feature space using contrastive loss [6,47] or differentiates positive pairs from negative pairs by applying a distance margin through triplet loss [11,43].The latter is extensively adopted by state-of-the-art deep face recognition methods.Softmax loss functions combined with heavy neural networks are demonstrated to obtain satisfactory performance [7].Various methods have been proposed to learn features with angular discrimination.SphereFace [27] introduces the angular SoftMax function (i.e., A-SoftMax), adding discriminative constraints on a hypersphere manifold.CosFace [51] further suggests a large margin cosine loss to enhance the decision margin in the angular space.ArcFace loss is designed to achieve highly discriminative features for FR by incorporating angular margin loss [7].CurricularFace [14] integrates the concept of curriculum learning into the loss function.MagFace [34] explores applying different margins based on recognizability, which incorporates substantial angular margins for elevated-norm features that exhibit a heightened level of discernibility.These loss functions differ in their approaches to optimizing intra-class compactness and inter-class separability of face embeddings.However, most of these loss functions rely on large-scale training data and high-capacity models, constraining their applicability on mobile devices.

Knowledge Distillation
Knowledge distillation was first proposed by Hinton et al. [10], who suggested transferring the softened logits (before the softmax layer) from the teacher to the student by minimizing the Kullback-Leibler divergence.A temperature factor is used to smooth the logits.In pursuit of richer representations, Romero et al. [42] proposed transferring intermediate layer features between the teacher and student networks.Subsequently, Zagoruyko and Komodakis [56] devised several statistical methods to emphasize the dominant areas of the feature map and disregard low-response areas as noise.
Chen et al. [3] introduced semantic calibration, enabling the student to learn from the most semantically related teacher layer.In [16], feature similarities between the teacher and student networks are computed and utilized as weights to balance feature matching.However, these approaches overlook the problem of semantic mismatch, where pixels in the teacher feature map often contain more semantic information than those in the student map at the same spatial location.We note that some works [15,25,[38][39][40] relax the spatial constraint during feature distillation.Typically, they define a relational graph or similarity matrix in the teacher network's feature space and transfer it to the student network.For instance, Tung and Mori [49] calculates a similarity matrix, with each entry encoding the similarity between two instances.Liu et al. [25] measure the correlation between channels using inner products.These methods reduce and compress entire features to specific properties, thereby eliminating spatial information.However, existing methods predominantly focus on homologous-architecture KD, limiting their applicability in cross-architecture KD.
They have shown competitive or superior performance compared to convolutional neural networks (CNNs) on many benchmarks and datasets [19].However, Transformers are computationally expensive and hard to accelerate on different platforms, especially on mobile devices.On the other hand, CNNs have been well developed in recent years, with libraries like CUDA [36] and TensorRT [37] that make them compatible with both servers and edge devices.Therefore, a common practice is to utilize Transformer as a teacher network and CNN as a student network for KD, which can improve the student's performance.Many existing KD methods cannot work with Transformers due to the architecture gap between Transformer and CNN [29].Some works have studied how to distill knowledge between Transformers.For example, DeiT [48] supplements a distillation token to assist the student Transformer to learn from the teacher and the ground truth (GT).MINILM [21] focuses on distilling the self-attention information in Transformer.IR [22] transfers the internal representations (e.g., self-attention map) from the teacher to the student.However, most of these methods require similar or identical architectures for both teacher and student.To solve this problem, Liu et al. [29] proposes to align the attention space and feature space of teacher and student networks, assuming that they share identical spatial information for each pixel.However, we argue that this assumption does not hold.As illustrated in Fig. 2, there is a significant architecture gap between teacher and student networks in terms of pixel-wise receptive fields: the teacher network adopts shifted window attention [30], while the student utilizes conventional convolution operations.In accordance with [29], we mitigate the architecture gap by aligning attention space and feature space.Having an edge on it, we synchronize the receptive fields of pixel-wise features of student and teacher networks, which further alleviates the architecture gap.

Prompting in Vision
Prompting is a technique that utilizes language instruction at the beginning of the input text to assist a pre-trained language model in pre-understanding the task [26].GPT-3 [1] strongly generalizes downstream transfer learning tasks with manually chosen prompts, even in few-shot or zero-shot settings.Recent works propose to optimize the prompts as task-specific continuous vectors via gradients during fine-tuning, which is called Prompt Tuning [22,24,28].
It performs comparable to full fine-tuning but with hundreds of times less parameter storage.Jia et al. [17] explore the generality and feasibility of visual prompting across multiple domains.Wang et al. [53] probe the viability of prompts in continual learning.In this paper, we explore prompts' applicability in KD.The objective is to optimize prompts to instruct the teacher to manage distillationspecific knowledge while maintaining model adaptable capacity.

METHOD
In this section, we first provide an overview of the proposed method, followed by a brief introduction to the general formulation of the Adaptable Prompting Teacher network (APT).Next, we detail the design of the proposed Unified Receptive Fields Mapping module (URFM).Lastly, we introduce the implementation of Facial Positional Encoding (FPE), including two candidate metric schemes, i.e., Saliency Distance (SD) and Relative Distance (RD).

Framework
We present the overall framework of our method in Fig. 3  Fields Mapping module (URFM), as well as the reciprocal alignment in feature space and attention space.For the teacher network, a facial image is initially divided into  patches encoded via a linear projection.The patch embeddings are then fed to APT to produce pixel features   which are subsequently mapped through URFM to obtain the local features   with unified receptive fields.Likewise, the ultimate pixel features   of the student are produced after encoding of CNN layers, followed by URFM to get local features   .We eventually execute a reciprocal alignment in the feature space and attention space.
teacher and student networks as well as the supplemented prompts.Finally, we conduct a mutual alignment on the attention space and feature space between teacher and student networks.The adopted classification loss is ArcFace loss [7] and the alignment losses on attention space and feature space are formulated as follows: and   represent the final attention maps of the teacher and student, respectively.MSE(•, •) denotes the mean squared error, and MSA(•) denotes Multi-head Self Attention [50].

Adaptable Prompting Teacher
Considering that the teacher is unaware of the student's capacity during training, an intuitive solution is to allow the teacher to optimize for distillation.However, the immense modelling capacity gap between the teacher and student degenerates the distillation into inferior self-distillation [41].Therefore, we propose inserting prompts into teacher, enabling it to manage distillation-specific knowledge while preserving the model's discriminative capacity.In Sec.4.5.3,we find that the number of learnable prompts determine the discriminative capacities of teacher and student.For a plain Swin Transformer [30] with  basic layers, an input facial image is resized and divided into  patches x = {  ∈ R ℎ××3 | 1 ≤  ≤ }, ℎ and  indicate the height and width of the image patches.Each patch is initially embedded into a patch feature: Here, [• , •] denotes stacking and concatenation on the token dimension, and   indicates the -th Transformer basic layer.x+1 and p+1 refer to the output of patch tokens and prompt tokens from the -th basic layer, respectively. (•) indicates the Patch Merging manipulation.We incorporate prompts as basic components in calculating shifted window attention [30] while ignoring them in the patch merging.In the subsequent layer, fresh prompts p +1 are initialized and inserted in to the x+1 , as the input of ( + 1)-th basic layer.Finally, we stack and concatenate the output of -th layer as the pixel features  for both teacher and student.

Unified Receptive Fields Mapping
The purpose of URFM module is to map the pixel features extracted by the backbone into local features with unified receptive fields.Self-Attention (SA) can be considered an alternative solution due to its sequence-to-sequence functional form.However, it has two problems in our context.1) SA maintains an equal number of input and output tokens, but the teacher and student networks generally have different numbers of pixel features (tokens), hindering the alignment of the feature space due to the inconsistent feature dimension between teacher and student.2) The vanilla positional encoding method in vision [55] merely considers the spatial distance between tokens while disregarding the variation of face structure between tokens.The proposed URFM solves these problems by modifying the SA with 1) learnable query embeddings and 2) facial positional encoding.First, we review the generic attention formulation containing query , key  , and value  embeddings.
where   ,   ,   are learnable weights and  indicates the channel dimension.Then, we modify it with learnable local centers : The local centers  ∈ R × ensure consistent numbers of the output pixel features of both teacher and student networks.
Transformers inherently fails to capture the ordering of input tokens, which necessitates incorporating explicit position information through positional encoding (PE).The original visual transformer proposes inserting fixed encodings generated by sine and cosine functions of varying frequencies and learnable PE into the input [8].Swin [30] supplements PE in the attention map as a bias, resulting in significant performance improvements, as formulated below: where    represents the inner product between patch  and .  and   indicate the input elements of the patch embeddings.   refers to the learnable position weights indexed by the spatial distance between patches  and  [30].However, the number of  and  may not be identical in our setting, leading to inconsistent dimensions between the attention map and the patch distance matrix.To address this, we propose incorporating absolute PE for the query: where   indicates the parameters indexed by the absolute position of patch , which is formulated as follows: The index function  (•) maps a relative distance to an integer in a finite set, and we employ PIF [55] as the index function. is random initialized parameter buckets. (, ℎ ) denotes the distance between patch  and ℎ .The conventional method for determining patch position involves measuring the Euclidean distance of coordinates from the anchor point [55]: Let ( x , ŷ ) denote the coordinates of patch .We choose (⌊ ⌋) patch as the anchor. indicates the number of local centers, as shown in Fig. 3.However, general visual PE methods primarily focus on the spatial distance of patch embeddings and overlook the differences in facial structure, which are pivotal in face domains.To address this, we propose incorporating facial structure distance into positional encoding, as formulated below: We define D   (, ) as the facial structure distance between patch  and , and candidates are detailed in Sec.3.4. is a constant that unifies the range of spatial distance and facial structure distance.

Facial Structure Distance
In this section, we introduce two candidate methods for measuring the facial structure distance between patches, while preserving the Euclidean distance of coordinates for basic positional information.
Saliency Distance.We utilize FaceX-Zoo [52] to predict 106 keypoints for each face image and quantify the saliency of each patch embedding by the amount of the keypoints inside the patch.The saliency distance between patches is computed as follows: where   and  ℎ indicate the number of landmarks in the corresponding patch.  denotes the maximum landmarks in a patch.
Relative Distance.We use dlib [20] to predict 5 keypoints for each face image and calculate the distance between the coordinates of the centroid of the patch  and that of each keypoint to comprise a relative distance vector   = {   , 1 ≤  ≤ 5}, which is employed to determine the distance between patch  and ℎ : where ( x , ȳ ) denotes the coordinates of the centroid of patch , and (   ,    ) indicates the coordinates of -th keypoint.Cos(•, •) represents the cosine distance computation.

EXPERIMENTS 4.1 Dataset
Training set.For fair comparisons with other SOTA approaches, we employ the refined MS1MV2 [7] as our training set.MS1MV2 consists of 5.8M facial images of 85K individuals.

Experimental Settings
Data Processing.The input facial images are cropped and resized to 112×112 for CNNs and ViT.For Swin, we utilize bilinear interpolation to resize the image from 112×112 to 224×224.Then images are normalized by subtracting 127.5 and dividing by 128.For the data augmentation, we adopt the random horizontal flip.
Training.We utilize Swin-S and ViT-S as the teacher models that are trained by ArcFace [7].For the student, there are two groups of student backbone networks widely used in face recognition, one is the MobileFaceNet [4] that is modified based on MobileNet [12], and the other is the IResNet [7] which is adapted from ResNet [9].To show the generality of our method, we utilize different teacherstudent configurations.We set the batch size to 128 for each GPU in all experiments, and train models on 8 NVIDIA Tesla V100 (32GB) GPUs.We apply the SGD optimization method and cosine learning rate decay [48] with 4 warmup epochs and 16 normal epochs.The momentum is set to 0.9, and the weight decay is 5e-4.For ArcFace loss, we follow the common setting with s = 64 and margin m = 0.5.

Comparison with SOTA Methods
In this section, we compare our method with state-of-the-art knowledge distillation methods, e.g., KD [10], FitNet [42], DarkRank [5], RKD [39], SP [15], CCKD [38], EKD [15] and GKD [57].We also compare our method with specifically designed method CKD [29] for cross-architecture knowledge distillation.Since the existing KD methods do not conduct experiments under cross-architecture knowledge distillation scenario for face recognition, we reproduce them according to the settings the the original manuscripts.majority of knowledge distillation methods surpass training the student network from scratch.Relation-based methods excel in comparison to feature-based methods but fall short of the performance achieved by methods specifically designed for cross-architecture scenarios.In contrast, our method synchronizes the receptive fields of local features in teacher and student networks, ultimately outperforming all competitors on small facial testing sets.

Results on IJB-B, IJB-C and MegaFace Challenge.
Tab. 1 offers a comparison of the 1:1 verification TPR@FPR=1e-4 and TPR@FPR=1e-5 between existing state-of-the-art KD methods and the proposed method on IJB-B and IJB-C datasets.The majority of knowledge distillation methods exhibit substantial performance enhancements on these two large-scale datasets.Fig. 4 presents the comprehensive ROC curves of current state-of-the-art competitors and our method, illustrating that our approach surpasses the other KD methods.For MegaFace challenge [18], we follow the testing protocol provided by ArcFace [7].As Tab. 1 indicates, most competitors obtain superior performance than the baseline, whereas our method achieves  the highest verification performance.For the rank-1 metric, our method performs marginally better than GKD [57].To showcase the alignment of pixel-wise receptive fields between the teacher and student, we compute and visualize the Effective Receptive Field (ERF) [32] of their local features, denoted as PERF.Fig. 5 shows that URFM aligns the PERF of the teacher with that of the student, whereas the PERF of the student undergoes minimal change.Additionally, we find the ERF of the Transformer exhibits a grid-like pattern.

Ablation Study
4.5.2Adapting Teacher's Adaptable Capacity.The teacher's adaptable capacity is manifested in the degree of alignment in PERF of teacher and student.We analyze that the scale of learnable parameters can adapt the teacher's adaptable capacity.As depicted in Fig. 6, we alter the number of incorporated prompts in the teacher and visualize the resulting PERF.We find that the PERF of the teacher converges with that of the student as the number of prompts increases, thereby demonstrating that the number of prompts is able to reflect the teacher's adaptable capacity.Notably, we consider that the teacher without prompt and optimizing all parameters in distillation hold the highest adaptable capacity, whereas the teacher frozen in distillation exhibits the lowest adaptable capacity, denoted as "All-learnable" and "Frozen", respectively.We first explore the effects of the teacher's adaptable capacity on its performance in CPLFW and AgeDB.Tab. 6 reveals a decline in the teacher's discriminative capacity as the adaptable capacity increases, since higher adaptable capacity results in overfitting in the distillation.To identify the easy-to-learn teacher, we evaluate the performance of the students distilled by teachers with different degrees of adaptable and discriminative capacity.As shown in Tab.6, we find that the teacher with the highest discriminative capacity and lowest adaptable capacity ("Frozen") is hard-to-learn.In contrast, the teacher with the lowest discriminative capacity but the highest adaptable capacity ("All-learnable") exhibits a marginal improvement.Interestingly, we observe that the teacher with a trade-off between discriminative capacity and adaptable capacity yields optimal performance for the student.

CONCLUSION
In this paper, we first demonstrate the implication of the architecture gap in cross-architecture knowledge distillation for face recognition.Subsequently, we find two challenges for CAKD in face recognition: 1) the teacher and student share disparate spatial information for each pixel, obstructing the alignment of feature space, and 2) the teacher network is not trained in the role of a teacher, lacking proficiency in handling distillation-specific knowledge.To tackle these problems, 1) we present a Unified Receptive Fields Mapping module (URFM), aiming at mapping pixel features of the teacher and student into local features with unified receptive fields.Additionally, 2) we propose an Adaptable Prompting Teacher network (APT) that supplements an adaptable number of prompts into the teacher network to instruct the network to manage distillationspecific knowledge.We experimentally find that the teacher with a trade-off between discriminative capacity and adaptable capacity is the most easy-to-learn for the student.Moreover, we construct different teacher-student pairs and demonstrate the generalization of the proposed method to different network settings.Finally, extensive experiments on popular face benchmarks and two large-scale verification datasets demonstrate the superiority of our method.

)Figure 1 :
Figure1: Existing KD methods suffer from performance degradation in cross-architecture distillation compared to the homologous-architecture distillation.With the student network identified as MobileFacenet[4], we adopt IResNet-50[7] as the teacher for homologous-architecture distillation and Swin-S as the teacher for cross-architecture distillation.Then, we evaluate the performance variation of students with different KD methods[5,10,15,38,39,42,49] in both scenarios by: (a) the average accuracy on the five popular face benchmarks[13,35,44,59,60], and (b) the average TPR@FAR=1e-4 on IJB-B[54] and IJB-C[33].Practical application requires a solution to transfer knowledge from Transformer to CNN, which serves as the primary focus of this study.

Figure 2 :
Figure 2: Illustration of the receptive fields of a pixel feature for teacher (Swin) and student (CNN).There exists a theoretical receptive field unalignment between teacher and student due to the architectural difference.

Figure 3 :
Figure3: An overview of the proposed method encompassing Adaptable Prompting Teacher Network (APT), Unified Receptive Fields Mapping module (URFM), as well as the reciprocal alignment in feature space and attention space.For the teacher network, a facial image is initially divided into  patches encoded via a linear projection.The patch embeddings are then fed to APT to produce pixel features   which are subsequently mapped through URFM to obtain the local features   with unified receptive fields.Likewise, the ultimate pixel features   of the student are produced after encoding of CNN layers, followed by URFM to get local features   .We eventually execute a reciprocal alignment in the feature space and attention space.
Let x0 = { x  |1 ≤  ≤ } denote the patch embedding set which refers to the input of 0-th Transformer basic layer.We supplement a collection of learnable embeddings p, initialized normally as prompts, into the embedding set x. Let  indicate the number of introduced prompts, which controls the adaptable capacity of the teacher, as detailed in Sec.4.5.3.The Transformer backbone is initialized with a pre-trained model and remains frozen.During distillation, only the prompts specific to the distillation are optimized.Prompts are inserted exclusively into each Transformer basic layer.The prompts supplemented in the -th Transformer basic layer is a set of d-dimensional vectors, denoted as p  = {   ∈ R  |1 ≤  ≤  }.The feed-forwarding process of APT is formulated as follows:

Figure 5 :Figure 6 :
Figure 5: Pixel-wise Effective Receptive Fields (PERF)[32] before and after Unified Receptive Fields Alignment (URFM).We measure the PERF for teacher and student as the absolute value of the gradient of pixel features (  and   ) or the local features (  and   ).Results are averaged across all channels in the feature map for 500 randomly selected images.
. The upper pink network and the lower gray network are the teacher and student networks, respectively.Suppose the teacher and student are Swin Transformer and convolutional neural networks, denoted by  and .For the teacher network (Transformer), an input facial image is resized and divided into  patches x = {  ∈ R ℎ××3 | 1 ≤  ≤ }, which are then fed into Adaptable Prompting Teacher to get   -dimension pixel features (image tokens).ℎ and  indicate the height and width of the patches.Then the pixel features   ∈ R   × are input into the URFM module to get local features   ∈ R × with unified pixel receptive fields.For the student (CNN), the pixel features   ∈ R   × are produced after the encoding of several CNN layers, followed by URFM to get local features ∈ R × . and  denote the number of local centers in URFM and feature dimension, respectively.Note that the values of   and   are commonly different, due to the distinct architecture of the

Table 1 :
Comparison on benchmark datasets of state-of-the-art knowledge distillation methods with our method.For large-scale face benchmarks ASA MA APT URFM CFP-FP CPLFW AgeDB

Table 3 :
Ablation of facial structural distance for indexing positional encoding.Experiments are conducted on basis of APT and URFM, and evaluated on popular facial benchmarks.

Table 4 :
Ablation experiments of number of local centers.All experiments are evaluated on popular facial benchmarks.

Table 5 :
[8]e that we discard the final prompt embeddings to maintain an equal number of pixel features for teacher and student networks, i.e.,   =   .Furthermore, we unify the Generalization for different student and teacher networks, as well as the identification comparisons with other SOTA methods.Student (Stu.) and teacher (Tea.)networks are replaced by IResNet-18[7]and ViT[8], respectively.pixel-wisereceptive fields of the teacher and student through the URFM module, which further enhances the student's performance.4.4.2Effects of Facial Structural Distance.We investigate two candidate facial structural distances for indexing positional encoding, i.e., Saliency Distance (SD) and Relative Distance (RD).We conduct the experiments based on APT and URFM, and compare the vanilla Euclidean distance (Euc) with SD and RD.From Tab. 3, all methods outperform the baseline model with Euclidean positional encoding, and SD outperforms RD.Therefore, we choose SD as the metric of facial structural distance in the following experiments.4.4.3Effects of Number of Local Centers.We investigate the effects of the number  on local centers.We conduct the experiments after introducing APT and facial positional encoding.From Tab. 4, we can find  = 3 × 3 is inferior in comparison to others since few local centers hinder structural and spatial information of faces.In contrast,  = 7 × 7 achieves the best performance.4.4.4Generalization for Student of IResNet.We investigate the generalization of our method for the student of IResNet-18.As shown in Tab. 5, the identification performance of different knowledge distillation methods is evaluated on CFP-FP, CPLFW, AgeDB and CALFW.Most methods outperform the baseline model on average performance, and our method improves the baseline and achieves the best performance on these four datasets.
both patch embeddings and class tokens.As shown in Tab. 5, most methods outperform the student trained from scratch with limited performance improvement.In contrast, our approach achieves superior performance over other methods.

Table 6 :
Effects of varying adaptable and discriminative capacity (performance) of the teacher on student's performance.Blue and Green denote the highest and lowest adaptable or discriminative capacity, respectively.Red denotes the trade-off that results in the best performance for the student.