Deep Learning for Iris Recognition: A Survey

In this survey, we provide a comprehensive review of more than 200 articles, technical reports, and GitHub repositories published over the last 10 years on the recent developments of deep learning techniques for iris recognition, covering broad topics on algorithm designs, open-source tools, open challenges, and emerging research. First, we conduct a comprehensive analysis of deep learning techniques developed for two main sub-tasks in iris biometrics: segmentation and recognition. Second, we focus on deep learning techniques for the robustness of iris recognition systems against presentation attacks and via human-machine pairing. Third, we delve deep into deep learning techniques for forensic application, especially in post-mortem iris recognition. Fourth, we review open-source resources and tools in deep learning techniques for iris recognition. Finally, we highlight the technical challenges, emerging research trends, and outlook for the future of deep learning in iris recognition.


INTRODUCTION
The human iris is a sight organ that controls the amount of light reaching the retina, by changing the size of the pupil.The texture of the iris is fully developed before birth, its minutiae do not depend on genotype, it stays relatively stable across lifetime (except for disease-and normal aging-related biological changes), and it may even be used for forensic identification shortly after subject's death [36,110,170].
In terms of its information theory-related properties, the iris texture has an extremely high randotypic randomness, and is stable (permanent) over time, providing an exceptionally high entropy per mm. 2 that justifies its higher discriminating power, when compared to other biometric modalities (e.g., face or fingerprint).The iris' collectability is another feature of interest and has been the subject of discussion over the last years: while it can be acquired using commercial off-the-shelf (COTS) hardware, either handheld or stationary, data can be even collected from at-a-distance, up to tens of meters away from the subjects [111].Even though commercial visible-light (RGB) cameras are able to image the iris, the near infrared-based (NIR) sensing dominates in most applications, due to a better visibility of iris texture for darker eyes, rich in melanin pigment, which is characterized by lower light absorption in NIR spectrum compared to shorter wavelengths.In addition, NIR wavelengths are barely perceivable by the human eye, which augment users' comfort, and avoids pupil contraction/dilation that would appear under visible light.
A seminal work by John Daugman brought to the community the Gabor filtering-based approach that became the dominant approach for iris recognition [34,35,37].Even though subsequent solutions to iris image encoding and matching appeared, the IrisCodes approach is still dominant due to its ability to effectively search in massive databases with a minimal probability of false matches, at extreme time performance.By considering binary words, pairs of signatures are matched using XOR parallel-bit logic at lightening speed, enabling millions of comparisons/second per processing core.Also, most of the methods that outperformed the original techniques in terms of effectiveness do not work under the one-shot learning paradigm, assume multiple observations of each class to obtain appropriate decision boundaries, and -most importantly -have encoding/matching steps with time complexity that forbid their use in large environments (in particular, for all-against-all settings).
In short, Daugman's algorithm encodes the iris image into a binary sequence of 2,048 bits by filtering the iris image with a family of Gabor kernels.The varying pupil size is rectified by the Cartesian-to-polar coordinate system transformation, to end up with an image representation of canonical size, guarantying identical structure of the iris code independently of the iris and pupil size.This makes possible to use the Hamming Distance (HD) to measure the similarity between two iris codes [37].Its low false match rate at acceptable false non-match rates is the key factor behind the success of global-scale iris recognition installments, such as the national person identification and border security program Aadhaar program in India (with over 1.2 billion pairs of irises enrolled) [174], the Homeland Advanced Recognition Technology (HART) in the US (up to 500 million identities) [128], or the NEXUS system, designed to speed up border crossings for low-risk and pre-approved travelers moving between Canada and the US.
Deep learning-based methods, in particular using various Convolutional Neural Network architectures, have been driving remarkable improvements in many computer vision applications over the last decade.In terms of biometrics technologies, it's not surprising that iris recognition has also seen an increasing adoption of purely data-driven approaches at all stages of the recognition pipeline: from preprocessing (such as off-axis gaze correction), segmentation, encoding to matching.Interestingly, however, the impact of deep learning on the various stages of iris recognition pipeline is uneven.One of the primary goals of this survey paper is to assess where deep learning helped in achieving highly performance and more secure systems, and which procedures did not benefit from more complex modeling.
The remainder of the paper is structured as follows.Section 2 and 3 review the application of deep learning in two main stages of the recognition pipeline: segmentation and recognition (encoding and comparison).Section 4 and 5 analyze the state of the art of deep learning-based approaches in two applications: Presentation Attack Detection (PAD) and Forensic.Section 6 investigates how human and machine can pair to improve deep learning based iris recognition.Section 7 focuses on approaches in less controlled environments of iris and periocular analysis.Section 8 reviews public resources and tools available in the deep learning based iris recognition domain.Section 9 focuses on the future of deep learning for iris recognition with discussion on emerging research directions in different aspects of iris analysis.The paper in concluded in Section 10.

DEEP LEARNING-BASED IRIS SEGMENTATION
The segmentation of the iris is seen as an extremely challenging problem.As illustrated in Fig. 1, segmenting the iris involves essentially three tasks: detect and parameterize the inner (pupillary) , Vol. 1, No. 1, Article .Publication date: October 2022.Fig. 1.Three main tasks typically associated to iris segmentation: 1) parameterization of the pupillary (inner) boundary; 2) parameterization of the scleric (outer) boundary; and 3) discrimination between the unoccluded (noise-free) and occluded (noisy) regions inside the iris ring.Such pieces of information are further used to obtain dimensionless polar representations of the iris texture, where feature extraction methods typically operate.
Schlett et al. [144] provided a multi-spectral analysis to improve iris segmentation accuracy in visible wavelengths by preprocessing data before the actual segmentation phase, extracting multiple spectral components in form of RGB color channels.Even though this approach does propose a DL-based framework, the different versions of the input could be easily used to feed DL-based models, and augment the robustness to non-ideal data.Chen et al. [22] used CNNs that include dense blocks, referred to as a dense-fully convolutional network (DFCN), where the encoder part consists of dense blocks, and the decoder counterpart obtains the segmentation masks via transpose convolutions.Hofbauer et al. [72] parameterize the iris boundaries based on segmentation maps yielding from a CNN, using a a cascaded architecture with four RefineNet units, each directly connecting to one Residual net.Huynh et al. [76] discriminate between three distinct eye regions with a DL model, and removes incorrect areas with heuristic filters.The proposed architecture is based on the encoder-decoder model, with depth-wise convolutions used to reduce the computational cost.Roughly at the same time, Li et al. [94] described the Interleaved Residual U-Net model for semantic segmentation and iris mask synthesis.In this work, unsupervised techniques (K-means clustering) were used to create intermediary pictorial representations of the ocular region, from where saliency points deemed to belong to the iris boundaries were found.Kerrigan et al. [85] assessed the performance of four different convolutional architectures designed for semantic segmentation.Two of these models were based in dilated convolutions, as proposed by Yu and Koltun [188].Wu and Zhao [186] described the Dense U-Net model, that combines dense layers to the U-Net network.The idea is to take advantage of the reduced set of parameters of the dense U-Net, while keeping the semantic segmentation capabilities of U-Net.The proposed model integrates dense connectivity into U-Net contraction and expansion paths.Compared with traditional CNNs, this model is claimed to reduce learning redundancy and enhance information flow, while keeping controlled the number of parameters of the model.Wei et al. [205] suggested to perform dilated convolutions, which is claimed to obtain more consistent global features.In this setting, convolutional kernels are not continuous, with zero-values being artificially inserted between each non-zero position, increasing the receptive field without augmenting the number of parameters of the model.
More recently, Ganeva and Myasnikov [55] compared the effectiveness of three convolutional neural network architectures (U-Net, LinkNet, and FC-DenseNet), determining the optimal parameterization for each one.Jalilian et al. [79] introduced a scheme to compensate for texture deformations caused by the off-angle distortions, re-projecting the off-angle images back to frontal view.The used architecture is a variant of RefineNet [96], which provides high-resolution prediction, while preserving the boundary information (required for parameterization purposes).
The idea of interactive learning for iris segmentation was suggested by Sardar et al. [142], describing an interactive variant of U-Net that includes Squeeze Expand modules.Trokielewicz et al. [172] used DL-based iris segmentation models to extract highly irregular iris texture areas in post-mortem iris images.They used a pre-trained SegNet model, fine-tuned with a database of cadaver iris images.Wang et al. [178] (further extended in [179]) described a lightweight deep convolutional neural network specifically designed for iris segmentation of degraded images acquired by handheld devices.The proposed approach jointly obtains the segmentation mask and parameterized pupillary/limbic boundaries of the iris.
Observing that edge-based information is extremely sensitive to be obtained in degraded data, Li et al. [7] presented an hybrid method that combines edge-based information to deep learning frameworks.A compacted Faster R-CNN-like architecture was used to roughly detect the eye and define the initial region of interest, from where the pupil is further located using a Gaussian mixture model.Wang et al. [184] trained a deep convolutional neural network(DCNN) that automatically extracts the iris and pupil pixels of each eye from input images.This work combines the power of U-Net and SqueezeNet to obtain a compact CNN suitable for real time mobile applications.Finally, Wang et al. [176] parameterize both the iris mask and the inner/outer iris boundaries jointly, by actively modeling such information into a unified multi-task network.
A final word is given to segmentation-less techniques.Assuming that the accurate segmentation of the iris boundaries is one of the hardest phases of the whole recognition chain and the main source for recognition errors, some recent works have been proposing to perform biometrics recognition in non-segmented or roughly segmented data [132] [135].Here, the idea is to use the remarkable discriminating power of DL-frameworks to perceive the agreeing patterns between pairs of images, even on such segmentation-less representations.As illustrated in Fig. 2, the idea here is to analyze a dimensionless representation of the iris data and produce a feature vector that lies in a hyperspace (embedding) where recognition is carried out.In this context, Boyd et el.[15] explored five different sets of weights for the popular ResNet50 architecture to test if iris-specific feature extractors perform better than models trained for general tasks.Minaee et al. [105] studied the application of deep features extracted from VGG-Net for iris recognition, having authors observed that the resulting features can be well transferred to biometric recognition.Luo et al. [102] described a DL model with spatial attention and channel attention mechanisms, that are directly inserted into the feature extraction module.Also, a coattention mechanism adaptively fuses features to obtain representative iris-periocular features.Hafner et al. [65] adapted the classical Daugman's pipeline, using convolutional neural networks to function as feature extractors.The DenseNet-201 architecture outperformed its competitors achieving state-of-the-art results both in the open and close world settings.Menotti et al. [104] assessed how DL-based feature representations can be used in spoofing detection, observing that spoofing detection systems based on CNNs can be robust to attacks already known and adapted, with little effort, to image-based attacks that are yet to come.
Yang et al. [196] generated multi-level spatially corresponding feature representations by an encoder-decoder structure.Also, a spatial attention feature fusion module was used to ensemble the resulting features more effectively.Chen et al. [23] addressed the large-scale recognition problem and described an optimized center loss function (tight center) to attenuate the insufficient discriminating power of the cross-entropy function.Nguyen et al. [112] explored the performance of state-of-the-art pre-trained CNNs on iris recognition, concluding that off-the-shelf CNN generic features are also extremely good at representing iris images, effectively extracting discriminative visual features and achieving promising results.Zhao et al. [207] proposed a method based on the capsule network architecture, where a modified routing algorithm based on the dynamic routing between two capsule layers was described, with three pre-trained models (VGG16, InceptionV3, and ResNet50) extracting the primary iris features.Next, a convolution capsule replaces the full connection capsule to reduce the number of parameters.Wang and Kumar [180] introduced the concept of residual feature for iris recognition.They described a residual network learning procedure with offline triplets selection and dilated convolutional kernels.
Other works have addressed the extraction of appropriate feature representations in multibiometrics settings: Damer et al. [32] propose to jointly extract multi-biometric representations within a single DNN.Unlike previous solutions that create independent representations from each biometric modality, they create these representations from multi-modality (face and iris), multi-instance (iris left and right), and multi-presentation (two face samples), which can be seen as a fusion at the data level policy.Finally, concerned about the difficulty of performing reliable recognition in hand-held devices, Odinokikh et al. [121] combined the advantages of handcrafted feature extractors and advanced deep learning techniques.The model utilizes shallow and deep feature representations in combination with characteristics describing the environment, to reduce the intra-subject variations expected in this kind of environments.

Deep Learning-based Iris Matching Strategies
The existing matching strategies can be categorized into three categories: (1) using conventional classifiers, such as SVM, RF, and Sparse Representation; (2) softmax-based losses; and (3) pairwisebased losses.A cohesive perspective of the most relevant recent DL-based methods is given in Table 2, with the techniques appearing in chronographic (and then alphabetical) order 3.2.1 Conventional classifiers.Various researchers have been using deep learning networks designed and pre-trained on the ImageNet dataset to extract iris feature representations, followed by a conventional classifier such as SVM, RF, Sparse Representation, etc. [15,18,112].The key benefit of these approaches is the simplicity of "plug and play", where proven and pre-trained deep learning networks inherited from large-scale computer vision challenges are widely available and ready to be used [112].Another benefit is that there is no need for large scale iris image datasets to train these networks because they have already been trained on such large-scale datasets as ImageNet.Considering these networks usually contain hundreds of layers and millions of parameters, and require millions of images to train, using pre-trained networks is extremely beneficial.

Iris Classification Networks. Iris classification networks couple deep learning architectures
with a family of softmax-based losses to classify an iris image into a list of known identities.Coupling a softmax loss with a backbone network enables training the backbone network in an end-to-end manner via popular optimization strategies such as back-propagation and steepest gradient decent.Compared to the conventional classifier approaches, the DL-based backbones in this category are learnable directly from the iris data, allowing them to better represent the iris.The key benefit is that it is similar to a generic image classification task, hence all designs and algorithms in the generic image classification task can be trivially applied with the iris image data.Typical examples of these iris classification networks are [15,56].However, these softmax-based networks require the iris in the test image be known in the identity classes in the training set, which means the networks must be re-trained whenever a new class (i.e. a new identity) is added.Gangwar et al. proposed two backbone networks (i.e.DeepIrisNet-A and DeepIrisNet-B) followed by a softmax loss for the iris recognition task [56].Later, they proposed another backbone network, but still followed by a softmax loss to classify one normalized iris image into a pre-defined list of identity [57].
Backbone Network Architectures: A wide range of backbone network architectures have been borrowed from generic image classification for the iris recognition task due to their similarity.
• AlexNet: AlexNet is the most primitive and been shown as least accurate for iris recognition compared to others [16,112].[207].

Iris Similarity Networks.
Iris similarity networks couple deep learning architectures with a family of pairwise-based losses to learn a metric representing how similar or dissimilar two iris images are without knowing their identities.The pairwise loss aims to pull images of the same iris closer and push images of different irises away in the similarity distance space.Different to the iris classification networks which only operate in an identification mode on a pre-defined identity list, iris similarity networks operate across both verification and identification modes with an open set of identities [209].Typical examples of these iris similarity networks are [80,97,113,180,209].There are three key benefits of these networks: (i) verification and identification: iris similarity networks operate across both verification and identification modes; (ii) open set of identities: iris similarity networks operate on an open set of identities; and (iii) explicit reflection: iris similarity networks directly and explicitly reflect what we want to achieve, i.e., small distances between irises of the same subject and larger distances between irises of different subjects.
Pairwise loss: Nianfeng et al. [97] proposed a pairwise network, which accepts two input images and directly outputs a similarity score.They designed a pairwise layer which accepts two input images and encodes their features via a backbone network.The backbone network is trained iteratively to minimize the dissimilarity distance between genuine pairs (pairs of the same identity) and maximize the dissimilarity distance between impostor pairs (pairs of the different identities).
Triplet loss: Since the pairwise network is trained with separate genuine and impostor pairs, it may not converge well, which has been proven in the face recognition [145].Rather than using one pair of two images to update the training as in the pairwise loss for each training iteration, the triplet loss employs a triplet of three images: an anchor image, a positive image with the same identity and a negative image with a different identity [145].The backbone network is trained to simultaneously minimize the similarity distance between the positive and the anchor images and maximize the distance between the negative and the anchor images.Tailored for iris images, Zhao et al. [180,209,211] proposed Extended Triplet Loss (EPL) to incorporate a bit-shifting operation to deal with rotation in the normalized iris images.Nguyen et al. also employed the ETL for their iris recognition network [113,115].Kuehlkamp et al. [91] proposed to improve the generic triplet loss function for iris recognition by forcing the distance to be positive (through the use of a sigmoid output layer), and adding a logarithmic penalty to the error.This modification allows the network to learn even when the difference between samples is negative and converge faster.Yan et al. [195] extended the generic triplet loss to batch triplet loss, in which the triplet loss is calculated over a batch of  subjects and  images for each subject.Performing batch triplet loss is usually expected to have smooth loss function.Yang et al. [196] improved triplet selection method for training by Batch Hard [197].
Backbone Network Architectures: Different to the classification iris networks, similarity iris networks are usually designed with their own network architectures and are usually much "shallower" than the classification counterparts.
• FCN: All similarity iris networks employ Fully Convolutional Networks (FCNs) instead of CNNs.Compared to CNNs, FCNs [100] do not have fully connected layer, allowing the output map to preserve the original spatial information.This is important to iris recognition since the output map can preserve spatial correspondence with the original input image [113,209], thus enabling pixel-to-pixel matching.Zhao et al. [209] proposed a FCN architecture with 3 convolutional layers, followed by activation and pooling layers.Outputs of convolutional layers are up-sampled to the original input image size.The up-samples features are stacked and convolved by another convolutional layer to generate a 2-dimension features with the same size as the input image.Later, they extended the backbone network with dilated convolutions [180].Yan et al. [195] employed a ResNet architecture and fine-tuned it with the triplet loss.Kuehlkamp et al.only used a part of the ResNet architecture.• NAS: Nguyen et al. [113] proposed to learn the network architecture directly from data rather than hand-designing it or using generic-image-classification architectures.They proposed a differential Neural Architecture Search (NAS) approach that models the architecture design process as a bi-level constrained optimization approach.This approach is not only able to search for the optimal network which achieves the best possible performance, but it can also impose constraints on resources such as model size or number of computational operations.• Complex-valued: Observing that there is an intrinsic difference between the iris texture and generic object-based images where the iris texture is stochastic without consistent shapes, edges, or semantic structure, Nguyen et al. [115] argued the network architecture has to be better tailored to incorporate domain-specific knowledge in order to reach the full potential in the iris recognition setting.Another observation that they made is a majority of wellknown handcrafted features such as IrisCode [35] transformed iris texture image into a complex-valued representation first, then further encoded the complex-valued representation to get a final representation.They proposed to use fully complex-valued networks rather than popular real-valued networks.Complex-valued backbone networks better retain the phase, are more invariant to multi-scale, multi-resolution and multi-orientation, have solid correspondence with the classic Gabor wavelets [173], hence are much better suited to iris recognition than their real-valued counterparts.

End-to-end Joint Iris Segmentation+Recognition Networks
Almost all existing approaches perform segmentation and normalization to transform an input image to a normalized rectangular 2D representation before recognition as this simplifies the representation learning.As segmentation and recognition may require a separate network themselves, this would cause redundancy in both computation and training, further slowing down an DL-based iris recognition approach.Several researchers have looked at approaches to perform end-to-end networks.One category is to perform segmentation-less recognition.Another category is to jointly learn segmentation and recognition using an unified network via multi-task learning.
Segmentation-less: These approaches feed the cropped iris images directly into a deep learning network to extract features.For example, Kuehlkamp et al. [91] used Mask R-CNN for semantic segmentation and fed the cropped iris region directly into a ResNet50 to extract features.Similarly, Chen et al. [24] also fed the cropped iris images directly into a DenseNet.Rather than feeding the cropped iris images directly, Proenca et al. transformed the cropped region (which is detected by SSD) into a polar representation first, then fed the polar representation into the VGG19 for extracting features [135].
Multi-task: Segmentation and recognition can be jointly learned with one unified network.This paves a way for multi-task learning.However, segmentation and recognition may require different number of layers, hence research is required to perform using different intermediate layers for each task.To the best knowledge, there does not exist any approach to explore this direction.

DEEP LEARNING-BASED IRIS PRESENTATION ATTACK DETECTION
In parallel to the popularity of biometrics, the security of these systems against attacks has become of paramount importance.The most common attack is a Presentation Attack (PA), which refers to presenting a fake sample to the sensor.The goal can be either to impersonate somebody else identity (also known as Impostor Attack Presentation), or to conceal the own identity (also known as Concealer Attack Presentation).Via impostor attacks, a person could also enroll fraudulently, allowing a continuous manipulation of the system.The previous acronyms and terms in italics correspond to the vocabulary recommended in the series of ISO/IEC 30107 standards of the ISO/IEC Subcommittee 37 (SC37) on Biometrics [163], which we will follow in the rest of this section.Presentation Attack Instruments (PAI) used to carry out impostor attacks are typically generated from bona fide images of an iris from an individual who has legitimate access to the system.The iris is printed on a piece of paper (printout attack) or displayed on a screen (replay attack) and then presented to the sensor.The iris of deceased individuals can also be used as PAI, since the texture remains intact for some hours [169].Theoretically, it would be possible to print a genuine iris texture into a contact lens as well, although this has not been successfully demonstrated yet [16].Concealer attacks, on the other hand, are commonly done via textured contact lenses that obscure or alter properties of the eye (such as color) to prevent the system from identifying the user.Synthetic iris images [191] not belonging to any specific identity could be used for similar purposes.Concealers can also present their legitimate iris, but in a way not expected by the system, e.g.closing eyelids as much as possible, looking to the sides (off-axis gaze), rotating the head, etc.Two challenges of PAs is that they happen outside the physical limits of the system, and they do not require specific knowledge of its inner workings, or any technical knowledge at all.Thus, if no properly tackled, they can derail public perception of even the most reliable biometric modality.It is even more critical if authentication is done without any supervision.Presentation Attack Detection (PAD) methods to counteract such attacks can be done [54]: ) at the hardware (or sensor) level, using additional illuminators or sensors that detect intrinsic properties of a living eye or responses to external stimuli (like pupil contraction or reflection), or ) at the software level, using only the footprint of the PA (if any) left in the same images captured with the standard sensor that will be employed for authentication.Software-based techniques are in principle less expensive and intrusive, since they do not demand extra hardware, and they will be the focus of this section.
Two comprehensive surveys on PAD are [30] (2018) and [16] (2020).While DL techniques were residual in the 2018 survey, they rose in popularity thereafter.We build this section upon the latest survey and summarize the most important developments in DL-PAD since it was published (Table 3).A descriptive summary of the datasets employed is given later in Section 8.The aim of PAD is to classify an image either as a bona fide or an attack presentation, so it is usually modeled as a two-class classification task.Typical strategies mimic the trend of the previous section when applying DL to iris recognition: either a CNN backbone is used to extract features that will feed a conventional classifier, or the network is trained end-to-end to do the classification itself.Some hybrid methods also combine traditional hand-crafted with deep-learned features.In the same manner, the network may be initialized e.g. on the ImageNet dataset to take advantage of such large generic corpus, since available iris PAD data is more scarce.Another strategy also employed widely in the PAD literature is to use adversarial networks, where a GAN [60] is trained to generate synthetic iris images that the discriminator must use to detect attack samples.

CNNs for Feature Extraction
Since each layer of a CNN represents a different level of abstraction, Fang et al. [44] fused the features from the last four convolutional layers of two models (VGG16, MobileNetv3-small).The features are projected to a lower dimensional space by PCA and either concatenated for classification with SVM (feature fusion) or the classification scores of each level combined (score fusion).Using two databases of printouts and textured contact lenses, the method showed superiority over the use of the different layers individually, or the feature vector from the next-to-last layer of the networks.

End-to-end Classification Networks
Arora and Bathia [8] trained a CNN with 10 convolutional layers to detect contact lenses and printouts.Rather than using the entire image, the network is trained on patches from all parts of the iris image.The system showed superior performance compared to state-of-the-art methods which at that time, according to the paper, were mostly based on hand-crafted features.
Focusing on embedded low-power devices, Peng et al. [126] adopted a Lite Anti-attack Iris Location Network (LAILNet) based on three dense blocks featuring depthwise separable convolutions to reduce the number of parameters.The algorithm demonstrated very good performance on three databases with printouts, synthetic irises, contact lenses and artificial plastic eyes.
Also focusing on mobiles, Fang et al. [45,48] used MobileNetv3-small.The contribution lies in the division of the normalized iris image into overlapped micro-stripes which are fed individually, and a decision reached by majority voting.The claimed advantages are that the classifier is forced to focus on the iris/sclera boundaries (given by their exact micro-stripes), the input dimensionality is lower and the amount of samples is higher (reducing overfitting), and the impact of imprecise segmentation is alleviated.Using three databases with contact lenses and printouts, the paper featured an extensive experimentation with cross-database, cross-sensor, and cross-attack setting.
Sharma and Ross [150] proposed D-NetPAD, based on DenseNet121, chosen due to benefits such as maximum flow of information given by dense connections to all subsequent layers, or fewer parameters compared to counterparts like ResNet or VGG.The PAI included printouts, artificial eye, cosmetic contacts, kindle replay, and transparent dome on print, with experiments substantiating the effectiveness of the method on cross-PAI, cross-sensor and cross-database scenarios.
Chen and Ross [19] proposed an explainable attention-guided detector (AG-PAD).To do so, the feature maps of a DenseNet121 were fed into two modules that independently capture interchannel and inter-spatial feature dependencies.The outputs were then fused via element-wise sum to capture complementary attention features from both channel and spatial dimensions.With three datasets containing colored contact lenses, artificial eyes (Van Dyke/Doll fake eyes), printouts, and textured contact lenses, the attention modules are shown to improve accuracy over the baseline network.Using heatmap visualization, it is also shown that the attention modules force the network to attend to the annular iris textural region which, intuitively, plays a vital role for PAD.
Spatial attention was also explored by Fang et al. [46].To find local regions that contribute the most to make accurate decisions and capture pixel/patch-level cues, they proposed an attentionbased pixel-wise binary supervision (A-PBS) method.To capture different levels of abstraction, they perform multi-scale fusion by adding spatial attention modules to feature maps from three levels of a DenseNet backbone.Using six datasets with textured lenses and printouts, they outperformed previous state-of-the-art including scenarios with unknown attacks, sensors, and databases.
Given the difficulty of collecting iris PAD data, most databases contain, at most, a few hundred subjects.To address this, Fang et al. [47] studied data augmentation techniques that modify position, scale or illumination.Using three architectures (ResNet50, VGG16, MobileNetv3-small) and three databases with printouts and textured contact lenses, they found that data augmentation improves PAD performance significantly, but each technique has a positive role on a particular dataset or CNN.They also explored the selection of augmentation techniques, finding, again, no consensus regarding the best combination, which was attributed to differences in capture environment, subject population, scale of the different datasets or imbalance between bona fide and attack samples.
Gupta et al. [63] proposed MVANet, with 5 convolutional layers and 3 branches of fully connected layers.They addressed the challenge of unseen databases, sensors, and imaging environment on textured contact lenses detection.The size of each layer of MVANet is different, thus capturing different features.They used three databases, each one captured in different settings (indoor/outdoor, different times of the day, varying weather, fixed/mobile sensors, etc.), with MVANET trained in one database at a time and tested on the other two.As baseline, they fine-tuned three popular CNNs (VGG16, ResNet18, DenseNet) initialized on ImageNet.The proposed network is shown to perform consistently better and more uniformly on the test databases than the baseline approaches.
Sharma and Ross [151] studied the viability of Optical Coherence Tomography (OCT).OCT provides a cross-sectional view of the eye, whereas traditional NIR or VW imaging provides 2D textural data.The PAIs considered are artificial eyes (Van Dyke eyes) and cosmetic lenses, evaluated on three different CNNs (VGG19, ResNet50, DenseNet121).By both intra-(known PAs) and crossattack (unknown PAs) scenarios, OCT is determined as a viable solution, although hardware cost is still a limiting factor.Indeed, OCT outperforms NIR and VW in the intra-attack scenario, while NIR generalizes better to unseen PAs.Cosmetic lenses also appear to be more difficult to detect than artificial eyes with any modality.Via heatmaps, it is seen as well that the fixation regions are different for each imaging modality and for each PAI, which could be a source of complementarity.
Zhang et al. [199] proposed a Weighted Region Network (WRN) to detect cosmetic lenses that includes a local attention Weight Network (for evaluating the discriminating information of different regions) and a global classification Region Network (for characterizing global features).Such strategy considers both the entire image and the attention effect by assigning different weights to regions.The mentioned networks are applied to a VGG16 backbone.The reported results showed improved performance compared to the state-of-the-art over three different databases.
The works by Agarwal et al. [1,2] evaluated the detection of contact lenses.In [2], they trained a siamese CNN of 5 convolutional layers on two different inputs (the original image and its CLAHE version), which are then combined by weighted score fusion of the softmax layer.Adding a processed version of the raw image attempts to enhance the feature extraction capabilities of the CNN.A similar strategy is followed in [1], but here they used a siamese contraction-expansion CNN, and the processed image is a edge-enhanced image obtained via Histogram of Oriented Gradients (HOG).Another difference was the use of feature-level fusion of the next-to-last CNN feature vectors, testing different strategies (vector addition, multiplication, concatenation and distance).The papers employed several databases, with an extensive protocol including unseen subjects, environments (indoor vs outdoor) and databases (sensors) that showcases the strength of the solutions against cross-domain changes.The methods also showed superiority against popular CNN models (VGG16, ResNet18, DenseNet) and the popular LBP and HOG hand-crafted features.
Gautam et al. [59] proposed a Deep Supervised Class Encoding (DSCE) approach consisting of an Autoencoder that exploits class information, and minimizes simultaneously the reconstruction and classification errors during training.Three datasets were used, containing textured lenses, printouts and synthetic images, showing superiority over a variety of hand-crafted and deep-learned features.
Tapia et al. [162] used a two-stages serial architecture based on a modified MobiletNetv2.A first network was trained to only distinguish two classes (bona fide vs attack).If it votes bona fide, the image is sent to a second network trained to classify it among three or four classes (bona fide or a different type of PAI: contact lenses, printout, or cadaver).Four databases were combined to obtain a super-set with the different PAIs, and class-weights were also incorporated into the loss to compensate imbalance.The paper applied contrast enhancement (CLAHE), and an aggressive data augmentation (rotation, blurring, contrast change, Gaussian noise, edge enhancement, image region dropout, etc.).They tested two image sizes, 224×224 and 448×448, observing that the extra detail of a higher resolution image results in more effective features.The paper also carried out leave-one-out PAI tests for open-set evaluation, showing robustness in detecting unknown attacks.

Hybrid Methods
Choudhary et al. [26,27] applied a Friedman test-based selection method to identify the best features of a set of hand-crafted and deep-learned ones.Each feature method feeds a SVM classifier, and the scores of the individual SVMs are fused via weighted sum.A preliminary version of [26] without feature selection appeared in [25].The databases of [27] include a medley of different PA (printouts, synthetic irises, artificial eyeballs, etc.), although the feature selection and classification methods are trained and evaluated separately on each database.The authors observed a saturation after a certain number of features are combined, and a superiority of the score-level fusion over other methods such as majority voting, feature-level fusion, and rank-level fusion.The work [26], on the other hand, concentrated on textured contact lenses attack, with an extensive set of evaluations including single sensor, cross-sensor and combined sensor experiments.Apart from the generic live vs attack scenario, it also reports binary and ternary classification across the different types of real (normal iris, soft lens) and fake (textured) classes.Naturally, the cross-sensor error is larger compared to single-sensor, and the combined sensor error is also observed to be slightly larger.The latter is attributed to the larger intraclass variation created when images from different sensors are combined.In any case, an improvement of performance over previous works with the three datasets employed is observed after the proposed feature selection and score-level fusion method.

Adversarial Networks
Generative methods have been used by some approaches, either to use the trained discriminator for iris PAD, or to generate synthetic samples and augment under-represented classes.In this direction, Yadav and Ross [193] proposed CIT-GAN (Cyclic Image Translation Generative Adversarial Network) for multi-domain style transfer to generate synthetic samples of several PAIs (cosmetic contact lenses, printed eyes, artificial eyes and kindle-display attack).To do so, image translation is driven by a Styling Network that learns style characteristics of each given domain.It also employs a Convolutional Autoencoder in the generator for image-to-image style translation, which takes a domain label as input along with an image.This is different than previous works of the same authors [191,192] which employed the traditional generator/discriminator approach driven by a noise vector.Different PAD methods using hand-crafted (BSIF, DESIST) and deep features (VGG16, D-NetPAD, AlexNet) were evaluated, demonstrating that they can be improved by adding synthetically generated data.The quality of synthetic images is also superior to a competing generative method (Star-GAN v2), measured via FID score distributions.

Open Research Questions in Iris PAD
One of the open research issues is to design robust iris PAD methods with cross-sensor and crossdatabase capabilities, so they generalize to unseen imaging conditions.Attackers are constantly developing new attack methodologies to circumvent PAD systems, so an even more important issue is unseen PAIs (i.e.cross-PAI capabilities) [149].Great results have been achieved on detecting known attack types (known as closed-set recognition), although cross-database evaluation (training in one database an testing in others) still appears as a difficult challenge due to changes in sensors, acquisition environments, or subjects.Moreover, generalizing to attacks that are unknown at the time of training (open-set recognition) is even a greater challenge for state-of-the-art methods [45].Part of the problem lies into the limited size of existing databases, which is an issue for data-hungry DL approaches.Some solutions, as studied by some of the methods above, are data augmentation by geometric or illumination modifications [47], or creating additional synthetic data via generative methods [193].Human-aided DL training is another promising avenue.Indeed, humans and machines cooperating in vision tasks is not new, and this strategy is finding its way into DL as well [14,17].For example, Boyd et al. [14] analyzed the utility of human judgement about salient regions of images to improve generalization of DL models.Asked about regions that humans deem important for their decision about an image, the work proposed to transform the training data to incorporate such opinions, demonstrating an improvement in accuracy and generalization in leave-one-attack-type-out scenarios.In a similar work, Boyd et al. [17] incorporated annotated saliency maps into the loss function to penalize large differences with human judgement.
Recently, concerns have emerged about the observed bias of DL methods that leads to discriminatory performance differences based on the user´s demographics, with face biometrics being the most talked-about and many companies and authorities banning its use [78].Obviously, this issue appears in iris PAD as well, as addressed by Fang et al. [49].Using three baselines based on hand-crafted and DL approaches and a database of contact lenses, the authors showed a significant difference in the performance between male and female samples.In dealing with this phenomenon, examination of biases towards eye color or race are another directions worthwhile to consider.Some elements considered as PAIs in this section, such as cosmetic lenses, may be worn normally by users without the purpose of fooling the biometric system, as it is the case of facial retouching via make-up, digital beautification or augmented reality [69].This poses the question of whether it is possible to use such images for authentication, while diminishing the effect in the recognition performance.Suggested alternatives have been to detect and match portions of live iris tissue still visible [125] or incorporate ocular information of the surrounding area [4].Unfortunately, in iris biometrics, recognition with textured contact lenses remains a hard problem to solve.

DEEP LEARNING-BASED FORENSIC IRIS RECOGNITION
Iris recognition has become the next biometric mode (in addition to face, fingerprints and palmprints) considered for large-scale forensic applications [52], and coincides in time with discoveries made in recent years about possibility to employ iris in recognition of deceased subjects.This includes both matching of iris patterns acquired a few hours after death with those with longer PMIs (Post-Mortem Intervals), ranging from days [12,143,167,168] to several weeks after demise [18,170], as well as matching patterns acquired before death with those collected post-mortem [141].Due to decomposition changes to the eye tissues, post-mortem iris images differ significantly from live iris images and rarely meet ISO/IEC 29794-6 quality requirements, as shown in Fig. 3(a).The challenges are related to appropriate detection of places when cornea dries and generates irregular and large specular highlights, as well regions where iris muscle furrows show up when the eyeball dehydrates.This is where DL-based methods may win over hand crafted approaches, as the latter usually make strong assumptions about anatomy of the iris appearance, not possible to be predicted for eyes undergoing random decomposition processes.Trokielewicz et al. proposed the first known to us iris recognition method designed specifically to cadaver irises [171,172].It incorporates SegNet-based segmenter and Siamese networks-based feature extractor, both trained in a domain-specific way solely on post-mortem iris samples.An interesting element of this approach is that segmetation incorporates two models: one trained with "fine" ground truth masks, marking all details associated with eye decomposition, and "coarse" model, aiming at detecting iris annulus and eyelids, as in classical iris recognition approaches.This allowed to apply a standard "rubber sheet" iris images normalization based on "coarse" masks, and at the same time exclude decomposition-driven artifacts from encoding, marked by the "fine" mask.Kuehlkamp et al. [91] in addition to detecting post-mortem deformations, as shown in Fig. 3(c), they also proposed a human-interpretable visualization of a classification process.The visualization is based on Class Activation Mapping mechanism [212] and highlights salient features used by the classifier in its judgment.This novelty in iris recognition algorithms may help human examiners to locate iris regions that should be carefully inspected, or to verify the algorithm's decision.

HUMAN-MACHINE PAIRING TO IMPROVE DEEP LEARNING-BASED IRIS RECOGNITION
Iris recognition is usually associated with automatic, solely machine-based and rapid biometric means.It has been changing in the recent decade due to constantly increasing ubiquitousness of iris recognition, especially owing to large governmental applications such as [174] or FBI's Next Generation Identification System (NGI) gradually replacing the Integrated Automated Fingerprint Identification System (IAFIS) [52].This combined with unique identification power of iris whetted the appetite to apply this technique to identification problems normally reserved for fingerprints and face: forensics, lost subjects search or post-mortem identification.To have the legal power, however, the judgment about samples originating or not from the same eye conclusion must be confirmed by a trained human expert.And here is the place where DL-based iris image processing may play a useful role.Trokielewicz et al. compared iris images in post-mortem iris recognition between humans and machines.They investigated which iris image regions humans and machines mainly attend to compare a pair of images.The machine-based attention maps are generated by Grad-CAM to highlight the regions that contribute the most to the deep learning model's prediction.The humanbased attention maps are learned by tracking the gaze as the human is looking around the screen that display iris image pairs and recording the regions where the human spend most time on.
Interestingly while humans and machines tend to focus on a limited number of iris areas, however, the region, appearance and density of these areas between humans and machines are different.As salient regions proposed by the deep learning model and identified from human eye gaze do not overlap in general, the computer-added visual cues may potentially constitute a valuable addition to the forensic examiner's expertise, as it can highlight important discriminatory regions that the human expert might miss in their proceedings.This human-machine pairing is important as human subjects can provide an incorrect decision even despite spending quite sometime observing many iris regions [120].In addition, there has been a body of research showing that humans and machines do not perform similarly well under different conditions [20,108,154].For example, Moreira et al. also showed that machines can outperform humans in healthy easy iris image pairs; however, humans outperform machines in disease-affected iris image pairs [108].Human-machine pairing will improve deep learning based iris recognition.

RECOGNITION IN LESS CONTROLLED ENVIRONMENTS: IRIS/PERIOCULAR ANALYSIS
Rooted in the seminal work due to Park et al. [124], efforts have been paid to the development of human recognition methods that -apart the iris -also consider information in the vicinity of the eye to infer the identity.This is a relatively recent topic, termed as periocular recognition.The rationale is that the periocular region represents a trade-off between the face and the iris.Periocular biometrics has been claimed to be particularly useful in environments that produce poor quality data (e.g., visual surveillance).Recently, as in the case of iris, several DL-based solutions have been proposed.Hernandez-Diaz et al. [70] tested the suitability of off-the-shelf CNN architectures to the periocular recognition task, observing that albeit such networks are optimized to classify generic objects, their features still can be effectively transferred to the periocular domain.
In the visual surveillance context, Kim et al. [88] infer subjects identities based either in loose/tight regions-of-interest, depending of the perceived image quality.Hwang and Lee [77]prevents the loss of mid-level features and dynamically selects the most important features for classification.Luo et al. [102] used self-attention channel and spatial mechanisms into the feature encoding module of a CNN, in order to obtain the most discriminative features of the iris and periocular regions.
Jung et al. [82]'s work is based in the concept of label smoothing regularization (LSR).Having as main goal to reduce the intra-class variability, they described a so-called Generalized LSR (GLSR) by learning a pre-task network prediction that is claimed to improve the permanence of the obtained periocular features.Having similar purposes, Zanlorensi et al. [198] described a preprocessing step based in generative networks able to compensate for the typical data variations in visual surveillance environments.Nie et al. [118] applied convolutional restricted Boltzmann machines to the periocular recognition problem.Starting from a set of genuine pairs that are used as a constraint, a Mahalanobis distance-metric is learned.
Obtaining auxiliary (e.g., soft biometrics) has been seen as an interesting direction for compensating the lack of image quality.Zhao and Kumar [210] incorporate an attention model into a DL-architecture to emphasize the most important regions in the periocular data.The same authors [208] described a semantics-assisted CNN framework to infer comprehensive periocular features.The whole model is composed of different networks, trained upon ID and semantic (e.g., gender, ethnicity) data, that are fused at the score and prediction levels.Similarly, Talreja et al. [158] described a multi-branch CNN framework that predicts simultaneously soft biometrics and ID labels, which are finally fused into the final response.
With regard to cross-spectral settings, Hernandez-Diaz et al. [71] used conditional GANs (CGANs) to convert periocular images between domains, that are further fed to intra-domain off-the-self frameworks.Sharma et el.[148] described a shallow neural architecture where each model learns the data features in each spectrum.Then, at a subsequent phase, all models are jointly fine tuned, to learn the cross-spectral variability and correspondence features.
Finally, several works have attempted to faithfully fuse the scores/responses from iris and periocular data.Wang and Kumar [181] used periocular features to adaptively match iris data acquired in less constrained conditions.Their framework incorporates such discriminative information using a multilayer perceptron network.Zhang et al. [203] described a DL-model that exploits complementary information from the iris and the periocular regions, that applies maxout units to obtain compact representations for each modality and then fuses the discriminative features of the modalities through weighted concatenation.In an opposite direction, Proença and Neves [134] argued that the periocular recognition performance is optimized when the components inside the ocular globe (the iris and the sclera) are simply discarded.

OPEN-SOURCE DEEP LEARNING-BASED IRIS RECOGNITION TOOLS
Here we summarize the main properties of the datasets employed by the methods of the previous sections for DL-based iris segmentation, recognition and PAD.We also describe available opensource software code for these tasks, and other relevant tools.

Data Sources
Table 4 gives the technical details of the datasets used in the segmentation and recognition methods of Tables 1 and 2. Table 5 does the same for the iris PAD methods of Table 3.We show the main properties (spectrum, image size, identities, images, sessions) and relevant features.Only the datasets of the methods reported in previous section are presented.Since we focus on the most recent developments, we consider that such approach provides the most relevant datasets for each task.Of course, the list of available datasets after decades of iris research is much longer [122].
A first observation is the dominance of near infrared (NIR) over the visible (VW) spectrum, which should not be surprising, since NIR is regarded as most suitable for iris analysis.However, research-wise, many segmentation and recognition studies (Tables 1, 2) use VW images, pushed by the success of challenging databases such as MICHE and UBIRIS.On the contrary, the VW modality in iris PAD research is residual (Table 3), a tendency also observed in pre-DL research [16,30].
When it comes to the types of Presentation Attack Instruments (PAIs) employed in iris PAD databases, they can be categorized into: • PP: paper printout of a real iris image, i.e. from a live person • PPD: paper printout of a real iris image with a transparent 3D plastic eye dome on top • CLL: textured contact lenses worn by a live person • CLP: textured contact lenses on printout (either a printout of a CLL image, or a printout of a real iris image with a textured contact lens placed on top) • RA: replay attack, i.e. a real iris image shown on a display • AE: artificial eyeball (plastic eyes of two different types: Van Dyke Eyes, with higher iris quality details, and Scary eyes, plastic fake eyes with a simple pattern on the iris region) • AEC: artificial eyeball with a textured contact lens on top • SY: synthetic iris, i.e. an image created via generative methods • PM: postmortem iris, i.e. an image acquired from cadaver eyes These PAIs mostly entail presenting the mentioned instrument to the iris sensor, which then captures an image of the artifact.An exception is "SY", which directly produces a synthetic digital image, although such image could be used as base to, for example, PP, PPD, RA, or AE attacks.In Table 5, it can be seen that CLL (textured lenses live) and PP (paper printouts) largely dominates as the most popular PAIs on the existing databases, and consequently, on the related research (Table 3).CLP (textured lenses on printout) also appears in many studies, driven by the wide use of the LivDet-2017-IIITD-WVU set, which includes such PAI.CASIA-Iris-Fake, which contains AE (artificial eyes) and SY (synthetic irises) also appears in a few studies.Other attacks that one may expect on the digital era, such as RA (replay), however, are residual in datasets and recent studies.

Software Tools
The availability of DL-based tools for iris biometrics has been scarce for years, specially for PAD [51].In the following, we provide a short description of peer-reviewed references with associated source code (link included in the paper, or easily found on the websites of the authors or dedicated sites such as www.paperswithcode.com).We describe (in this order) tools for segmentation, recognition and PAD.For each type, the references are then presented in cronological order.3 (NIR: near-infrared; VW: visible wavelength).The type of PAIs (second column) are PP: paper printout, PPD: paper printout with plastic dome, CLL: textured contact lenses (live), CLP: textured contact lenses (printout), RA: replay attack (display), AE: artificial eyeball, AEC: artificial eyeball with textured contact lens, SY: synthetic iris, PM: postmortem iris.TTP (next to last column) indicates the existence of a training/test split.The features (last column) are MS: multi-sensor, ME: multi-environment (e.g.indoor/outdoor, light variability, mobile environment, etc.), UPAI: unseen PAIs in the test set. 1 Contains IIITD-CLI and IIITD-IS 2 Iris-LivDet-2017-ND-CLD is a subset of ND-CLD-15 3 IIITD-IS images are printouts of IIITD-CLI captured with a iris scanner and a flatbed scanner 4 ND-PSID is a subset of ND-CLD-15 8.2.1 Segmentation.Lozej et al. [101] released their end-to-end DL model based on the U-Net architecture [139].The model was trained and evaluated with a small set of 200 annotated iris images from CASIA database.The authors also explored the impact of the model depth and the use of batch normalization layers.
Kerrigan et al. [85] released the code and models of Iris-recognition-OTS-DNN, a set of four architectures based on off-the-shelf CNNs trained for iris segmentation (two VGG-16 with dilated convolutions, one ResNet with dilated kernels, and one SegNet encoder/decoder).Training databases included CASIA-Irisv4-Interval, ND-Iris-0405, Warsaw-Post-Mortem v2.0 and ND-TWINS-2009-2010, whereas testing data came from ND-Iris-0405 (disjoint subject), BioSec and UBIRIS.v2.Results showed that the DL solutions evaluated outperform traditional segmentation techniques, e.g.Hough transform or integro-differential operators.It was also seen that each test dataset had a method that performs best, with UBIRIS obtaining the worst performance.This should not come as a surprise, since it contains VW images with high variability taking distantly with a digital camera, whereas the other two are from close-up NIR iris sensors in controlled environments.
Wang et al. [176] released the code and models of their high-efficiency segmentation approach, IrisParseNet.A multi-task attention network was first applied to simultaneously predict the iris mask, pupil mask and iris outer boundary.Then, from the predicted masks and outer boundary, a parameterization of the iris boundaries was calculated.The solution is complete, in the sense that the mask (including light reflections and occlusions) and the parameterized inner and outer iris boundaries are jointly achieved.
More recently, authors from the same group presented IrisSegBenchmark [177], an open iris segmentation evaluation benchmark where they implemented six different CNN architectures, including Fully Convolutional Networks (FCN) [100], Deeplab V1,V2,V3 [21], ParseNet [98], PSPNet [206], SegNet [9], and U-Net [139].The methods were evaluated on CASIA-Irisv4-Distance, MICHE-I and UBIRIS.v2.As in [85], results showed that the best method depends on the database, being: ParseNet for CASIA (NIR data), DeeplabV3 for MICHE (VW images from mobile devices), and U-Net for UBIRIS (VW images from a digital camera).In this case, however, the three tests databases behaved approximately equal, since they all contain difficult distant data.CASIA showed a slightly better accuracy, suggesting that NIR data may be easier to segment.Traditional, non-DL methods were also evaluated, concluding that DL-based segmentation achieves superior accuracy.
Banerjee et al. [10] released the code of their V-Net architecture, designed to overcome some drawbacks of U-Net, such as instability to tackle iris segmentation or tendency to overfit.A preprocessing stage on the YCrCb and HSV spaces was also added to detect salient regions and aid detection of iris boundaries.The method was evaluated on the difficult UBIRIS.v2VW dataset.

Recognition.
The code of the DL method ThirdEye was released by Ahmad and Fuller [3], based on a ResNet-50 trained with triplet loss.Authors directly used segmented images without normalization to a rectangular 2D representation, arguing that such step may be counterproductive in unconstrained images.The model was evaluated on the ND-0405, IITD and UBIRIS.v2datasets.
The models of Boyd et al. [15] for recognition have been also released, based on a ResNet-50 with different weight initialization techniques, comprising: from scratch (random), off-the-shelf ImageNet (general-purpose vision weights), off-the shelf VGGFace2 (face recognition weights), fine-tuned ImageNet weights, and fine-tuned VGGFace2 weights.Both ImageNet and VGGFace2 are very large datasets with millions of images, and face images contain the iris region.Thus, using these datasets as initialization may be beneficial for iris recognition, where available training data is in the order of hundreds of thousand images only.This strategy has been followed e.g. in ocular soft-biometrics as well [6].The observed optimal strategy is indeed to fine-tune an off-the-shelf set of weights to the iris recognition domain, be general-purpose or face recognition weights.

Segmentation and Recognition Packages.
A complete package comprising segmentation and feature encoding was provided by Tann et al. [161].The segmentator is based on a Fully Convolutional Network (FCN), but encoding is based on hand-crafted Gabor filters [35].Evaluation was done on CASIA-Irisv4-Interval and IITD.
In forensic investigation for diseased eyes and post-mortem samples, Czajka [29] also released a complete package combining segmentation and feature encoding.The models are based on previous efforts of the author and co-workers, comprising a SegNet [172] and a CCNet [106] DL segmentators, but the feature encoder is based on hand-crafted BSIF filters.
Another complete segmentation and recognition package was released by Kuehlkamp et al. [91].The segmentator is based on a fine-tuned Mask-RCNN architecture, with the cropped iris region fed directly into a ResNet50 pre-trained for face recognition on the very large VGGFace2 dataset, and fine-tuned for iris recognition using triplet loss.The paper is oriented towards postmortem iris analysis, so the methods use a mixture of live and postmortem images for training and evaluation.
Parzianello and Czajka [125] also released the models and annotated data for their textured contact lens aware iris recognition method.The foundation is that such lenses may be used normally for cosmetic purposes, without intention to fool the biometric system.Therefore, they proposed to detect and match portions of live iris tissue still visible in order to enable recognition even when a person wears textured contact lenses.To do so, they applied a Mask R-CNN as a segmentation backbone, trained to detect authentically-looking parts of the iris using manually segmented samples from NDIris3D dataset.Non-iris information is then removed from the training images by blurring it or replacing it with random noise to guide the subsequent recognition network (based on ResNet-18) to salient, non-occluded regions that should be used for matching.

Iris PAD.
In the iris PAD arena, Gragnaniello et al. [61] proposed a CNN that incorporates domain-specific knowledge.Based on the assumption that PAD relies on residual artifacts left mostly in highfrequencies, a regularization term was added to the loss function which forces the first layer to behave as a high-pass filter.The method, which is available in the website of the first author, could be applied to PAD in multiple modalities, including iris and face.
The code and model of the method of Sharma and Ross [150] (D-NetPAD) is also available.It is based on DenseNet121 and trained for a variety of PAIs (printouts, artificial eye, cosmetic contacts, kindle replay, and transparent dome on print), with an script to retrain the method also available.

Other Tools: Iris Image Quality Assessment
Several image properties considered to potentially influence the accuracy of iris biometrics have been defined in support of the standard ISO/IEC 29794-6 [164].They include: grayscale spread (dynamic range), iris size (pixels across the iris radius when the boundaries are modeled by a circle), dilation (ratio of the pupil to iris radius), usable iris area (percentage of non-occluded iris, either by eyelashes, eyelids or reflections), contrast of pupil and sclera boundaries, shape (irregularity) of pupil and sclera boundaries, margin (distance between the iris boundary and the closest image edge), sharpness (absence of defocus blur), motion blur, signal to noise ratio, gaze (deviation of the optical axis of the eye from the optical axis of the camera), and interlace of the acquisition device.
Low quality iris images, which can potentially appear in uncontrolled or non-cooperative environments, are known to reduce the performance of iris location, segmentation and recognition.Thus, an accurate quality assessment can be a valuable tool in support of the overall pipeline, either by dropping low quality images, or invoking specialized processing [5].One possibility might be to quantify the properties mentioned above, and placing thresholds on each.A more elaborated alternative is to combine them according to some rule and produce an overall quality score.However, it is difficult to provide metrics that cover all types of quality distortions [157] and doing so for some indeed entails to segment the iris.
Broadly, a biometric sample is of good quality if it is suitable for recognition, so quality should correlate with recognition performance [62].As such, quality assessment can be viewed as a regression problem.Wang et al. [182] considered that a non-ideal eye image will pivot in the feature space around the embedding of an ideal image.They defined quality as the distance to the embedding of such "ideal" image which, is regarded as a registration sample collected under a highly controlled environment.They used a model to learn the mapping between images and Distance in Feature Space (DFS) directly from a given dataset.Quality is computed via attention-based pooling that combines a heatmap that comes from a coarse segmentation based on U-Net and the feature map of an extraction network based on MobileNetv2 pre-trained on CASIA-Iris-V4 and NDIRIS-0405.

EMERGING RESEARCH DIRECTIONS
In this section, we discuss the most relevant open challenges and hypothesize about emerging research directions that could become hot-topics in biometrics literature in a close future.

Resource-aware designs of iris recognition networks
Application-wise, iris recognition can be performed on a wide range of hardware, ranging from high-end computers to low-end embedded devices, or from large computer clusters to personal devices such as mobile phones.Performing recognition on resource-limited hardware could pose new challenges for deep learning based iris networks, which usually contain hundreds of layers and millions of parameters.Therefore designing these deep learning networks necessarily need to be aware of the hardware platforms on which they will be run.Lightweight models: Lightweight CNNs employ advanced techniques to efficiently trade-off between resource and accuracy, minimising their model size and computations in term of the number of floating point operations (FLOPs), while retaining high accuracies.Specialized lightweight CNN architectures include MobileNets [73] and U-Net [139].There are a few lightweight deep learning based models for both segmentation and feature extraction.Fang et al. [50] adapted the lightweight CC-Net [106] for iris segmentation.CC-Net has a U-Net structure [139], able to retain up to 95% accuracy using only 0.1% of the trainable parameters.Boutros et al. [13] benchmarked MobileNet-V3 against deeper networks for iris recognition and showed that the MobileNet based model can achieve similar EER with 85% less number of parameters and 80% less inference time.Model compression: Studies have found that most of the large deep learning models tend to be overparameterized, leading to lots of redundant parameters and operations in the network.This becomes more severe considering iris texture images are different from generic object-based images.This has motivated a hot trend looking to remove these redundancies from the models, including pruning, quantization and low-rank factorization [95].In our iris recognition literature, there a few lightweight deep learning based models for both segmentation and feature extraction.Tann et al.
[161] quantized 64-bit floating points numbers of weights and activations of the full FCN-based iris segmentation model using an 8-bit dynamic fixed-point (DFP) format, which provide a 8× memory saving as well as speed enhancement due to reduced complexity of lower precision operations.Neural Architecture Search: Neural Architecture Search (NAS) automates the process of architecture design of neural networks by iteratively sampling a population of child networks, evaluating the child models' performance metrics as rewards and learning to generate high-performance architecture candidates [43].In our iris recognition literature, Nguyen et al. [113] showed that computation and memory can be incorporated into the NAS formulation to enable resourceconstrained design of deep iris networks.

Human-interpretable methods and XAI
With hundreds of layers and millions of parameters, deep learning networks are usually opaque or "blackbox" where humans struggle to understand why a deep network predict what it predicts.This necessitates approaches to make deep learning methods more interpretable and understandable to humans.Interestingly, the need for human-interpretable methods has been raised even from the handcrafted era.For example, Shen et al. published a series of work [20,152] on using iris crypts for iris matching.Iris crypts are clearly visible to humans in a similar way as finger minutiae.Another example is the macro-features [156] which use SIFT to detect keypoints and perform iris matching based on these keypoints [136].Another notable work is by Proença et al. [132] where they proposed a deformation field to represent the correspondence between two iris images.
From a deep learning perspective, researchers have also attempted to visualize the matching.Kuehlkamp et al. [91] argued that existing iris recognition methods offer limited and non-standard methods of visualization to let human examiners interpret the model output.They applied Class Activation Maps (CAM) [212] to visualize the level of contribution of each iris region to the overall matching score.Similarly, Nguyen et al. [115] also decomposed the final matching score into pixel-level to visualize the level of contribution of each pixel to the overall matching score.

Deep learning-based synthetic iris generation
Data synthesis provides an alternative to time-and resource-consuming database collection.One could create as many images as desired, with new textures that even do not match any existing identity, which would avoid privacy problems too.On the other hand, fake irises that are indistinguishable from real ones can be used for identity concealment attacks (if the image does not match any identity) or impersonation attacks (if the image resembles an existing identity) [30].Indeed, synthetic irises are present in databases employed for iris PAD, such as CASIA-Iris-Fake (Table 5).
Regardless of the purpose or ability to detect if an image is synthetic, Generative Adversarial Networks (GANs) [60] have shown impressive photo-realistic generating capabilities in many domains.GANs learn to model image distributions by an adversarial process, where a discriminator assesses the realism of images synthesized by a generator.At the end, the generator have learned the distribution of the training data, being able to synthesize new images with the same characteristics.
For iris generation, some methods by Yadav et al. [191][192][193] were mentioned in iris PAD contexts (Section 4.4).RaSGAN [191,192] followed the traditional approach of driving the generation/discrimination training by randomly sampling so-called latent vectors from a probabilistic distribution.As training progresses, the generator learns to associate features of the latent vectors with semantically meaningful attributes that naturally vary in the images.However, this does not impose any restriction in the relationship between features in latent space and factors of variation in the image domain, making difficult to decode what the latent vectors represent.As a result, the image characteristics (eye color, eyelids shape, eyelashes, gender, age...) are generated randomly.Kohli et al. [90] presented iDCGAN for iris PAD, which also followed the latent vector sampling concept.To counteract such issue, researchers have tried to incorporate constrains or mechanisms that guide the generation process to a desired characteristic.For example, CIT-GAN [193] employed a Styling Network that learns style characteristics of each given domain, while taking as input a domain label that drives the network to embed a desired style into the generated data.
In a similar direction, Kaur and Manduchi [83,84] proposed to synthesize eye images with a desired style (skin color, texture, iris color, identity) using an encoder-decoder ResNet.The method is aimed at manipulating gaze, so the generator receives a segmentation mask with the desired gaze, and an image with the style that will see its gaze modified.To achieve cross-spectral recognition, Hernandez-Diaz et al. [71] used CGANs to convert ocular images between VW and NIR spectra while keeping identity, so comparisons are done in the same spectrum.This allows the use of existing feature methods, which are typically optimized to operate in a single spectrum.
Despite great advances in DL-based synthetic image generation, one open problem is the possible identity leakage from the training set when creating data of non-existing identities, resulting in privacy issues.This has just been revealed recently in face generation [165].Another issue in the opposite direction is the difficulty in preserving identity in the generation process when the target is precisely creating images of an existing identity with different properties.This is an issue being addressed in face generation methods [reference under review], but is lacking in iris synthesis research.

Deep learning-based iris super-resolution
One of the main constraints for existing iris recognition systems is the short distance of image acquisition, which usually requires a subject to stay still less than 60 cm from iris cameras.This is due to the requirement of high-resolution iris region, e.g. 120 pixels across the iris diameter due to the European standard and NIST standard, despite the small physical size of an eye, i.e. 15 × 15 mm.The lack of resolution of imaging systems has critically adverse impacts on the recognition and performance of biometric systems, especially in less constrained conditions and long range surveillance applications [116].
Super-resolution, as one of the core innovations in computer vision, has been an attractive but challenging solution to address the low resolution problem in both general imaging systems and biometric systems.Deep learning based super-resolution approaches have been across multiple works in iris recognition.Ribeiro et al. [137,138] experimented two deep learning single-image super-resolution approaches: Stacked Auto-Encoders (SAE) and Convolutional Neural Networks (CNN).Both approaches learn one encoder to map the high resolution iris images to the low resolution domain, and one decoder to learn to reconstruct the original high resolution images from the low resolution ones.Zhang et al. [201] learned a single CNN to learn non-linear mapping function between LR images to HR images for mobile iris recognition.Wang et al. [183] extended the single CNN to two CNNs: one generator CNN and one discriminator CNN as in the GAN architecture.The generator functions similar to the single LR -HR mapping CNN.Adding the discriminator CNN allows them to control the generator to generate HR images not just visually higher resolution but also preserve the identity of the iris.Mostofa et al. [109] incorporated a GAN-based photo-realistic super-resolution approach [93] to improve the resolution of LR iris images from the NIR domain before cross-matching the HR outputs with the HR images from the RGB domain.While these approaches showed improved performance, dealing with noisy data in such cases as iris at a distance and on the move could require the quality of an input iris image to be included in the super-resolution process [114].In addition, Nguyen et al. argued that a fundamental difference exists between conventional super-resolution motivations and those required for biometrics, hence proposing to perform super-resolution at the feature level targeting explicitly the representation used by recognition [117].

Privacy in deep learning-based iris recognition
Privacy is becoming a key issue in computer vision and machine learning domains.In particular, it is accepted that the accuracy attained by deep learning models depends on the availability of large amounts of visual data, which stresses the need for privacy-preserving recognition solutions.
In short, the goal in privacy preserving deep-learning is to appropriately train models while preserving the privacy of the training datasets.While the utility of this kind of solutions is obvious, there are certain concerns about the training data that supported the model creation, as the collection of images from a large number of individuals comes with significant privacy risks.In particular, it should be considered that the subjects from whom the data were collected can neither delete nor control what actually will be learned from their data.
As most of the existing biometric technologies, DL-based iris recognition pose challenges to privacy, which are even more concerning, considering the data-driven feature of such kind of systems.Particular attention should be paid to avoid function creep, guaranteeing that the system yielding from a set of data is not used for a different purpose than the originally communicated to the individual at the time of providing their information.Covert collection is another major concern, which is also particular important for the iris trait, according to the possibility of being imaged from large distances and in surreptitious way.
Particular attention has been paid to the development of fair recognition systems, in the sense that this kind of systems should attain similar effectiveness in different subgroups of the population, regarding different features such as gender, age, race or ethnicity.For data-driven systems, this might be a relevant challenge, considering that most of the existing datasets that support the learned systems have evident biases with regared tio the subjects' characteristics above.
Lastly, in a more general machine learning perspective, potential attacks to the learned models have been concerning the research community and have been the scope of various recent works, attempting to provide defense mechanisms against: i) model inversion attacks, that aim to reconstruct the training data from the model parameters (e.g., [87] and [67]); ii) membership inference, that attempt to infer whether one individual was part of a training set (e.g., [75] and [153]); and iii) training data extraction attacks, that aim to recover individual training samples by querying the models (e.g, [86] and [39]).

Deep learning-based iris segmentation
Being one of the earliest phases of the recognition process, segmentation is known as one of the most challenging, as it is at the front line for facing the dynamics of the data acquisition environments.This is particularly true, in case of less constrained data acquisition protocols, where the resulting data have highly varying features and the particular conditions of each environment strongly determine the most likely data covariates.
In the segmentation context, the main challenge remains as the development of methods robust to cross-domain settings, i.e., able to segment the iris region for a broad range of image features, e.g., in terms of: 1) illumination, 2) scale, 3) gaze, 4) occlusions, 5) rotation and 6) pose, corresponding to the acquisition in very different environments.Over the past decades, many research groups have been devoting their attentions in improving the robustness of iris segmentation, which is known to be a primary factor for the final effectiveness of the recognition process.In this timeline, the proposed segmentation methods can be roughly grouped into three categories: 1) boundary-based methods (using the integro-differential operator or Hough transform); 2) based in handcrafted features (particularly suited for non-cooperative recognition, e.g., [160] and [159]) ; and 3) DL-based solutions.
For the latter family of methods, the emerging trends are closely related to the general challenges of DL-based segmentation frameworks, namely to obtain interpretable models that allow us to perceive what exactly are these systems learning, or the minimal neural architecture that guarantees a predefined level of accuracy.Also, the development of weakly supervised or even unsupervised frameworks is another grand-challenge, as it is accepted that such systems will likely adapt better to previously unseen data acquisition conditions.Finally, the computational cost of segmentation (both in terms of space and time) is another concern, with special impact in the deployment of this kind of frameworks in mobile settings, and in the IoT setting [140].

Deep learning-based iris recognition in visible wavelengths
Being a topic of study for over a decade (e.g.[99] and [129]), iris recognition in visible wavelengths remains essentially as an interesting possibility for delivering biometric recognition from large distances (in conditions that are typically associated to visual surveillance settings) and in handheld commercial devices, such as smartphones.
The emerging trends in this scope regard the development of alternate ways to analyze the multi-spectral information available in visible light data (typically RGB), i.e., by developing deep learning architectures optimized for fusion, either at the data, feature, score or decision levels [11].
In the visual surveillance setting, the main challenge regards the development of optimized data acquisition settings, profiting from the advances in remote sensing technologies, that should be able to augment the quality (e.g., resolution and sharpness) of the obtained irises.In this scope, the research on active data acquisition technologies (based in PTZ devices, or similar) might also be an interesting emerging possibility [66].

CONCLUSIONS
Motivated by the tremendous success of DL-based solutions for many different solutions to everyday problems, machine learning is entering one of its golden era, attracting growing interests from the research, commercial and governmental communities.In short, deep learning uses multiple layers to represent the abstractions of data to build computational models that -even in a bit surprising way -typically surpass the previous generation of handcrafted-based automata.However, being extremely data-driven, the effectiveness of DL-based solutions is typically constrained by the existence of massive amounts of data, annotated in a consistent way.
As in the generality of the computer-vision topics, a myriad of DL-based techniques has been proposed over the last years to perform biometric recognition, and -in particular -iris recognition.Nowadays, the existing methods cover the whole phases of the typical processing chain, from the preprocessing, segmentation, feature extraction up to the matching and recognition steps.
Accordingly, this article provides the first comprehensive review of the historical and state-ofthe-art approaches in DL-based techniques for iris recognition, followed by an in-depth analysis on pivoting and groundbreaking advances in each phase of the processing chain.We summarize and critically compare the most relevant methods for iris acquisition, segmentation, quality assessment, feature encoding, matching and recognition problems, also presenting the most relevant openproblems for each phase.
Finally, we review the typical issues faced in DL-based methods in this domain of expertize, such as unsupervised learning, black-box models, and online learning and to illustrate how these challenges can be important to open prolific future research paths and solutions.

Fig. 2 .
Fig. 2. The main task of DL-based iris feature extraction: given a dimensionless representation of the iris data, obtain its compact and representative representation -the feature set -that is further used in the classification phase.

Fig. 3 .
Fig. 3. Post-mortem iris recognition and visualization: (a) a good-quality post-mortem iris image; (b) top to bottom: deep learning-based detection of iris annulus, specular highlights and decomposition-induced wrinkles; (c) segmentation results presented to a human examiner along with an overlaid heatmap visualizing regions judged as salient by the matching algorithm.Source:[91]

Table 1 .
Cohesive comparison of the most relevant DL-based iris segmentation methods (NIR: near-infrared; VW: visible wavelength).Methods are listed in chronological (and then alphabetical) order.

Table 2 .
Cohesive comparison of the most relevant DL-based iris recognition methods (NIR: near-infrared; VW: visible wavelength).Methods are listed in chronological (and then alphabetical) order.

Table 3 .
[16,30]e comparison of the most relevant DL-based iris Presentation Attack Detection methods after the surveys[16,30](NIR: near-infrared; VW: visible wavelength).Methods are listed in chronological (and then alphabetical) order.

Table 4 .
Summary of datasets used in the DL-based iris segmentation and recognition methods of Tables1 and 2(NIR: near-infrared; VW: visible wavelength).

Table 5 .
Summary of datasets used in the DL-based iris Presentation Attack Detection methods of Table