A Long-Range Mutual Gaze Detector for HRI

The detection of mutual gaze in the context of human-robot interaction is crucial for the understanding of human partners' behavior. Indeed, the monitoring of the users' gaze from a long distance enables the prediction of their intention and allows the robot to be proactive. Nonetheless, current implementations struggle or cannot operate in scenarios where detection from long distances is required. In this work, we propose a ROS2 software pipeline that detects mutual gaze up to 5 m of distance. The code relies on robust off-the-shelf perception algorithms.


INTRODUCTION
In Human-Robot Interaction (HRI), nonverbal communication is given crucial importance [11,20].In particular, the user gaze is a very powerful indicator of human thinking processes and intentions, even before they are expressed [6,7].Nonetheless, capturing nonverbal cues for HRI purposes is not an easy task [19].For gaze detectors, technical limitations usually restrict usage to close distances [13] [23], where eyes and face landmarks are more easily captured.In the context of social HRI, it is desirable to endow robots with perception skills that allow them to detect human behavior from far, well before the start of a potential interaction [5].
The process of extracting gaze information can be divided into two main problems: estimation of the gaze direction and detection 1 m 2.5 m 5 m Figure 1: Our approach implements a long-range mutual gaze detector fusing the information from the users' body frames (in green) and the facial landmarks (in orange).
of mutual gaze.The former is usually tackled by using precise gaze tracker algorithms.Solutions based on Deep Learning (DL) are available [13,22,23]; however, they lose performance at long distances.
Alternative approaches implementing longer-range detectors [8,12] usually require specifc hardware (like infra-red projectors), which is not available in the standard sensory equipment of social robots.The tracking performance of such cumbersome sensors can be replicated using simpler RGB cameras [14] at the expense of keeping the maximum operating distance at around 1.8 m.Recent solutions [21] showed encouraging results in estimating the gaze direction from long distances.However, to the best of our knowledge, currently, there is no released implementation.On the other side, the detection of the mutual gaze is a much simpler problem.In general, mutual gaze defnes direct eye contact between individuals, and in the HRI context of this work, it refers to whether a person is looking at a robot.This problem has been previously studied with good results [9], even in the robotics domain, making use of body landmarks in an approach similar to our proposed solution [15].However, these methods are insufcient for many social HRI applications as the maximum operating distance is below 2 m.Instead, a desirable operating zone should at least include the robot's social space, defned as a circle of 4 m around the robot [16,19].To this end, we use a face mesh tracking solution with a fner resolution (allowing more control over the number and location of the landmarks) and integrate it with a 3D body tracker.Fusing carefully chosen face landmarks with body information enables our mutual gaze detector to work at distances up to around 5 m.
Our approach aims at providing the HRI community with a long-range mutual gaze detector, see Fig. 1.We design a detector that only relies on a standard sensor and can be easily used in diferent HRI tasks and scenarios.Examples of applications are: () prediction of the intention to interact, as detecting mutual gaze from far distances is indicative of users' interaction will; () engagement monitoring, to adapt the robot's behavior to maintain or increase engagement during tasks or conversations; () collaborative tasks, e.g. for industrial manufacturing, as mutual gaze tracking can improve coordination.The approach leverages of-the-shelf algorithms providing information on the users' body motion and facial landmarks.It is implemented in the Robot Operating System 2 (ROS2) framework and can be easily installed and deployed.

APPROACH 2.1 Problem formulation
We consider a robot that performs social interaction with humans.The robot is equipped with an RGB-D sensor, and its reference frame is defned at the center of its RGB camera sensor.This frame is oriented so that its vertical axis consistently aligns with the gravity vector, while the heading angle remains unconstrained and free to follow the robot's orientation.This design choice is crucial because it allows to be independent of the camera movement and orientation, thus allowing the camera to be mounted on moving parts, e.g. a robotic head.The information about the users' body motion is expressed by monitoring two particular frames of interest: one located in the person's chest, and the other on their head.We assume that the poses of these frames w.r.t. the robot's frame are measurable by the robot perception system.Furthermore, we also assume that the camera RGB images allow the detection of facial landmarks, which consist of the projected locations, on the image plane, of specifc points of interest detected on a person's face.Each landmark is defned by a 3D vector consisting of its image coordinates (see Fig. 2), and the predicted depth of the corresponding point on the user's face.In the proposed approach, facial landmarks and body information are fed to a classifer that predicts whether the subject is looking at the robot, i.e. gives an estimate of the mutual gaze.We aim to provide an implementation of such a mutual gaze classifer that can work for distances greater than 2 m.

Data structure
Previously available algorithms for mutual gaze estimation [15] are not designed for distances larger than 2 m.Publicly available gaze datasets do not contain data acquired at further distances and are not ft for our needs (see, e.g., [10] or the dataset related to [23]).Therefore, we train our classifer on a dataset collected ad hoc.Such a dataset is composed of two subsets difering in the users' motion patterns during the acquisition campaign.The frst subset is called Standing set and gathers data of users standing in front of the robot in 28 predefned positions.Such positions are arranged on a grid pattern to cover the entire sensor's Field of View (FoV) ranging from 0.8 m to 4.2 m.In each position, the user does not walk but moves their head and torso while alternating periods of looking at the robot and elsewhere.For each instant, we record the users' body frame poses, the facial landmarks, and the mutual gaze ground truth, switching position after about 90 s.Data are labeled by the subjects themselves, by pressing a wireless remote button when they directly look into the sensor.Data thus collected produced 12373 samples (34.8% true and 65.2% false).The Standing set provides a lot of data, useful to build the core capabilities of the classifer, but it does not ofer much variability.Therefore, the Passing by set is also recorded.In this dataset, the subjects pass in front of the robot with diferent random moving patterns.This set contains 1449 samples (56.5% are true and 43.5% false).Figure 3 shows the pattern of the users' position for each sample contained in the whole training dataset, with the darker positions showing the grid pattern used for the Standing set.The complete dataset is composed of 13822 samples; 522 of them (all belonging to the Passing by set) are used as a test set; the remaining 13300 are used for training.The local institutional ethical committee has cleared the data-gathering campaign and the experiments.

Implementation details
The RGB-D sensor is placed at a height of around 1.10 m that is typical of many commercially available social robots, e.g.Tiago by PAL Robotics [17] or Pepper by Aldebaran [18].The sensor used in this work is Azure Kinect RGB-D sensor [1], streaming images at the maximum resolution of 4096 × 3072 pixels at 5 Hz; with horizontal and vertical FoV equal to 90 • and 59 • , respectively.The sensor's Software Development Kit (SDK) provides the tracking of 32 salient body frames composing a skeleton of the user, as shown in green in Fig. 1.From such a skeleton, we select the head and chest frame information.As for the facial landmarks, we use a modifed version of MediaPipe library [2].This package ofers detection of up to 478 facial landmarks from RGB images, providing a very detailed mesh of the users' faces.However, the vanilla implementation lacks complete support for multiple users and long distances simultaneously.To overcome this issue, we use the head frame information from the Azure Kinect to manually track the users' faces in the input image.The users' face regions are then cropped from the image and rescaled to 200 × 200 pixels, i.e. the internal input resolution of MediaPipe's face landmark detector.This operation does not impact the precision of the extracted landmarks thanks to the high-resolution image of the sensor.Indeed, the dimension, in the image space, of a normal-sized human face at around 5 m is only marginally smaller than the resizing window.This process also comes with the additional benefts of increased speed and stability of the tracking process, even at close range.The cropped region around the users' heads can be seen as red squares in Fig. 1.The classifer is tested with diferent subsets of facial landmarks, spanning from the full mesh of 478 landmarks, to around half size with 249 landmarks, and fnally the smallest subset of 21 landmarks, i.e. about the 4% of the original set.The diferent facial sets are shown in Fig. 2. Regarding the classifer architecture, we rely on a simple stateless Random Forest (RF), which can provide good robustness while being very lightweight.We use the scikit-learn [4] implementation of such algorithm leaving the default settings for training as well as the number of 100 trees.

Performances
We compare diferent versions of the classifer: using only facial landmarks (the corresponding model is denoted with F, where is the number of the landmarks) or in combination with chest and head pose information (CH-F).The performances are evaluated with the AUROC metric calculated using a 10 fold cross-validation approach.The algorithm is frst evaluated frame-by-frame; the results are shown in Fig. 4. From the results of the frst 3 models  (F21, F249, and F478), which only use facial landmarks, one can appreciate that richer facial features do not translate into better performances, but into a decrease of the mean AUROC from 89.1% to 87.9%.This trend is even more pronounced when combining chest and head information (model CH-F21, CH-F249, and CH-F478).Indeed, the model using the smallest facial landmarks subset (CH-F21) exhibits the best result with an average AUROC of 92.6%.The best model (CH-F21) is evaluated w.r.t. the distance from the camera sensor.The training samples are grouped into 5 bins according to the absolute distance from the sensor frame.The results in Fig. 5 show that the performance is always above 90%.Finally, to validate the generalization of our classifer w.r.t. to unseen samples, CH-F21 is tested on the smaller testing set, which contains completely separated sequences from the training set on which the classifer has an AUROC of 93.3%.This generalization capability comes from the fact that the proposed classifer works with pre-processed data and not raw inputs, thus relying on the robustness of the of-the-shelf  components of the pipeline.The robustness has been empirically confrmed in further experiments with multiple unseen subjects.The system displays a maximum operating range of about 5 meter.This is due to limitations of the sensor in both the maximum distance for reliable body tracking and the resolution of the image, which after 5 meter causes the face region of interest to be blurred.

SOFTWARE 3.1 System Architecture
The whole system, implemented using the ROS2 infrastructure [3], is composed of 4 diferent nodes (see Fig. 6) described below.
3.1.1Azure Kinect Driver.This node runs the driver of the RGB-D sensor and the body tracking ofered by its SDK.It publishes the following information: () the raw signals of the Kinect's onboard Inertial measurement unit (IMU); () the RGB image streams; () the body frames from the sensor SDK; and () the RGB camera frame.The frame information is standardized in ROS2 as tf messages.
3.1.2Gravity Alignment.This node calculates the gravity-aligned robot frame as defned in Sec.2.1.It takes the IMU data to calculate the diference of the RGB sensor frame orientation w.r.t. the inertial vertical direction and broadcasts the aligned robot frame.Ultimately, the robot frame is defned as the one centered in the origin of the RGB sensor frame constrained vertically to be aligned with gravity and horizontally with the camera heading.

Face Landmarks
Tracking.This node is the core of the perception pipeline and implements the face landmarks extraction.It takes as input the user's body and robot frames information and the current RGB image.Firstly, it performs projection of the users' 3D head positions onto related 2D points on the image plane.This information is used to crop the regions of interest of the RGB image corresponding to the faces of the detected users.Such cropped images are fed to multiple instances of the MediaPipe Face Mesher, which detects the face landmarks for each user.This crucial step, introduced in Sec.2.3, allows us to overcome the distance range limits of the MediaPipe implementation.Exploiting the robust detection of the head frames provided by the Azure Kinect SDK, we can allow the face landmarks detection by MediaPipe at distances further than 2 m.To be independent of the camera orientation, the body frame poses, originally expressed in the RGB sensor frame, are transformed into the gravity-aligned robot frame.The detected face landmarks and the transformed body frame poses of detected users are fnally time synchronized and published as a custom ROS2 topic.Such message is called Users' Data in the scheme of Fig. 6.

Mutual Gaze
Classifier.This node is a ROS2 wrapper for the actual scikit-learn implementation of our classifer.It takes as input the user data custom topic provided by the Face Landmark Tracking node.As output, it publishes a simple custom topic that is denoted Mutual Gaze Info in Fig. 6.This message contains the IDs of the detected users and the corresponding probability of mutual gaze as computed by the classifer.This node also runs a GUI showing the real-time evolution of the predicted probability related to the user who has been tracked for the longest time.

Code release
Our software is open-sourced under the MIT license and hosted on the GitHub page of our institution, available at: https://github.com/idsia-robotics/mutual_gaze_detectorwhere the README fle explains in detail the procedure needed to create the right setup and test the code.Specifc paragraphs address code maintenance, licensing, and deployment with an emphasis on responsible use of the software.One can choose to either create an environment on their machine or use ready-to-use Docker images, which are provided to ease the code setup by executing the code from within a container.This choice comes at the cost of a reduced execution speed of the pipeline.We provide two diferent images; the frst can be used to run the entire system online using an Azure Kinect; the second ofers a quick demo displaying the code capabilities using pre-recorded data.The data are provided as ROS2 rosbag, which contains frame-wise pre-processed (anonymous) user's information.This is useful if an Azure Kinect is not readily available to the user.In the README fle, we also provide a video showing the launch of the code and the GUI running on such ofine data.

CONCLUSIONS
In this work, we presented a long-range mutual gaze detector implemented as a classifer and designed for HRI applications.We trained the classifer with diverse data and achieved good performances that did not degrade as the users' distance increased.Therefore, it can be considered useful to swiftly detect mutual gaze at long distances, outperforming the options available in the literature; also it can fnd application beyond the HRI context.The whole framework is available online and is built on robust and popular of-the-shelf perception algorithms.The implementation is realized within the ROS2 framework and, thus, can be easily deployed for HRI tasks.

Figure 2 :
Figure 2: Diferent numbers of face landmarks (21 on the left, 249 in the center, 478 on the right) detected on the same face.The ones related to eye pupils and irises are in blue.
Figure 3: position w.r.t. the robot frame during the data collection and the normalized per-axis distributions.

Figure 4 :
Figure 4: Area Under Receiver Operating Curve (AUROC) metric of diferent versions of the mutual gaze classifers using diferent sets of input feature.

Figure 5 :
Figure 5: AUROC of the CH-F21 classifer (vertical axis) tested at diferent distance ranges (horizontal axis).The distances are grouped in quintiles of the test set.