Controlling Media Player with Hands: A Transformer Approach and a Quality of Experience Assessment

In this article, we propose a Hand Gesture Recognition (HGR) system based on a novel deep transformer (DT) neural network for media player control. The extracted hand skeleton features are processed by separate transformers for each finger in isolation to better identify the finger characteristics to drive the following classification. The achieved HGR accuracy (0.853) outperforms state-of-the-art HGR approaches when tested on the popular NVIDIA dataset. Moreover, we conducted a subjective assessment involving 30 people to evaluate the Quality of Experience (QoE) provided by the proposed DT-HGR for controlling a media player application compared with two traditional input devices, i.e., mouse and keyboard. The assessment participants were asked to evaluate objective (accuracy) and subjective (physical fatigue, usability, pragmatic quality, and hedonic quality) measurements. We found that (i) the accuracy of DT-HGR is very high (91.67%), only slightly lower than that of traditional alternative interaction modalities; and that (ii) the perceived quality for DT-HGR in terms of satisfaction, comfort, and interactivity is very high, with an average Mean Opinion Score (MOS) value as high as 4.4, whereas the alternative approaches did not reach 3.8, which encourages a more pervasive adoption of the natural gesture interaction.


INTRODUCTION
The digital world is becoming increasingly heavily present in our everyday activities, which is heading towards the "metaverse", a complete virtual environment linked to the real physical world [30].Key technologies, such as extended reality (XR), artificial intelligence (AI), and hedonic quality.An objective metric was also considered to measure the accuracy of the proposed HGR system.Finally, we analyzed the collected data and discussed the obtained results.
The article is structured as follows.Section 2 discusses the related work on HGR and QoE assessment of hand gesture-based HCIs.In Section 3, we present the considered scenario, describe the proposed DT-HGR solution, and discuss the achieved performance.Section 4 describes the designed quality assessment methodology.Section 5 discusses the QoE results.Section 6 presents our conclusions.

RELATED WORK 2.1 Hand Gesture Recognition
Hand gesture recognition (HGR) refers to the process of identification and tracking of explicit human gestures to their representation for commanding computer systems [34].There are two main methods to perform HGR: contact-based devices and CV systems.
Contact-based devices are devices that need physical interaction with the user, such as data gloves, wearables, accelerometers, and multi-touch screens.Being in contact with the user, these devices are very precise in detecting the user's movement, but they can be intrusive and uncomfortable, and the users need to be accustomed to them [23].
CV systems, instead, use one or more cameras to capture the user's movements, which are then analyzed to detect hand motion and gestures.Although these systems are more user friendly, there are several challenges to overcome, such as robustness (detecting the hand within the image under different lighting conditions and background), accuracy (detecting the correct hand gesture), and real-time processing (ability to analyze the image at the frame rate of the input video) [14].
The two main components of CV-based HGR systems are the hand detector and the gesture recognizer.The hand detector detects the hand image within the original captured images and extracts hand gesture features.Traditional image-based algorithms (e.g., the Viola-Jones algorithm) were commonly used to extract image-based hand gesture features, such as shape and color.However, these methods have shown some major limitations because (i) appropriate lighting conditions are needed; (ii) they are computationally slow; and (iii) the nature of hand gestures is linked to the spatial motion relationships of the hand joints [14].
Therefore, state-of-the-art methods for hand detection utilize AI-based techniques to extract hand skeleton information [5], which can also be fused with color and depth image information to enhance gesture recognition performance [14].The features provided by the hand detector are the input of the gesture recognizer, which uses AI-based techniques to classify the extracted features for the recognition of different hand gestures.Convolutional Neural Networks (CNNs) were commonly used for gesture recognition because they are very fast and effective, in particular, compared with static gestures whose pattern does not change during the execution time [16,46].However, CNNs are not as effective for the recognition of dynamic hand gestures, which, in addition to spatial features (as for the static gestures), also include temporal features.For this reason, recurrent neural networks, such as Long Short-Term Memory (LSTM), are preferred for HGR of dynamic gestures in that they can keep track of arbitrary long-term dependencies in the input sequences.Indeed, most of the state-of-the-art HGR methods rely on the combination of a CNN with an LSTM [14,27].
Recently, transformer-based neural networks have started to be used in the CV field to replace traditional recurrent modules (e.g., LSTM) because they are designed to process sequential input data all at once; the attention mechanism provides context for any position in the input sequence [24].The authors of [12] are among the first to propose a transformer-based architecture to classify dynamic hand gestures.The proposed model, which can process an input sequence of variable length and outputs the gesture classification, achieved state-of-the-art results.
Concerning HGR solutions specifically designed for controlling media players, the literature is limited and based on outdated methods, as follows.The approaches in [2,3,29,33] lack robustness because the hand detection task was implemented using image-based analysis (e.g., color thresholding [2], Viola-Jones algorithm [3]) for the extraction of the hand features.Therefore, the hands may not be detected if the light setting is different from that used to train the classifier.
Traditional classifiers were also used for the HGR task, such as k-Nearest Neighbor (k-NN) [33], support vector machine (SVM) [2], decision tree [29], and Haar classifier [3].A CNN architecture was used in [18,26], but only a few static hand gestures (2 in [18] and 6 in [26]) were considered to control the media player.Moreover, none of these studies has evaluated its approach on state-of-the-art datasets.
It is evident that the HGR systems for controlling media players found in the literature have significant limitations concerning both hand detection and hand recognition tasks.For this reason, inspired by the most recent approaches for HGR [12,14], we propose a novel transformer-based HGR application for controlling media players.In contrast to [12], in which a single transformer is used for all hand gesture features, our DT-HGR system exploits 5 separate transformers, one for each hand's finger feature set, to better identify the finger characteristics and to better feed the following classification procedure.The proposed approach is described in Section 3.

QoE Assessment of Hand Gesture-Based HCIs
The QoE is defined as "the degree of delight or annoyance of the user of an application or service" [22].By reflecting the subjective quality perceived by the user, the evaluation of the QoE for multimedia services is of utmost importance although it is not straightforward.Indeed, the QoE is affected by three types of influence factors, namely, system, human, and context, which, in turn, are composed of several different perceptual dimensions (PDs) [6,22,43].
The studies assessing the QoE of HCIs based on hand gestures are limited in the literature.Also, there is not a standard methodology for their assessment as, for example, the ITU-T P.910 [17] guidelines for the assessment of video quality.For this reason, each literature study assessed the QoE following a different methodology and considering different PDs.
The usability PD and acceptability PD were considered in [11] to evaluate the proposed gesturecontrolled video player interface for mobile devices compared with standard buttons.A subjective experiment was conducted involving 15 people, who were asked to indicate the preferred interaction method and to judge how satisfying, tiring, and responsive the new interface was.The performance of the new interface was also measured objectively in terms of task execution time and the number of erroneous gestures performed.
In [40], the user experience is assessed by conducting two different experiments involving a total of 19 subjects.The first experiment involved the mouse versus near field hand gestures as the HCIs and consisted of the exploration of 3D objects presented on a stereoscopic display using actions such as rotation, zooming in and out, or pointing to a specific part of the volume.The second experiment involved the Wii controller versus far field hand gestures and consisted of controlling an icon by moving the hand in 4 directions.User experience was assessed in terms of physical fatigue for various upper body parts (using the Borg scale [9]), usability (performance, fun, and ease of learning), and pragmatic and hedonic qualities (using 21 semantic differential items on a 7-point response scale).
In [38], the usability and user experience of a hand gesture-based in-vehicle system as perceived by elderly people is investigated.The subjective expectations and the subjective experience were evaluated respectively before the system usage through the SU X ES i questionnaire and after the system usage through the SU X ES f questionnaire [39], whereas the System Usability Scale (SUS) questionnaire was used to evaluate the post-test perceived usability of the system [10].
Controlling Media Player with Hands: A Transformer Approach and a QoE Assessment 132:5 The study in [32] compared the utilization of the Leap Motion controller with standard keyboard and mouse controls for playing video games.A total of 15 participants were recruited to play two games and assess the experience in terms of usability, engagement, and personal motion control sensitivity.The SUS was used to rate the subjective experience of the system.Also, the participants had to select the device they considered to be most fun to use, the device they would choose for longer gaming sessions, the overall preferred mode, and the easiest gestures.
The gesture-based user experience for controlling the gameplay of three simple video games was evaluated in [31] using the User Experience Questionnaire (UEQ) [20], which consists of 26 total items classified into 6 different scales.Participants were also asked during the postexperiment interview about how natural they thought each gesture was and to rate the reliability of each game's gestures.Furthermore, a performance score was defined to compute the percentage of correctly performed gestures (intended actions) compared with misses (no action) or errors (unintended actions).
Finally, a general model to evaluate the usability of gesture-based interfaces is proposed in [7], which considers the main factors that determine the quality of a gesture (i.e., fatigue, naturalness, accuracy, and duration) and how to measure them.
Concerning the HGR applications for controlling media players discussed in Section 2.1, most of the studies [18,29,33] only evaluated objective performance related to the HGR task.A subjective quality assessment was only conducted by the authors of [2] and [3].
In [2], 15 persons were asked to rate their experience with the proposed HGR application using a 4-point scale: poor, satisfactory, good, and excellent.A total of 70% of persons rated their experience as good.
In [3], 17 individuals were asked to perform simple tasks on the computer, such as watching videos, listening to songs, and viewing images.They had to use a mouse, a keyboard, and the proposed hand gesture-based interface.They were asked which option they found more convenient for controlling the applications: 47.06% preferred using the hand gesture interface, 29.41% preferred using the keyboard, and 23.53% preferred using the mouse.
To gain a better understanding of the user-perceived quality when using this HCI and to identify current limitations, we designed and conducted a quality assessment including specific tasks to motivate participants to utilize the hand gestures, and we considered both objective and subjective quality metrics for the evaluation.The details of the assessment are described in Section 4; the QoE results are discussed in Section 5.

PROPOSED HGR SOLUTION FOR CONTROLLING MEDIA PLAYERS
We introduce the considered scenario in Section 3.1, describe the proposed DT-HGR solution in Section 3.2, and discuss the achieved performance results in Section 3.3.

Considered Scenario
We considered a scenario in which a person is sitting in front of a display equipped with a camera and wants to use a media player to watch videos or listen to music.The traditional HCIs used for interacting with applications are the mouse, keyboard, and touch screen, which are well known by most people.We want to investigate what would be the user experience when interacting with and controlling the applications with the hands instead of using an end device, such as a mouse or a keyboard.To this end, we designed an HGR solution for controlling a media player application that captures video sequences from the camera placed in front of the user and it is capable of recognizing 6 different hand gestures.
For our HGR application, we selected 6 hand gestures (which are shown in Figure 1) from the dataset in [13], which includes 27 dynamic hand gestures intended for HCIs and captured at high image quality.Most of the gestures in [13] (i.e., the first 25 gestures) were adopted from the NVIDIA popular dataset [25], while the 2 additional hand gestures have been designed to command the playback of previous/next content when controlling media players.There is a difference in the numbering of the gestures between the dataset in [13] and the NVIDIA dataset: the dataset in [13] numbers the gestures starting from 1, whereas the NVIDIA dataset starts from 0. In this article, we refer to the numbering used in the NVIDIA dataset.Thus, the 2 additional gestures introduced by the dataset in [13] are Gesture 25 and Gesture 26.
We selected 6 gestures able to control the media player with some of the most common actions: -Stop the playback: Gesture 17 (open hand for 3 s).
-Go to the previous content: Gesture 25.
-Go to the next content: Gesture 26.
We have chosen these 6 gestures by considering related literature studies for controlling media players with hand gestures [3,18,26,29,33].Although these studies use different gestures for the same player control commands we consider (e.g., play/pause, stop, increase/decrease volume), all considered gestures include using the hand flat and closed fist either statically or dynamically (e.g., moving hand flat and closed fist right-left or up-down).The selected gestures are all included in the NVIDIA dataset [25], which includes 25 gestures adopted from existing commercial systems and popular datasets.Thus, we considered the hand flat and "ok" gestures to stop and start/play the video, respectively; moving the closed fist (right to left and left to right) to go to the previous/next video; and clockwise and counterclockwise movement with two fingers to increase/decrease the volume.The selection of these gestures has been driven by a preliminary survey involving 12 students who were asked to pick the most intuitive gesture among the 25 in [25] for each of the 6 control commands.We then selected the most popular gestures for each one.

The Proposed Solution
We present the proposed novel Deep Transformers neural network for HGR (DT-HGR).A typical HGR system relies on three main components: scene acquisition, hand detection, and gesture Controlling Media Player with Hands: A Transformer Approach and a QoE Assessment 132:7 recognition.Once the scene has been acquired, we perform hand detection, which processes each frame to identify the hand skeleton.To this end, we have deeply investigated the state-of-the-art and exploited the most appropriate approach, which has been adapted to the HGR problem at hand.The subsequent gesture recognition algorithm relies on a completely novel approach that exploits separate transformers for each hand's finger feature set to better identify the finger characteristics to feed the following classification.
In the following, we provide the details of the proposed procedure.

Scene Acquisition.
The scene acquisition provides a continuous video of the hand gesture at full HD resolution.In our design and testing phases, we used the Logitech Brio Stream 4K HDR video camera to acquire the video scenes at full HD resolution (1920 × 1080) at 30 frames per second (fps).

Hand Detection.
As discussed in Section 2.1, the hand detection task is a well-studied problem, and state-of-the-art methods for hand detection extract hand skeleton information [5].None of the HGR systems for controlling media players found in the literature extracts hand skeleton data; rather, those methods rely on traditional image-based algorithms (i.e., image color analysis [2,29,33], the Viola-Jones algorithm [3]) to extract image-based hand gesture features, such as shape and color.However, these solutions have shown some major limitations because (i) appropriate lighting conditions are needed (lack of robustness); (ii) they are computationally slow; and (iii) the nature of hand gestures is linked to the spatial motion relationships of the hand joints [14].
The study in [18] is the only one using a CNN-based solution, which is commonly used to achieve fast and effective recognition of static gestures.However, no dynamic gestures and no hand skeleton data were considered in this study.For these reasons, for the hand detection task of the proposed HGR system, we relied on the pipeline described in [46], which has been proposed for hand tracking.It is composed of the CNN proposed in [8] trained for palm detection and the CNN Keypoint Detection via Confidence Maps proposed in [35] for hand landmarks detection.These neural networks extract hand-skeleton data and have proven to be lightweight and fast even with low-power devices.For instance, the CNN presented in [8] can execute the inference process in 0.3 s running on an iPhone XS.First, the pipeline identifies the hand's palm, i.e., it selects a boundary box within the video frame including the hand image.Then, this boundary box is processed by the second stage of the pipeline, the CNN Keypoint Detection, which detects 21 hand landmarks as shown in the hand in Figure 2.Each landmark is described by a tuple (x, y), which identifies the landmark's position in the xand y-axes of the image.As the same gesture may be performed with different "sizes" (e.g., Gestures 19 and 20 of Figure 1 can be performed both doing a "thin" clockwise movement or a "large" clockwise movement of the fingers), the coordinate values have been normalized as in Equation ( 1): where x j and y j are the x and y coordinates, respectively, of landmark j, which ranges from 0 to 20. (x 0 , y 0 ) are the coordinates of the reference landmark (shared with all fingers) and (max(x), max(y)) are the maximum values of the coordinates of the 21 landmarks.From each video frame, the hand detection component computes 21 normalized landmark coordinate tuples, which are used as the input features for the gesture recognition component.combination to each set of finger landmarks.In particular, each set of finger landmarks is processed by a single Transformer Encoder (TE), which is composed of the Multi-head attention (MHA) layer proposed in [41], two normalization layers, and a Feed-Forward layer (composed of two onedimensional [1D] Convolutional layers with kernel size of 1 and a Dropout layer).Since the hand gestures are the result of subsequent movements of the 5 fingers, the MHA layer aims to detect the most relevant finger landmarks that identify the gesture.Each MHA block is a self-attention layer that repeats its computation 8 times to draw connections between any part of the input sequence.Moreover, each TE repeats its computation 6 times.These numbers of repeating cycles provided the best trade-off between accuracy and computational cost and were determined empirically by employing a greedy algorithm that looked for the optimal parameters' setting in a given search area.The output of the TE is the input of the Avg pooling module, which decreases the feature dimension.The outputs of the 5 Avg pooling modules undergo a concatenation and normalization process followed by a 1D Convolutional layer with a kernel size of 5, an Avg pooling layer for feature reduction, and, finally, a Dense layer and a Softmax activation function that classifies the gesture label.

Gesture Recognition.
In the following, we provide more details about the data processing.The input I is the concatenation of 5 matrices F i , with i = {1, 2, . . ., F }, where F = 5 is the number of fingers.Each matrix F i has shape {N × L i }, where N is the total number of video frames of the recorded gesture and L i is the number of landmarks for finger i.Note that landmark 0 is shared between the 5 fingers and L i = 5 for each finger.Each row of the matrix F i , defined as f n i , is the concatenation of the coordinates of landmarks l n i,k of the finger i for the video frame n, with n = {1, 2, . . ., N }.The index k identifies the finger landmark depending on the finger i, as defined in Equation (2): As an example, for the video frame 1, the arrays f 1 i are defined as , l 1 5,20 }.Then, the F i matrix describes the movements of finger i through time during the gesture execution.It is defined as the concatenation of the f n i arrays: For instance, for finger 1, the matrix F 1 is defined as Finally, the input I is defined as the concatenation of the 5 matrices F i : The F i matrices undergo the temporal combination (TempComb) process applied by the TE modules, whose outputs are the input of the Avg pooling (AvgPool) for feature reduction: ( The result of the 5 TempComb processes are then normalized and merged in a single matrix H by means of a concatenation (Conc) process: In order to identify the essential features from the landmarks data collected from the 5 fingers, the matrix H undergoes a convolutional (Conv) process, whose result is reduced in size by means of the Avg pooling module: Finally, the Dense layer and the Softmax activation function output the predicted hand gesture label hд p :

Comparison of HGR Accuracy on the NVIDIA Dataset.
We computed the performance of the proposed DT-HGR approach on the NVIDIA dataset, which is widely used by state-of-the-art studies to prove the performance of their HGR methods.We computed the HGR performance training the DT-HGR with 5-fold cross-validation and 70%/30% training/validation rate, resulting in 1,050 training videos and 482 validation videos.Table 1 compares the performance between state-of-theart methods and the proposed DT-HGR in terms of HGR accuracy when training and testing the algorithms on the NVIDIA dataset [25].Note that for a fair comparison of state-of-the-art methods with our approach, only the unimodal column should be considered, which reports the accuracy achieved with the unimodal HGR approach (i.e., RGB only).However, for completeness, the multimodal column reports the accuracy achieved by multimodal HGR approaches, which consider a combination of the RGB data with additional sources of image information, such as depth, flow, or infrared (IR).It can be seen that our proposed method achieved a mean HGR accuracy of 0.853, outperforming the HGR accuracy achieved by state-of-the-art unimodal methods.In particular, our solution achieved an 8.8% higher accuracy than that achieved by the transformer-based solution in [12], which used one transformer for all fingers.Thus, considering one transformer for each finger rather than a single transformer for all fingers allowed our approach to achieve a greater accuracy score.Furthermore, our proposed method outperformed 5 of the 9 considered multimodal methods.This is a promising result, which suggests that feeding our system with multimodal image data would likely enhance the HGR performance.Method Accuracy Unimodal (Color) Multimodal HOG + HOG 2 [28] 0.245 0.369 (color + depth) Simonyan and Zisserman [36] 0.546 0.656 (color + flow) Wang et al. [42] 0.591 0.734 (color + flow) C3D [37] 0.693 -R3DCNN [25] 0.741 0.838 (color + depth + flow + IR) GPM [15] 0.759 0.878 (color + depth + flow + IR) PreRNN [44] 0.765 0.850 (color + depth) Transformer [12] 0.765 0.876 (color + depth + normals + IR) ResNeXt-101 [19] 0.786 -MTUT [1] 0.813 0.869 (color + depth + flow) Yu et al. [45] 0.836 0.884 (color + depth) DT-HGR (Ours) 0.853 - Figure 3 shows the confusion matrix computed for the 25 classes of the NVIDIA dataset with the proposed solution.It can be seen that all gestures are predicted with good accuracy.However, lower accuracy values are achieved for the classification of Gestures 21 and 22.The reason may be that to execute Gesture 21, some users have made some hand movements that are similar to those needed to execute Gesture 22. [13].Firstly, we computed the performance of the proposed DT-HGR approach trained on the NVIDIA dataset (as described in the previous section) and validated on the dataset in [13] (excluding Gestures 25 and 26, which are not present in the NVIDIA dataset).A mean HGR accuracy of 0.808 was achieved; this dataset cross-validation result further demonstrates the validity of the proposed DT-HGR method.

HGR Performance on the Dataset in
In Figure 4, we show the confusion matrix achieved by the DT-HGR for the first 25 classes (i.e., those adopted by the NVIDIA dataset) of the dataset in [13].It can be seen that the proposed approach found some difficulties in the classification of Gesture 16, which was confused with Gesture 17.The reason is that the "push hand down" and "push hand up" gestures are very similar to each other.Moreover, some executions of these two gestures are very similar to each other because the subjects executed them in the wrong way.Furthermore, Gesture 5 has not achieved classification performance comparable to that of the other gestures because the DT-HGR solution has confused this gesture with other gestures.In addition, lower accuracy values are achieved for the classification of Gestures 21 and 22, as for the NVIDIA dataset.However, on average, our solution achieved great classification results both on the NVIDIA dataset and the dataset in [13] (cross-validation).
Secondly, we computed the performance of the proposed DT-HGR approach training with 5fold cross-validation and 70%/30% training/validation rate on the dataset in [13].A mean HGR accuracy of 0.885 is achieved over the 27 gestures of this dataset, which is a greater accuracy than that achieved on the NVIDIA dataset (0.853) although th dataset in [13] has 2 additional gestures.This can be motivated by the higher image quality of video frames collected in this dataset and the more precise gesture execution (participants were monitored during gesture execution and they had to repeat the hand gesture in case the movement was not correct) compared with the NVIDIA dataset.Figure 5 shows the confusion matrix for the 27 classes.Even in this case, Gestures 21 and 22 achieve the lowest accuracy scores.Note that a state-of-the-art comparison with the dataset in [13] (similarly to Table 1) could not be provided because none of the literature studies used this dataset to test their models and no software was publicly provided to allow for computing the models' performance on the dataset in [13].
Finally, the proposed DT-HGR approach achieves a mean HGR accuracy of 0.983 if only six hand gestures in Figure 1 are considered from the dataset in [13].Table 2 reports the performance of the proposed HGR solution for these six hand gestures in terms of accuracy, precision, and recall, computed for the single classes.The HGR accuracy achieved for single gestures is greater than 0.96.In particular, except for Gestures 25 and 26, achieving an accuracy of 0.97, the accuracy achieved for single gestures is 0.99.Indeed, Gestures 25 and 26 ("Next/Previous content") also achieved lower precision and recall values than those obtained by the other gestures, which could derive from the different motion speeds needed to recognize these kinds of gestures.For this reason, the camera could have lost some frames when trying to capture the high-speed movements.However, all 132:12 A. Floris et al.  gestures achieved precision and recall results greater than 0.93 and 0.91, respectively, which motivates their utilization for the real-time experiment.By only considering the 6 selected gestures, the DT-HGR approach achieves greater performance than considering the entire dataset.The reason is that the lower the number of gestures to be recognized, the lower the chance for error.Also, the selected gesture movements are well distinguished from each other.

QOE ASSESSMENT
In Section 2.2, we discussed the QoE assessments conducted by the state-of-the-art studies.
Although there is no standardized methodology to follow, each study focused on specific functionalities of the application under study that involved the utilization of the hand gestures to be evaluated.Concerning the QoE evaluation, there are objective and subjective quality metrics that are common to most of the literature studies.From the objective side, we considered and measured the performance score (PScore), which is defined in [31] as the ratio between the number of correct gestures and the total number of gestures.From the subjective side, four perceptual dimensions were considered: the physical fatigue perceived when performing the hand gesture; the usability of the HCI in controlling the video player; the pragmatic quality in terms of accuracy and responsiveness; and, finally, the hedonic quality in terms of satisfaction, comfort, interactivity,

Methodology
The considered scenario is described in Section 3.1.The designed QoE assessment consists of three phases: Introduction, Training, and Experiment, as illustrated in Figure 6.During the Introduction phase, which lasted about 10 minutes, the participants were given explanations of the objective and the steps of the assessment.The hand gestures, mouse controls, and keyboard controls needed to control the video player were introduced and explained to the participants.They were given written and oral instructions to understand how to utilize each of the HCIs under test to operate the considered video player controls, which are summarized in Table 3.They were also given explanations regarding the tasks needed to be performed to complete a part of the experiment (answers to questions concerning the content of the watched video).The post-experiment surveys were introduced and explained so that participants understood the quality perceptual dimensions to be evaluated and the rating scales.During this phase, the participants had the opportunity to ask further questions to eliminate any doubts regarding the experiment.Finally, this phase ended with participants providing written consent regarding the publication of the obtained results.
The training phase required test participants to practice the three HCIs under test by operating the video controls they would have to use during the Experiment phase.For the practice session, training videos were created using videos different from those used for the Experiment phase.The objective of this phase was to ensure that the participants acquired a base level of practice before judging the HCIs and had already used the controls in similar situations.The participants had to familiarize themselves with each of the 6 video player controls in Table 3 using the three HCIs.They had to perform the controls several times until they were sure they understood how to use them.In particular, most of the time was spent practicing with the hand gesture-based HCI, since each participant already had some experience using a mouse and keyboard to control video players.This phase lasted about 15 minutes.
Finally, during the Experiment phase, the test participants performed the experiment and evaluated the perceived quality.The Experiment phase consisted of three sessions, each performed with a different HCI.The utilization order of the 3 HCIs was chosen randomly for each participant so that it would not affect the data analysis.Each session consisted of 2 video series.During the first video series, the participant had to watch the videos and follow the written instructions appearing on the screen asking to perform the 6 video player controls at specific times.The second video series was similar to the first.However, the participant was also required to pay careful attention to the video content since specific questions were asked to test the participant's attention and ability to use the HCIs to control the video player.Figure 7 shows an example of the tasks required during the second video series.The participant had to write the answers on a sheet of paper.A pause of 5 minutes was taken between each session, which lasted an average of 10 minutes.At the end of each session, the participant had to fill out the subjective questionnaire to evaluate the perceived physical fatigue, usability, pragmatic quality, and hedonic quality using that HCI.At the end of the experiment (all three sessions completed and rated), the final questionnaire had to be completed where the preferences regarding the utilization of the three HCIs were asked.
In Figure 8, we show the experiment environment.The participant was seated in front of the screen behind a desk.The distance from the screen was the same when using each of the three HCIs.The main limitation concerning the workable distance using hand gestures regards the video camera field of view, which determines the maximum distance of the hand gesture recognition from the camera.Similarly, the diagonal field of view determines the maximum workable angles.Even without extensive experimentation, we have determined that the gesture area should be at least 200 × 200 pixels in size when the gesture is in front of the video camera.When considering the typical commercial webcam (full HD, diagonal field of view of 78 degrees), this corresponds to a distance of around 1 meter.The distance can be extended by using zoom and reducing the diagonal field of view.As to the workable angle, we have empirically found that it should not be greater than 45 degrees.

Evaluation
We evaluated the QoE using both objective and subjective quality metrics.From the objective side, we considered and measured the performance score (PScore), which is defined in [31] as the ratio between the number of correct gestures and the total number of gestures.A correct gesture results in the intended control by the video player.When a gesture is not performed correctly, it can be a miss or an error.A miss is a gesture that resulted in no action, i.e., a participant's delayed reaction or a wrong movement not recognized by the HGR application that did not result in a control recognized by the video player.An error is a wrong movement that results in an unintended action, such as a different control executed by the video player.We also applied the PScore to mouse and keyboard controls.A miss, in this case, is a participant's delayed reaction whereas an error is the click/press of a wrong button/key.Test participants were observed during the Experiment phase and the number of correct gestures, misses, and errors was noted to compute the PScore for each video player control and each HCI.
The considered subjective metrics are physical fatigue, usability, pragmatic quality, and hedonic quality.At the end of each Experiment session, the participants had to fill out the questionnaire, including the questions concerning these four subjective metrics.The questionnaire was divided into four sections, one for each metric.Physical fatigue regards the level of exertion perceived when performing hand gestures, clicking the mouse buttons, and pressing the keyboard keys, which had to be evaluated using the Borg scale, from 0 to 10 [9].The SUS questionnaire was used to evaluate the perceived usability of the three HCIs [10].The SUS is a ten-statement questionnaire the participants had to use to rate the HCIs.It contains a 5-point Likert scale to indicate whether they strongly disagree (0) or strongly agree (4) with each statement.From the ratings of the 10 statements, the SUS score was computed, which ranges from 0 to 100.Usability aspects taken into account by the SUS statements included complexity, confidence, and ease of learning.
The pragmatic quality is defined as "the extent to which a system allows for effective and efficient goal-achievement, and is closely related to the notion of usability" [40].We measured the pragmatic quality in terms of accuracy and responsiveness of the used HCIs.Accuracy concerns the system's ability to perform the action/control intended by the participant whereas the responsiveness includes the system quality and speed of the reaction to the participant's input.Note that other pragmatic quality aspects, such as ease of use and ease of learning, were not considered as already included in the SUS questionnaire.On the other hand, the hedonic quality is defined as "the extent to which a system allows for stimulation by its challenging and novel character or identification by communicating important personal values" [40].Therefore, with the hedonic quality, we aimed to investigate non-task-oriented aspects regarding the perceived quality: satisfaction, comfort, naturalness, interactivity, and fun.Satisfaction regards the perceived level of enjoyment when using the HCI, whereas comfort takes into account how comfortable the HCI is to control the video player.With interactivity, we refer to the correlation between the performed gestures and the reactions re- turned by the video player.Naturalness indicates how intuitive the HCI is for controlling the video player, whereas fun concerns the pleasure and entertainment provided by the HCI.A 5-point Likert scale had to be used by participants to rate the hedonic qualities, from 1 to 5. Participants were also asked whether the used HCIs met their expectations.At the end of the experiment, i.e., after all three sessions were completed and evaluated by the participants, they were asked to fill out the final questionnaire, in which they had to provide their preferences regarding the utilization of the three HCIs.In particular, they were asked to indicate their absolute favorite HCIs and what HCI they preferred for each of the four kinds of controls: Play/Pause, Stop, Previous/Next video, and Increase/Decrease volume.

Objective Quality
Table 4 summarizes the number of miss actions and errors observed as well as the PScore results computed for each video player control and each HCI.The mouse achieved the greatest overall PScore (97.78%).Two participants missed the action to pause the video playback at a specific instant to catch a particular scene in the video.They had a delayed reaction.One participant clicked the wrong button to stop the video and another participant clicked the wrong button to go to the next video.The keyboard achieved second place, with an overall 93.33%.In this case, the miss actions were a delayed reaction to pause the video (as for the mouse case), to stop the video, and to play the next and previous video.The 6 errors, instead, were due to pressing the wrong keys, which resulted in unintended actions (2 errors for Stop, 2 for Increase/Decrease volume, and 2 for Next/Previous video).Finally, the hand gestures achieved an overall PScore of 91.67%.The miss actions (1 for Play/Pause, 2 for Stop, 3 for Next/Previous video, and 1 for Increase/Decrease volume) in this case concern the gestures not recognized by the HGR application because they were not performed properly by the participants.On the other hand, all the errors (1 for Play/Pause, 3 for Next/Previous video, and 4 for Increase/Decrease volume) involved the unintended Pause command given by the participants when raising the hand to perform a gesture.Indeed, some of them tended to keep the thumb and index finger close to each other, which was recognized as the "ok" command to pause the video.

Subjective Quality
Figure 9 shows the mean perceived physical fatigue with the 95% confidence interval (CI) for the three HCIs.The physical fatigue for the hand gestures is slightly greater than that perceived  for the mouse, which, in turn, is slightly greater than that perceived for the keyboard.However, these values are very similar and indicate very little fatigue perceived for the three HCIs (around 1 on a scale ranging from 0 to 10). Figure 10 shows the perceived usability in terms of SUS with the 95% CI.The SUS perceived for the hand gestures and the mouse is comparable whereas that perceived for the keyboard is slightly lower.However, all of the perceived SUS are very high, which indicates the usability of the three HCIs is more than acceptable for the participants (over 80 on a scale ranging from 0 to 100).The one-way Analysis of Variance (ANOVA) results between the groups of scores provided for the three HCIs have not shown significant differences in terms of physical fatigue and SUS.
Figure 11 shows the mean opinion score (MOS) computed for the perceived pragmatic quality (accuracy and responsiveness) and hedonic quality (satisfaction, comfort, interactivity, naturalness, and fun) with the 95% CI.Concerning the pragmatic quality, the results obtained by the three HCIs are quite good and comparable.The hand gestures achieved the lowest perceived accuracy, which, however, corresponds to a good accuracy (3.90).The mouse achieved the greatest perceived accuracy (4.20), slightly greater than that obtained by the keyboard (4.13).In contrast, the keyboard achieved the greatest perceived responsiveness (4.40) whereas those achieved by hand gestures and mouse are comparable (around 4.2).The ANOVA results between the groups of scores provided for the three HCIs have not shown significant differences in terms of accuracy and responsiveness.
Concerning the hedonic quality, the mouse achieved the lowest score for all quality aspects except for naturalness.On the other hand, the hand gestures always achieved the greatest quality.The participants were very satisfied with the experience with the hand gestures (4.40), which resulted in the most enjoyable interface.Instead, sufficient satisfaction was perceived for the keyboard (3.40) and the mouse (3.13).The hand gestures are also perceived as the most comfortable HCI for controlling the video player (4.23), followed by the keyboard (3.73) and the mouse (3.57), which achieved sufficient to good comfort.Concerning the interactivity provided by the HCIs to control the video player, the hand gestures achieved a very high result (4.67) whereas, again, the keyboard (3.27) and mouse (2.97) achieved just sufficient rates.The naturalness results are, instead, quite comparable among the three HCIs.The hand gestures (4.13) are perceived as the most natural HCI to control the video player, followed this time by the mouse (3.83) and the keyboard (3.47).Finally, hand gestures are the most fun HCI to control the video player (4.73).The keyboard achieved poor to sufficient fun (2.53) whereas the mouse is the least fun HCI, achieving a poor rating (1.93).
We also computed the ANOVA results between the groups of scores provided for the three HCIs, which have shown significant differences in terms of the 5 hedonic qualities.In particular: -Satisfaction (p < 0.001): Significant difference between hand gestures-mouse and hand gestures-keyboard.No significant difference between mouse-keyboard.-Comfort (p < 0.05): Significant difference between hand gestures-mouse.No significant difference between hand gestures-keyboard and mouse-keyboard.-Interactivity (p < 0.001): Significant difference between hand gestures-mouse and hand gestures-keyboard.No significant difference between mouse-keyboard.-Naturalness (p < 0.05): Significant difference between hand gestures-keyboard.No significant difference between hand gestures-mouse and mouse-keyboard.-Fun (p < 0.001): Significant difference between hand gestures-mouse, hand gestures-keyboard, and mouse-keyboard.
The hedonic quality results are confirmed by Figure 12, in which we summarize the preferred HCI for each video player control as well as the overall favorite HCI.The mouse is the favorite HCI only for controlling the volume whereas the keyboard is the preferred HCI only for playing the previous/next videos from the playlist.These results may be motivated by the easiness and user-friendliness of scrolling the mouse wheel to increase/decrease the volume and pressing a single key to play the previous/next video.However, the hand gestures are the overall favorite HCI, the favorite HCI for playing/pausing and stopping the video, and the second favorite HCI for controlling the volume and selecting the next/previous video.Thus, the hand gestures were really appreciated by the test participants to command the video player.The keyboard was also well appreciated for most of the player controls whereas the mouse resulted in the least beloved HCI (except for controlling the volume).
Finally, Figure 13 shows the response to the question "Has the used HCI met your expectations?".This result further confirms that the hand gestures HCI were not only well appreciated but exceeded the expectations of almost half of the test participants.On the other hand, the mouse and keyboard mostly met participants' expectations with low surprise.

CONCLUSION
Hand gesture-based HCIs are a promising means of interaction for current and future multimediabased applications.In the current study, we focused on the QoE assessment of device-free hand gesture-based controls when compared with traditional HCIs, i.e., the mouse and the keyboard.The application under testing was a video player, and the test participants had to perform common video controls (i.e., play, pause, stop, increase/decrease volume, next/previous video) using the three considered HCIs.We proposed our DT-HGR system to control the video player using hand gestures, whose HGR algorithm relies on a completely novel approach that exploits separate transformers for each hand-finger feature set.The DT-HGR aims to benefit from the transformer architecture to better identify the finger characteristics to drive the following classification.The recognition accuracy of the DT-HGR (0.853) outperforms the accuracy achieved by 11 state-ofthe-art HGR approaches tested on the same NVIDIA dataset.Moreover, the DT-HGR achieved an accuracy of 0.808 when cross-validated on the dataset in [13].
The comprehensive evaluation (objective and subjective) conducted in this study highlights the importance of assessing the subjective QoE perceived by the users.Indeed, although the objective performance of the proposed HGR application has shown some minor flaws (e.g., some gestures were not immediately recognized and participants had to repeat the gesture), the subjective results have emphasized the level of satisfaction and appreciation of the hand gestures against the two traditional HCIs.In particular, hand gestures achieved the greatest scores for all pragmatic and hedonic quality dimensions, including satisfaction, comfort, interactivity, naturalness, and fun.Moreover, the hand gestures were chosen as the overall preferred HCI and the favorite HCI to perform half of the considered controls, i.e., play/pause/stop.Furthermore, subjective feedback can be the input for identifying the weak points of the HGR application and understanding how to solve them to enhance the objective performance.
Overall, the obtained results also confirm the current (and likely future) trend that sees the end user willing and able to command high-tech systems in a multimodal device-free way.Intelligent personal assistants (IPAs), such as Google Home and Amazon's Alexa, are examples of devices controlled through vocal commands.Hand gestures can be an alternative or support for vocal commands (e.g., for deaf-mute people) for future IPAs (equipped with a camera) and can be extremely useful for several future applications, such as interacting with smart walls in public spaces to get information; controlling the radio/air conditioning/navigation systems in cars (autonomous driving, in particular); and manipulating objects or controlling virtual/augmented reality environments.
However, widespread use of hand gesture-based systems would require facing some challenges, for example, gesture standardization, system scalability, and technology integration.Gesture standardization is required to avoid the system being retrained for each new use case.However, a dataset such as that collected by NVIDIA [25] already includes a complete set of gestures for controlling multimedia systems (adopted from existing commercial systems and popular datasets), which may be considered as a standard set for implementing hand gesture-based systems for controlling media players.Indeed, most state-of-the-art HGR approaches tested the performance of their methods on this dataset.Concerning scalability, the 25 gestures included in the NVIDIA dataset can be more than enough to control multimedia applications.The reason is that the user should be able to remember the gestures needed to control a multimedia application.Generally, only a selected subset of these 25 gestures are selected to control a specific multimedia application.Moreover, increasing the number of gestures raises the chances of having similar gestures, which may be confused by the HGR system, decreasing the performance.Finally, technology integration requires HGR systems to be supported with simply a camera and software that links the recognized gesture to a specific action.Thus, they are easily extendable to many applications.
In future works, we aim to test and validate the proposed system in different environments for further application use cases.From the experience achieved in the development of the proposed use case, we have obtained and reported some guidelines concerning the determination of the maximum workable distances and angles that would allow the system to work correctly.The proposed HGR system would regularly perform well if workable conditions are respected, i.e., if the user executes the correct gesture in front of a camera despite the type of application to be controlled.These guidelines can be followed to precisely define the workable conditions for each considered environment and application, which requires specific experimental tests.We also aim to investigate the impact of utilizing multimodal image information (e.g., depth, flow) to feed the proposed HGR system and verify whether the performance can be further enhanced.

Fig. 1 .
Fig.1.The six hand gestures selected from the dataset in[13] considered for the proposed DT-HGR system.Gesture 17: stop the playback.Gesture 19: increase the volume (clockwise rotation).Gesture 20: decrease the volume (anticlockwise rotation).Gesture 24: play/pause the playback.Gesture 25: go to the previous content.Gesture 26: go to the next content.
Figure 2 illustrates how the proposed DT-HGR solution implements the hand gesture recognition and classification tasks.The 21 hand landmarks provided by the hand detector are the input of the Temporal landmark analysis module, which applies a temporal

Fig. 2 .
Fig. 2. The proposed DT-HGR solution for gesture recognition and classification.

Fig. 3 .
Fig. 3. Confusion matrix achieved by the proposed DT-HGR for the 25 classes of the NVIDIA dataset.

Fig. 4 .
Fig. 4. Confusion matrix achieved by the proposed DT-HGR for the first 25 classes of the dataset in [13] (i.e., those adopted by the NVIDIA dataset).

Fig. 7 .
Fig. 7. Example of video series where the participant is asked to perform specific tasks and answer video content-related questions.

Fig. 9 .
Fig. 9. Mean perceived physical fatigue with the 95% CI for the three HCIs.

Fig. 10 .
Fig. 10.Perceived usability in terms of SUS with the 95% CI for the three HCIs.

Fig. 12 .
Fig. 12. Preferred HCI for each video player control and overall favorite HCI.

Fig. 13 .
Fig. 13.Response to the question: "Has the used HCI met your expectations?".

Table 1 .
[25]arison of the HGR Accuracy Achieved by State-of-the-Art HGR Systems and the Proposed DT-HGR When Both Training and Testing are Performed on the NVIDIA Dataset[25]

Table 2 .
Performance of the Proposed DT-HGR Achieved for the Six Considered Hand Gestures in Terms of Accuracy, Precision, and Recall

Table 3 .
Video Player Controls with the 3 Different HCIs for the VLC Video Player 24 Left click on the "Play/Pause" button Spacebar key Previous video 25 Left click on the "Previous video" button P key Next video 26 Left click on the "Next video" button N key naturalness, and fun.In Section 4.1, we describe the designed assessment methodology.Section 4.2 discusses the evaluation process.

Table 4 .
Number of Miss Actions and Errors Observed, and PScore Results Computed for each Video Player Control and Each HCI