EMG-Based Automatic Gesture Recognition Using Lipschitz-Regularized Neural Networks

This article introduces a novel approach for building a robust Automatic Gesture Recognition system based on Surface Electromyographic (sEMG) signals, acquired at the forearm level. Our main contribution is to propose new constrained learning strategies that ensure robustness against adversarial perturbations by controlling the Lipschitz constant of the classifier. We focus on nonnegative neural networks for which accurate Lipschitz bounds can be derived, and we propose different spectral norm constraints offering robustness guarantees from a theoretical viewpoint. Experimental results on four publicly available datasets highlight that a good tradeoff in terms of accuracy and performance is achieved. We then demonstrate the robustness of our models, compared with standard trained classifiers in four scenarios, considering both white-box and black-box attacks.


INTRODUCTION
In recent years, the concept of human-computer interaction (HCI) has been at the core of many scientific and sociological developments.Combined with the power of machine learning algorithms, it has led to some of the most outstanding achievements in nowadays technology, which are used successfully in an ever-increasing number of areas impacting our lives, e.g., medicine [29], autonomous driving [38], natural language processing [57], and so on.Researchers all around the world focus on providing new intuitive and accurate ways of interacting with devices around, based on gesture, voice, or vision analysis [36].Gestures constitute a universal and intuitive way of communication, with the potential of bringing the Internet of Things (IoT) experience to a different, more organic level [48].Automatic gesture recognition (AGR) algorithms can be successfully used in various applications, from sign language recognition (SLR) [12] to VR games [56].
Various solutions for AGRs based on image or video stream analysis, leveraging on computer vision algorithms have been proposed; see for example [22,25,34].A multi-stream solution for dynamic hand-gesture recognition is described in Reference [58].Multi-modal approaches for gesture classification have been also studied [42].A novel method showing a fully neuromorphic implementation [9] achieves good results (96% accuracy while reducing the inference time by 30%).Although a good performance is achieved on synthetic data, in real-life scenarios these systems may be sensitive to environmental conditions, e.g., light conditions, background, and so on.Additionally, these systems are often computationally demanding and consequently not always suited for real-time applications.Accelerometers and electromyography (EMG) sensors provide an alternative low-cost technology for gesture sensing [30].sEMG stands for surface EMG and represents the electrical manifestation of the neuromuscular activation related to the contraction of the muscles [4].In Reference [46], the authors propose a method combining feature selection with ensamble learning, achieving around 78% classification accuracy for 53 gestures.The applications of sEMG-based classification systems are focused on, but not limited to, assertive devices and rehabilitation or postural control therapy for physically impaired persons [31].With the continuous development of more versatile signal processing techniques, the applications of EMG signal classification expanded to a wide range of domains including augmented reality, gaming industry, military applications, and so on [32,43].
Two critical issues need to be addressed when developing AGR algorithms: fast enough inference to ensure real-time feeling for the end-user, and accurate and robust classification to guarantee that the gesture is correctly identified no matter the environmental conditions.
However, deep neural networks (DNNs), which are probably the most powerful methods, may appear as black boxes whose robustness is not always well-controlled.For real-life applications, it is mandatory to guarantee the reliability of such techniques.Nowadays, the main difficulty to overcome consists in developing high-performance systems that are also trustable and safe.An additional challenge is to avoid implementation heaviness during the learning phase.
In Reference [52], the authors showed that slightly altering data inputs that were correctly classified by the network can lead to a wrong classification [7,35,53].This finding was at the origin of the concept of adversarial inputs, which constitute malicious input data that can deceive machine learning models.For example, Reference [7] shows how voice interfaces can be fooled by creating carefully crafted artificial audio inputs of unintelligible voice that are miss-classified as specific vocal commands by the system.Also, Reference [27] introduces several methods for generating adversarial examples on ImageNet that are so close to the original data that differences are indistinguishable for the human eye.
It must be emphasized that adversarial inputs are not necessarily artificially created with the intention to sabotage the system.As other physiological signals, e.g., EEG or EKG, EMG signals have low frequency components (usually between 10-150 Hz), and low amplitudes (≤ 10 mV Peak to Peak).This makes them very sensitive to noise and outside perturbations that can occur innately, under the form of noise stemming from acquisition devices, imperfect sensor contact, and so on.Those can seriously flaw the performance of real-life applications based on pre-trained models [40].An empirical way of training more robust AGR systems is detailed in Reference [23], where a strategy of training using noisy labels is proposed.
As highlighted in Reference [27], the Lipschitz behaviour of the network is tightly correlated with its robustness against adversarial attacks.The Lipschitz constant allows to upper bound the output perturbation knowing the magnitude of the input one, for a given metric [50].Controlling this constant thus represents a feasible solution to limit the effect of adversarial attacks.Computing the exact Lipschitz constant of a neural network is, however, a very complex problem, so the main challenge is to find clever ways to approximate this constant effectively.
Recently, several techniques to ensure the Lipschitz stability of neural networks have been explored.For example, Reference [53] proposes a novel weight spectral normalization technique applied to stabilize the training of the discriminator in Generative Adversarial Networks (GANs).The Lipschitz constant of the network is viewed as a hyper-parameter that can be tuned in the training process of the image generation task.Doing so leads to a model with improved generalization capabilities.
In Reference [3] norm-constraint GroupSort based architectures are proposed and it is shown that they can be used as universal Lipschitz function approximators.The authors apply gradient norm preservation to create Lipschitzian networks that offer adversarial robustness guarantees.In Reference [14] the authors introduce Parseval networks, another approach for designing networks which are intrinsically robust to adversarial noise, by imposing the Lipschitz constant of each layer of the system to be less than 1.In Reference [24], a convex optimization framework is introduced to compute tight upper bounds for the Lipschitz constant of DNNs.They make use of the observation that commonly used activation operators are gradients of convex functions.Semidefinite programming approaches to ensure robustness are also explored in Reference [45].The main contributions of this article are: -To propose a robust real-time AGR system based on sEMG signals.The robustness is ensured by using a novel learning algorithm for training feedforward neural networks.-To show that a good accuracy-robustness balance can be reached.To do so, we train the system under carefully crafted spectral norm constraints, allowing us to finely control its Lipschitz constant.A tight Lipschitz constant is efficiently estimated by focusing on neural networks with positive weights as in Reference [13].-To demonstrate the performance of the final architecture in real-life experiments where we show that the proposed robust model outperforms those trained conventionally.-To analyze how our system behaves when the input is affected by different noise levels, simulating perturbations that may occur in real scenarios.-To show the validity of our solution by experimenting on four distinct publicly available sEMG gestures datasets.
The rest of the article is structured as follows.The theoretical background of our work is detailed in Section 2. In Section 3, we present the proposed optimization algorithm and we investigate the way of dealing with the constraints.The application and the results are discussed in Section 4, while Section 5 deals with how our model behaves when facing adversarial data.Finally, Section 6 contains some concluding remarks.

ROBUSTNESS SOLUTIONS IN THE CONTEXT OF NONNEGATIVE NEURAL NETWORKS 2.1 Problem Formulation
Any feedforward neural network is obtained by cascading m layers associated with operators (T i ) 1≤i ≤m .The neural network can thus be expressed as the following composition of operators: Each layer i ∈ {1, . . .,m} has a real-valued vector input where are the weight matrix and bias parameter, respectively.R i : R N i → R N i constitutes a non-linear activation operator which is applied component-wise (e.g., ReLU or Sigmoid) or globally (e.g., Softmax).Figure 1 shows a graphical representation of this concept.
Even though the choice of the activation R i may differ depending on the task at hand, it has been shown in References [16,18] that most of them are actually α i -averaged operators with When α i = 1/2, R i is said to be firmly nonexpansive.For standard choices of activation operators, R i is firmly nonexpansive since it is the proximity operator of a proper, lower-semicontinous function (see Reference [16] for more details).Note that, in Reference [24], it is assumed that R i operates component-wise and is slope-bounded.The authors emphasize that the most common case corresponds to lower and upper slope values equal to 0 and 1, respectively.It follows from [15,Proposition 2.4] that a function satisfies this property if and only if it is the proximity operator of some proper lower-semicontinuous convex function, so that similar assumptions to those made in Reference [16] are recovered.As explained in Reference [18], examples of activation operators R i which are α i -averaged with α i > 1/2 can be encountered.They basically correspond to over-relaxations of firmly nonexpansive operator.An example of such operators is the Swish activation function [49].Another famous example is the group-sort operator: where the vector x i has been decomposed in M subvectors x i, j with j ∈ {1, . . ., M }, of dimension B (N i = BM) and x ↑ i, j designate the vector of components of x i, j sorted in ascending order.R i is then purely nonexpansive, i.e., α i = 1.Note that max-pooling can be achieved by composing this group sort operation with a linear operator.Indeed, if i < m, M = N i+1 , and W i+1 is the matrix extracted from the N i × N i identity matrix Id N i by selecting the matrix rows with indices multiple of B, then W i+1 • R i corresponds to a max-pooling.

Lipschitz Robustness Certificate
Consider a neural network T as described in Figure 1.let x ∈ R N 0 be the input of the network and let T (x ) ∈ R N m be its associated output.By adding some small perturbation z ∈ R 0 to the input, the perturbed input is x = x + z.The effect of the perturbation on the output of the system can be quantified by the following inequality: where θ m ≥ 0 denotes a Lipschitz constant of the network.θ m represents thus, an important parameter that allows us to assess and control the sensitivity of a neural network to various perturbations.It needs, however, to be accurately estimated to provide valuable information.A standard approximation to the Lipschitz constant [27] is given by where • S denotes the spectral norm of a matrix.Although simple to compute, this approximate bound is over-pessimistic.Different methods for obtaining tighter estimates of the Lipschitz constant have been presented in the recent literature; see for example [11,18,24,37,50].Local estimates of the Lipschitz constant can also be performed, which may appear more relevant.But they are more complex to compute and, as we will see, controlling the global Lipschitz constant is usually sufficient to get a good performance.Estimating the global Lispchitz constant of the network is an NP (non-deterministic polynomial-time)-hard problem [50].Although there exist efficient approaches to approximate an accurate bound [11,24,37], computing these estimates may be expensive for wide or deep networks.In addition, using these bounds within a training procedure is a difficult task [45].In this work, we will make the following assumption.
Assumption 2.1.Let a neural network be given by (1) where the i th layer with i ∈ {1, . . .,m} is given by (2).We assume that (i) all the activation layers, except possibly the last one, consist of separable averaged operators, that is, for every i ∈ {1, . . .,m − 1}, there exist averaged functions 1≤i ≤k ; (ii) at the last activation layer, R m is an averaged operator.
Our approach will be grounded on the following result.Proposition 2.2 ( [18]).Suppose that Assumption 2.1 holds.For every i ∈ {1, . . .,m}, let A i be the matrix whose elements are the absolute values of those of W i .Then, is a Lipschitz constant of T .In addition In particular if, for every i ∈ {1, . . .,m}, , ϑ m is equal to the lower bound in (8).
Based on this proposition, the best estimate for the Lipschitz constant of a given feedforward neural network having nonnegative weights simplifies to the spectral norm of the product of all the weight matrices composing the network.More precisely, the obtained Lipschitz constant is the Lipschitz constant of a purely linear network, where all the non-linear activation operators have been replaced with the identity operator.
The above result is guaranteed to be valid only in the case when all the weights are nonnegative.In the general case of networks with weights having arbitrary signs, it can be proved that W m × • • • × W 1 S represents only a lower bound of the Lipschitz constant established in Reference [18].It is also worth mentioning that the proposed results hold for any algebraic structure of the weight matrices (W i ) 1≤i ≤m .Using the above defined bound, in the following, we will propose an algorithm for training models with theoretical robustness guarantees, and validate it in the context of gesture classification.By focusing on gesture recognition, we aims to showcase the effectiveness of our methodology in a challenging domain having multiple applications and for which real experiments can be made.Moreover, gesture recognition tasks often involve complex and dynamic data, making them suitable testbeds for evaluating the robustness and adaptability of our proposed approach.

OPTIMIZATION METHODS FOR TRAINING ROBUST FEEDFORWARD NETWORKS 3.1 Stochastic Gradient Descent-Projected Variant
Standard training in neural networks consists in the minimization of a nonconvex cost function with respect to the model parameters by means of an iterative strategy.Let L be the cost function defined as follows: where η = (η i ) 1≤i ≤m is a vector encompassing all the model parameters.For each layer i ∈ {1, . . .,m}, η i denotes a vector of dimension N i (N i−1 + 1) that contains the scalar variables associated with the weight matrices W i and the corresponding bias components b i .The data information is represented by (z k ) 1≤k ≤K .For every k ∈ {1, . . ., K }, z k is a pair consisting of an input of the system and the associated desired output (ground truth).Also, represents the loss function assumed to be differentiable (almost everywhere) with respect to η.
To ensure robustness, we shall impose spectral norm constraints on the weight matrices.In other words, the vector of parameters η is constrained to belong to a closed set S that will be described in the next section.We propose to use an extension of a standard optimization technique for training neural networks [19].More specifically, we will implement a projected stochastic gradient algorithm.A momentum parameter is introduced in this algorithm to accelerate the convergence process.
Algorithm 1 describes the iterations performed at each epoch n > 0. We see that there are two nested loops: the outer loop operates on the batch index q and the second one on the layer index i.In this algorithm, γ n ∈ ]0, +∞[ is the learning rate, while ζ n ∈ [0, +∞[ denotes the inertia parameter for momentum.The algorithm is very similar to block-iterative techniques used in convex optimization [19].The parameters of each layer are indeed updated successively by performing a gradient step on the data in the current mini-batch (which can be epoch-dependent).∇ i represents the gradient, computed by standard backpropagation mechanism, with respect to η i for each i ∈ {1, . . .,m}.This stochastic gradient step is followed by a projection P S i, n onto the constraint set S i,n .The definition of this set as well as the way of handling this projection are detailed in the following.

Constraint Sets
As mentioned before, this work revolves around feed-forward networks with positive weights.Thus, the first condition that we impose is nonnegativity for each layer i ∈ {1, . . .,m}, which is modeled by the constraint set where Moreover, based on our standing assumptions and Proposition 2.2, we must impose a spectral norm constraint on the weight matrices to control the robustness of the system.This translates mathematically as the following upper bound constraint: where ϑ represents the target maximum Lipschitz constant of the network.This bound constitutes a direct measure of the system level of robustness against adversarial inputs.We need to handle these two constraints simultaneously during the training process.Imposing nonnegativity is fairly easy since (10) defines a simple convex constraint.By contrast, constraint (11) does not satisfy the convexity property.Since (11) corresponds to a closed set in the underlying space of weight matrices and this set has a nonempty intersection with the projection onto the intersection of the two sets can be defined but it is not guaranteed to be unique.To circumvent this difficulty, it can be noticed that (11) actually defines a multi-convex constraint in the sense that if, for every i ∈ {1, . . .,m}, (W j ) 1≤j ≤m, j i are given, then (11) imposes a convex constraint on W i .This suggests to introduce the following closed and convex set: in order to control the Lipschitz constant.Hereabove, the matrices A i,n and B i,n represent the product of the weight matrices for the previous and the posterior layers, respectively.By adopting the convention that A i,n = Id if i = m and B i,n = Id if i = 1, we define these matrix products as where (W j,n ) 1≤j ≤m denote the estimates of the weight matrices at each iteration n, as it appears in Algorithm 1.Thus, our objective will be to perform the projection onto the set S i,n = D i ∩ C i,n , for each layer i ∈ {1, . . .,m} and at each iteration n.Several algorithms can be envisaged to solve this convex optimization problem.
Before describing our proposed algorithmic solution, let us recall the expressions of the required elementary projections.For every W ∈ R S ×T , the projection of W onto [0, +∞[ S ×T is where, for every s ∈ {1, . . ., S } and t ∈ {1, . . .,T }, Let B(0, ϑ ) be the closed spectral ball of center 0 and radius ϑ defined as1 For every W = (W s,t ) 1≤s ≤S,1≤t ≤T ∈ R S ×T , let U ΛV be the singular value decomposition of W , where U ∈ R S ×R and V ∈ R T ×R are matrices such that U U = Id and V V = Id , R = min{S,T }, and Λ = Diag(λ 1 , . . ., λ R ), (λ r ) 1≤r ≤R ∈ [0, +∞[ R being the singular values of W . Then the projection of W onto B(0, ϑ ) is expressed as where Λ = Diag( λ 1 , . . ., λ r ) and To compute the projection onto S i,n of a matrix , we propose to employ the FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) version of a dual forward-backward method in Algorithm 2. This algorithm is based on a dual proximal approach [33] and constitutes an extension of the optimization method originally proposed in Reference [17].The rationale for this algorithm is given in the appendix.

Handling Looser Constraints
The Lipchitz constant of the network can be controlled in multiple ways.Besides the solution formulated in Section 3.2, a more standard approach to control it [52] consists in imposing Two strategies have been implemented to enforce this constraint.
(i) The first one consists in imposing a uniform bound on the spectral norm of each weight matrix (W i ) 1≤i ≤m , which leads to the following convex constraint sets: (ii) The second strategy aim at introducing more flexible bounds on the spectral norms of each layer.It is based on the following choice for the individual convex constraint sets: For every i ∈ {1, . . .,m}, projecting onto C i or Či,n is performed by truncating a singular value decomposition, similarly to the technique described at the end of Section 3.2.The projections onto C i ∩ D i and Či,n ∩ D i can then be computed by using the same iterative method as in Algorithm 2 with In all the proposed constrained optimization methods, the projection P B(0, ϑ ) onto a spectral ball with radius ϑ > 0 plays a prominent role.The ball radius depends on the handled constraint ( 11), (20), or (10).A complex operation such as a singular value decomposition may be very demanding in terms of computational resources when dealing with large size matrices.In that case, we propose to use an approximate projection [53] defined as Using this approximation in Algorithm 2 yields approximate projections ( P C i, n ∩D i ) 1≤i ≤m,n >0 .Note, however, that we then lose the theoretical guarantees of convergence in Algorithm 2, even if this issue was not observed in our implementation.
An additional advantage of Formula ( 21) is that it allows the nonnegativity of the elements of the input matrix to be kept.This allows us to derive cheap approximate versions of the projection onto C i ∩ D i with i ∈ {1, . . .,m} by first projecting onto D i and then applying the approximate projection onto C i .The resulting approximate projection is denoted by ( P C i ∩D i ) 1≤i ≤m .A similar procedure can be followed to compute approximate projections

sEMG Datasets
We test our proposed training scheme on four online datasets containing EMG information of different hand gestures.The first three were acquired using Myo armband, a device developed by Thalmic Labs, equipped with eight sEMG sensors displayed circularly, while the last one was acquired using 10 active double-differential OttoBockMy-oBock13E200 sEMG electrodes. 2yo-sEMG.The first dataset, detailed in Reference [21] contains EMG signals characterizing 7 hand gestures correlated to the primary movements of the hand.There are four mobility gestures (i.e., wrist flexion and extension, ulnar, and radial deviation) and two gestures used for grasping and releasing objects (i.e., spread fingers and close fist).The 7 th gesture characterizes the neutral position, corresponding to the relaxation of the muscles.
13Myo-sEMG.The second dataset includes 13 gestures: the same 7 gestures described above, plus 6 additional classes.It contains gestures from 50 different subjects and two sets of trials per user.All 13 gestures are depicted in Figure 2.More details about the dataset can be found in Reference [2].NinaPro DB5.C.The third dataset is a subset of NinaPro DB5 dataset, detailed in [47].The dataset is acquired using two Myo armbands, one positioned just below the elbow and the other one closer to the arm.For our experiments, we considered the subset C, which contains sEMG data associated to 24 gestures.
NinaPro DB1.The forth dataset was introduced in Reference [5], and encompasses physiological data acquired from 27 able-bodied subjects, performing a total of 53 different gestures.The sEMG data is recorded using 10 electrodes, positioned as follows.The first eight electrodes are evenly distributed around the forearm using an elastic band, maintaining a consistent distance from the radio-humeral joint located directly below the elbow.Two more electrodes are strategically positioned on the major flexor and extensor muscles located in the forearm.
We also validate our models in a real-context scenario.For the real-life predictions, we recorded the EMG activity associated with each gesture at forearm level using Myo armband.The information collected from each channel is transmitted to a computer via Bluetooth protocol where it is processed to extract relevant time domain features that will be used by the classifier to determine which gesture has been performed.

Proposed Architecture
The raw 8/10 channels EMG signal is split using a 250 ms sliding window, with 50% overlap.A 250 ms window is long enough to cover the most common gesture durations, ensuring that the essential temporal aspects of each gesture are captured within this window.Overlap ensures that important signal characteristics, such as abrupt changes or transient patterns, are not missed due to window boundaries.By using overlapping windows, the feature extraction process also becomes more robust, as multiple windows contribute to representing the same temporal information from the EMG signal.From each window of each channel a series of 8 time descriptors are extracted.The information from all the channels is then concatenated, forming a 64 (80 for the forth dataset) dimensional vector.The 7-gestures dataset contains around 200k vector samples, the 13-gestures dataset has around 59k vector samples, the 24-gestures dataset has around 20k vector samples, while the 53-gestures dataset has 250k vector samples.Those are split in training, validation, and test sets at user level according to the ratio: 70%, 20%, and 10%.These vectors are fed to the network in mini-batches of size 2048.For our experiments, we used as the loss function the categorical cross entropy, with a learning rate γ = 10 −3 and momentum parameter ζ = 0.02.The considered architectures consists of a 6-hidden layer (m = 6) fully connected neural networks, with different parameters depending on the considered datasets, but the same core structure, as displayed in Figure 3. Let x = (x k ) 0≤k ≤K −1 be the vector of EMG samples acquired on a window from one channel.For this work, we considered some of the most relevant features to describe sEMG data, as follows.
(i) Mean Absolute Value (MAV)-represents the average muscle activation level within a specific time window.As different gestures involve varying degrees of muscle activation, MAV can capture the overall muscle activity pattern, helping to distinguish between gestures with low and high muscle involvement.
(ii) Zero Crossing Rate (ZCR)-indicates how frequently the EMG signal crosses zero within a time window.Rapid changes in muscle activation lead to higher ZCR values, making it relevant for identifying gestures involving quick and repetitive movements.A threshold α ≥ 0 is used in order to lessen the noise effect.This feature can be computed in an incremental manner and it is defined as (iii) Waveform Length (WL)-quantifies the amplitude variations within a time window.
Longer WL values may correspond to gestures involving sustained muscle activity or complex patterns.It corresponds to the following total variation seminorm: (iv) Slope Sign Changes (SSC)-counts the number of times the slope of the EMG signal changes its sign within a window.It is effective in detecting abrupt changes in muscle activation, which is crucial for recognizing gestures with distinct start and stop points.It amounts in checking a condition on three consecutive samples x k , x k−1 , x k+1 with k ∈ {2, . . ., K − 2}: where the threshold α > 0 is employed to reduce the influence of the noise.(v) Root Mean Square (RMS)-provides information about the overall energy of the EMG signal within a time window.High energy levels may correspond to forceful gestures, while lower energy levels may indicate more subtle movements.RMS helps in recognizing gestures with varying intensity levels and it is given by (vi) Hjorth parameters-are a set of three features originally developed for characterizing electroencephalography signals and then successfully applied to sEMG signal recognition.The most relevant Hjorth activity parameter can be thought of as the integrated power spectrum and basically corresponds to the variance of the signal calculated as follows: where μ (x ) represents the mean value of the signal.The standard deviation and RMS(x ) are equal when the mean of the signal is zero.(vii) Skewness-measures the asymmetry of the EMG signal amplitude distribution within a time window.Positive skewness indicates a longer tail on the right side, while negative skewness indicates a longer tail on the left side and can be useful in identifying gestures with asymmetric muscle activations.
(viii) Integrated Square-root EMG (ISEMG) -It provides a measure of the total muscular activity and is particularly useful for capturing the overall muscle involvement over time.iEMG is commonly used to quantify muscle fatigue and effort during movements.In the context of gesture recognition, iEMG can help differentiate between gestures with varying levels of sustained muscle activation and can be indicative of the gesture intensity and duration.

Performance Analysis in Terms of Accuracy and Robustness
Our best AGR system trained conventionally achieves state-of-the-art performance [2,30,41], of over 99% accuracy for the first two datasets, around 86% in the case of the 24-gestures dataset [23,46] and around 88.5% in the case of 53-gestures dataset [55].A more detailed comparison with other new sEMG-based AGR systems is presented in Table 1 where we show comparisons with other recent works proposing neural-network solutions on the same datasets.Since, in this case, the weights are not guaranteed to be positive, the lower bound introduced in Proposition 2.2 does not constitute a valid Lipschitz constant.Computing the exact Lipschitz constant θ m of the system is a very difficult task [18], but we can easily bound θ m between the estimate given by ( 6) and the spectral norm of the product of all the weight matrices from the The training time is computed for an epoch, with a batch size=2048.All models were trained for 1000 iterations.All experiments were performed using 2 × A100 40 Gb Nvidia GPUs.
network.We found that the Lipschitz constant upper bound θ m is greater than 10 12 for all our baseline models.Also, while training our model, we faced the problem of overfitting, which is a challenging issue in classification of physiological signals.This suggests that despite the high performance of the classifiers, their robustness is poorly controlled, leaving the systems vulnerable to adversarial perturbations.A first step towards controlling the Lipschitz constant of the classification algorithm and implicitly its robustness is to impose the nonnegativity condition associated with constraint D. Training under such a nonnegativity constraint is shown to improve the network operation interpretability [13] and acts as a regularization, reducing overfitting.On the other hand, it can affect its approximation capability and potentially lead to a performance decay.To further study the effect of other regularization techniques from a dual performance-robustness perspective, we trained several models for 1000 iterations using common regularization methods, such as Dropout, 1 / 2 Regularization, and Batch Normalization.Such comparisons were also featured in other works like [28].The results for the 7-gesture dataset are summarized in Table 2.As expected, employing regularization techniques during the training phase improves the overall performance of the baseline classifiers.While the positive impact of regularization techniques on enhancing neural network model performance by mitigating overfitting has been extensively researched and validated, the exploration of their influence on system robustness remains an understudied area.It can be observed that Batch Normalization is the most efficient technique from the accuracy view-point, but it comes with an increase in the overall Lipschitz constant of the classifier.Training the proposed system subject to the nonnegativity constraint (D) results in an overall accuracy of 96.92 %, 95.87%, 84.75%, and 85.65% for the case of 7, 13, 24, and 53 classes, respectively.The performance decay was balanced by an increase in the robustness, since the Lipschitz constant, computed as indicated in Proposition 2.2, equals θ m = 9.69×10 10 for 7 classes, θ m = 9.73×10 10 for 13 classes, θ m = 1.03×10 11 for 24 classes, and θ m = 8.4 × 10 10 for 53 classes.We observed that the accuracy reduction can be overcome by adding additional layers to the architecture.Indeed, we were able to obtain a similar accuracy to the baseline by adding an extra layer to the existing architecture and retraining both systems subject to D, i.e., 98.68%, 97.21%, 85.12%, and 87.03% for the 7-gesture, 13-gesture, 24-gesture, and 53-gesture datasets, respectively.Furthermore, compared with the unconstrained models, we managed to maintain a high performance while improving the robustness with respect to unconstrained training, i.e., θ m = 1.02 × 10 11 for the 7-classes dataset, θ m = 9.96 × 10 10 for the 13-classes dataset, θ m = 4.24 × 10 11 for the 24-classes dataset, and θ m = 3.15 × 10 11 for the 53classes dataset.We can however conclude from these tests that imposing the nonnegativity of the weight coefficients is not sufficient to reach satisfactory robustness.
To further control the robustness of the systems, we have to manage the Lipschitz constant of the networks by training them under additional spectral norm constraints, as described by Equation (11).Searching for the optimal accuracy robustness tradeoff, we trained several models considering each of the four aforementioned constraints, namely (C i,n ) 1≤i ≤m,n ∈N in ( 12), ( C i ) 1≤i ≤m in Equation (20), and ( Či,n ) 1≤i ≤m,n ∈N in Equation (10).
By adjusting the upper bound ϑ , we were able to assess the effect of a robustness constraint on the overall performance of the neural network-based classifiers, and finally to achieve the optimal trade-off.All our models were trained using Algorithm 1 as the optimizer.
The obtained results are summarized in Table 3.As expected, obtaining a good robustnessaccuracy tradeoff requires paying attention to the way we design our constrained networks.In all the cases, we show that using tight constraints during the training phase to approximate the Lipschitz bound improves the overall performance of the classifier, proving the generalization properties of our solution.
For comparison, for each of the proposed constraints, we also evaluated the use of an inexact projection, designated by P (see Section 3.3).It can be observed that using an exact projection yields significantly better results.By combining tight constraints and exact projection techniques, we observe that the robustness of the network can be properly ensured while keeping a good accuracy in both cases.Indeed, we succeeded in ensuring a Lipschitz constant around 1 for a 95% accuracy for the first two datasets.The observed loss in accuracy with respect to a standard training is consistent with the "no free lunch theorem" [54].
Training neural networks subject to tight spectral norm constraints can be challenging, 3 and the cost of obtaining a good performance is the training time.We used a learning rate scheduler strategy during training, reducing the learning rate by a factor of 2 if the performance does not improve for 1000 epochs.Figure 4 shows the training curves for both validation and training sets in the context of the unconstrained baseline model (yellow and green lines), and in the case of training a constrained version (red and blue lines) using the optimal projection P C i, n ∩D i , with ϑ m = 0.95.Even though it requires more iterations, the constrained model is capable of reaching an accuracy comparable with the baseline, while providing a robustness certificate.
Since the training curves may show some slight variations, we measured the accuracy variations in two ways: by computing the classical standard deviation (std), and by employing median absolute deviation (mad).For a vector where ζ (x ) represents the median of the vector components.From this quantity, we can derive an empirical estimate of the standard deviation by multiplying MAD with a factor equal to 1.4826.The latter estimate is known to be more robust to outliers for Gaussian distributed data, especially in the case of small populations.The results are summarized in Table 4.It can be observed that the empirical standard deviation is below 1.6% and the robust estimate of it is below 1.1% for all four datasets.These deviations values are normal considering the size of the dataset and shows that the presented results are relevant and consistent.Next, we have also evaluated how the positivity constraint impacts the overall accuracy of our system.We trained a robust network by allowing the weights to have arbitrary signs.For this purpose, we control individually the Lipschitz constant of each layer i ∈ {1, . . .,m} to be less than a given value ϑ 1/m .The exact projection onto C i , P C i , as well as the approximate one P C i were computed as described previously.In this case, ϑ represents an upper bound on the Lipschitz constant of the system.Table 5 summarizes the results for different values of ϑ , for two datasets.We compare our method for dealing with Lipschitz constraints with the approach proposed in Reference [51].This approach, which is implemented in the deel-lip library allows the user to train robust networks in a convenient manner, offering a robustness certificate by performing a spectral normalization for each layer.It can be observed on these datasets that our method yields similar results when using the approximate projection, but better ones when using the exact projection.These results underline again the importance of carefully managing the projections and the effect it has on the accuracy of the system.

ROBUSTNESS VALIDATION
In this section, we investigate to what extent the theoretical concepts described in the previous sections help in improving the robustness of the classifier in different settings.To this goal, we consider the following three scenarios.In the first one, we examine the impact of adversarial attacks on the performance of the classifier.The second scenario takes into account the effect of noise in the acquisition process.In the case of sEMG signals, this noise may come from imperfect skin-sensor contact caused by hairs or drops of sweat.In the last scenario, we perform a real-life experiment using 10 able-bodied volunteers.

Sensitivity to Adversarial Attacks
We evaluate our robust model on purposely designed perturbations, by studying their influence on the overall performance of the system.We lead attacks on our best robust model in terms of accuracy and robustness achieving 92.95% accuracy and a Lipschitz constant ϑ = 0.87 for the 7gesture dataset.We compare the results with two conventionally trained models: the best one in terms of performance, which achieves 99.78% prediction accuracy on non-adversarial data, and another one trained to have similar performance as our robust model reaching 92.99% accuracy on the original test set.
To create the adversarial samples, we used some of the most popular attackers, namely: -Fast gradient sign method (FGSM) [27]-generates adversarial data based on the gradient of the cost function with respect to the input data; -Jacobian Saliency Map Attacker (JSMA) [44]-computes a perturbation based on 2 distance metric by iteratively selecting the input sample that will increase the chance of missclassification; -Projected gradient descent (PGD) [39]-uses local first order information about the network to create adversarial examples; -Carlini and Wagner (C&W) [8]-uses 2 distance to compute the optimal adversarial perturbation.-Gradient Matching (GM) [26]-this is a data-poisoning black-box attack.In this case, the attacker does not have access to the victim model parameters, but instead is trying to match the gradient direction for adversarial examples.
We also show a comparison with another popular technique of ensuring the robustness of neural network-based models, namely Adversarial training.This implies training an extended version of the dataset, containing the original training data together with a perturbed version of the samples in an effort to increase the system stability against adversarial inputs.Note that this method is purely empirical and gives no theoretical robustness certificates.We implemented an adversarial training strategy detailed in Reference [39], training the system using an augmented version of the dataset which was updated every 25 epochs.The adversarial samples were created using PGD attack and then the model was validated using data containing perturbations computed with various attacks.
The results summarised in Table 6 show the performance obtained for the 7-gesture test set.Note that the robust model performance is barely affected by the adversarial perturbations, whereas the baseline models show a huge drop in accuracy.It can be observed that adversarial training First four lines correspond to white-box attacks, whereas the last line shows a black-box attack.We consider out best constrained model, having a Lipschitz constant θ = 0.97, two models trained conventionally: the best baseline and another one having similar performance as the constrained one.On the last columns we feature an adversarial trained model using PGD-generated perturbations.
helps to increase the robustness, but our method of controlling the Lipschitz constant the network provides better results when facing data perturbed with other attackers than PGD.As expected, the poisoning attack is less effective than the white-box ones against the baseline models, but still our robust model showcases better performance.This shows that our method is more versatile, since its performance remains stable whatever the attacker.

Noisy Input Behaviour
To simulate the effect of underlying noise generated during the acquisition process, we added synthetic noise directly to the raw sEMG data, prior to the feature extraction step.The noise is chosen independent and identically distributed according to a Gaussian mixture law (1 − p)N (0, σ 2 0 ) + pN (0, σ 2 1 ).The mixture comprises a background component, corresponding to the intrinsic electronic noise in the armband, such as thermal or quantization noise, and an impulsive component accounting for outliers.Those may be related to imperfect wiring that can generate impulse-like artifacts.In our experiments, we consider background and impulse noises with standard deviations σ 0 = α and σ 1 = 10α with α ∈ [0, +∞[.We generate different levels of noise, by varying the parameter α.The probability of peaks p ∈ [0, 1] is also adjusted to simulate more or less severe scenarios in terms of outliers.
From the resulting noisy signals, we extract the features described in Section 4 and pass them to the classifier, using our robust models reaching an accuracy of 92.95% (ϑ = 0.87) for the 7gestures dataset, and 93.05% (ϑ = 0.98) in the case of the 13-gestures dataset, trained with nonaltered data.We compared the results achieved with our robust training with those obtained with (i) classical training and (ii) adversarial training.In this case, the adversarial training was performed by generating an extended dataset, containing the original data and corrupted versions of them by additive noise following the Gaussian mixture law described above, where the parameters p and α were drawn randomly in a uniform manner on [0.15, 0.45] and [0, 2], respectively.In the absence of noise, a similar performance in terms of accuracy was obtained: 7-gestures dataset-92.99%, and 92.97%, 13-gestures dataset-93.03% and 92.98% for baseline and the adversarial training, respectively.The experimental results obtained on two datasets are depicted in Figure 5.The red, blue, and green lines correspond to the unconstrained, constrained, and adversarial models, respectively.We observe that the constrained model is significantly less affected by the presence of noise in the inputs than the one trained without robustness guarantees.It is also worth noting that training with adversarial inputs also leads to satisfactory results, although usually slightly less accurate.The Lipschitz lower and upper bounds computed for the networks trained in an adversarial manner are indeed much lower than those with standard training, but they remain quite large ((1845.23,79534.2) for 7-gestures dataset and (1754.74,64595.8)for 13-gestures dataset).This experiment emphasizes that controlling the Lipschitz constant of a network improves its robustness not only against targeted adversarial attacks, as shown previously, but also in the case of black-box attacks, where no prior information about the model is used.

Real-Life Scenario Validation
To illustrate the practical applicability of our findings, we proceed to validate our model in a reallife context.For this purpose, we designed an experiment to compare a conventionally trained model with the constrained one.We integrated both models in a real-time application that controls a 3D hand on a screen, as well as a game that can be controlled by gestures, to give the user a tangible feedback.We used the Unity4 platform to design and control a 3D hand and then en-  The training time is computed on a batch of 2048 examples.capsulated our models in an application which performed real-time inference and the hand was moving on the screen in accordance with the predicted gesture.We asked 10 volunteers (males and females) to test both models by performing each gesture 20 times.We emphasize that the user had no prior knowledge about what model was implemented, since it was randomly selected at the beginning of each new trial.Pictures of the experimental setup are provided in Figure 6.Table 7 details on a user level, how many (out of the 20) trials were erroneously classified.U and C denote the Unconstrained and the Constrained models, respectively.Note that, despite obtaining very good results on the test set, the unconstrained model loses a lot in terms of performance (up to 15%) when facing real-life data.We can observe that training a positive neural networks subject to Lipschitz constraints improves the overall robustness of the classifier against adversarial perturbations, not only from a theoretical viewpoint, but also practically by leading to more reliable systems with greater generalization power.
As for the other application, we asked the volunteers to play 2 rounds of a gesture-controlled game, one with each model.The game was inspired by the famous Temple Run, 5 and consists of a moving cube which the user controls via gestures.The player can move his/hers hand left or right to move the character to either side of the screen to avoid obstacles.The player can also move the hand up to jump or spread its fingers to shoot and clear the obstacles ahead.The game is over when the player fails to take a turn or to jump/ clear an obstacle.We observed that 70% of the users were able to obtain higher scores when they used the constrained model, showing again that our solution is more stable when it comes to real-life applications.

Limitations
Increased training time is one of the main limitations of our proposed approach.Indeed, to compute the true projection, the proposed method uses an iterative algorithm which performs singular value decomposition at each iteration, which is a resource consuming operation, especially when performed on large matrices.We propose several lower complexity solutions, which have proved to offer a good tradeoff between training time, robustness and performance.Table 8 shows the training time for all the propose constraint algorithms.The time is measured per step, which consists of a batch of 2048 examples.Nevertheless, it is worth noting that the additional time overhead is applicable only during the training phase.The inference is the same for all the models, around 7 ms per step.
Another limitation is related to the fact that our method for controlling the Lipschitz constant of the system is currently applicable in the context of nonnegative-weighted fully connected feedforward neural networks.Although the performance remains good for the considered AGR systems, the nonnegativity constraint might lead to a loss of expressivity of the neural networks in other inference tasks.In a future work, we plan to extend our method towards more general neural network architectures, including convolutional layers, skip connections, and so on.

CONCLUSION
This work has shown the usefulness of designing robust feed-forward neural networks for AGR based on sEMG physiological signals.More precisely, we proposed to finely control the Lipschitz constant of these nonlinear systems by considering positively weighted neural architectures.To offer robustness certificates, we also developed new optimization techniques for training classifiers subject to spectral norm constraints on the weights.We studied various constrained formulations and showed that robustness can be secured without sacrificing accuracy when using a combination of tight constraints and exact projections.We also provide several lower-complexity solutions, which reduce the training time significantly.
Experiments on four distinct datasets illustrated the good performance of our approach.We further demonstrated the effectiveness of our robust classifier, compared with classically trained ones, when facing white-box and black-box attacks.
We also want to highlight that one of the key advantages of our research was the ability to conduct real-life experiments.This was made possible because we had access to a specialized acquisition module tailored for capturing gesture data.The availability of this acquisition module allowed us to gather real-world gesture data in a controlled setting, which closely mimics practical scenarios.By conducting experiments with real users and their gestures, we could thoroughly evaluate the performance and accuracy of our proposed methodology.This real-life experimentation not only provided us with invaluable insights into the effectiveness of our approach, but also demonstrated its feasibility and potential for implementation in real-world applications.
In future works, it would be interesting to apply such a robust training procedure to other applications in pattern recognition involving data acquired in real-time.

APPENDIX A ACCELERATED DFB ALGORITHM
Let n ∈ N \ {0} and i ∈ {1, . . .,m}.Computing the projection of a matrix W i ∈ R N i ×N i −1 onto D i ∩ C i,n is equivalent to solve the following matrix optimization problem: minimize where • F is the Frobenius norm and ι S denotes the indicator of a set S (this function is equal to 0 on this set and +∞ otherwise.)The dual optimization problem associated to this strongly convex minimization problem reads minimize where for a given function д, д * denotes its Fenchel-Legendre conjugate.In our case f = ι D i + 1 2 • −W i 2 F .From standard conjugation rules [33], f * is equal to where ι D i is the Moreau envelope of ι * D i given by The Moreau envelope of a proper lower-semincontinuous convex function is differentiable.Thus f * is differentiable and its gradient is [6,Example 17.33] We deduce that the gradient of Y → f * (−A i,n Y B i,n ) is Since P D i is a nonexpansive operator, the latter function has a Lipschitz gradient with constant β = A i,n for some scaling parameter γ ∈ ]0, +∞[.By using Moreau's formula [6], this proximity operator is expressed as A classical solution for solving the dual problem consists in using the standard forward-backward algorithm [15,20].This leads to Algorithm 3 [17].Another solution consists in using the FISTA-like algorithm in [10], which leads to the accelerated version in Algorithm 2. The sequences (Y ) ∈N generated by these two algorithms is guaranteed to converge to a solution Y to the dual problem.
In addition, from Kuhn-Tucker conditions, the solution to the primal problem W i = P S i, n (W i ) is equal to ∇f * (−A i,n Y B i,n ).It follows from (34) and the continuity of P D i that the sequence (V ) ∈N converges to W i .

Fig. 1 .
Fig. 1.Representation of a NN as a composition of operators.

Fig. 3 .
Fig. 3. Proposed neural network architecture for AGR.All the layers except the last one use ReLU activation functions; the last layer uses Softmax.The number of neurons considered for each layer is: 128, 128, 128, 64, 32, 16, in the case of 7-gestures dataset, 256, 256, 256, 128, 64, 32 in the case of 13-gesture dataset, and 512, 512, 256, 128, 64 in the case of 24-gesture and 53-gesture dataset.The last layer has 7, 13 or 24, or 53 neurons representing the gesture number being recognized.Each EMG box represents a column vector containing 8 time-descriptors.

Fig. 4 .
Fig. 4. Accuracy vs. Iterations-constrained and unconstrained models in the context of 7-gesture dataset.The training and validation curves are displayed in green and yellow, respectively, for the unconstrained model.The training and validation curves are displayed in blue and red, respectively, in the case of constrained training, with the bound ϑ = 0.95.

2 S B i,n 2 S.
The dual problem(31) thus corresponds to the minimization of the sum of a smooth convex function and a proper lower-semicontinuous function.Consequently, it can be minimized by a proximal algorithm.Such a strategy will require to calculate the proximity operator of γι * B(0,ϑ )

Table 1 .
Comparison to Other sEMG-Based AGR Systems

Table 2 .
Performance and Robustness Results for 7-Gestures Dataset Baseline Models

Table 4 .
Standard Deviation of Accuracy Computed on 15 Epochs, After Convergence, on the Test Set for Constrained Models

Table 6 .
Adversarial Attack Results

Table 7 .
Real-Scenario Experiment Results

Table 8 .
Training Time for Different Constraints in the Case of an m = 6-layer NetworkConstraintNoneP C i ∩D i P C i ∩D i P Či,n ∩D i P Či ∩D i P C i, n ∩D i P C i, n ∩D i