TongueTap: Multimodal Tongue Gesture Recognition with Head-Worn Devices

Mouth-based interfaces are a promising new approach enabling silent, hands-free and eyes-free interaction with wearable devices. However, interfaces sensing mouth movements are traditionally custom-designed and placed near or within the mouth. TongueTap synchronizes multimodal EEG, PPG, IMU, eye tracking and head tracking data from two commercial headsets to facilitate tongue gesture recognition using only off-the-shelf devices on the upper face. We classified eight closed-mouth tongue gestures with 94% accuracy, offering an invisible and inaudible method for discreet control of head-worn devices. Moreover, we found that the IMU alone differentiates eight gestures with 80% accuracy and a subset of four gestures with 92% accuracy. We built a dataset of 48,000 gesture trials across 16 participants, allowing TongueTap to perform user-independent classification. Our findings suggest tongue gestures can be a viable interaction technique for VR/AR headsets and earables without requiring novel hardware.


INTRODUCTION
Head-worn devices are increasingly ubiquitous in our lives due to the growing usage of headphones and virtual or augmented reality (VR/AR) headsets.With the rising prevalence of such devices, new interaction methods have become necessary to control the devices without requiring an external controller.These interactions commonly rely on the hands, such as pressing a button on the headphones or hand-based gesture control for augmented reality headsets.Hands-free interaction methods such as speech recognition and eye tracking provide an alternative for use cases where the user's hands may be permanently or situationally impaired.A wide range of neuromotor disorders including Amyotrophic Lateral Sclerosis (ALS), muscular dystrophy and stroke greatly reduce the ability to move the hand voluntarily.Meanwhile, head-worn devices are used in settings such as warehouses [50], manufacturing [8] and surgeries [21] where users' hands are occupied and cannot be used for interactions.
However, speech recognition, the most common hands-free interaction method, is unusable when the environment is noisy or privacy is necessary.Gaze tracking requires continuous attention to sustain the interaction, making it difcult to control and distracting from other tasks a user may be performing.Gaze and dwell, the most common approach for gaze-based interaction is slow and has a high error rate, especially for novices [14,39].As a result, both speech recognition and gaze tracking are inaccessible to a wide range of users, particularly when their interactions need to be discreet and ephemeral.
Mouth-based interaction methods are hands-free, voice-free and eyes-free, ofering a deeply enabling approach to interacting with head-worn devices.Past work on mouth-based interaction methods have involved custom hardware that is around the jaw [44] and neck [54] or within the mouth [16,53].For mouth-based interaction to be used in everyday devices, both the device and the interaction must be discreet, necessitating sensors that can be embedded in existing form factors.While recent studies have investigated sensors around the ear [5] and eyes [66], these studies have been focused on a single sensing modality with custom hardware, limiting the reproducibility and accessibility of research on mouth-based interaction.Moreover, due to the emphasis on silent speech commands [12] for these interfaces, there has been a bias towards modalities that can capture multi-organ movement from the lips to the larynx due to vocal articulation.As a result, there is a gap in multimodality and a lack of mouth interactions that are invisible for daily use.
We created a tongue gesture interface (TongueTap), by combining sensors in two commercial of-the-shelf headsets.Using this interface, we demonstrate that even sensors far from the mouth can recognize tongue movement.Compared to silent speech interfaces, interfaces using tongue gestures minimally engage the lips and jaws, and can be performed with the mouth closed, generating limited visual movement.Closed-mouth tongue gestures allow privacy during ephemeral interactions such as increasing volume in earables or closing a notifcation in augmented reality (AR).The availability of the two devices we used in this study may help make it easier for researchers to reproduce and experiment with their own mouth-based interaction methods using the same devices.We evaluate the performance of eight diferent gestures and six sensing modalities via data gathered simultaneously for comparability and make the dataset publicly available.
We evaluated TongueTap in a series of ofine tests comparing accuracy across diferent gestures, sensing modalities, data amounts and moving window sizes across sixteen participants spread over two study locations.We compared our eight selected gestures to two controls, blinking and sticking out the tongue, as a comparison for gaze and facial interaction researchers.We developed a pipeline for real-time user-independent classifcation of tongue gestures, demonstrating it in diferent desktop applications.We also collected informal qualitative feedback and NASA Task Load Index (TLX) [22] questionnaires for each gesture.
The key contributions of this paper are: (1) Tongue Gesture Recognition using commercial head-worn devices, increasing the accessibility and reproducibility of research on mouth-based interaction methods.To the best of our knowledge, this is the frst tongue interface designed for of-the-shelf use.To facilitate such use, we've made our data open-access at https://zenodo.org/record/8247217. (2) Gesture Recognition Experiments using 8 closed-mouth tongue gestures and two baseline conditions.We report our recognition accuracy on user-dependent and independent models, and present our fndings for the ideal window sizes, gesture subsets, efects of pre-training, and a NASA-TLX.(3) Sensing Modality Experiments that reveal the most descriptive sensors and sensor groups (Table 1).Our fndingssuch as that 80% of the accuracy was due to the Inertial Measurement Units (IMUs)-are useful for the design of future head-worn tongue gesture platforms.

RELATED WORK
Wearable hands-free interaction approaches have diversifed signifcantly over the years with advanced speech recognition and techniques such as eye tracking [3], facial gestures [41,60] and brain-computer interfaces (BCIs) [63].For developing TongueTap, we primarily drew from past research in gaze and teeth interactions, facial muscle sensing in earables, and the glossokinetic potential, an electrophysiological motion artifact by the tongue.

Hands-free Interaction
Speech recognition and voice commands have an extensive history in facilitating hands-free interaction with head-worn devices [15] and have improved alongside speech recognition technologies.The limitations of speech recognition in diferent settings has led to a need for hands-free and voice-free interaction approaches.Eye tracking emerged quickly as one such approach [3].Gaze and dwell, an interaction method relying on fxating the eye on a single point, has remained as the most common method of gaze-based interaction [62].While expert users of the method can achieve up to 300ms with adjustable dwell times [39], dwell-based gaze interactions have sufered from high error rates and cognitive load with limited speed [14].Eye gestures have been used at up to 250ms in head-worn displays [7,13].However, occupying the eyes with gestures is often undesirable as they draw too much of the user's attention.
BCIs have attempted to control devices without requiring movement [63].Steady state visually evoked potentials (SSVEP), showing rhythmic visual stimuli to the user, has been used with VR headsets to classify visual targets [30,40].While useful for paralyzed users, SSVEP has thesame problems with eye tracking due to constant visual attention.BCIs have been efective with movement, and Bleichner et al. have shown attempted mouth movements to be decodable even with paralyzed users, providing support for the viability of mouth-based interactions [6].

Mouth-based Interaction
The mouth has been a target of physiological sensing for various research aims.Human activity recognition researchers have focused on detecting daily activities such as chewing, drinking and speaking [4].Facial and mouth expressions have been sensed for use in virtual reality and teleconferencing [9,35,36].Much interest in mouth-based sensing and interaction has focused on silent speech, an interaction method enabling speech communication when an audible signal cannot be used [12].Silent speech interfaces have been targeted as a strategic interaction method for enabling fast, hands-free communication using sensors within and around the mouth.These have allowed communication with head-worn displays [5], interactions with voice assistants [26,29] and text entry [28] by developing recognition models with large vocabularies for sensors around or inside the mouth.
Some silent speech interfaces have relied on non-contact approaches using lip reading [49,58], infrared imaging [65] and acoustics [17].While these are useful for interacting with mobile devices such as smartphones, head-worn displays and earables already have contact points where sensors can detect mouth movements, allowing greater fexibility in sensing approaches while keeping sensors invisible.The potential uses have resulted in a push towards nonintrusive silent speech for interacting with head-worn displays, via infared camera in HMDSpeller [1] and via acoustics in EchoSpeech [66].Particularly exciting about EchoSpeech is the discreet form factor of the sensors, showing that they could be integrated into future head-worn displays without changing device shape.
After a silent speech interface detecting ear canal deformation by Sahni et al., there's been interest in making silent speech interactions for earables [52].EarCommand achieved 32-word silent speech recognition with earphones and MuteIt characterized silent speech recognition using the jaw as a secondary articulator [23,27,56].Roddiger et al. note that mouth-based interactions with earables have successfully detected gestures from the jaw, teeth and tongue with surprising accuracy [51].Such earables have made use of muscles connecting the muscles around the mouth, including the tongue, to the styloid process near the ear.The styloglossal muscle has made tongue sensing possible through sensors in the upper face, which TongueTap also makes use of, for mouth gestures rather than silent speech.
Mouth gestures difer from silent speech commands by allowing a wider range of more ephemeral and short-term interactions for daily, quick usage.Many mouth gesture interfaces have involved teeth clicks [59,61] and jaw clenching [27], but we attempted to minimize jaw and teeth movement as such gestures are audible and visible to an observer.Mouth gestures can be more discreet than silent speech by keeping input intraoral [16].They can make mobile input easy for wearable devices without necessitating silent speech commands [42].Chen et al. mapped the space of mouth gesture design in more detail, fnding that users prefer short and direct gestures while avoiding natural motions like smiling [10].Chen et al. further note that mouth gestures can provide haptic feedback for themselves through various surfaces around the mouth, making closed-mouth tongue gestures an appealing intraoral interaction method.

Tongue Interfaces
Many tongue interfaces have used intrusive methods that require a retainer or magnet inside the mouth [37,43,52,53].While this approach provides reliable signals, as demonstrated by SilentSpeller's 1164-word vocabulary, it comes at the cost of making users uncomfortable and limiting interaction duration [28].
Non-intrusive approaches have tried to replace such tongue interfaces using electromyography (EMG) signals from around the cheeks, neck and jaw, [44,54,64] or pressure sensors on the cheek [11].Such interfaces still occupy the lower face, making tongue interactions very inconvenient for daily use.Instead, non-contact methods have used cameras [38,48] and Doppler radar [20].These methods require the tongue gestures to be detected through external movements, making them less viable for discreet, closed-mouth tongue gestures.A tongue interface that stands out from among such interfaces is TYTH, which only uses electroencephalography (EEG) and EMG sensors around the ear to detect tongue gestures [47].TYTH uses the hypoglossal cranial nerve and the styloglossus and hyoglossus muscles, the same muscles allowing earable silent speech interfaces and a primary target for TongueTap.However, TYTH requires a custom headset and was still highly visible to observers due to the gestures chosen.Some tongue interfaces have made use of the glossokinetic potential, an electrophysiological motion artifact caused by tongue movement that is commonly observed in EEG studies.Nam et al. have explored the glossokinetic potential for their tongue gesture interfaces controlling robots and electric wheelchairs [45,46].Kaeseler et al. have investigated the glossokinetic potential as a movementbased brain-computer interface, achieving the discreetness of braincomputer interfaces with a much more reliable movement-related potential than typically possible [25,34].This was the second signal targeted by TongueTap in addition to the styloglossus muscle outlined in the previous section, although it showed an underwhelming result compared to the IMU.By including all of these modalities in a single study, we hope to provide a more comprehensive comparison of the diferent sensing modalities for tongue gestures.

DESIGN 3.1 Hardware Selection
We primarily selected hardware to include a range of sensors, with an emphasis on motion and electrophysiology based on past performance of IMUs in earables [41,56] and EEG/EMG in tongue gestures [26,47,54].While IMUs are available in some earables and in the correct position for tongue sensing, no commercial headphones contain EEG/EMG sensors at the time of writing.Instead, we sought to select a VR/AR headset capable of all such sensors, but the location of IMUs and the lack of reliable EEG or EMG sensors in commercial VR/AR headsets made it difcult.We thought the HP Reverb G2 Omnicept Edition (OE), the VR headset with the widest range of sensors among the headsets we looked at, would be sufcient for our goals as its documentation mentioned facial EMG, yet these sensors were not included in the headset.We combined the Reverb G2 OE with an EEG headset such that wearing both at the same time wouldn't be too uncomfortable for the study duration.We note that despite using a VR headset and EEG headset, we believe the most meaningful use cases of tongue gestures are for earables and AR.The headsets we selected are equivalent to what sensor placement in earables and AR headsets could be.
The hardware for TongueTap consists of an HP Reverb G2 OE VR headset [2] and a Muse 2 EEG headband [33].The sensors contained by these devices are described in more detail in Table 1.Notably, both headsets contained IMUs and photoplethysmography (PPG) sensors.We excluded the calculated measures of the Reverb G2 OE as their frequency was too low, with the heart rate and variability at 0.2Hz and cognitive load at 1Hz.Moreover, we excluded the mouth camera, originally one of the most promising sensors, due to challenges with the Omnicept software used for data collection making it impossible to obtain the images.As we later elaborate in Section 8.1, the Muse 2 EEG headband may have limited our EEG results due to the fve dry electrodes being on the forehead and noisier than gel electrodes.
The two headsets can be ftted to a user by extending, then contracting the Muse 2 on the user's forehead and repeating the same process with the Reverb G2, fnalized by tightening the head strap to the top of the user's head.The combined hardware puts the Muse 2's forehead sensors slightly above the top of the Reverb G2's face gasket, as shown in Figure 1b.

Gesture Design
In selecting gestures, we made sure that all of the gestures could be performed with the mouth closed so that there were neither auditory nor visual cues to a third-party observer.As Chen et al. have already conducted a gesture elicitation study for mouth gestures, we relied on their fndings in choosing our gestures [10].However, we deviated from their "best" gestures as we also sought to have a spatial mapping of the gestures around the mouth while ensuring they would be easy to recognize by machine learning models [18].For example, we sought to have a gesture pointing up, which became curling the tongue above and backward, and another pointing left and right, which was performed as a tap on the left and right cheeks.The eight gestures selected are shown in Figure 2. Notably, only three of the gestures require any jaw movement while others only engage the tongue.All the gestures are silent, contained within the mouth and use the teeth, cheeks and palate for haptic feedback.We had a total of 10 gestures for our study.In addition to the eight tongue gestures described in Figure 2, we selected two control gestures, "Blink" and "Stick Out" to benchmark our performance.The "Blink" serves as a point of comparison for gaze tracking and BCI researchers while helping verify signal quality and timestamping by using the high-amplitude EEG signals and eye tracking measurements generated during the gesture.Meanwhile, the "Stick Out" gesture is an open-mouth gesture where the tongue is stuck out to make usage obvious because the eight closed-mouth gestures were sometimes too discreet to be noticed by the experimenters.The "Stick Out" gesture is also comparable to lip-based gestures such as those used in LipIO [24] as the tongue and jaw motion are similar.

IMPLEMENTATION 4.1 Data Collection Software
The data from the Muse 2 and Reverb G2 OE devices was synchronized using the Lab Streaming Layer (LSL) [31], a system for time synchronization commonly used for multimodal brain-computer interfaces.LSL allows both real-time streaming as well as recording streamed data to an extended data fle (XDF) using its own Lab Recorder software.For the Muse 2, we used BlueMuse [32], an open-source tool for streaming LSL data from Muse.For the Reverb G2 OE, we created a custom data streaming tool in the Unity game engine built on HP's Omnicept software and the C# endings for LSL.Outside the Omnicept software, the Reverb G2 also provides the During data collection, the user can press the "A" button on a Windows Mixed Reality controller to start a gesture and release it to stop the gesture, continuing to the next one.As gestures often take variable duration to complete, this allows more accurate boundaries to the gesture while also measuring the duration.If the user believes they made a mistake, they can instead press the "B" button to delete the previous gesture and redo it.The "Press", "Release" and "Delete" signals from these controller activities are also synchronized over LSL.All data is either stored in an XDF using the LSL Lab Recorder for ofine recognition or streamed directly to a Python script processing moving windows from the data stream.The full data fowchart is shown in 3.

Gesture Recognition Approach
Our pre-processing pipeline used a 128Hz low-pass flter using SciPy and Independent Component Analysis (ICA) on the EEG signals while applying Principal Component Analysis (PCA) to the other sensors, each sensor separately from the others.ICA and PCA components were equal to the number of channels or axes for each sensor, for example, fve components for EEG and six for IMU.The accelerometer values from the IMUs had gravity subtracted onboard the devices, so no additional pre-processing was performed for them.Then, we extracted 400ms windows from each gesture using MNE, beginning 100ms before the button press and ending 300ms after.Our gesture recognition models were not capable of handling invalid or raw time series data, so we removed chunks of the time series where any sensor was invalid, fattened the data into a single vector for every gesture and concatenated the sensors.We note that a model meant for time series may not require fattening and have better accuracy, although the varying frequencies of the diferent sensors make applying such a model to the data challenging.
For gesture recognition, we designed a hierarchical model as shown in Figure 4. Our fnal model used a Support Vector Machine (SVM) in Scikit-Learn using a radial basis function (RBF) kernel with hyperparameters C=100 and gamma=1 to do binary classifcation and determine whether a moving window of data contained a gesture or if it was a non-gesture.If the model decided it was a gesture, the fnal classifcation was done by a multi-class Random Forest Classifer with hyperparameters: 40 max.depth, 2 min.samples per leaf, 800 estimators.Prior to reaching the hierarchical model, we experimented with Support Vector Machines, Random Forest Classifers, Multi-Layer Perceptrons and Logistic Regression for the classifer.For dimensionality reduction, we tried PCA, ICA as well as Linear Discriminant Analysis (LDA).We found that the Random Forest Classifer always outperformed when doing multi-class classifcation yet the Support Vector Machine outperformed in binary classifcation, leading to the hierarchical approach for more optimally handling rest sequences.The dimensionality reduction difered for the sensing modalities, where ICA was more efective for EEG while other sensors were more successful using PCA.For tuning, as well as testing the accuracy of these models, we used 5-fold cross-validation while keeping a distinct testing set from 20% of the data.By doing so, we prevented overftting on the testing set while tuning the models.
We attempted traditional machine learning methods instead of deep learning approaches as we were aiming for a classifer that could be executed in less than 100ms reliably.However, given the size of our dataset, deep learning methods could be plausible in recognizing tongue gestures.In our case, we didn't fnd it particularly necessary as we were already able to achieve a high enough accuracy in multi-class classifcation without leveraging deep neural networks.

DATA COLLECTION
The goal of our study was to create a large dataset of tongue gestures for evaluating tongue gesture recognition with sensors in of-theshelf devices.Our study procedure was reviewed and approved by the Ethics Review Board at Microsoft prior to recruitment.

Participants
Participants were recruited at two locations (Redmond, WA, USA and Atlanta, GA, USA) through fiers around campus with a QR code, a mailing list for participants of past studies, and channels on Microsoft Teams.Participants were required to be 18-69 years in age, fuent in English and have normal vision, motor and cognitive abilities to be able to follow instructions and use the VR headset safely.After the study, participants were compensated $50 in the form of a gift card of their choice.The demographics for the 16 participants are shown in Table 2. Participants also had a diverse range of hair length, style and texture including braided and curly hair, ensuring signals could be obtained even for users for whom BCIs traditionally fail to work.

Tasks and Procedure
When participants arrived at the study, we described the procedures and obtained informed consent.After participants were introduced to the study, we ftted the Muse 2 and then the Reverb G2 onto the participant's head and verifed that they were able to see the Unity experimental interface on their display.We confrmed EEG contact quality by ensuring all electrodes had a standard deviation below 20 microvolts and waited 1 minute for all the sensors and calculated measures to stabilize before starting data collection.
The participants were then asked to do a practice round where they performed each gesture 5 times.The practice round served to help verify the signal quality, familiarize the participants with the press-and-release approach to recording and ensure that participants were doing the gestures correctly.As the gesture descriptions weren't very clear and difcult to demonstrate, this step served an important role in normalizing gesture movements across participants.
Afterward, participants started the main study for collecting the full dataset.The study consisted of 60 self-paced trials, separated into six batches of our 10 gestures.Participants performed the study fully in VR using the visual display in the Reverb G2, shown in Figure 5.At the start of a trial, participants were prompted which gesture they were to perform.During a trial, participants performed that gesture repeatedly, marking the start and end point of each gesture using the "A" button on the Windows Mixed Reality controller (i.e.button-down, button-up).Participants repeated the gesture 50 times in each trial while a visual counter incremented with each button press.Once they reached 50, the trial would end.This created a total of 3000 training examples per participant.In between batches, participants received a 10 second mandatory rest period to recoup attention.They were allowed to make other movements during rests, and we handled this "non-gesture" data as a null sequence where normal mouth and head motions could occur.Due to the long duration of the study, participants could also take an optional break of up to two minutes after every 15 trials.
At the end of the study, participants flled out a basic demographic survey and gave qualitative feedback on their experience with the interface.Additionally, the eight participants in the Atlanta site completed a NASA-TLX questionnaire for each of the 10 gestures.This was not completed at the other site due to being a later addition to the study protocol.The study took approximately 1.5 hours in total.

RESULTS
After data was collected from all participants and the models were optimized as described in Section 4.2, we performed a series of ofine experiments for gesture classifcation.For the below experiments, unless otherwise specifed, we used an 80/20 train-test split to build a user-dependent model with the eight gestures and rest  condition using the hierarchical random forest and support vector machine model.

Classifcation and Sensing Modality
The result of most interest to us from the study was which sensors were most efective at classifying tongue gestures.While some of our sensors already contained multiple modalities, such as the IMUs including an accelerometer and gyroscope, we treat each stream as its own modality for the purpose of this comparison as they can be packaged together.Initially, we compared each sensing modality independently, but we observed that multimodal combinations were able to achieve a higher accuracy than a single modality alone.In particular, the most efective method was to combine the IMU on the Muse EEG headset with the PPG.The results for each modality and multimodal confguration is shown in Figure 6.
To our surprise, EEG was not particularly efective, although this may have been due to the location of the sensors being too close to the eyes, which produce a much stronger artifact.The IMU on the Muse turned out to be our most efective sensor, achieving 80% alone.Multimodal combinations including the Muse IMU were even more efcient, with a combination with the PPG sensor achieving 94% accuracy.While we have not observed this in prior literature as the PPG has never been used, we suspect this may be due to a greater blood fow to the entire face during tongue movement.We also found promising results when using the head tracking of the VR headset, although the head tracking may be less efective in a more ecologically valid confguration.

Gesture Classifcation
For classifying between the gestures, we created confusion matrices for both user-dependent and user-independent classifcation In this case, we decided to include the "Blink" and "Stick Out" control gestures as we were curious if gestures would be confused with other movements of the face.For user-independent classifcation, we used a leave-one-user-out cross-validation for testing instead of a 80/20 test split averaged across users.We chose this approach as we sought to include no data from the participant being tested in the training dataset.As shown in Figure 7, the "Shake" gesture where the tongue is swung sideways was the gesture with the most error in the user-dependent model, being confused for the "Mouth Floor" gesture.The user-independent model had the classifcation error far more distributed, although the overall accuracy decreased to 80%.
In addition to classifying between gestures, we evaluated the amount of data per participant and window size necessary to get reliable accuracy.As shown in Figure 8, we found that recognition accuracy decayed rapidly after a dataset size of 180 samples per gesture and window size of 400 milliseconds.While this may be due to the hyperparameters chosen, the inability to reduce the dataset size suggests data augmentation methods may be necessary for achieving generalizable tongue gestures without collecting even larger datasets.

Gesture Usability
Quantitative metrics on the usability of gestures was collected using a NASA-TLX questionnaire, reported in Figure 9.Some of the gestures, such as curling the tongue back were challenging to perform, with participants pointing out they felt more tired after trials for the gesture.However, we found that the single and double tap, as well as biting the tongue were comparable to blinking in cognitive load.Aligning with Chen et al. 's results [10], participants showed a preference for tongue gestures that were shorter in duration.
For the informal qualitative feedback, participants noted that tongue movements in the front of the mouth, such as "Bite" and "Single Tap" were more convenient, which aligns with the NASA-TLX results.Many participants struggled with interpreting what gestures meant; almost all of them asked for clarifcation on how "Mouth Floor" should be performed.P3 pointed out that they didn't know when to stop "Shake", which made them confused.P9 mentioned that while they felt they touched the cheek for "Left Cheek" and "Right Cheek", they weren't sure when was the right time to stop.Such an issue could be solved by real-time feedback when using the interface.

Integrating Tongue Gestures to Devices
Based on the sensors with the best accuracy, we can observe that the IMU behind the ear is a low-cost method of detecting tongue gestures with a position allowing it to be combined with past mouth sensing approaches such as Nguyen et al. 's ear EMG [47] and Jin et al. 's in-ear acoustics [23].As a result, an IMU or a combination of these approaches can be used in earables or smart headphones and head-worn displays with relatively few modifcations to existing hardware.Discreet, hands-free tongue gestures could replace touchbased gestures on these devices or be an alternative confguration for them.Potentially, the user could receive additional haptic feedback after performing gestures by adding additional components such as ultrasound transducers [55] or capacitive electrodes [24].We chose not to include any custom hardware in this study as it would be against our goal of convenient replicability, although our omission of existing custom approaches makes comparison harder.
Another step critical for making tongue gestures viable for products is a reliable, user-independent classifcation model.While the user-independent model can already achieve above 80% accuracy, this wouldn't be sufcient for using such a classifer repeatedly to control an application.We also expect that current user-independent accuracy would decay when taken outside the lab conditions.A two-IMU approach as shown by Srivastava et al. may help improve the generalization of the IMU signals [56].An ear IMU could be combined with VR position tracking in a similar manner.An especially vital task would be obtaining a better model of non-gesture motion.

Real-time Implementation
In addition to the ofine results, we implemented a real-time recognizer for tongue gestures.The ofine model was pickled in Python, then unpickled for the real-time algorithm, which then captured a window of data repeatedly from the LSL stream using a ring bufer.As extra conditions for a positive gesture classifcation, there had to be enough data to fll the ring bufer and 400 milliseconds must have passed since the previous positive gesture classifcation.This helped prevent repeated activations of the gestures.When the realtime recognizer detected a gesture, it would trigger a keyboard press or mouse click, which was used to facilitate interactions with Universal Windows Platform apps.Given the success of IMU and PPG, we decided to test the real-time interface using only the Muse 2, without a VR headset.We combined the real-time tongue gesture recognizer with a Tobii desktop eye tracker, successfully controlling multiple hands-free games and an interface for controlling a music instrument [19] as shown in Figure 10.Real-time recognition typically required participants to be sitting still due to the limitations of the current data, but the interface successfully detected tongue gestures with a latency of 400 milliseconds.

Ecological Validity
While our study included multiple locations, they were both controlled lab environments where the participant was only moving to execute the gestures.For the gestures to translate to more realistic environments, a more ecologically valid study design with multiple sessions and mobility between environments is necessary.The real-time model developed in this paper will help in testing the usability of tongue gesture interactions in realistic settings.
In the future, we plan to conduct studies where participants may be performing other tasks such as walking, eating and talking to better characterize gestural and non-gestural movements.We would also like to test our tongue interface in settings emulating typical tongue gesture use cases where the sensor ft quality may decay and environmental noise would vary due to the mobility of head-worn devices.Our chunked block-based study design could be replaced by an event-based study design to better capture and evaluate gestures in daily interactions.

FUTURE WORK 8.1 Sensing Alternatives
The most useful sensors in our experiment were the IMU and PPG, but there are still many sensors that could be used in head-worn displays to detect tongue gestures.Acoustic approaches have been efective in sensing mouth movements in earables [23,35] and could easily be applied in current head-worn devices.Moreover, the motion at the back of the ear captured by the Muse 2 IMU could potentially be detected in other modalities such as stretch sensors.
The sensors we used could also have diferent positions to make them more efective.Our EEG results fell short of our expectations based on prior work by Kaeseler et al. [25,34], which we suspect is due to the poor facial positioning of the Muse 2 EEG sensors, where there was much greater noise due to eye movements as demonstrated by the Blink control gesture.EEG and EMG sensors in the ears, around the nose piece and on the head strap may prove more useful for sensing tongue movements in head-worn displays.For earables, in-ear EEG would likely be a better option.

Tongue Interaction for Augmented Reality
We believe the most promising application for tongue interactions is in controlling AR interfaces.In contrast with VR, AR headsets are often used while interacting with other people in shared spaces, making discreet control of the interface more important.While typically not near the ear, most AR headsets contain an IMU already, making it easy to integrate our system into these devices.Tongue taps could be a suitable alternative to hand gestures like the "air tap" present in the HoloLens.AR headsets like the HoloLens 2 and Magic Leap 2 also include eye tracking as a potential sensor for facilitating interactions [57], which could be combined with tongue gestures to enable point & click interactions [19].We plan to study this multi-organ interaction further by experimenting with its use in AR headsets and comparing it to other gaze-based interactions.

CONCLUSION
In this paper, we presented TongueTap, a tongue gesture interface that does not require any additional sensors beyond those available in commercial head-worn devices.We found that IMUs, PPG and motion tracking capabilities in head-worn devices can perform eight-gesture classifcation at greater than 70% accuracy and tried combinations of sensors that may enable tongue gesture interactions in head-worn displays with minimum cost and hardware modifcations.We determined that some tongue gestures like tapping on the front teeth and biting on the tongue had cognitive demands comparable to blinking.We also found that tongue gestures can be executed and recognized at 400ms, less than gaze and dwell rates for most users.Through the sensing capabilities, gestures and interactions demonstrated by TongueTap, we put forth tongue gestures as a method for discreet, hands-free control of head-worn devices without requiring any additional hardware.
Gesture Name Description Single Tap Tap front upper teeth once with tongue Double Tap Tap front upper teeth twice in a row with tongue Shake Swing tongue left and right repeatedly Left Cheek Tap left cheek with tongue Right Cheek Tap right cheek with tongue Mouth Floor Touch bottom of mouth, behind lower teeth with tongue Curl Back Curl tongue up and towards the back of the palate Bite Gently bite on tongue with front teeth

Figure 2 :
Figure 2: Eight discreet, closed mouth tongue gestures and how they are performed.

Figure 3 :
Figure 3: Data fowchart for both ofline and online recognition with TongueTap.position tracking data used in VR applications via OpenXR, which we added to our streaming tool for another measure of motion tracking.During data collection, the user can press the "A" button on a Windows Mixed Reality controller to start a gesture and release it to stop the gesture, continuing to the next one.As gestures often take variable duration to complete, this allows more accurate boundaries to the gesture while also measuring the duration.If the user believes they made a mistake, they can instead press the "B" button to delete the previous gesture and redo it.The "Press", "Release" and "Delete" signals from these controller activities are also synchronized over LSL.All data is either stored in an XDF using the LSL Lab Recorder for ofine recognition or streamed directly to a Python script processing moving windows from the data stream.The full data fowchart is shown in 3.

Figure 4 : 6 Gender 9
Figure 4: Architecture of gesture recognition model.Each sensor is processed separately until the SVM binary classifcation stage.

Figure 5 :
Figure 5: A participant wearing our experimental interface while performing data collection in VR, with sensor positions shown over it.Eye tracking is within gasket and omitted.Participants marked gestures using a VR controller, and received visual feedback indicating their position in the study.

Figure 7 :
Figure 7: Confusion matrices for user-dependent and userindependent classifcation with all gestures and controls.

Figure 8 :
Figure 8: Samples per gesture and window size experiments.Classifcation across 8 gestures, mean of 16 participants.

Figure 10 :
Figure 10: Four hands-free applications used for real-time tongue interactions.(a) A maze game.(b) A matching game.(c) A "double up" game similar to 2048.(d) A tubular bell musical instrument interface.

Table 1 :
Sensors and calculated measures from the Muse 2 and Reverb G2 Omnicept Edition.Modalities marked with * were not used for classifcation for reasons explained in Section 3.1.