Left or right? Detecting driver's head movement on the road

The Internet of Things is enabling innovations in the automotive industry by expanding the capabilities of vehicles by connecting them with the cloud. One important application domain is traffic safety, which can benefit from monitoring the driver’s condition on how safely they are handling the vehicle. By detecting drowsiness, inattentiveness, and distraction of the driver, it is possible to react before accidents happen. This paper uses accelerometer and gyroscope data collected using an ear-worn sensor to classify the orientation of the driver’s head in a moving vehicle. We show that lightweight machine learning algorithms such as Random Forest and K-Nearest Neighbor can be used to reach accurate classifications even without applying any noise reduction to the signal data. Data cleaning and transformation approaches are studied to see how they give deeper insights into the classification problem. This study paves the way for the development of driver monitoring systems capable of reacting to anomalous driving behaviour before traffic accidents can happen.


INTRODUCTION
Head movement detection technology can be applied in various fields, including security surveillance, interaction with wheelchairs and similar devices, and automotive applications, detecting the head orientation and driver's field of vision.Traditional approaches for head movement and orientation detection include utilising cameras as the main sensors and applying image processing on streamed data.However, the image streams are heavy-weight and driving security-related automotive applications require minimal response times from the critical systems.Some attempts are already been taken into account to develop a real-time head movement identification system.However, the challenges for this field still remain high as it requires high computational hardware [1].A microcontrollerlow computational hardware -is not always capable of executing head movement detection algorithms.Therefore, the time required for the CPU to process and analyse head movement data needs to be improved.Moreover, developing an appropriate method for head movement detection is a crucial challenge as the success rate of different methods differ considerably.
In-vehicular sensing, i.e. sensing the movements of a person driving a moving vehicle, sets its own challenges for the data quality.Different wearable sensors (wristbands, smart rings, glasses etc.) can detect the driver's condition, such as sleepiness and attention level, and behavioural patterns, such as head movements and the extent of the field of vision [4].Considering the potentiality and challenges of head movement detection technology, we aim to detect head movement by applying machine learning models for lightweight gyroscope data.We use an in-ear sensor, eSense device, that produces accelerometer and gyroscope data when worn in an ear similar to any wireless earbud, for e.g.listening to music.We use gyroscope data generated through eSense device to detect the individual's head direction: left, right, or straight.We evaluate the result by comparing them against the existing literature and methods of different other models' accuracy.
The concept of earables is fairly recent, and it seems to have entered academic literature during the past few years [19].Therefore, there is still plenty to explore with different application domains for them.Many studies have already used machine learning to classify situations and activities and signal processing techniques [4,18] to track head orientations.Application domains such as ergonomics, healthcare, and navigation have already seen studies related to them.However, applications for traffic safety and driving training are less considered.In this work, we consider adopting the earables for improving traffic safety.Possible use cases include reminding drivers to look at so-called dead angles and vehicle blind spots [3,13] when passing lines and teaching such activity to new drivers.
The contributions of this paper can be summarised as follows: • We investigate lightweight machine learning models to predict three categories of head poses: left, right, and straight, with a simple earn-worn device and accelerometer and gyroscope readings.The best of our models achieve 0.92 to 0.99 accuracy, comparable to computer-vision-based results in the related work.• We investigate the results beyond accuracy, also looking for the best methods for avoiding false positives and false negatives by analysing the prediction results by each class with precision, recall, and F1 scores.Our results show that in some cases, high accuracy can "hide" an imbalance of how well each class is predicted individually if not properly studied.• We consider the data preprocessing strategies, including undersampling and balancing the dataset and the window sizes, and discuss their effects on the classification results.Our results highlight that the balance between classes existing in the dataset and undersampling, indeed, have a reasonable effect on the class-based results.

RELATED WORK
Head movement detection is traditionally applied for video surveillance and security with different purposes, such as person identification [21], but it also has other uses.In driving assistant systems, recognising the driver's attention is exceedingly important.Drivers' head position can assist in identifying their attention and distractions, which is critical for applications to prevent road accidents.
For instance, to identify drivers' attention, Liu et al. [14] employed the head posture detection method, and Lee et al. [11] represented a system where head movement is used to realise drivers' drowsiness.The methods of detecting head movement are also wide-ranging, traditionally altering around the computer vision-based methods.For example, Murphy-Chutorian and Trivedi [16] used image processing and pattern recognition techniques in order to develop a static head movement detection algorithm as well as a visual 3-D tracking algorithm.They detect the preliminary posture of the head and then track the head position and direction in six degrees of freedom.Haar-wavelet Adaboost cascades are used in this head movement detection module.As inputs, the preliminary head pose detection section applies support vector regressor (SVR) and localised gradient orientation (LGO) histograms.In the tracking module, an appearance-based particle filter technology is employed to pinpoint the 3-D movement of the head.Moreover, Liu et al. proposed a video-based method [14] for detecting head pose from the general posture between adjoining views in subsequent video frames.Scale-Invariant Feature Transform (SIFT) descriptors were used to coordinate corresponding feature points of two contiguous perspectives.
Some more creative methods have been proposed to enable different applications, such as power wheelchair control systems.Jianzheng and Zheng [6] introduced another idea of detecting head movements where they used mathematical methods using image processing techniques.They believe feature points such as nostrils are closely positioned with the head.Therefore, the head movements can be identified by tracking the feature points' direction.Aiming to obtain the pattern of positions of the head, they used Lucas-Kanade (LK) algorithm and GentleBoost classifiers to employ the coordinates of the nostrils in a video frame.Kupetz et al. [10] use of an IR camera and IR LEDs, a head movement tracking system was developed by them.It tracks a 2x2 infrared LED array, which is connected to the rear of the head.Light-tracking video analysis is applied to process the LED motion.According to this method, each frame is sectioned into regions of interest, and the development of key element points is followed between frames [5].
Furthermore, Berjón et al. [2] offered a combination of alternative Human-Computer Interfaces systems that include head movements, voice recognition and mobile devices.An RGB camera and image processing technique were applied with Haar-like features and an optical flow algorithm.The job of Haar-like features is to detect the direction of the face, whereas the optical flow algorithm is responsible for detecting the changes in the face position within the image.The acoustic-signal-based method is another approach that has been considered to estimate head movement.For example, an acoustic-based method introduced by Sasou [20] employs a microphone array aiming to detect the source of the sound.Subsequently, in each head direction, the sounds produced by the user, including their localised positions, are distributed around specific areas.Hence, by defining the unique areas between the boundaries, it is possible to detect the head movement based on the corresponding areas of generated sounds.
With automotive applications, the vital aim is to deliver head movement detecting and tracking technology in real-time.Videobased methods are sound for very accurate head pose detection but require high-quality data streams and compromise design principles for security and privacy with required data collection phrases to model training.In addition, we claim there are more lightweight data sources for detecting the head pose with a beneficial level of accuracy.Thus, we prefer to use sensors such as gyroscopes and accelerometers to gain information about head movement is one of the widely applied methods.Nguyen et al. [17] proposed a method that can detect the head movement of users by investigating the data gathered from a dual-axis accelerometer and pattern recognition procedure.Applying this method, the head movement can be classified into one of four gestures where an optimised version of a Bayesian Neural Network is used.
Another approach was presented by Manogna et al. [15], where on the user's forehead, an accelerometer device was attached.This accelerometer device is able to sense the tilt generated from the user's head movement.Then, this tilt is used to produce an analogue voltage value.Using this voltage value, control signals can be created to identify the head movements.The study of Kim et al. [8] introduced a head tracking system that can detect head movement using gyroscope sensor data.Angular velocity data collected via the gyroscope sensor is converted to angles by applying an integral operation.Relative coordinates are preferred over absolute coordinates.

METHODS AND MATERIALS 3.1 Earable sensor
Earables, i.e. in-ear sensors, are a relatively new concept but have already shown great potential.The eSense device [4,7] is developed with a customised 15 x 15 x 3mm PCB and is comprised of a Qualcomm CSR8670.It is also equipped with a dual-mode Bluetooth audio system-on-chip (SoC) including a microphone in each earbud.It has a built-in InvenSense MPU6500 six-axis inertial measurement unit (IMU) comprising a three-axis accelerometer, a three-axis gyroscope, and a two-state button.A circular LED, associated power regulation, and battery-charging circuitry are fitted as well in this device.It does not have any internal storage or real-time clock functionality.An ultra-thin 40-mAh LiPo battery is used to generate power for this device.The eSense earbuds can be recharged on the go using a carrier case which has a built-in battery facility.Each earbud is 20g heavy, and the dimension is 18 x 20 x 20mm.In our study, Samsung S10 5G mobile phone was the host device.

Experimental setup
Figure 1 shows the procedure of data collection using an earable device.First, the earbuds are inserted into the ear canal softly and then placement can be adjusted by rotating it slightly in order to wear the earbuds properly as well as to prevent them from falling off during head movements.A BLE interface is equipped with a left earbud, and it is used to configure the IMU sensor aiming to collect accelerometer and gyroscope data.It also detects proximity as the left earbud transmits periodic BLE advertisement packets.Generally, IMU sampling starts with a certain sampling rate.Later, the accelerometer and gyroscope full-scale range can be configured, which defines the minimum and maximum acceleration and rotational speed detected by the sensor.IMU sensors, particularly gyroscopes, produce a bias in their readings.Calibration eSense is done to compensate for the bias before starting the experiment.The data is transmitted through the smartphone (Samsung S10 5G with Android 9) gateway into the local data storage.
We divided the experiment into two phases: first, to collect stationary data from people sitting still and turning their heads left and right, and second, to collect head movement data from an actual driving scenario with a real car.The movement of the car brings noise to the collected data, which is expected to hinder accurate classification.As later shown, driving data is also more imbalanced as the driver of the vehicle will face their head forward during the driving event.
Six volunteering participants from our research unit staff were asked to participate in the study to gather the preliminary data (half women, half men).First, each participant performed the task of sitting still and moving his/her head slowly but continuously in different directions as left, right, and straight.This was done six times for each participant.The orientation of the head is labelled into three categories (left, straight, and right) using a smartphone application we designed for this purpose.Figure 2 shows an overview of the data with labels added later and coloured in different colours to highlight the directions of the head.
Second, a single individual (woman) performed the second stage of the experiment by driving a car with the experimental setup.She was requested to drive as normally as possible, turning her head left and right naturally when driving a car.A second person sits in the front seat to label the data using the mobile app.As expected, the head positioning data, when collected during the driving scenario, has more noise due to the movement of the car itself.In addition, the data collected from the driving case requires more preprocessing, discussed in the next section.

Data processing and analysis
The collected data consists of a time series of six values from the three dimensions for both accelerometer and gyroscope data.In addition, the data was labelled during the experiment as the head direction with three options: 'L', 'S', and 'R', representing left, straight, and right, respectively.
Table 1 shows the distribution of the samples between the labels both for the stationary and the driving data.As it turns out, there are almost twice as many samples in the driving dataset compared to the stationary dataset.The imbalance comes from contextual factors as the driver of the car is facing straight ahead most of the time.The slight imbalance in the stationary data comes from the practical implementation of how the data was collected.The participants were asked to move their heads to face from left to right and back again during the data collection sessions.As the head faces straight forward in between the left and right stages, the activity provides more labels for the transition stage of facing forward in comparison to the two extremities.
To make more general and useful models from the driving dataset, the data was balanced before training any models.The main approach for balancing the dataset was randomised undersampling, which means the size of the dataset was reduced to a total of 6915 samples with a distribution ratio of 1:1:1, meaning 33% percentage per label.Undersampling was chosen instead of oversampling as the small dataset size could bring an unacceptable amount of bias to the models if samples were directly duplicated, meaning the generalizability would suffer too much.In addition, generating random noise to the duplicated samples would reduce the realism of the data, which in turn might cause the models to learn patterns not visible in real-world data.With limited data to validate the models, oversampling was not adopted to balance the datasets.
Data transformation efforts include splitting the data into constantlength sequences of samples.This allows smoothening of the dataset as it increases the number of samples by a magnitude and decreases the differences between data samples.Two main approaches were chosen for the data transformation.First, a sequencing algorithm was developed for splitting the data into more consistent sequences.Second, data filtering algorithms were applied to the data both to reduce the dimensions and to highlight the orientational value of the data for the machine learning models to harness.
Different lengths of the sequences were considered to find the optimal window size to train the categorisation ML algorithms.This was done with the labelled data by looping through the data samples and building a list of them until the label changed.Whenever the label changed, the build list was looped through and split to match a desired window length with a desired overlap.Window sizes of 5, 10, 15, and 20 and overlaps of 10 were explored.

Machine learning models
Three machine learning algorithms were used for exploring solving the classification task: Random Forest Classifier (RFC), K-Nearest Neighbour (kNN), and Logistic Regression (LR).These algorithms were selected as they are suitable for classification tasks, they are capable of supervised learning, and their implementations were readily available.In addition, at least the SVM, RFC, and kNN algorithms can be considered popular shallow classifiers [12].
The data was then split into training and testing sets: 80% of the data was allocated for training the model, and 20% for testing the final model.In addition, as part of the training, 10-fold crossvalidation was used to produce evaluation scores for each trained model.This means that the training dataset was split into ten equalsized subsets, and each model was trained so that each of these subsets got to be the validation dataset while the others were used for training.The predicted label is the direction of the head for each classification algorithm.A total of 44 runs of training and testing on the models were executed for the final evaluation of the effects of the different data cleaning and transformation approaches.Twelve different models were trained, and they were tested on both datasets with different data preprocessing steps.
With the small sample size, it is important to consider many factors when assessing a model's performance.Specifically, models were trained using 10-fold validation, and macro, micro, and weighted averaging approaches were all used separately.Micro averaging calculates metrics on the binary confusion matrix formed by adding each class' confusion matrices' values together; macro averaging calculates an unweighted mean for the labels.Finally, the weighted averaging builds on top of the macro approach by considering the label imbalance using label counts as support weights.Label-wise precision, recall, f1-score, and their support were calculated to provide insight into possible problems with dataset imbalance.The models' hyperparameters were the following: RFCs were trained with 100 estimators, LR was configured to have 5,000 as the maximum number of iterations, and the optimal value of K for kNN was searched from [0..100] using mean accuracy of 5-fold cross-validation.

Prediction results
Table 2 shows the results for three different algorithms: Random Forest Classifier (RFC), K-Nearest Neighbour (kNN), and Logistic Regression (LR).Results are presented separately for stationary and driving datasets.In general, our approach reaches high accuracy (up to 99% with stationary data) comparable to the related work.For example, a computer vision-based head tracking system has a detection accuracy of 92.86% [22].On the other hand, the accelerometer and gyroscope sensor-based head movement detection method has reached detection accuracy of 99.05% [9] and 93.75% [17] similar to what we achieved in our study.
With multiple cross-validation and folding exercises, we can conclude that the individual's effect on how the head movement is measured is minimal.The data was trained over all the individuals who participated in the "stationary", i.e. sitting still experiment (see Table 2 labelled as "stationary") with no significant effects on whose data was used for training or testing parts.However, we note that data may not generalise over multiple people with its limited in participant size.
Table 2 (labelled as "driving") shows results for the driving case, including the most effective configurations regarding the utilisation of data undersampling for the models.Accuracy is a simple metric to evaluate the models, so let us consider the average scores for how the model performed with and without undersampling.RFC is the most effective of the three (average accuracy 0.96), followed by kNN (average accuracy 0.83), and LR is the least effective(average accuracy 0.65).It must be noted that accuracy tells one part of the story, and it is also necessary to consider the label-wise metrics to see how the model manages the imbalanced dataset.Looking at the lowest label-wise metrics, each model's lowest metric was the 'right' label's recall (0.59 for RFC, 0.47 for kNN, and 0.11 for LR).Whereas RFC and kNN had better metrics overall for the model trained on an imbalanced dataset, LR benefited from the balanced dataset more.
One noteworthy point is that for RFC and kNN, the balance between the trained and tested models' accuracy remains reasonable (0.01 and 0.15 accuracy loss, respectively), but for LR, the model trained on the imbalanced dataset performs worse with the undersampled data (0.55 accuracy loss).This is because that model seemed to learn the bias that comes with the imbalance, as seen from the model's incredibly low recall scores (0.01 -0.02) for the smaller labels.
It seems that both for RFC and kNN, it makes sense to work directly with the imbalanced dataset, as the scores remain high even when testing with the undersampled dataset.The precision of the label 'straight' had the lowest score for both of the models when tested with the undersampled dataset with values of 0.87 and 0.53, respectively.In contrast, the model trained on the undersampled dataset did not adapt well to the imbalanced dataset, as can be seen from the decreased precision score, which went from 0.89 down to 0.31 for the label 'left' with RFC and from 0.82 to 0.22 for the same label with kNN.
While all of the trained models in Table 2 have some high scores (each of them achieved at least 0.97 with at least one metric for a label with driving data and 0.99 with stationary data), only the RFC trained on the imbalanced dataset seems to produce acceptable results all around (average between all label-wise precision and recall for both datasets at almost 0.90) for reliable classification.The worst of the metrics is the recall for the smaller labels, which falls below 0.60, but even this score is at an acceptable level considering a random classifier would produce around 0.33.
In addition, we also tested if models built on stationary data can be used to predict the head direction in the driving scenario (i.e. if people sitting still reflect the driving scenario).We tested the models trained on undersampled data without undersampling to see how well they work on realistic label distributions.Overall, models trained with driving data seemed to perform better on driving data than on stationary data, and models trained with stationary data seemed to perform better on stationary data than on driving data, indicating the use cases are, indeed, very different in nature.When training the models with stationary data and testing with driving data, the accuracy is only slightly better than the random guess (RFC: 0.42, kNN: 0.39, LR: 0.46).

Sequencing window sizes
Sequencing is usually done to increase the effectiveness of the trained models as the amount of data per sample increases.As the sequencing can be done with different window sizes, there was a need to assess which one would be the most appropriate.The window sizes were explored by training an RFC model on the imbalanced driving dataset using sequencing for each explored window size: 5, 10, 15, and 20.Similarly processed dataset with undersampling was used to test the model as well.Figure 4 shows the effectiveness of different-sized sequencing windows for the dataset when using the RFC algorithm.It is noticeable that the support values keep going down as the window is increased.This happens because the number of samples decreases significantly as a larger number of dataset samples are required for producing a single sequenced sample.It also seems that the recall of the 'left' label is linearly decreasing (0.39, 0.31, 0.24, 0.21) as the window size is increased.The 'right' label also decreases from 0.40 to 0.24 with the jump from window size 5 to 10, but there is an increase afterwards to 0.28 and further to 0.29.Considering the window size of 5, the model does not seem as effective as without the sequencing.
The difference in accuracy is 0.02, and the biggest drop is for the label-wise metric for 'left', which drops from 0.61 to 0.39.Gathering multiple samples into one does not seem to be the way to go when improving the effectiveness of the model.The accuracy remains more or less the same (max difference of 0.09) for all window sizes.Overall, it seems that the scores are either slowly decreasing or remaining the same as the window size is increased.This is supported by considering the lowest (0.39, 0.24, 0.24, 0.21) and the averages (0.71, 0.65, 0.64, 0.60) label-wise values for precision and recall.These values are visualised in Figure 4.The trend is mainly negative as the window size is increased.As an additional aspect to consider, the decreased number of samples used for training and verification brings about bias regarding the generalizability of the The exploration of the window sizes for the data transformation sequencing algorithm provided some insight into their effect on the results.Figure 3a shows the confusion matrix of window size 5 sequenced undersampled driving data tested on a similarly imbalanced driving model.For comparison, Figure 3b shows the exact same situation except for not using sequencing at all.Increasing the window size reduced the number of samples both due to using more of the data for a single sample and also as more of the naturally smaller sequences were dropped with the size increases.

DISCUSSION
In some cases, simply looking at the model's accuracy seems to overestimate the model's effectiveness as higher than it actually was.Thus, it is important to consider a variety of different performance metrics to gain a reliable overview of how the model actually performs.Next, we wish to highlight some of the treacherously one-sided pictures painted by the collected metrics along with other metrics that fill the missing pieces, leading to revealing the real effectiveness of the model.
Often most of the calculated metrics indicated that the trained model was good (with metric values 0.80-0.95),even if the labelwise results disagree.One example of these misleading metrics is shown in Figure 5. Based on the confusion matrix, the model is clearly not a good one.The 'right' label is predicted correctly well enough (representing 1331 predictions, which is the highest value in the matrix), but there are a lot of misclassifications, e.g., 1040 for 'straight' classified as 'right' and 856 for 'left' classified as 'straight', and the diagonal top-left to bottom-right is not the most represented.Fortunately, the low label-wise metrics raise some suspicion about the model's effectiveness.
Another finding is that models trained with driving data and tested with stationary data often steered towards predicting straight and left.One example is an LR model that classified only one sample as 'right', and even that one was actually 'left'.There were also often generalizability issues related to models trained with the imbalanced driving dataset as the models learned to predict 'straight' and neglected the two other labels.These models may have fared well against the matching test set, but the results suffered when an undersampled driving or stationary dataset was used.What was learned, compromising the training and testing datasets' preprocessing (e.g. using stationary data to learn about driving data or processed data to learn about unprocessed situations) does not improve the model results but, on the contrary, increases the number of misclassifications.
After experimenting with data sequencing and different window sizes, it seems that the grouping of samples into sequences does not affect the models' prediction accuracy.The metrics scores seemed to go down as the window size was increased, but the confusion matrices maintained mostly the same proportions and distributions.The window size of five data samples per training dataset sample provided the most optimal models with a reasonable amount of generalizability.However, the non-sequenced models seemed to provide similar, if not better, results.Sequencing might have a more positive impact on the models if the dataset was larger, as then the reduced number of samples would become more negligible.

CONCLUSIONS
To summarise, our results show that a small ear-worn device, i.e. earable with accelerometer and gyroscope sensors, can be used to reliably detect the head pose in three categories: left, right, and straight.The methods utilised in this study are more lightweight than traditional video-based techniques utilised in the head orientation measurements: first, the data consists of significantly fewer points than a video stream, and second, the machine learning models used are generally lightweight but effective.This information will be useful for designing applications that help the driver to detect blind spots by reminding the importance of sufficient head movement when driving a vehicle or learning such behaviour.
In our study, we highlight the importance of data preprocessing and understanding the prediction results beyond accuracy.We show that imbalance in the data is a crucial factor leading to unsuccessful results and propose methods to correct the matter.At the same time, we show that in this sensor case, window size does not cause a significant difference in the prediction results.In addition to looking at the model accuracy, we show how different metrics, i.e. precision, recall, and F1 scores, should be utilised to understand the prediction success in different individual classes to avoid false negative and false positive predictions.
Finally, we highlight that driving data from the earable is, indeed, noisier than the stationary measured one.However, if training and testing the models with the same type of data (i.e. the driving use case), we get equally good results than the stationary use case.Training a model with stationary data and applying it to the driving use case, however, is not a good idea, resulting in an accuracy barely better than a random guess.

Figure 1 :
Figure 1: Procedure of data collection using an earable device.

Figure 2 :
Figure 2: Example of the data where a person was sitting still and moving his/her head left (later hand-labelled and coloured red) and then right (coloured blue).The green parts represent the straight pose.The three axes (x, y, and z) of the accelerometer (left) and gyroscope (right) are represented separately.
(a) RFC undersampled and sequenced (window size 5) driving dataset on an imbalanced model.(b) RFC undersampled driving dataset on an imbalanced model.

Figure 3 :
Figure 3: Confusion matrices of two example cases with the RFC algorithm.The y-axis shows the true label, and the xaxis the predicted label.

Figure 5 :
Figure 5: An example confusion matrix of a "bad" model.

Table 2 :
Summary of the training results.