Using touch sensor and vision feedback to adapt skewering strategy for robust assistive feeding

Assistive feeding using a robotic arm can help people become more independent and reduce caregiver burden. The ability to skewer different types of food is a key requirement for assistive feeding. State-of-the-art methods have shown promising results by learning food skewering strategies based on vision and force-torque sensor feedback. However, force-torque sensors are expensive and have a lengthy and complicated fabrication process. In this work, we demonstrate how MWCNT/PDMS-based tactile sensor arrays, developed by Leong Research Group, which are much cheaper and easy to fabricate, can be used for learning food skewering strategies. We first show how the sensors can be calibrated to be sensitive to different food textures. We then create a touch sensor and vision feedback dataset for skewering different foods and show that a neural network trained on this dataset can learn food skewering strategies with much better success than naive skewering strategies and is comparable to the skewering strategies learned using force-torque sensors. Overall, our system presents a much cheaper alternative for assistive feeding while providing similar accuracy.


INTRODUCTION
Increased independence with activities of daily living (ADL) can significantly enhance the quality of life for the elderly and people with disability [20].One important ADL is self-feeding, a difficult task for those who experience reduced fine motor skills such as stroke patients, people with Parkinson's disease, spinal cord injury and several other medical conditions.Robot arms are commonly used in assistive feeding applications [11].The ability to skewer different types of food is one of the key requirements for assistive feeding.In this work, we focus on food skewering using a fork, which is challenging due to the need to adapt the skewering strategy based on the texture of the food.For example, a vertical skewering approach is suitable for hard foods like apples while an angled skewering approach is more suited for softer foods like bananas.
Most of the prior works [10,2,3,16,4,21] have used force-torque (FT) sensors to estimate the texture of the food and determine the skewering strategy.This has shown promising results.Most recently Sundaresan et al [21] could skewer a variety of foods based on the FT sensor and camera feedback obtained during initial probing of 26ms.They train a neural network called HapticVisualNet.This network takes as an input the image of the food and the FT sensor reading during initial probing and outputs whether to skewer using a vertical or an angled approach.
While results with FT sensors have been promising, FT sensors are very expensive and have a very lengthy fabrication process making the current assistive feeding systems unaffordable.Works like Song et al. [17] have attempted to use GelSight touch sensors instead of FT sensors for assistive feeding but have reported various difficulties like high hysteresis and unstable gripping.In general, as compared to FT sensors, touch sensors have noisy feedback and require careful calibration.To the best of our knowledge, no works have used touch sensors in a real-time assistive feeding application.
In this work, we present a real-time assistive feeding system which can adapt its skewering strategy based on the feedback from the camera and the touch sensors.We use a MWCNT/PDMSbased tactile sensor array developed by Leong Research Group.Similar MWCNT/PDMS-based tactile sensors have been previously reported [23] but have not been used for an assistive feeding application.The flat PDMS surface on our sensors provides sufficient frictional force to counter the shear force by the fork which is needed for stable skewering.We first calibrate these touch sensors such that they are sensitive to forces experienced when skewering different foods.Then we follow [21] and create a touch sensor and vision sensor feedback dataset for training HapticVisualNet using our touch sensors instead of FT sensors.Experiments on real food with our trained model show that our success rate is much better than naive skewering strategies and comparable to FT sensors.

Related Work
Assistive feeding has been a widely studied topic and many works exist which focus on different aspects of feeding such as detecting food [14,10,6], scooping food [12,18], skewering food [3,21] and putting food in the mouth of the person [16,4].In this work, our main focus is on skewering food using a fork.
1.1.1Food skewering.As discussed in section 1, most of the food skewering approaches [10,2,3,16,4,21] have used force-torque (FT) sensors along with camera sensor feedback to adapt skewering strategies.The key methodology used in all the approaches is to train a model using an offline dataset in a supervised manner, which can map haptic feedback from FT sensors to either object properties or directly to the robot action.Recent work by Sundaresan et al [21] has shown the most promising results.We follow a similar approach but use touch sensors instead of FT sensors and show that real-time skewering can be done using touch sensors as well.
1.1.2Touch Sensors.As compared to FT sensors, touch sensors provide haptic feedback based usually on resistance or capacitance [1,5].The change in pressure causes a measurable change in resistance/capacitance.Touch sensors have been used in general for grasping and manipulation of rigid objects [9,7,19] and direct grasping of soft food [15].However, for soft deformable objects like food, they have not yet been integrated into a full skewering system which needs to measure shear force due to noisy output, high hysteresis and unstable gripping [17].In this work, we use MWCNT/PDMS-based piezoresistive tactile sensor array, developed by Leong Research Group for haptic feedback (Fig. 2), with coplanar electrodes and isolated conductive layers to reduce inter-taxel cross-talk.The MWCNT improves the conductivity while PDMS imparts mechanical stability of the composite.

SYSTEM OVERVIEW
Fig. 1 gives an overview of how our system works.In the following sections, we explain the important details of our system.

Touch Sensor Placement and Calibration
We use xArm6 [22], a 6 Dof robotic arm for skewering food.One MWCNT/PDMS-based piezoresistive tactile sensor array with 6 channels is placed on the left inner surface of the xArm6 gripper (Fig. 2).With this placement, all the 6 channels on the tactile sensor array maintain contact with the fork.
Each channel in the tactile sensor array provides a voltage reading.We take the average of the reading from 6 channels as the touch sensor feedback.Since we are using a piezoresistive sensor, the range and sensitivity of the voltage reading can be tuned by varying the amount of resistance in the voltage divider circuit (See Fig. 3).To make sure that our touch sensors can differentiate between different types of foods, we tried 3 different resistor values in the voltage divider circuit and measured the voltage readings when different amount of shear force was generated.To generate different shear force, we made the gripper grasp the hook of weights of varying values.The plot in the Fig. 3 shows the averaged ADC output of six channels obtained in first 10ms (data collected at a frequency of 1kHz) when the weights are varied from 0 to 1kg.We can see that 10k ohms provided the greatest range and sensitivity and was chosen.

Acquisition of haptic and vision feedback
After calibrating the touch sensors, next step is to collect the vision feedback and haptic feedback.For HapticVisualNet, vision feedback is the RGB image taken from the camera sensor mounted on the gripper when the gripper is over food.Fig. 1 shows an example of one such image.We call it pre-contact RGB image.Haptic feedback is the touch sensor reading while probing the food item.Following [21], we set the probing time as 26ms post contact with the food.
For acquiring haptic feedback, one crucial requirement is to determine the contact with the food.Touch sensors cannot be used for determining the contact because the probing time is only 26ms.By the time we will determine the contact based on touch sensor values, probing time would already be over.Thus the contact point needs to be determined before the actual contact.This is done using object detection on pre-contact RGB image.We detect the food bounding in the pre-contact RGB image using RetinaNet [13] that has been pre-trained to detect food [8].The lowest depth value (discarding any value less than 15 cm in depth in order to not include the depth of the occluding fork) in that bounding box is considered as the skewering start depth at which contact will start.Once the fork reaches that depth value, we start collecting the touch sensor reading for 26ms.Fig. 1 shows an example of the haptic feedback collected during probing.

Inputs and Outputs.
HapticVisualNet is a neural network model used for in [21] which takes as an input pre-contact RGB image and haptic feedback and outputs whether a vertical or an angled skewer should be used to skewer the food with a fork.The vertical skewering strategy would continue probing directly downwards, before bringing food to a pre-determined feeding position.
For the angled skewering strategy, the fork is rotated with reference to the pitch axis before being brought to the feeding position.

Network architecture.
HapticVisualNet is based on ResNet-50 architecture.To extract features from the sensor inputs, the network uses a combination of convolutional neural networks (CNNs) and fully connected layers.The CNN layers extract spatial features from the pre-contact RGB image.The fully connected layers capture temporal information from the haptic feedback.

Training.
To train the HapticVisualNet, we collect data from four type of fruits : apple, banana, mango and plum.We skewer each fruit type 8 times and get the pre-contact RGB image and the haptic feedback while probing.To label each sensor data as vertical or angled skewer, we follow [21], and label the data from hard fruit i.e. apple as vertical skewer and soft fruits like banana, mango and plum as angled skewer.Skewering results using naive skewering strategy of always using vertical skewer or angled skewer presented in Table 1 verify our classification of hard and soft foods.To ensure the samples were representative of proper fork usage, the data was discarded and re-collected if less than two tines of the fork contacted the food.To mitigate overfitting and expand the training data without the need for time-consuming data collection, we employed data augmentation techniques following [21].These techniques involved operations such as image mirroring, contrast changes and colour modifications to artificially create new samples from the existing data.This increased the diversity and variability of the samples, which helped the model to learn more generalized features, which proved effective in improving the model's performance without the need for additional data collection efforts, as shown in Table 2.

EXPERIMENTATION
To evaluate the performance of our learned model, we conducted experiments with real food.Our hypothesis is that training Hap-ticVisualnet on touch sensor data for food acquisition will lead to better performance as compared to a naive strategy of always using a vertical or angled skewer.

Setup and Protocol
Same fruit types i.e. apple, banana, plum and mango that were used for training were used for testing also.However, the fruit instances were different and could vary in hardness.The foods were sliced up into bite sizes of approximately 3 cm in diameter.During each trial, the food is first placed at a fixed position below the gripper.Then pre-contact RGB image is acquired.After that skewering depth is determined and the food is approached.When the fork reaches skewering depth, collection of haptic feedback is done for the probing time of 26ms.The image and haptic feedback is then sent to the trained HapticVisualNet model which outputs whether to use a vertical or angled skewer strategy.The robot arm then executes the strategy and brings the food to a feeding position.(Fig. 4) shows a sample trial.
Trial was considered successful if the food is skewered intact and brought to a feeding position and unsuccessful if the food slips off the fork or is moved out of the way during the skewering attempt.To ensure that the evaluation is on the performance of the test sensor, any failures due to perception, such as an incorrect depth image input and incorrect bounding box generation, were discarded and the trial was repeated.10 trials were then performed for each fruit type using our proposed approach.

Results and Discussion
Table 3 shows the success rate with HapticVisualNet generated strategy and Table 1 shows the results with naive skewering strategues.We can see that the success rate of HapticVisualNet strategy (80%) is much higher that the success rate of always using vertical skewer (38.75%) or always using angled skewer (61%).Even if we use the strategy of using vertical skewer for hard fruit like apple and angled skewer for soft fruits like banana, mango and soft plum, the success rate is (71%) which is still less than the success rate obtained using HapticVisual Net.This shows touch sensor and camera feedback is useful in determining the skewering strategy and we could train a model using the collected data to learn such a skewering strategy.[21], the observed success rate on the food class of assorted fruits and vegetables is 90%.This success rate is observed with apples.However, the success rate on softer foods, such as banana, mango, and soft plum, is slightly lower.One reason for our marginally lower success rate could be our relatively small training dataset.

Importance of haptic feedback.
The retrained model is able to output the distinct skewering strategies required for plum and apple successfully.Plum and apple are visually similar but have different hardness.This shows that the model is not classifying the fruits just based on their appearance and haptic feedback is playing an important role in determining the skewering strategy.

Generalization to unseen fruit instances.
Although the fruit types were same during testing, the fruit instances were different.An interesting case is of the mango.The mango at test time was a bit unripe and thus harder.HapticVisualNet could successfully choose vertical skewering strategy for unripe mango during testing.

CONCLUSION AND FUTURE WORK
We have presented a real time food skewering system based on the camera and touch sensor feedback which can leverage the sensor feedback to give better results than the naive skewering strategy and is also comparable to the FT sensors based system.Thus our system provides a cheaper alternative to existing systems which use FT sensors.In future, we would like to extend our system to more varieties of food like by gathering more training data.We would also like to convert this into a complete system by adding advanced food recognition capabilities such that the food can be detected in a full plate and then skewered successfully.

Figure 1 :
Figure 1: Overview of the real time food skewering system.

Figure 3 :
Figure 3: Characterisation of resistive touch sensors (left), and placement in voltage divider circuit (right).

Table 1 :
Success rates using a vertical vs. angled strategy

Table 2 :
Training results pre-and post-data augmentation.

Table 3 :
HapticVisualNet generated strategies Comparision with FT sensors.In Sundaresan et al.