A Real-life Chinese Dishes Recognition System Evolved from Full Training to Transfer Learning and Domain Adaptation

This study presents a real-life Chinese dishes recognition system. For enhancing the prediction accuracy, the system training strategy is evolved from full training to transfer learning and domain adaptation. Firstly, a Chinese dishes database with 28 types, 16,904 images and 45,061 instances is collected. Secondly, five networks pre-trained on Microsoft COCO are transferred for this specific task, and the network leading to the best results is selected as the backbone of the dishes recognition system. Thirdly, the backbone network trained with full training is compared to that with fine-tuning. Fourthly, domain adaptation using contrastive learning based unpaired image-to-image translation from Japanese dishes (UEC Food100) is considered for improving the backbone performance. Massive experiments suggest that transfer learning benefits the Chinese dishes recognition by fine-tuning hidden parameters, while domain adaptation remains challenging due to high data dependency and massive time consumption. Meanwhile, ≥ 200 instances per dishes type should be prepared for upgrading the menus list of the prototype system. Conclusively, transfer learning is promising for improving real-life Chinese dishes recognition, and domain adaptation requires further investigation.


INTRODUCTION
Dishes recognition is meaningful yet difficult in the multimedia community [25].It has the potential to assist in cuisine recommendation, calorie monitoring, nutrition therapy and personalized health management.On the other hand, the task is challenging.Apart from a diverse range of ingredients shown in different shapes, colors and textures, the cooking styles, procedural attributes and serving environment may differ from each other.
Numerous methods have been proposed for accurate dishes recognition [25].According to the backbone workflow, these methods can be grouped into multi-stage and one-stage categories.In a multi-stage method, feature extraction and dishes classification are required.Representative features include visual appearance, ingredient texture, restaurant localization modeling, serving menus analysis, and other external messages.For instance, a dish recognition method using support vector machine combines food shape, color, size and texture information [22], and messages of dish types, restaurants and geolocalized settings are collected for dish recognition [27].It is important to note that multi-stage methods often necessitate massive time and expert knowledge for the selection of proper features and classifiers.
The field of dishes recognition has been advanced through the utilization of one-stage deep learning methods which incorporated the multi-stage procedure into an end-to-end optimization problem [1,11,13,23,24].For instance, a multi-task learning procedure is designed by exploring the semantic relationship among dish categories for smooth dish recognition [26], a modified YOLO [23] combined with feature maps and multi-task learning is proposed for dish detection and calorie estimation [4], a multi-attention neural network using multi-scale guidance of ingredient analysis is implemented for sequential localization of multiple informative food regions [19], and a neural network embraces both handcrafted and deep learnt features as well as local and global features for dish image recognition and dish health assessment [6].Overall, to effectively train a one-stage approach, massive dishes samples are indispensable for hyper-parameter optimization.
A dozen of dishes recognition systems have been implemented on smartphones or on cloud for real-world applications.As to multistage approaches, a smartphone system combines linear support vector machine, adjusting bounding box and food region approximation for personalized health management [9], and a practical system deployed in an embedded environment integrates dishes position localization, dishes category re-identification and dishes attribute estimation for balancing recognition accuracy and time consumption [3].As to one-stage approaches, NutriNet designed for recognizing food and drink images is evaluated on images captured by hand-held cameras and shows promising performance [18], and a recognition application utilizes mean classifier and deep features for sequential personalized food logging and diet monitoring [7].
Despite the implementation of numerous dish recognition algorithms and food logging systems, there are still several issues that need to be addressed.Firstly, dishes served in real-life circumstance have been paid less attention to, since the majority of data samples in public databases are high-quality and high-resolution web recipe and menu pictures [2,15,16].Secondly, most of existing dishes databases focuses on Western, Japanese, or miscellaneous food categories [15][16][17], and insufficient attention is paid to Chinese dishes analysis [10].The ChineseFoodNet [2] and the Food2K database [20] are the pioneering and exclusive large-scale Chinese food image databases.
Real-life Chinese dishes recognition poses significant challenges.Except for the common issues, the lighting changes, imaging noise, and serving circumstance impose massive difficulties encountered in dishes recognition.In this study, a real-life Chinese dishes recognition system is presented.Firstly, a Chinese dishes database with 45,061 instances is prepared.Secondly, a one-stage dishes recognition system is deployed, and its training strategy follows a long-term evolution approach from full training to transfer learning and domain adaptation.Massive experiments have been conducted, which may provide valuable insights for further investigation in dishes image recognition tasks.

MATERIALS AND METHODS 2.1 The prototype system
Figure 1 shows the prototype system deployed in a Chinese restaurant.It comprises of a video recording system, a deep learning-based dishes recognition system, and the user interface for dish-sale analysis.The video recording system is equipped with a compact camera that allows for the acquisition of low-resolution images ([640,480,3]) at a rate of 10 frames per second.The dishes recognition system is embedded using a well-trained deep network.Specifically, the system is linked to food prices and is useful for automating charging..In addition, the user interface of the system enables analysis of dish sales and facilitates efficient management of the canteen.

The real-life Chinese dishes database
An annotated database of Chinese dishes has been prepared, consisting of 28 common dishes types, 16,904 images and 45,061 instances.The database will be made available online for the further development, comparison, and reproducibility of algorithms for the recognition of dishes (or Chinese dishes) 1 .The data collection and preparation could be divided into three steps.
Data cleaning A total of 30 hours of daily dish-sale videos were recorded.Among the mathematical methods, Cosine similarity was found the most suitable for identifying background or highsimilarity images.Then, images were divided into 15 subsets according to the shooting date, and Baidu AI EasyData2 was used to eliminate images with motion blur, focus blur, noise, and highsimilarity content, thereby ensuring the visual image quality.
Image annotation Image annotation was conducted using Baidu AI EasyData.About 30% images were manually labelled with each category having more than 10 instances annotated.The images with annotated instances were uploaded to the platform for training the embedded model, and the model was then used to label the remaining images automatically.
Quality assurance After annotation, many images were observed with no dishes or with dishes in plastic bags.Thereby, we manually removed these useless or irrelevant images to guarantee annotation quality of the Chinese dishes database.

Experiment design
Experiment design encompasses the backbone selection among five candidate deep networks, the training strategies of full training, fine tuning and domain adaptation, and the parameter settings.
Candidate backbone networks Five deep networks off-the-shelf are examined for one-stage dishes recognition.In one word, YOLO associates isolated bounding boxes with class probabilities [23], SSD categorizes the output bounding boxes into different scales and aspect ratios for object scoring [13], Faster-RCNN predicts object boundaries and objectness scores at potential positions [24], RetinaNet develops the focal loss and addresses data imbalance by weighting the loss less to easy examples [11], and Cascade-RCNN is composed of a serial of sample detectors trained with intersection over union to enhance the quality of hypotheses [1].
Training strategies Three training strategies are considered.(a) Transfer learning is applied primarily due to its achievement in the field of artificial intelligence [29].In this study, the networks pre-trained on the Microsoft COCO database [12] are fine-tuned on the training set of dishes images with no parameters frozen.(b) Full training is also explored.The hidden parameters in the deep networks are fully trained and optimized on the training set of dishes images.(c) Domain adaptation follows the semi-supervised domain adaptive workflow [28] for cross-domain dishes recognition.It obtains instance-level features by knowledge distillation and remedies image-level differences using contrastive learning based unpaired image-to-image translation for similar scene style transfer of cross-generation of pseudo images [21].Notably, an intuitive consistency loss is designed for improving cross-domain prediction alignment and generalization performance.
Parameter settings Table 1 lists the parameters of batch size (), base learning rate ( ), weight decay (), and weight decay coefficient () in model fine-tuning procedure.The learning rate decay type is PiecewiseDecay, and the decay scheduler is set at the 25th, 30th and 40th epoch.The optimizer used is Momentum except for SSD with Adam.If not specified, the parameters of deep networks are set as default values.It should be noted that in the full-training procedure, except for  = 5 × 10 −3 , the parameters are set as the same as those defined in the fine tuning procedure.Experimental design On the performance comparison of deep networks using different training strategies, 10 times of experiments are conducted.In each time, images in the ChinaDishSet database are divided into a training set with 80% images (13,000) and a testing set with the remaining images (3,904).The performance of the deep networks was evaluated with the classification accuracy, recall, and mean average precision (MAP).

System implementation
The algorithms of the dishes recognition system is implemented with Pytorch (version 1.

The Chinese dishes database
Removing the categories with ≤ 200 instances, 28 categories remain.Figure 2 shows representative instances.Due to the variation of lighting condition, shooting time, food ingredients, cooking procedure and background, the visual quality of dish images varies in real-life image acquisition.
The ChinaDishSet database contains 45,061 instances.Among the items, rice is the most consumed, with a total of 6,496 instances, followed by scrambled eggs with cucumber.By comparing the number of the most (rice, 6,496) and the least food (dumpling, 244) instances, 26.62 is the maximum imbalance ratio.
The average number of dishes instances is 1,609 per category.In the stage of data cleaning, categories with less than 200 instances have been set aside for further database upgrading.In a Chinese restaurant, the dishes categories may change as chefs join or leave, and as seasonal food supplies vary.On the other hand, setting aside the categories with a small number of samples helps to prevent data imbalance, which may negatively affect model training.

The backbone determination
Figure 3 shows MAP values regarding different thresholds on dishes classification.When the threshold value is set to 0.20, all the networks obtain good performance (MAP ≥ 0.80).Even if the threshold is set to 0.0, both Faster-RCNN and Cascade-RCNN achieve MAP > 0.85.Among the networks, when the confidence thresholds change, Faster-RCNN and Cascade-RCNN achieve consistently superior MAP values, followed by YOLOV3.In addition, RetinaNet obtains satisfactory results when the cutoff value of the confidence threshold is set to 0.40, while the MAP values of the SSD network keep no large than 0.90.Table 2 summaries the results for the backbone selection.It shows that Cascade-RCNN, RetinaNet, and Faster-RCNN achieve promising outcomes, followed by YOLOV3 and SSD.Notably, Faster-RCNN exhibits the fastest run-time at 146 ms, and all the networks maintain a time cost of around 200 ms, which meets the real-time requirement in real-life scenarios.To balance the performance of accuracy, recall, MAP, and time cost, Faster-RCNN is chose as the backbone of the prototype system for Chinese dishes recognition.Table 4 shows the recognition results using domain adaptation from UEC Food100 to this Chinese dishes recognition task.As the number of iterative pseudo-label generation increases from 10 times to 90 times, the recognition performance improves from 61.24% to 82.78% (MAP value).It indicates the feasibility of cross-generation of pseudo images in improving dishes recognition performance.On the other hand, it costs ≈ 3 hours per iteration in unpaired image-to-image translation.As shown in Table 2, 3 and 4, the utilization of Faster-RCNN as the backbone of the Chinese dishes recognition system demonstrates that fine-tuning yields the best performance with MAP 98.30%, followed by full training (87.84%) and domain adaptation (82.78%).Notably, domain adaptation is highly time-consuming.

Daily dishes recognition in the canteen
Figure 4 shows the confound matrix of one-day-sale dishes recognition result.A total of 7,274 dishes were served with 25 instances misclassified (accuracy, 98.10%; recall, 97.21%; MAP, 98.34%).In each category, no more than 4 samples are wrongly predicted.Timely check of the wrong prediction between potato_meat and zucchini_meat reveals that the ingredient composition and cooking procedure result in the similar appearance.

DISCUSSION
Accurate real-life dishes recognition is crucial in a smart canteen for food ordering, charging and payment.Despite the deployment of several systems, their performance on real user photos is still not satisfactory [18].In this study, a prototype system (Figure 1) is designed and evaluated on daily dish images in the Chinese canteen.Specifically, two issues are explored, including the appropriate backbone selection among five networks, and the training strategy determination (full training, fine tuning, and domain adaptation).
Faster-RCNN [24] serves as the backbone of the prototype system, since it generally outperforms the other four one-stage networks [1,11,13,23] (Figure 3 and Table 2).According to Table 2, Faster-RCNN, RetinaNet, Cascade-RCNN, and YOLOV3 all achieve competitive results, while Faster-RCNN stands out for least time cost in dish recognition.Notably, Faster-RCNN has also been used as a food identifier by incorporating cross-connected layers and attention module [14].It is worth mentioning that some networks are continuously updated to enhance their capacity.
The effectiveness comparison of different strategies on the backbone network indicates fine-tuning is more effective than full training (Table 3) and domain adaptation (Table 4).Fine-tuning involves adjusting the parameters of a pre-trained network by fixing either a portion or the entire model layers [29].It enables the pre-trained Faster-RCNN achieve higher MAP values (≥ 96%), outperforming both full training (87.84%,Table 3) and domain adaptation (82.78%,Table 4).Table 4 shows the potential of domain adaptation in improving dishes recognition, while it takes massive time each iteration.In this study, the parameters of pre-trained Faster-RCNN are fine-tuned using a small learning rate (Table 1) that softly transfers knowledge learnt from common objects in context for real-life dishes recognition.
Further, the impact of training sizes on dishes recognition indicates that 150 annotation per category are sufficient for system upgrading (Table 3).Increasing the number of training samples benefits dishes recognition.Using fine-tuned Faster-RCNN, the MAP value increases from 91.34% to 96.48%, when the number of instances in each dish type increases from 50 to 200 instances.On the other hand, the improvement is not observable, when the training size increases from 150 to 200 samples.Thus, collecting and annotating 150 dish samples can be considered acceptable for system upgrading.
The daily dishes recognition results have further validated the success of the prototype system (Figure 4).On the one hand, the system performs well in daily operations, accurately identifying the categories of dishes in the recognition list (accuracy, 98.10%; recall, 97.10%; MAP, 98.34%).On the other hand, there is a 1.90% mis-classification rate in daily service, indicating the need for future work to enhance the system's capacity through the use of more advanced techniques, workflow decoration and dishes modeling.
There are several limitations in the current study.For boosting the performance, multiple cues can be exploited, such as the analysis of visual appearance, ingredient compositions, inherent semantic relationships among fine-grained classes, food procedural attributes and external knowledge [26].Meanwhile, novel techniques should be considered, such as multi-scale representation learning [8], local and global feature aggregation [6], pre-training using large-scale dish database and multi-model fusion [5].Most urgently, it is imperative to upgrade the video recording system with high-resolution micro-camera for high-quality image acquisition.

CONCLUSION
Accurate dishes recognition helps improve dining service, nutritional intake monitoring, food retrieval and cuisine recommendation.However, a large gap exists between laboratory test photos and real user photos, hampering the deployment of many novel approaches.In this study, we simplify the real-life dishes recognition by deploying a prototype system in a Chinese canteen.After the dish image database is built, practical issues related to dishes recognition are investigated.Experimental results suggest that finetuned Faster-RCNN can serve as the backbone, fine-tuning is more effective than full training and domain adaptation, and when incorporating a new dish type into the recognition list, a minimum of 150 samples per category should be prepared.In the future, the prototype system will be deployed in other canteens, and we aim to explore the cues of visual appearance, ingredient composition, food procedure attributes and the techniques of multi-scale and multiview representation for boosting the recognition performance.

Figure 1 :
Figure 1: A prototype system deployed in a Chinese restaurant for real-time accurate dishes recognition.

Figure 2 :
Figure 2: Representative instances of 28 categories of Chinese dishes in the ChinaDishSet database.

Figure 3 :
Figure 3: MAP values with different confidence thresholds.

Figure 4 :
Figure 4: The confound matrix of one-day-sale results.

Table 1 :
The parameters used in model fine-tuning

Table 3
summarizes dishes recognition results when the training data size changes.It shows that increasing training samples improves the performance.When using full training, the backbone

Table 2 :
Chinese dishes recognition using fine-tuning MAP value improves from 61.46% (50 instances per category) to 87.84% (200 instances per category).Additionally, fine-tuning also leads to performance enhancement, and the MAP value keeps high from 91.33% to 96.48%.When the number of training samples per type increases from 150 to 200, the MAP values of the fine-tuned model exhibits a small increase (≈ 0.33%).This observation implies that when incorporating new types of dishes, 150 samples may be sufficient for model fine-tuning.

Table 4 :
The recognition performance using domain adaptation from UEC Food100 to the Chinese dishes