Rapid Image Labeling via Neuro-Symbolic Learning

The success of Computer Vision (CV) relies heavily on manually annotated data. However, it is prohibitively expensive to annotate images in key domains such as healthcare, where data labeling requires significant domain expertise and cannot be easily delegated to crowd workers. To address this challenge, we propose a neuro-symbolic approach called Rapid, which infers image labeling rules from a small amount of labeled data provided by domain experts and automatically labels unannotated data using the rules. Specifically, Rapid combines pre-trained CV models and inductive logic learning to infer the logic-based labeling rules. Rapid achieves a labeling accuracy of 83.33% to 88.33% on four image labeling tasks with only 12 to 39 labeled samples. In particular, Rapid significantly outperforms finetuned CV models in two highly specialized tasks. These results demonstrate the effectiveness of Rapid in learning from small data and its capability to generalize among different tasks. Code and our dataset are publicly available at https://github.com/Neural-Symbolic-Image-Labeling/


ABSTRACT
The success of Computer Vision (CV) relies heavily on manually annotated data.However, it is prohibitively expensive to annotate images in key domains such as healthcare, where data labeling requires significant domain expertise and cannot be easily delegated to crowd workers.To address this challenge, we propose a neuro-symbolic approach called Rapid, which infers image labeling rules from a small amount of labeled data provided by domain experts and automatically labels unannotated data using the rules.Specifically, Rapid combines pre-trained CV models and inductive logic learning to infer the logic-based labeling rules.Rapid achieves a labeling accuracy of 83.33% to 88.33% on four image

INTRODUCTION
Deep learning methods have shown great power in challenging computer vision tasks, such as traffic scene detection [5,19] and disease diagnosis [41,51].These methods often require a vast amount of labeled image data to achieve good performance.Labeling these data is laborious and expensive.This challenge is exacerbated in highly specialized domains such as healthcare, where data labeling requires significant domain expertise.
Many data labeling methods have been proposed to address this challenge with a small amount of labeled data (e.g., less than 100 labeled samples).One mainstream line of research is to develop models for automated data labeling [8,24,29,39].Existing approaches in this category often learn class prototypes from the training data samples and infer the class of unlabeled data by assigning the class of its nearest class prototype.To adopt these approaches in a low resource setting, the distance between data samples is often designed to depend on task-specific information such as meta-data or other task-specific insights.However, the task-specific nature of these approaches restricts their generalizability to other tasks.Besides, the need for designing specific models for a specific task requires extensive human efforts, which is against the motivation of saving human efforts for data labeling in the first place.
To address the aforementioned limitations, a new data labeling paradigm called data programming [32] has been proposed for rapid data labeling.For example, Snorkel [31] asks domain experts to create labeling functions and uses a generative model to combine those labeling functions to provide probabilistic labels.However, those labeling functions must be written in a programming language such as Python.This requirement not only incurs an overhead of manually composing labeling functions but also incurs a steep learning curve for domain experts and end users who typically do not have any programming experience.
In this paper, we propose a new neuro-symbolic approach called Rapid for image labeling in low-resource settings (e.g., less than 100 labeled images).The novelty of this framework lies in synergizing the strength of neural models (i.e., handling rich, complex image data) and the strength of inductive logic learning (i.e., learning from small datasets) to handle the image labeling challenge.Specifically, Rapid automatically infers logic rules from a small amount of labeled data, applies the inferred rules to label the unlabeled data, and solicits user feedback to refine the rules iteratively.
Unlike Snorkel, Rapid infers labeling rules automatically rather than requiring users to manually construct these rules.Rapid leverages the First-Order Inductive Learner (FOIL) algorithm to infer logic rules based on the low-level visual attributes extracted by pretrained models.This way, our approach disentangles the perception and the learning process, making it more transparent and explainable to human labelers.Furthermore, to maximize the efficiency of data usage, we develop a multi-criteria active learning method to iteratively elicit human feedback to refine the labeling rules.
We conduct extensive experiments on datasets from two highly specialized domains and two common domains.Our method achieves significantly higher labeling accuracy on the two highly specialized domains (85.52% on disease diagnosis and 86.11% on bird species labeling, respectively) compared to the baseline models.We demonstrate that by actively refining labeling rules with rapid, incremental human feedback, Rapid can effectively embed expert knowledge and achieve high image labeling accuracy with a limited amount of training data.
Overall, this work makes the following contributions: • We proposed a new neuro-symbolic learning framework that synergizes pre-trained computer vision models with inductive logic learning for rapid image labeling with a limited amount of training data; • We designed a new conflict-based informativeness metric for data selection in active learning; • We conducted comprehensive evaluations on four labeling tasks from different domains with user simulation and multiple baselines.

RELATED WORK 2.1 Image Labeling
Reducing the human effort in image labeling has become increasingly important in recent years with the advent of deep learning, which requires a massive amount of labeled data to train a model.There has been a continuous effort to develop automated solutions for image labeling.The basic idea is to train a model with some labeled data and automatically assign labels to new data samples without further human involvement.Among these existing methods, fully-automated methods have attracted significant attention and achieved promising performance.Some methods exploit the similarity between unlabeled and label images [8].Some methods resolve the problem of data scarcity by creating representations of images using auxiliary information such as corresponding captions [29], meta-data [24], or pseudo-labels generated by other models [39].Despite the promising performance of these methods, they often lack generalizability to domains where data acquisition is challenging, as the models usually require a substantial amount of data to learn the knowledge.Thus, some semi-automated labeling approaches use human efforts in the training or inference process to provide the information needed for training [37] or a coarse initial label [13].Compared with existing work, our approach uses inductive logic learning to infer logic rules from a small amount of human-annotated data to label images.

Active Learning
Active learning is widely adopted to get humans involved in the labeling process iteratively while minimizing the amount of data to be labeled by humans.Active learning methods select the data samples that can benefit the model most in each iteration and request humans to label them to push the usage of human efforts to the minimum.To determine which data sample to be labeled by humans, some approaches use probability models and prioritize the data samples with high prediction inconsistency [12,14,26,52], some depend on the vectorized representation from deep learning models [16,34,53], and some calculate the low-rank matrix representation for both labeled and unlabeled data to calculate the informativeness [45].
Though the active learning strategy can optimize data selection to reduce human effort, it often requires many iterations to achieve a reasonable labeling accuracy.Thus, it remains too expensive for labeling tasks where time is precious for domain experts such as clinicians.Snorkel [31] enables users to explicitly embed their domain knowledge by creating labeling functions.However, the labeling functions are either written in programming languages or special declarative functions defined by the author of Snorkel, causing huge overhead effort in learning the labeling function grammar.Our work combines interactive learning with an inductive logic learner, generating logic rules to classify images.The generated rules are represented with simple logic and descriptive predicates, which are easy for users to read and edit.

Neuro-Symbolic Learning
There has been a growing interest in combining neural networks with symbolic methods [1,17,23,25,49].Here, the term neuro refers to artificial neural networks or connectionist systems, while the term symbolic refers to AI approaches that perform explicit symbol manipulation, such as term rewriting, graph algorithms, and formal logic.There are different ways of combining neural network modules with symbolic learning.Following the categorization in a recent survey [33], our method belongs to a cascading neurosymbolic paradigm that extracts latent patterns from input data using a neural system and then feeds them into a symbolic reasoner for final prediction.
Existing approaches in this category include NS-VQA [50], NS-CL [23], and FO-SL [25].NS-VQA [50] firstly parses an image to a structural scene representation with Mask R-CNN and ResNet-34 and converts a natural language question into a query program with an LSTM model.Then it uses a symbolic executor to run the program on the scene representation to obtain the answer to the given question.NS-CL [23] adopts a similar approach as NL-VQA but learns the feature vector representation of an object from question-answer pairs, instead of extracting them directly with pretrained models.FO-SL [25] represents images in first-order logic and uses an SAT solver to solve visual discrimination puzzles.
Unlike existing neuro-symbolic learning approaches in this category, we are the first to use inductive logic learning as the symbolic method for rule inference.In this way, we can explicitly model the logic of labeling rules.Furthermore, our approach is also the first to apply neuro-symbolic learning to image labeling.

Human Feedback in Inductive Logic Learning
To the best of our knowledge, our work is the first that integrates active learning with Inductive Logic Learning (ILL).Existing research in ILL has focused on improving the learning algorithm for better efficiency and scalability.There are only a few that investigates the interactivity of ILL in the 1990s [4,9,10].Specifically, De Raedt et al. [9,10] The overview of our image labeling approach.

METHOD
Figure 2 gives an overview of our approach called Rapid.Rapid consists of three parts-(1) a pre-trained visual attribute extractor to extract basic, low-level visual attributes from images, (2) an inductive learner to infer logic labeling rules from the relationship between the visual attributes and the target classes, and (3) a multicriteria data selection module to select a small set of informative and diverse automatically labeled data for users to inspect and fix.Rapid works iteratively.It starts with a small initial set of training images created by human users.In each iteration, Rapid first extracts the visual attributes of each labeled image.Then, it takes as input the visual attributes and the corresponding class label of the images to infer a set of labeling rules for each class.The inferred rules are then applied to automatically generate labels for unannotated images.In case of contradicting labels, Rapid selects the corresponding label of the rule with the highest Clause Satisfaction Ratio (detailed in Section 3.3.2,Equation 7).Next, Rapid adopts a multi-criteria data selection strategy to compute the informativeness score for each unannotated data and select a diverse set of data samples for users to inspect.Users can fix the incorrect labels, which will be used to refine the labeling rules by Rapid.

Visual Attribute Extraction with Pre-trained Models
A visual attribute extractor processes a given image and extracts the basic, low-level attributes of the image that are useful and relevant to the labeling task.The visual attributes can be object types in an image, the relationships between the objects, and an object's properties (e.g., size, number).This work mainly uses pre-trained perception models as the visual attribute extractors, detailed in Section 4.3.But one can also use a traditional feature extractor such as SIFT [21] to extract visual attributes.The visual attribute extractors are designed as pluggable components in our approach.For different labeling tasks, we use different pre-trained models to extract visual attributes related to the labeling task.This design increases the flexibility of our approach to be reused for new labeling tasks.

Rule Inference via Inductive Logic Learning
Given its capability to learn from a small amount of data, we use First Order Inductive Learner (FOIL) [30] to infer labeling rules.
Furthermore, the declarative nature of logic rules makes them easy to be understood and refined by human labelers based on their domain knowledge.FOIL is initially designed to learn a logic rule with pre-defined predicates to distinguish a set of positive and negative examples.In our design, each predicate represents one trait of a visual attribute.The original FOIL algorithm can only infer logic rules with variables, which lacks the expressiveness for logic rules with constant values.Therefore, we extend FOIL to support the inference of constant values.As a consequence, this increases the search space exponentially.To address this challenge, we design several inductive biases, such as a TF-IDF-based heuristic, to improve search efficiency (detailed in Appendix A).
There are N object A in the image area(A,N) Object A has the area of N in the image greater(N,) N is greater than  smaller(N,) N is smaller than  In our approach, a labeling rule is defined in a disjunctive normal form with  clauses, as shown below.
1 , ...  denote clauses and  denotes the label.If at least 1 clause is satisfied, an image is labeled as class .A clause is defined as, where  1 , ...  are logic predicates for visual attributes.A clause is a conjunctive normal form with  predicates.Hence, a clause is satisfied if and only if all the predicates are satisfied.In this work, we design a set of primitive predicates for different kinds of visual attributes, as shown in Table 1.For example, in the traffic scenario labeling task, a target image class, "highway", can be inferred based on the types of objects on the road, e.g., "trucks".An example labeling rule for "highway" images can be: where X is an input image, A and B are objects detected by a pretrained object detection model.This rule means that if there are no pedestrians or there exist more than five trucks, the image is classified as "highway".In practice, users can redesign the predicates, e.g., by removing irrelevant predicates and adding domain-specific predicates based on the characteristics of each labeling task to improve the efficiency of inferring logic rules.

Algorithm 1 Inductive Learning for Labeling Rules
Input: positive examples ( + ) and negative examples ( − ) for a label, clauses that must be included ( ) and must be excluded () S ← Initialize( + ) 7: while clause.append(Max_Gain(S)) 9: end while if clause ∉  then 12: + ← remove ( + ,clause) .append(clause) 14: end if 15: end while 16: return  Algorithm 1 describes how to infer labeling rules with inductive learning.For each image label, our algorithm takes the set of positive examples  + and the set of negative examples  − as input and infers a logic rule.It also allows human labelers to specify which clauses must be included ( ) or excluded () based on their domain knowledge.It first adds must-include clauses into the rule  (Line 2).Then, it keeps searching for possible clauses until  + is empty (Lines 3 to 15).If the clause does not need to be excluded (Line 11), the algorithm removes the positive examples which contain all visual attributes in the clause from  + (Line 12), and then adds the clause to the rule (Line 13).The algorithm initializes a negative set  −  as  − (Line 5) and a set containing all possible predicates from the set of positive examples  (Line 6) before the first iteration of the inner loop.To find a possible clause (Line 7 to 10), the algorithm constantly selects predicates with the maximum information gain from  and adds into the clause (Line 8) until  −  is empty (Line 7).The information gain of each predicate is defined below: where

Labeling Rule Refinement via Active Learning
Due to the ambiguity and incompleteness of the small amount of training data, the inductive learning module may not learn the best labeling rules in one pass.We propose to use active learning to improve the performance of inductive learning by iteratively soliciting more human labels.
In each iteration of active learning, when selecting data samples, our goal is to choose the data with the most information and variety to reduce data usage and improve performance.We propose a multicriteria data selection strategy to achieve this goal.[18].Thus, they are only applicable to statistical models that output prediction probabilities.Since our approach uses logical rules, it does not produce a probability for an inferred image label.
To bridge the gap, we propose a novel informativeness metric based on the extent of labeling conflicts among labeling rules.This design is based on the insight that multiple labeling rules may generate conflicting labels for the same image, which can be leveraged to measure the uncertainty of image labeling.The more conflicts there are in our labeling rules about the image labeling result of an image, the more information the image can bring to our model.
The informativeness is largely measured by the number of inconsistent labels.To break the tie among images with the same number of inconsistent labels, we further consider the extent of conflict in the unsatisfied labels ( in Equation 5).
In the equation above,  denotes an image;  denotes a labeling rule composed of a disjunction of clauses ( ←  1 ∨ 2 ∨ ... ∨  ).Here, each rule exclusively defines a unique label.Unsatisfied denotes the set of rules the image  does not satisfy. is a hyper-parameter, which is empirically set to 0.6 in our experiments.The -Clause Satisfactory Ratio-measures the degree of satisfaction of a single rule, as defined in Equation (7). measures the average  for the rules that the image does not satisfy.

𝐶𝑆𝑅(𝑖, 𝑟
In Equation (7),    denotes the -th predicate of the -th clause in rule R.  (, ) represents whether a predicate is satisfied: 3.3.3Diversity.The robustness of image labeling models largely depends on the variety of labeled data.Therefore, our data selection algorithm also accounts for the diversity of images to maximize the variety of a group of data samples when selecting the data to label.We propose to cluster all the unlabeled images into  clusters and choose one sample from each cluster to form the group of data to label.We use k-Means to cluster the data samples and choose the final centroids to generate a diverse group of samples.In k-Means, each image is represented with a vector, and the similarity between every two images is measured by the cosine similarity of the two vectors.The feature vector for an image has  dimensions, which represent the total  types of objects we consider for this task (or dataset).The value in each dimension is the number of the corresponding objects detected in the image.

EXPERIMENTS
We design experiments to answer the following research questions.RQ1 How effective is Rapid on image labeling tasks from different domains?RQ2 How effective are the two kinds of human feedback solicited by Rapid?RQ3 How effective is inductive logic learning compared to statistical and neural network models?RQ4 How sensitive is our active learning algorithm to different data selection strategies?

Experiment Design & Setup
To answer RQ1, we measure the image labeling accuracy and efficiency of Rapid on four different image labeling tasks-two from highly specialized domains and two from common domains.Section 4.3 describes the four tasks and datasets.To represent the condition of learning from limited training data, for each task, we bootstrap Rapid with only 3 randomly sampled data instances with labels to learn the initial set of rules.In each following iteration, the active learning module selects 3 images and corrects their labels if wrong to refine the learned labeling rules.The choice of 3 is to simulate the rapid, incremental feedback from users.This process continues until the 20th iteration.Thus, for each task, Rapid is trained with in total 60 images.We compare Rapid with 6 baselines.Section 4.4 describes these baselines and their training procedures.
To answer RQ2, we measure the degradation of image labeling accuracy and efficiency when abating the human feedback mechanisms in Rapid.Rapid supports two kinds of human feedback-(1) directly editing the labeling rules generated by Rapid and (2) fixing incorrect labels inferred by Rapid and supplementing new labels.Thus, we create three variants of Rapid-Rapid without rule editing (Rapid - ), Rapid without any labeling correction (Rapid - ), and Rapid without any kinds of feedback (Rapid -  ).
RQ3 aims to measure the effectiveness of adopting inductive logic learning for image labeling.To answer RQ3, we create four variants of Rapid by replacing the inductive logic learning module in Rapid with three statistical models-SVM, random forest, and XGBoost [7]-and a neural network.The design of these variants is inspired by existing frameworks such as Snorkel [31] and Concept Bottle Network [20], which use statistical models to make the final prediction based on symbolic representations extracted from raw input data.For the neural network variant, we adopt the design of the fully connected layers in ResNet-18 [15].
To answer RQ4, we compare Rapid with three alternative data selection strategies-random selection, selection with only the informativeness criterion, and selection with only the diversity criterion.We use image labeling accuracy, as well as the hit rate of misclassified data, as the evaluation metrics.Specifically, the hit rate is defined as the percentage of selected data samples that Rapid mislabels in the current iteration and thus is worth fixing.A higher hit rate indicates better effectiveness of data selection.

User Simulation
Since Rapid is designed as a human-in-the-loop approach, Rapid needs to keep soliciting feedback from human experts to refine the labeling rules.It is expensive to recruit human participants to provide feedback, especially in the two highly specialized labeling tasks that require domain experts such as ophthalmologists and ornithologists.Therefore, we develop an automated script to simulate human feedback based on the ground truth data.To simulate label corrections, our script compares the ground truth label of each image with the labels inferred by Rapid and automatically fixes the incorrect labels.To simulate human edits to labeling rules, the authors first manually constructed a set of high-quality labeling rules based on their own knowledge and the information shared on professional websites.In each iteration of the training process, our script compares the labeling rule inferred by Rapid with the corresponding manually curated rule.Our script then replaces the first inconsistent clause with the clause in the manually curated rule.In all experiments, we restrict the simulation script to only edit one clause per iteration to simulate the incremental editing process of human labelers.

Imagle Labeling Tasks and Datasets
Rapid is designed for image labeling tasks in highly specialized domains.Therefore, we first select two datasets-Glaucoma Diagnosis and Bird Species Labeling-from highly specialized domains.To test the generalizability of Rapid, we construct the two datasets on general domains, including traffic scene labeling and occupation labeling, by searching on Google and Flickr.
Glaucoma Diagnosis.Given a color fundus image, this task requires labeling the eye in the image to be either normal or diseased.We combine the color fundus images from three datasets, Drishti-GS [36], RIM-ONE_r3 [3], and REFUGE [28].We have 116 images of glaucomatous eyes and 189 images of normal eyes with both glaucoma diagnosis and structure segmentation.We use a pretrained model called BEAL [43] as the visual attribute extractor to obtain the segmentation of eye fundus structures in the images.The visual attributes designed for this task are the diameter, area, and cup-to-disk ratio calculated based on the segmentation results.
Bird Species Labeling.Given a bird image, this task requires labeling the bird species.We use the Caltech-UCSD Birds-200-2011 (CUB 200-2011) dataset [40], containing 11, 788 bird images annotated with 200 bird species and 312 attributes that describe each body part of a bird, e.g., wing color, tail shape, etc.Following the experiment settings in Koh et al. [20], we use 112 out of the 312 attributes, and randomly choose three bird species to label in our experiments.We use the pre-trained concept models from Koh et al. [20] to extract the visual attributes.
Occupation Labeling.Given an image of a person, this task requires labeling the occupation of the person.For this task, we build a dataset containing 300 images of three occupations-chef, farmer, and teacher.Each occupation has 100 labeled images.We use a pre-trained object detection model [2] to detect objects in the images and use the type (glasses, long hair, kitchen, etc.), color, and overlapping relationship between them as visual attributes.
Traffic Scene Labeling.Given a road image, this task requires labeling the traffic scene of the image.For this task, we build a dataset containing 420 images of three traffic scenes-mountain road, highway, and downtown.Each traffic scene has 140 labeled images.We use a pre-trained object detection model called DETR [6] to detect the objects in the images and use the position, color, and type (e.g., pedestrian, truck, car, etc.) of the objects and the overlapping relationship between them as visual attributes.

Comparison Baselines
For RQ1, we compare Rapid with four image classification neural network baselines-ResNet-18 [15], ResNet-34, ResNeXt-32 [46] and Inception-V3 [38])-and an active learning baseline called CEAL [42].We further compare Rapid with GARDNet [22] on the Glaucoma diagnosis task since GARDNet is specially designed for this task.To represent the condition of learning from limited training data, in each task, we randomly sample 30 training data per class label to finetune the baseline models.That is a total of 60 training samples for the Glaucoma Diagnosis task since it only has two class labels and 90 for the other three tasks since they have three class labels.For the first five baselines, we first pre-train them on ImageNet [11] and then fine-tune them on the four datasets, with training sets in the same size as training Rapid.For GARDNet, we obtained the trained model from its original paper and then fine-tuned it on our Glaucoma diagnosis dataset.
For RQ2, we build three variants of Rapid-Rapid without editing rules by users (Rapid - ), Rapid without labeling correction or new labels (Rapid - ), and Rapid -  , Rapid without feedback when selecting training samples.Similar to the training setting of Rapid, both Rapid - and Rapid - are initially trained with 3 randomly selected images.In the following interactions, Rapid - only fixes incorrect labels in the 3 images selected by the multi-criteria active learning algorithm per iteration but does not apply any direct edits to the inferred rules.By contrast, Rapid - only makes direct edits to one clause of the inferred rules per iteration but does not fix any incorrect labels.Rapid -  does not use any active feedback from users.It is trained with randomly sampled images in various numbers (e.g., 3, 6, 9, etc.) ahead of time without soliciting further human feedback.
For RQ3, we create four variants of Rapid by replacing the inductive logic learner with other machine learning methods, including SVM, random forest, gradient boosting, and neural network.For these variants, we use a feature vector generated with the extracted visual attributes to represent each image.The feature vector is constructed in the same way as calculating the diversity criterion in Section 3.3.3.We repeat each training three times and compute the average performance of each baseline.For the two common domains, Rapid achieves comparable or worse accuracy.Specifically, Rapid achieves an accuracy of 83.33% in the traffic scene labeling task, while the best baseline model achieves 92.86% accuracy.For the occupation labeling task, the accuracy of Rapid is 88.33% while the best baseline model achieves 97.78% accuracy.This result is not surprising since the baseline models are pre-trained on ImageNet, which includes a considerable number of images similar to the ones in these two common tasks.Thus, the baseline models have already learned from many similar cases during the pre-training process.

RQ1. Effectiveness on Different Labeling Tasks
Figure 3 shows the image labeling accuracy of Rapid during the training process.At the 4th iteration, with only 12 training samples, Rapid has already achieved a reasonable accuracy-85% in Glaucoma diagnosis, 70% in occupation labeling, 65% in traffic scene labeling, and 64% in bird species labeling.Within the 13th iteration, Rapid has achieved the peak accuracy on all four tasks.Besides, during the training process, the performance of Rapid is stably improving.The result shows Rapid can effectively learn and refine labeling rules within a small number of iterations.

RQ2. Effectiveness of Human Feedback
Table 3 shows the image labeling accuracy and the number of iterations to achieve the optimal accuracy of Rapid in comparison to its invariants after ablating each feedback mechanism.Overall, Rapid always achieves the highest accuracy with the smallest number of iterations (10.25 on average).Rapid -  has the worst performance with the largest number of iterations (25.25 on average).Without direct rule editing, Rapid - takes significantly 4X more iterations to achieve a comparable accuracy on the Glaucoma diagnosis task and has significantly lower accuracy on the other three tasks (11.87% accuracy decrease on average).Though Rapid - can achieve the same final accuracy as Rapid, it takes 6 more iterations in the bird species labeling task.It is interesting to observe that Rapid - takes the same or even fewer iterations in three tasks.This is because the user simulation script always makes the right edit based on the ground-truth labeling rule each time.When using active learning together with rule editing, Rapid will regenerate the label after receiving new labels in each iteration.These newly generated rules may deviate from the groundtruth rules, therefore leading to more iterations.
Compared with the three variants, Rapid achieves the largest performance gain in the bird species labeling task.This performance gain can be largely attributed to the inherent learning challenge of the dataset.For example, birds of the same species are of great variety (e.g., a Kentucky Warbler can exhibit eight distinct wing color variations.).Thus, this increases the difficulty of learning a proper labeling rule to define what a certain bird species look like from such a small training dataset.On the other hand, by editing the rules, experts can directly embed their expert knowledge into the labeling rules, leading to a huge performance improvement.Figure 4 shows the image labeling accuracy of Rapid and its variants over iterations in the bird species labeling task.For ease of comparison, we show the accuracy of Rapid -  over the number of randomly sampled training samples it uses.In other words, for Rapid -  in Figure 4, 1 in the x-axis means training with 3 samples, 2 means training with 6 samples, 3 means training with 9 samples, so on and so forth.Overall, Rapid achieves the highest image labeling accuracy (86.11%) with only 12 iterations and 36 images in total.While Rapid - achieves the same accuracy, it takes 7 more iterations and thus 21 more images to reach this accuracy.In this task, both Rapid - and Rapid -  takes significantly more iterations to achieve the final accuracy.

RQ3. Effectiveness of Inductive Logic Learning
Table 4 shows the comparison of accuracy between Rapid and variants replacing the inductive logic learner with statistical and neural network models.Rapid outperforms all variant models in all four tasks by 6.02%, 15.74%, 8.89% and 3.57%, respectively.The results show the great capability of the inductive logic learning model in learning image labeling tasks under a low resource setting.
Though Rapid  does not achieve an accuracy as high as Rapid, it is worth mentioning that Rapid  outperforms all the baseline models in Table 2, in the two highly specialized domain tasks.This implies that our pipeline, similar to Concept Bottleneck Models [20], which disentangles the perception process (Visual Attribute Extraction) and the learning process, has great capability in highly specialized image labeling tasks.

RQ4. Sensitivity Analysis of Active Learning
Figure 5 shows the comparison of accuracy and hit rate between four data selection strategies in active learning in the traffic scene labeling task.Recall that the hit rate is defined as the percentage of selected data samples that Rapid mislabels in the current iteration and thus is worth fixing.A higher hit rate indicates better effectiveness of data selection.Due to the page limit, we only showcase the experiment results on this task.Figures for the other three tasks have been provided in Appendix C.
Among the four data selection strategies, the multi-criteria strategy achieves the highest image labeling accuracy (73.81%) with the smallest number of iterations (3).By contrast, using the diversity criterion alone takes 19 iterations (6X more) to achieve similar accuracy.Both using the informativeness criterion alone and random selection achieve lower accuracy, 50.00% and 67.86%.Yet compared with random selection, the informative selection still helps, achieving significantly higher accuracy within fewer iterations.
In the traffic scene labeling task, random selection, single diversity criterion, single informativeness criterion, and multi-criteria achieve 42.22%, 40.00%, 71.11%, and 76.67% average hit rate over the iterations.Combining informativeness and diversity, our multicriteria strategy achieves the highest hit rate in this task.The single informativeness strategy maintains the highest hit rate at the beginning of the training process, which proves the effectiveness of our informativeness metric designed in Section 3.3.2.The results on accuracy and hit rate imply that the two metrics are complementary to each other.With the informativeness metric selecting training samples with high information and the diversity metric choosing representatives from the selected samples, Rapid can achieve high accuracy in a small number of iterations.

DISCUSSION
The experiment results demonstrate the effectiveness of Rapid in learning accurate labeling rules with a small amount of training data.Based on the ablation studies in RQ3 and RQ4, we found that both inductive logic learning and multi-criteria active learning play vital roles in the success of Rapid.Furthermore, given the inherent transparency and explainability of logic-based labeling rules, Rapid provides affordance for users to directly embed their expertise into the labeling rules via rule editing.This further improves the efficiency of rule inference, leading to significantly fewer feedback iterations compared with using active learning alone.
Our work is the first to apply inductive logic learning to neurosymbolic learning.The experiment for RQ3 demonstrates the learning capability of our FOIL-based inductive logic learner.Specifically, our logic learner outperforms alternative statistical and neural network models.It is interesting to observe that when replacing the inductive logic learner with Random Forest, the resulting model can still achieve better performance than the fine-tuned models in Table 2 in highly specialized domains.This implies that simply disentangling the perception process from the reasoning process can still be beneficial when learning highly specialized tasks in a low-resource setting (i.e., training with a small amount of data).
Compared with Snorkel [31], our approach has two significant advantages.First, Rapid can learn an initial set of labeling rules from a small amount of labeled data (e.g., 3 training samples in our case) as a starting point for expert users to refine.In contrast, Snorkel requires users to manually write labeling functions from scratch, which can be effortful for expressing complex knowledge.Second, to use Snorkel, users need to be familiar with a programming language such as Python to write labeling functions.Thus, it comes with a steep learning curve for domain experts and end-users who do not have programming background.By contrast, the logic labeling rules in our approach are readable and intuitive.Thus, our approach comes with a more gentle learning curve compared with Snorkel.
This work is also a good demonstration of effective human-AI collaboration in challenging tasks.In our case, Rapid infers an initial set of labeling rules and continuously refines them based on human feedback in the form of direct edits or label corrections.Our experiment in RQ2 has shown that incorporating human feedback is critical to infer accurate labeling rules.Without humans in the loop, Rapid has significantly lower accuracy even when trained with the same amount of data as training with many iterations.
Despite the promising results, the final image labeling accuracy of Rapid may still not be on par with human experts.Our approach achieves an average of 85% in the four labeling tasks, while the accuracy of human labelers on ImageNet is about 95% [27].Similar to Snorkel [31], we also want to argue that such relatively noisy data can still provide valuable supervision signals for model training, especially when used together with noise-robust learning [44,48] and weak supervision [31,47].This can be extremely beneficial in highly specialized domains such as medical imaging, where human labelers are expensive and hard to acquire.
The current design of the inductive logic learner in Rapid has a rigid objective of learning a rule that matches all positive samples while rejecting all negative samples.Consequently, our approach is sensitive to user mistakes (e.g., incorrect labels and edits).A single incorrect training sample can lead to a labeling rule that makes no sense to users.In future work, we will improve the inductive logic learning method by providing the flexibility to relax the rigid rule satisfaction requirement, which is expected to increase the noisy label tolerance and robustness.Besides, when training Rapid in this work, we follow the FOIL algorithm and use an information gainbased search method.In future work, We will explore improving the search mechanism by adding heuristics using visual attributes.
The performance of Rapid highly depends on the visual attributes extracted by pre-trained computer vision models.For specialized domains, it is possible that no pre-trained computer vision model can extract meaningful visual attributes.However, we think this situation will not occur very often.There are two reasons.First, numerous large computer vision models have been developed and made available online (e.g., HuggingFace, GitHub) these days.These models can recognize a wide range of objects, shapes, colors, and other basic visual attributes that generally apply to different domains, including many highly specialized domains.Second, for rare visual attributes, the pre-trained models can be substituted with conventional handcrafted computer vision techniques, such as SIFT.Our inductive logic learning component can still act on those low-level attributes and synthesize meaningful labeling rules.

B.3 Change of rules along with iterations
Two kinds of rule updates may occur in iterations of active learning.First, one or more parameters in a rule may get updated with new labeled data samples provided by a user.Take the glaucoma diagnosis labeling task as an example.As shown in Table 7, the parameters of the logic predicates, "greater" and "smaller", were updated in each iteration.Second, logic predicates or clauses may be added or removed in an iteration.Take the bird species labeling task as an example.As users provided more labeled images during the iterations, more logic predicates and clauses were added over the iterations to distinguish those examples.For simplicity, we show the labeling rule for one bird species-the orange-crowned warbler-at Iterations 1, 5, and 13 in table 8.  ∨ (¬ has forehead color(X,grey) ∧ has underparts color(X,olive)) ∨ (¬ has leg color(X,black) ∧ has eye color(X,black)) ∨ has forehead color(X,green) ∨ (¬ has leg color(X,black) ∧ has head pattern(X,plain)) ∨ (has nape color(X,yellow) ∧¬ has leg color(X,black)) ∨ (has under tail color(X,yellow) ∧ has leg color(X,buff)) ∨ (has nape color(X,buff) ∧¬ has under tail color(X,yellow)) ∨ has forehead color(X,olive) ∨ (¬ has nape color(X,grey) ∧¬ has under tail color(X,yellow) ∧ has head pattern(X,eyeline)) ∨ (has under tail color(X,brown) ∧ has forehead color(X,brown)) ∨ (has nape color(X,yellow) ∧ has forehead color(X,yellow) ∧ has underparts color(X,yellow)) ∨ (has forehead color(X,buff) ∧ has bill color(X,buff) ∧ has underparts color(X,buff)) ∨ (has leg color(X,yellow) ∧ has belly color(X,black)) ∨ has breast color(X,red)] 86.11 C EXPERIMENTAL RESULTS OF RQ4 IN OTHER THREE TASKS.
Due to the page limit, the paper only showcases the comparison of accuracy and hit rate on the traffic scene labeling dataset for different data selection strategies.Here, we show the results in the other three tasks, bird species labeling in Figure 6, Glaucoma diagnosis in Figure 7, and occupation labeling in Figure 8.
The results match those of the traffic scene labeling task.Our Multi-criteria achieves higher image labeling accuracy than the other three image selection strategies.Using informativeness only,    Rapid can achieve the best hit rate when selecting images, and a high labeling accuracy as the number of iterations (i.e., the number of human-annotated images) increases.Compared with informativeness only, our multi-criteria

1 Figure 1 :
Figure 1: Two image labeling tasks from highly specialized domains and two from common domains with example labeling rules.

Figure 3 :
Figure 3: Image labeling accuracy of Rapid on four tasks during the training process.

Figure 4 :
Figure 4: Comparison of accuracy in the training process between Rapid and three variants ablating two types of user feedback in the bird species labeling task.
Comparison of hit rate.

Figure 5 :
Figure 5: Comparison of accuracy and hit rate for different data selection strategies in the traffic scene labeling task.

Figure 7 :
Figure 7: Comparison of accuracy and hit rate in the Glaucoma diagnosis task for different data selection strategies.
(a) Comparison of accuracy.(b) Comparison of hit rate.

Figure 8 :
Figure 8: Comparison of accuracy and hit rate in the occupation labeling task for different data selection strategies.
(a) Comparison of accuracy.(b) Comparison of hit rate.

Figure 6 :
Figure 6: Comparison of accuracy and hit rate in the bird species labeling task for different data selection strategies.

Received 20
February 2007; revised 12 March 2009; accepted 5 June 2009 approaches have an active learning component (i.e., a data selection algorithm) to carefully rank and select data samples for user inspection and correction.

Table 1 :
Logic Predicates for Expressing Visual Attributes +  and  −  denote the set of positive examples and the set of negative examples before adding the new predicate   . + +1 and  − +1 denote the set of positive examples and the set of negative examples after adding   .Then in each iteration in the inner loop,  −  is redefined to a set that removes the negative examples which contain all visual attributes in the clause from  − (Line 9).
Algorithm 2 Multi-criteria Data Selection Input: Unlabeled data  , the size of the intermediate set , the number of data samples to select  Output: The set of selected data samples  1: Calculate the informative score for  2:   ← sort  by informativeness score 3:   ← pick top  data instances 4:  ← K-Means(  ) 5: return  3.3.1 Multi-Criteria Data Selection.Algorithm 2 describes the multicriteria data selection process.First, it calculates the informativeness score of each unlabeled data instance and ranks them based on the scores.Then, it selects the  most informative samples to form an intermediate set.These samples are then clustered based on similarity, and our algorithm selects the final set of  samples that are both informative and diverse.The informativeness metric and the clustering algorithm are detailed in the following subsections.3.3.2Informativeness.Existing informativeness metrics in the literature of active learning are typically calculated based on prediction probabilities, e.g., entropy-based uncertainty

Table 2
shows the image labeling accuracy of Rapid in the four labeling tasks from different domains in comparison to the fine-tuned models.Rapid outperforms all baselines on the two highly specialized domains (Glaucoma Diagnosis and Bird Species Labeling) by 11.75% to 12.03%.Specifically, Rapid achieves a high accuracy of 85.52% and 86.11% in these two tasks, respectively.The result shows Rapid can effectively infer accurate labeling rules in highly specialized domains with a small amount of training data.For more details about the labeling rules, such as examples and statistics of the optimal rules, and how the labeling rules change over iterations, please refer to Appendix B.

Table 2 :
Comparison of accuracy (%) between Rapid and image labeling baseline models in the four tasks.

Table 3 :
Comparison of accuracy (%) and the number of iterations consumed (Iter.) between Rapid and three variants ablating two types of user feedback in the four tasks.

Table 4 :
Comparison of accuracy (%) between Rapid and four variants replacing the inductive logic learner with statistic and NN modules in the four tasks.

Table 6 :
The optimal rules on the four tasks.