Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-Loop

Large-scale 3D point clouds are often used as training data for 3D semantic segmentation, but the labor-intensive nature of the annotation process challenges the acquisition of sufficient labeled data. Meanwhile, there has been limited research on introducing novice annotators to acquire the labeled data by enhancing their annotation performance and user experience. Therefore, in this study, we explored solutions involving two dimensions: the presence of AI assistance and the number of classes visualized simultaneously in model’s segmentation results in HITL. We conducted a user study with 16 novice annotators who had no prior experience in 3D semantic segmentation, asking them to perform annotation tasks. The results revealed an interaction effect between the two dimensions on annotation accuracy and labeling efficiency. We also found that displaying multiple classes at once reduced the time taken for annotation. Moreover, visualizing multiple classes at once or the absence of AI assistance led to a greater increase in model accuracy compared to our baselines. The best user experience was observed when the visualization showed a single class at a time with AI assistance. Based on these findings, we discuss which environments can enhance novice annotators’ annotation performance and user experience in 3D semantic segmentation tasks within HITL contexts.


INTRODUCTION
3D semantic segmentation has shown great potential in various applications such as autonomous driving and robotics.However, it suffers from a data hunger problem [20,53] as it relies heavily on fully-supervised learning with large-scale 3D point cloud data [10,81].Although generating a substantial amount of high-quality training data tailored to specific tasks can be a potential solution, it is labor-intensive [48], time-consuming [14,29], and requires high cost [73].In particular, due to the inherent characteristics of 3D point cloud data, such as sheerness, irregularity [47], lack of grid alignment [7], and sparsity [66], preparing training data for 3D point clouds becomes even more challenging.Therefore, generating 3D point cloud data has been done by a limited number of annotation experts with domain expertise [38,41].
As an alternative approach, human-in-the-loop (HITL) has been considered [78], which allows human annotators to label a subset of data instead of the entire dataset, and the rest is done by AI.Also, there have been attempts to involve novice annotators in a labeling process with support of AI/ML techniques [23,50,54,61,77,80] and effective visualization method [8,43,65].However, given that the data hunger problem is especially challenging in the context of 3D point clouds, little has been explored about how these AI assistance and visualization methods can be used to support novice's annotation for semantic segmentation tasks with 3D point clouds in the HITL.Thus, we aim to explore whether these two dimensions could be potential methods to enhance novice annotators' annotation performance and user experience in HITL environments for 3D point cloud data.Particularly, we focus on how the presence of AI assistance (AI Presence) and the number of simultaneously visualized classes in the model's segmentation result (Class Num) impact the performance of the annotators.Therefore, we set our research questions as below: • RQ1: How does the presence of AI assistance affect novice annotators' annotation performance and user experience with 3D point cloud in HITL?
• RQ2: How does the number of simultaneously visualized classes in the model's segmentation result affect novice annotators' annotation performance and user experience with 3D point cloud in HITL?
To answer these questions, we conducted a user study with 16 novice annotators.We assessed their annotation performance and user experience under different conditions varying AI Presence (with AI assistance vs. without AI assistance) and Class Num (single label type at once vs. multiple label types at once).In particular, for performance, we estimated Annotation accuracy, Time taken, the number of points they annotated in a millisecond (Efficiency), and the increased accuracy with their annotation compared to our baseline models (Effectiveness).
As a result, we found an interaction effect exists between AI Presence and Class Num on the Annotation accuracy and the Efficiency.Both of them were the highest when AI assistance exists in AI Presence and Class Num is single.For Time taken, it gets faster when Class Num is single.Additionally, for Effectiveness, it gets higher when AI assistance does not exist in AI Presence or Class Num is multiple.Finally, user experience was the best when AI assistance does not exist in AI Presence and Class Num is single.
The contributions of this work are as follows: (1) we discovered environments that enhance the annotation capabilities of novice annotators in 3D semantic segmentation tasks with 3D point clouds in a HITL setting, (2) we conducted an empirical analysis of the positive and negative effects of AI assistance on human augmentation and user experience in 3D point cloud annotation, and (3) we identified the impact of varying the number of visualized segment classes on human augmentation and user experience in 3D point cloud annotation.

RELATED WORKS 2.1 Presence of AI support in a data labeling process
Visual interactive labeling (VIL) is a data labeling method where human annotators interact with visually represented model results and data in the labeling process [6].It is a human-centered labeling method since the annotators solely select data subsets to label [2] in the entire annotation process.In VIL, a human annotator is expected to label data that is most likely to improve the model performance significantly.It has drawn attention as its effect on increasing model performance was comparable to that of active learning [3,5], a labeling method that machines select a data subset to label itself based on computation [40].Moreover, VIL has additional benefits over active learning as it enables a human annotator to take more initiative during the annotation process [33,64] and resolves the cold start problem [5,46].Mostly, the VIL has been utilized on 2D image data [42,51,75].For instance, Xiang et al. [75] showed that VIL was helpful for generating a training dataset for the clothing image classification model.On the other hand, there are few studies that apply VIL to 3D point cloud data.Although Zhi et al. [82] and Kontogianni et al. [39] explored annotation techniques for 3D point cloud data, they did not employ VIL; annotators were not instructed to choose the data that would most improve performance.
However, VIL has two notable drawbacks [6].First, VIL relies solely on human annotators in selecting data instances to label that are used for training accurate and robust models.That is, the purely human-selected instances need to be reliably validated.Second, VIL prioritizes human learning and knowledge acquisition, considering the creation of high-quality datasets as a secondary outcome.Therefore, VIL may not be the best approach when it comes to training elaborate models.
To address these shortcomings, hybrid solutions that leverage the complementary relationship between VIL and active learning have been explored [6,27,28,30,33].In the hybrid process, the model initially selects data subsets to be labeled, and then the human annotator chooses from these selected subsets to provide the labels.This method has been acknowledged to enhance the collaboration between humans and AI [17,24,25,33,49,79].Related to this, Jia et al. [33] showed that creating a collaborative environment where human annotators can get assistance from AI's computational guidance is effective for building a robust classifier.
While the hybrid approach can increase the benefit of interactive labeling in the context of 2D data in HITL [79], there is limited research demonstrating its advantages when applied to the task of 3D point cloud data labeling.Therefore, in this paper, we explore if the merits of the hybrid approach can be applied to 3D point cloud data as well.

Visualizing training state of model
The visualization method for model results plays a crucial role in enhancing human comprehension of the model's training process and outcomes.For example, depending on the method, human annotators can obtain a deeper understanding of the computations taking place at each layer of the model [35,76].Additionally, they can find out which factors significantly influence the model's output generation [55,72].As a result, visualization methods for model results have been actively studied for HITL scenarios where human annotators need to repeatedly assess the model's training state [4,26,68].
Especially, when illustrating the results of a segmentation model, most studies have primarily displayed multiple classes in the result at once [15,31,34,37,63].This approach typically utilizes class coloring [4,22], that uses different color to each class to differentiate classes.For example, Kyle et al. [22] visualize the segmentation model's segmentation results by assigning different colors to each class (e.g., trees in green, cars in blue).It could enable differentiating between adjacent objects of different classes and make it easier to discern the boundaries of each segment.
Meanwhile, a technique displaying segments for each single class at a time has also been used to visualize the results of segmentation models [1].For instance, Andriluka et al. [1] proposed an annotation tool that shows a segment for each class in the sequence of Mask-RCNN score or Mahalanobis distance related to click location.This technique has been actively employed in the field of explainable AI [9,19].It has been demonstrated to provide more trustworthy explanations to humans compared to showing the prediction results for all classes at once [60].Furthermore, reducing the number of simultaneously displayed classes could have a positive impact in terms of cognitive load reduction during the annotation process [67,70].
However, few studies have deeply investigated how the number of visualized classes in segmentation results affects the performance and experience of novice annotators.Therefore, this study delves into the effects of visualization methods on the annotation performance and user experience of novice annotators by comparing the two methods described above.

METHODS
Our aim is to investigate the potential influence of AI assistance and visualization methods on the annotation performance and user experience of novice annotators while employing the HITL approach for training 3D point cloud semantic segmentation models.
To accomplish this, we conducted a 3-hour single-session study with 16 novice annotators.They were asked to annotate 3D point cloud data for a segmentation task under different conditions.

Conditions
We explored the impact of the presence of AI assistance (AI Presence) and the number of classes in a model's results (Class Num) on the annotation performance and the user experience.To be specific, for AI Presence, we conducted a comparison between the w/o AI assistance and with AI assistance.The w/o AI assistance condition involves annotators selecting data subsets for labeling on their own, as in the VIL method.Conversely, in the with AI assistance condition, AI aids in the human labeling process, as in the hybrid method.In our study, the AI assists the annotators by highlighting points with low confidence scores in color, while displaying the other points in gray.That is, the annotators can see the points that highly need to be labeled.We elaborate on our method for identifying low confidence points in subsection 3.3.Therefore, as Figure 1b and Figure 1e, points with low confidence scores are represented by colors assigned for each segmented class in with AI assistance.On the other hand, w/o AI assistance assigns colors to all points and does not narrow down the annotation candidates for the annotators (Figure 1a and Figure 1d).
For Class Num, we manipulated the number of classes to be displayed in the model's segmentation results; we had MLONE condition that displays segments across multiple label types at once, and we had SLONE condition that presents segments for single label type at once.As mentioned above, we assigned colors for each of the segmented classes.As Figure 1a and Figure 1b, MLONE shows results of all classes at once, while SLONE shows a result for each class (Figure 1d and Figure 1e).
We hypothesized that with AI assistance would outperform w/o AI assistance condition in enhancing novice annotators' annotation performance and user experience.This is because it combines the capabilities of AI and humans that are known as complementary.Also, we expect to see better performance with SLONE than MLONE as it could reduce novice annotators' cognitive load.

Participants
16 participants (10 female, 6 male) were recruited via social network services.The criteria for participation were: (1) no prior experience with annotation for 3D semantic segmentation tasks, (2) no background knowledge in AI, and (3) no problem with distinguishing colors.Their range of age was between 19 and 38 with an average age of 25.3 (SD=5.95).They were compensated with $27 for their participation.

Apparatus
For the experiment, we implemented an annotation tool for segmenting 3D point cloud data 1 .It is based on labelCloud [58], an open-source point cloud labeling tool.We added features such as adjusting the position and the size of a bounding box, saving annotated data, and measuring the task duration.The tool is designed to use a mouse as an input device and a computer monitor to see the output during the annotation task.Its user interface and the detailed functions are described in appendix A. In our study, a dualwindow setup was employed to present the tool and original point cloud data.Such a setting allows participants to refer to the original data while performing labeling tasks on the tool.

Dataset and Model.
In our study, We employed ScanNet [14], a representative 3D point cloud dataset.It contains over 1,500 RGB-D indoor scenes and provides instance-level semantic segmentation information.In terms of model, we used PointNet++ [56], a leading 3D point cloud-based semantic segmentation model.
We selected four different scenes from ScanNet for our experiment, presented in a counterbalanced order.To minimize the impact of scene characteristics on each condition, we selected scenes with similar features based on the object types, as well as the number of objects in the scenes.Furthermore, we cropped the chosen scenes so that the number of points per scene was approximately the same.Thereby, we alleviated the influence of differences in the quantity of points within the four selected scenes.
In our study, we designed the experiment using the Wizard of Oz method [13] to eliminate external factors (i.e., model training time) that could influence the results (i.e., user experience).Therefore, we presented pre-generated intermediate training results to participants when they labeled data to simulate the real HITL process.We created three results for each scene since participants were required to perform annotations twice in each scene.Before starting the annotation process, participants see the first result.After completing one round of annotations, they are shown the second result, and upon completing the second round, they are presented with the final result.
To create the results, we trained the PointNet++ model with ScanNet data.As the training state to be shown could affect the participants' annotation process, we controlled the models to have similar ranges of Voxel mIoU and loss on four scenes, as shown in Table 1.Note that Voxel mIoU and loss are frequently used as metrics to estimate 3D point cloud-based models' performance [16,44].In addition, we set the performance of training states in each iteration to increase gradually to let the participants believe their annotation helps the model's training.
As mentioned in subsection 3.1, with AI assistance condition used confidence scores for visualizing segmentation results.It is calculated with models in Table 1 in each iteration.In this condition, we used a pre-made ply file in our tool that only displays one-third of the points with low confidence for each iteration and scene.It ensures uniformity in the number of colored points across all scenes, thus mitigating any potential effects that varying the number of colored points might have on participants' annotations.

Procedure
Each session began by signing a consent form.Then, participants were given a brief introduction to the study.A tutorial was provided for participants to get familiar with the annotation tool for each condition before the actual task.Once they felt comfortable with the interface, participants were asked to select two parts in scene that they believed to be most helpful for improving the model's segmentation performance and provide labels with two bounding boxes.They were given four minutes to label based on the current training results that are displayed on the screen.Afterward, participants were shown the updated results and then started over for one more round.This process was repeated for all four conditions in a counter-balanced order.After the labeling for each of the conditions was completed, they were asked to fill out 7-scale user experience questionnaire regarding the ease, enjoyment, and fatigue of the labeling process with each of the condition.Considering the exhausting nature of the labeling task, sufficient rest time was provided after the completion of each of the two labeling tasks for each condition.Additionally, a 10-minute break was given after completing annotations for two conditions.Finally, after completing the tasks for all four conditions, participants filled out an overall user experience questionnaire again, comparing all conditions using ranked voting.For each question in this questionnaire, we asked participants to provide reasons for the ranks they assigned.In this procedure, we divided our 16 participants into four groups and employed a Latin square design to randomize the sequence of conditions.

FINDINGS
We assessed the effects of AI Presence and Class Num on Annotation accuracy, Time taken, Efficiency, Effectiveness, and user experience for each condition.We used the Shaprio-Wilk test [62] to test data normality.Additionally, if the data were not normalized, we applied Z-score normalization to our data, following the methods used in prior studies [69,71].After normalizing the data, we employed the two-way repeated measures ANOVA to determine if there were any interaction effects between AI Presence and Class Num.If there were no interaction effects, we utilized the pairwise t-test to identify main effects of the two dimensions.Meanwhile, user experience responses rated on a 7-point scale for each of the conditions were tested using Friedman tests [18].For the final user experience questionnaire, we compared the conditions based on the ranking vote proportions.an interaction effect between AI Presence and Class Num ( < .001).

Annotation accuracy
In particular, when the AI assistance exists (with AI assistance), the average annotation accuracy was higher in the SLONE (M = 77.53%,SD = 0.231) than that of MLONE (M = 53.91%,SD = 0.372) ( = .005).On the other hand, when there is no AI assistance (w/o AI assistance), there was no statistically significant difference between MLONE and SLONE.As shown in

Time taken and efficiency
To measure the productivity level of participants' labeling, we evaluated Time Taken and Efficiency.Time Taken refers to the amount of time consumed by participants to complete labeling each time.It was measured from when the participants pressed the start button to when they pressed the stop button on the interface.On the other hand, Efficiency represents the quantity of labeled data in relation to time [4], which is a primary evaluation criterion in assessing human-provided labeling in HITL.To assess Efficiency, we divided the number of 3D points labeled by the Time taken of participants to label two bounding boxes in every iterations within each condition.
For Time taken, as shown in Figure 2a, there was no interaction effect between AI Presence and Class Num.However, the main effect of the visualization method existed ( = .009).When the visualization method was SLONE, participants took 202,962ms on average (SD = 36,358ms), which is significantly higher than the 187,769ms (SD = 43,799ms) with MLONE.There was no main effects of the AI assistance.One participant stated that performing the task was difficult in the SLONE condition, as she had to remember the segmentation results of other classes.On the other hands, there was no main effects of AI assistance.However, in with AI assistance condition, one participant mentioned that they could concentrate on the labeling process without going through complex thinking, unlike w/o AI assistance.Additionally, another participant stated that it was challenging to define which part to label relying solely on AI's segmentation result right away in the w/o AI assistance condition.
For Efficiency, as shown in Figure 2b, the two-way repeated measures ANOVA analysis revealed that there is a significant interaction effect between AI Presence and Class Num ( = .027).In the with AI assistance condition, the Efficiency was 0.026 points per millisecond on average (SD = 0.019) in MLONE, which is lower than that of SLONE, 0.047 points per millisecond on average (SD =  0.042) ( = .010).In the w/o AI assistance condition, there was no significant difference between two visualization methods.

Effectiveness
Effectiveness, another representative evaluation metric when assessing human-provided labeling in HITL environments, refers to the degree how much the model's accuracy is improved by participant's annotation [4].To measure Effectiveness, we looked at the improvement ratio of Voxel mIoU each participant achieved per iteration per condition compared to the baseline models (see Table 1 for details).First, we trained the baseline models with the labeled data provided by each participant in each iteration.In this process, we used three random seeds to get a generalized result.Then, we calculated Effectiveness by comparing the Voxel MIoU of the re-trained model and that of the baseline.
Unlike the Efficiency, there was no interaction effect between AI Presence and Class Num in Effectiveness.The main effects of AI presence was found to be significant ( < .001).In w/o AI assistance condition, the improved accuracy ratio over the baseline was 1.439 (SD = 0.256), which is higher than 1.177 (SD = 0.261) of the with AI assistance condition; details are shown in Figure 3.One participant mentioned that in the with AI assistance, it is difficult to know how the uncolored points were segmented because these points have no information on the segmentation results.Additionally, the main effects of visualization method was present ( < .001).The average ratio of MLONE was 1.439 ( = 0.310), which was significantly higher than that of SLONE ( = 1.176,  = 0.194).Among the participants, nine stated displaying multiple classes at once is good for grasping the results at a glance.

User experience
After the labeling for each of the conditions was completed, we asked 7-scale user experience questionnaires regarding the ease, enjoyment, and fatigue of the labeling process.The Friedman test revealed no statistically significant differences among conditions.
Additionally, at the end of the study, we asked participants to rank all four conditions in terms of Enjoyment, ownership, and Ease of use.The results are shown in Figure 4.For the analysis, we grouped the first and second places as top-ranked, and the third and fourth places as bottom-ranked for each condition.The results show that SLONE received more top-ranked votes than MLONE regardless of AI Presence for all metrics.Also, in the case of SLONE, w/o AI assistance received even more number of top-ranked votes than with AI assistance.
One participant who selected SLONE with w/o AI assistance as the first place for all metrics said that since it shows the result for all data for each object, determining the priorities in labeling candidates was far easier than other conditions.With w/o AI assistance, MLONE and SLONE tied for the first place in receiving the most first-ranked votes for Ownership.At the same time, it took the fewest fourthranked votes in the metric.For SLONE under the with AI assistance, it received the fewest fourth-ranked votes for Enjoyment, but it also got the least first-place votes for Ownership and Ease of use.Under the same condition(with AI assistance), MLONE tied for first place in attaining first-ranked votes for Ease of use, while it received the most fourth-ranked votes across all metrics.

DISCUSSION
In this section, we discussed environments that improve novice annotators' annotation performance and user experience for 3D semantic segmentation tasks in HITL.

Reducing time taken and enhancing
annotation efficiency while alleviating cognitive overload in decision-making 5.1.1Requiring less memory with MLONE.As shown in subsection 4.2, the Time taken was found to be lower in MLONE than SLONE.In other words, the annotators performed labeling more faster in MLONE.In the condition, as mentioned in subsection 4.2, they struggled to decide on the parts to label while remembering the segmentation results of different classes.Therefore, when they forgot the results for prior classes, some returned to the segmentation results of them.Since such repetition increases the number of choices unnecessarily, the Paradox of Choice [11,32] could exist.It argues that having too many choices can lead to stress and anxiety, suggesting that a limited set of options can actually reduce time taken in decision-making [57,59].
5.1.2Narrowing down labeling candidates with AI assistance.In the SLONE environment, the number of points displayed at once was small, a process that heavily involves their cognitive efforts.In this setting, as shown in subsection 4.2, Efficiency was found to be higher with the support of AI (with AI assistance) compared to without it (w/o AI assistance).In other words, the annotators were quicker in annotation when they received assistance in narrowing down labeling candidates.These observations are supported by studies that have shown that decreased cognitive load increases task efficiency [21,45,52] As mentioned in subsection 4.3, Effectiveness was found to be higher in MLONE compared to SLONE.Such a result could come from abundant contextual information in MLONE showing the results with objects of all classes together.Such context allows humans to understand the meaning of the provided information better [36].Such context information is also known for supporting object identification in the visualization field [12].Therefore, contextual information in MLONE may enable annotators to better identify areas for labeling by allowing a comprehensive grasping of the model's training state on entire data.

5.2.2
Increased an amount of information in model result with w/o AI assistance.As shown in subsection 4.3, Effectiveness was also observed to be higher in environments where there is no AI assistance (w/o AI assistance) compared to when there is AI assistance (with AI assistance).This discrepancy may be attributed to the limited visibility of points in environments with AI assistance.In such settings, annotators are only presented with AI-recommended labeling points, which can restrict the overall context available for interpretation.On the other hand, in environments where there is no AI assistance, a broader range of labeling points are visible.Therefore, it provides a richer context that makes it easier for annotators to understand and label effectively.
In the results, Effectiveness was found to be better in MLONE than in SLONE, and also higher in without AI assistance compared to with AI assistance.Meanwhile, Annotation Accuracy was the highest in with AI assistance-SLONE.We speculate that this may be due to potentially lower class diversity in labeled data in the with AI assistance-SLONE condition compared to other conditions.In with AI assistance-SLONE, the novice annotators tended to focus on drawing bounding boxes for a single class and targeting narrow range with low confidence score.In contrast, in conditions such as without AI assistance or MLONE, there was a tendency to draw bounding boxes on objects segmented into multiple classes, potentially providing the AI with more diverse class information.This approach could have led to the higher Annotation Accuracy in with AI assistance-SLONE, but lower Effectiveness.

Dealing with annotators' speed-accuracy tradeoff by applying conditions flexibly
Based on findings, we found that the Speed-accuracy tradeoff [74] also exists between novice annotators' Efficiency and Effectiveness in the HITL.The tradeoff means less time consumed mostly leads to lower accuracy when conducting a task.Our findings show that with AI assistance-SLONE helps to increase Efficiency (as in subsection 4.2), while w/o AI assistance and MLONE aid with boosting Effectiveness (as in subsection 4.3).In other words, conditions that enhance efficiency give relatively less support for increasing effectiveness than other conditions.Since the tradeoff prevents optimization for efficient and effective novice annotators' annotation process, solutions to minimize the effect are needed.One way to mitigate the tradeoff's effects is by adopting the combinations of the conditions flexibly based on the purpose.For example, if the priority is to obtain a large amount of labeled data quickly, we can use with AI assistance along with SLONE.Conversely, if the focus is on obtaining labeled data that significantly enhances model performance, we might consider applying w/o AI assistance and MLONE.Additionally, since with AI assistance-SLONE and without AI assistance-MLONE are at opposite ends of the spectrum in both of AI Presence and Class Num, further exploration could be conducted to find optimal solutions at an intermediate point.

Augmenting annotators' user experience through support in establishing labeling criteria
As in subsection 4.4, w/o AI assistance under SLONE condition got the best evaluation on user experience.We hypothesize that this condition is most helpful for setting criteria when determining which part to label.Getting support in establishing labeling criteria can be beneficial for improving the user experience because it helps reduce the uncertainty and sense of being overwhelmed annotators might experience until the criteria are set.In detail, first, SLONE displays segments predicted for each class type.In this way, SLONE makes it easier to assess the segmentation model's performance for each class type compared to MLONE.It enables the annotators to identify which class the model struggles with the segmentation task.Understanding the specific areas where the AI model falls short can guide annotators in determining their labeling criteria.This is crucial because the annotators aim to provide labels that will most effectively improve those weak areas in the model's training.Also, w/o AI assistance displays all segments for all classes, while with AI assistance only shows partial data for all classes.It allows the annotators to get the full picture of the models' training state.Such deep comprehension of the current state also helps to guess the best supportable labeling candidates.

Limitations and future work
First, to mitigate external factors such as model training time, we employed the Wizard of Oz approach.Therefore, the pre-generated model results for this approach were not precisely trained on the data provided by the participants in real-time.As a result, we could not fully replicate the exact HITL environments that participants would encounter in real situations.Second, in our experiment, we explored novice annotators' task performance and user experience using four point cloud scenes.In this process, to set only four conditions (AI Presence and Class Num) as the independent variables, we aimed for our participants to see a model result in each labeling stage for all conditions that have similar performance with baseline.Furthermore, the model results should have similar types and quantities of objects included in them.However, it was practically challenging to find scenes that satisfied both of these settings simultaneously.
For future work, we plan to explore methods to improve the annotation process for novice annotators in a real HITL environment.Through this, we intend to increase the number of options within the Class Num category, which currently has only two: single or multiple.Furthermore, we could conduct the same experiment using room-sized point cloud data in virtual reality.This would allow novice annotators to explore 3D data in a 3D environment, potentially offering a different annotation experience.Finally, we plan to conduct a class diversity analysis of the labeled data to delve deeper into the labeling accuracy of novice annotators.

CONCLUSION
In this paper, we conducted a user study with sixteen novice annotators to investigate methods to improve their annotation performance and user experience in HITL for the 3D semantic segmentation model.In the study, we found fast, productive, and usable approaches related to the presence of AI assistance and the number of classes displayed in a model's result.Based on our work, we discuss environments where novel approaches that enhance their annotation performance and user experience could come from.As a result, this paper contributes to the 3D point cloud field suffering from a data hunger problem despite its high demand by presenting breakthroughs.Through this research, we hope to inspire more studies to augment the annotation capacity of novice annotators and enable them to play a more active role in the rapidly evolving AI environment.

Figure 1 :
Figure 1: Our experiment explores four conditions.In conditions (a) and (d), the overall model's segmentation results are displayed in color, while in conditions (b) and (e), only the segmentation results that the AI selected are shown in color.Additionally, conditions (a) and (b) display segmentation results for all classes simultaneously, whereas in conditions (d) and (e), the segmentation results are presented separately for one class at a time.Therefore, only points with a single color are shown in (d) and (e).

Figure 2 :
Figure 2: Time taken (left) and Efficiency (right) for four conditions.

Figure 4 :
Figure 4: User experience results after completing tasks for all conditions ( = 16).From top to bottom, four colors in each bar indicate counts of first-ranked to fourth-ranked votes the condition received.The numbers in each color are the counts for each rank.

Table 1 :
The range of the Voxel mIoU and loss for four scenes in each iteration.
Annotation accuracy indicates how accurately participants annotate the 3D point cloud data.We calculated the average annotation accuracy of participants for two labeled data from each condition based on the annotated data they provided and the ground truth data.The two-way repeated measures ANOVA analysis indicated 1st iteration in scenes 2nd iteration in scenes Final result in scenes

Table 2 ,
the highest Annotation accuracy was observed in the with AI assistance-SLONE condition.

Table 2 :
The average annotation accuracy for each condition.
. As mentioned in subsection 4.2, novice annotators mentioned that they can work without mental pressure due to intricate thinking processes in with AI assistance compared to w/o AI assistance.