If in a Crowdsourced Data Annotation Pipeline, a GPT-4

Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.


INTRODUCTION
With GPT-4 demonstrating its impressive ability to follow written instructions and respond to questions, a series of new studies emerged, stating that GPT-4's ability to label data has surpassed online crowd workers, notably Amazon Mechanical Turk (MTurk) workers [1,21,52,66].While these studies could be encouragingthe ultimate goal of collecting human annotations at scale is to train advanced AI models-several criticisms have emerged questioning the credibility of these studies [7,53,62].Aside from obvious issues like testing GPT-4's ability with datasets existing before GPT-4's knowledge cut-off date, 2 criticisms fall into two categories: • First, many studies deviated from standard crowdsourcing practices for annotating data.For example, several studies solely used MTurk's Master Qualification to select workers, which is known to be ineffective [45].Furthermore, in many of these studies, requesters did not monitor the quality of workers' labels and removed underperforming workers in the midst of the annotation process, which is a common practice when constructing a new dataset.• Second, and more fundamentally, these studies focused primarily on individual workers' performances instead of holistic data-annotation process.Literature in collective intelligence suggests that aggregating individual judgments, despite their potential inaccuracy, can lead to a final decision superior to any single person's judgment [44,65].Furthermore, in real-world data annotation efforts, in addition to individual workers' performances, many more factors contribute to the quality of the final resulting labels, including interface design, requesters' monitoring and communication effort, payment, and label aggregation techniques.Some of these studies, with the goal of isolating the variable of worker performance, adopted approaches that were impractical or unfeasible for real-world crowdsourcing, e.g., using the entire dataset's gold-standard labels to eliminate underperforming workers, or collecting only one label per data instance.
This progress in AI can be exciting, especially given that the motivation for collecting human annotations has long been to train powerful AI models.However, this research direction is not without its skeptics.A primary critique, evident from Table 1, is that many studies leveraged existing datasets with gold labels already available online before ChatGPT's knowledge cutoff date (September 2021) [1,23,52,66].Given that ChatGPT utilized vast swaths of online data for training, testing it with datasets available before this cutoff raises concerns about data contamination [14,28,37]-essentially, testing GPT-4 with its training data.While it is convenient to use existent datasets for initial GPT-4 benchmarks, it is crucial for an unbiased assessment in which new datasets are curated and used.
Beyond the data contamination concern, the crowdsourcing research community has highlighted additional issues.Many of these studies deviated from standard MTurk data annotation practices.For instance, they have depended on the frequently criticized MTurk Master Qualification [45], either underpaid workers or failed to disclose pay rates, and lacked thoroughness in filtering out underperforming workers during the annotation process [1,23,26,52].Additionally, these studies often overlooked the collective nature of crowdsourcing, focusing instead on individual performance.This focus gave rise to impractical testing methods, like using the entire dataset's gold labels to filter out underperforming workers [52].Consequently, these studies did not explore label aggregation techniques, which, when implemented, could yield high-quality results even if individual worker contributions were mediocre.
One of the few studies that took a more holistic view of crowdsourced labels rather than solely focusing on individual workers' performance was conducted by Li [31].They compared LLMs' labeling performances to both individual and aggregated labels in crowdsourced NLP datasets, including RTE (Recognizing Textual Entailment) [48] and QUIZ [32].However, the datasets used in this study-from 2008 and 2017-predate ChatGPT's knowledge cutoff.

Crowd Label Aggregation and Quality Control
In crowdsourcing, human computation, and database literature, there has been a significant focus on addressing the issue of unreliable quality in crowdsourced labels.This has led to two main research areas: (i) improving the quality of labels collected via crowdsourcing [2,12,48] and (ii) developing label aggregation techniques [65].Both strategies, which can be used simultaneously for better final label quality [18,29], involve different approaches.Quality control in crowdsourcing examines how factors like fair payment [38,51], instructions and training [22,33], task design [20], and interface design [51] affect the quality of crowd work.Common practices [2] include inserting gold labels for quality checks, using reputation systems, time locks, attention checks [41], and recruiting workers with specific qualifications [30].Meanwhile, label aggregation research aims to derive high-quality labels from numerous unreliable ones [6,65].Techniques include EM-based methods, weighted voting, bidding, and, more recently, neural aggregation methods.(We describe the details of all the aggregation algorithms used in this work in Section 3.6.)However, most of these prior works were based on the assumption that AI models could not yet perform these labeling tasks Collect until 3 agree.

Partial
Partial No new crowdsourced annotations were collected.

Yes
Yes No new crowdsourced annotations were collected.

Yes
Yes No new crowdsourced annotations were collected.
[16] -  1: Recent studies have conducted comparisons between LLMs and Crowd Workers.However, many of these studies utilized data predating ChatGPT's cutoff date, potentially resulting in data contamination where GPT-4 was evaluated using its own training data.Notably, none of these studies incorporated manual filtering, and they often did not employ multiple aggregation methods.
effectively.Our research operates under a new paradigm where large language models can already perform these tasks fairly well in a zero-shot manner, leading to new research questions and challenges.

COMPARATIVE STUDY PROCEDURE
In this paper, we followed the best practice of crowdsourcing to use MTurk to label new data, applied a variety of label aggregation techniques to induce final labels, and compared results with GPT-4.This section details the procedure.Previous research indicates worker interface design can influence performance [42,51]; thus, we tested two different interfaces in our study.Among several available guidelines for reporting crowdsourcing experiments [15,43], we adhered to the guidelines and checklist developed by Ramírez et al. [43] to ensure our experiments are detailed enough for reliable replication by the community.

Annotation Scheme and Data
Annotation Scheme and Instruction.We aimed to compare MTurk and GPT-4's ability to label text items, as GPT-4 currently performs best with text rather than with video or images.For our study, we chose the CODA-19 label scheme [27], which categorizes sentence segments in paper abstracts into research aspects, i.e., Background, Purpose, Method, Finding/Contribution, and Other.We obtained the detailed annotation instructions via CODA-19's GitHub repository 5and used it in our study.
This task was picked for its balanced difficulty: It demands reading scholarly articles, making it more challenging than basic sentiment labeling.However, it is not as hard as expert-only labeling tasks like disease mentions, and MTurk workers have successfully completed it before [9,27].
Data.The original CODA-19 dataset [27] contains biomedical papers published before April 2020 extracted from the COVID-19 Open Research Dataset (CORD-19) dataset [58].In this study, we sampled papers from the most recent release of the CORD-19 dataset, dated June 2, 2022, which housed around one million documentations.To prevent our test data from overlapping with OpenAI GPT's training data, we limited our study to documents published after ChatGPT's last knowledge update in September 2021, focusing on 2022 publications or later.Using langdetect, 6we identified and retained only English papers.After this process, our dataset comprised 123,881 papers with full text and metadata.
For our main study, we randomly sampled 200 papers from this dataset as the test set.We segmented the abstracts of these papers into 3,177 sentence segments, averaging 15.89 segments per abstract, following CODA-19's approach [27].
For developing worker interfaces (Section 3.2), we also randomly sampled 200 different papers from the dataset as the interface development set.During the interface design phase of the Basic Worker Interface (Section 3.2), the papers in the interface development set were used as prototyping materials, such as placeholder texts for layout adjustments and texts for MTurk tasks to test interface functionalities.We deliberately separated the test set from the interface development set to prevent bias, ensuring the interface does not unfairly favor papers in the test set.

Collecting Labels via Amazon Mechanical Turk
Worker Interfaces.Prior studies suggested that the design of the worker interface would impact annotation performances on MTurk [42,51].To address potential biases, we tested two interfaces in our study.Both displayed the original CODA-19 instructions but were independently designed by different individuals: • Basic Worker Interface (Figure 1): An author of this paper, unfamiliar with designing interfaces for MTurk tasks, was tasked with creating a worker interface using the original CODA-19 annotation instructions, including examples and FAQs.We emphasized simplicity and usability in the design.
The interface had instructions at the top (Figure 1a) and an annotation section below.Workers skimmed the abstract first (Figure 1b), labeled text segments, and then reviewed and corrected their labels using the "prev" and "next" buttons (Figure 1c).• Advanced Worker Interface (Figure 2): It is the original interface that was used for constructing the CODA-19 dataset [27], designed by a crowdsourcing expert with extensive experience in designing MTurk task interfaces.Although this interface had a similar layout to the basic worker interface, it has several advanced features, such as visual feedback on button clicks, a color-coded annotation overview, and a time lock to prevent hasty spam submissions.Both interfaces show the original CODA-19 annotation instructions.We did not explicitly tell workers that they were part of an experiment comparing MTurk pipelines with GPT; we simply stated it was a data labeling task.This approach was chosen to replicate a typical data labeling scenario.
CODA-19's successful collection with MTurk workers [27] confirmed the efficacy of the labeling scheme and task instructions, so we did not perform a pilot study for the identical Advanced Worker Interface.For the Basic Worker Interface, which was newly created for this work, we conducted a small set of pilot studies on MTurk to verify its functionality.
Worker Recruitment and Grouping.We first set up a $1 qualification Human Intelligence Task (HIT) on MTurk with the basic interface, in which workers needed to watch a tutorial video, review instructions, and annotate a research abstract to qualify.Of those who passed, we divided 800 workers into two groups: 400 for the basic and 400 for the advanced interface.Workers could only access the HITs of their qualified group.Additionally, we applied four MTurk built-in qualifications: Locale (US Only), HIT Approval Rate (≥98%), Approved HITs (≥3000), and Adult Content Qualification. 7osting Tasks in Batches and Monitoring Label Quality.We follow the best crowdsourcing practice to annotate the data using MTurk.When experienced requesters use MTurk to label unseen data, they rarely post all the data at once.It is more common to post data in batches and monitor label quality carefully between each.The original CODA-19 dataset was constructed using this approach, and we adopted the same strategy in our study.
We divided 200 abstracts (see Section 3.1) into four batches of 50, posting one at a time.For each abstract, we created two HITs: one with the basic interface and the other with the advanced interface.We recruited 20 workers via 20 assignments (from the qualified pool of 400) for each HIT.
Once a batch was completed, we first approved all submitted work in that batch.Next, we assessed label quality and removed qualifications from underperforming workers to prevent them from accessing our future HITs.In a practical label collection scenario, we would not have gold-standard labels for the entire dataset and could not afford to check every label manually.Therefore, the first author, who is a Ph.D. student in Informatics (the "CS Expert" in Section 3.3), manually labeled only 10 abstracts per batch.We then  The advanced worker interface, adopted from CODA-19 [27], incorporates advanced features such as a visual feedback button, color-coded annotation view, and a time lock mechanism to deter hasty spam submissions.used these labels to compute three worker quality control statistics: (i) label accuracy, based on only 10 manually labeled abstracts per batch, (ii) probability of agreeing with the majority label, and (iii) probability of labeling "Other, " a rare label.For (i) and (ii), we reviewed the bottom 30 workers' labels, and for (iii), the top 30's.We developed an interface allowing for rapid label inspection (details in the Appendix Figure 8).If we observed a worker consistently providing incorrect labels or seemingly spamming our task, we revoked their qualification.When a few removed workers contacted us to complain and request reinstatement, we justified our exclusion decision using the three statistics.
The crowdsourced data collection study took place between August 22nd, 2023, and September 7th, 2023.The first HIT batch was posted on August 22nd, 2023; the second batch was posted on August 30th, 2023; the third batch was posted on September 5th, 2023; and the fourth batch was posted on September 7th, 2023.Each batch was completed within one day after posting.With two different interfaces, we collected a total of 127,080 labels (3,177 sentence segments × 2 interface variations × 20 workers.) 8orker Wage.To target an hourly wage of $10, we referenced literature which states the average reading speed on computer monitors is 250 words per minute [47].We determined working minutes by dividing the number of tokens by 250, then rounding up.The task payment was set using the formula: $0.05 + (Estimated Working Minutes × $0.17).Consequently, 53% of our HITs were priced at $0.22, 46% at $0.39, 0.5% at $0.56, and 0.5% at $0.73.For each HIT, we posted 40 assignments.Factoring in the 40% MTurk fee, the average cost to code each abstract with 40 workers was $16.94.

Collecting Gold-Standard Labels Using Experts
Similar to CODA-19 [27], we worked with two experts, a biomedical expert (Bio Expert), and a computer science expert (CS Expert).Both of these experts, who are also co-authors of this paper, manually annotated the entire test set of 200 abstracts from our MTurk study using the advanced interface.The inter-annotator agreement (Cohen's kappa) between the two was 0.788.
Bio Expert provided gold-standard labels.Dr. Chien-Kuang Cornelia Ding is the biomedical expert, referred to as the "Bio Expert" in Table 8.She is a faculty member in the Department of Pathology at the University of California, San Francisco.Dr. Ding possesses an M.D. and a Ph.D. in Genetics and Genomics.Notably, Dr. Ding played a critical role in the creation of the original CODA-19 label scheme, spending considerable time on manual data annotation to pinpoint corner cases and shape the initial CODA-19 instructions [27].Therefore, we trusted the Bio Expert's judgment throughout this study and treated her labels as the gold standard.
CS Expert labeled data for quality control in the annotation process and benchmarking non-experts' performance limits.The "CS Expert" in Table 8, the first author of the paper, is a Ph.D. student in Informatics and well-acquainted with our annotation scheme.The purpose of collecting the CS Expert's labels is two-fold.First, we used a subset of the CS Expert's labels to remove underperforming workers in the annotation process (Section 3.2), simulating situations where Bio Expert's gold-standard labels are partially unavailable, requiring requesters to label some data to evaluate workers' performance.These situations are typical when computer science researchers develop datasets in specialized fields like biomedicine [34,36,39,40,46,67]. 9In such instances, experts within the domain often are unable to label large datasets quickly, so out-of-domain experts-frequently CS graduate students-sometime need to label portions of the data to assess the crowd labels' quality.Second, the accuracy of the CS Expert's labels sets an estimated upper limit for non-expert performance, given their familiarity with the task and focused attention.This benchmark helps in understanding the potential improvements in MTurk workers' performance through interface enhancements and attention checks.

Annotating Data Using GPT-4
We used the full worker instruction from the original CODA-19 dataset as GPT-4's prompt for our data labeling [27].Our initial perception was that GPT-4 underperformed in this specific task, given that it was reported to have inferior performance compared to the SciBERT model fine-tuned on the CODA-19 dataset [10].However, we noticed that the prompt used in the said study did not contain the entire abstract [10], which might have led GPT-4 to rely on partial context for predictions.So, we modified the prompt to include the full abstract for a zero-shot approach.See Table 13 in the Appendix for our prompt details.
Following prior studies that compared GPT-4's zero-shot capabilities with crowd workers [52], we tested GPT-4 using both high (1.0) and low (0.2) temperature settings.For each setting, we executed the model five times and employed majority voting to determine the final label for every sentence segment.We employed OpenAI's GPT-4 8K context model, priced at $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens. 10The zero-shot GPT-4 experiment was conducted after collecting all crowd worker's data, on September 8th and 9th, 2023.Post annotation, we recorded 2,507,240 input tokens and 780,979 output tokens, bringing the total GPT-4 cost to $122.08.This amounted to an average cost of $0.61 per abstract.

Label Cleaning Strategies
Label Cleaning Strategies.As described in Section 3.2, we removed underperforming workers' qualifications after each data batch so they can not participate in future batches.This raises an If in a Crowdsourced Data Annotation Pipeline, a GPT-4 CHI '24, May 11-16, 2024, Honolulu, HI, USA interesting question: How should we treat the labels these removed workers submit?We explored three strategies in this paper: • All: Retain every collected label without any exclusions.
• Exclude-By-Worker: Exclude labels from any MTurk worker who was ever removed.• Exclude-By-Batch: Only exclude a label if its annotator was removed during that specific data batch.This means if a worker was removed from a given batch, we only exclude their labels from that batch but retain those from prior batches.
Only the selected labels will proceed to the follow-up label aggregation step.
3.6.1 Choosing Crowd-Kit.In our study, we implemented all the algorithms using Crowd-Kit, 11 except for Majority Voting and MACE.This choice was driven by our aim to mimic a realistic crowdsourced data annotation process, leading us to select publicly available and easy-to-use toolkits for label aggregation.We selected Crowd-Kit for its regular updates and better maintenance over other tools like Active Crowd Toolkit [56] and the CrowdTruth [17] framework's GitHub repository. 12Additionally, Crowd-Kit, with over 180 stars, is GitHub's most popular label aggregation toolkit.It is likely to be the preferred choice for many users seeking to aggregate algorithms for crowd labels.Evtikhiev et al. [19] employed the M-MSR algorithm from Crowd-Kit to derive human "ground truth" grades; Hiippala et al. [24] utilized its Dawid-Skene algorithm for identifying the most probable answers from three responses.

Comparison between each algorithm.
We chose these algorithms for their representativeness and common use in label aggregation.This sub-subsection describes and compares all the algorithms used in our work.Majority Vote, a basic method, determines the final label based on the most common answer among workers.Its primary drawback is the equal weighting it assigns to all workers, regardless of their expertise level.The Dawid-Skene algorithm [13] assesses the expertise level of each worker by representing them with a confusion matrix and utilizes the Expectation-Maximization (EM) framework to formulate two iterative steps for this process; the One-Coin Dawid-Skene [63] variant follows the same principles as the original Dawid-Skene model, but it differs in the way it calculates worker errors during the M-step of the algorithm.The one-coin model, due to its simplicity, is easier to estimate and has better convergence properties.The M-MSR [35] model operates under the assumption that workers possess varying levels of expertise and represent these workers through vectors denoting their skills.It then estimates each worker's probability distribution by solving a rank-one matrix completion problem.The GLAD [60] algorithm models the difficulty of each task and then analyzes each worker's response to determine the optimal values through the application of Gradient Descent [65].The MACE [25] approach focuses on determining if a worker is spamming and creates a probability model that associates each worker with a specific label probability distribution.For more detailed insights, we would like to point to the paper by Zheng et al. [65], which surveys 17 aggregation algorithms and finds the Dawid-Skene algorithm to generally provide reliable results.
Moreover, we engaged in discussions with the developers of Crowd-Kit (who are not co-authors of this paper) about the Wawa [3] and ZBS [50] methods, which were not based on published academic papers.They explained that both Wawa and ZBS were developed in-house at Toloka, 13 the data labeling platform that develops and maintains Crowd-Kit.Wawa offers a modest yet intuitive enhancement over the Majority Vote method.ZBS, on the other hand, was initially designed to manage streaming responses, particularly when new annotations are introduced.It can rapidly and efficiently update the skills of annotators, maintaining a modest quality improvement over Majority Voting.Importantly, it achieves this without the need to reprocess the entire dataset to infer the final label.
3.6.3Implementation Details.First, we used the majority voting method, including its tie-breaker approach, directly from CODA-19 [27].In cases of a tie, we prioritize the labels in the following sequence: Finding, Method, Purpose, Background, and Other.Second, we used the algorithms of Dawid-Skene, One-Coin Dawid-Skene, M-MSR, GLAD, Wawa, and ZBS provided by Crowd-Kit.For all these models, we adhered to the default parameters specified by Crowd-Kit.M-MSR occasionally encountered failures when the number of crowd workers was less than 10.We chose to ignore these failures in our simulation phase.In cases where a failure occurred, we reinitiated the simulation using a newly shuffled group of workers.Finally, we selected the MACE implementation by Hovy et al. [25] due to its extensive use and high speed.Additionally, its aggregation results proved to be more stable compared to those of Crowd-Kit's MACE.

EXPERIMENTAL RESULTS
In this section, we first overview the comparative results of GPT-4 and MTurk pipeline with a variety of settings (Section 4.1) and then show the results of incorporating GPT-4 into MTurk pipelines (Section 4.2.)Notably, we experimented with two interfaces, the basic interface and the advanced interface, so all the experiments have two sets of results.

GPT-4 vs. MTurk Pipelines
To evaluate the labeling accuracy, we used the Bio Expert's labels as the gold-standard labels.We employed the majority vote as our baseline model, as it was used in the original CODA-19 paper.Table 2 shows the results.GPT-4 exhibited accuracies of 83.6% and 83.3% at low (0.2) and high (1.0)temperatures, respectively.Using Majority Voting (MV), labels provided by MTurk workers from the Basic and Advanced interface groups achieved accuracies of 47.7% and 44.2%, respectively.The CS Expert achieved the highest accuracy of 85.9%.
Figure 3 shows the accuracy distribution of MTurk workers within the basic and advanced interface groups.We adopted the histogram format from Törnberg [52]'s work for easier comparison.As highlighted in our introduction, while most workers reached accuracies of 20% to 30%, the aggregated result is notably more reliable.We also noticed that workers in the advanced interface group exhibited marginally better performance, with their histogram leaning more to the right.
4.1.1Exclude-By-Worker is the best label-cleaning strategy.Next, we experimented with the three different label-cleaning strategies mentioned in Section 3.6.Table 3 and Table 4 show the accuracy of three strategies paired with different aggregation models.Both the Exclude-By-Worker and Exclude-By-Batch strategies enhanced the aggregation accuracy, but Exclude-By-Worker produced superior results.
When pairing the One-Coin Dawid-Skene aggregation method with MTurk workers in the Advanced Interface group using the Exclude-By-Worker approach, we achieved the best accuracy of 81.5%.While this surpassed other aggregation models with different strategies, it did not exceed the performance of GPT-4 (83.6%).
4.1.2While more crowd labels enhanced accuracy, even 20 could not surpass GPT-4.To better grasp how the number of workers affects aggregated labels' accuracy, we ran simulations using varying numbers of randomly selected worker results.The figures presented are averages from 20 rounds of these random selections.Figure 4 shows these results, with a detailed breakdown for the top-performing Exclude-By-Worker strategy in Table 5. Detailed results for the other two strategies can be found in the Appendix, specifically Table 8 and Table 9.
As a result, while more crowd labels improved accuracy, even aggregating 20 could not surpass GPT-4.We also noticed that in the simulations, accuracy improved for most aggregation methods as the number of MTurk workers increased.
As the Exclude-By-Worker label cleaning strategy systematically produced the best results, we showcase results only using the Exclude-By-Worker cleaning strategy throughout the remainder of the paper.

GPT-4 in MTurk Pipelines
Driven by a collaborative perspective on crowdwork, this paper emphasizes the importance of aggregating results from all labels for the final output.The previous section compared GPT-4 against MTurk pipelines as separate entities, and despite our efforts, nothing surpassed GPT-4's performance.It is intriguing to consider the potential impact of integrating GPT-4 as a worker into the label aggregation process.This subsection presents the findings.

4.2.1
Combining GPT-4 with crowd labels can potentially exceed GPT-4's solo performance.In our simulation study, we treated GPT-4 as an MTurk worker and selected it at t=0.2 due to its high accuracy.The results are shown in Figure 5.The x-axis in each figure indicates the number of human workers in the aggregation.A count of 5 workers, for example, refers to a mix of 5 MTurk workers and the GPT-4 model.
In aggregations that included GPT-4, the Advanced Interface results using One-Coin Dawid-Skene (Figure 5f) and MACE (Figure 5d) consistently surpassed GPT-4's performance (83.6%), peaking at an accuracy of 87.5%.All these two settings confirm the statistical significance of the improvement, where results are detailed in Table 7. Notably, this was the only two settings in our experiments that bested GPT-4, demonstrating the potential and the challenge to enhance accuracy by incorporating crowd labels.This finding suggests that even a handful of crowd labels can be beneficial.
We also conducted a t-test analysis to compare GPT-4 (t=0.2) across various settings, including different aggregation methods, cleaning strategies, and user interfaces.The analysis included twotailed paired t-tests at both the sentence and article levels.For the sentence-level, each sentence's correctness was treated as a sample (N=3,177), while for the article-level, we considered the average score within each article (N=200) as a sample.Despite the sentence-level t-test aligning more with our other analyses, its large sample size (N=3,177) resulted in very small p-values (p-value ≤ 0.001), limiting meaningful interpretation.The only exception was observed in the One-Coin Dawid-Skene method combined with the Exclude-By-Worker strategy and Advanced Interface, yielding a p-value of 0.007.Conversely, the article-level t-test produced marginally more interpretable results with slightly higher p-values.Consequently, we have included the article-level t-test p-values in Tables 3 to 7, noting that most p-values remained quite small (p-value ≤ 0.001).These t-test results suggest that the integration of GPT-4 with the Advanced Interface, particularly with the One-Coin Dawid-Skene and MACE methods, significantly outperformed the pure GPT-4 setting (as detailed in Table 7).
To further aid in interpreting these results, we calculated 95% confidence intervals for sentence-level accuracy (N=3,177).Tables 3, 4, 14, and 7 present these accuracies alongside their respective 95% confidence intervals, highlighting the precision and reliability of our findings.
However, in contrast to MTurk-only results (Section 4.2), merging GPT-4's outputs showed varied trends across aggregation models as shown in Figure 5. Interestingly, as more MTurk workers were added, accuracy generally declined.2: Performance using Bio Expert as the gold standard.The CS Expert achieves the highest accuracy of 85.9% over all models.GPT-4 at temperature of .2 and 1.0 have accuracies of 83.6% and 83.3%.Basic and Advanced Majority Vote had accuracy of 47.7% and 44.2%.by GPT-4 alone.This subsection analyzes the potential factors contributing to these improvements.    .All models use Bio Expert as the gold standard.The baseline is the Majority Vote (MV).From Exclude-By-Worker results, the One-Coin Dawid-Skene aggregation model achieves the highest accuracy for both basic and advanced interfaces.Advanced One-Coin Dawid-Skene reaches 81.5% and outperforms other aggregation models in every aspect.The accuracy from advanced One-Coin Dawid-Skene almost reaches the accuracy of the GPT-4 (t=.2), 83.6%.

4.
3.1 Improvement occurs as the crowd and GPT-4's labeling capabilities complement each other.We first noticed that, in Table 5, OneCoin and MACE were the only two algorithms that exceeded or matched GPT-4's class-specific F1 scores in any individual class.When aggregating crowd labels, most algorithms failed to surpass GPT-4 in F1 scores for any label class.However, OneCoin (0.880) and MACE (0.872) achieved higher or similar F1 scores than GPT-4 (0.872) in the Finding/Contribution class (Table 5).We further visualized OneCoin and MACE's confusion matrices, alongside WAWA and GLAD for comparison, in Figure 6.The confusion matrices show that while OneCoin and MACE did not outperform GPT-4 in all other classes, they achieved higher recalls than GPT-4, specifically  7: Aggregation Accuracy Results of the Advanced Interface integrated with GPT4 Group.Bold and underline highlight the highest score within the column and across the table, respectively.OneCoin and MACE are only two aggregation methods that outperform GPT-4 and the differences are statistically significant, shown in the table.P-value is obtained by comparing with GPT-4 over the article-level accuracy.( ** : p<0.01; *** : p<0.001.Paired t-test.N=200) in the Finding/Contribution class.These observations suggest a hypothesis: when crowd labels exhibit specific strengths surpassing GPT-4's capabilities, effectively compensating for GPT-4's weaknesses (in our case, labeling the Finding/Contribution class), their aggregation can lead to even higher accuracy.
To test this hypothesis, we calculated the number of GPT's labels that were flipped to correct and incorrect, following aggregation with crowd labels.The results are shown in Figure 7.It appears that the "Finding" class played the most important role in enhancing the accuracy of GPT labels when aggregating with crowd labels using MACE and OneCoin algorithms.As Finding is the only class where crowd outperformed GPT, these results suggest that the aggregation algorithms leveraged the strengths of both the crowd workers and GPT-4 to achieve better overall accuracy.

4.3.2
Where did the crowd and GPT-4 disagree with gold-standard labels?We manually coded a small set of error cases of crowd and GPT, identifying four primary sources for disagreeing with the gold-standard labels: • [Ambiguous]: First, the text was unclear and could be categorized under either of two different labels.For instance, this following sentence segment mentioned both "Purpose" and "Method": "allow us to provide health strategies with more intelligent capabilities to accurately predict the outcomes of the disease in daily life and the hospital and prevent the progression of this disease and its many complications." GPT-4 labeled it as "Purpose", while the Bio Expert marked it as "Method".• [Keyword]: Second, the sentence segment began with an annotation keyword, yet the subsequent content does not align with the implied meaning of that keyword.For example, this segment contained a keyword "Conclusion", which ideally should be categorized as "Finding": "CON-CLUSION: Regarding all initial variables," However, GPT-4 classified it as "Other".Figure 7a and 7b show that crowd workers benefit GPT results mainly in the "Finding" category (positive X-axis), while the decline in some of GPT's annotations across all labels at different levels (negative X-axis) .
• [Short]: Third, the text was just a short fragment of a longer sentence, containing minimal information.For instance, "Most participants were female (68%), " and "4.5 M ) in Canada, ". • [Context]: Finally, some sentence segments can only be annotated correctly within the context because it could be classified as a different label when it appears alone.For example, the gold-standard labels of the following two segments were both "Finding": "and examine associations longitudinally to better understand causality.", "and financial strain may be mutually reinforcing and compound the health consequence of smoking." When appearing alone, the former could be labeled as "Purpose, " and the latter could be labeled as "Background".
To understand how frequently each of the four main reasons for disagreement between crowd workers and the gold standard occured, we selected 200 cases where the top-performing model (Advanced Interface using One-Coin Dawid-Skene, excluding GPT-4 labels) and the gold standard did not match.The first author then classified these disagreements into five categories: Ambiguous, Keyword, Short, Context, and Other, based on the main reasons for disagreement described earlier.The "Other" category included disagreements that either did not align with the first four categories or had unclear causes.Upon reviewing 200 cases of disagreement, our analysis showed the following distribution: 65 (32.5%)cases were due to Ambiguous, 9 (4.5%) to Keyword, 17 (8.5%) to Short, 27 (13.5%) to Context, and 82 (41%) fell into the "Other" category.Notably, ambiguity was the predominant reason for disagreement, presenting classification challenges even to our experts.We also found that around 40% of the disagreements were categorized as "Other," highlighting the need for more research to better align crowd workers with expert judgments.

Why did some aggregation algorithms outperform others?
In our study, the One-Coin Dawid-Skene and MACE methods were the top two algorithms throughout most cases with the best accuracies.Zheng et al. [65] surveyed 17 label aggregation algorithms and concluded that Dawid-Skene algorithm generally provides reliable results.The OneCoin Dawid-Skene is a variant of the original Dawid-Skene algorithm.Its simplicity allows it to perform more effectively with smaller datasets.Given the modest size of the dataset used in our study, this attribute likely contributed to its superior performance.
Another insight we could offer is that many label aggregation algorithms approach the problem of inferring final labels by dividing it into two aspects: (i) assessing the capabilities of the crowd workers and (ii) estimating the difficulty of the tasks.Based on our experience, we observed that in crowdsourcing data labeling tasks, particularly within the same data batch, the difficulty level of tasks tends to be relatively uniform.Therefore, the crucial factor for accuracy seems to be the effective estimation of the workers' performance.

Generalizability
Our study focused solely on the CODA-19 data labeling task, and thus, we cannot guarantee that our specific findings, such as the settings that could outperform GPT-4, will apply to other tasks.However, we believe the methodology we have used, as well as high-level takeaways, are broadly applicable to a range of labeling tasks.Our confidence is based on two key observations.Firstly, the distribution of worker performance in our "MTurk vs. GPT" study (Figure 3) aligns with previous research at the individual worker level [52].This could suggest that, at the pipeline level, MTurk annotation pipelines would behave similarly to the tasks from these prior studies, like tweet labeling.Secondly, the principle of combining two labeling sources with complementary strengths to enhance annotation quality is well-established in existing research [4,49].Based on these points, we are optimistic that our higher-level findings-particularly that GPT-4 generally surpasses traditional crowdsourcing pipelines in accuracy, and that combining crowd and GPT inputs can further improve results-could be applicable to other data labeling tasks.

Crowdsourced Data Annotation Practices in the Era of LLMs
Our study demonstrates that even a meticulously crafted MTurk pipeline may not outperform the zero-shot GPT-4 in labeling accuracy.(Note that we did not even test GPT-4 in few-shot settings.)Despite our extensive experience in crowdsourcing, with several years using MTurk, this paper's crowdsourced data annotation effort was still demanding.We spent weeks developing, testing, and implementing Basic and Advanced interfaces.After finalizing these interfaces, another two weeks were dedicated to posting tasks and gathering data from MTurk workers.This was mixed with significant effort to review and filter their submissions.However, even with this level of commitment, we could not match GPT-4's performance.In contrast, the efficiency of GPT-4 was outstanding.
The design, testing, and execution of annotation tasks took two days.From a cost perspective, GPT-4 is also more affordable than hiring MTurk workers.The main experiment with GPT-4 totaled only $122.08, compared to the $4,508 spent on MTurk ($3,388 for annotation tasks and $1,120 for qualification tasks).This brings us to a pivotal question: In light of the fact that LLMs can now, in some instances, outperform human annotators, how will the practices of data annotation evolve?While we cannot definitively answer it, we want to give a few thoughts based on our study: • Firstly, the value of expert-level, high-quality labels will likely rise significantly.In our study, gold labels played a central role in several critical decisions: refining prompts for greater efficacy, choosing the most effective label-cleaning strategy, and selecting the best label-aggregation algorithms.These decisions led us to the few parameter combinations (Advanced Interface + OneCoin/MACE + incorporating GPT-4) that eventually surpassed GPT-4's performance.Given that GPT-4 has matched crowd workers in many scenarios but has not yet consistently outdone domain experts, we anticipate a shift in data annotation towards amassing smaller, exceptionally high-quality datasets.• Second, the research focus might shift from "using AI to support human labelers" to "using humans to enhance AI labeling."Our study showed that by carefully adding a few crowd labels, GPT-4's accuracy can be improved.
Given the cost and difficulty of finding expert labelers, using non-expert labels to enhance LLM's performance will likely become more critical.• Finally, while it might appear as a nuanced point, we believe that the Human-Computer Interaction (HCI) challenges in the human annotation process will become central again.In our study, initial observations suggest marginal differences between the Basic and Advanced interfaces: the variations in accuracy among workers from both groups are subtle, and aggregation methods do not offer an apparent advantage to those using the Advanced interface.The majority voting even favored the Basic interface group (Table 5).However, the more detailed analysis, i.e., worker number analysis and the GPT-4 aggregation study, show the strengths of the advanced interface.Workers using it provided more consistent labels, making it the only interface to surpass GPT-4.Given LLMs' high labeling accuracy, we will likely need even more reliable human labels in the future to boost their performance further.Developing systems that allow users, especially non-expert annotators, to perform reliably and consistently is essentially an HCI problem.

Limitations
Our study has several limitations worth noting.First, we focused exclusively on one annotation task, CODA-19.Consequently, our findings may not apply to other tasks, especially where crowd workers might surpass LLMs.Second, our research employed only GPT-4, meaning it may not reflect the behaviors of other LLMs like GPT-3.5.Additionally, GPT-4 operates as a closed model, making generalizations to open models like LLama potentially inapplicable.Third, although it appears unlikely, we cannot dismiss the possibility that crowd workers might use LLMs while completing tasks, as noted by Veselovsky et al. [57].We designed our interface to enable quick completion of each sentence segment, usually within seconds.This design discouraged workers from copying text to use ChatGPT for answers, which would take significantly longer.However, admittedly, our interfaces lack specific mechanisms to prevent the use of LLMs by workers.Finally, we invested significant resources in worker recruitment and segmentation through Qualification tasks, which was essential for our study's objectives.However, this approach diverges from many MTurk tasks with broader recruitment strategies.While we stand by our conclusions, given our study's expansive scale, crowd performance might be marginally lower in more open recruitment settings due to the inclusion of newer workers and the lack of a cohesive group dynamic.

CONCLUSION AND FUTURE WORK
This paper evaluates GPT-4's labeling capabilities in contrast to a well-executed, ethical crowdsourcing pipeline for annotating unseen data.Utilizing the CODA-19 labeling scheme, we exhaustively tested various label-cleaning strategies, label-aggregation techniques, and interface designs on MTurk.Despite adhering to best crowdsourcing practices, the best-performing MTurk pipeline achieved an accuracy of 81.5%, slightly below GPT-4's 83.6%.Interestingly, by optimizing the combination of label aggregation techniques and interfaces, integrating GPT-4 labels with the MTurk aggregation process boosted accuracy to 87.5%.Moving forward, our research will focus on generating a smaller set of high-quality labels via MTurk, aiming to further enhance the labeling performance of already sophisticated LLMs like GPT-4.Additionally, we will delve deeper into the influence of worker interface design on label quality to further improve LLM performance., where different labels had different colors.To simulate the practical scenario, we pretended that we had limited expert annotations during the experiment.As a result, some abstracts did not have expert annotation when we did the comparison.

C PROMPT
If in a Crowdsourced Data Annotation Pipeline, a GPT-4 CHI '24, May 11-16, 2024, Honolulu, HI, USA Zero-shot Prompt Classify the given text into one of the following labels.
[Background]: Text segments answer one or more of these questions: Why is this problem important?, What relevant works have been created before?, What is still missing in the previous works?, What are the high-level research questions?, How might this help other research or researchers?[Purpose]: Text segments answer one or more of these questions: What specific things do the researchers want to do?, What specific knowledge do the researchers want to gain?, What specific hypothesis do the researchers want to test? [Method]: Text segments answer one or more of these questions: How did the researchers do the work or find what they sought?, What are the procedures and steps of the research?[Finding]: Text segments answer one or more of these questions: What did the researchers find out?, Did the proposed methods work?, Did the thing behave as the researchers expected?
[Other]: Text fragments that do NOT fit into any of the four categories above.3.This text fragment is NOT in English.What should I do?If the whole fragment (or the majority of words in the fragment) is in Non-English, please label it as "Other".If the majority of the words in this fragment are in English with a few non-English words, please judge the label normally.4. I'm not sure if this should be a "background" or a "finding." How do I tell?When a sentence occurs in the earlier part of an article, and it is presented as a known fact or looks authoritative, it is often a "background" information.5. Do "potential applications of the proposed work" count as "background" or "purpose"?It should be "background." The "purpose" refers to specific things the paper wants to achieve.6.If the article says it's a "literature review" (e.g., "We reviewed the literature" / "In this article, we review.. " etc), would we classify those as finding/contribution or purpose?Most parts of a literature review paper should still be "background" or "purpose", and only the "insight" drew from a set of prior works can be viewed as a "finding/contribution". 7. What should I do with the case study on a patient?Typically, it has a patient come in with a set of signs and symptoms in the ER, and then the patient gets assessed and diagnosed.The patient is admitted to the hospital ICU and tests are done and they may be diagnosed with something else.In such cases, please label the interventions done by the medical staff (e.g., CT scans, X-rays, and medications given) as "Method", and the patient's final result (e.g. the patient's pneumonia resolved and he was released from the hospital) as "Finding/Contribution".
In Table 6 in the original paper, the data for the eight aggregation methods listed under the 'Exclude-By-Worker' and 'Exclude-By-Batch' categories were mistakenly swapped during the editing process.The 'Avg.Acc' and '#workers' rows were correct.(2) The results presented in Table 12 in the original paper should be replaced by the results in Table 15 in this document.
In Table 12 (in the Appendix) in the original paper, within "Basic UI" section, the data in the Majority Vote (MV) row was pasted incorrectly.The data presented in the original paper were mistakenly copied from the ZBS row under Advanced UI section in Table 11 (in the Appendix).This error did not affect our conclusion.(3) Additionally, the sentence on page 8 of the original paper, "We also noticed that in the simulations, accuracy improved for most aggregation methods as the number of MTurk workers increased, but MACE's performance was notably inconsistent." should be corrected as follows: "We also noticed that in the simulations, accuracy improved for most aggregation methods as the number of MTurk workers increased." The last part in the original sentence referred to the unstable performance of the older tool used in early experiments, the Crowd-Kit implemented MACE method, as detailed in the sections 3.6.1 and 3.6.3 of the original paper.In the final version of the paper, we instead used Hovy's MACE implementation, which we also described in the sections 3.6 of the original paper.Therefore, the last part of the sentence needs to be removed.This error did not impact the main conclusion or affect any other results presented in the original paper.These corrections do not affect our overall conclusions.Our core findings remain valid.

Figure 1 :
Figure 1: The basic worker interface, individually designed by one of the authors, has a focus on prioritizing task simplicity and user-friendliness.

Figure 2 :
Figure2: The advanced worker interface, adopted from CODA-19[27], incorporates advanced features such as a visual feedback button, color-coded annotation view, and a time lock mechanism to deter hasty spam submissions.

Figure 3 :
Figure 3: A density histogram of the response of all individual crowd workers' accuracy (filtered by the Exclude-By-Worker strategy.)The dash lines show majority vote accuracy and the best aggregation algorithm accuracy on both basic and advanced interface response, the accuracy of GPT-4 (t=0.2 and t=1.0), and the accuracy of CS Expert.The accuracy uses the Bio Expert as the gold standard.

Figure 4 :
Figure4: Aggregation Methods for All Workers, Exclude-By-Worker, and Exclude-By-Batch.Among the various models and strategies employed, only the combination of the One-Coin Dawid-Skene aggregation method and workers in the Advanced Interface group using the Exclude-By-Worker approach demonstrated performance that closely approached that of GPT-4.In contrast, the other models utilizing different strategies were unable to achieve GPT-4's performance level.

Figure 5 :
Figure 5: Exclude-By-Worker simulation results applied to different aggregation models.One-Coin Dawid-Skene and MACE algorithms for a combination of advanced interface results had the best accuracy and outperformed GPT-4 at temperature 0.2 (see (d) (f)).

Figure 6 :
Figure 6: Comparison of the confusion matrices before and after combining with GPT-4's predictions for Advanced Interface when applying Exclude-By-Worker cleaning strategy.Integration of crowd labels with GPT-4 enhances overall performance only when the crowd-only conditions' finding accuracy surpasses that of GPT-4 alone.This is evident in the cases of both the One-Coin and MACE methods.
(a) One-Coin Dawid-Skene on Crowd + GPT Exclude-By-Worker results in flip to correct and flip to incorrect different annotations from GPT result, under Advanced Interface condition.(b) MACE on Crowd + GPT results flip to correct and flip to incorrect different annotations from Pure Crowd Workers result, under Advanced Interface Exclude-By-Worker condition.

Figure 7 :
Figure 7: One-Coin Dawid-Skene and MACE Flip to Correct/Incorrect Stacked Bar Chart for Advanced Interface under Exclude-By-Worker, comparing to GPT-Only results, respectively.Figure7a and 7bshow that crowd workers benefit GPT results mainly in the "Finding" category (positive X-axis), while the decline in some of GPT's annotations across all labels at different levels (negative X-axis).

Figure 8 :
Figure 8: Worker result color-coded visualization interface for rapid label inspection.It displayed worker annotation, majority vote, and expert annotation (if applied), where different labels had different colors.To simulate the practical scenario, we pretended that we had limited expert annotations during the experiment.As a result, some abstracts did not have expert annotation when we did the comparison.

Figure 9 :
Figure 9: All Workers simulation results applied to different aggregation models.

Figure 10 :
Figure 10: Exclude-By-Batch simulation results applied to different aggregation models.

Table 3 :
Aggregation Accuracy Results of the Basic Interface Group.Bold and underline highlight the highest score within the column and across the table, respectively.P-value is obtained by comparing with GPT-4 over the article-level accuracy.( ** : p<0.01; *** : p<0.001.Paired t-test.N=200)

Table 6 :
Aggregation Accuracy Results of the Basic Interface integrated with GPT4 Group.Bold and underline highlight the highest score within the column and across the table, respectively.P-value is obtained by comparing with GPT-4 over the article-level accuracy.( ** : p<0.01; *** : p<0.001.Paired t-test.N=200)

Table 8 :
All Workers Table.All models use Bio Expert as the gold standard.Baseline is the Majority Vote (MV).

Table 9 :
Exclude-By-Batch Table.All models use Bio Expert as the gold standard.Baseline is the Majority Vote (MV).voteand, if used, expert annotations.This visualization also includes a central worker panel, which shows detailed statistics for each worker, like the number of correct answers, total attempts, and their accuracy as compared to the majority vote and expert annotations.

Table 10 :
All Workers integrated with GPT-4 Table.All models use Bio Expert as the gold standard.Baseline is the Majority Vote (MV).

Table 13
shows the zero-shot prompt we used for querying LLMs.

Table 11 :
Exclude-By-Worker integrated with GPT-4 Table.All models use Bio Expert as the gold standard.Baseline is the Majority Vote (MV).From Exclude-By-Worker results, One-Coin Dawid-Skene aggregation model achieves the highest accuracy for both basic and advanced interface.Advanced One-Coin Dawid-Skene reaches 86.6% and outperforms other aggregation models in every aspects.The accuracy from advanced One-Coin Dawid-Skene almost reaches the accuracy of the GPT-4 (t=.2), 82.7%.

Table 12 :
Exclude-By-Batch integrated with GPT-4 Table.All models use Bio Expert as the gold standard.Baseline is the Majority Vote (MV).
Text fragments that are NOT part of the article.Text fragments that are NOT in English.Text fragments that contains ONLY reference marks (e.g., "[1,2,3,4,5") or ONLY dates (e.g., "April 20, 2008").Captions for figures and tables (e.g."Figure 1: Experimental Result of ... ", or "Table 1: The Typical Symptoms of ... ") Formatting errors.I really don't know or I'm not sure.FAQs 1.This text fragment has terms that I don't understand.What should I do?Please use the context in the article to figure out the focus.You can look up terms you don't know if you feel like you need to understand them.2. This text fragment is too short to mean anything.What should I do?If the text fragment is too short to have significant meanings, you could consider the entire sentence and answer based on the entire sentence.