The State of Pilot Study Reporting in Crowdsourcing: A Reflection on Best Practices and Guidelines

Pilot studies are an essential cornerstone of the design of crowdsourcing campaigns, yet they are often only mentioned in passing in the scholarly literature. A lack of details surrounding pilot studies in crowdsourcing research hinders the replication of studies and the reproduction of findings, stalling potential scientific advances. We conducted a systematic literature review on the current state of pilot study reporting at the intersection of crowdsourcing and HCI research. Our review of ten years of literature included 171 articles published in the proceedings of the Conference on Human Computation and Crowdsourcing (AAAI HCOMP) and the ACM Digital Library. We found that pilot studies in crowdsourcing research (i.e., crowd pilot studies) are often under-reported in the literature. Important details, such as the number of workers and rewards to workers, are often not reported. On the basis of our findings, we reflect on the current state of practice and formulate a set of best practice guidelines for reporting crowd pilot studies in crowdsourcing research. We also provide implications for the design of crowdsourcing platforms and make practical suggestions for supporting crowd pilot study reporting.


INTRODUCTION
Crowdsourcing is an empirical research area that involves human subjects.The very ingredients that make crowdsourcing a powerful paradigm -diversity in the background of participating individuals and independence in their opinion [203] -also lead to a wide range of behavior and a high variance in performance.It is therefore no surprise that a majority of work in the realms of crowdsourcing research over the last two decades has focused on addressing challenges related to quality [29,65,112,113].This well-documented variability in human behavior and performance while carrying out crowdsourcing tasks interacts with other task parameters to shape outcomes, such as the task reward [240], task complexity [238], task clarity [68], batch size [47], and reward schemes [59].Many of such influential configuration parameters of a crowdsourcing campaign are not known before the campaign is launched.Due to this, researchers and practitioners turn to pilot studies to inform their design choices and fine-tune such parameters.Pilot studies are a vital part of crowdsourcing research and researchers often launch one or several small-scale studies before launching the main study.One typical reason, among others, is to estimate the average completion time of crowdsourced tasks with the aim of appropriately setting the monetary rewards for the larger main study.In this work, we refer to these small preliminary studies which are often used to calibrate crowdsourcing task design parameters or inform main studies as crowd pilot studies.different sections or in different levels of detail.Yet in our literature review, we show that there are many commonalities between crowd pilot studies across a multitude of different fields that would allow to develop common reporting standards.But while there are guidelines and checklists for running and reporting crowdsourcing studies [50,168,170,174,196], there is a gap in the scholarly literature on pilot studies in crowdsourcing research.
In this paper, we aim to address this gap and synthesize the best practices in reporting crowd pilot studies.To this end, we first conducted a systematic literature review.Our screening of 513 articles downloaded from the ACM Digital Library (ACM-DL) and the proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP) -a premier venue for crowdsourcing related research -resulted in a corpus of 171 articles.We systematically analyze this corpus to capture the current state of crowd pilot study reporting in the scholarly literature.Our aim is to report and reflect on the current state of pilot study reporting at the intersection of the HCI and crowdsourcing literature.To this end, we identified whether and to what extent the following information is being reported in articles: RQ1: Why are crowd pilot studies typically conducted?RQ2: How are crowd pilot studies typically reported?RQ3: What do crowd pilot studies report?While pilot studies are very common in crowdsourcing research, little is known and reported about them in the scholarly literature [168,170].Therefore, much of the knowledge from running pilot studies is bound in researchers with experience in crowdsourcing.An experienced researcher may, for instance, decide to not even conduct a crowd pilot study because the researcher's experience will tell what parameters of a crowdsourcing campaign will work best.We therefore complemented our literature review with a survey study with experienced crowdsourcing researchers to fill the aforementioned gap.The survey study investigated broader topics not explicitly reported in the scholarly literature: RQ4: What makes a "good" crowd pilot study?RQ5: What are the factors that promote or obstruct reporting crowd pilot studies?RQ6: How can crowd pilot studies be facilitated with platform-specific features?To the best of our knowledge, our work is the first to provide a detailed investigation on the current state of practice of crowd pilot study reporting in the crowdsourcing and HCI literature.Based on the findings of our literature review and survey study, we provide a set of guidelines for reporting crowd pilot studies.We reflect on the trade-offs around running pilot studies and discuss implications for the design of crowdsourcing platforms.All data pertaining to our work in this paper are made publicly available for the benefit of the research community and in the spirit of open science. 1ur work is structured as follows.We first provide a brief review of related literature in Section 2. We then describe our methodological approach for conducting the systematic literature review and the complementary survey study in Section 3. In Section 4, we present the results of our analysis, followed by a reflection and discussion of our findings in Section 5. We discuss caveats and limitations of our work in Section 5.5 and conclude in Section 6.

Pilot Studies in Crowdsourcing-based Research
The crowdsourcing paradigm has seen vast adoption in academia and industry.Crowdsourcing is a cost-effective method for conducting online experiments [156] and user studies [112].However, designing an effective crowdsourcing campaign is not an easy task and there are many pitfalls for requesters when designing crowdsourcing campaigns.For instance, task clarity is one important determinant of work quality [68].Many other factors can potentially affect the work quality, such as a task's complexity [21], usability, and accessibility [211].
Crowd pilot studies are typically conducted to address these challenges.Crowd pilot studies aim to iteratively design a task and empirically determine design parameters of a crowdsourcing campaign, such as an estimate of the average completion time per task.This estimate can then be used to calculate the price per task for the main study.Before running a pilot study, the average completion time is unknown.Therefore, trial and error is needed to determine an accurate task pricing for microtasks [229].Determining the amount of pay is part of the design of every crowdsourcing campaign involving monetary incentives.
Tools and methods have been developed to support requesters in determining the above parameters.Objective measures like ETA (error-time area) task [32] have been proposed to help researchers accurately structure and price their work.ETA empirically models the relationship between time and error rate by manipulating the time that workers have to complete a task [32].The measure proposes that requesters rapidly iterate on task designs and measure whether the changes improve the performance of workers and task outcomes.Requesters may use the ETA measure to rapidly and iteratively test different task designs and measure whether the changes improve the performance of workers and task outcomes.Besides ETA, other tools supporting requesters in designing tasks and crowdsourcing campaigns have been developed.Manam et al. [131] developed a linting tool that automatically uncovers ambiguities in task instructions and supports requesters in writing task instructions with greater clarity.Nouri et al. [148] proposed methods to computationally assess the clarity of tasks and designed a tool to help requesters improve tasks iteratively.Nobre et al. [147] presented a system for running and monitoring pilot studies.
However, in practice, the most typical remedy to the above challenges is running informal, small-scale studies with prototypical tasks.These small-scale studies are often conducted iteratively to rapidly uncover issues in the design of the task or to empirically derive estimates of important determinants of the crowdsourcing campaign (such as the task pricing).

Guidelines for Conducting and Reporting Crowdsourcing Studies
Several best practices and guidelines have been developed for requesters to design crowdsourcing campaigns.These guidelines are motivated with two primary concerns.Some authors take the workers' perspective and aim to provide guidelines for requesters to conduct fair and responsible crowd work.Other authors provide guidelines from the requester's point of view, aiming to optimize the efficiency, cost, quality, and accuracy of crowdsourced work.
From the requester's perspective, Cobanoglu et al. [35] presented a guide and best practices for using crowdsourcing platforms.These guidelines are primarily meant as a beginner's guide to crowdsourcing.Simperl [196] also provided guidelines and examples on using crowdsourcing effectively.The guidelines take a system development perspective aiming to provide "design and participation best practices" guiding the development of crowdsourcing systems.Alonso [6] provided a short list of guidelines for designing crowdsourcing studies.The article is scoped to practical aspects when conducting relevance evaluations.Redish and Laskowsk [174] presented guidelines for writing clear instructions for voters and poll workers.While this report is not written for the crowdsourcing domain, the report provides takeaways for writing clear instructions to crowd workers.Gadiraju et al. [66] explore the different ways in which tasks can be exploited by unreliable workers in surveys and propose task design guidelines to thwart such behavior and ensure quality control.Whiting et al. [229] introduced a means to help requesters in automatically paying workers a minimum wage by adding a one-line script tag to their task HTML on Amazon Mechanical Turk (MTurk).Draws et al. [50] proposed a checklist as a practical tool that requesters can use to improve their task designs by mitigating cognitive biases of workers and appropriately describe potential limitations of collected data.
Guidelines written with the workers' perspective in mind are fewer in number.For instance, Dynamo by Salehi et al. [186] provided worker-generated "Guidelines for Academic Requesters" for ethical research on Amazon Mechanical Turk [53].The guidelines aim to provide guidance for requesters on "how to be a good requester," fair payment, and other aspects of fair crowd work.Schäfer et al. [191] formulated key principles for effective communication with workers in crowdsourcing contests.In a similar vein, the "ground rules" hosted at Crowdsourcing-code.com aim to provide guidance "for a prosperous and fair cooperation between crowdsourcing companies and crowdworkers" [39].Besides the above documents, guidance and feedback for requesters can also be found on worker-focused websites and in online forums, such as Turkopticon [103] and Turker Nation [133].
However, when it comes to reporting crowdsourcing studies, little guidance is available in the scholarly literature [168,170].To the best of our knowledge, there are only two papers providing guidance on how to report crowdsourcing studies and experiments.Ramírez et al. [168] proposed DREC, a datasheet for reporting experiments in crowdsourcing.Based on an analysis of a sample of 15 scientific articles, the authors provided a glimpse into the state of reporting on crowdsourcing studies.The authors found that details of crowdsourcing studies are often not being reported in scientific articles.The authors created a taxonomy of attributes relevant to crowdsourcing studies aiming to support requesters in reporting crowdsourcing experiments.Ramírez et al. later extended the DREC taxonomy in scope and provided a checklist for reporting crowdsourcing experiments, based on an analysis of the literature [170].The article examines the state of reporting on crowdsourcing experiments and offers guidance for requesters.
In relation to these two studies, there is an overlap with our work in that the authors make recommendations for the reporting of key statistics of a crowdsourcing campaign, such as the number of participants.However, the checklist is clearly scoped to reporting the results of the main crowdsourcing experiment.The section on pilot studies (Item No. 6) in Ramírez et al. [170] is very short and only recommends to "[d]escribe if pilot studies were performed before the main experiment" (p.30).
Our work connects to the above two guidelines by providing an extensive and in-depth investigation on the state and practice of pilot study reporting as well as detailed recommendations for reporting crowd pilot studies.

METHOD
We conducted a systematic literature review [162] to investigate how pilot studies are being reported in the HCI and crowdsourcing literature (sections 3.1-3.4).The literature review is complemented with an online survey with requesters (see Section 3.5).

Scope of the Literature Review
Before we started our literature review, we needed to clearly define the research questions (see Section 1) and delineate the boundaries of the review [161].Our work focuses specifically on pilot studies that are crowdsourced to a group of diverse and independent individuals online.Throughout the remainder of this work, we refer to this type of pilot studies as crowd pilot studies.This concept has two components ('pilot study' and 'crowd') which we clarify and define in the following two sections.

Pilot study.
From the onset of our literature review, we were particularly interested in smallscale formative pilot studies in the crowdsourcing domain.The way these studies are being reported in the scholarly literature is often opaque and we wanted to illuminate researchers' practices around reporting pilot studies.This is important because opaque reporting of pilot studies may obfuscate information and -from a systemic perspective -attenuate the spread of best practices.The first iteration of our literature review -as a scoping review [161] -found over 100 articles reporting small-scale pilot studies in a formative way.However, in this scoping review, we found that a considerable amount of pilot studies are also being conducted for summative purposes.We therefore extended the scope of our literature review to a broader definition of pilot studies that better captures the state of reporting on both formative and summative studies.
Many different user studies and experiments have been conducted on crowdsourcing platforms.In our work, a pilot study is any small-scale or larger scale experiment or study that is being conducted to inform the design of a prototype, validate a proof of concept, or for other formative or summative reasons.The scope of our work is defined by the authors' use of the term crowdsourcing in combination with the 'pilot' keyword.For instance, user studies on crowdsourcing platforms are only included in our literature corpus if the studies were referred to by the authors as pilot studies.

Crowd.
The pilot studies could be conducted internally by the researchers within a lab setting or with an external crowd [67].In our work, we exclusively focus on pilot studies conducted with an external crowd.This external crowd needs to consist of people other than the researchers (otherwise it would be considered an "internal pilot study").Crowdsourcing comes in many different forms (e.g., crowdfunding, contests, microtasking, among others).Our work follows a broad and integrative definition of the crowd.Our literature review includes studies conducted on traditional paid microtasking platforms, situated and mobile crowdsourcing [73,92], or on other paid and unpaid crowdsourcing platforms with different types of participants (e.g., students and volunteers).

Creating a Corpus of Relevant Articles
We limited our search to articles published in the past ten years (2012 -2021).The time frame of ten years was chosen to provide a representative window into current best practices that have emerged since the inception of crowdsourcing.
Our search was conducted in two bibliographic sources.First, we downloaded all articles (excluding posters) from the Proceedings of the Conference on Human Computation and Crowdsourcing (HCOMP), widely considered as a primary venue for crowdsourcing related research.This resulted in 215 articles (of which two articles from the oldest proceedings could not be retrieved).Second, we searched the Digital Library of the Association for Computing Machinery (ACM).The ACM-DL is the document storage for all articles published by the ACM and therefore covers a wide range of different conferences and journals (including the ACM CHI and ACM CSCW conferences where premier works on crowdsourcing research at the intersection of HCI are commonly published).The search in the ACM-DL used the following query string: query: Fulltext:(pilot) AND Abstract:(crowdsourc*) filter: Article Type: Research Article, Publication Date (01/01/2012 to 12/31/2021) The choice of the 'pilot' keyword reflects the scope of our literature review.We found that limiting our search to occurrences of "crowdsourcing" (or derivations thereof) in the abstract was a good compromise between identifying relevant literature and avoiding false positives (e.g., studies that mention crowdsourcing in the related work or the references).
Our aim was to arrive at a representative coverage of articles in the literature [219].We focused on the proceedings of AAAI HCOMP and the ACM Digital Library due to their prominence and influence in the crowdsourcing research community.These venues are known for their rigorous peer-review processes and attract a wide range of high-quality submissions from researchers globally, making them representative sources for our analysis.While we acknowledge that relevant literature might also exist in the IEEE Xplore library and other repositories, our choice was driven by the desire to capture the core developments and trends in crowdsourcing from these leading communities.Our search resulted in a corpus of 513 articles (213 articles at HCOMP and 300 articles in the ACM-DL).

Article filtering and exclusion criteria
We filtered the corpus of 513 articles in five consecutive steps.The steps are depicted in the flowchart in Figure 1.First, we identified whether the keyword "pilot" was present in the article.Articles from the HCOMP proceedings that did not include this term were excluded ( = 191).This step also excluded articles from the ACM-DL where the pilot keyword was only mentioned in the references.Second, we identified whether the term "pilot" refers to a pilot study or experiment, or whether it denoted something else (e.g., 'Palm Pilot' or 'airplane co-pilot').This step excluded 32 articles.Third, we identified whether the pilot study was conducted by the authors in the article.This step excluded articles that mentioned pilot studies in the related work section or only provided recommendations for conducting pilot studies, without conducting one in the article.Articles that did not conduct a pilot study were excluded ( = 51).Next, we excluded articles in which authors referenced and discussed their pilot study in other articles.We decided not to conduct a backward reference search because the cited articles did not contain our search keyword and we wanted to focus on full articles.This step excluded five articles.Finally, we identified whether the pilot study involved crowd workers.As mentioned in Section 3.1.2,we apply a broad definition of crowd work that includes everything from situated crowdsourcing, citizen science (e.g., Zooniverse), volunteering, to paid microwork.We specifically focused on pilot studies conducted with a crowd, excluding other participants (such as experts) in our analysis.If the pilot study did not include a pilot study conducted with a crowd, the article was excluded ( = 63).If it was not fully clear from the authors' statements whether the pilot study involved crowdsourcing, we included the article in our analysis ( = 2).
We validated the robustness of our 5-step literature screening procedure by conducting a sensitivity analysis.The sensitivity analysis involved slightly altering the criteria for the first step and then applying the full 5-step filtering process on a subset of articles.More specifically, we expanded the first step to include "preliminary, " "initial, " and "formative" as search keywords, besides "pilot, " and observed the impact on the sampled articles.As subset for the sensitivity analysis, we selected two HCOMP proceedings (2017 and 2018) with 46 articles (8.97% of the initial corpus).Validation of our filtering methodology showed strong internal consistency.The sensitivity analysis revealed that modifications in the search keyword resulted in less than a 5% change in the subset of papers (2 out of 48; 4.35%), confirming the robustness of our filtering process.The sensitivity analysis also confirmed our initial suspicion that "pilot" is the most commonly used term to refer to a crowd pilot study.
Our final set of literature comprises 171 articles (23 articles from HCOMP and 148 articles from the ACM-DL).Throughout our work, direct quotes from articles are printed in italics.The literature corpus was analyzed as follows.

Analyzing the Corpus
We started our analysis by familiarizing ourselves with the articles.To this end, we manually extracted all verbatim statements that mentioned the pilot keyword from each article together with the surrounding context.Typically, there were only few mentions of the pilot keyword in one paragraph or short sentences of the article.If there were pilot studies with multiple participant samples, we focused on the pilot studies conducted with the crowd in our analysis.
For answering our research questions, we followed an inductive approach based on grounded theory [70].We iteratively revisited the verbatim statements to identify what could be reported about the articles (e.g., the number of pilot studies conducted in the article or whether payment to workers was reported).If our research questions could not be answered from the verbatim statements, we revisited the article for closer reading.
Coding was straightforward in cases when variables were binary (e.g., deciding whether the pilot study was reported in its own section) or when information had to be extracted (e.g., the year of publication).This straightforward coding required only one coder and no inter-rater agreement was calculated [135].Other cases were more challenging.These cases were analyzed by two postdoctoral researchers.The coding was developed iteratively and from the bottom-up in several coding passes.The first coding iteration stayed close to the information provided in the articles.This first iteration allowed us to form an understanding of categories in the data, which we then iteratively grouped into codes.The coding results were frequently discussed among all authors which resulted in codes being adjusted and articles iteratively being re-coded.The coding was done in an Excel sheet which was then used to produce graphs in R. All data and code relevant to this process will be shared publicly for the benefit of further research, in the spirit of Open Science. 2

Survey Study
It is noteworthy that from the literature alone, we cannot discern anything about authors who decide not to report pilot studies.Therefore, our literature review can only provide incomplete insights into the prevalence of pilot studies and the requesters' motivations for running pilot studies.To limit if not alleviate this publication bias [160] of our literature review, we complemented our analysis with an online survey study with crowdsourcing researchers in academia and industry.
The survey was implemented on Qualtrics.Participants' consent was collected before starting the survey.Participation was incentivized with a raffle of 10 Amazon gift cards (each worth US$15).The survey included 25 items and was estimated to take between 10 and 15 minutes.Many questions were closed-ended (with an option to enter an open-ended response, if preferred or needed) to be mindful of researchers' time and not overburden the participants.The open-ended survey items focused on two key areas: 1) the motivation for running pilot studies and 2) the requester's practices around reporting pilot studies.The former included questions about the participants' motivation for conducting pilot studies, what they consider as a "good" pilot study, and criteria for running pilot studies.The latter asked what factors promote or obstruct the pilot study reporting and possible features of a crowdsourcing platform that could support requesters in running pilot studies.Participants with experience in collaborating with industry were asked about differences between pilot studies in industry and academia.
We aimed to invite researchers and industry professionals with experience in crowdsourcing.For this reason, we followed a mix of snowball [74] and convenience sampling [86] to disseminate the survey study to experienced researchers in academia and industry.We also announced the study in communities dedicated to human computation and crowdsourcing, including the HCOMP Slack Community (with 396 members at the time of writing) and the Google Group on Crowdsourcing and Human Computation (with 579 members).The survey had valid responses from twelve participants, but we excluded one because she did not consent to the study.Participants included researchers from academia and industry with a background in computer science, human-computer interaction, and design.The sample includes five Ph.D. students, one postdoctoral researcher, and researchers at the professor-level.Participants had between 2 and over 12 years of experience in crowdsourcing research.

THE STATE OF CROWD PILOT STUDY REPORTING
In this section, we first provide an overview of the literature corpus before we turn to answering our research questions in the subsequent sections.

Literature Corpus
The literature corpus includes 171 articles (see Table 1).We find the number of articles reporting crowd pilot studies has more than tripled in the past decade (see Figure 2).

Year Articles 2012
The corpus includes articles published in academic conferences ( = 144), journals ( = 24), and workshops ( = 3).The articles have between 2 and 41 pages (including references and appendices).The distribution of articles over the different venues is long-tailed (see Figure 3).About half of the articles ( = 82, 48.0%) are published in the three conferences on Human Factors in Computing Systems (CHI), Human Computation and Crowdsourcing (HCOMP), and Computer Supported Cooperative Work (CSCW), with other conference venues and journals reporting significantly fewer pilot studies,  2 (70,  = 171) = 923.06, < 0.000.Analyzing the Computing Classification System (CCS) concepts provided by the authors, we find the bulk of the articles were classified as research on Human-Centered Computing ( = 47), HCI ( = 45), Information Systems ( = 26), Computing Methodologies ( = 13), and Applied Computing ( = 7).A similar human-centered pattern is notable in the author-provided keywords (see Figure 4) which revolve around human factors, design, user studies, and experimentation involving human subjects.The corpus includes 770 affiliations (including double-affiliations) of authors from academia and industry.Authors are predominantly affiliated with universities ( = 607), with Carnegie Melon University (42 authors), Stanford University (26 authors), the University of Washington (24 authors), and the University of Waterloo (24 authors) having the most author affiliations in our corpus of literature.A smaller number of authors are from industry ( = 89) and other institutions (e.g., colleges and research institutes; 74 authors).Among the institutions from industry are Microsoft (32 authors), Google (10 authors), and Disney Research (7 authors).
The corpus provides us a source of documentation on the state of practice of crowd pilot study reporting in the HCI and crowdsourcing literature which we analyzed to answer research questions RQ1 -RQ4.

RQ1: Why are Crowd Pilot Studies Typically Conducted?
Most crowd pilot studies in the literature are formative studies conducted during the design or development phase ( = 143, 83.6%).About 15% of the crowd pilot studies are summative studies ( = 26, 15.2%) conducted to evaluate or validate a prototype, proof of concept, or an idea.Three quarters of the articles in our corpus report crowd pilot studies only in passing -in a few sentences, footnotes, or short paragraphs (128 articles, 74.9%).
We classified the articles based on the amount of space a crowd pilot study is given in the article and the type of crowdsourcing study (formative versus summative).In this classification, 'in passing' refers to articles which mention the pilot study only in few sentences, footnotes, or short paragraphs.'Detailed reporting' denotes articles which dedicate a larger amount of space (e.g., a full section) to the pilot study.Finally, 'main study' refers to articles in which the entire article is considered as being a pilot study.Based on this classification, we find there are five different types of crowd pilot studies with varying levels of prevalence in the literature (as depicted in Figure 5): • Formative crowd pilot study, mentioned in passing ( = 119, 69.6%):Over two thirds of the articles in our corpus contain formative crowd pilot studies that are mentioned in passing.These articles are by far the largest group in the literature.The articles in this group are often opaque in how the crowd pilot study is being reported and details about the crowd pilot study are often not provided.As this is the largest group of articles, we analyze these articles in more detail in Section 4.3.1.• Formative crowd pilot study, detailed reporting ( = 24, 14.0%):This type of article devoted more than just a few sentences to the crowd pilot study.For instance, Winther et al. [233] report on a series of three formative pilot studies in which the authors iterated on parameters related to the design of the task and campaign (such as the task design, task instructions, task reliability, task accuracy, task difficulty, and worker behavior) to inform a crowdsourcing campaign.
• Summative crowd pilot study, pilot is main study ( = 17, 9.9%): Several articles presented a summative crowd pilot study as main study of the article.This type of study is being conducted to test the feasibility, provide a proof of concept, or evaluate and validate a system.A representative article is the work by Dow et al. [49] who established the feasibility of using crowds for design feedback in the classroom.Other examples are the technical evaluation of the VidQuiz system by Davis et al. [41] and the work by Ramchurn et al. [167] who mention pilot studies with the task allocation system of a disaster response system.• Summative crowd pilot study, detailed reporting ( = 2, 1.2%):Only very few articles contained summative pilot studies reported in detail (i.e., in a separate section of the article).These studies were being conducted to validate the design and functionality of systems or to show the generalizability of the system.Qiu et al. [164] conducted a pilot study with the prototype of a spatial crowdsourcing system.This pilot study was reported after the main results section (titled "performance evaluation").Eickhoff et al. [57] used a crowd pilot study to demonstrate that their game generalizes to other domains.• Summative crowd pilot study, mentioned in passing ( = 7, 4.1%): A few articles mentioned a summative pilot study in passing.Some articles in this group report the crowd pilot study in the context of evaluating a system, demonstrating a proof of concept, or validating the feasibility.Vaish et al. [212], for instance, report that a small team of participants conducted a pilot experiment.Winkler et al. [232] mention that a "set of pilot runs" was executed "to ensure the feasibility of the study design" (p.34).Oppenlaender et al. [152] pilot tested their CrowdUI system to ensure the system's functionality before the main study.Similarly, Yang et al. [237] launched a pilot study with the intention of verifying the quality of results before deploying the main study.Other articles validate that a task produces the intended results, such as Noy et al. [149] who compared the results of a pilot study with the main results of their article, finding no differences.• Other articles: Two articles (1.2%) could not be assigned to being formative or summative due to insufficient information in the article.These two instances are an article in which a pilot study is mentioned in the acknowledgments section [122] and the article by Kim and Follmer [110] in which the authors state that crowd workers were excluded if they had "previously participated in any of [the authors'] pilot studies" (p.10).
Digging deeper into the details reported about the crowd pilot studies, we find that task design and crowdsourcing campaign are the most commonly reported reasons for undertaking a crowd pilot study in the literature.We split our further analysis into three parts related to the motivation for running the pilot study: the design of the task, the crowdsourcing campaign, and other reasons.

4.2.1
Task design related reasons for conducting crowd pilot studies ( = 101, 59.1%).The design of crowdsourced tasks is a central component of crowdsourcing and includes factors affecting the performance of tasks and quality of results, such as, for instance, the task design, the usability of the user interface, and the clarity of the task instructions.Authors in our corpus often mentioned iterative experimentation aiming to improve the task design, but without providing details.Timmermans et al., for instance, mention that "[p]ilots were run for optimizing the microtasks settings in terms of cost, amount of judgments and task design" (p.2).Singh et al. [197] report that they "iterated extensively in pilot studies with crowd workers to strike a balance between simplicity (avoid complex or numerous instructions) and effectiveness (make the layout better)" (p.4).
If details about the task design are mentioned in the article, we find it is most common for authors to report on the outcome of the crowd pilot study (see Figure 6).Under this category, we subsume any information reported by authors about the crowd pilot study's results, performance, accuracy, validity, and quality.Hu et al. [95], for instance, determined a similarity threshold "[t]hrough a pilot study" from the accuracy of results.Kim et al. [109] "learned from pilot runs that longer video segments lead to lower annotation accuracy [...] and slower responses on Mechanical Turk" (p.4021).The task outcome was the most often reported factor related to task design ( = 55, 32.2%), used both in formative and summative crowd pilot studies.
Crowd pilot studies were also often conducted to assess the difficulty of the task during the formative design stage ( = 26, 15.2%).For instance, Swinger et al. [204] reported that "in pilot experiments, " workers were unfamiliar with items presented to them.The solution by the authors was to use custom qualifications.A qualification is "a set of questions [...] that the worker must answer to qualify and therefore work on the assignments [6].Similarly, Kiesel et al. [108] determined "[i]n pilot experiments" that the "task does not require expert workers, so we just required workers to have at least 100 previously approved HITs" (p.3051).Other authors carried out a more rigorous process to avoid systematic biases from seeping into the collected data [98].For instance, Vogogias et al. [218] systematically experimented with different difficulty levels in a pilot study "to identify the correct level of difficulty to avoid floor and ceiling effects" (p.5).
Another reason for conducting the crowd pilot study is the iterative design of task instructions ( = 14, 8.2%), including to improve a task's intelligibility (e.g., Wang et al. [223]) and clarity (e.g., Simoiu et al. [194]).However, many authors were not specific on how the task instructions were iterated on.For instance, Vertanen and Kristensson [216] simply reported that the "exact instructions we gave workers evolved over the course of several pilot experiments" (p.18).
Analyzing worker behavior and preferences was another -although with 13 articles (7.6%) less common -reason for conducting crowd pilot studies.Qiu et al. [166], for instance, measured parameters of a worker model "according to a pilot task on Figure Eight" (p.225).Roy et al. [180] reported on pilot experiments that "revealed people had a strong preference to use manual control" (p.4).Acer et al. [1] investigated the workers' behavior during the crowd pilot study, including the response rate, and reported that workers already adopted the tasks as a habit during the crowd pilot study.
Some authors elicited open-ended feedback in the crowd pilot study and ideas for improving the study during the formative design phase ( = 10, 5.8%).For instance, Chen et al. [31] wrote that "[t]o better inform the interface, we conducted a pilot study with 5 non-expert workers and asked them to rate the appearance of the marks after they finished their tasks" (p.9).Siangliulue et al. [192] reported that according to feedback from the pilot studies, their "approach was intuitive and matched users' expectations well" (p.613).Lykourentzou et al. [128] used the exit survey on CrowdFlower during the crowd pilot study to monitor the workers' satisfaction with the payment to "ensure fair worker treatment" (p.265).However, CrowdFlower's exit survey was rarely used to specifically support the aims of crowd pilot studies.
Usability ( = 10, 5.8%) and the task abandonment [77,78] or completion rate ( = 8, 4.7%) were also mentioned in articles.Wilson et al. [231], for instance, reported that the "iterative design resulted in substantial usability improvements" (p.135) and Kandappu et al. [106] observed the task completion rate of 900 workers in "a pilot study [with] over 900 workers in Sept 2015.From that study, [the authors] observed that 15% of the accepted tasks are not completed by the crowd workers" (p.907).The task description ( = 3, 1.8%) and the task clarity ( = 3, 1.8%) were only explicitly mentioned in a few articles, although these two items are implicitly part of the design of the task and its instructions.Another equally less frequently reported reason for conducting the crowd pilot study includes empirically determining the optimal size of a task ( = 3, 1.8%).For instance, Goncalves et al. [72] "tested a variety of gameplay settings" to determine the optimal number of items included in a task to "not cause fatigue" (p.708).

4.2.2
Campaign design related reasons for conducting crowd pilot studies ( = 58, 33.9%).The design of a crowdsourcing campaign includes parameters necessary for launching the campaign, such as the number of tasks assigned to workers or batch sizes, the average task completion time, and the task pricing.These factors have been shown to influence task outcomes [32,46,47].
Related to the campaign design, crowd pilot studies are often conducted to empirically determine the price of the crowdsourced task ( = 35, 20.5%).As mentioned in the introduction, the task price is typically estimated from the workers' average task completion times in a crowd pilot study.We see evidence for this practice in our literature corpus.Typical ways of reporting this information include, for instance, Hara et al. [81] who reported that workers were "paid $0.75 per HIT ($0.047-0.054per labeling task); which was decided based on the task completion time in pilot studies (e.g., approximately $0.10 per minute)" (p. 6).Another way was to provide information on a target hourly rate, such as Correll et al. [38] who mentioned that "[b]ased on piloting, we paid participants $2 for participation, for a target rate of $8/hour" (p.5).Similarly, Han et al. [79] mention in a footnote that "[b]ased on [a] pilot experiment" the hourly pay was "equivalent to US$13.5 per hour on average" (p.3).A slightly more extensive report was given by Roitero et al. [179] who "performed several small pilots of the task, and after measuring the time and effort taken to successfully complete it, we set the HIT reward to $1.5.This was computed based on the expected time to complete it and targeting to pay at least the US federal minimum wage of $7.25 per hour" (p.441).
Besides the very common combination of task price estimated from average task completion times, some authors also empirically determined other campaign related parameters in crowd pilot studies.This includes determining a time limit for the task [5,22,72,114,119,138,181,245] and determining the optimal sample size [30,102,115,117,126,163,171,172] for the main study.Only few crowd pilot studies involved qualifications for a task.Swinger et al. [204] used a "qualification exam" to identify well-performing workers based worker accuracy.Kiesel et al. [108] used results from a crowd pilot study to determine the minimum number of previously approved HITs for workers.Ramírez et al. [171] analyzed the geographic location of workers in the crowd pilot study to identify countries for the main study.Aigrain et al. [3] used a quiz on CrowdFlower to filter workers.Feyisetan and Simperl [60] ruled out the use of qualifying questions through crowd pilot studies to avoid an increase in attrition rate.
A common way of controlling the quality in crowdsourcing studies are gold questions (i.e., questions for which the answer is known) [40].One way of developing and verifying gold standard questions would be through crowd pilot studies.However, only few articles mentioned gold standard questions in the context of crowd pilot studies.McDonnell et al. [136] found an "inexplicable problem" with the gold judgments and subsequently abandoned the use of the gold dataset in favor of another dataset.Chang et al. [30] used a crowd pilot study with seven MTurk workers to verify that the quality of work done is comparable to trained experts, concluding that "judgements from 20 workers on Mturk can serve as the gold standard data set" (p.403).Nguyen et al. [146] mention that "[s]mall pilot experiments were carried out while developing the design of the HIT" which included "gold labels" (p.323).This gold standard was taken from existing corpora and not verified.Last, Winther et al. [233] went one step further and presented experimentation on gold standards in one of their crowd pilot studies.The authors found that "the gold standard proved to be too restrictive" and "gold tests and majority voting produced approximately the same acceptance results" (p.29).

4.2.3
Other reasons for conducting a crowd pilot study ( = 72, 42.1%).In the above, the majority of crowd pilot studies are conducted for formative reasons with the aim of iteratively designing a crowdsourced task in rapid fashion with a small set of participants.We found that another formative reason for conducting a crowd pilot study is collecting data for the main study (20 articles, 11.7%).This category of articles can be split in articles which conducted the crowd pilot study with the sole purpose of collecting data (9 articles) and articles that conducted the crowd pilot study also for other purposes (11 articles).Amir et al. [13], for instance, collected "pre-generated solutions represented common wrong solutions that were submitted by participants (as determined in a pilot study)" (p.5).Yu et al. [242] conducted a pilot study to collect constraints for a design task.In these articles, the collected data was then used in the main experiment or study of the article.For instance, Agarwal et al. [2] crowdsourced a dataset of tagged tweets in a crowd pilot study which was then used as input for a machine based classifier, "thereby making the classifier emotionally intelligent" (p.3).
Another reason for conducting the crowd pilot study was the iterative design of a study or an experiment.This purpose of a crowd pilot study was explicit in 11 articles (6.4%), although we believe this motive is implicitly present in many articles, such as the ones reporting on iteratively designing a task in the context of an experiment or the design of a system.As a single outstanding instance, d'Eon et al. [42] used a crowd pilot study to qualify and recruit participants for the main study.
As observed through our findings, crowd pilot studies are carried out for a broad range of compelling reasons -reasons that others who partake in carrying out crowdsourcing studies are very likely to face.It is therefore, important to understand how and at what level of detail crowd pilot studies are reported in literature.

RQ2: How are Crowd Pilot Studies Typically Reported?
In the previous section, we already provided some examples on how authors reported results of crowd pilot studies in the scholarly literature.In this section, we go in-depth and investigate how authors report crowd pilot studies.We analyze which terms and phrasing authors use to refer to pilot studies and how consistent they are in their wording (Section 4.3.1)as well as in which section of the article crowd pilot studies are being reported (Section 4.3.2).
Given that the vast majority of articles reported on pilot studies only in passing, a surprisingly low number of articles ( = 5, 2.9%) referred to the crowd pilot study as an "informal" study [7,83,178,190,213].This informal study provided authors an "informal sense" of worker behavior in the context of an "open-ended exploration" [7] as well as support in the design of tasks to "understand how different interfaces affected crowd performance" [83].
We find that about two thirds of the articles ( = 115, 67.3%) are internally consistent in how they refer to the pilot study within the article.In these articles, authors used only one single term to refer to the pilot study.In the other third of the articles, some authors used up to four different terms to refer to the pilot study (43 articles used two different terms, 12 articles used three terms, and one article used four terms).The most common combinations among the articles using multiple terms are 'pilot study' and 'pilot' (n=12), 'pilot test' and 'pilot study' (n=78), 'pilot study' and 'pilot experiment' (n=3), 'pilot experiment' and 'pilot' (n=3), and 'pilot study' and 'pilot run' (n=3).
In formative crowd pilot studies mentioned in passing, the term 'pilot study' is often used as a blanket statement to justify design decisions without providing details about the crowd pilot study.For this reason, we analyzed the phrasing used by authors of these crowd pilot studies in more detail (see Figure 8).The most common way in which authors mention formative crowd pilot studies is by stating that a crowd pilot study was conducted, followed by selected details about the outcome of the pilot study.Nguyen et al. [146], for instance, mention that "[s]mall pilot experiments were carried out while developing the design of the HIT, " followed by details about the HIT, the assignment of the HIT to workers, and the price of the HIT.The phrase 'in a pilot study, we found' (or close derivations thereof) is also common.For instance, Hara et al. [81] report that "[i]n early pilot studies, we found that users would get disoriented" (p. 6).It is also very common for authors to derive design decisions 'based on a pilot study.'This phrasing was often used to refer to the estimation of the task price from average (or in some cases median) task completion times.For instance, Diakopoulos et al. [44] report that workers were offered $0.50 per rating "based on the median time taken on a pilot task" (p.10).Similarly, Li et al. [125] estimated "the time needed for each microtask based on pilot studies" (p. 7).Some other phrases used among authors include that crowd pilot studies 'showed, ' 'revealed, ' or 'demonstrated' some specific results and that parameters were iteratively 'refined in pilot studies.' Next, to draw further insights about pilot studies based on the context in which they are described, we explored sections of articles in which they are reported.4.3.2In which section are pilot studies reported?We analyzed in which section authors report on the crowd pilot study, using a closed-coding approach.Our initial coding scheme reflected the standard structure of academic articles (i.e., Introduction, Related Work, Method, Results, Discussion, Conclusion, Appendix).However, the codes were slightly modified after one iteration of coding to better accommodate differences in the methodological approaches used in the articles.The result of the coding is depicted in Figure 9.We find the majority of articles ( = 149, 87.1%) report on the crowd pilot studies in sections related to the methodology.These sections include the study design or experiment design ( = 88, 51.5%), the system design (or related sections;  = 36, 21.1%), dataset creation ( = 6, 3.5%), as well as separate sections dedicated to the pilot study, as found in about 10% of the articles ( = 19).The choice of section depends on the methodological approach taken in the article.For instance, articles that develop a novel system often mention the results of the crowd pilot study in the section on the system's design.
Besides the general trend described above, some outstanding instances of articles took a different approach to reporting on the crowd pilot study.Of the outstanding instances of articles that report on the crowd pilot study in the limitations section, we were expecting that the authors would discuss weaknesses and limitations of the crowd pilot study.Instead, the crowd pilot study was, in some cases, used to validate the results of the main study.For instance, Sabou et al. [184] mention in the limitations section that "a set of pilot runs" was executed "to ensure the feasibility of the study design" in an application domain to "address external threats to validity" (p.171).Robertson et al. [177] discuss differences between their main study and an (independent) pilot study, reporting that the results "were fully consistent with those from a pilot version of this study that we conducted in July 2019" and that "results are robust to pseudoreplication" (p.12).
Rekatsinas et al. [175] conducted a crowd pilot study in the introduction section to motivate their article.Rodríguez et al. [178] mentioned an "independent" crowd pilot study which was used to estimate the optimal price of the task and Robertson et al. [177] also mentioned an independent crowd pilot study.However, besides these three articles, crowd pilot studies were typically not conducted as independent studies, but as integral part of the article.On the other hand, some authors used the extended space of the appendix to report on the crowd pilot study in detail.For instance, Fogliato et al. [61] report differences between the main experiment and the crowd pilot study in a separate appendix.
In the following section, we investigate in more detail what is known about the pilot studies from the reporting in the literature.

RQ3: What do Crowd Pilot Studies Report?
This section reports findings on which crowdsourcing platform is being used (Section 4.4.1),how many crowd pilot studies are being conducted in each article (Section 4.4.2), and which other key details are being reported about the crowd pilot study (Section 4.4.3).

Which crowdsourcing platform is being used?
We analyzed which crowdsourcing platform is being used in crowd pilot studies.This information was often not explicitly stated and had to be inferred from context.Amazon Mechanical Turk (MTurk) is by far the most common crowdsourcing platform ( = 102, 59.6%) in the literature corpus.Other platforms include, for instance, CrowdFlower/Figure Eight (now Appen) ( = 23, 13.5%), Prolific [117,141,150], Microworkers [220,233], LabInTheWild [192], ZBJ [223], Clickworker [177], and the Yahoo!crowdsourcing platform [236], among others.In about 12% of the articles ( = 20, 11.7%), the crowd pilot study was conducted with other participant samples, such as students [104,106,182,183], citizens [11,58,72,100,143,201], and volunteers [25,88,89,105,134,213,214,239].In-house or custom crowdsourcing platforms were only reported in five studies (2.9%) [31,93,157,164,225].These articles include a pilot study conducted on an "indigenous crowdsourcing platform" [157] and an article by authors from Google who "ran numerous pilots to tune task hyper-parameters [...] sourced from contracted operators through an in-house crowdsourcing platform" [225], a study with the prototype of a spatial crowdsourcing system [164], a study with a web-based platform for collecting ratings [93], and a study conducted with a crowdsourcing system designed for analyzing industrial tomographic images [31].In ten articles (5.8%), neither the type of participant sample nor crowdsourcing platform was mentioned.4.4.2How many pilot studies are being conducted in the article?As in the previous section, the information on how many pilot studies were conducted in the article had to, in many instances, be inferred from the wording used by the authors.In many cases, this wording was opaque and the exact number of crowd pilot studies could not be determined (see Figure 10).For instance, some authors mentioned conducting 'pilot studies' (e.g., Dimara et al. [48], Hara et al. [81]), 'pilots' (e.g., Feyisetan and Simperl [60], Luther et al. [127]), or 'pilot experiments' (e.g., Swinger et al. [204], Timmermans et al. [207]).From these terms, we can only infer that there was more than one crowd pilot study conducted.Due to the use of non-descriptive terms such as 'pilot testing' [15,52,116] and 'piloting' [38,76,84], the number of crowd pilot studies could not be determined in six articles.
About half of the articles ( = 90, 52.6%) report conducting one pilot study (see Figure 10).A third of the articles ( = 55, 32.2%) report conducting more than one pilot study.A high number of pilot studies within an article was rare ( = 6, 3.5%).Simoiu et al. [194], for instance, conducted "six small pilot tests" to "ensure that the questions were clearly phrased, and of appropriate difficulty" (p.174).The highest number of pilot studies was reported by Inel et al. [102] who conducted extensive experimentation in eight crowd pilot studies.

Which key attributes are typically reported about crowd pilot studies?
We analyzed what authors choose to report about crowd pilot studies.We specifically looked at three key attributes of crowd pilot studies: the number of workers participating in the crowd pilot study, the number of tasks (or any other information given in the article that would allow to determine the number of assignments to workers), and the rewards to workers.About 60% of the articles did not provide any information on the three key statistics ( = 102, 59.6%; see Figure 11).We find authors who report one key statistic also often report other key statistics about the crowd pilot study.Twenty-four articles (14.0%) report on all three key statistics.These articles often dedicated a full section or the entire article to the crowd pilot study.About 40% of the articles ( = 68, 39.8%) report at least one of the three key statistics.
About a third of the articles mentioned the number of workers participating in the crowd pilot study (55 articles, 32.2%).In these articles, the number of workers ranged from three (e.g., Oppenlaender and Hosio [150]) to over 2,000 [88].Some authors were imprecise about the number of participating workers, such as Ambrosino et al. [11] who reported "almost 40" participants (p.5).Among the 50 articles in which we could identify or calculate the exact number of participants in the crowd pilot study, the average number of participants was 111.7 ( = 169.9).
The number of tasks or assignments was mentioned in 42 articles (24.6%).This information was more difficult to analyze because some authors mentioned the number of assignments, others the number of tasks.Further complicating the analysis was the fundamental difference between the studies -some of which collected tags or annotations, others conducted situated crowdsourcing studies.In the articles in which we could infer the number of tasks, the number ranged from 10 tasks [224] to 55,000 [225] (assuming one rating per task).Because of the difficulty of determining the exact number, we do not report the mean and standard deviation of the number of tasks in these articles.
Monetary rewards were reported in 33 articles (19.3%) but often without explicitly mentioning MTurk fees.Of the 34 articles, eight involved unpaid volunteers [1, 11,33,88,89,105,214,239], two articles simply stated that participants were paid minimum wage [100,139], and one article involved a raffle for an iPad [134].Three articles reported the monetary rewards in the crowd pilot study as an average hourly [4,132] or per minute wage [81].Among the remaining 20 articles, the monetary pay for participating in the crowd pilot study ranged from $0.01 [149] to $10 [61] per task.Welbourne et al. [224] paid "a maximum of $30 US" (p.3), but in this case, workers were recruited on Elance and ODesk (now Upwork) and the actual bids may have been lower.
Only few articles ( = 3, 1.8%) reported experimenting with different price points.Kim et al. [111] experimented with paying workers in increments "from $0.00 to $8.00 [...] up to the federal minimum wage in the United States ($7.25/hour as of April 2, 2011)" (p.4).Similarly, Borish and Lok [20] posted tasks "in $.05 increments, starting at $.15 and going up through $.50" (p.10).Rodríguez et al. [178] investigated the robustness of results against varying levels of reward (from US$0.05 to US$0.10).Bonuses to workers in the context of crowd pilot studies, in general, were only mentioned in a few articles.Vonikakis et al. [220], for instance, mention experimenting with different incentive schemes that involve a bonus to well-performing workers and Huang and Fu [96] conducted a crowd pilot study to determine a bonus based on the workers' average accuracy.

Differences in how Crowd
Pilot Studies are reported 4.5.1 Are there differences in crowd pilot study reporting between research communities?As mentioned in Section 4.1 and depicted in Figure 3, the bulk of crowd pilot studies were reported in three research communities: CHI, CSCW, and HCOMP.The former two are closely related humancentered venues and researchers often submit to both venues.The latter is a venue specialized on advancing the state of the art of human computation and crowdsourcing, but also on applying it practically.We explored differences between the three venues in how crowd pilot studies are being reported in human-centered conferences (CHI and CSCW) as compared to crowdsourcing research (HCOMP).Our initial hypothesis is that there will be a difference between the communities since best practices will likely emerge from within the community of practice [227] in the crowdsourcing-focused domain at HCOMP.We first investigate how consistent crowd pilot studies are being reported in the three communities.We define 'consistency' in the reporting of crowd pilot studies as the uniformity in the use of terminologies, methodologies, and presentation of results across the surveyed articles.Specifically, an article is deemed 'internally consistent' if its descriptions, methodologies, and terminologies related to pilot studies remain coherent and unambiguous throughout the article's text.In contrast, articles with varied references to pilot studies are deemed 'inconsistent.' Looking at the ratio of internally consistent articles in the three venues (cf. Figure 12), we find CHI and CSCW are about comparable in consistency (73.0% and 72.7%, respectively).The ratio of internally consistent articles published at HCOMP is slightly lower (69.6%),but this difference is not significant (pairwise t-tests, each with  > 0.05).We find there is no agreement between articles in the three venues on which term is used to denote pilot studies, even among the articles which use only one term.A wide range of different terms are being used in the three communities, with 'pilot study' being most common, followed by 'pilot' and 'pilot experiment' (see Figure 13).Looking at the placement of pilot studies within articles (see Figure 14), we find that crowd pilot studies are often reported in sections relating to the methodology (e.g., study design or experiment design).There is no significant difference between the three venues when it comes to the section in which the pilot study is being reported,  2 (22,  = 90) = 19.37, = 0.6224.Four HCOMP articles (17.4% of the articles in this venue) reported the pilot study in a separate section, which highlights the importance of pilot studies in the field of crowdsourcing, as compared to articles in CHI (5.4%) and CSCW (9.1%).The number of pilot studies conducted within an article is similar in all three venues, with one single crowd pilot study being most common (see Figure 15).At CHI and HCOMP, it is also common for articles to report more than one crowd pilot study.The difference between the three venues is, however, not statistically significant ( 2 (10,  = 82) = 8.3317,  = 0.5965).When it comes to reporting key statistics about the crowd pilot studies, we find that in all three venues, the most common way of reporting a crowd pilot study is in passing without providing any details about the number of participating workers, the amount of tasks, or the exact monetary rewards provided to workers.A large percentage of articles (between 65% to over 95% of the articles reporting crowd pilot studies in each conference venue) do not report these three key statistics (see Figure 16).There are, however, differences between HCOMP and CHI/CSCW when it comes to reporting details about the crowd pilot study.Authors in HCOMP are more likely to report key statistics about the crowdsourcing campaign as compared to CHI and CSCW (see Figure 16).HCOMP articles report the number of workers almost twice as often as CHI articles.Similarly, the number of tasks assigned to workers is more likely to be reported in HCOMP articles (34.8%) as compared to CHI (5.4%) and CSCW articles (13.6%).This difference is even more profound when it comes to reporting payments to workers.HCOMP articles reported payments to workers participating in crowd pilot studies in about a quarter of the HCOMP articles as compared to CHI (8.1% of the CHI articles) and CSCW (4.5% of the CSCW articles).One possible reason for this is that authors at HCOMP may be more sensitive to issues surrounding crowd work due to fairness of crowd work being a long-standing research topic in human computation and crowdsourcing.The differences between the three conference venues were, however, only statistically significant for the number of tasks,  2 (2,  = 82) = 9.2864,  = 0.0096.
In summary, we found no major differences between the HCOMP and CHI/CSCW communities in terms of the number of crowd pilot studies being conducted and the wording used within articles.The consistency of wording within articles was comparable between the three venues, with many different terms being used to denote the crowd pilot study (some more common than others).Authors in all three venues prefer to report the results of crowd pilot studies in a section relating to methods, with the study design section being the top choice of authors.

How do crowd pilot studies differ between academia and industry?
In our survey study, we asked participants if they had industry experience or worked closely with the industry.In response to this, four participants (36.36%) responded they had industrial experience, while seven did not have experience (63.64%).Two out of four (50%) who had industrial experience do research in collaboration with an industrial partner, while one (25%) indicated that he is planning to conduct a pilot study with industry.When asked what differences the participants found between the academic and industrial crowd pilot studies, one indicated that "industrial pilot studies cost more in salaries than academics" while another respondent stated that "the pilot study was to create a digital asset for the company." We also asked about potential differences between crowdsourcing crowd pilot studies on inhouse/internal platforms and other commercial platforms (e.g., MTurk).Participants came up with a variety of feedback.One participant indicated that "the internal CS platform is more accurate than other commercial platforms" and that "internal systems are easier to use as managers would have no issue with them, external ones are more tricky due to privacy and security issues." 4.6 RQ4: What makes a "good" crowd pilot study?
In response to this question, researchers in our survey (cf.Section 3.5) identified several qualities that define a good crowd pilot study.These qualities relate to the objectives for running a pilot study and may stand in tension with each other, as evident in the following sub-sections.4.6.1 Mimicking the main experiment.Two researchers stated that a successful crowd pilot study is "as similar to a formal experiment" and "one that only differs from the complete study by sample size." This finding is also consistent with the recommendations of other researchers that a (crowd) pilot study should mirror all the processes of the main research and adhere to the identical protocol, including inclusion and exclusion criteria for participants, measuring tools, and training resources [101].
4.6.2Exploration and experimentation.Others stated that the paramount quality of a good crowd pilot study is its exploratory nature, which provides them with different directions for their primary research questions or hypothesis.For instance, one noted that "[a good pilot is the] one which gives a clear direction of which RQs/directions would be more promising to pursue in an actual study" and one that "should give researchers some useful inputs about their hypothesis or prototype." This finding shows that researchers use pilot studies in the conception phase of their projects when they need supporting evidence to develop a research question and research plan [215].

4.6.3
Validating the feasibility of a study.Another quality mentioned by researchers is the ability of a crowd pilot study to assess the feasibility of an approach.This feasibility could also refer to the technical feasibility where researchers test rigorously through several trials that "all functions are working, and log [that the] system can repeat what users have done." Others viewed a good crowd pilot study as one that "allows to validate the functioning of your task and it allows you to gather a sample of the final expected data" because "it costs high to redo a formal exp."Other respondents defined assessing the task related information as the main criteria that a crowd pilot study should incorporate, such as the "task length, number of workers required, task complexity and task design".For instance, one researcher summarized this in the following words "I think we need to always do a pilot study, to figure out if both the design and the technical problems are solved." 4.6.4Accurate estimation of campaign parameters.Another critical dimension that defines a good crowd pilot study is its ability to estimate sample size and power calculations.For instance, researchers reported that a good crowd pilot study could help to "[correct] the sample size errors" and "help to calibrate the power calculation." Similarly, one participant reported that a crowd pilot study should help to estimate the "statistical data involved, e.g., mean/median/sd." Thus, sample size and power calculations are another essential quality of a good crowd pilot study.These estimates are even crucial in crowdsourcing research when researchers need to hire hundreds of participants, thanks to the affordability and affordances of crowdsourcing platforms.However, the estimation of the sample size required for the main trial needs to be performed cautiously since a crowd pilot study only provides the estimated value of standardized effect size [123].Moreover, one may also need to account for participant exclusion in such cases.

RQ5: What factors promote or obstruct the reporting of crowd pilot studies?
We approached this question by posing both closed-and open-ended questions.In a closed-ended question, we asked participants to provide potential reasons for characteristics that either encourage or restrict the reporting of crowd pilot studies.These reasons include page limit restrictions, funding availability, article types, and reviewer preferences (see Figure 17).Most participants indicated that page limitation were the most critical factor ( = 6), followed by availability of funding ( = 5).Only two respondents answered that the article type was the most essential factor, and one responded that he does crowd pilot studies because reviewers want them.4.7.1 Page limit restrictions.Regarding the page limitations, respondents felt that "it limits the content length of the report" and they would prefer actual experiments over crowd pilot studies "because the results of the formal experiment are more interesting [than pilot studies]."Another respondent believed that "conf[erence] papers normally require a tight page limit which would squeeze the space for rather important content (e.g., results)."Another respondent who worked in the area of crowd-powered applications responded that "justifying some design decisions of a big crowd-powered system is probably not very critical.We will likely cut these justifications when we don't have space." We also asked "if there is no page limit, will this make you more likely to report crowd pilot studies in your articles?"Three out of eleven respondents believed that they will 'very likely' report pilot studies, two affirmed that they would undoubtedly report pilot studies, while one was neutral about this opinion.We also noted that no respondents selected 'unlikely' or 'highly unlikely', which shows that page limitation is a rather decisive factor.This trend is slowly shifting.For instance, conferences have been slowly transitioning to a revise and resubmit cycle along with more flexible manuscript lengths, which removes page restrictions and permits authors to expand the methodology and design sections, enabling the reporting of crowd pilot studies.

Availability of funding.
Participants also felt that the availability of funding may encourage the scalability of an experiment and extensive testing of a product before it could be made available.A participant, for example, thought that "funding is crucial in scaling the experiment, and the funding sources tend to encourage folks to include pilot results in the grant application." Another participant who believed crowd pilot studies were important for iteration and testing stated this as follows: A good project needs to be developed and tested for a long time before it can be released.Therefore, the initial investment is relatively large and stable sources of funds are needed.

Article types.
The article type also played a significant role in inhibiting the reporting of a pilot study.For example, one respondent responded as: To report this we need to write a paragraph or at least some sentences describing it.If we need to cut down something due to exceeding the page limit, this would be an option.For conference or journal, because the target audience have different focus.For some system-focused venue, we may shorten the description of data collection and experiment design by skipping this.

Reviewer preferences.
One respondent was of the opinion that "reviewers perceive pilot studies as less impactful and therefore would not be willing to accept them for publication." 4.8 RQ6: How can crowd pilot studies be facilitated with platform-specific features?
In our literature review, we found a handful of mentions of platform-specific technical features that were used for conducting and monitoring crowd pilot studies.In the remainder of this section, we reflect on the design of such features, based on our literature review, the results from our survey study, and our experience with different crowdsourcing platforms.
4.8.1 Exit surveys for facilitating crowd pilot studies.One possible feature for supporting crowd pilot studies is the 'exit survey.'An exit survey is a short questionnaire that workers fill out after completing tasks.Exit surveys were used by a few authors to measure or monitor workers' satisfaction with the payment during crowd pilot studies.For instance, Lykourentzou et al. [128] used the aggregated results of the exit survey (provided by CrowdFlower/Figure Eight) to validate and justify the choice of payment.The authors reported the results of the exit survey indicated "that the chosen payment was considered acceptable by the workers" and that "the selected compensation was appropriate for the specific study setting" (p.265).Another use for an exit survey is collecting demographics, which is especially important on microtask platforms where tasks are typically too short to collect demographics.For instance, Wilson et al. [230] used a custom exit survey on Amazon Mechanical Turk to collect demographic information.
4.8.2Reward calculation.Besides the above feature, participants in our survey mentioned a number of other features that could facilitate crowd pilot studies.Most often mentioned was a "reward calculator" which could calculate rewards based on estimated completion times.As found in our literature review, the calculation of rewards from average task completion times is one of the most common reasons for conducting crowd pilot studies.Prolific3 , a crowdsourcing platform for academic studies, already offers a recommendation for the price of a task, based on the estimated time.This is, however, only an incomplete solution because it is difficult for a requester to estimate the completion time -often, the very reason for conducting the crowd pilot study is finding this estimate.However, crowdsourcing platforms are host to many different types of tasks.Given the large variety and amount of tasks on the crowdsourcing platforms, it would be possible for platform operators to collect information on tasks and to devise machine learning based platform features to support the estimation of task completion times and task rewards, based on empirical data collected on the crowdsourcing platform.
4.8.3Better support for running qualification studies.Screening criteria were mentioned often by the survey participants.Crowdsourcing platforms differ in their capabilities to support screening and qualification studies.Custom qualifications can be created, but this requires running a study, collecting results, and then uploading a comma-separated values (CSV) file to Amazon Mechanical Turk to assign the custom qualifications to workers.Only then can the qualification be selected in future studies.Amazon Mechanical Turk offers only a limited set of qualification criteria for screening participants.Although Prolific offers a broader array of pre-defined qualification criteria, setting up custom qualification studies can be just as complex as in MTurk (when implemented via a survey study) or restricted to Prolific's in-built multiple-choice survey options.Better user interfaces for running qualification studies and setting (or deleting) qualifications are needed.The survey participants further perceived a need for an MTurk feature to extend running studies with more participants.On Amazon Mechanical Turk, no changes can be made to a running crowdsourcing campaign.This leads to disparate sets of survey results which then need to be manually integrated by the researcher.Last, the participants in our survey mentioned wanting better tools to communicate with workers, such as a chat or e-mail service.This speaks to the survey participants' need for a less dehuminizing communication with crowd workers [17].Features to communicate with the crowd would allow requesters to better monitor ongoing studies and grow a base of trusting participants [191].
Clearly, there is an opportunity for the design of dedicated features on crowdsourcing platforms that could better support and facilitate running crowd pilot studies.Features, such as the above, could support best practices in crowdsourcing.We reflect on the importance of best practices and make recommendations for reporting crowd pilot studies in the following section.

MOVING FORWARD: BEST PRACTICES FOR REPORTING CROWD PILOT STUDIES
Crowd pilot studies are a common and required method in crowdsourcing research due to the empirical nature of the crowdsourcing paradigm.Unsurprisingly, many authors report having conducted crowd pilot studies in the scholarly literature.Yet, no scientific study spanning crowdsourcing has investigated this topic in depth.Our work aimed to fill this gap.In this section, we reflect on our findings and the current state of best practice on reporting crowd pilot studies.

Readdressing current practices for reporting crowd pilot studies in crowdsourcing research
Crowd pilot studies connect to two strains of research in the field of crowdsourcing that touch upon the very nature of crowdsourcing: fair and responsible crowdsourcing [193,229] as well as reproducibility in empirical computer science [36].These two issues have long been debated in the scholarly literature.
5.1.1Best practices for fair and responsible crowdsourcing research.Crowd pilot studies account for a significant amount of work that is unaccounted for to a large extent in the scholarly literature.Since the majority of authors use opaque language masking the extent of studies, little is known about the real extent of crowd pilot studies.Further, due to the empirical nature of crowdsourcing, it is likely that crowd pilot studies often underpay participants.Estimating the rewards for crowdsourced tasks is hard and one way of adressing this shortcoming is to raise the basic level of payment or assign bonuses to workers in a post-hoc manner to fairly compensate participants in crowd pilot studies [16,85,165].However, it is likely that workers in crowd pilot studies are substantially and potentially systematically underpaid [45,80,133].Recent work has unearthed different forms of invisible labor that crowd workers put in as they strive to earn their livelihood in various crowdsourcing marketplaces [75,208].Prior work has also revealed how crowd workers are often subject to unfair rejections following qualification studies [55,64,137].It is likely that such practices transcend to ill-reported crowd pilot studies.Interestingly, extremely few articles in our literature review reported that bonuses were given to workers in or after crowd pilot studies.Based on results from our literature review and survey, we find it is more typical -though still not common -to pay bonuses to participants for performing well in the main study.later used as input for the main study of the article.In this sense, much of the reporting on crowd pilot studies uses 'hedging' language, a "rhetorical means of gaining acceptance of claims" [99].We argue that researchers should do their due diligence on the research claims made and report transparently on the aims and results of crowd pilot studies.Readers need to know the details about crowd pilot studies.For instance, a reader needs to know the number of participants in a crowd pilot study "to know that the study was big enough to justify the claims made" [129].Authors need to realize that opaque reporting on crowd pilot studies -especially if it is done as a summative evaluation (as was the case in a few articles we reviewed) -weakens the claims of the authors' research.Insufficient details can impede the progress of science in general.The current state of reporting on crowd pilot studies exacerbates and affirms this widespread practice of opaque reporting.More transparency on reporting crowd pilot studies is needed to nudge the current state of reporting in the field of crowdsourcing toward a code of practiced ethics that values transparent reporting of crowd pilot studies.However, crowdsourcing is still a relatively young field where good practices need building.

5.1.3
Treating the crowd workforce fairly.One of the pivotal realizations that has emerged through research and practice within the crowdsourcing community over the last few years is the need to treat crowd workers fairly and with dignity -whether it is in terms of the hourly wages paid or with respect to communication with workers [103,193,229].It is now commonplace in most HCI communities to declare the hourly wage that participants are paid in reported main studies.By raising the bar for what is expected in the reporting of crowd pilot studies in scholarly literature, we can hope to instill the otherwise (potentially) dormant desire to pay workers fairly within crowd pilot studies.This will increase the overall accountability of researchers and other requesters, and help bridge a gap in the invisible labor prevalent in crowdsourcing marketplaces [75].
Beyond crowd pilot studies, the broader domain of data annotation stands as another significant area where fairness in treatment and payment of crowd workers is paramount.Data annotators play a foundational role in shaping machine learning models and other AI systems by providing high-quality labeled data.Yet, there have been growing concerns about the remuneration, working conditions, and well-being of these data annotators, especially given the labor-intensive nature of their tasks [75,103,153,202,221,244].Inadequate compensation for data annotators not only poses ethical dilemmas but also risks compromising the quality of annotated datasets.By ensuring fair wages and conditions for these workers, we not only uphold the principles of ethical research and practice but also contribute to the production of more reliable and robust AI systems.It is crucial for the HCI and broader AI communities to address this concern head-on, establishing standards that reflect the true value of this indispensable labor.
Creating a widespread change in how crowd pilot studies are reported will require widespread and collective action.This is especially required, since well-meaning authors are often subject to a trade-off while reporting crowd pilot studies, as discussed in the following section.

The trade-offs around reporting crowd pilot studies
Researchers are influenced -consciously or unconsciously -in how they report crowd pilot studies.In this section, we discuss confounding factors and biases that may affect the reporting of crowd pilot studies in academia and industry.5.2.1 Page limitations.Traditionally, the page limit at conference venues such as CHI was 10 pages (in two-column format, not including references).Some venues continue to uphold such strict restrictions on the number of pages in articles.Authors, therefore, face a difficult trade-off between reporting on pilot studies in detail and reporting on the main study.Our literature review is evidence for this trade-off, with many crowd pilot studies being reported briefly and casually.In recent years, however, many venues in HCI (e.g., CHI and CSCW) have relaxed the page limitations which could, in theory, encourage authors to dedicate more space to crowd pilot studies.However, a number of other biases and trade-offs may still make authors consider otherwise, such as the academic publishing model.5.2.2 Academic publishing model.Publication bias is "the failure to publish the results of a study 'on the basis of the direction or strength of the study findings'" [43].Due to the publication bias in academia, authors may feel the need to report positive findings in order to get published.A formative crowd pilot study, in particular, may -in the mind of authors and/or reviewers -not add to this goal.Further, if authors feel the pilot study does not contribute towards the acceptance of the article, the authors may decide to omit the pilot study or shorten the reporting.Another concern that authors may have is that if a formative pilot study is given too much space in the article, reviewers may view the article as a work-in-progress and recommend it for acceptance in a lesser capacity (e.g., as a poster).Therefore, researchers may decide not to report pilot studies because of a perceived need to produce writing that pleases reviewers.However, in the past years, some conference venues have opened up to the possibility of submitting works following the principles of open science.These venues encourage the submission of replications and articles with null or negative results which have been traditionally hard to publish.While these advances, so far, have been limited to special tracks -such as the Open Science track at the Conference on Intelligent User Interfaces (IUI) 2023 -they could lead to a slow systemic change toward an academic system in which reporting on pilot studies is being encouraged.In this regard, some referees may consider it favorable if crowd pilot studies are being reported transparently and in detail.

Funding.
The availability of funding is another important factor that may influence the authors' decision to conduct or report pilot studies.For instance, the availability of funding may affect the extent of crowd pilot studies.If researchers are short on budget, they may skip or reduce the number of formative pilot studies.However, even if funding is available, authors may decide not to run crowd pilot studies in order to not "waste" the funding organization's money on formative studies with an anonymous crowd.For similar reasons, authors may decide not to report on crowd pilot studies.On the other hand, iterative experimentation is important, especially in the field of Human-Computer Interaction (HCI) where emphasis is placed on iterative and participatory design to ensure optimal outcomes in a variety of contexts [19].The very process of design requires iteration and formative experimentation to arrive at an acceptable solution.

Corporate or organizational culture.
If not the academic system or external funding, then the internal culture of an authors' organization could discourage conducting pilot studies.For instance, universities in Finland recently started following stricter directives from the Tax Administration in a move towards a system where any rewards to participants -whether it is cinema vouchers, gift cards, or monetary payments -need to be declared to the tax office, regardless of the monetary value of the rewards.Because monetary compensations to participants are subject to withhold tax, this causes an overhead to the university administration.Even more concerning is that researchers are asked to collect private information from their study participants (name, address, and social security number), if participants are to be rewarded.Therefore, researchers in Finland are strongly discouraged from using paid participant samples in their research.This development is deeply worrisome as it discourages researchers in Finland from conducting ethical and fair science.

Guidelines for Reporting Pilot Studies
Our analysis of the HCI and crowdsourcing literature allows us to provide recommendations for reporting crowd pilot studies.In this section, we revisit the research questions RQ1-RQ3 and connect the findings of our literature review to recommendations for reporting crowd pilot studies.Be transparent on the reasons for conducting crowd pilot studies.Most articles in our literature review conducted crowd pilot studies for formative reasons.However, in articles that report crowd pilot studies in passing, it was sometimes not explicitly stated why a pilot study was conducted.Clearly motivating the crowd pilot study will provide clarity to the writing and increase the readers' understanding of why a pilot study was needed.
Inform crowd workers that they are participating in a crowd pilot study.Crowd workers are subject to a wide range of tasks posted on crowdsourcing platforms.Some tasks are more and some less lucrative for the workers.Crowd pilot studies may fall into the latter category, especially if the task price is not estimated accurately.While some workers are not motivated by extrinsic factors and may enjoy participating in crowd pilot studies [151], other workers may want to avoid them.Workers should be informed that they are participating in a small-scale study (that may potentially be under-priced).

How are crowd pilot studies typically reported? (RQ2).
Use consistent wording.Academic writing requires precise language.Using different terms within an article to refer to crowd pilot studies may add confusion to an uninitiated reader.We recommend to use the term 'pilot study' to denote crowd pilot studies.This term was the most common term used in the literature (cf. Figure 7).
Report crowd pilot study findings in one section.We found many authors scatter findings from their pilot studies throughout their papers.To improve the clarity of pilot study reporting and to better showcase the results of pilot studies, we recommend to bundle the reporting of crowd pilot studies in a single section of the article.This would improve both the understanding of the reader of the extent of pilot studies conducted, and the reproducibility of the pilot study.Our analysis of the literature indicates that it is most common to report the results of crowd pilot studies in design-related sections (cf. Figure 9).

What do crowd pilot studies report? (RQ3).
Report the number and extent of crowd pilot studies.A considerable amount of articles in our literature corpus (62 articles, about 36%) did not provide information on the exact number of crowd pilot studies being conducted.In other cases, the number of studies being conducted had to be calculated from information scattered in the article.Authors should clearly state the number of pilot studies and their respective extent.
Clearly identify the participants.There should be no room for interpretation when it comes to who participated in the crowd pilot study.In particular, researchers should identify whether crowd workers participated in the pilot study or whether the pilot study was conducted with a different participant sample (e.g., students, experts, or internal participants).This also includes information on the crowdsourcing platform used in the pilot study, if it cannot be reliably inferred from the context in the article.Internal pilot studies should be clearly denoted as such.
Report the key attributes of each crowd pilot study.If page restrictions limit authors from reporting in-depth on crowd pilot studies, we recommend to include at least the following key information when reporting crowd pilot studies: • number of participating crowd workers, • number of tasks (or assignments to workers), • payment per task, • participation constraints enforced (including platform settings), and • the type of crowd or crowdsourcing platform.
The latter could be omitted if it is clear from the context that only one crowdsourcing platform was used throughout the article.The selection of rewards to workers in the crowd pilot study should be justified.If there are major discrepancies between the rewards paid in the crowd pilot study and the main study, it should be explained how these discrepancies came into existence and what measures were taken to remedy the discrepancies.
Report a minimum set of information.Inspired by scientific reporting guidelines, such as guidelines by the American Psychological Association [12], and based on the above recommendations while also considering the trade-offs discussed in Section 5.2, we propose a condensed format for reporting formative crowd pilot studies: . . .pilot study (MTurk; N=12; 1000 HITs; US$4.5 per HIT) . . .
In combination with any of the preferred methods of reporting on pilot studies, such as "in a pilot study (. . .), we found. . ." or ". . .based on a pilot study (. . .)" (cf. Figure 8), this condensed format provides key statistics about the formative crowd pilot study (i.e., the crowdsourcing platform, number of participants, number of HITs, currency, and price per HIT) without taking up an undue amount of space in the article.
We hope that authors will adopt at least this condensed way of reporting crowd pilot studies to increase the transparency and reproducibility of their research.In the same vein, we hope that reviewers in conferences and journals publishing research with crowd pilot studies will, in the future, place increased emphasis on seeing transparent reporting of crowd pilot studies.

Practical Suggestions for Supporting Better Crowd Pilot Study Reporting
Creating a centralized repository for crowd pilot studies in crowdsourcing research could be a possible way to enhance the reporting and transparency of such investigations.The repository would serve as a dedicated platform for researchers to submit their crowd pilot studies following a standardized report format, as suggested in the previous section.This format should encapsulate critical elements including research questions, methods, results, and challenges encountered, thereby enabling a comprehensive understanding of the study without exceeding paper page limits.Furthermore, the structured reporting style within the repository should include specifics such as sample size, study duration, data cleaning methods, and outcomes.This standardized approach, coupled with a requirement for authors to detail challenges and potential improvements, could not only foster transparency but also provide insights for researchers undertaking similar studies.
To further encourage crowd pilot study reporting, a system of incentives could be introduced for authors who make use of the repository, ranging from formal acknowledgments within the academic community, citation opportunities, to reduced publication fees in affiliated journals.At the same time, scientific journals and conferences could set forth clear guidelines encouraging the citation of crowd pilot studies from the repository in their submissions.Creating this repository and encouraging its use would help cultivate a research culture that values transparently reporting crowd pilot studies, ultimately leading to more accurate, rigorous, and replicable crowdsourcing research.

Limitations
In our literature review, we made pragmatic choices to limit the set of literature to what we believe is a representative coverage of the literature, as mentioned in Section 3.2.Our screened corpus included all articles from HCOMP, the premier venue for crowdsourcing research.The ACM Digital Library contains articles from human-centered journals and conferences, such as CHI and CSCW.We do not claim that the selected corpus generalizes to all publications involving pilot studies.However, this corpus provided a good view into the prevailing practices in diverse research communities.
Another aspect in achieving representativeness is the choice of search keyword.Our literature review may have missed articles that do not contain the 'pilot' keyword and, instead, refer to the pilot study in other ways, such as "preliminary study" or "formative study."However, both our survey and scoping review of the literature found that 'pilot study' is the most common term to refer to crowd pilot studies.Further, there is a difference between pilot studies and preliminary studies.The latter are primarily conducted to identify user needs and for defining requirements.In the context of crowdsourcing, however, pilot studies are being conducted for the specific purpose of determining and validating the parameters of a crowdsourcing campaign.We argue that 'pilot study' is a standing term that is being used to refer to small-scale formative studies in crowdsourcing based research.It is this type of study that we investigated in our paper.
We acknowledge that there may be more reasons that prevent authors from reporting crowd pilot studies more elaborately, which did not surface in our investigation.We therefore cannot treat this work as an exhaustive account of why or how crowd pilot studies are being reported.

CONCLUSION
In this paper, we provided an extensive investigation into the state of pilot study reporting in crowdsourcing research.Our systematic screening of over 500 publications at the intersection of HCI and crowdsourcing literature resulted in a corpus of 171 articles which we analyzed in depth.Our analysis revealed that authors are often vague about the extent and content of their crowd pilot studies.Insufficient details pertaining to such pilot studies can hinder replication and reproducibility, and stall the progress of scientific research.We explored the various reasons that drive authors to carry out crowd pilot studies (RQ1), how they are typically reported (RQ2), and what such reports contain (RQ3).Through synthesizing related literature and via a survey study with crowdsourcing researchers in academia and industry, we explored the desirable attributes of a crowd pilot study (RQ4), and the factors that influence the reporting of crowd pilot studies (RQ5).Finally, we explored platform-specific features that can support and facilitate crowd pilot studies (RQ6).Based on our findings, we reflected on how detailed reporting of crowd pilot studies can further aid fair, responsible, and reproducible crowdsourcing research.We presented insights into the trade-offs that authors make while reporting crowd pilot studies and proposed guidelines for reporting them.Our proposed guidelines for reporting crowd pilot studies and the APA-inspired way of doing so -concisely but effectively -can have important implications on the proliferation of crowdsourcing research, crowdsourcing as a sound scientific method, and on the anonymous crowd workers who undoubtedly play the most pivotal role in sustaining the crowdsourcing paradigm.

Fig. 2 .
Fig. 2. Articles reporting crowd pilot studies in the ACM Digital Library and HCOMP proceedings.

Fig. 3 .
Fig. 3. Conference venues and journals in the corpus.

Fig. 4 .
Fig. 4. Bar chart of author-provided keywords appearing in the analyzed papers, including only keywords mentioned at least three times.

Fig. 5 .
Fig. 5. Types of crowd pilot studies and ways of reporting crowd pilot studies in the literature.

Fig. 6 .
Fig. 6.Task design related reasons for conducting crowd pilot studies reported in the literature.

Fig. 7 .
Fig. 7. Different terms used to refer to pilot studies in the articles.Note that some authors used multiple terms to refer to crowd pilot studies within their article.

Fig. 8 .
Fig. 8. Phrasing used to report on the results of pilot studies (abbreviated 'ps' in this figure) among articles which mention formative crowd pilot studies.

Fig. 9 .
Fig. 9.Sections in which authors report the results of their pilot studies.

Fig. 10 .
Fig. 10.Number of pilot studies conducted in the article.

Fig. 12 .
Fig. 12. Consistency of wording within articles in different research communities.

Fig. 13 .
Fig. 13.Wording used to refer to crowd pilot studies within articles in different research communities.

Fig. 14 .
Fig. 14.Type of section in which pilot studies are being reported in different research communities.

Fig. 15 .
Fig. 15.Number of crowd pilot studies conducted in the articles in different research communities.

Fig. 17 .
Fig. 17.Factors that promote or inhibit the reporting of crowd pilot studies.

Table 1 .
Research articles reporting crowd pilot studies per year.
5.1.2Reproducibilityin crowdsourcing research.The crowdsourcing paradigm has many known limitations.For instance, results obtained from crowdsourcing studies may be difficult to reproduce due to the anonymity of the workforce.The opaque reporting of crowd pilot studies, as evidenced in our literature review, adds one additional layer to the issue of reproducibility.The strong prevalence of reporting on study results in passing accentuates and entrenches bias in research and helps bad practices to endure.For instance, some authors used the crowd pilot study to substantiate claims.Crowd pilot studies are sometimes used as a magic linguistic device to materialize results that are Proc.ACM Hum.-Comput.Interact., Vol. 4, No. CSCW, Article .Publication date: December 2023.