"Are we all in the same boat?" Customizable and Evolving Avatars to Improve Worker Engagement and Foster a Sense of Community in Online Crowd Work

Human intelligence continues to be essential in building ground-truth data, training sets, and for evaluating a plethora of systems. The democratized and distributed nature of online crowd work — an attractive and accessible feature that has led to the proliferation of the paradigm — has also meant that crowd workers may not always feel connected to their remote peers. Despite the prevalence of collaborative crowdsourcing practices, workers on many microtask crowdsourcing platforms work on tasks individually and are seldom directly exposed to other crowd workers. In this context, improving worker engagement on microtask crowdsourcing platforms is an unsolved challenge. At the same time, fostering a sense of community among workers can improve the sustainability and working conditions in crowd work. This work aims to increase worker engagement in conversational microtask crowdsourcing by leveraging evolving avatars that workers can customize as they progress through monotonous task batches. We also aim to improve group identification in individual tasks by creating a community space where workers can share their avatars and feelings on task completion. To this end, we carried out a preregistered between-subjects controlled study (N = 680) spanning five experimental conditions and two task types. We found that evolving and customizable worker avatars can increase worker retention. The prospect of sharing worker avatars and task-related feelings in a community space did not consistently affect group identification. Our exploratory analysis indicated that workers who identify themselves as crowd workers experienced greater intrinsic motivation, subjective engagement, and perceived workload. Furthermore, we discuss how task differences shape the relative effectiveness of our interventions. Our findings have important theoretical and practical implications for designing conversational crowdsourcing tasks and in shaping new directions for research to improve crowd worker experiences.

Figure 1: This screenshot illustrates the worker community space in one of our experimental conditions (Evolving⊕Comm) in which workers could customize their avatars with an evolving set of features as they progressed through a batch of tasks.The community space includes the REFRESH , SIMILAR MOOD , ORDER ON LEVEL , and LOAD MORE buttons.These allow workers to (a) see avatars of a random subset of other workers who completed the same tasks, (b) see avatars of all workers who expressed the same task-related feelings after completing the same tasks, (c) order worker avatars based on the highest level that workers progressed to within the task batch, (d) load avatars of all workers who completed the same tasks.On entering the community page, a worker's own avatar is displayed in the middle, with a random subset of other worker avatars displayed surrounding the worker.All avatars are rendered with a text bubble describing their task-related feelings.

ABSTRACT
Human intelligence continues to be essential in building groundtruth data, training sets, and for evaluating a plethora of systems.The democratized and distributed nature of online crowd work -an attractive and accessible feature that has led to the proliferation of the paradigm -has also meant that crowd workers may not always feel connected to their remote peers.Despite the prevalence of collaborative crowdsourcing practices, workers on many microtask crowdsourcing platforms work on tasks individually and are seldom directly exposed to other crowd workers.In this context, improving worker engagement on microtask crowdsourcing platforms is an unsolved challenge.At the same time, fostering a sense of community among workers can improve the sustainability and working conditions in crowd work.This work aims to increase worker engagement in conversational microtask crowdsourcing by leveraging evolving avatars that workers can customize as they progress through monotonous task batches.We also aim to improve group identifcation in individual tasks by creating a community space where workers can share their avatars and feelings on task completion.To this end, we carried out a preregistered between-subjects controlled study ( = 680) spanning fve experimental conditions and two task types.We found that evolving and customizable worker avatars can increase worker retention.The prospect of sharing worker avatars and task-related feelings in a community space did not consistently afect group identifcation.Our exploratory analysis indicated that workers who identify themselves as crowd workers experienced greater intrinsic motivation, subjective engagement, and perceived workload.Furthermore, we discuss how task diferences shape the relative efectiveness of our interventions.Our fndings have important theoretical and practical implications for designing conversational crowdsourcing tasks and in shaping new directions for research to improve crowd worker experiences.

INTRODUCTION
The need for human input on demand has steadily increased alongside the growth in the adoption of artifcial intelligence (AI) and machine learning (ML) systems across all domains [28].The foundations of many AI systems we interact with daily rely on the labor of crowd work [30].With the availability of crowd workers ondemand [12], human intelligence tasks (HITs) can be distributed and completed at scale on crowdsourcing platforms like Amazon Mechanical Turk, 1 Prolifc, 2 and Toloka. 3Tasks range from data labeling [10], image annotation [48], and classifcation [82] to the creation and support of real-time healthcare applications [3,7].
Due to the repetitiveness of HITs, tasks can be monotonous and boring, causing task rejection and drop-out [34,60], which is problematic for both crowd workers and task requesters.Task rejection can afect the morale of crowd workers [17], and high drop-out rates result in low-quality crowd work, also afecting worker pay.Monotonous and boring work decreases the motivation of workers [9], resulting in reduced worker engagement.Furthermore, motivation is known to be an essential factor when it comes to reducing work-related stress and burnout [87].Similarly, job satisfaction has been shown to be positively related to subjective well-being [6].To decrease the problematic efects of monotonous and tedious tasks for crowd workers and task distributors, we need to improve worker engagement by creating better worker experiences.In the long run, this can also result in improving the quality of crowd work [97].
Although some crowdsourcing tasks require collaboration and teamwork among workers [10,19,55,66], workers typically execute microtasks individually and sometimes in isolation [24,57].Not all workers, therefore, have the opportunity to experience a sense of community due to this, and little is typically done to increase group identifcation among workers.In addition to improving worker engagement during task execution, increasing a sense of community can go a long way toward creating better worker experiences.Prior work has shown that crowd workers use external forums to communicate with other crowd workers [91,95,96], such as Reddit HWTF, Facebook, MTurkGrind, MTurkForum, and Turkernation.However, elaborate social interventions and facilitating extensive engagement via forums are not viable solutions for all workers.While several crowd workers have been shown to communicate with other workers, many do not communicate with others and work alone [96].In part, this may be due to workers not having time to engage in external forums as a result of other commitments not related to crowd work [1].It is, therefore, prudent to explore whether a lightweight method that does not require extensive social engagement or exchange of private information can still help build a sense of community among workers while completing tasks individually.Through our work, we aim to address these challenges pertaining to both research and empirical gaps.
Digital avatars are known to increase identifcation and user experience in online multi-player video games [89], solitary educational games [40], and conversational crowdsourcing tasks [68].Moreover, the ability to personalize the avatar by customizing its appearance further increases users' self-identifcation with the avatar [5].Prior HCI research has shown a promising impact of crowd worker avatar customization within a conversational interface to reduce cognitive workload and increase worker retention [68].However, the notion of evolving and customizable worker avatars and their efect on worker experience and task-related outcomes remains unexplored.Addressing this research gap, we propose to couple avatar evolution and customization with workers' progress in task batches.
Since digital avatars facilitate the creation of a virtual identity [5,63,89], we argue that a personal worker avatar can be an effective tool to increase a sense of community among the workers while protecting their privacy.Prior research found that avatar identifcation relates to [90] and predicts [23] group identifcation in online video games.Gabbiadini et al. [23] explained that when users see their avatar in the group, they imagine themselves as being part of the group.Similarly, Takano and Taka [86] found that avatar identifcation has a positive efect on the feeling of belonging, partially mediated by self-expression.Inspired by this prior literature, we aim to facilitate group identifcation by creating a community space where workers can share their personalized avatars with other crowd workers.With the worker community space, we aim to build a lightweight intervention that can be used in tasks without elements of collaboration to refect a feeling of unity [84] by placing the virtual identity of the worker among other worker avatars.As a part of customization, the facial expressions of avatars can then be used to share (task-related) feelings with other workers in a community space on task completion, as sharing feelings (afective self-disclosure) can contribute to a feeling of connection [84].Combining the interventions of evolving avatars and group identifcation, we address the following research questions in our work: RQ1 How do evolving and customizable worker avatars affect worker experience and task-related outcomes in conversational crowdsourcing?RQ2 To what extent can the sharing of worker avatars in a community space afect the sense of group identifcation among crowd workers on a crowdsourcing platform?RQ3 How does a sense of group identifcation, induced by a community space where customizable and evolving avatars among crowd workers can be shared, afect worker experience and task-related outcomes in conversational crowdsourcing?
By combining avatar customization, gamifed avatar evolution, and creating a sense of community, we aim to improve overall worker experiences and the quality of the task outcomes.Worker experiences can be described and measured by their perceived workload, intrinsic motivation, and subjective engagement.Furthermore, we aim to analyze the impact of these interventions on task-related outcomes, such as retention, accuracy, and overall task execution time.To this end, we carried out a between-subjects study by recruiting workers from the Prolifc crowdsourcing platform ( = 680), spanning fve experimental conditions and considering two popular types of tasks (information fnding and credibility analysis).We found that evolving and customizable worker avatars can increase worker retention.Although the worker community space was not successful in fostering an increased sense of group identifcation among crowd workers, we found that this varied across workers based on the extent to which they considered themselves as crowd workers.Workers who identify themselves as crowd workers experience a signifcantly greater perceived workload, intrinsic motivation, and subjective engagement.Our fndings have important implications for the design of future conversational crowdsourcing tasks and for crowdsourcing platforms, with an aim to improve worker experiences and foster a sense of community.All code and data pertaining to this work can be found in the OSF repository for the beneft of the community and in the spirit of open science. 4

RELATED LITERATURE AND HYPOTHESES
We position our work in the context of worker experiences in microtask crowdsourcing and literature in the realm of creating a sense of community among users.By building on existing works in these areas, we present and ground our research hypotheses.

Worker Engagement in Microtask Crowdsourcing
A promising way to improve worker engagement in repetitive crowdsourcing tasks is to improve the worker experience through gamifcation.Gamifcation in crowdsourcing tasks often leads to an increased motivation of crowd workers, participation and throughput rates, and quality of the work [62].Feng et al. [21] proposed a model that describes how gamifcation indirectly increases the intention of participation of crowd workers by an increased level of intrinsic motivation.Examples of how to incorporate (contextindependent) gamifcation in crowdsourcing tasks are tracking scores, leaderboards, badges/achievements, and the use of increasing levels.Highlighting the importance of targeting intrinsic motivation compared to extrinsic motivation, the results of a study by Maddalena et al. [56] suggest that while a monetary incentive may increase retention, it decreases the quality of the work.Interestingly, the same study showed that while the total number of completed voluntary tasks was higher with gamifcation compared to no gamifcation (no furtherance incentives), this efect was caused by a number of outlier workers.This fnding implies that only the workers who favor gamifcation show more engagement with the task.Another study that tested the efect of gamifcation on task retention and quality of the results found that retention and output quality increased when the task was gamifed using levels [22].The study tested multiple furtherance incentives and showed that game elements (badges, a leaderboard, levels, access, power, and a monetary bonus) can increase accuracy and cause tasks to be perceived as more rewarding and engaging, particularly for social incentives that involve visibility among crowd workers.
Prior work has explored the use of competitive game designs ranging from monetary reward schemes that are inspired by the success of competitions, lotteries, and games of luck to improve the cost-efectiveness of crowdsourcing tasks [76].Rokicki et al. [77] proposed strategies for team-based crowdsourcing to improve crowdsourcing competitions, leading to performance boosts.Kobren et al. [46] proposed a survival model to predict the probability that workers will proceed to the next task available and leveraged this model to dynamically decide what task to assign and what motivating goals to present to the user.They proposed to jointly optimize for the short term (getting complex tasks done) and for the long term (keeping users engaged for more extended periods).Similarly, Gadiraju and Dietze [26] proposed using achievement priming to engage workers in long task batches and provide them with learning opportunities that can positively impact their performance.More recently, researchers proposed the use of conversational crowdsourcing as a more engaging interface for completing crowdsourcing microtasks [72] and found that using worker avatars can reduce the cognitive workload among workers and increase worker retention [68].Inspired by prior work in this realm, we propose to leverage customizable and evolving worker avatars with features that become available to workers as they progress through task batches (giving rise to potentially evolving worker avatars).

Fostering a Sense of Community
2.2.1 Importance of Group Identification.Community identifcation is one of the main intrinsic motivations for crowd workers [42].
In addition, a lack of intrinsic motivation is one of the reasons why crowd workers quit their work, as intrinsic motivation starts to outweigh extrinsic motivation (often a monetary incentive) after some time [83].Kaufmann et al. [42] describes community-based motivation as "the acting of workers guided by the platform community, which is caused by a personal identifcation process".Furthermore, they mention social contact as another type of community-based motivation: "motivation caused by the sheer existence of the community that ofers the possibility to foster social contact".Their study found that the main motivators of crowd workers who spend more than 8 hours per week on MTurk are skill variety (tasks that require multiple skills that ft with the specifc skill set of the worker), human capital advancement (the possibility to train useful skills), and community identifcation.The study of Ihl et al. [39] investigated social support (afective and instrumental), group identifcation, engagement, and experienced meaningfulness on crowdsourcing platforms by conducting surveys among crowd workers.Group identifcation was measured by the group identifcation scale of Doosje et al. [14].Afective social support was measured with a questionnaire about how supported the worker felt by other crowd workers (e.g., "The members of the crowd communities care about me.").Instrumental social support was more focused on useful support from other workers (e.g., "The members of the crowd communities give useful advice on job problems").Their main results showed that social support fosters a sense of group identifcation and experienced meaningfulness, contributing positively to crowd workers' subjective engagement.Corresponding to these results, through qualitative interviews with crowd workers, Soliman et al. [83] revealed that community identifcation is positively related to continuous participation.Thus, a sense of group identifcation with peers has been found to be an essential asset for motivation [42] and engagement [39,83] in online crowd work.

External forums.
The online solitary nature of individual crowd work tasks makes it difcult for workers to connect to their peers and foster a sense of group identifcation.Therefore, crowd workers often connect with peers through external platforms [31,91,95,96].These external forums help crowd workers to identify with others who do similar work, forming online communities [53].Online communities are important as they facilitate a shared working experience among the crowd workers [85].While these online communities serve a social goal, many crowd workers mention that their main motivation to engage in online forums is to gain information about how to optimize the quality of their work, which can optimize their earnings [58,85,91,96].Moreover, the time that crowd workers spend on these forums to gain information to improve their crowdsourcing skills is part of the 'invisible' work of crowd workers [58].The 'invisible labor' of crowd workers refers to their work outside the tasks they perform, which is typically unpaid and unaccounted for by platforms or task requesters [30,88].Thus, not all workers are able or want to spend the time and efort to engage in these forums.Moreover, a study by Yin et al. [96] found that 59.1% out of 10,000 workers on MTurk reported using at least one forum, while the other 40.9% reported not being engaged on forums.Such workers cannot beneft from the social and learning opportunities that external forums ofer as a community space.Therefore, researchers have suggested that social interactions between the crowd workers should be facilitated and integrated into the crowdsourcing platform itself [85,96].

2.2.3
Fostering group identification internally.Kobayashi et al. [45] used a communication platform and a worker ranking based on the number of completed tasks to foster a sense of community.They found that fostering a sense of community positively relates to continued participation.Using such a platform increases worker visibility, which is considered to induce a sense of community and group identifcation [8].
We build on such prior works by attempting to foster a sense of community and increase group identifcation among workers completing task batches individually.To this end, we create a worker community space where workers can share their avatars and taskrelated feelings on successful task completion.A key diference in our efort is our focus on a lightweight intervention that does not require extensive social engagement, communication, or exchange of additional information among workers (since not all workers can indulge in such interactions and time-consuming methods can afect workers' earnings).A personalized worker avatar contributes to the ability for workers to express themselves and form their worker identity within the group of other crowd workers.We explore whether creating such visibility among crowd workers can induce a refection on unity, causing the workers to relate to each other, thereby developing a sense of belonging [23,84].

Hypotheses
Customizable worker avatars and avatar character selection have been shown to reduce perceived workload in information-fnding tasks in conversational microtask crowdsourcing compared to conventional web interfaces without customizable worker avatars [68].The evolution of the customizable avatars introduces a gaming element that unlocks new editable features of the avatar when the worker completes more tasks (the worker unlocks new levels).The study of Lee et al. [52] used a similar gamifcation approach using levels that unlock new features within a crowdsourcing task that requires workers to label cultural heritage design elements.They found that the usage of gamifcation reduced the perceived workload of the workers.Therefore, we expect the perceived workload to reduce when using evolving and customizable worker avatars.
H1a: Evolving and customizable worker avatars will reduce the perceived workload among workers.
While the study of Qiu et al. [68] did not fnd any signifcant efects on intrinsic motivation, another study by Birk et al. [5] did fnd increased intrinsic motivation due to customizable avatar identifcation.Moreover, gamifcation in crowdsourcing tasks often increases motivation [62].Specifcally, prior work found that using levels in crowdsourcing tasks can improve intrinsic motivation [52].Therefore, we expect that combining avatar customization with gamifcation (evolving customizable avatars) can increase intrinsic motivation.Adding gamifcation elements to crowdsourcing tasks can improve worker engagement [22].We expect that increased intrinsic motivation can lead to improved subjective worker engagement.Thus, we formulate the following hypotheses: H1b: Evolving and customizable worker avatars will lead to an increased level of intrinsic motivation.H1c: Evolving and customizable worker avatars will lead to improved subjective worker engagement.
Prior work showed that customizable worker avatars have a positive efect on worker retention [68].In addition, prior studies show that the willingness to complete more tasks increases as a result of gamifcation [21,22,52,62].Interestingly, the results presented by Maddalena et al. [56] suggest this is only the case for workers who favor gamifcation.Overall, we expect evolving and customizable worker avatars to increase task retention.
H2a: Evolving and customizable worker avatars will lead to increased task retention.
Prior work showed no signifcant improvement in task accuracy due to worker avatars [68].Furthermore, while some prior studies suggest that data quality can be improved by gamifcation [22,62], other studies did not fnd an increased data quality [52,56].The task execution time might be longer when using evolving avatars as workers might spend more time interacting with the avatar editor throughout the task.
H2b: Evolving and customizable worker avatars do not afect task accuracy.H2c: Evolving and customizable worker avatars will lead to a longer task execution time.
Since worker avatars are known to facilitate identifcation [40,68,89], we expect that workers will identify with their avatars.Sharing and presenting their avatars in a community space with other workers can help them identify themselves as being a part of a group of crowd workers without necessarily revealing other private information.In other words, the visibility of the worker avatars might facilitate a refection of unity [84].Similar fndings have been seen in studies about group identifcation and avatar customization in (serious) games [23,86,90].Furthermore, the option to share their feelings about a task all workers in the cohort completed can contribute to feeling a connection with other workers [84].Exposure to similar opinions from others has been shown to induce group identifcation [64].
H3: Sharing worker avatars and feelings about the task in a community space will facilitate a sense of group identifcation among crowd workers.
By facilitating group identifcation crowd workers can refect on the fact that others are completing the same tasks as them.This notion of being part of a group might contribute to an increased intrinsic motivation of workers, which in turn can reduce their perceived workload [50].Prior studies have found a positive relation between group or community identifcation and intrinsic motivation [42,83].As feeling part of a group can be an intrinsic motivator for workers, we expect that facilitating a community space where workers can share their avatars and task-related feelings can induce group identifcation and increase intrinsic motivation.Prior research found that group identifcation among crowd workers is positively related to user engagement [39].Moreover, organizational identifcation of employees is positively related to work engagement [41].Therefore, we expect that user engagement will be positively impacted by inducing group identifcation.
H4a: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings reduces the perceived workload among workers.H4b: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings will lead to increased intrinsic motivation.H4c: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings improves subjective worker engagement.
As a result of the improved intrinsic motivation Kaufmann et al. [42], Kyndt et al. [50] and worker engagement Ihl et al. [39], Karanika-Murray et al. [41], we expect that facilitating a sense of group identifcation can increase task retention.Prior work found that community identifcation aids the continued participation of workers in crowd work [45].Based on prior work, the potential of sharing worker avatars and feelings may not afect their accuracy.On the other hand, the total task execution time might be longer due to the time spent by workers in the community space.
H5a: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings will lead to increased task retention.H5b: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings does not afect task accuracy.H5c: Creating a sense of group identifcation by facilitating the sharing of worker avatars and feelings will lead to a longer task execution time.

STUDY DESIGN
To address the aforementioned research questions (RQ1, RQ2, RQ3), we conducted a preregistered between-subjects study with fve diferent experimental conditions, considering two diferent types of tasks.In this section, we describe our overall study design, including our experimental setup, measures, and procedure in detail.Details about our technical implementation and statistical methods can be found in the Appendix, Section A.1 and A.4 respectively.

Task Design
Prior work has revealed the impact of task types on worker performance and experience-related outcomes [2,25,33].To account for task type efects and better understand the generalizability of our fndings, we consider two diferent types of tasks, an information fnding task and a credibility analysis task.These types of tasks have been shown to be popular in microtask marketplaces and are commonly considered in similar studies [13,27,68].Inspired by prior work that has shown that conversational crowdsourcing is an efective way to increase user engagement and satisfaction [61,68,72], we presented tasks to workers using a conversational interface.In both tasks, workers can refer to search on Google 5 or Wikipedia 6 to answer the question.Workers must complete at least fve mandatory tasks, after which they are free to stop whenever they wish.
Information Finding: In this task, workers are asked to fnd the middle names of famous people by searching the Web.We used a subset of 40 questions from the dataset of Qiu et al. [69], comprising questions that provide the frst and last name of a famous person, together with the profession and the active year.The task is considered to be difcult, as the dataset consists of famous people whose names and professions are similar to other famous people.To fnd the correct middle name, workers had to actively search based on the active year that tells these famous people apart.An example of a question can be found in Figure 2.
Credibility Analysis: In this task, workers are asked to read the text of statements posted online and assess their credibility -'CREDIBLE' or 'NOT CREDIBLE.' To this end, we used the dataset compiled by Robbemond et al. [74].The dataset consisted of 40 statements that were labeled as credible, somewhat credible, not credible, or somewhat not credible.Each category consisted of 10 statements.To increase difculty, we combined the somewhat credible and the credible category and we combined the somewhat not credible and the not credible category.See Figure 3a for a not credible statement, and Figure 3b for a credible statement that is considered to be more difcult.The statements were ordered alphabetically to randomize the order of credibility.This resulted in a fnal set of 20 credible statements and 20 not credible statements.

Experimental Conditions
To test our hypotheses and address the research questions, we designed the following experimental conditions: (1) No relatable avatar (Control): This control condition has a standard, non-human, default avatar.We expect no form of identifcation with this avatar.See Figure 4a for the conversational interface of this condition.(2) Basic avatar (Basic): In this condition, workers are prompted with an opportunity to edit their avatar using the avatar editor before they can proceed to the tasks (cf.Section 3.2.1 and  Figure 5).Workers are only able to customize basic avatar features in the avatar editor.The specifcs of the avatar editor are further explained in Section 3.2.1.After starting the task, no changes can be made to the customized avatar.Workers can see their personalized avatar when working on the task in the conversational interface (see Figure 4b).(3) Basic avatar with community space (Basic⊕Comm): This condition is similar to the Basic experimental condition.However, before starting the task, workers are informed that their fnal avatars will be shared with other crowd workers in the worker community space on task completion.To this end, we created a worker community space supporting diferent interactions (cf.Section 3.2.2 and Figure 1).( 4) Evolving avatar (Evolving): This condition starts similar to the Basic condition.However, for every 4 tasks, the worker unlocks a new level that reveals new editable features to further personalize the avatar.This way, we further introduce the gamifcation aspect to the avatar customization.Whenever a new level is unlocked, a pop-up notifcation shows up that notifes the worker that they have reached a new level and which features are unlocked.The worker is able to move back and forth from the avatar editor to the task to immediately check the new unlocked features.See Section 3.2.1 for a more detailed description of the avatar editor.(5) Evolving avatar with community space (Evolving⊕Comm): This condition is similar to the Evolving experimental condition.However, workers are informed at the beginning that on fnishing their tasks, their avatars will be shared on a page with all other workers' avatars.They are informed that they can express their feelings about the task using the facial gestures of their avatar and stating how the task made them feel.By creating a space to provide visibility and expression, we aim to create a sense of group identifcation among the workers working on the task.

Avatar Editor.
The avatar editor is used by workers to customize their avatar prior to the task (Basic, Basic⊕Comm, Evolving, and Evolving⊕Comm), during the task (Evolving and Evolving⊕Comm), and after the task (Basic⊕Comm, Evolving, and Evolving⊕Comm).At the start, the avatar editor sets the avatar's eye type, mouth type, and eyebrow type to default.Furthermore, the initial hair/top type is set to no hair, and the skin color is randomly chosen.An example of the initial phase of the avatar editor can be seen in Figure 5.For the conditions including the worker community space (Basic⊕Comm and Evolving⊕Comm), an extra line of text is added to the avatar editor to notify and remind workers that their avatar will be shared with other workers on the worker community space.An overview of the initial editable features (Basic and Basic⊕Comm) and those that can be unlocked with new levels (Evolving and Evolving⊕Comm), can be found in Table 2 in the appendix along with further details of the technical implementation.

Worker Community Space.
Workers in the community conditions (Basic⊕Comm and Evolving⊕Comm) get the opportunity to share their customized avatars and feelings about the task in the worker community space.Before entering the community space upon successful task completion, workers are given a fnal chance to edit and update their avatars.Workers are asked to complete a sentence with the prompt 'I am feeling ...' by choosing a mood from the Pick-A-Mood (PAM) scale [11] (see Figure 13 in the Appendix A.3), which is displayed alongside their avatar on the community space (as shown in Figure 1).PAM is a character-based pictorial scale for reporting moods, and it has been shown to be particularly useful in capturing moods in a crowdsourcing context [70,93,97].
In addition, workers have the agency to choose from a variety of facial expressions to share their feelings.The moods from which workers were able to choose pertain to pleasant (i.e., one of excited, cheerful, relaxed, calm), unpleasant (i.e., one of tense, irritated, bored, sad), and the neutral mood.We created the worker community space with the aim of fostering group identifcation.In the worker community space, workers see a random subset of 8 other workers' avatars and how they felt about the task.Their own avatar is placed in the middle to induce a sense of being part of the group of avatars displayed on the screen.We have implemented several interactive elements in the worker community space.Workers can use a REFRESH button to change the displayed subset of worker avatars at random, and the LOAD MORE button to display all other workers.To further increase a sense of group identifcation, a SIMILAR MOOD button was created to flter avatars of workers who reported a similar feeling.Workers in the Evolving⊕Comm condition were also able to order avatars based on their evolution using the ORDER ON LEVEL button.The worker community space only shows avatars of workers who were in the same condition and successfully completed their tasks.Furthermore, to prevent a cold start problem with a blank community space, we added two avatars for each mood per condition to the community space (this resulted in a start with 18 avatars per condition).This design choice was made to ensure that workers could always see at least a few other avatars in the community space even when fltering on mood, with an aim to positively impact the sense of group identifcation among workers.

Measures
We used previously validated questionnaires to measure worker experience (i.e., their perceived workload, intrinsic motivation, and subjective engagement) and group identifcation.When applicable, the questions were slightly altered to ft the context of our task (e.g., 'I think I did pretty well at this activity, compared to other students.'was changed to 'I think I did pretty well at this task, compared to other workers').Furthermore, we measured worker retention, accuracy, and total task execution time as the task-related outcomes.
Perceived Workload.To measure the workers' perceived workload, we used the NASA-TLX [35] with a 7-point Likert scale.This questionnaire assesses workload on six diferent single-question dimensions.The dimensions of mental demand and physical demand describe how mentally or physically demanding the task was.Temporal demand describes how hurried or rushed the pace of the task was.The performance dimension describes how successful the worker was in accomplishing the task and the efort dimension describes how hard the worker had to work to accomplish this task.Lastly, the frustration dimension describes how insecure, discouraged, irritated, stressed, and/or annoyed the worker was when doing the task.To study the efect of evolving avatars and fostering group identifcation among crowd workers, we assessed the average of all dimensions (performance reversed) and each dimension separately.A high average score on the perceived workload implies that the workers perceived a high task workload.
Intrinsic Motivation.To measure the intrinsic motivation of the workers, we used three dimensions of the Intrinsic Motivation Inventory (IMI): Interest/Enjoyment (INT-ENJ), Perceived Competence (PER-COMP), and Efort/Importance (EFF-IMP) [59].The questions were asked with a 7-point Likert scale, ranging from 1: Not at all true to 7: Very true.The interest/enjoyment sub-scale is considered to measure intrinsic motivation directly and consists of seven questions.The perceived competence sub-scale describes the subjective performance of the worker based on the worker's own judgment (six questions).Lastly, the sub-scale efort/importance contains fve questions that address how much energy and efort the worker put into the task.Similar to the perceived workload, we analyze the average of each sub-scale separately and the total score over all sub-scales.A high score for the average overall score means that the worker has a strong intrinsic motivation to work on the task.Subjective Engagement.To measure subjective engagement, we used the short form of the User Engagement Scale (UES-SF) with a 5point Likert scale [65].This scale consists of multiple subdimensions with three questions each: Focused Attention (FA; how focused was the worker on performing the task?), Perceived Usability (PU; how difcult was it to interact with the task?), Aesthetic Appeal (AE; how attractive is the interface?),and Reward (RW; how rewarding was the task?).The average score for each subdimension and the total average score are used for our analysis.A high overall subjective engagement score means that the worker was highly engaged in the task.
Group Identification.To measure the extent to which workers identify themselves as crowd workers, we used the Group Identifcation Measure [14].The group identifcation measure consists of four questions with a 7-point Likert scale (1: Not at all to 7: Extremely).The questions cover the cognitive, evaluative, and afective aspects of identifcation.The mean score over all four questions was measured.A high score implies a strong group identifcation.
In addition, to gain further insights into whether and why workers feel connected to other workers, we used a 7-point Likert scale question asking workers: 'To what extent do you feel connected to the other crowd workers that participated in this study?',followed by an open-ended question asking why they did or did not feel connected to the other workers.These two questions were used to code the open-ended questions into categories by two coders.
Worker Retention.To measure the objective engagement of workers in the task, we used worker retention.Worker retention is measured as the number of completed questions within one task batch.For instance, worker retention of 30 for the credibility task means that a worker classifed 5 mandatory and 25 optional statements for credible or not credible.Note that there are 5 mandatory tasks and 35 additional tasks that are available within the task batch in each of the task types (i.e., information fnding and credibility analysis).
Worker Accuracy.For both tasks, worker accuracy is calculated as the percentage of tasks correctly completed.For the information fnding task, a task is correctly completed if a worker's response contains the middle name of the famous person.For the credibility analysis task, a worker's response is considered to be correct if the right button (i.e., Credible or Not credible) is pressed.Workers have the option to edit their responses to each task before their fnal task submission.
Task Execution Time.The task execution time is based on the total time that workers spend within the task interface (including the avatar editor, worker community space, and conversational interface).So, this is either taken from the moment the worker starts the task in the conversational interface (Control), or when the worker enters the avatar editor (all remaining conditions), up to when the worker is redirected to the post-task questionnaires.

Participant Recruitment and Procedure
Workers in our study were recruited from the Prolifc crowdsourcing platform. 7Our study was approved by the 'Human Research Ethics Committee' of Delft University of Technology.Participation was restricted to workers who have adequate English profciency to ensure that all workers understand the task and the questionnaires.Furthermore, workers need to be at least 18 years old.To ensure the quality of the data, we only allowed workers with an approval rate of at least 95% to participate.Workers were only allowed to participate once in our study.Based on a G-power analysis [20], the required sample size was found to be 610 workers, i.e., 305 workers per task type; one-way ANOVA, = 0.2, = 0.05, (1 − ) = 0.8.To account for potential exclusion due to data quality we increase the number by ∼ 10% to a total sample size of 680.Therefore, we recruited 340 workers per task, and 68 workers per condition within each task.Workers were paid a fair hourly wage of 9 GBP, which is above the minimum hourly wage suggested by the Prolifc platform and rated as a 'good' hourly rate on the dashboard.
Procedure.On beginning the task, workers from Prolifc are redirected to a Qualtrics survey containing the informed consent.After signing the informed consent, the workers are randomly assigned to a condition and task.Subsequently, workers are redirected to the task hosted on a server.After fnishing the task, the workers are directed to the post-task Qualtrics survey.Here, workers complete a set of questionnaires (cf.Section 3.3) before being redirected to Prolifc on successful completion.

RESULTS AND ANALYSIS 4.1 Demographic Distribution
A total of 680 workers participated in our experiment, equally divided across both task types.One worker was excluded due to technical problems, and three workers were excluded due to invalid answers (all workers from the information fnding task).This resulted in a fnal number of workers of 676 (mean age = 33.83,= 11.23).Of those workers, 61.5% identifed as male (416 workers), 37.3% as female (252 workers), 1% as non-binary (7 workers), and 0.1% as other (1 worker).For the information fnding task, 66 workers participated in the Control condition, 67 in the Basic condition, 68 in the Basic⊕Comm condition, 67 in the Evolving condition, and 68 in the Evolving⊕Comm condition.For the credibility analysis task, this was 68, 67, 68, 69, and 68 respectively.Descriptive statistics related to the use of the avatar editor can be found in the Appendix, Section B.1.Based on the Shapiro-Wilk tests for normality, none of our dependent measurements were normally distributed for each condition ( < .05).Therefore, we employed Kruskal Wallis tests to verify our hypotheses.

Perceived Workload
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall TLX score and its diferent dimensions difer signifcantly across the conditions.For both tasks, the overall TLX score and the TLX dimensions did not difer across the diferent conditions ( = 0.05).Thus, no signifcant efect was found of evolving avatars and the worker community space on workers' perceived workload.Summary: H1a) We did not fnd any evidence for a reduced perceived workload as an efect of evolving and customizable worker avatars.H4a) We did not fnd any efect of the worker community space on workers' perceived workload.Therefore, we reject both hypotheses.

Intrinsic Motivation
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall IMI score and its dimensions difer signifcantly across the conditions.For both tasks, there were no signifcant differences found between the conditions for the overall IMI score and its subdimensions ( = 0.05).Thus, no signifcant efect was found of evolving avatars and a worker community space on workers' intrinsic motivation.
Summary: H1b) We found no evidence of an increased intrinsic motivation as an efect of evolving and customizable worker avatars.H4b) Our results found no efect of a worker community space on workers' intrinsic motivation.Therefore, we reject both hypotheses.

Subjective Worker Engagement
A non-parametric Kruskal-Wallis test was performed to investigate whether the overall UES score and its dimensions difer signifcantly across the experimental conditions (H1c and H4c).For the credibility task, we found a signifcant diference between conditions for the aesthetic appeal (AE) dimension ( = 4, = 9.739, = .045,= 0.05).A Dunn test was performed with a Bonferroni correction for the p-value to test which conditions difer signifcantly.Workers in the credibility analysis task with evolving avatars had a signifcantly higher aesthetic appeal score compared to workers without an avatar ( = −3.029,= .025,= 0.05; cf. Figure 6b).In contrast, there was no signifcant diference in aesthetic appeal for the information fnding task (cf. Figure 6a).Summary: H1c) Despite no signifcant diferences found for the overall subjective engagement, workers with an evolving and customizable avatar experienced signifcantly greater aesthetic appeal within the credibility task.For the information fnding task, no signifcant diferences were found.Therefore, we found partial support for hypothesis 1c.H4c) We found no evidence of an efect of a worker community space on workers' subjective engagement.Therefore, we reject hypothesis 4c.

Worker Retention
A non-parametric Kruskal Wallis test was performed to investigate whether the retention difers signifcantly across the conditions.The Kruskal-Wallis test showed no signifcant diferences between the conditions for the information fnding task ( = 8.657, = 4, = .070,= 0.05; see fgure 7a).For the credibility analysis task, the Kruskal-Wallis test showed signifcant diferences between the conditions ( = 13.848,= 4, = .008,= 0.05; see Figure 7b).Based on the Dunn test with a Bonferroni corrected p-value, workers with an evolving avatar had signifcantly higher retention than workers without an avatar ( = −3.121,= .018,= 0.05).Interestingly, workers with an evolving avatar and the worker community space did not have signifcantly higher worker retention compared to workers without an avatar ( = −2.684,= .073).To further understand our results and their efect sizes, Figure 8 shows the estimation plots for worker retention [36].The Control A non-parametric Kruskal Wallis test was performed to investigate condition is compared to the other conditions.Based on these plots, whether the accuracy difers signifcantly across the conditions.we see larger efect sizes for the Evolving condition of the informa-There were no signifcant diferences found between the conditions tion fnding task, and the Basic, Evolving, and Evolving⊕Comm conditions for the credibility analysis task.
Summary: H2a) The results show that customizable and evolving worker avatars can signifcantly improve worker retention for the credibility analysis task.Furthermore, the estimation plots show a positive efect of evolving and customizable worker avatars across both tasks.Therefore, we found partial support for hypothesis 2a.H5a) We found no efect of the worker community space on worker retention.Therefore, we reject hypothesis 5a. for the accuracy of the information fnding task ( = 1.287, = Summary: H2b) There is no efect found on worker accuracy as a result of evolving and customizable worker avatars.H5b) Likewise, the worker community space does not impact the worker's accuracy.Therefore, we accept both our hypotheses.

Task Execution Time
For the analysis of task execution time, we removed outliers outside the whiskers of the boxplot (3 + 1.5 * ; 1 − 1.5 * ) for both tasks, since these long task execution times could be an artifact of diferent external factors such as workers completing multiple tasks simultaneously [29], using diferent working strategies [33], a function of their work environments [24], and so forth.This resulted in 18 outliers being removed from the information fnding task across all experimental conditions, and 12 outliers being removed from the credibility analysis task.For the information fnding task, this resulted in 64 workers in the Control condition, 67 workers in Basic, 62 workers in Basic⊕Comm, 62 workers in Evolving, and 63 workers in Evolving⊕Comm.For the credibility task, this was 65, 65, 66, 67, and 65 respectively.
A Kruskal-Wallis test was performed to investigate whether there are signifcant diferences in task duration across the conditions.The Kruskal-Wallis test revealed signifcant diferences between the conditions for the information fnding task ( = 15.84,= 4, = 0.003; cf. Figure 9a) and the credibility analysis task ( = 36.977,= 4, < .001;cf. Figure 9b).For the information fnding task, the Dunn test with a Bonferroni corrected p-value showed that workers in the Evolving condition had a signifcantly longer task execution time than the Control condition ( = −3.298,= .01,= 0.05) and the Basic⊕Comm condition ( = −3.143,= .017,= 0.05).For the credibility analysis task, the Dunn test  Summary: H2c) For both tasks, the task execution time is signifcantly longer for workers with an evolving and customizable worker avatar.Therefore, we accept hypothesis 2c.H5c) We found no signifcant efect of the worker community space on task execution time.Therefore, we reject hypothesis 5c.

Group Identifcation
A non-parametric Kruskal-Wallis test was performed to investigate whether the GIM score and the connected question difer signifcantly across the conditions (H3).There were no signifcant diferences found across conditions for the GIM score and the connected question ( = 0.05).
To explore why workers did or did not feel connected to the other crowd workers who worked on the same tasks and whether this was related to the worker community space, the answers to the open-ended question were manually coded into categories for workers in a condition that included the worker community space.Furthermore, workers are classifed based on their responses on the 7-point Likert scale as either not feeling connected ( < 4) or feeling connected ( > 4) to diferentiate between the workers who felt connected or not.Open-coding was used to defne diferent categories based on the open-ended questions of both the credibility task and the information fnding task, similar to the methods of a conventional qualitative content analysis [37].Some responses could be categorized into two diferent categories.The open-ended questions from both tasks were categorized using these created categories.Subsequently, a second coder used the same defned categories to categorize roughly half of the data, consisting of the open-ended questions from the credibility task ( = 136).A substantial inter-annotator agreement was found between the two coders, as measured with Cohen's Kappa ( = 0.744) [51].An overview of the description of the categories and the results can be found in the Appendix, Section B.3.
Information fnding tasks.Of all the workers who worked on the information fnding task that reported not feeling connected to the other workers ( = 63), most workers (65%, = 41) did not feel connected because of a lack of direct interaction with other workers.Some workers (13%, = 8) did not believe that the workers in the worker community space were indeed other workers.A smaller group of workers (6%, = 4) did not feel connected because of the feelings shown in the worker community space.From the workers that did feel connected ( = 43), the majority of the workers felt connected because they shared a similar goal (28%, = 12) or because of the feelings on the worker community space (23%, = 10).A smaller fraction of the workers (9%, = 4) felt connected due to the avatars in the worker community space.
Credibility analysis tasks.Of all workers from the credibility analysis task who did not feel connected to the other workers ( = 63), most of the workers (76%, = 48) did not feel connected because there was a lack of interaction with the other workers.They felt like they were completing the tasks on their own.A smaller fraction of the workers did not feel connected because other workers mentioned they felt diferently about the task (6%, = 4), or the avatar was too basic an instrument to make them feel connected to other workers (6%, = 4).The majority of the workers who felt connected ( = 50) did so because they all shared the same goal when working on the task (36%, = 18).Furthermore, some workers (20%, = 10) felt connected because they saw other workers reporting the same feelings about the task.Of the workers who did feel connected, a few also mentioned a lack of interaction between them and the other workers (14%, = 7).
Summary: H3) Our fndings revealed that there was no signifcant efect of the worker community space, where workers share their avatar and feelings about the task, on either self-identifcation as a crowd worker or on how much they feel connected to other workers that worked on the task.Therefore, we reject our hypothesis.

Exploratory Analysis -Group Identifcation
We did not fnd an increased sense of group identifcation for the conditions containing the worker community space (H3).With an aim to further understand group identifcation in our study, we explored the diferences between workers who reported diferent levels of group identifcation across all conditions.To do this, we divided the workers into three groups based on their reported GIM scores: low (1 ≤ ≤ 3.5), mid (3.5 < ≤ 4.5), and high (4.5 < ≤ 7).For the information fnding task, 104 workers were found to be in the low group, 102 workers in the mid group, and 130 workers in the high group respectively.For the credibility analysis task, 112 workers were in the low group, 93 in the mid group, and 135 in the high group.
To analyze how the task duration (i.e., the execution time) varied between these groups, outliers were removed from both tasks.For the information fnding task, 27 outliers were removed in a similar way as described in Section 4.7, resulting in 125 workers in the high GIM group, 91 workers in the mid GIM group, and 93 workers in the low GIM group.For the credibility task, 18 workers were removed, CHI '24, May 11-16, 2024, Honolulu, HI, USA resulting in 123 workers in the high GIM group, 91 workers in the mid GIM group, and 108 workers in the low GIM group.4.9.1 Diferences Across GIM Groups: Worker Experiences.Similar to the experimental conditions, all measurements had at least one group that did not have a normal distribution based on the Shapiro-Wilk test ( < .05).Therefore, we performed Kruskal-Wallis tests to investigate the diferences in task-related outcomes and worker experience measurements between the diferent GIM groups.The results of the Kruskal-Wallis tests with all our dependent measurements can be found in Table 1.For the information fnding task, we found signifcant diferences between workers with diferent GIM levels for worker retention, task duration, overall TLX score (and the dimensions of mental demand, physical demand, efort, and frustration), overall IMI score (across all dimensions), and the UES score (across all dimensions).For the credibility task, we found signifcant diferences in the accuracy, overall TLX score (the dimensions of mental demand, physical demand, and efort), overall IMI score (across all dimensions), the overall UES score (and the dimensions of FA, AE, and RW).
The results of the Dunn test for the worker experience measures, based on the Bonferroni corrected p-values, are visualized in Figure 10 (metrics for all tests can be found in the appendix, Table 6 and Table 7).For the information fnding task, the workers in the high GIM group ( = 4.708, < .001)and the mid GIM group ( = −3.26,= .003)had a signifcantly lower TLX score than the low GIM group.For the credibility analysis task, the high GIM group had a signifcantly higher TLX score than the low GIM group ( = 3.64, = .001).For both tasks, workers in the high GIM  Note that for the TLX measurements, a low score for the subdimension performance indicates a high perceived performance.
Summary: Workers who strongly identify themselves as a crowd worker (i.e., report high GIM scores) experience a signifcantly greater perceived workload but also greater intrinsic motivation and subjective engagement compared to workers who do not identify themselves with other crowd workers.4.9.2Diferences Across GIM Groups: Task-related Outcomes.The Dunn test with Bonferroni correction showed that workers in the high GIM group had signifcantly higher retention than workers in the low GIM group for the information fnding task ( = 2.643, = .025;see Figure 11a).Furthermore, the task duration of the high GIM group was signifcantly longer than the task duration of the low GIM group ( = 4.162, < .001)and the mid GIM group ( = 3.117, = .005)for the information fnding task (see fgure 11b).For the credibility task, the accuracy of the high GIM group was signifcantly lower than the mid GIM group ( = −2.733,= .019;see Figure 12).Summary: i) In the information fnding task, workers who strongly identifed as a crowd worker showed greater worker retention and task execution time.ii) In the credibility analysis task, workers who strongly identifed as crowd worker (high group) showed less accuracy than workers who identifed a little as a crowd worker (mid group).

Exploratory Analysis -Task Diferences
Following our results which revealed diferences between the credibility task and the information fnding task, an exploratory analysis was carried out to further investigate how these two types of tasks were perceived diferently by workers (see Figure 14 in the Appendix).Based on Wilcoxon rank tests, we found that the credibility analysis task had a signifcantly lower ( = .018)perceived workload compared to the information fnding task, caused by a lower level of frustration ( < .001)and temporal demand ( < .001).Furthermore, the credibility analysis task scored higher in intrinsic motivation ( = .004),caused by greater interest and enjoyment ( < .001).In line, user engagement was greater for the credibility analysis task ( < .001),caused by greater perceived usefulness ( < .001),aesthetic appeal ( < .001),and reward ( < .001).
Summary: Workers in the information fnding task perceived a higher workload, lower intrinsic motivation, and lower subjective engagement compared to workers in the credibility analysis task.

Key Findings
5.1.1Evolving and Customizable Avatars.The aim of our frst research question (RQ1) was to investigate the efect of evolving and customizable worker avatars on worker experience and taskrelated outcomes.While we did not fnd any signifcant impact on the perceived workload, intrinsic motivation, and overall subjective engagement, our results indicate that evolving and customizable worker avatars can positively impact worker retention without decreasing accuracy.This fnding is in line with prior research on the efect of avatar customization in crowdsourcing [68] and gamifcation in crowdsourcing [21,22,52,62].As expected, the increase in worker retention, together with some extra time that workers use in customizing their avatars, led to a signifcantly increased total task execution time.
Interestingly, the increased worker retention, which can be considered an objective measurement of engagement, is not accompanied by a signifcant increase in subjective engagement.Only one dimension of subjective engagement, aesthetic appeal (the attractiveness of the interface), was perceived as being signifcantly higher for workers with an evolving and customizable avatar within the credibility analysis task.This suggests a potentially orthogonal relationship between objective worker retention and subjective worker engagement.

Group
Identification and the Worker Community Space.We aimed to investigate whether we could foster a sense of group identifcation among crowd workers by providing a worker community space where workers could share their personalized worker avatar and how the tasks made them feel (RQ2).We proposed this as a lightweight and non-intrusive method of sharing individual information and task-related impressions to promote group identifcation.We expected that workers would identify with their avatar [5,63,89] and seeing their avatar among the other worker avatars would induce group identifcation [23,84].Our results suggest that this does not induce a statistically signifcant sense of group identifcation among the crowd workers using the worker community space.As mentioned in section 4.8, the workers who did not feel connected to other workers mainly reported a lack of interaction as the main reason.Therefore, we suggest that future work incorporates direct interaction between the crowd workers in a community space, which also resonates with prior fndings related to personalized avatars and group identifcation in online video games [23,86,90].The workers who did feel connected to other workers predominantly mentioned that sharing a goal and/or seeing the feelings of other workers on the community page made them feel connected to the other workers.The latter reason corresponds to prior work about how sharing feelings can make people feel more connected [84], and exposure to similar opinions can induce group identifcation [64].However, as we did not fnd any signifcant diferences in group identifcation and connectedness between workers in the experimental conditions with and without a community space, we expect that feeling connected and identifying with other crowd workers in our study is more likely caused by existing individual diferences between the workers.Our fndings suggest that workers who identify themselves as crowd workers fnd more meaning in the worker community space.

Exploratory
Findings on Group Identification.Our third research question (RQ3) aimed to answer how a sense of group identifcation, induced by the worker community space, can afect worker experience and task-related outcomes.Although we did not fnd signifcant diferences in the level of group identifcation across our experimental conditions, results from our exploratory analysis suggest that workers who strongly identify as being crowd workers experience greater intrinsic motivation and subjective engagement, corroborating prior work on group identifcation being related to intrinsic motivation [42,83] and subjective engagement [39,41].An unexpected result is a greater perceived workload for workers who strongly identify as a crowd worker, compared to workers who do not (strongly) identify as crowd workers.A potential explanation for this could be that workers who identify themselves as crowd workers consider doing the work as an essential part of their lives and draw more meaning out of their work [97].It is likely that those who strongly identify themselves as crowd workers also rely on crowd work for their primary livelihood (or a signifcant portion of their livelihood).While these workers may have greater intrinsic motivation and feel more engaged to participate in crowdsourcing tasks, their perceived workload might also be higher as they are more motivated to perform well.More research is necessary to further explore how group identifcation among crowd workers relates to their perceived workload, intrinsic motivation, and subjective engagement, perhaps focusing on crowd workers who spend relatively more time working on crowdsourcing tasks.
Interestingly, on exploring the relationship between group identifcation and task-related outcomes, we found some diferences between the information fnding task and the credibility analysis task.In the information fnding task, we found an increased worker retention and total task execution time for workers who strongly identifed as crowd workers.This fnding is in line with the increased level of intrinsic motivation and subjective engagement of workers who strongly identifed as crowd workers and prior work on community identifcation and continued participation in crowdsourcing tasks [45].However, workers in the credibility analysis task who identifed strongly as crowd workers did not exhibit an increased worker retention and task duration but exhibited a decrease in accuracy compared to workers who identifed slightly as crowd workers.This suggests a potential task type-related efect, which has also been demonstrated in prior research revealing the distinct impact of diferent task types in crowdsourcing marketplaces [25,68,94].

Exploratory
Findings on Task Diferences.Our results indicate diferences in the impact of evolving and customizable avatars and group identifcation between the information fnding and credibility analysis tasks.Workers in the credibility analysis task show signifcantly greater worker retention due to evolving and customizable worker avatars.The results of the information fnding task do not show a signifcant efect, but the results indicate a positive efect on worker retention (see section 4.5).A similar efect is seen for the workers in the experimental conditions with evolving avatars and the worker community space.We found that workers in the credibility analysis task reported a signifcantly higher perception of aesthetic appeal, which was not the case for the information fnding task.Our exploratory fndings for the diferent GIM levels also revealed diferences in the task-related outcomes between the two task types.
These diferences in our fndings across the tasks suggest that there might be an important role for task features that can either mitigate or amplify the impact of evolving avatar customization or group identifcation.Based on prior work, some task features that could have infuenced this efect may be the task complexity, enjoyment, and/or the efort to come up with an answer [94].We carried out an exploratory analysis to understand potential diferences in worker perceptions of the credibility analysis and the information fnding task.This analysis revealed that the credibility analysis task was perceived as less frustrating, less hurried/rushed, inducing greater interest and enjoyment, being less difcult to interact with, having a more attractive interface, and being more rewarding than the information fnding task.These diferences may have mitigated the impact of the evolving and customizable avatars in the information fnding task on worker retention and the perception of the attractiveness of the task interface.Furthermore, we saw that workers who identify strongly as crowd workers put in more work and time in a task that is generally perceived as more frustrating and less enjoyable (the information fnding task).For a more enjoyable task (the credibility task), workers who did not (strongly) identify put in the same amount of work and time as those who strongly identifed as crowd workers.Future research can further explore the role of task types in the efectiveness of gamifcation interventions and the efect of group identifcation.

Caveats, Limitations, and Other Considerations
Novelty Efect.It is possible that the efects we observed as a result of gamifying the avatar customization by tying it together with task progress is caused by a novelty efect, and may not be sustainable over a long-term [32].Such novelty efects often occur for gamifcation that is focused on extrinsic game elements [75].However, we chose evolving and customizable worker avatars because it is an extrinsic game element and is therefore not bound to a specifc crowdsourcing task context.Future work is necessary to determine whether this approach can reap continued benefts over a long term.For instance, incorporating evolving customizable avatars in a crowdsourcing platform and/or integrating them within a permanent or dynamic worker community space can ensure that any progress made by workers does not get lost beyond the task itself.This way, the virtual worker identity formed by the avatar is maintained over time by the integration of the crowdsourcing platform itself.Perhaps future work could investigate how this virtual identity can contribute to more elaborate social interactions that can be implemented directly in crowdsourcing tasks and platforms.Potential Biases.Cognitive biases can negatively impact the outcome of crowdsourcing experiments [18,38,80].We used the Cognitive Bias Checklist to analyze and report potential biases in our study [15].Confrmation bias may have surfaced in our work through the credibility analysis tasks that we considered.The statements used in the credibility analysis task could relate to a worker's prior beliefs about specifc topics.For instance, a worker who identifes as an anti-vaxer might have a confrmation bias to fag the statement 'The CDC issued a warning to all Americans urging them not to get the fu shot this year.' as being 'CREDIBLE.'Another potential cognitive bias that may have surfaced is loss aversion.Although we mentioned to the workers that they would get paid based on an hourly wage, workers may have chosen to drop out of the task batch earlier to ensure their earnings.Prior work in crowdsourcing literature has identifed and corroborated such behavior [34].While both cognitive biases could have infuenced the task-related outcomes, it is unlikely that these biases have caused signifcant diferences across the experimental conditions and, therefore, may not afect the validity of conclusions drawn in this study.
Ethical Issues and Considerations.Shahri et al. [81] identifed diferent ethical issues that can be caused by deploying gamifcation techniques in a workplace.Some ethical issues raised are related to leaderboards, privacy, exploitation, and personal and cultural values.Within our study, the functionality within the worker community space to order workers based on the levels reached might have caused workers to feel bad about their performance and their relatively less evolved avatars.However, this efect may have been mitigated by ensuring the anonymity of workers.
Furthermore, the worker community space is limited to serve workers who completed the task successfully, which might confict with personal and/or cultural values.It can be considered unfair towards other workers, as fostering group identifcation and increasing worker experience can be seen as a right for all workers.Future work could investigate ways to foster group identifcation during and before tasks to deal with this value confict.From a task requesters' perspective, fostering a sense of group identifcation before task completion might also beneft engagement during the task [39].Another ethical issue related to gamifcation in a workplace is whether increasing workers' productivity with gamifcation is exploitative [43,81].As observed in our study, gamifcation might cause workers to complete more work.This is a problem when workers are not paid for their extra eforts or sufer due to the workload.We argue that there is positive value in employing gamifcation to increase productivity, aiming to improve workers' experience and motivation to engage in the work [52,62].However, increasing productivity should not cause an excessive workload or afect the short and long-term health of workers [4], and workers should be paid fair wages [92].
Platform Diferences.As our current study focused on the Prolifc crowdsourcing platform, we are unsure how the results generalize towards other platforms, such as Amazon Mechanical Turk (MTurk), Appen, or Toloka.Diferent platforms have diferences in how they are used, the number of hours that workers generally spend on the platforms, and their workers' demographic and geographic features [67].Moreover, some workers are active on multiple crowdsourcing platforms.Future research can investigate the potential platform-specifc needs of workers and how to facilitate an appropriate working identity that suits worker needs.

Implications and Future Work
Our work has important design and theoretical implications, which we discuss in detail in this section.
Evolving and Customizable Worker Avatars for Crowdsourcing Tasks.Our fndings have important implications for the design of future crowdsourcing microtasks.Task requesters often desire worker retention in tasks with elaborate training or tutorial phases.Based on our results, evolving and customizable worker avatars in monotonous crowdsourcing tasks can improve worker retention in conversational crowdsourcing.Though the evolving aspect of the customizable worker avatars can lead to an increased focus on completing more microtasks among workers, our results suggest that accuracy is not negatively impacted.Prior crowdsourcing literature has also revealed a positive impact of increased worker retention on overall accuracy [26].Furthermore, evolving worker avatars can be particularly interesting when designing crowdsourcing tasks where worker retention plays an important role.For instance, tasks that require training or tutorials.
In that case, increasing worker retention might save costs related to training the worker.Considering the benefts that can be reaped from worker retention in long batches of tasks (such as learning efects, improvement in accuracy, task efciency, and stable performance), this method shows the potential to improve worker experiences while meeting task requesters' needs.Additionally, the context-independent nature of integrating evolving and customizable worker avatars makes this viable for diferent tasks.Our results, however, indicate that task-specifc features can play a role in mediating the efect of customizable worker avatars and group identifcation.Future work is necessary to investigate how and the extent to which task-dependent features shape the impact of evolving and customizable avatars in fostering group identifcation and shaping task-related outcomes and worker experience.
Group Identifcation and Sustainable Crowd Work.Our exploratory fndings have highlighted the importance of improving group identifcation among crowd workers working on individual crowdsourcing tasks.This has important theoretical and practical implications for the broad context of crowdsourcing.Our results indicate that group identifcation is related to greater intrinsic motivation and subjective engagement.Based on this, we believe that fostering group identifcation contributes positively to the worker experience, which can help create a stronger and thriving workforce [44].Therefore, we envision that fostering group identifcation can aid in improving the sustainability of crowd work.While workers who identifed themselves strongly as crowd workers showed greater intrinsic motivation and subjective engagement, they also experienced a greater perceived workload.These fndings highlight important future directions for optimizing a healthy and sustainable work environment for crowd workers.Future work can further explore efective means to foster a sense of community among crowd workers who predominantly work on tasks individually.More work is needed to understand how we can increase workers' intrinsic motivation and engagement while maintaining a healthy level of perceived workload.

CONCLUSIONS
Our frst research question was to investigate the efect of evolving and customizable worker avatars on worker experience and task-related outcomes (RQ1).To address this question, we created a conversational crowdsourcing task where workers were able to customize their worker avatars, and as they progressed through the task batches, they unlocked new levels that allowed them to use new features to customize their avatars.We measured taskrelated outcomes, such as worker retention, accuracy, and total task execution time.The worker experience was measured by perceived workload, intrinsic motivation, and subjective engagement.Our results suggest that evolving and customizable worker avatars can increase worker retention.Our second research question addressed the extent to which the sharing of worker avatars and task-related feelings in a worker community space could foster a sense of group identifcation among crowd workers (RQ2).We created an interactive worker community space where workers shared their personalized worker avatars with their feelings on the task.However, the worker community space did not successfully foster an increased sense of group identifcation among crowd workers, although exploratory fndings revealed that this could be a function of individual diferences among crowd workers.With our third research question, we investigated the efect of group identifcation, induced by the worker community space, on worker experience and task-related outcomes (RQ3).We found that the worker community space did not improve group identifcation among the crowd workers.We conducted an exploratory analysis to investigate the efect of diferent levels of group identifcation across all workers on task-related outcomes and worker experience.Our results indicated that workers who identify themselves as crowd workers experience a signifcantly greater perceived workload, intrinsic motivation, and subjective engagement.Our study contributes to extending the understanding of designing future crowdsourcing tasks.It sheds light on new directions to improve the sustainability of the crowdsourcing paradigm for crowd workers, task requesters, and crowdsourcing platforms.

A STUDY DESIGN A.1 Technical Implementation
We used TickTalkTurk [71] to design the conversational task interface and leveraged a Vue.js library 8 of the Avataaars library 9  to create an avatar editor for workers.The front end of the task interface, including the avatar editor, conversational interface, and the worker community space was built using the JavaScript Framework Vue.js. 10 The back end was built with Flask 11 in Python and connected to a MongoDB database. 12The application was hosted on a Ubuntu 22.04 server using Nginx [73] and Gunicorn, 13 and secured with an SSL certifcate by Let's Encrypt. 14

A.2 Editable Features of Avatars
An overview of the editable features in the avatar editor can be found in Table 2.

A.3 Pick-A-Mood Scale
Figure 13 shows the interface where workers are asked how the task made them feel based on the Pick-A-Mood scale [11] before entering the worker community space.

A.4 Statistical Analysis
To test our hypotheses, we want to compare the conditions for each dependent variable that is related to the crowd worker experience or task-related outcomes.For each dependent variable, we tested whether each condition is normally distributed using a Shapiro-Wilk test [78].If the dependent variable is normally distributed across all conditions, we test the homogeneity of variances among the conditions with Levene's test [54].
If the assumption of normality and homogeneity of variances were met, a one-way ANOVA test was performed to test for signifcant diferences between the conditions.If the assumptions are not met, a Kruskal-Wallis test was performed [49].To further investigate the diferences between the conditions, post-hoc tests were carried out, while appropriately adjusting for multiple comparisons to avoid type-I error infation.For the parametric one-way ANOVA test, Tukey's test [79] was performed.In the case of the non-parametric Kruskal-Wallis, a Dunn test [16] was performed.As demographic diferences can infuence the efect of gamifcation [47], we explored potential confounds of age and/or gender by carrying out corresponding ANCOVA tests while considering these variables as covariates.These results can be found in Section B.2.

B RESULTS B.1 Descriptive Statistics
Table 3 shows the descriptive statistics of the avatar editor and the worker community space to gain insights into how workers interacted with the avatar editor and the community space.As expected, the number of changes made in the avatar editor is higher for the evolving avatar conditions.Furthermore, the descriptive results show that workers actively customized their avatars.The descriptive statistics of the worker community indicate that on average, the workers did not interact much with the buttons, while they did spend some time in the worker community space.

B.2 Covariance Analysis
To verify whether gender and age played a role in shaping the signifcant diferences we found in worker retention and aesthetic appeal across the diferent experimental conditions for the credibility analysis task, we performed an ANCOVA test between all conditions, using gender and age as covariates.For worker retention, our ANCOVA test does not show any efect of age ( = 1, = 1.357, = .245)or gender ( = 3, = 0.613, = .607).The ANCOVA test for aesthetic appeal does not show any significant efect of age ( = 1, = 0.610, = .285)but does show a signifcant efect of gender ( = 3, = 2.692, = .046,= 0.05).A post-hoc Tukey test revealed a signifcant diference between the categories non-binary and other ( = .026).These two categories only consist of 3 and 1 worker, respectively.Therefore, we can conclude that the variables of age and gender did not afect our fndings.

B.3 Group Identifcation -Qualitative Analysis
The description of the categories that emerged from the opencoding of the responses on the open-ended question about group identifcation can be found in Table 4, together with an example response.Furthermore, Table 5 shows an overview of the descriptive statistics of our qualitative data analysis.

B.4 Task Diferences
The task diferences in perceived workload, intrinsic motivation, and subjective user engagement between the credibility analysis task and the information fnding task can be found in Figure 14.

B.5 GIM level diferences
The details of the exploratory statistic analyses for the Dunn tests between the diferent levels of group identifcation can be found in Table 6 (information fnding task) and Table 7 (credibility analysis task).Seeing the feelings of other workers about the task made the worker feel less connected.
They were just icons on my screen and did not feel like real people.
People had diferent feelings.

Feelings +
Seeing the feelings of other workers about the task made the worker feel more connected.
Most of the other workers were relaxed and calm just like me.

Interaction Avatar -
The worker experienced a lack of interaction/ the worker mentions working solely/ the worker does not know other workers personally.
Seeing the avatars did not make the worker feel connected.
I just saw them at the end.During the experiment there was no interaction.
It's hard to feel connected to someone behind an avatar with very little customisation.

Avatar + Shared Goal Other
Seeing the avatars made the worker feel connected.
Worker feels connected because they work on the same task.All answers that did not ft the categories.
The last page made me feel connected because we were all shown together.*We're all in the same boat, doing the same thing for the same compensation.I just didn't feel any connection.

Figure 2 :
Figure 2: An example of a question from the information fnding task.

Figure 3 :
Figure 3: Examples of the credibility statement questions in the credibility analysis task.Figure a) shows a statement that is not credible.Figure b) shows a credible statement that used to be a somewhat credible statement.Therefore the statement in Figure b) is considered to be difcult.

Figure 5 :
Figure 5: A screenshot showing the avatar editor interface.
(a) Aesthetic Appeal (dimension of UES) for the information fnding task.(b)Aesthetic Appeal (dimension of UES) for the credibility analysis task.

Figure 7 :
Figure 7: Worker retention across the diferent experimental conditions and the two task types.
(a) Information Finding task (b) Credibility Analysis task

Figure 8 :
Figure 8: Estimation plots for worker retention.For both tasks, all conditions are compared to the control condition.
(a) Task execution time for the information fnding task.(b) Task execution time for the credibility analysis task.

Figure 9 :
Figure 9: Task execution time of workers across diferent experimental conditions in the two task types (with outliers removed).

Figure 10 :
Figure 10: Worker experience measures for diferent levels of group identifcation (GIM: low, mid, high, represented respectively by the lower, middle, and upper boxplot per measurement).Signifcant diferences from the Kruskal-Wallis test are shown at the measurement level (y-axis), and the signifcant diferences (adjusted p-value) from the Dunn test within these measurements are shown with signifcance brackets between the GIM levels.* indicates < .05,** indicates < .01,and *** indicates < 0.001.The TLX and IMI scores are measured on a 7-point Likert scale, and the UES measurements are measured on a 5-point Likert scale.Note that for the TLX measurements, a low score for the subdimension performance indicates a high perceived performance.

Figure 13 :
Figure 13: Workers are asked how the task made them feel, based on the Pick-A-Mood scale.

Figure 14 :
Figure 14: Signifcant diferences between the worker experience of the credibility analysis task (cred) and the information fnding task (info).* means < .05,** means < .01,and *** means < .001.The TLX and IMI scores are measured on a 7-point Likert scale, and the UES measurements are measured on a 5-point Likert scale.Note that for the TLX measurements, a low score for the subdimension performance indicates a high perceived performance.

Table 2 :
All the editable features of the avatar editor.The 'Basic Items' are the items that are always available in the avatar editor.The 'Evolving Items' are the features that can be unlocked by reaching a new level.When these items are unlocked are shown in the 'Unlocked' column.These items are based on the Avataaars generator (https://getavataaars.com).

Table 3 :
Descriptive statistics for the avatar editor and worker community space.The number of changes describes how often a worker changed features to edit their avatar.Total interactions describes how many times a worker clicked one of the interactive buttons in the worker community space, and the Time (in seconds) describes the amount of time the worker spent in the worker community space.

Table 4 :
Categories emerging from the open-coding of responses from the open-ended question on why the workers did or did not feel connected to other workers who completed the same tasks.The categories are described, and an example from the open-ended responses is presented as it stands in the original quotes.Furthermore, the quote in the title of this paper is an adjusted version of the original quote from our data marked in this table with *.Worker felt as if the avatars on the worker community page were not real workers.

Table 5 :
The number (N) and percentage (%) of workers for each category of why they felt not connected (Connected <4) or connected (Connected >4) for both tasks.The total row describes the number and percentage of workers per task who felt connected or not.