Hear Me Out: A Study on the Use of the Voice Modality for Crowdsourced Relevance Assessments

The creation of relevance assessments by human assessors (often nowadays crowdworkers) is a vital step when building IR test collections. Prior works have investigated assessor quality&behaviour, though into the impact of a document's presentation modality on assessor efficiency and effectiveness. Given the rise of voice-based interfaces, we investigate whether it is feasible for assessors to judge the relevance of text documents via a voice-based interface. We ran a user study (n = 49) on a crowdsourcing platform where participants judged the relevance of short and long documents sampled from the TREC Deep Learning corpus-presented to them either in the text or voice modality. We found that: (i) participants are equally accurate in their judgements across both the text and voice modality; (ii) with increased document length it takes participants significantly longer (for documents of length>120 words it takes almost twice as much time) to make relevance judgements in the voice condition; and (iii) the ability of assessors to ignore stimuli that are not relevant (i.e., inhibition) impacts the assessment quality in the voice modality-assessors with higher inhibition are significantly more accurate than those with lower inhibition. Our results indicate that we can reliably leverage the voice modality as a means to effectively collect relevance labels from crowdworkers.


INTRODUCTION
Document relevance assessments by human assessors-with respect to a given set of information needs-is a vital step in the building of an Information Retrieval (IR) test collection [33,67].Depending on the corpus, documents are represented in a variety of formsincluding text (the most common form at TREC), images [19,48], or videos [25,42].Prior works have investigated assessor quality, their behaviour, and tooling to support assessors-most often in the context of text documents [6,38,56,62,63].Given the prevalent nature of text corpora, we continue in this vein and focus on an aspect that has received little attention so far: the presentation modality of the text documents during the judging process.
Thanks to the development of voice-based conversational search systems, people have become accustomed to being presented search results that are read out to them, an approach that is very different from the presentation of text on-screen.We posit that by utilising such audio-based devices, we can increase the scope for collecting relevance judgements for text documents in a number of ways.For example, assessors can contribute by judging documents on their smartphones [3,77], if they have visual impairments [55,79,86], or if they come from a low-resource background [5,55].
Two important aspects of collecting relevance judgements are: (i) the quality of assessments [62]; and (ii) the time taken by assessors to make their judgements [69].Since relevance judgements are used to train and evaluate Learning to Rank (LtR) systems, the quality of judgements impacts the effectiveness of such systems [15,83].The time taken by assessors to judge relevance may not only affect the quality of judgements, but also contribute to the cost of building (and maintaining) test collections.NIST assessors [16,17] and crowdworkers [4,39] are often paid by their time spent on a task (e.g., as on Prolific).The longer it takes assessors to judge, the costlier it becomes.There are a number of factors-not limited to topic difficulty [18,62], document familiarity [63], or relevance judgement session length [63]-that have been shown to affect the quality of (and the time taken for) judging relevance.
In our work, we focus on two such factors in our pursuit to examine the feasibility of using the voice modality for text-document relevance assessments: document length [23,59,63,68] and an assessor's cognitive abilities [60,62] expressed in terms of working memory and inhibition.Our selection of factors is motivated by a range of prior works.The serial [40] and temporal [61] nature of the voice medium makes it more difficult for listeners to "skim" back and forth over a piece of information as compared to reading it on-screen [49,82,84].Voice interfaces also demand greater cognitive load when compared to text interfaces for processing information [40,54,66].These are exacerbated as the amount of information to be conveyed increases in size [51,64].Understanding how these factors affect the relevance judgement process can help us design tasks for assessors with a wide range of abilities and for different document presentation modalities.While there exists various measures for cognitive abilities, we selected two-working memory (someone's ability to hold information in short-term memory) [21] and inhibition (someone's ability to ignore or inhibit attention to stimuli that are not relevant) [21]-which have been shown to play an important role in speech understanding [26,58,71].We posit that they will also be crucial in the relevance judgement process, especially when documents are presented in the voice modality.Taken together, we investigate the following research questions.
RQ1 How does the modality of document presentation (text vs. voice) affect an assessor's relevance judgement in terms of accuracy, time taken, and perceived workload?RQ2 How does the length of documents affect assessors' ability to judge relevance?Specifically, we look into the main effect of document length and the effect of its interplay with presentation modality.RQ3 How do the cognitive abilities of an assessor (with respect to their working memory and inhibition) affect their ability to judge relevance?Specifically, we look into the main effect of the cognitive abilities and the effect of their interplay with the presentation modality.
To answer these questions, we conducted a quantitative user study ( = 49) on the crowdsourcing platform Prolific.Participants judged the relevance of 40 short and long documents sampled from the passage retrieval task data of the 2019 & 2020 TREC Deep Learning (DL) track [16,17].Our findings are summarised as follows.
• Participants judging documents presented in the voice modality were equally accurate as those judging them in the text modality.• As documents got longer, participants judging documents in voice modality took significantly longer than those in text modality.For documents of length greater than 120 words, the former took twice as much time with less reliable judgements.• We also found that inhibition-or a participant's individual ability to ignore or inhibit attention to stimuli that are not relevantimpacts relevance judgements in voice modality.Indeed, those with higher inhibition were significantly more accurate than their lower inhibition counterparts.
Overall, our results indicate that we can leverage the voice modality to effectively collect relevance labels from crowdworkers.

RELATED WORK 2.1 Relevance Judgement Collection
The general approach for gathering relevance assessments for large document corpora (large enough that a full judgement of all corpus documents is not possible) was established by TREC in the early 1990s [28].Given a set of information needs, a pooled set of documents based on the top- results of (ideally) a wide range of retrieval runs are assessed by topic experts.This method is typically costly and does not scale up [4] once the number of information needs or  increases.In the last decade, creating test collections using crowdsourcing via platforms like Prolific or Amazon Mechanical Turk (AMT) have been shown to be a less costly yet reliable alternative [4,39,63,85].While the potential of crowdsourcing for more efficient relevance assessment has been acknowledged, concerns have been raised regarding its quality-as workers might be too inexperienced, lack the necessary topical expertise, or be paid an insufficient salary.In turn, these issues may lead them to completing the tasks to a low standard [36,46,53].Aggregation methods (e.g., majority voting) can be used as effective countermeasures to improve the reliability of judgements [32,34].
There are a number of factors that have been shown to affect the relevance judgement process.Scholer et al. [62] observed that participants exposed to non-relevant documents at the start of a judgement session assigned higher overall relevance scores to documents than when compared to those exposed to relevant documents.Damessie et al. [18] found that for easier topics, assessors processed documents more quickly, and spent less time overall.Document length was also shown to be an important factor for judgement reliability.Hagerty [27] found that the precision and recall of abstracts judged increased as the abstract lengths increased (30, 60, and 300 words).In a similar vein, Singhal et al. [68] observed that the likelihood of a document being judged relevant by an assessor increased with the document length.Chandar et al. [12] found that shorter documents that are easier to understand provoked higher disagreement, and that there was a weak relationship between document length and disagreement between the assessor.In terms of time spent for relevance judgement, Konstan et al. [37] and Shinoda [65] asserted that there is no significant correlation between time and document length.On the other hand, Smucker et al. [70] found participants took more time to read, as document length increased (from ∼10s for 100 words, to ∼25s for 1000 words).

Voice Modality
Voice-based crowdsourcing has been shown to be more accessible for people with visual impairments [79,86], or those from low resource backgrounds [55].It can also provide greater flexibility to crowdworkers by allowing them to work in brief sessions, enabling multitasking, reducing effort required to initiate tasks, and being reliable [31,78].However, information processing via voice is inherently different compared to when it is presented as text.The use of voice has been often shown to lead to a higher cognitive load [50,80].Individuals also exhibit different preferences.For example, Trippas et al. [75] observed that participants preferred longer summaries for text presentation.For voice however, shortened summaries were preferred when the queries were single-faceted.Although their study did not measure the accuracy of judgements against a ground truth, what participants considered the most relevant was similar across both conditions (text vs. voice presentation).Furthermore, the voice modality can leverage its own unique characteristics for information presentation.For instance, Chuklin et al. [14] varied the prosody features (pauses, speech rate, pitch) of sentences containing answers to factoid questions.They found that emphasising the answer phrase with a lower speaking rate and higher pitch increased the perceived level of information conveyed.
Concerning the collection of relevance assessments, Tombros and Crestani [74] found in their lab study that participants were more accurate and faster in judging relevance when the list of documents (with respect to a query) were presented as text on screen as compared to when they were read out to the participantseither in person, or via telephone.It should however be noted that this work was conducted more than two decades ago-barely ten years after the invention of the Web, when the now common voice assistants and voice-enabled devices were long to be developed.
The work closest to ours is the study by Vtyurina et al. [80], who presented crowdworkers with five results of different ranks from Google-either in text or voice modality.They asked their participants to select the two most useful results and the least useful one.They observed that the relevance judgements of participants in the text condition were significantly more consistent with the true ranking of the results than those who were presented with five audio snippets.The ability to identify the most relevant result was however not different between the experimental cohorts.This study did not consider the effect of document length or cognitive abilities of participants on their relevance judgement performance, which is what we explore.

Cognitive Abilities
Prior works have explored how the cognitive abilities of assessors impact relevance judgements.Davidson [20] observed that openness to information-measured by a number of cognitive style variables such as open-mindedness, rigidity, and locus of controlaccounted for approximately 30% of the variance in relevance assessments.Scholer et al. [62] found that assessors with a higher need for cognition (i.e., a predisposition to enjoy cognitively demanding activities) had higher agreement with expert assessors, and took longer to judge compared to their lower need for cognition counterparts.Our work focuses on working memory and inhibition.
Working Memory (WM) refers to an individual's capacity for keeping information in short-term memory even when it is no longer perceptually present [21].This ability plays a role in higher-level tasks, such as reading comprehension [43] and problem solving [81].MacFarlane et al. [44] observed that participants with dyslexia-a learning disorder characterised by low working memory-judged fewer text documents as non-relevant when compared to participants without the learning disorder.They posited that it might be cognitively more demanding to identify text documents as nonrelevant for the cohort with dyslexia.With regards to processing speech, High WM has also been shown to be helpful in adapting to distortion of speech signals caused by background noise [26].Rudner et al. [58] and Stenbäck [71] observed high WM individuals perceived less effort while recognising speech from noise.
Inhibition (IN) refers to the capacity to regulate attention, behaviour, thoughts, and/or emotions by overriding internal impulses or external 'lure'-and maintaining focus on what is appropriate or needed [21].To our knowledge, prior studies have not investigated the effect of IN on the relevance assessment process.High IN has been shown to help in speech recognition, especially in adverse conditions like the presence of background noise [71,72].
A significant number of prior works have explored various aspects related to the process of relevance assessment.This work Query/Passage pairs from TREC DL. however considers the novel effect of document length and the cognitive abilities of assessors to explore the utility of the voice modality with regards to judging relevance.

METHODOLOGY
To address our three research questions outlined in §1, we conducted a crowdsourced user study.The study participants were asked to judge the relevance of Query/Passage (Q/P) pairings, where passages were presented either in the form of text (i.e., a piece of text) or voice (i.e., an audio clip).In our study, passage presentation modality is a between-subjects variable.We also controlled the length of passages; this is a within-subjects variable to ensure that participants judged passages of varying lengths.The independent variables working memory and inhibition allow us to estimate the impact of the cognitive abilities of the participants on the accuracy of their judgements, time taken and perceived workload.

Study Overview
Figure 1 presents an overview of the user study design. 1The diagram highlights the main tasks that study participants undertook.Lasting approximately 32 minutes for text and 40 minutes for voice, the study consisted of four main parts: (i) the pre-task survey ( §3.6); (ii) the cognitive ability tests ( §3.3); (iii) the judgements ( §3.4); and (iv) the post-task survey ( §3.5).
After agreeing to the terms of the study, participants completed a pre-task survey A .This survey included demographics questions, including questions about their familiarity with voice assistants-as reported in §3.6.Participants would then move onto two psychometric tests; as outlined in §3.3, these tests measured their cognitive abilities with respect to working memory B and inhibition C .Participants undertook a short practice task to help them familiarise themselves with the interface for each test.
After the psychometric tests, participants moved to the main part of the study: judging Q/P pairings D .The experimental system first assigned the participants to either text or voice randomly E ( §3.4).Based on the assigned condition, participants then judged a total of 42 Q/P pairings presented to them in a random order to mitigate the effect of topic ordering [62,63] ( §3.2)-40 were selected from the 2019 and 2020 TREC Deep Learning (DL) track, and the remaining two acted as a sanity check (SC) F .The 40 passages belonged to different answer length buckets §3.2 G .Finally, the participants would be taken to the post-task survey H .

Query/Passage Pairings
As mentioned, we obtained the Q/P pairings from the 2019 and 2020 TREC DL track-specifically the passage retrieval task [16,17,45].The test partition of the datasets contain 43 and 54 natural language queries with passages that are judged by NIST assessors.Using a graded relevance scale, passages for each query were judged by assessors as: (i) perfectly relevant when the passage is dedicated to the query, containing an exact answer; (ii) related when the passage appears somewhat related to the query, but does not answer it completely; or (iii) non-relevant, when the passage has nothing to do with the provided query [16,17].We note that an additional relevance category exists (highly relevant).However, we ignore judgements of this category in our work (similar to [39]) in order to have a clear distinction between the different categories.
Sampling Procedure.From the available test queries, we sampled 40 (due to budget constraints).As RQ2 states we are interested in how passage length affects assessments, we next determined five different buckets of passage length: from very short to very long (more details follow below).We randomly assigned the 40 queries to these five buckets, leading to eight queries per passage length bucket.For each query, we sampled three passages from the QRELs, with the additional condition that the sampled passages must fall into the query's passage length bucket: one perfectly relevant, one related, and one non-relevant passage.And thus, each bucket contains 24 passages pertaining to eight queries.Table 2 demonstrates three Q/P examples, each coming from a different passage length bucket.
Sanity Check (SC).We also created two additional Q/P pairings to act as a sanity checks2 in order to perform quality control of the relevance judgements by our participants, as suggested by Scholer et al. [63].We did not consider the SC Q/P pairs in our data analysis.
Judgements per Participant.We presented all our participants with the same set of 40 queries + 2 SC queries in order to mitigate effects arising due to differences in queries [18].Each participant judged one randomly sampled passage-out of the three available ones-for each of the 40 queries (ignoring the SC queries).We thus collected relevance judgements on a total of 40 × 3 = 120 Q/P pairs 3 .Each participant judged 13 passages per QREL.
Passage Length Buckets.To add more detail to our passage length bucketing procedure, we chose five types of length buckets: XS (Very Short); S (Small); M (Medium); L (Long); and XL (Very Long).They corresponded to the 0 − 5, 5 − 50, 50 − 75, 75 − 99 and 99 + %-ile of the lengths of all judged passages of the 97 test queries in our TREC-DL datasets.We selected the percentiles to have a range of 20 to 30 words per passage length bucket.The concrete word ranges for each passage length bucket can be found in Table 1.From Text Passage to Audio Clip.We processed the passages to remove any unwanted punctuation, leading and trailing whitespace, and corrected a few spelling errors.These cleaning steps were necessary as we did not want the participants to be distracted by unclean text, and to create legible audio clips for the voice interface.We used Amazon Polly 4 -an open-source text to speech system with an array of options for language and voice types-to generate the audio clips for the voice results.Specifically, we chose Matthew, a male US English voice, with a speed of 95% as the authors unanimously agreed that this particular setting (among other evaluated voice options) had the clearest pronunciation, in particular of difficult words 5 that might appear in the passages.Lastly, we ran a pilot study ( = 5) where participants were asked to rate the pace, accent, and length of our generated audio clips on a seven-point scale.They reported an average score of 6.3, confirming the high quality of the audio clips for our task.Table 1 shows the minimum, maximum, and average length of the audio clips in seconds for the passages belonging to the five length buckets. 6

Cognitive Ability Tests
In order to measure the cognitive abilities of our participants with relation to judging the presented Q/P pairings, we chose two established psychometric tests that examine both an individual's working memory and their inhibition.Prior work [26,58,71] has shown that working memory and inhibition play an important role in speech understanding.
Working Memory.To measure working memory capacity, we used the Operation-word-SPAN (OSPAN) test [76] that has also been used in prior Interactive IR (IIR) work [13].The OSPAN test measures an individual's ability to recall letters displayed in sequence, while concurrently completing simple secondary tasks.Participants completed eight trials of varying lengths.During each trial, participants were shown a sequence of 3 − 7 letters, and were then asked to recall the letters in their original order from a grid display.Additionally, during each trial, participants completed simple mathematical problems between each letter shown in sequence (e.g., "is 8+6=15?").The final score was equal to the sum of sequence lengths of all trials perfectly recalled.A higher score in the OSPAN test indicates a participant's greater ability to hold information (the RELEVANT Some prosthesis, like hip and knee joints made of cobalt chrome, contain some trace of nickel and for patients with allergies to this may have to go with Titanium joints.[Audio ]

Short (S)
Who has the highest career passer rating in the nfl? (1056416) SOMEWHAT-RELEVANT Wilson is the only quarterback in NFL history to post a 100-plus passer rating in each of his first two seasons, and he's already won a Super Bowl.Dan Marino is really the only quarterback you could argue was better out of the gate.[Audio ]

Long (L)
What is the appearance of granulation tissue?(1133579) NON-RELEVANT The protective outer layer of the plant.Everything needs skin, or at least some sort of a covering, for plants, it's a system of dermal tissue.Which covers the outside of a plant and it protects the plant in a variety of ways.Dermal tissue called epidermis is made up of live parenchyma cells in the non-woody parts of plants.Epidermal cells can secrete a wax-coated substance on leaves and stems, which becomes the cuticle.Dermal tissue that is made up of dead parenchyma cells is what makes up the outer bark in woody plants.[Audio ] letter sequence in correct order) in short-term memory when it is no longer perceptually present.
Inhibition.To measure inhibition, we used the Stroop test which was first introduced in 1935 [73].As an example, the Stroop test has been used to measure inhibitory attention control in learning [24,35] and speech processing [71].We used a computerised version of the test that was also used in the IIR study undertaken by Arguello and Choi [7].During the Stroop test, participants were shown a sequence of words indicating one of four colours: red, green, yellow, or blue.Some of the words displayed are congruent (e.g., the word "blue" displayed in blue font), and others are incongruent (e.g., the word "blue" displayed in red font).For each word, participants had to indicate the font colour of the word as quickly as possible by clicking on the correct option presented as a list (the trial continued until the correct colour was chosen).Participants had to complete 48 correct trials (similar to the study by Arguello and Choi [7]), of which 24 are congruent and 24 are incongruent.The final score is equal to the participant's average response time (in milliseconds) for the incongruent trials, minus the average response time for the congruent trials.Response times are typically slower for the incongruent trials, an effect referred to as the Stroop effect.Lower scores are better for the Stroop test, with higher scores indicating a greater difficulty in focusing on the relevant stimulus (the colour of the word) and ignoring the non-relevant stimulus (the word itself).

Assessor Interface
Our study interface is shown in Figure 2, as a composition of both the text and voice interfaces.The text-specific components are highlighted in blue; voice-specific ones in orange.For each Q/P pairing they were required to judge, participants were presented with a static query box 1 which could not be altered; it displayed the query for which the participant was to judge the passage for.Only one passage was shown 2 ; depending on the condition, this was either presented as text (for text), or a series of buttons to control the audio clip (for voice).In the case of voice, the participant had to press the Play Answer button to listen to the audio clip.They could also pause and restart the audio clip by pressing the Pause Answer and Restart Answer buttons respectively.
Once they had read or listened to the answer passage, participants then moved to the underlying form located at 3 to provide their judgement of the passage.Participants could choose between 'Relevant', 'Somewhat relevant', 'Non relevant', and 'I do not know'.
We included the final option to ensure that participants were not forced to make a relevance decision in the case that they were not sure as it has been shown that assessors are not always certain of their judgements [1].We did not provide the participants with the option to skip parts of the audio clip or adjust the speed.Certain checks were in place to ensure reliability of relevance judgements of participants, in addition to the two SC pairings as outlined in §3.2.For text, the form for marking relevance 3 appeared after five seconds.For voice, the form for marking relevance 3 appeared after 50% of the audio clip had been played.Participants could also proceed to judge the next query/passage pair by clicking the Next Query button 4 which was enabled only after a participant made their judgement.Once participants moved on to the next pairing, they could not go back to revise earlier judgements.No time limit was imposed on participants during the judging process.

Outcome Measures
In addition to the use of the two psychometric tests outlined in §3.3, we used interaction logging apparatus and additional surveys to capture both behavioural and experience data respectively.
Measuring Participant Behaviours.We added the JavaScript library LogUI [47] into our web-based judgement interface; it allowed us to capture a variety of different behaviours and events such as: (i) when the page was loaded; (ii) clicks on the form to record the judgement made by a participant; and (iii) clicks on the Play/Pause/Restart buttons (for voice).From these events, we could compute the amount of time taken for an individual to make a judgement-that is, from when the page loaded (showing the query/passage pairing) to when the Next Query button was clicked 4 (Figure 2).In turn, this allowed us to compute the time per relevance judgement, as reported in our results.
Measuring Participant Experiences.After completing the relevance judgements, participants completed the post-task survey.
Participants were asked about their perceived workload.They were asked specifically to answer the questions based only on their perceived experiences of the relevance judgement tasks.To measure workload, we used five questions from the raw NASA TLX survey, as proposed by Hart and Staveland [29].This instrument has been used (in slightly different forms) in several prior IIR studies (e.g., [7,8,57]).The five selected questions from the NASA TLX are designed to measure perceived: (i) mental demand; (ii) effort; (iii) temporal demand; (iv) frustration; and (iv) performance.We  omitted the 'physical demand' question from the survey as it was not relevant to our task. 7Participants responded to the five NASA TLX questions using a seven-point scale with labelled endpoints (from "poor" to "good" for performance and from "low" to "high" for the remaining four).
Measuring Participant Performance.We also computed the accuracy of our participants in the relevance judgement tasks.Accuracy was calculated in terms of how many Q/P pairs participants judged correctly-that is, their relevance judgement matching the ground truth from the QRELs.We also aggregated relevance judgements of participants on each Q/P pairing based on majority voting, as done by Kutlu et al. [39] to observe if collective judgements are more accurate.We used Krippendorff's alpha () to measure interannotator agreement (as used by Damessie et al. [18]).Lastly, we calculated Cohen's kappa () [9][10][11] which measures the agreement of judgements with ground truths by considering chance.

Participant Demographics
We conducted an a-priori power analysis using G-power [22] to determine the minimum sample size required to test our RQs.The results indicated that the required sample size-to achieve 95% power for detecting an effect of 0.25, with two groups (modality) and five measurements (passage length)-is 46.As such, we recruited 50 participants from the Prolific platform.We disqualified one participant as they failed to correctly judge our sanity check query/passage pairs ( §3.2).Our  = 49 (25 for text, 24 for voice) participants were native English speakers, with a 98% approval rate on the platform-a minimum of 250 prior successful task submissions, and self-declared as having no issues in seeing colour.Participants were required to use a desktop/laptop device in order to control for variables that might affect results of the Stroop and OSPAN tests on other (smaller) devices.From our participants, 22 identified as female, 24 as male, with 3 declining to disclose this information.The mean age of our participants was 38 (min.22, max. 69).With respect to the highest completed education level, 28 possessed a Bachelors (or equivalent), nine has a Masters (or equivalent), ten had a high school degree, and two had a PhD (or equivalent).We also asked participants how often they used a smart speaker to search for information, and listening to the provided 7 This was also done in prior studies, such as the study reported by Vtyurina et al. [80] answer-to which 13 reported daily usage, 20 reported usage on a weekly basis, and 16 said never.Participants were paid at the rate of GBP£11/hour, a value that is greater than the 2022-2023 [outside London] UK Real Living Wage.

RESULTS AND DISCUSSION
This section presents the results of our experiments pertaining to our three RQs.First, we provide details on the statistical tests we conducted, and how we utilised the cognitive ability tests to divide participants into low-and high-ability groups.
Statistical Tests.For our analyses 8 , we conducted a series of independent sample t-tests with Bonferroni correction ( = 0.05) to observe if the modality of presentation has a significant effect on our dependent variables-accuracy of relevance judgements, the time taken to judge, and the perceived workload (RQ1).We also conducted a series of mixed factorial ANOVA tests (where modality of presentation is a between-subjects variable, and passage length is a within subjects variable) to observe if presentation modality, passage length, or the interaction between them have a significant effect on accuracy of relevance judgement and time taken (RQ2).Lastly, we conducted a series of three-way ANOVA tests to observe if the two user dispositions-working memory and inhibition-or their interaction with modality of presentation have a significant effect on the three dependent variables (RQ3).For RQ2 and RQ3, we followed up the ANOVA with pairwise Tukey tests with Bonferroni correction ( = 0.05) to observe where significant differences lay.In the case where no significant difference was observed between the two conditions, we used equivalence testing between conditions through the two one-sided t-tests (TOST) procedure.The upper and lower bounds for the TOST was set at 7.5% (-ΔL = ΔU = 7.5) for accuracy, as Xu et al. [83] observed that LtR models were robust to errors of up to 10% in the dataset (we used 7.5% for conservativeness).
For each scale of NASA-TLX, we set -ΔL = ΔU = 2.04, following Lee et al. [41], who used a bound of ±18 on a 100-point NASA TLX.For our seven-point scale, it translates to ±2.08 according to the formula of Hertzum [30].
Cognitive Ability Scores and High vs. Low Ability Groups.To examine the effect of a participant's cognitive abilities on relevance Table 3: RQ1: Effect of modality of passage presentation on accuracy of relevance judgement, time taken per judgement in seconds and perceived workload (IV-VIII) per participant.We also report Krippendorff's  and Cohen's  for accuracy.
† indicates significant difference in between the two conditions according to independent sample t-test.★ indicates the corresponding metric is equivalent for both conditions based on the TOST procedure.judgement accuracy (RQ3), we performed a median split of the scores obtained by the participants in the OSPAN (min.0, max.50, mean = 25.4(±12),median = 22) and Stroop test (min.= −300, max.= 650, mean = 171.25(±184),median = 170) respectively.The mean scores of our participants for working memory and inhibition were within one standard deviation of the reference mean scores as reported in [7], validating our methodology.Participants were thus divided into a high-and low-ability group for each of working memory (based on OSPAN test scores) and inhibition (based on Stroop test scores).Note that for inhibition, a low test score indicates high ability.Prior studies have also analysed the effects of different cognitive abilities by dividing participants into low/high ability groups using a median split [2,7,13,62].

RQ1: Modality of Passage Presentation
Table 3 presents the main results for RQ1.There was no significant difference in judgement accuracy (row I, Table 3) between participants in text and those in voice ( (47) = 0.97,  = 0.33).TOST revealed that accuracy of judgements across both conditions were equivalent ( = 0.02).The inter-annotator agreement () was slightly higher in text.When using majority voting to aggregate relevance judgements (on average we had eight judgements per Q/P pair in each condition), we found that the accuracy increased from 68% and 66% to 79% and 76% respectively for text and voice (II, Table 3).This observation is in line with prior work [39], which shows that aggregating judgements from several assessors is more reliable than a single untrained assessor.Cohen's  also increased with majority voting for both experimental conditions, indicating an increase in judgement reliability.Participants also showed similar trends of relevance judgement accuracy per relevance label category for both experimental conditions.As shown in Figure 3, participants in both conditions were most accurate in judging 'relevant' passages (in line with findings by Alonso and Mizzaro [4]), followed by 'non-relevant' passages.'Somewhat relevant' passages were most difficult to judge as participants in both conditions judged them correctly about half the time.With respect to the time taken to judge (III, Table 3), judgements in text were made significantly faster ( (47) = −4.93, < 0.001) than in voice.
In terms of workload measured using NASA-TLX, there was no significant difference in averages between the two cohorts in terms of perceived mental demand, effort, and temporal demand (IV-VI, Table 3).The TOST procedure revealed equivalent scores ( < 0.05) provided by participants for these three items of the NASA-TLX scale.For the other dimensions of NASA-TLX questionnaire, participants in text reported they felt significantly more frustrated (VII, Table 3) while performing the task than those in voice ( (47) = 4.69,  < 0.001).Participants in voice also reported significantly higher perceived performance (VIII, Table 3) when compared to the former ( (47) = −3.60, < 0.001).
Overall, we found that participants listening to voice passages were equally accurate to their text counterparts.Vtyurina et al. [80] also observed that the probability of participants to identify the most relevant document was the same for both text and voice conditions.However, the authors implemented a different task design to ours.Their participants were presented with a list of results, and were significantly better at identifying the correct order of relevance when the summaries were presented in text modality.Insofar as to acknowledging the difference in task design, our observations with regards to the accuracy of participants with respect to relevance judgements across modalities are found to be partially in line with those of Vtyurina et al. [80].We also observed that voice participants perceived a lower or equal workload when compared to those of text, in contrast to the other study's findings [80].This can be attributed to their study setup.Contrary to ours, their presentation modality was a within-subjects variable.Our results indicate the proficiency of participants with both modalities for the given design of the task.

RQ2: Passage Length
Table 4 presents results related to RQ2.Like modality of presentation, passage length or its interaction with presentation modality did not have a significant effect on the relevance judgement accuracy (comparing rows Ia and Ib, Table 4).The TOST procedure revealed that for XS ( = 0.01) and L ( = 0.001) passages, judgement accuracy was equivalent across both conditions.Aggregating Table 4: RQ2: Effects of passage length and presentation modality on accuracy of relevance judgements (with Krippendorff's , Cohen's ) and time taken.A bold number indicates that the metric for the corresponding presentation modality is significantly more than that for the other modality for the particular passage length.,,,, indicates significant difference (within the same experimental condition) compared to XS, S, M, L, XL passage lengths.★ indicates equivalence between the two conditions.judgements via majority voting increased relevance judgement accuracy across all passage lengths for both text and voice conditions (comparing rows Ia-IIa and Ib-IIb, Table 4).However, for XL passages (IIa-IIb, Table 4), the difference in accuracy after majority voting was more than 10% (with text being more accurate).We also observed a higher difference in Cohen's  and Krippendorff's  for XL passages between the text and voice conditions.These results indicated a higher inter-annotator agreement and reliability of judgements for text compared to participants in voice with regards to XL passages.With respect to the time taken for judging, we have already seen (Section 4.1) that presentation modality significantly affected the time to judge.Mixed factorial ANOVA showed that passage length had a significant main effect (F = 21.6, = 3.3 −15 ) on the time taken to assess.A post-hoc test revealed a significant difference in the time taken to judge of the following pairs of passage lengths (with the latter passage length category taking more time): XS-M ( = 0.02), XS-L ( < 0.001), XS-XL ( < 0.001), S-XL( < 0.001) and M-XL( = 0.001).There was also a significant interaction effect between passage length and presentation modality on the amount of time taken.Pairwise Tukey test revealed that except for XS passages, judging relevance in voice took significantly longer for participants as compared to doing the same in text (bold numbers, row III, Table 5).In voice (IIIb, Table 5), it took participants significantly longer to judge relevance, as passages (audio clips) increased in length.Superscripts (in Table 4) indicate which pairs of passage length were significantly different in voice in terms of time taken per judgement.
In summary, we did not observe a significant difference in relevance judgement accuracy across different passage lengths in both conditions.We observed judging relevance of XS passages was equivalent in terms of accuracy and time taken across both text and voice.However, for XL passages, relevance judgements in text were more reliable (indicated by majority voting accuracy,  and  when compared to that in voice).There was no clear trend between passage length and assessor agreement observed in contrast to findings from [12], possibly due to differences in the type of documents assessed.Although it took longer on average to judge a lengthier passage in text, there was no significant difference in terms of the time taken to judge relevance of different passage lengths (a similar trend as observed in [37,65]).For longer passages, participants in voice took significantly longer to judge relevance than in text.For XL passages, we found that participants were taking twice as long in voice when compared to text.
Why does it take longer for participants to judge longer passages in the voice condition?In order to control for confounding variables, we did not let participants speed up the audio clips, nor did we provide them with a seeker bar to skip ahead.We found evidence that participants moved on to the next Q/P pairing as soon as they were satisfied with their assessment.Indeed, they did not wait for the audio clip to finish playing before moving on to the next Q/P pair for longer passages (Figure 4 (a)).We also let participants mark the relevance of a passage in voice only after 50% of the audio clip had been played (Section 3.1).However, as seen from Figure 4 (b), participants took longer to judge relevance (rather than right at the 50% mark).For XL passages, it was at the 66% of the audio clip on average.This suggests that it indeed took more time for participants in voice compared to text to assimilate the information and come to a judgement decision for longer passages.

RQ3: Assessor Cognitive Abilities
Table 5 contains the results for our third research question.Here, ✓ indicates a significant effect ( < 0.05) on the particular dependent variable, and ✗ indicates no significant effect.None of the independent variables-modality of passage presentation (PM), working memory (WM), and inhibition (IN)-had a significant main effect on judgement accuracy.The interaction between the IN of participants and presentation modality (IN x PM) had a significant effect on the accuracy (F = 4.89,  = 0.03).Pairwise Tukey test revealed that in voice participants with higher IN performed significantly better than those with lower IN (70.5 ± 7.2% vs. 59.5 ± 4.8 %).The post-hoc test ( = 0.01) also revealed participants with low IN performed significantly better in text than those in voice (70.0±9.5 % vs. 59.5±4.8 %).We found significant main effects of PM on the time taken to judge relevance (F = 22.17,  < 0.001), reaffirming findings from Section 4.1 and Section 4.2.
With respect to the perceived workload, working memory had significant main effects on perceived temporal demand (F = 7.88,  = 0.01).A post-hoc test ( < 0.001) revealed that participants with high WM reported significantly less temporal demand as compared to those with low WM (2.5 ± 1.3 vs. 4.6 ± 1.7 respectively).IN also had significant main effects on perceived temporal demand (F = 7.4,  = 0.01).A post-hoc test ( < 0.001) revealed that participants with high IN reported significantly less temporal demand as compared to those with low IN (2.74 ± 1.4 vs. 4.59 ± 1.9, respectively).Presentation modality had significant main effects on perceived frustration (F = 8.36,  = 0.008) and performance (F = 5.83,  = 0.02)-confirming observations from Section 4.1-with participants in voice reporting a lower workload.Lastly, the interaction between WM and presentation modality (WM x PM) had a significant effect on perceived effort for the task (F = 5.1,  = 0.03).Post-hoc tests revealed that participants with high WM felt that judging using text required significantly more effort when compared to those in voice ( = 0.001).
In summary, we found that IN is a more important trait than WM, specifically for relevance judgement accuracy in the voice modality.Low IN participants in the voice condition were less accurate-since we did not control for the audio device of the participants, and consequently not for the background noise they were subjected to, low IN participants in voice were less effective in focusing on the passages while judging relevance [71,72].We leave exploring the effect of background noise as future work.In our study, the interplay between cognitive abilities and modality of presentation on perceived workload had different effects.High IN and WM participants felt less temporal demand.High WM in text felt more perceived effort compared to those in voice.Our results imply that we should design tasks for collecting relevance assessments to match the preference and abilities of crowdworkers [5,52].

CONCLUSIONS
We explored the feasibility of using voice as a modality to collect relevance judgements of query-passage pairs.We investigated the effect of passage length and the cognitive abilities of participants on judgement accuracy, the time taken, and perceived workload.
RQ1 On average, the relevance judgement accuracy was equivalent across both text and voice.Participants also perceived equal or less workload in voice when compared to text.
RQ2 For XS passages, the performance and time taken for relevance judgements was equivalent between both voice and text.As passages increased in length, it took participants significantly longer to make relevance judgements in the voice condition; for XL passages voice, participants took twice as much time and the judgements were less reliable compared to text.
RQ3 Inhibition impacted the relevance judgement accuracy in the voice condition-participants with higher inhibition were significantly more accurate than those with lower inhibition.
Our results from RQ1 suggest that we can leverage the voice modality for this task.RQ2 points to the possibility of designing hybrid tasks, where we can use the voice modality for judging shorter passages and text for longer passages.The results of RQ3 showed that selecting the right participants for the relevance judgement task is important.We should be mindful to personalise the task to match the preference and abilities of crowdworkers [5,52].
There are several open questions for future work.We did not provide participants with the option to speed-up voice passagesdoes letting them speed-up or skip passage parts reduce time for longer passages without reducing accuracy?We also did not test the limit of length-how long can documents be for equal accuracy in the text and voice modality?Future work should also explore mobile devices for playing voice passages-can we collect relevance judgements by offering more flexibility to crowdworkers?Lastly, since asking to provide rationales for judgements has been shown to improve relevance judgement accuracy of crowdworkers in the text modality [39], exploring the effects of rationale in voice-based relevance judgements should be a worthwhile endeavour.
P pairs are randomly selected from each bucket + 2 from Sanity Check (SC) bucket; 42 total.

Figure 1 :
Figure 1: A high-level overview of the user study protocol, including approximate times for participants to complete each component.Refer to §3.1 for mappings to the letters highlighting key aspects of the study procedure.

Figure 2 :
Figure 2: Composition screenshot of both the text and voice interfaces used by participants for judging query-passage pairs.Circled numbers correspond to the same in the narrative, found in §3.4.

Figure 3 :
Figure 3: Accuracy of relevance judgements per label category for both text and voice.Diagonals represent percentage of time the true labels were correctly predicted by participants.Here, R = RELEVANT, SR = SOMEWHAT-RELEVANT, NR = NON-RELEVANT and IDK = I do not know.

Figure 4 :
Figure 4: The trend of voice participants judging relevance w.r.t.time taken for passages of various length: (a) % of time participants listened to the entire audio clip; and (b) at what point was relevance judged (as a % of audio clip length).

Table 1 :
Overview of passage length buckets.Averages are reported together with the standard deviation.

Table 2 :
Examples of Query/Passage (Q/P) pairs for different passage length categories.The (Qid) is taken from the TREC datasets.We also provide links to [audio ] clips of the respective passages.

Table 5 :
RQ3: Summary of main effects of Presentation Modality (PM), Working Memory (WM), Inhibition (IN), and effects of the interaction of WM and IN with PM on accuracy of relevance judgement, time taken, and perceived workload.A ✓ indicates significant effect of a 3-way ANOVA test ( < 0.05) on the particular dependent variables and ✗ indicates no significant effect.